Tag Archives: sysctl

ZFS on SMR Drives

The ZFS filesystem (more often called OpenZFS lately – as the project name) is a great filesystem for many purposes. From home or desktop/laptop solutions to enterprise offerings. Traditional disk drives have non overlapping magnetic tracks parallel to each other. These are PMR disks (Perpendicular Magnetic Recording). Hard disk drive manufacturers – to pack even more data into the same size platters – also offer SMR disks. In SMR disks data tracks are written to overlap part of previously written track – this results in narrower tracks and higher density. I will try to visualize this difference below using my favorite Enterprise Architect ASCII Edition software.

 PMR                    SMR

[xxx][___][___][___]   [xx[__[__[___]
[___][xxx][___][___]   [__[xx[__[___]
[___][___][xxx][___]   [__[__[xx[___]
[___][___][___][xxx]   [__[__[__[xxx]
[___][xxx][___][xxx]   [__[xx[__[xxx]
[xxx][___][___][xxx]   [xx[__[__[xxx]

12345678901234567890   12345678901234

I marked the filled blocks on both disks with xxx marks. As you can compare the below ‘size’ of the taken place the same data on SMR disk takes less physical space then on traditional PMR drives. This comes at a price through. Writes are little ‘crippled’ comparing to PMR drives. Especially heavy and random I/O writes are ‘problematic’ and slower on SMR drives … but it does not mean they are useless.

disk

For the backup or clone purposes they are more then enough. I personally use SMR drives for my backup solutions. Its just about price/performance ratio.

Here are mine backup solutions based on the SMR drives:

Speed

How ZFS behaves on SMR drives? Very well I would say. ZFS tries to pack as much random I/O into sequential with its ZFS features – described in detail in the zpool-features(7) man page for example.

I recently tried ZFS on top of GELI encrypted partition on a 5 TB external USB SMR drive. I needed to copy little more then 3 TB of data there. I used rsync(1) for that purpose. These are the arguments I use for my rsync(1) jobs.

% rsync --modify-window=1 -l -t -r -D -v -S -H --force    \
        --progress --no-whole-file --numeric-ids --delete \
        /files/ /media/external/files/

Of course I do not write all these options by hand – I just a script wrapper for that – rsync-delete.sh – available on my scripts page.

As I started to copy files on the drive I watched the write speeds using iostat(8) and zpool-iostat(8) tools. I expected quite slow operation but even with the enabled zstd compression and AES-XTS 256bit GELI encryption I got pretty decent results.

Here are the iostat(8) results. Each line means average of 10 minutes (600 seconds). Check the speeds for da0 drive below.

% iostat 600
       tty            ada0             ada1              da0             cpu
 tin  tout KB/t  tps  MB/s  KB/t  tps  MB/s  KB/t  tps  MB/s  us ni sy in id
   1     1  513  120  59.9  29.5   39   1.1   742   65  46.8   4  8 17  2 69
   0     2  615   94  56.6  19.1   22   0.4   751   68  49.8   1  3 14  1 82
   0     0  561  106  57.9  17.9   20   0.4   760   70  52.0   1  2 14  1 82
   0     0 1015   57  56.8  18.4   16   0.3   769   68  50.9   1  3 15  1 81
   0     0 1017   57  56.3  18.5   16   0.3   757   68  50.6   1  3 14  1 81
   0     1  752   72  53.0  16.6   23   0.4   765   67  50.1   1  1 13  0 85
   0     0 1014   51  50.1  16.5   21   0.3   723   68  48.3   1  1 13  0 86
   0     0 1012   51  50.2  19.8   18   0.3   743   68  49.2   1  1 12  0 86

And here are the zpool-iostat(8) results.

% zpool iostat POOL 600
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
POOL        3.18T  1.37T      7     56  53.5K  40.7M
POOL        3.20T  1.34T      0     57  9.01K  41.4M
POOL        3.22T  1.33T      0     47  3.29K  32.3M
POOL        3.24T  1.31T      0     47  5.59K  33.9M
POOL        3.25T  1.29T      0     43  3.39K  24.3M
POOL        3.27T  1.28T      0     42  3.01K  25.5M
POOL        3.28T  1.27T      0     44  3.14K  26.8M
POOL        3.29T  1.26T      0     42  3.49K  23.9M

The drive was attached over USB 3.0 port so there was not 35 MB/s limitation from USB 2.0 port. I would say that the results are very decent and consistent.

Tuning

There are several settings that can help you squeeze maximum from these SMR drives on ZFS filesystem.

First are ZFS pool settings. You want the latest zstd compression to save some space. Also better compression means less physical bytes need to be written to the drive so less I/O operations. You should also turn atime into off state as it will not be needed. You should also increase recordsize to something really big like 1m (1 megabyte) so you will get higher compressratio and also will need to have less metadata for more ZFS blocks. Keep in mind that ZFS will still use variable block size and not only the 1m maximum. If something is smaller (like 100k) then it would take for example 80k (after applied zstd compression). You will not waste 920k here πŸ™‚

Keep in mind that most newer and larger drives use 4k blocks (instead of 512b). Sometimes its 512e method which means that drive firmware will ‘present’ device with 512b blocks while underneath these eight 512k blocks just lay down on a single 4k block. For these reasons its important to keep in mind several things.

When adding new partitions with gpart(8) remember to align them to 4k with -a 4k argument.

# gpart add -t freebsd-zfs -a 4k da0

Next – when initializing the geli(8) encryption layer – make sure you add -s 4096 argument.

# geli init -s 4096 /dev/da0p1

The last thing is ZFS pool creation with proper ashift property – it can not be changed later. On FreeBSD UNIX its done that way:

# sysctl vfs.zfs.min_auto_ashift=12
# zpool create POOL da0
# zdb -C POOL | grep ashift
                ashift: 12

If you are curious what 12 means then below table will help you:

ASHIFT  BLOCKSIZE
     9  512b
    10  1k
    11  2k
    12  4k
    13  8k

Last but not least is the redundant_metadata option. By default its at all setting but its desired to set it into the most state. Do you need redundant metadata? I think not. When your single drive will fail the redundant metadata would not help and if your ZFS pool have some redundancy level like raidz or mirror then redundant metadata is also not needed because its just ‘normally’ redundant being spread across several disks.

Keep in mind that ZFS resilver process on some of these SMR drives can take forever. Some people from Reddit reported that they successfully resilvered their ZFS pools with SMR drives but that does not have to be the case for all SMR drives out there. You can also check Ars Technica tests of resilver on SMR disks.

Here is the summary of ZFS tunables suggested – you will find in depth description of all of them in the zfsprops(7) man page.

# zfs set redundant_metadata=most POOL
# zfs set compression=zstd        POOL
# zfs set atime=off               POOL
# zfs set recordsize=1m           POOL

In theory the TRIM operations upon deletion would create additional unwanted ‘stress’ for SMR drives which would mean that TRIM operations should be disabled for on non-SSD drives and you can disable them entirely on the ZFS pool level … but.

TRIM commands issued by the operating system allows SMR HDD internal controller to get the information that certain areas/blocks on that SMR HDD plates are no longer in use. It means that writes to such areas could be performed without slow read-modify-write pattern.

This means we are leaving the autotrim option as on (enabled) for SMR drives.

# zpool autotrim=on POOL

Also – if needed – you can manually trigger the TRIM operations with this command.

# zpool trim POOL
# zpool status POOL
  pool: POOL
 state: ONLINE
  scan: scrub repaired 0B in 02:17:22 with 0 errors on Sun May  8 05:18:22 2022
config:

        NAME          STATE     READ WRITE CKSUM
        POOL          ONLINE       0     0     0
          da0p1.eli   ONLINE       0     0     0  (trimming)

errors: No known data errors


By default the TRIM commands are executed at 64 rate on FreeBSD. You can limit them to 1 and still have them enabled with following sysctl(8) tunable.

# sysctl vfs.zfs.vdev.trim_max_active=1

If you want to make it survive across reboots then put it into the /etc/sysctl.conf file.

Logic could suggest that simpler/older filesystems such as FreeBSD UFS for example could be more suitable solution for SMR drives … but the reality shows that not so much. Check this Reddit thread for example – Appalling Performance on External USB SMR Drive – to name just one.

Hope this article will help you get most of your SMR drives.

Regards.

EOF