ZFS on SMR Drives

The ZFS filesystem (more often called OpenZFS lately – as the project name) is a great filesystem for many purposes. From home or desktop/laptop solutions to enterprise offerings. Traditional disk drives have non overlapping magnetic tracks parallel to each other. These are PMR disks (Perpendicular Magnetic Recording). Hard disk drive manufacturers – to pack even more data into the same size platters – also offer SMR disks. In SMR disks data tracks are written to overlap part of previously written track – this results in narrower tracks and higher density. I will try to visualize this difference below using my favorite Enterprise Architect ASCII Edition software.

 PMR                    SMR

[xxx][___][___][___]   [xx[__[__[___]
[___][xxx][___][___]   [__[xx[__[___]
[___][___][xxx][___]   [__[__[xx[___]
[___][___][___][xxx]   [__[__[__[xxx]
[___][xxx][___][xxx]   [__[xx[__[xxx]
[xxx][___][___][xxx]   [xx[__[__[xxx]

12345678901234567890   12345678901234

I marked the filled blocks on both disks with xxx marks. As you can compare the below ‘size’ of the taken place the same data on SMR disk takes less physical space then on traditional PMR drives. This comes at a price through. Writes are little ‘crippled’ comparing to PMR drives. Especially heavy and random I/O writes are ‘problematic’ and slower on SMR drives … but it does not mean they are useless.

disk

For the backup or clone purposes they are more then enough. I personally use SMR drives for my backup solutions. Its just about price/performance ratio.

Here are mine backup solutions based on the SMR drives:

Speed

How ZFS behaves on SMR drives? Very well I would say. ZFS tries to pack as much random I/O into sequential with its ZFS features – described in detail in the zpool-features(7) man page for example.

I recently tried ZFS on top of GELI encrypted partition on a 5 TB external USB SMR drive. I needed to copy little more then 3 TB of data there. I used rsync(1) for that purpose. These are the arguments I use for my rsync(1) jobs.

% rsync --modify-window=1 -l -t -r -D -v -S -H --force    \
        --progress --no-whole-file --numeric-ids --delete \
        /files/ /media/external/files/

Of course I do not write all these options by hand – I just a script wrapper for that – rsync-delete.sh – available on my scripts page.

As I started to copy files on the drive I watched the write speeds using iostat(8) and zpool-iostat(8) tools. I expected quite slow operation but even with the enabled zstd compression and AES-XTS 256bit GELI encryption I got pretty decent results.

Here are the iostat(8) results. Each line means average of 10 minutes (600 seconds). Check the speeds for da0 drive below.

% iostat 600
       tty            ada0             ada1              da0             cpu
 tin  tout KB/t  tps  MB/s  KB/t  tps  MB/s  KB/t  tps  MB/s  us ni sy in id
   1     1  513  120  59.9  29.5   39   1.1   742   65  46.8   4  8 17  2 69
   0     2  615   94  56.6  19.1   22   0.4   751   68  49.8   1  3 14  1 82
   0     0  561  106  57.9  17.9   20   0.4   760   70  52.0   1  2 14  1 82
   0     0 1015   57  56.8  18.4   16   0.3   769   68  50.9   1  3 15  1 81
   0     0 1017   57  56.3  18.5   16   0.3   757   68  50.6   1  3 14  1 81
   0     1  752   72  53.0  16.6   23   0.4   765   67  50.1   1  1 13  0 85
   0     0 1014   51  50.1  16.5   21   0.3   723   68  48.3   1  1 13  0 86
   0     0 1012   51  50.2  19.8   18   0.3   743   68  49.2   1  1 12  0 86

And here are the zpool-iostat(8) results.

% zpool iostat POOL 600
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
POOL        3.18T  1.37T      7     56  53.5K  40.7M
POOL        3.20T  1.34T      0     57  9.01K  41.4M
POOL        3.22T  1.33T      0     47  3.29K  32.3M
POOL        3.24T  1.31T      0     47  5.59K  33.9M
POOL        3.25T  1.29T      0     43  3.39K  24.3M
POOL        3.27T  1.28T      0     42  3.01K  25.5M
POOL        3.28T  1.27T      0     44  3.14K  26.8M
POOL        3.29T  1.26T      0     42  3.49K  23.9M

The drive was attached over USB 3.0 port so there was not 35 MB/s limitation from USB 2.0 port. I would say that the results are very decent and consistent.

Tuning

There are several settings that can help you squeeze maximum from these SMR drives on ZFS filesystem.

First are ZFS pool settings. You want the latest zstd compression to save some space. Also better compression means less physical bytes need to be written to the drive so less I/O operations. You should also turn atime into off state as it will not be needed. You should also increase recordsize to something really big like 1m (1 megabyte) so you will get higher compressratio and also will need to have less metadata for more ZFS blocks. Keep in mind that ZFS will still use variable block size and not only the 1m maximum. If something is smaller (like 100k) then it would take for example 80k (after applied zstd compression). You will not waste 920k here πŸ™‚

Keep in mind that most newer and larger drives use 4k blocks (instead of 512b). Sometimes its 512e method which means that drive firmware will ‘present’ device with 512b blocks while underneath these eight 512k blocks just lay down on a single 4k block. For these reasons its important to keep in mind several things.

When adding new partitions with gpart(8) remember to align them to 4k with -a 4k argument.

# gpart add -t freebsd-zfs -a 4k da0

Next – when initializing the geli(8) encryption layer – make sure you add -s 4096 argument.

# geli init -s 4096 /dev/da0p1

The last thing is ZFS pool creation with proper ashift property – it can not be changed later. On FreeBSD UNIX its done that way:

# sysctl vfs.zfs.min_auto_ashift=12
# zpool create POOL da0
# zdb -C POOL | grep ashift
                ashift: 12

If you are curious what 12 means then below table will help you:

ASHIFT  BLOCKSIZE
     9  512b
    10  1k
    11  2k
    12  4k
    13  8k

Last but not least is the redundant_metadata option. By default its at all setting but its desired to set it into the most state. Do you need redundant metadata? I think not. When your single drive will fail the redundant metadata would not help and if your ZFS pool have some redundancy level like raidz or mirror then redundant metadata is also not needed because its just ‘normally’ redundant being spread across several disks.

Keep in mind that ZFS resilver process on some of these SMR drives can take forever. Some people from Reddit reported that they successfully resilvered their ZFS pools with SMR drives but that does not have to be the case for all SMR drives out there. You can also check Ars Technica tests of resilver on SMR disks.

Here is the summary of ZFS tunables suggested – you will find in depth description of all of them in the zfsprops(7) man page.

# zfs set redundant_metadata=most POOL
# zfs set compression=zstd        POOL
# zfs set atime=off               POOL
# zfs set recordsize=1m           POOL

In theory the TRIM operations upon deletion would create additional unwanted ‘stress’ for SMR drives which would mean that TRIM operations should be disabled for on non-SSD drives and you can disable them entirely on the ZFS pool level … but.

TRIM commands issued by the operating system allows SMR HDD internal controller to get the information that certain areas/blocks on that SMR HDD plates are no longer in use. It means that writes to such areas could be performed without slow read-modify-write pattern.

This means we are leaving the autotrim option as on (enabled) for SMR drives.

# zpool autotrim=on POOL

Also – if needed – you can manually trigger the TRIM operations with this command.

# zpool trim POOL
# zpool status POOL
  pool: POOL
 state: ONLINE
  scan: scrub repaired 0B in 02:17:22 with 0 errors on Sun May  8 05:18:22 2022
config:

        NAME          STATE     READ WRITE CKSUM
        POOL          ONLINE       0     0     0
          da0p1.eli   ONLINE       0     0     0  (trimming)

errors: No known data errors


By default the TRIM commands are executed at 64 rate on FreeBSD. You can limit them to 1 and still have them enabled with following sysctl(8) tunable.

# sysctl vfs.zfs.vdev.trim_max_active=1

If you want to make it survive across reboots then put it into the /etc/sysctl.conf file.

Logic could suggest that simpler/older filesystems such as FreeBSD UFS for example could be more suitable solution for SMR drives … but the reality shows that not so much. Check this Reddit thread for example – Appalling Performance on External USB SMR Drive – to name just one.

Hope this article will help you get most of your SMR drives.

Regards.

EOF

10 thoughts on “ZFS on SMR Drives

  1. Mara Sophie Grosch

    What about resilvering? Around two years ago I added a SMR WD Red to my existing pool, replacing a failed disk and resilvering never completed successfully. Is this fixed by now or did you not test that yet and it’s still a problem?

    Like

    Reply
      1. SP

        Lots of people have though. Would love to see some tests on a raidz2 with 6 disks or so.

        Last time I tested I could never complete a resliver.

        Like

      2. debdrup

        Try offlining a drive, dd’ing a few GB from /dev/random onto various parts of the the offlined drive, online it, and then wait.
        And wait, and then wait some more.

        The problem with SMR isn’t that you can’t use it as a backup pool, it’s that the random I/O properties of SMR are utterly terrible with no redeeming features.
        When you’re doing a resilver it’s essentially random which disk any particular record will be read from in order to rebuild the mirroring or distributed parity.
        So what happens when you try to resilver? It takes forever, and I know of nobody who’s succeeded in resilvering, because it’d break any and all RTO policies.

        I’d recommend that you take this article down or at the very least mark it as containing information that’s dangerous if used in production environments, because otherwise people will inevitably use it as an argument for SMR without understanding the problems.

        Like

      3. vermaden Post author

        Sure.

        Added that to the article:

        Keep in mind that ZFS resilver process on some of these SMR drives can take forever. Some people from Reddit reported that they successfully resilvered their ZFS pools with SMR drives but that does not have to be the case for all SMR drives out there. You can also check Ars Technica tests of resilver on SMR disks.

        Regards.

        Like

  2. Pingback: Valuable News – 2022/05/10 | πšŸπšŽπš›πš–πšŠπšπšŽπš—

  3. Pingback: Silent Fanless Dell Wyse 3030 LT FreeBSD Server | πšŸπšŽπš›πš–πšŠπšπšŽπš—

  4. Vladyslav

    Hi, vermaden.
    Thank you for your blog. Over time I found a lot of useful information here.

    As to the claim that “TRIM operations are not needed on non-SSD drives” – I’m afraid this might not be the case.
    TRIM commands issued by the operating system allows SMR HDD internal controller to “know” that certain areas on HDD plates are no longer in use. And thus (small) writes to such areas could be performed without slow read-modify-write pattern.

    Liked by 1 person

    Reply

Leave a comment