Tag Archives: zpool

ZFS on SMR Drives

The ZFS filesystem (more often called OpenZFS lately – as the project name) is a great filesystem for many purposes. From home or desktop/laptop solutions to enterprise offerings. Traditional disk drives have non overlapping magnetic tracks parallel to each other. These are PMR disks (Perpendicular Magnetic Recording). Hard disk drive manufacturers – to pack even more data into the same size platters – also offer SMR disks. In SMR disks data tracks are written to overlap part of previously written track – this results in narrower tracks and higher density. I will try to visualize this difference below using my favorite Enterprise Architect ASCII Edition software.

 PMR                    SMR

[xxx][___][___][___]   [xx[__[__[___]
[___][xxx][___][___]   [__[xx[__[___]
[___][___][xxx][___]   [__[__[xx[___]
[___][___][___][xxx]   [__[__[__[xxx]
[___][xxx][___][xxx]   [__[xx[__[xxx]
[xxx][___][___][xxx]   [xx[__[__[xxx]

12345678901234567890   12345678901234

I marked the filled blocks on both disks with xxx marks. As you can compare the below ‘size’ of the taken place the same data on SMR disk takes less physical space then on traditional PMR drives. This comes at a price through. Writes are little ‘crippled’ comparing to PMR drives. Especially heavy and random I/O writes are ‘problematic’ and slower on SMR drives … but it does not mean they are useless.

disk

For the backup or clone purposes they are more then enough. I personally use SMR drives for my backup solutions. Its just about price/performance ratio.

Here are mine backup solutions based on the SMR drives:

Speed

How ZFS behaves on SMR drives? Very well I would say. ZFS tries to pack as much random I/O into sequential with its ZFS features – described in detail in the zpool-features(7) man page for example.

I recently tried ZFS on top of GELI encrypted partition on a 5 TB external USB SMR drive. I needed to copy little more then 3 TB of data there. I used rsync(1) for that purpose. These are the arguments I use for my rsync(1) jobs.

% rsync --modify-window=1 -l -t -r -D -v -S -H --force    \
        --progress --no-whole-file --numeric-ids --delete \
        /files/ /media/external/files/

Of course I do not write all these options by hand – I just a script wrapper for that – rsync-delete.sh – available on my scripts page.

As I started to copy files on the drive I watched the write speeds using iostat(8) and zpool-iostat(8) tools. I expected quite slow operation but even with the enabled zstd compression and AES-XTS 256bit GELI encryption I got pretty decent results.

Here are the iostat(8) results. Each line means average of 10 minutes (600 seconds). Check the speeds for da0 drive below.

% iostat 600
       tty            ada0             ada1              da0             cpu
 tin  tout KB/t  tps  MB/s  KB/t  tps  MB/s  KB/t  tps  MB/s  us ni sy in id
   1     1  513  120  59.9  29.5   39   1.1   742   65  46.8   4  8 17  2 69
   0     2  615   94  56.6  19.1   22   0.4   751   68  49.8   1  3 14  1 82
   0     0  561  106  57.9  17.9   20   0.4   760   70  52.0   1  2 14  1 82
   0     0 1015   57  56.8  18.4   16   0.3   769   68  50.9   1  3 15  1 81
   0     0 1017   57  56.3  18.5   16   0.3   757   68  50.6   1  3 14  1 81
   0     1  752   72  53.0  16.6   23   0.4   765   67  50.1   1  1 13  0 85
   0     0 1014   51  50.1  16.5   21   0.3   723   68  48.3   1  1 13  0 86
   0     0 1012   51  50.2  19.8   18   0.3   743   68  49.2   1  1 12  0 86

And here are the zpool-iostat(8) results.

% zpool iostat POOL 600
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
POOL        3.18T  1.37T      7     56  53.5K  40.7M
POOL        3.20T  1.34T      0     57  9.01K  41.4M
POOL        3.22T  1.33T      0     47  3.29K  32.3M
POOL        3.24T  1.31T      0     47  5.59K  33.9M
POOL        3.25T  1.29T      0     43  3.39K  24.3M
POOL        3.27T  1.28T      0     42  3.01K  25.5M
POOL        3.28T  1.27T      0     44  3.14K  26.8M
POOL        3.29T  1.26T      0     42  3.49K  23.9M

The drive was attached over USB 3.0 port so there was not 35 MB/s limitation from USB 2.0 port. I would say that the results are very decent and consistent.

Tuning

There are several settings that can help you squeeze maximum from these SMR drives on ZFS filesystem.

First are ZFS pool settings. You want the latest zstd compression to save some space. Also better compression means less physical bytes need to be written to the drive so less I/O operations. You should also turn atime into off state as it will not be needed. You should also increase recordsize to something really big like 1m (1 megabyte) so you will get higher compressratio and also will need to have less metadata for more ZFS blocks. Keep in mind that ZFS will still use variable block size and not only the 1m maximum. If something is smaller (like 100k) then it would take for example 80k (after applied zstd compression). You will not waste 920k here πŸ™‚

Keep in mind that most newer and larger drives use 4k blocks (instead of 512b). Sometimes its 512e method which means that drive firmware will ‘present’ device with 512b blocks while underneath these eight 512k blocks just lay down on a single 4k block. For these reasons its important to keep in mind several things.

When adding new partitions with gpart(8) remember to align them to 4k with -a 4k argument.

# gpart add -t freebsd-zfs -a 4k da0

Next – when initializing the geli(8) encryption layer – make sure you add -s 4096 argument.

# geli init -s 4096 /dev/da0p1

The last thing is ZFS pool creation with proper ashift property – it can not be changed later. On FreeBSD UNIX its done that way:

# sysctl vfs.zfs.min_auto_ashift=12
# zpool create POOL da0
# zdb -C POOL | grep ashift
                ashift: 12

If you are curious what 12 means then below table will help you:

ASHIFT  BLOCKSIZE
     9  512b
    10  1k
    11  2k
    12  4k
    13  8k

Last but not least is the redundant_metadata option. By default its at all setting but its desired to set it into the most state. Do you need redundant metadata? I think not. When your single drive will fail the redundant metadata would not help and if your ZFS pool have some redundancy level like raidz or mirror then redundant metadata is also not needed because its just ‘normally’ redundant being spread across several disks.

Keep in mind that ZFS resilver process on some of these SMR drives can take forever. Some people from Reddit reported that they successfully resilvered their ZFS pools with SMR drives but that does not have to be the case for all SMR drives out there. You can also check Ars Technica tests of resilver on SMR disks.

Here is the summary of ZFS tunables suggested – you will find in depth description of all of them in the zfsprops(7) man page.

# zfs set redundant_metadata=most POOL
# zfs set compression=zstd        POOL
# zfs set atime=off               POOL
# zfs set recordsize=1m           POOL

In theory the TRIM operations upon deletion would create additional unwanted ‘stress’ for SMR drives which would mean that TRIM operations should be disabled for on non-SSD drives and you can disable them entirely on the ZFS pool level … but.

TRIM commands issued by the operating system allows SMR HDD internal controller to get the information that certain areas/blocks on that SMR HDD plates are no longer in use. It means that writes to such areas could be performed without slow read-modify-write pattern.

This means we are leaving the autotrim option as on (enabled) for SMR drives.

# zpool autotrim=on POOL

Also – if needed – you can manually trigger the TRIM operations with this command.

# zpool trim POOL
# zpool status POOL
  pool: POOL
 state: ONLINE
  scan: scrub repaired 0B in 02:17:22 with 0 errors on Sun May  8 05:18:22 2022
config:

        NAME          STATE     READ WRITE CKSUM
        POOL          ONLINE       0     0     0
          da0p1.eli   ONLINE       0     0     0  (trimming)

errors: No known data errors


By default the TRIM commands are executed at 64 rate on FreeBSD. You can limit them to 1 and still have them enabled with following sysctl(8) tunable.

# sysctl vfs.zfs.vdev.trim_max_active=1

If you want to make it survive across reboots then put it into the /etc/sysctl.conf file.

Logic could suggest that simpler/older filesystems such as FreeBSD UFS for example could be more suitable solution for SMR drives … but the reality shows that not so much. Check this Reddit thread for example – Appalling Performance on External USB SMR Drive – to name just one.

Hope this article will help you get most of your SMR drives.

Regards.

EOF

ZFS Compatibility

The best free filesystem on Earth – ZFS – also often named OpenZFS recently – has also become very portable in recent years of its development. The OpenZFS Distributions page lists 6 (six) operating systems already.

They are:

  • FreeBSD
  • Illumos
  • Linux
  • MacOS
  • NetBSD
  • Windows

… but if you would like to create a ZFS pool compatible with all of them … which options and ZFS features should you choose? There is OpenZFS Feature Flags page dedicated exactly to that topic.

zfs-feature-flags

These are the ones that have yes value in all operating systems.

  • async_destroy
  • bookmarks
  • empty_bpobj
  • enabled_txg
  • filesystem_limits
  • lz4_compress
  • hole_birth
  • multi_vdev_crash_dump
  • spacemap_histogram

I would also include these as only older NetBSD 4.0.5 version does not support them – but they are supported in newer NetBSD 5.3 version.

  • embedded_data
  • large_blocks
  • sha512
  • skein

There is also a dedicated zpool-features(7) man page for that information on the OpenZFS page and also zpool-features(7) man page on the FreeBSD page.

On the FreeBSD system the /usr/share/zfs/compatibility.d directory has files with supported ZFS Feature Flags for many major operating systems and ZFS versions.

% ls /usr/share/zfs/compatibility.d
2018
2019
2020
2021
compat-2018
compat-2019
compat-2020
compat-2021
freebsd-11.0
freebsd-11.1
freebsd-11.2
freebsd-11.3
freebsd-11.4
freebsd-12.0
freebsd-12.1
freebsd-12.2
freenas-11.0
freenas-11.1
freenas-11.2
freenas-11.3
freenas-9.10.2
grub2
openzfs-2.0-freebsd
openzfs-2.0-linux
openzfs-2.1-freebsd
openzfs-2.1-linux
openzfsonosx-1.7.0
openzfsonosx-1.8.1
openzfsonosx-1.9.3
openzfsonosx-1.9.4
truenas-12.0
ubuntu-18.04
ubuntu-20.04
zol-0.6.1
zol-0.6.4
zol-0.6.5
zol-0.7
zol-0.8

Unfortunately it misses NetBSD and Illumos systems for example … but having information from the OpenZFS Feature Flags page we can find Feature Flags set that will be supported everywhere.

Here are the stats for supported ZFS Feature Flags. The higher the number the more operating systems and ZFS version it covers.

% grep -h '^[^#]' /usr/share/zfs/compatibility.d/* \
    | sort -n \
    | uniq -c \
    | sort -n
   2 draid
   5 bookmark_written
   5 device_rebuild
   5 livelist
   5 log_spacemap
   5 redacted_datasets
   5 redaction_bookmarks
   5 zstd_compress
   9 allocation_classes
   9 bookmark_v2
   9 project_quota
   9 resilver_defer
  10 edonr
  10 encryption
  11 large_dnode
  11 userobj_accounting
  18 spacemap_v2
  20 device_removal
  20 obsolete_counts
  20 zpool_checkpoint
  31 sha512
  31 skein
  32 multi_vdev_crash_dump
  36 filesystem_limits
  36 large_blocks
  37 bookmarks
  37 embedded_data
  37 enabled_txg
  37 extensible_dataset
  37 hole_birth
  37 spacemap_histogram
  38 async_destroy
  38 empty_bpobj
  38 lz4_compress

As the GNU GRUB is very outdated when it comes to ZFS support it should be pretty bulletproof to use it as a starting point of limited ZFS Feature Flags support.

% cat /usr/share/zfs/compatibility.d/grub2
# Features which are supported by GRUB2
async_destroy
bookmarks
embedded_data
empty_bpobj
enabled_txg
extensible_dataset
filesystem_limits
hole_birth
large_blocks
lz4_compress
spacemap_histogram

To make sure we are compatible we will now cross-link the GNU GRUB data.

First we will ‘generate’ the grep(1) command arguments that we will use in the next command.

% grep '^[^#]' /usr/share/zfs/compatibility.d/grub2 \
  | while read I
    do
      echo "-e ' ${I}' \\"
    done
-e ' async_destroy' \
-e ' bookmarks' \
-e ' embedded_data' \
-e ' empty_bpobj' \
-e ' enabled_txg' \
-e ' extensible_dataset' \
-e ' filesystem_limits' \
-e ' hole_birth' \
-e ' large_blocks' \
-e ' lz4_compress' \
-e ' spacemap_histogram' \

Lets now use these arguments to filter the ZFS features.

% grep -h '^[^#]' /usr/share/zfs/compatibility.d/grub2/* \
    | sort -n \
    | uniq -c \
    | sort -n \
    | grep -e ' async_destroy' \
           -e ' bookmarks' \
           -e ' embedded_data' \
           -e ' empty_bpobj' \
           -e ' enabled_txg' \
           -e ' extensible_dataset' \
           -e ' filesystem_limits' \
           -e ' hole_birth' \
           -e ' large_blocks' \
           -e ' lz4_compress' \
           -e ' spacemap_histogram' \
    | wc -l
      11

% grep -h '^[^#]' /usr/share/zfs/compatibility.d/grub2 | wc -l
      11

So if seems that GRUB list of ZFS Feature Flags seems pretty compatible.

Lets now cross-reference that GRUB data with the data from the OpenZFS Feature Flags page.

I will create new /usr/share/zfs/compatibility.d/OZFF file that has these ZFS Feature Flags as content.

% cat /usr/share/zfs/compatibility.d/OZFF
async_destroy
bookmarks
empty_bpobj
enabled_txg
filesystem_limits
lz4_compress
hole_birth
multi_vdev_crash_dump
spacemap_histogram
embedded_data
large_blocks
sha512
skein

% wc -l /usr/share/zfs/compatibility.d/OZFF
      13

So there are 11 GRUB ZFS Feature Flags and 13 OpenZFS ‘Compatible’ Feature Flags.

Lets see how it they compare.

% cat /usr/share/zfs/compatibility.d/OZFF \
    | grep -e async_destroy \
           -e bookmarks \
           -e embedded_data \
           -e empty_bpobj \
           -e enabled_txg \
           -e extensible_dataset \
           -e filesystem_limits \
           -e hole_birth \
           -e large_blocks \
           -e lz4_compress \
           -e spacemap_histogram \
    | wc -l
      10

I expected 11 here instead of 10 … we will not have to compare GRUB results to the OpenZFS Feature Flags section.

% grep -h '^[^#]' /usr/share/zfs/compatibility.d/{grub2,OZFF} \
    | sort -n \
    | uniq -c \
    | sort -n
   1 extensible_dataset
   1 multi_vdev_crash_dump
   1 sha512
   1 skein
   2 async_destroy
   2 bookmarks
   2 embedded_data
   2 empty_bpobj
   2 enabled_txg
   2 filesystem_limits
   2 hole_birth
   2 large_blocks
   2 lz4_compress
   2 spacemap_histogram

Seems that we finally got our 10 most compatible OpenZFS Feature Flags set.

Its this set:

  • async_destroy
  • bookmarks
  • embedded_data
  • empty_bpobj
  • enabled_txg
  • filesystem_limits
  • hole_birth
  • large_blocks
  • lz4_compress
  • spacemap_histogram

To make it more comfortable to use we will put them it into the separate /usr/share/zfs/compatibility.d/COMPATIBLE file.

# cat /usr/share/zfs/compatibility.d/COMPATIBLE
async_destroy
bookmarks
embedded_data
empty_bpobj
enabled_txg
filesystem_limits
hole_birth
large_blocks
lz4_compress
spacemap_histogram

Lets now try to make that best effort most-compatible ZFS pool.

I will use one of my scripts – mdconfig.sh – to easy manipulate md(4) memory disks on FreeBSD.

# truncate -s 1g FILE
# mdconfig.sh -c FILE
IN: created vnode at /dev/md0
# zpool create -o compatibility=COMPATIBLE compatible /dev/md0
# zpool list
NAME         SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
compatible   960M   116K   960M        -         -     0%     0%  1.00x    ONLINE  -
zroot        118G  48.6G  69.4G        -         -    36%    41%  1.00x    ONLINE  -


Now lets see what is zpool upgrade command showing us.

# zpool upgrade
This system supports ZFS pool feature flags.

All pools are formatted using feature flags.


Some supported features are not enabled on the following pools. Once a
feature is enabled the pool may become incompatible with software
that does not support the feature. See zpool-features(7) for details.

Note that the pool 'compatibility' feature can be used to inhibit
feature upgrades.

POOL  FEATURE
---------------
compatible
      multi_vdev_crash_dump
      large_dnode
      sha512
      skein
      userobj_accounting
      encryption
      project_quota
      device_removal
      obsolete_counts
      zpool_checkpoint
      spacemap_v2
      allocation_classes
      resilver_defer
      bookmark_v2
      redaction_bookmarks
      redacted_datasets
      bookmark_written
      log_spacemap
      livelist
      device_rebuild
      zstd_compress
      draid
zroot
      userobj_accounting
      encryption
      project_quota
      allocation_classes
      resilver_defer
      bookmark_v2
      redaction_bookmarks
      redacted_datasets
      bookmark_written
      log_spacemap
      livelist
      device_rebuild
      zstd_compress
      draid

There are LOTS of ZFS Feature Flags to be activated but if we want to have our ZFS pool keep compatible – we will have to stay away from it πŸ™‚

zfs-terminal

You may get impression that you miss a lot … but you do not miss that much. Here are the ZFS Feature Flags you can fully utilize.

# zpool get all compatible | grep -v disabled
NAME        PROPERTY                       VALUE                          SOURCE
compatible  size                           960M                           -
compatible  capacity                       0%                             -
compatible  altroot                        -                              default
compatible  health                         ONLINE                         -
compatible  guid                           5395735446052695775            -
compatible  version                        -                              default
compatible  bootfs                         -                              default
compatible  delegation                     on                             default
compatible  autoreplace                    off                            default
compatible  cachefile                      -                              default
compatible  failmode                       wait                           default
compatible  listsnapshots                  off                            default
compatible  autoexpand                     off                            default
compatible  dedupratio                     1.00x                          -
compatible  free                           960M                           -
compatible  allocated                      116K                           -
compatible  readonly                       off                            -
compatible  ashift                         0                              default
compatible  comment                        -                              default
compatible  expandsize                     -                              -
compatible  freeing                        0                              -
compatible  fragmentation                  0%                             -
compatible  leaked                         0                              -
compatible  multihost                      off                            default
compatible  checkpoint                     -                              -
compatible  load_guid                      17463015630652190527           -
compatible  autotrim                       off                            default
compatible  compatibility                  COMPATIBLE                     local
compatible  feature@async_destroy          enabled                        local
compatible  feature@empty_bpobj            enabled                        local
compatible  feature@lz4_compress           active                         local
compatible  feature@spacemap_histogram     active                         local
compatible  feature@enabled_txg            active                         local
compatible  feature@hole_birth             active                         local
compatible  feature@extensible_dataset     enabled                        local
compatible  feature@embedded_data          active                         local
compatible  feature@bookmarks              enabled                        local
compatible  feature@filesystem_limits      enabled                        local
compatible  feature@large_blocks           enabled                        local

You get the very decent LZ4 compression and also ZFS Bookmarks feature which are very useful feature for sync|recv mechanism.

Keep in mind that the -o compatibility= switch for zpool(8) is available on OpenZFS 2.1 or newer. The 2.1 version is already available on the FreeBSD 13.1-BETA* releases and will be part of the FreeBSD 13.1-RELEASE systems. To make use of it on older FreeBSD releases you will have to use openzfs and openzfs-kmod packages and also use the following settings in the /boot/loader.conf file.

From that one:

zfs_load=YES

Into that one:

zfs_load=NO
openzfs_load=YES

With these openzfs and openzfs-kmod packages and above settings in the /boot/loader.conf file you can use this OpenZFS 2.1 on FreeBSD 12.2 – on FreeBSD 12.3 – and on FreeBSD 13.0 … and of course on upcoming FreeBSD 13.1 release.

Not sure if I should have add anything more here. Feel free to remind me in the commends πŸ™‚

EOF