Tiered storage with BTRFS - how is it done?

Question

NETGEAR uses BTRFS in their ReadyNAS OS, and implements Tiered Storage in their latest versions. They started with "Metadata" tier only in ReadyNAS v6.9, and then added "Data Tier" in v6.10. The system uses SSDs as Tier-0 to speed up access to the slower HDDs in the system. The description of the system states that metadata will reside on the SSDs in both cases, and that in the "Data Tier" case also newly written data will go to the SSDs first, and then later will get migrated to HDD periodically, or when the SSD tier fills up to a specified level.

ReadyNAS uses BTRFS on top of RAID-ed HDDs in its normal installs - e.g. my system has a RAID5 made of 4 disks, which BTRFS sees/uses as a single device.

Looking at how the Tiering is implemented, it looks like both "Metadata" and "Data Tier" setups are done by adding a second RAID array, made only of the SSDs, to the main HDD RAID array, and transforming the initial single-device BTRFS into a multiple-device one.

What I cannot figure out is how the migration is done, and also how the "Metadata" case manages to separate metadata from data, so that only metadata goes to SSD? Also, how does the "Data Tier" mode direct the writes entirely to the SSD tier?

Any ideas?

score 5 · Answer 1 · answered Dec 31 '20 at 01:19

You are saying Netgear has found a way to do what MergerFS Tiered Caching allows you to do already, in a user-friendly and extremely simple configuration: https://github.com/trapexit/mergerfs#tiered-caching

create 2 MergerFS pools A) one with all HDD drives including the SSD ("POOL", tier0) and set to write to the device with the least free space (unless it has X amount of free space left). B) second pool ("POOL-ARCHIVE", tier1) only containing only the HDDs.
Your users and all applications only use the path of the first pool.
A nightly script that copies everything that has not been touched for the past X days from the first pool to the second (easy, since the drives are the same, this will only cause data on the SSD to be copied). This is the only item that uses the path of the second pool.

This is exactly how I have set up my homeserver. All drives are BtrFS formatted. I don't (can't, with this solution) use Raid.

The pros:

When a drive fails, you only loose data on that drive (and I mitigate this by using SnapRAID as first backup system). You don't loose the entire pool like with BtrFS-RAID0.
This is extremely easy to setup. 2 mounts in your /etc/fstab. BAM, tiered caching!
You always use the SSD first (unless it has only X amount of free space left). Giving you max speed.

The cons:

You can't use BtrFS subvolumes (spanning across disks) within your MergerFS pool as MergerFS runs on top of filesystems in user space.
This also means you can't do snapshots of subvolumes within your pool. I would love to have time machine like snapshots per user-data folders in my pool.

I absolutely love MergerFS for its simplicity, but con #2 makes me very interested in how Netgear hacked a similar solution using BTRFS.

Stefan Piperov · Answer 2 · 2020-12-18T23:15:48.720

OK, here's what I found happening during the periodic balances:

The following process is started on the host:

btrfs balance start -dsweep lt:/dev/md127:7 /data LANG=en_US.UTF-8 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin DBUS_SESSION_BUS_ADDRESS=unix:path=/var/netatalk/spotlight.ipc TRACKER_USE_CONFIG_FILES=1 TRACKER_USE_LOG_FILES=1 XDG_DATA_HOME=/apps/.xdg/local/share XDG_CONFIG_HOME=/apps/.xdg/config XDG_CACHE_HOME=/apps/.xdg/cache

where /data is my tiered data-volume, /dev/md127 is the SSD array used as buffer/cache.

This process runs until the data from the SSD tier is moved almost completely to the HDD tier - e.g. somewhere along the way I see:

btrfs fi sh /data
Label: '0a44c6bc:data'  uuid: ed150b8f-c986-46d0-ada8-45ee219acbac
    Total devices 2 FS bytes used 393.14GiB
    devid    1 size 7.12TiB used 359.00GiB path /dev/md126
    devid    2 size 114.68GiB used 42.06GiB path /dev/md127

and then it goes down until the usage of the SSD tier goes almost to zero. The strange thing is that so far i was not able to run this command manually.

I still cannot figure out the 'sweep' balance filter.

This is what the -help shows:

# btrfs balance start --help
usage: btrfs balance start [options] <path>
Balance chunks across the devices

Balance and/or convert (change allocation profile of) chunks that
passed all filters in a comma-separated list of filters for a
particular chunk type.  If filter list is not given balance all
chunks of that type.  In case none of the -d, -m or -s options is
given balance all chunks in a filesystem. This is potentially
long operation and the user is warned before this start, with
a delay to stop it.

-d[filters]    act on data chunks
-m[filters]    act on metadata chunks
-s[filters]    act on system chunks (only under -f)
-v             be verbose
-f             force reducing of metadata integrity
--full-balance do not print warning and do not delay start
--background|--bg
               run the balance as a background process

but this does not explain how it maps to the "lt:/dev/md127:7" part of the command that runs periodically:

btrfs balance start -dsweep lt:/dev/md127:7 /data

What's the meaning here: Run until the /dev/md127 data usage falls below 7% ?!?

I cannot find any documentation on the BTRFS's 'sweep' balance filter. — Stefan Piperov, Dec 18 '20 at 23:12

score 1 · Answer 3 · answered Dec 09 '20 at 01:13

1

It must be a cronjob that runs regularly and does the migration.

Check /etc/cron.d for entries that might be doing that.

answered Dec 09 '20 at 01:13

Blitzer

31
4

Tiered storage with BTRFS - how is it done?

3 Answers3

Linked