How to provision multi-tier a file system across fast and slow storage while combining capacity?

Question

I'm hunting for a way to utilise a slow 500GB magnetic HDD alongside a fast 500GB SSD.

I'd like to end up with a reasonable fraction of the two combined [hopefully > 800GB] in terms of capacity, but deciding which files should live on which disk is going to be really tricky and near impossible to cut across directory boundaries.

It occurs to be that something like this has already been thought of in the form of multi-tired storage. But so far the only mechanisms I've found use the fast volume as a cache over the slow volume. That's NOT useful to me since I'd just end up with 500GB of SSD cache over 500GB HDD slow drive... I'd be no better off than simply binning the slow HDD and using only the SSD.

Hypothetically a system could exist that spanned across two volumes and dynamically copied from one to the other. This would be similar to a cache but without any requirement to keep one volume wholly consistent, and with the ability to provision more capacity than either single volume.

So far all my searches have drawn blanks. I've found multiple references to LVM and XFS supporting caching. But nothing obviously has a path to achieving higher capacity than either single device.

So while it's hypothetically possible, I don't see a way to achieve it.

Note that the tricky feature of this question is achieving a solution with only one end file system. Manually choosing which files go where by directory will not be possible for me.

You could use e.g. part the SSD as a fast cache for the HD, and use the rest normally. That way you'd get most of the data space in use and ready-made solutions might be easier to find. — ilkkachu, Apr 17 '23 at 13:54
@ilkkachu interesting Idea, though I'm not sure how I'd achieve that without splitting into multiple FS. — Philip Couling, Apr 17 '23 at 14:03
yes that's the hard part. You might have to do something silly like a linear MD device out of the remaining SSD and the HD. Anyway, a cache-like system that'd make all the space usable would have to move things back and forth between the slow and hard device, all the time. And that sounds really rather awkward. — ilkkachu, Apr 17 '23 at 15:10
Hypothetically a system could exist that spanned across two volumes and dynamically copied from one to the other. This seems apropos: https://en.wikipedia.org/wiki/Hierarchical_storage_management Note that all the implementations are proprietary. — Andrew Henle, Apr 17 '23 at 15:25
ZFS has "special" devices that can store metadata and small files, but it's intended to be a small portion of your total storage... — Matt Nordhoff, Apr 18 '23 at 03:51

score 3 · Answer 1 · answered Apr 17 '23 at 14:23

3

The following answer is not based on personal experience, so I cannot tell you for sure if it would indeed work, but those just ideas you can try.

lvmcache

The cache logical volume type uses a small and fast LV to improve the performance of a large and slow LV. It does this by storing the frequently used blocks on the faster LV.

If I understand correctly, you can limit the cache size on the fast disk, and then create another logical volume of the slow disk and the remains of the fast one.

Block cache

Another possible solution is bcache:

Bcache (block cache) allows one to use an SSD as a read/write cache (in writeback mode) or read cache (writethrough or writearound) for another blockdevice (generally a rotating HDD or array).

Both the backing device and the caching device could be "a whole device, a partition or any other standard block device."

Theoretically, you could split your SSD to two 200+300GB partitions. The 200GB would be used as the caching device.

Then you could combine the remaining 300GB partition from your SSD and the slow disk to one 800GB logical volume (by using either LVM or a Software RAID), and using the caching device to cache it.

Again, I've never tried those. lvmcache seems more native for your purpose. You can try both and compare their performance to determine what would be the better solution for you.

answered Apr 17 '23 at 14:23

aviro

5,532

Thanks. Just to point out this still does the thing I said I wanted to avoid of splitting the data by directory tree as it results results in multiple file system partitions. – Philip Couling Apr 17 '23 at 15:36
@PhilipCouling you don't need to split the data by directories - as long as you don't have many different directories on your root partition. Not all the directories need caching. For instance, reading files from /usr, /bin etc (which are supposed to be read/only) is pretty quick and caching will not bring a lot of benefit there. But maybe I misunderstand something. Which directories do you have in mind, and are they all on the root partition? – aviro Apr 17 '23 at 15:45
Two notes: First, anything you'll need to do on the root partition, will probably requires rebuilding. I assume those things could be done for the root partition, but it would be pretty difficult, if not impossible, to do that on a a running machine. Second note: There's a fairly limited number of folder on the root partition. Usually only limited number of those have high I/O. It's an easy task to identify those. You don't need to create a partition for each an every one of those, you can just create symlinks to the cached FS you've created. – aviro Apr 17 '23 at 16:05
@PhilipCouling just out of curiosity, can you explain what's special about your system? The only assumption I made was that using there aren't many folder on the root partitions. Is that not true in your case? Actually, I had another assumption which is that some folders, like /usr and /bin don't have a lot of IO. Is this also a wrong assumption? Again, not meaning to hassle, just curious. – aviro Apr 17 '23 at 16:24
nothing's "special", I think your numbers are off. I'm talking in my question about 1000GB of storage. A typical OS install with all it's apps is well under 50GB (and I'm being generous with that number). So when you talk about splitting along root level directories you are splitting off at most 5%, where the other 95% is likely to be lumped in one huge mass either in /home for a home computer or /var/lib for a server. If you need high IO on some of that 95% then it suddenly gets harder to chose what to make it's own partition. For a concrete example: server running K8s or docker. – Philip Couling Apr 17 '23 at 16:33
aviron (and @PhilipCouling): If you're carving up your SSD to use part of it specially, like a cache, you'd want to tell a filesystem like XFS to keep the log on the SSD, like mkfs.xfs -l logdev=/dev/partition-on-ssd,size=100m (example from the man page, IDK if 100MiB is an appropriate size for modern XFS). Many data-ordering operations have to wait for a log write to commit to disk. Ideally you'd like all the metadata to be on the SSD, but IDK if XFS can do that. Some other FS maybe can, perhaps btrfs if told about both devices separately, rather than an MD or LVM. – Peter Cordes Apr 18 '23 at 03:22

score 3 · Answer 2 · answered Apr 17 '23 at 15:37

XFS provides a realtime partition:

An XFS filesystem can reside on a regular disk partition or on a logical volume. An XFS filesystem has up to three parts: a data section, a log section, and a realtime section. ...

...

The realtime section is used to store the data of realtime files. These files had an attribute bit set through xfsctl(3) after file creation, before any data was written to the file. The realtime section is divided into a number of extents of fixed size (specified at mkfs.xfs(8) time). Each file in the realtime section has an extent size that is a multiple of the realtime section extent size.

Note, though, there are limitations. Data will be written to either the realtime parition or the "normal" partition, so there's no moving of data back-and-forth that can expand capacity like you'd see in a hierarchical storage system.

Second, the writing application needs to set the real-time XFS_XFLAG_REALTIME bit for files that you'd want written to the realtime partition after the file is created but before any data is written to it.

Hans-Martin Mosner · Answer 3 · 2023-04-18T09:43:44.530

Just a rough idea: Have you considered an overlay FS where the read-only part is the HDD (presumably for files that don't change or change rarely) and the overlay is on the SSD? You would need to have an offline mechanism that moves files from overlay to the HDD when they are deemed stable, and that removes files from HDD when there is an active working copy in the overlay. This offline mechanism can be run when the SSD is becoming full. Apparently people have been looking in this direction before, see Merge changes to upper filesystem to lower filesystem in Linux Overlay (OverlayFS) mount.

With lots of work, it might even be possible to create a FUSE filesystem which does this online and automatically. Maybe there actually are approaches to this, see https://en.wikipedia.org/wiki/Filesystem_in_Userspace for some ideas and pointers.

It's an interesting concept certainly! The challenge would be getting it to be atomically safe, but I really like the concept. — Philip Couling, Apr 18 '23 at 09:41

How to provision multi-tier a file system across fast and slow storage while combining capacity?

3 Answers3

lvmcache

Block cache