15

Context

I'm setting up a system for an off-site backup. It is a Raspberry Pi connected to two external hard disks. The data is written on both disks in parallel (this is not RAID 1, but two disks on which the backup service writes data individually).

The files will be mostly written, very rarely read. There are two patterns: most of the time, there would be a few small files written per minute. In some cases, there would be extensive writing of large files at a speed of 30 MB/s (if Raspberry Pi able to sustain such speed) for a duration of a few minutes to a few hours.

There would be around 1,000,000 files, using 2.5 TB in total. The largest files will be around 10 GB and there will be few of them; most will be from a few kilobytes to a few tens of megabytes. In order to avoid the issues Linux has with directories containing too much files, the data directory stores a maximum of thirty-two sub-directories, which in turn stores a maximum of thirty-two sub-sub-directories, and so on, six levels deep, limiting the number of files/directories of any directory to thirty-two.

The device would be powered through an UPS, but there is still a risk for it to be unintentionally unplugged/turned off while writing data. While a loss of a file which was being written when the device turned off/crashed is not a problem, it would be a serious issue if it affects other files.

The database (PostgreSQL) might be stored on those disks as well (although I haven't made my choice between storing it there or using the SD card of the Raspberry Pi and doing hourly backups on hard disks).

Question

Would there be any benefit from using XFS?

Would there be any drawbacks?

From what I've read, XFS has benefits for situations where ext4 shows its limitations, like when one deals with exabytes of data, which is not exactly my case. It also seemed that XFS was slightly less stable a few years ago, but it doesn't seem to be the case any longer.

So, does it matter in my case, or my situation is too ordinary to justify a strong preference for a file system in particular?

  • Between XFS and ext4, there's this: https://unix.stackexchange.com/questions/35962/xfs-vs-ext4-vs-others-which-file-system-is-stable-reliable-for-long-run-such And this is more recent: https://unix.stackexchange.com/questions/440748/big-data-what-is-the-right-filesystem-ext4-or-xfs – Andrew Henle Sep 06 '18 at 20:27
  • @AndrewHenle: Thank you. I've seen the first one, which seemed more generic compared to my question (especially in terms of number of files) and doesn't have a constraint of a possible loss of power. Thanks to your comment, I discovered the second question, but here too it's too generic compared to my situation. – Arseni Mourzenko Sep 06 '18 at 20:31
  • 3
    There are more specific recommendations on RedHat's filesystem page: https://access.redhat.com/articles/3129891 especially in the "Choosing a Local File System" section. – Andrew Henle Sep 06 '18 at 20:46

1 Answers1

13

So, does it matter in my case, or my situation is too ordinary to justify a strong preference for a file system in particular?

The latter. The way you describe this workload, I think it is not very demanding. Both ext4 and XFS should be able to handle it. So I think you should have no strong preference, except to consider what you are familiar with and what is best documented.

If you use Debian, Ubuntu, or Fedora Workstation, the installer defaults to ext4. So that's what most Linux users would be familiar with.

This can make differences as there are a few functional differences, i.e. a few little traps that might catch you out if you are used to the other one.

XFS is a great filesystem, that scales well for large servers. But in one case it has avoided meeting the same expectation, that AIUI Linus required ext4 to fulfil as the successor to ext3. This is an example of where sticking to what everyone else uses for whatever purpose, might help you avoid hitting things that nobody else knows to warn about :-). Which filesystems require fsync() for crash-safety when replacing an existing file with rename()?

Red Hat is trying to grow a storage stack based on XFS called Stratis, including specific features and/or new work that will go into XFS. This might become interesting and lead to more widespread community expertise etc with XFS in future. And if you want to use RHEL or CentOS somewhere, which default to XFS, then by all means go ahead. Red Hat make a lot of comprehensive manuals available (which basically apply to CentOS too).

Our glorious future of checksumming filesystems seems equally cloudy at the moment ... at least Red Hat starting Stratis suggests someone sees it that way ... so if you don't mention them, then I will also try to avoid doing so. You should require your backup application to include checksums anyway. Even if it doesn't give you the full goodness of ZFS' RAID support.

sourcejedi
  • 50,249
  • 1
    Note @sourcejedi's answer in the link clearly states you still need to fsync after that rename on ext4 if you care care about whether the data was written. While you likely won't get a zero length file after a rename you can still get a corrupted one if the kernel were to crash before fully flushing buffered writes... – Anon Dec 09 '18 at 07:21
  • 1
    Also note modern (post 22) Fedora only defaults to ext4 for desktop setups but defaults to XFS on server ones (https://docs.fedoraproject.org/en-US/fedora/f27/install-guide/install/Installing_Using_Anaconda/index.html ). – Anon Dec 09 '18 at 07:32
  • 1
    @Anon thank you for adding these informative comments! I've tweaked the answer slightly. I haven't tried to explain the fsync thing any better. As I understand it it's about exact timing, where XFS ends up with a 30-second window for loss (by default) if you don't do the right fsync(), whereas ext4 aims for a much smaller window to match ext3 behaviour. I absolutely agree, apps should be using fsync() on both of these filesystems. – sourcejedi Dec 09 '18 at 15:31