62

I have a directory tree that contains many small files, and a small number of larger files. The average size of a file is about 1 kilobyte. There are 210158 files and directories in the tree (this number was obtained by running find | wc -l).

A small percentage of files gets added/deleted/rewritten several times per week. This applies to the small files, as well as to the (small number of) larger files.

The filesystems that I tried (ext4, btrfs) have some problems with positioning of files on disk. Over a longer span of time, the physical positions of files on the disk (rotating media, not solid state disk) are becoming more randomly distributed. The negative consequence of this random distribution is that the filesystem is getting slower (such as: 4 times slower than a fresh filesystem).

Is there a Linux filesystem (or a method of filesystem maintenance) that does not suffer from this performance degradation and is able to maintain a stable performance profile on a rotating media? The filesystem may run on Fuse, but it needs to be reliable.

  • If you know which files are going to be big/not changing very often, and which are going to be small/frequently changing, you might want to create two filesystems with different options on them, more suited to each scenario. If you need them to be accessible as they were a part of the same structure, you can do some tricks with mount, symlinks. – Marcin Jan 10 '12 at 13:22
  • I am quiet surprised to know that btrfs(with copy-on-write feature) has been sluggish to you over a period of time. I am curious to have the results shared from you, possibly helping each other in new direction of performance tuning with it. – Nikhil Mulley Jan 10 '12 at 14:00
  • there is a new animal online zfs on Linux, available in native mode and fuse implementations, incase you wanted to have a look. – Nikhil Mulley Jan 10 '12 at 14:03
  • I tried zfs on linux once, was quite unstable. Managed to completely lock up the filesystem quite often. Box would work, but any access to the FS would hang. – phemmer Jan 10 '12 at 14:46
  • Similar post http://serverfault.com/questions/6711/filesystem-for-millions-of-small-files – Nikhil Mulley Jan 10 '12 at 15:22
  • @Patrick yeah, I see those solutions are still naive and it will be sometime to see them performing native. – Nikhil Mulley Jan 10 '12 at 15:23
  • What does it matter whether they are spread out or not? Even if they are stored one after the other, if you are accessing a small subset of the files randomly, then you will still get a random IO pattern. For a sequential access like taring the whole thing up, from what I have seen, btrfs handles this best. ext4 is bad at it because it stores the file names in hash order, which is essentially random, so even if the file data is all in order, tar reads them in a random order. btrfs does a very good job of keeping them in order. Running a btrfs fi defrag every now and again helps too. – psusi Jan 11 '12 at 19:31
  • @psusi It seems it was a mistake to use the word "fragmentation" in my question. I just replaced it with "positioning of files on disk". I apologize. –  Jan 13 '12 at 14:39
  • XFS has improved greatly since 5 years ago in the area of small files, I suspect the numbers above would be very different for XFS in newer Linux distros now. –  Aug 21 '14 at 16:42
  • I was reading (a few months back) that COW file-systems, fragment more that non-COW. Apparently this is a property of COW. Therefore if using COW, one should run a de-fragmenter. If COW is not needed, then it COW file-systems should be avoided. – ctrl-alt-delor Nov 04 '18 at 20:23

5 Answers5

65

Performance

I wrote a small Benchmark (source), to find out, what file system performs best with hundred thousands of small files:

  • create 300000 files (512B to 1536B) with data from /dev/urandom
  • rewrite 30000 random files and change the size
  • read 30000 sequential files
  • read 30000 random files
  • delete all files

  • sync and drop cache after every step

Results (average time in seconds, lower = better):

Using Linux Kernel version 3.1.7
Btrfs:
    create:    53 s
    rewrite:    6 s
    read sq:    4 s
    read rn:  312 s
    delete:   373 s

ext4:
    create:    46 s
    rewrite:   18 s
    read sq:   29 s
    read rn:  272 s
    delete:    12 s

ReiserFS:
    create:    62 s
    rewrite:  321 s
    read sq:    6 s
    read rn:  246 s
    delete:    41 s

XFS:
    create:    68 s
    rewrite:  430 s
    read sq:   37 s
    read rn:  367 s
    delete:    36 s

Result:
While Ext4 had good overall performance, ReiserFS was extreme fast at reading sequential files. It turned out that XFS is slow with many small files - you should not use it for this use case.

Fragmentation issue

The only way to prevent file systems from distributing files over the drive, is to keep the partition only as big as you really need it, but pay attention not to make the partition too small, to prevent intrafile-fragmenting. Using LVM can be very helpful.

Further reading

The Arch Wiki has some great articles dealing with file system performance:

https://wiki.archlinux.org/index.php/Beginner%27s_Guide#Filesystem_types

https://wiki.archlinux.org/index.php/Maximizing_Performance#Storage_devices

taffer
  • 1,583
  • 4
    You should specify what version of the kernel youre basing that comparison off of. XFS got some very significant speed imporovments in one of the recent kernels (think it was 2.6.31, but dont quote me on that). – phemmer Jan 11 '12 at 01:47
  • Other note, having a UPS is vital on any filesystem not running in synchronous IO mode. Ext4, reiserfs, and btrfs would all suffer data loss without a proper shutdown. – phemmer Jan 11 '12 at 01:48
  • @Patrick Kernel: look here even though XFS performance has been improved, it's still slow with small files (DBench uses a filesize of 5B). XFS was intentionally made for CGI with huge partitions and files that are > 4MB, not for small 1KB files. – taffer Jan 11 '12 at 09:18
  • @Patrick UPS: There is a big d – taffer Jan 11 '12 at 09:21
  • @Patrick UPS: There is a big difference on how much data you will loose. While XFS will show data that might be corrupted as \0, even whole directories, ext4 will have added garbage to a directory or file (unless in data=journal mode). Btrfs in contrast uses transactional writes, so that you will at least end up with the last non corrupted version of a file in case of a power outage. Also note that many people do not have UPS outside datacenters. – taffer Jan 11 '12 at 09:27
  • 1
    btrfs internally does your lvm trick. It allocates smaller chunks of the disk and places files in those chunks, then only allocates another chunk of the disk when the existing chunks fill up. – psusi Jan 11 '12 at 19:38
  • @taffer, ext4 will not have added garbage to a directory or file after a power failure in the default data=ordered mode. Even in data=writeback, directories will not be corrupted. – psusi Jan 11 '12 at 19:42
  • @psusi btrfs: shure, when you use btrfs, you don't need lvm at all. – taffer Jan 12 '12 at 14:47
  • @psusi ext4: yes, my fault, the garbage part is only true for data=writeback and doesn't affect directories at all. nevertheless iirc files might be uncomplete in data=ordered mode when power outage happens while data is written. – taffer Jan 12 '12 at 14:52
  • 1
    That's true of any filesystem. That is why applications use things like fsync(). – psusi Jan 12 '12 at 18:49
  • @psusi as far as i know thats not true of transactions in btrfs. – taffer Jan 12 '12 at 20:05
  • 2
    @taffer, it is. The transactions have the same effect as the journal does in other filesystems: they protect the fs metadata. In theory they can be used by applications in the way you describe, but there is currently no api to allow applications to open and close transactions. – psusi Jan 12 '12 at 20:31
  • @psusi OK, thanks for the clarification. Do you know of any small-file benchmarks comparing xfs, ext4 and reiserfs? – taffer Jan 12 '12 at 21:08
  • @taffer, nope, but I do know that linear access ( like tar ) to many small files ( like a Maildir ) on ext4 is slow as all hell because ext4 stores the file names in hash order to speed up locating an individual file, and that effectively randomizes the order of file names relative to where they are stored on disk. I had tar take 30 minutes to archive a Maildir on ext4 that took 3 minutes on btrfs. See http://askubuntu.com/q/29831/8500 – psusi Jan 12 '12 at 23:51
  • @psusi Maybe you are interested in the benchmark I have made. I also took into account the linear access you have mentioned. Unforunatly I didn't have the time to benchmark btrfs and jfs too. – taffer Jan 14 '12 at 12:47
  • Nice, but those times seem pretty fast. Are you sure you dropped cache, or is this on an SSD or something? – psusi Jan 14 '12 at 16:48
  • @psusi Yes, echo 3 > /proc/sys/vm/drop_caches after every step should have done it. I was using a 7200 rpm HDD with a Xeon W3550 processor. There is a link to the source of the benchmark in my answer, so you can try it yourself and improve it if you like. – taffer Jan 14 '12 at 17:11
  • @psusi added btrfs results – taffer Jan 16 '12 at 14:03
  • @taffer You write: "keep the partition only as big as you really need it". I think this is misleading. You cannot prevent copy-on-write filesystems (BTRFS/ZFS) to fragment a bit over time. A rule of thumb says, filling them permanently beyond 50% guarantees, that you will encounter serious fragmentation issues sooner or later. YMMV, some see this at 70%, some already at 25%. And even it is caused by something different, the same is more or less true for ext-type FS as well. (But: I fully agree with what you write about XFS.) – Tino Jun 13 '15 at 13:03
  • When you use LVM to gradually grow the FS, you shift the fragmentation issue one layer lower, which may be worse than having the FS fragment (and know about it). – Jonas Schäfer Jun 26 '17 at 09:10
  • This info is 6 years old as of this comment. XFS now includes metadata CRCs, free inode B+ trees (finobt) for better performance on "aged" filesystems, and file types are now stored in the directory by default (ftype=1) which greatly improves performance in certain scenarios with lots of (usually small) files. Do not fully trust random benchmarks online, especially outdated ones like above. While XFS is not the best choice for a filesystem that will ONLY store tons of very small files, it is by far the best filesystem choice all-around. – Jody Bruchon Jun 10 '18 at 16:01
  • @JodyLeeBruchon Who says that? XFS Performance is still a mixed bag: Recent benchmarks by Phoronix show that other filesystems are faster with small I/O. If you want to know what the best filesystem for your application is, you have to benchmark with your workload on your hardware and see what works best for you. Don't believe in silver bullets. – taffer Aug 22 '18 at 20:41
  • 1
    @taffer Your "recent benchmark" is from April 2015, over three years old and uses XFS with only default options. This pre-dates xfsprogs 3.2.3 which makes XFS v5 the default and all the benefits it brings. It also wasn't formatted with -m finobt=1 which is a game-changer for XFS performance with small files and heavy metadata updates. No, there are no silver bullets, but basing your opinion on old benchmarks with is not wise, especially when major performance-changing features were ignored, unavailable, or disabled. – Jody Bruchon Aug 23 '18 at 21:14
  • 1
    the linked file http://dl.dropbox.com/u/40969346/stackoverflow/bench.py is a dead link. It's be great if you could post the benchmark code, thanks! – FGiorlando Jan 01 '19 at 12:41
  • @FGiorlando I updated the link – taffer Feb 09 '19 at 15:17
  • 1
    thanks for the link update, very useful script! – FGiorlando Feb 10 '19 at 23:04
  • The file has no license. Please edit your post to specify a license (GPL v3, MIT, or other) in your original post or upload it to GitHub with a license. Otherwise, the script is technically "no permission"--see https://choosealicense.com/no-permission/ – Poikilos Jul 22 '19 at 17:20
10

The ext4 performance drops off after 1-2 million files in a directory. See this page http://genomewiki.ucsc.edu/index.php/File_system_performance created by Hiram Clawson at UCSC

Max
  • 201
8

I am using ReiserFS for this task, it is especially made for handling a lot of small files. There is an easy to read text about it at the funtoo wiki.

ReiserFS also has a host of features aimed specifically at improving small file performance. Unlike ext2, ReiserFS doesn't allocate storage space in fixed one k or four k blocks. Instead, it can allocate the exact size it needs.

Baarn
  • 882
  • 1
    There are stability issues as well with ReiserFS - so RH and SuSE have dropped that FS. From the principle (BTree-based-FS) BTRFS should be comparable. – Nils Jan 10 '12 at 20:59
1

I would use the default FileSystem on your OS, but don't store data within files, and switch to use a SQlite3 database.

https://sqlite.org/fasterthanfs.html

It would be much faster for your use case, and avoid most of the syscall once data has been loaded in memory. Also writes of several data would be faster, because with a SQL transaction it would also reduce the number of disk writes - and being safer than the FS about atomicity of the writing.

One great benefit would also be that you could search by not only a name, but from any kind of key, just by adding a column (and an index). If the content is text, you could also full-text index it!

-2

XFS is noted for performing very well in situations like this. This is part of why we use it at my work for our mail stores (that can contain hundreds of thousands of files in 1 directory). It has better fault tolerance than ReiserFS, is in much wider use, and is generally a very mature filesystem.

Additionally, XFS supports online defragmentation. Though it does use a delayed allocation technique which results in less fragmentation (vs other filesystems) to begin with.

phemmer
  • 71,831
  • 28
    XFS is noted for performing very well in situations like this. [citation needed] – taffer Jan 14 '12 at 16:13
  • 12
    Ehm, xfs is especially known for the opposite: working really well with big files, but not that well on small ones! Look at this exhaustive benchmark for example (or jump right to the conclusion on page 10 ^^): http://www.ilsistemista.net/index.php/linux-a-unix/40-ext3-vs-ext4-vs-xfs-vs-btrfs-filesystem-comparison-on-fedora-18.html – Levite Jul 02 '14 at 12:51
  • 1
    @Levit I think you're misreading that report. The report very clearly shows that XFS performs very well for random IO. But that aside, the report does not address the type of scenario in this question, lots of files. Random IO is one thing, large numbers of files is where ext* falls on its face. – phemmer Jul 02 '14 at 13:26
  • 1
    Sry, I don't wanna make a discussion here, but if you look through the benches you see that pattern or just look at the closing statement on page 10 of this article stating for large file installation (as VM hosting system) or direct I/O, go with XFS .... which is where it really shines. I don't say it is extremely terrible with small files, but surely not ideal for it! – Levite Jul 02 '14 at 13:34
  • @Levit See page 4 & page 7. You should always look at the data someone uses to arrive at their conclusion, and never trust the conclusion by itself. – phemmer Jul 02 '14 at 15:22
  • 2
    The only place XFS is really better there are the random read/write operations (still seems strange that a truly random read pattern on a mechanical disk is able to get 10MB/s - seems to me like some optimization that does not fly in real world (imho)), whereas on page 7 it shows just what I said earlier, XFS is really good in handling big files! Look at pages 3&5, esp on 3 you see it handling small files clearly not as well as ext! I really don't have anything against XFS though, but from what you find just about everywhere, it is not the best optiom for many small files, is all I am saying! – Levite Jul 03 '14 at 06:22
  • But take any other bench that deals with small files! I just have not really seen anything where XFS would come close to ext4 (eg: http://www.phoronix.com/scan.php?page=article&item=linux_311_filesystems&num=2) on the other hand for big files most of the time I see XFS ahead, and no doubt it is a really well designed fs. – Levite Jul 03 '14 at 07:12
  • 5
    XFS also can be extremely slow when it comes to big files, if these files are extended randomly/slowly with small chunks over a long time. (The typical syslogd pattern.) For example at my side in an XFS over MD setup I just observed, that removing a 1.5 GB file took 4.75 minutes(!) while the disk drive was capped at a 100 transactions/s limit at a write rate of more than 2 MB/s. This also impacts performance of other parallel going IO operations on the same drive badly, as the drive already is maxed out. Never saw anything like that in other FS (or being tested in benchmarks). – Tino Jun 13 '15 at 12:40
  • XFS v5 has been released since these comments were left. There have been significant improvements in performance. Also, most filesystems will generally have poor performance in the case of a file that is extended randomly over time due to fragmentation. – Jody Bruchon Jun 10 '18 at 16:05