0

I have an ext4 formatted disk with thousands of files that are generated automatically and are needed. A few thousand of them are only one byte long, some two bytes. All files in both groups of tiny files are identical.

How much space can I save by locating these, say 1000, files of 1 byte in length, removing each and hard-linking to a single representative file?

Like this:

# ls -l
-rw-r----- 1 john john 1 Feb 25 10:29 a
-rw-r----- 1 john john 1 Feb 25 10:29 b
-rw-r----- 1 john john 1 Feb 25 10:29 c
# du -kcs ?
4   a
4   b
4   c
12  total

Try to consolidate:

# rm b c
# ln a b
# ln a c
ll
total 12
-rw-r----- 3 john john 1 Feb 25 10:29 a
-rw-r----- 3 john john 1 Feb 25 10:29 b
-rw-r----- 3 john john 1 Feb 25 10:29 c
# du -kcs ?
4   a
4   total

(Please note that du does not even list b and c which I find curious).

Question: Is it really that easy and one can save 999*4 KiB in my 1000 file scenario if an allocation block is 4 KiB in size?

Or, does ext4 have the ability to transparently "merge tails", or store tiny files in the "directory inode" (I vaguely remember some filesystems can do that)?

(I know file allocation blocks can vary and a command like tune2fs -l /dev/sda1 can tell me.)

Ned64
  • 8,726
  • 1
    regarding "du does not even list b and c", note the -l option to du to count hardlinked files once per link, rather than once per inode. – user4556274 Feb 25 '21 at 18:01
  • 3
    Why not just test it? It's a trivial exercise with the shell and will give you empirical numbers, immediately. – Kyle Jones Feb 25 '21 at 18:35
  • @KyleJones The disk is permanently written to so numbers may not be accurate. Furthermore I know that some systems to save space will work "offline" like a garbage collector so benefits may be delayed. To be honest, I also thought someone may know and tell us right away which would save me lots of work. If no-one knows I will try it out and try to answer my own question, though. – Ned64 Feb 25 '21 at 21:21
  • 2
    There's no garbage collector of any kind for ext2/3/4 filesystems - after sync you see what you get. I guess you'll save space but it's not obvious how much. – Artem S. Tashkinov Feb 26 '21 at 08:40

1 Answers1

5

There are three parts to storing files: the blocks used to store the file contents, the inode used to store the file’s metadata, and the directory entry (or entries) pointing to the inode.

When you create multiple separate files, in the most general case you pay this cost as many times as there are files.

With inline data (if your file system was created with the appropriate options), you save the blocks used to store the file contents if the file is small enough, but you still need one inode per file and at least one directory entry per file.

With hard links, you save the blocks used to store the file contents and the inodes: there’s only one inode, one instance of the file data (whether inline in the inode or separate), and as many directory entries as links.

Given that you need to store the directory entries anyway, hard links are effectively free. Anything else will involve more storage; exactly how much depends on your file system’s specific settings.

Stephen Kitt
  • 434,908