Difference in total size between files on disk and files in tar archive

Question

I'm a new Linux user and still learning it. As I understand, the tar command (by itself and without options like z, j or J) doesn't compress files by default. It only bundles multiple files into a single one. Below is my testing.

root@u2004:~# du -sh /etc/
11M /etc/
root@u2004:~# tar cf etc.tar /etc
tar: Removing leading `/' from member names
root@u2004:~# du -sh etc.tar 
6.6M    etc.tar
root@u2004:~#

As you can see, the files in the /etc directory are 11M in total. After archiving them into a single file, the new archive file is 6.6M. Where does the size difference come from? Is it because the files are written contiguously and squeezed together?

Probably related: Why are there so many different ways to measure disk usage? — Kamil Maciorowski, Apr 15 '22 at 14:00
"Is it because the files are written contiguously and squeezed together?" Yes. — Chris Davies, Apr 15 '22 at 14:16

score 3 · Accepted Answer · answered Apr 15 '22 at 15:00

The du, by default, measure file size in 'blocks'. So each small file (which is smaller than a block), takes as much of the block as needed, and the rest is empty. But it cannot be used by another file (since a block can belong to just one file). And therefore you have some amount of bytes 'wasted'.

The tar on the other hand, concatenates all files. Much less 'wasted' space.

You can use key -b for du if you want to see a better prediction of the tar size.

Meaning if you run

$ du -shb /etc
$ du -shb etc.tar

You will get numbers much closer in size to each other. The difference will come from file's descriptions. A size of the directory in the first case and size of tar header in the second.

To investigate it farther, you can start with:

$ df /some_test_dir

That will tell you which physical device that directory is located (column Filesystem)

$ sudo /sbin/dumpe2fs /dev/?? |grep 'Block size'

Define your device here and you will get a size of the block on that device.

If you do du /some_test_dir and that dir is empty - you will get a block size.

If you now put a file (or many files) all of which would be a zero-length, then du on the directory will still give a block size - that is because the files are not taking any space at all, and the information about them is stored inside the directory's block.

For the next test, create in this directory N files each of them less in size than a block. Actual size does not matter, it have to be more than zero, less than a block. Now du on the directory will give you (N+1)*block. Here each file will take a block and a directory itself takes a block.

If you would have many files (how many depends on file system) then the directory itself can grow in size in order to store information of files in it. But the directory size will always be a multiple of a block size.

Thanks for the insightful answer! – Fajela Tajkiya Apr 15 '22 at 15:14 — Fajela Tajkiya, Apr 15 '22 at 15:14

Difference in total size between files on disk and files in tar archive

1 Answers1