2

tar can be used to gather a whole directory into a single file. I tried with the sample directory sampledir containing just some text files, with no subdirectories. Originally the directory occupies 52K:

$ du -h sampledir/
52K sampledir/

I ran

$ tar -cf tararchive.tar sampledir/

and the generated file is

$ du -h tararchive.tar 
40K tararchive.tar

It is smaller than sampledir: but in the command I did not request any compression. I am referring to the BSD version of tar (used also in Ubuntu).

So, what exactly does tar? Does it simply gather the directory with all its files, inserting some header in order to mark their end and their beginning? If it was so, how can tararchive.tar be smaller than the original directory, even without compression?

BowPark
  • 4,895

1 Answers1

8

This is because files use up space in whole-block increments. So if your block size is 512 bytes and you have a small 100 byte file, the size it actually uses up will be rounded up to the nearest block - in this case 512. When tarring, because the result is a single file, that inefficiency is reduced since there is only one resultant file - the .tar file.

You can really see this in action if you create 100 small files and see their size as individual files vs. combined together. Running the following commands will create a directory with 100 single-byte files and then compare the size of them individually vs. all combined into one vs. a tarball created from them.

mkdir tmp_small_file_test
for ((i=0; i<100; i++)); do head -c 1 /dev/zero > tmp_small_file_test/file$i; done
du -sh tmp_small_file_test
#on a 4096 byte block size filesystem this output 404K

cat tmp_small_file_test/file* >>  tmp_small_file_test/all_files_combined
du -sh tmp_small_file_test/all_files_combined
#this output 4.0K

rm -f tmp_small_file_test/all_files_combined
tar -cf tmp_small_file_test.tar tmp_small_file_test
du -sh tmp_small_file_test.tar
#this output 116K

NOTE: since tar has some overhead to store each file in a tarball, if you tar up the above directory the tar file isn't as small as all the files combined together, but it's still a lot smaller than the files by themselves (at least on a filesystem with block size 4096).

If you're using an ext3/ext4 filesystem, you can see the block size using something like tune2fs -l /dev/sda1 |grep -i 'block size' (replace /dev/sda1 which the filesystem you're using). This should work out to the first du above divided by 100.

sa289
  • 774