36

I've checked these two questions (question one, question two), but they were not helpful for me to understand. I have a file file.txt with 40 lines of Hello World! string. ls -l shows that its size is 520 bytes. Now I archive this file with tar -cvf file.tar file.txt and when I do ls -l again I see that file.tar is 10240 bytes. Why?

I've read some manuals and have understood that archiving and compressing are different things. But can someone please explain how it is working?

Peter Cordes
  • 6,466
  • 29
    Also worth noting: tar by itself does not do any compression. So the output tar file will always be larger than the input file. That's why tar files are usually ran through a compressor like gz. – Colonel Thirty Two Jan 10 '22 at 20:19
  • 1
    Have you examined the .tar file to check the contents? – Bert Jan 11 '22 at 00:07
  • 3
    @ColonelThirtyTwo “That's why tar files are usually ran through a compressor like gz” – well, they often are, but I wouldn't say “usually”. It certainly doesn't always make sense, in particular not when the archive contains e.g. compressed video files where another layer of general-purpose compression is completely useless but may take significant computation resources. – leftaroundabout Jan 12 '22 at 12:12
  • @leftaroundabout, I've rarely seen a tar file that wasn't then compressed with some algorithm. I would certainly say usually and, although it would be impossible to prove, would venture to say that >90% of all tar files are also compressed. – Ben Jan 12 '22 at 14:35
  • 8
    @Ben that seems likely for publicly-visible tarballs; however given the volume of data stored on tape in various organisations around the world, and the many instances where such tarballs aren’t compressed (e.g. because the tape drives do their own compression), it also seems likely that the proportion of compressed tarballs is far less than 90% when considering all tarballs. – Stephen Kitt Jan 12 '22 at 14:49
  • Of course, the file also occupies more than 520 bytes in the file system. Most likely not as much as 10K, but still more than the size indicated by ls or dir. – René Nyffenegger Jan 13 '22 at 07:00

2 Answers2

66

tar archives have a minimum size of 10240 bytes by default; see the GNU tar manual for details (but this is not GNU-specific).

With GNU tar, you can reduce this by specifying either a different block size, or different block factor, or both:

tar -cv -b 1 -f file.tar file.txt

The result will still be bigger than file.txt, because file.tar stores metadata about file.txt in addition to file.txt itself. In most cases you’ll see one block for the file’s metadata (name, size, timestamps, ownership, permissions), then the file content, then two blocks for the end-of-archive entry, so the smallest archive containing a non-zero-length file is four blocks in size (2,048 bytes with a 512-byte block).

Stephen Kitt
  • 434,908
  • 3
    Note libarchive's bsdtar (which also supports -b) uses 512 byte blocks by default when the output goes to a regular file or is compressed with compression options passed to bsdtar (the compressed file itself will end up being padded to a 10k block though if it doesn't go to a regular file). – Stéphane Chazelas Jan 10 '22 at 13:17
  • 4
    Explanation for that odd ball default: Tar was once commonly used with actual tapes, which often had such odd block sizes – rackandboneman Jan 11 '22 at 18:41
  • @rackandboneman I still use tapes, but not with tar! – Stephen Kitt Jan 11 '22 at 19:37
  • 2
    @rackandboneman: tar literally stands for Tape ARchive, so yeah it's no surprise the format was designed around writing directly to tapes. – Peter Cordes Jan 12 '22 at 15:01
  • 2
    @rackandboneman: These sizes are not odd, they are very even: Most of them are powers of 2 ;) 2^9 = 512, 2^20=2048 etc... – ohno Jan 13 '22 at 13:05
29

tar needs to do three things apart from simply storing files:

  1. Store metadata (name of the file, mode, owner, group, dates...)
  2. Mark the end of the file.
  3. Mark the end of the archive.

tar means "tape archive". For tapes it is important to decide, where is the file end, and the device needs to know it even when searching (the tape moves faster). So tar for the tape convenience adds some zeroes at the end of every file and another set of seroes at the end of the archive. The second question you pointed at does explain this.

You can see what is in the archive using hexdump -C archive.tar |less.

d.c.
  • 887