Why is my tar file bigger than the directory backed up

Question

This question did not help me (despite having the same title). So I post this even though this is a duplicate question.

As far as I can tell, the total in du -k includes all the subdirectories and indicates I have 77 megabytes of data

/raid/fpuData/oldOutput>du -ks
77063332        .
/raid/fpuData/oldOutput>tar -cvzf ../oldOutput.tar.zip *

The backup is sill running, but already the file is considerably bigger than 77 megabytes

/raid/fpuData>ls oldOutput.tar.zip
-rw-r--r-- 1 nobody nobody 14470610944 Jul  1 22:18 oldOutput.tar.zip

The files I'm backing up are all huge text files filled with numbers, like a huge comma delimited spreadsheet). Something like this

0.3454915028125262743685653,0.5590169943749474512628694,...
0.221761776923297210251107,0.3588180924674668759166707,...
-0.06101864995889930837202897,-0.09873024958113109372792593,...
-0.3001958820500086333460388,-0.4857271404396689140625654,...
...

Why is the tar file bigger than the directory? It should be compressed because I'm using the data with the z option. What's the point of tarring it, then?

You may already realize this, but the correct extension is .tar.gz. Read this. — , Jul 02 '13 at 02:32
As a theoretical note: it's easy to show that, given any data-compressing algorithm, there exists texts such that the resulting output is at least as long as the original text. Otherwise, given a string of any length, X you could compress it to Y, a string of length strictly less than the length of X. This means that any file could be compressed, by multiple applications of the algorithm, to the empty file, which is absurd since data-compressing algorithms must be bijective/reversible. Hence it's not strange for archives to have bigger size than the original data. — Bakuriu, Jul 02 '13 at 07:46

score 15 · Accepted Answer · 2013-07-02T03:02:27.783

15

Your compressed tar file is smaller than its contents.

ls prints file sizes in bytes by default.
du -k prints file sizes in kilobytes.

0610944B ≈ 14131456KB < 77063332KB

To make ls print file sizes in kilobytes, use the -k flag.

edited Jul 02 '13 at 03:02

answered Jul 02 '13 at 02:35

I guess I could complain about Linux not using a "k" unit symbol or be happy that I have an answer - either way you are correct. The final tard file size, using ls -k oldOutput.tar.zip is 32703224 - less than half the original. – Jeff Jul 02 '13 at 20:51
You can use the -h flag with both du and ls to get human-readable sizes. If you would like that to be your default, add a couple of aliases to your interactive shell config (.bashrc, .zshrc, etc): alias du='du -h' and alias ls='ls -h'. – Jul 02 '13 at 21:24

score 2 · Answer 2 · answered Jul 02 '13 at 03:22

2

Remember if you compress primary BINARY data (i.e. *.gz, *.zip) it is possible/probable that you will get an output file that is FAR LARGER than the original aggregation. So I'd lose the -z switch on the tar you are trying.

answered Jul 02 '13 at 03:22

mdpc

6,834

The OP is compressing text, not binary data. The text happens to be numbers but it is still text: "The files I'm backing up are all huge text files filled with numbers". – terdon Jul 02 '13 at 03:27
See the command: tar -cvzf ../oldOutput.tar.zip *, it is compressing via tar -z an already zipped and compressed file which at that point IS binary data! – mdpc Jul 02 '13 at 16:00
mdpc: Please clarify. What do you mean I am compressing already zipped files. It seems to me that @terdon's original comment is correct. – Jeff Jul 02 '13 at 20:47
@Jeff you're quite right, I misread the command when I agreed with @mdpc. Your command is correctly creating oldOutput.tar.zip from (presumably) uncompressed files. – terdon Jul 02 '13 at 20:55
tar -cvf created a much larger file than tar -cvzf - but it ran faster. – Jeff Jul 06 '13 at 20:49

score 2 · Answer 3 · answered Jul 02 '13 at 15:27

2

Text files don't compress better just because they labeled with a "txt" extension. Text files often compress better because there tends to be a lot of extra "white space" and duplicated letter usage.

I postulate that your CSV file has little to no "white space" to clean up and actually emulates a binary or graphics image file instead.

answered Jul 02 '13 at 15:27

user3471

21

There is no white space at all in my CSV files, unless new lines are considered whitespace - probably not - in which case each file has exactly 10,000 white space characters. – Jeff Jul 02 '13 at 20:49

score 0 · Answer 4 · edited Jul 02 '13 at 04:00

0

Maybe you backup sparse files without the tar option --sparse?

You can easily find out by extracting the archive after it has been finished and comparing the source directory with the extracted directory.

edited Jul 02 '13 at 04:00

jordanm

42,678

answered Jul 02 '13 at 02:27

Hauke Laging

90,279

Why is my tar file bigger than the directory backed up

4 Answers4

Linked