105

At my company, we download a local development database snapshot as a db.dump.tar.gz file. The compression makes sense, but the tarball only contains a single file (db.dump).

Is there any point to archiving a single file, or is .tar.gz just such a common idiom? Why not just .gz?

Braiam
  • 35,991
gardenhead
  • 2,017
  • 5
    It is just a matter of convention in my opinion. When people see a file with gz extension, their default thinking is, using tar -zxvf. But for those who look at the file name and see it doesn't have .tgz as extension, it is perfectly fine to gzip the db dump file. Since I don't know the compression algorithms in detail, I am not sure if tar makes any compression on sparse files like db dump, but for plain text files, direct gzip of the file has a very tiny size advantage over taring first and gzip ing the file – MelBurslan Apr 20 '16 at 12:59
  • 3
    All tarring a single file will do is add a few metadata blocks to the start and end of the file. The actual file data passes through tar to the compressor untouched. So for a large file the size difference between plain compression and taring will be negligable. – plugwash Apr 20 '16 at 14:29
  • In the past when trying various compression methods I found .tar.gz to be superior to most other common methods. I recall it was superior to just .tar but cannot remember if it was better than just .gz. Ironically Window's .cab format was the best of the methods I tried, which was very unexpected. – Pharap Apr 25 '16 at 03:14
  • @Pharap tar is not a compression algorithm, it's an archiving format – gardenhead Apr 25 '16 at 16:37
  • 1
    @gardenhead Well that would explain why it didn't work very well. – Pharap Apr 26 '16 at 04:18

8 Answers8

165

Advantages of using .tar.gz instead of .gz are that

  • tar stores more meta-data (UNIX permissions etc.) than gzip.
  • the setup can more easily be expanded to store multiple files
  • .tar.gz files are very common, only-gzipped files may puzzle some users. (cf. MelBurslans comment)

The overhead of using tar is also very small.

If not really needed, I still do not recommend to tar a single file. There are many useful tools which can access compressed single files directly (such as zcat, zgrep etc. - also existing for bzip2 and xz).

jofel
  • 26,758
  • 37
    I didn't consider the meta-data aspect. Very good point – gardenhead Apr 20 '16 at 13:38
  • 5
    If I see a .gz, my first instinct is to tar -zxf foo.gz. Remembering that gzip is even a command takes a few more seconds. – bgStack15 Apr 20 '16 at 14:26
  • 2
    @bgStack15 FWIW you don't need the z (or the - for that matter), most modern tars will automatically detect the file needs to be decompressed. –  Apr 20 '16 at 18:00
  • 2
    By default gzip will store the original file name and time stamp. You can use the -N option when decompressing to restore them. – Ross Ridge Apr 22 '16 at 04:56
  • @RossRidge thanks, I removed again the text about the original file name. – jofel Apr 22 '16 at 08:50
65

You are actually asking only half of the question. The other question being, "Why would I compress a tar file with gzip?". And the answer is not just that gzip makes the file smaller (in most cases):

tar:

  • stores filename and other metadata: mode, owner ID, group ID, filesize, modification time
  • stores a checksum (for the header only)

gzip:

  • can store the original filename, but that is optional
  • has a CRC-32 checksum over the original data
  • it compresses the file

With only tar you cannot be sure your data was not corrupted. With only gzip you cannot restore user/group ID, modification time and possible not the original filename.

The combination is more powerful than the individual commmands/formats provides, because they complement each other's features.

Anthon
  • 79,293
  • Thanks for clarifying that! When I was reading the tar wikipedia page, I misunderstood the description to mean that the checksum was for the whole file. – gardenhead Apr 21 '16 at 14:23
  • This feels to me like the correct answer. I'd also add a few more reasons, which you might wanto to edit in if you agree. 1) there's no additional cost to the admin for .tgz over .tar or .gz alone: they're all just one command 2) Admins back up, copy, relocate, move a LOT of files, for a lot of different reasons; DB backups are just one of these. They can use the same workflow, tools and commands whether backing up one or multiple files; so why specialcase using the syntax of the gzip command, for the case where there is one file? – Dewi Morgan Apr 22 '16 at 23:12
30

There is a quite big advantage to using only-gzipped text files - the contents can be directly accessed with command-line tools like less, zgrep, zcat.

heemayl
  • 56,300
ejdi
  • 401
  • interesting point, but the question is about a database snapshot, unlikely to be a text file, and not only-gzipped. – underscore_d Apr 20 '16 at 20:39
  • 9
    @underscore_d all of my database dumps (mostly mysql and pgsql) are text dumps, partly because they're more salvageable if something happens to partially-corrupt the dump, and partly because i can pre-process any restore with the usual tools (sed, awk, perl, etc) if I need to. i.e. more reliable and more useful than binary dumps. The trade-off is that text-dumps tend to be larger (who cares? disk space is cheap and we have good compression) and restores are significantly slower (but less so if you wrap the restore in a transaction). – cas Apr 21 '16 at 00:32
  • 1
    What is the advantage of these tools over simply piping the output of a decompressor into the plain tools? – CodesInChaos Apr 23 '16 at 20:16
21

I would say it's likely that the people just don't realise they can use gzip/bzip2/xz without tar. Possibly because they come from a DOS/Windows background where it is normal for compression and archiving to be integrated in a single format (ZIP, RAR, etc).

While there may be slight advantages to using tar in some situations due to the storage of metadata or the ability to add extra files, there are also disadvantages. With a plain gzip/bzip2/xz file you can decompress it and pipe the decompressed data straight to another tool (such as your database) without ever having to store the decompressed data as a file on disk. With a tarball this is harder.

Otheus
  • 6,138
plugwash
  • 4,352
  • 2
    With GNU tar,it takes just -O switch to output to stdout, so I wouldn't say it is much harder! – hyde Apr 20 '16 at 20:28
  • 5
    The first paragraph seems plausible enough for files using the tgz extension. However, the OP's case uses tar.gz - and if these hypothetical ex-Win/DOS users are anything like I was, the first thing they say when looking at such a file is: 'Why does it have 2 extensions?'. Then they google it and quickly get the answer, which specifically explains that tar and compression are distinct. ;-) – underscore_d Apr 20 '16 at 20:42
18

There is an important difference that could make using tar important under some circumstances: Besides the "metadata" that @jofel mentioned in his answer, tar records the filename in the archive. When you extract it, you get the original filename regardless of what the archive is called.

In your case the tar archive and the file it contains have the related names db.dump.tar.gz and db.tar, but suppose you rename the tar file to 20-Apr-16.dump.tgz, or whatever. Untar this with tar xvfz, and you get db.dump. For comparison, unzip 20-Apr-16.dump.gz and you've got 20-Apr-16.dump. (Edit: as pointed out in the comments, gzip also makes a record of the filename; but it's not normally used when unzipping). A tar archive can also contain a relative pathname that puts the extracted file in a subdirectory.

Your use case will dictate whether this kind of filename persistence is needed, or even wanted, or is actually undesirable. But certainly, regardless of compression, a tar archive travels differently from a regular file.

alexis
  • 5,759
  • 6
    gzip also records the original filename. – psusi Apr 21 '16 at 00:59
  • 8
    Yup. The name is optional in the gzip header—obviously there won't be one if you compressed the streaming output of a command—and most tools won't restore it by default (for instance, you have to use gzip --name explicitly when decompressing), but you don't have to use tar to get filename persistence. – Miles Apr 21 '16 at 03:12
  • Thanks for pointing this out, I hadn't known that. Still, since that's not the default behavior, the point stands: Distributing a file in tar format preserves the original filename (and possibly the relative path), without intervention of the recipient. Distributing a (g)zipped file doesn't. – alexis Apr 21 '16 at 14:33
8

In addition to all the other answers, I've recently struck a scripting situation where only one file was expected, but a previous employee wrote the scripts with the possibility of more than one file being generated. So files were tarred and bzipped, then transferred, and expanded.

When the process grew to the point it made a 4.3 GB file, it rolled over and made a .dump.001 file in addition to a .dump file. All the scripts just kept working.

That is proactive sysadmin laziness defined!

Criggie
  • 1,781
2

I would tar a single file, to copy it preserving the timestamp (which is easily overlooked in downloads). File permissions and ownership are less important: download is a term that applies to systems which are not well integrated.

Whether tar'd or not, it is standard practice to compress the file to make downloads faster — and avoid running out of diskspace.

Thomas Dickey
  • 76,765
-1

Tar is especially useful for multiple files not written to a formal file system, it always has been. If for some reason there is on occasion, only 1 file to be written it is of no real consequence. I can dd my .tar.gz directly to /dev/sdx without regard to partition or file system. It may as well be tape.

It is generally done because the script or process has been copied from heritage code. Of course there is no need to tar if there is only one file, but it leaves room for enhancement to multiple files......

mckenzm
  • 327