Why are tar archive formats switching to xz compression to replace bzip2 and what about gzip?

Question

More and more tar archives use the xz format based on LZMA2 for compression instead of the traditional bzip2(bz2) compression. In fact kernel.org made a late "Good-bye bzip2" announcement, 27th Dec. 2013, indicating kernel sources would from this point on be released in both tar.gz and tar.xz format - and on the main page of the website what's directly offered is in tar.xz.

Are there any specific reasons explaining why this is happening and what is the relevance of gzip in this context?

derobert · Accepted Answer · 2016-11-09T17:52:28.720

239

For distributing archives over the Internet, the following things are generally a priority:

Compression ratio (i.e., how small the compressor makes the data);
Decompression time (CPU requirements);
Decompression memory requirements; and
Compatibility (how wide-spread the decompression program is)

Compression memory & CPU requirements aren't very important, because you can use a large fast machine for that, and you only have to do it once.

Compared to bzip2, xz has a better compression ratio and lower (better) decompression time. It, however—at the compression settings typically used—requires more memory to decompress^[1] and is somewhat less widespread. Gzip uses less memory than either.

So, both gzip and xz format archives are posted, allowing you to pick:

Need to decompress on a machine with very limited memory (<32 MB): gzip. Given, not very likely when talking about kernel sources.
Need to decompress minimal tools available: gzip
Want to save download time and/or bandwidth: xz

There isn't really a realistic combination of factors that'd get you to pick bzip2. So its being phased out.

I looked at compression comparisons in a blog post. I didn't attempt to replicate the results, and I suspect some of it has changed (mostly, I expect xz has improved, as its the newest.)

(There are some specific scenarios where a good bzip2 implementation may be preferable to xz: bzip2 can compresses a file with lots of zeros and genome DNA sequences better than xz. Newer versions of xz now have an (optional) block mode which allows data recovery after the point of corruption and parallel compression and [in theory] decompression. Previously, only bzip2 offered these.^[2] However none of these are relevant for kernel distribution)

1: In archive size, xz -3 is around bzip -9. Then xz uses less memory to decompress. But xz -9 (as, e.g., used for Linux kernel tarballs) uses much more than bzip -9. (And even xz -0 needs more than gzip -9).

2: F21 System Wide Change: lbzip2 as default bzip2 implementation

edited Nov 09 '16 at 17:52

answered Jan 06 '14 at 18:57

derobert

109,670

Any comment on the topic of fault tolerance or is that something that's always implemented completely outside of compression algorithms? – Jan 08 '14 at 01:00
2

@illuminÉ resiliency can't be provided without sacrificing compression ratio. It's an orthogonal problem, and while tools like Parchive exist, for distributing the kernel TCP's error handling does the job just as well. – Tobu Jan 08 '14 at 08:58
3

@illuminÉ Fault tolerance (assuming you mean something similar to par2) isn't normally a concern with distributing archives over the Internet. Downloads are assumed reliable enough (and you can just redownload if it was corrupted). Cryptographic hashes and signatures are often used, and they detect corruption as well as tampering. There are compressors that give greater fault tolerance, though at the cost of compression ratio. No one seems to find the tradeoff worth it for HTTP or FTP downloads. – derobert Jan 08 '14 at 17:02
xz uses LESS memory to decompress. – MichalH Aug 11 '15 at 11:32
@Mike Has it changed since I wrote this? In particular, footnote one explains memory usage. – derobert Aug 11 '15 at 14:03
1

@derobert I don't know, but it seems quite misleading to me, because for example (from the blog post) xz-2 compressed file has nearly the same size as bzip2-3 but xz-2 uses 1300 KB of memory and bzip2-3 uses 1700 KB when decompressing. (Plus compress time of xz-2 is faster.) However it uses more memory when compressing. – MichalH Aug 11 '15 at 14:15
@Mike right, for same file size xz uses less than bzip, but it seems people prefer the higher compression ratio instead (so real world use requires more memory). I'll think of a way to make that clearer. – derobert Aug 11 '15 at 14:18
1

@Mike does that make it clearer? I checked and confirmed that kernel sources are compressed with xz -9 (xv -vvl will tell you). – derobert Aug 11 '15 at 20:59
xz now supports parallel compression (and so does pixz) and xz archives can be made randomly accessible (so recoverable to some extent) and are when doing parallel compression. – Stéphane Chazelas Oct 25 '16 at 09:54
128MB memory is massive for bzip2. Bzip2 tops out around 6MB. – Joshua Nov 01 '16 at 15:58
@StéphaneChazelas I've updated the post based on xz's parallel compression support. According to the docs at least, it doesn't do parallel decompression yet—but sounds like it's possible in theory. – derobert Nov 09 '16 at 17:53
@Joshua true, even less than that according to the docs. I've changed it to 32MB, as of course much of that RAM will be used for other things already running (kernel, init, shell, services, etc.). I think this is almost entirely irrelevant for kernel sources, though. (gcc will surely use more memory!) – derobert Nov 09 '16 at 17:55
I came across this article, suggesting xz should not be used as the compressed data might be corrupted. Thought? – HCSF Jun 12 '21 at 10:44
"It, however—at the compression settings typically used—requires more memory to decompress[1] and is somewhat less widespread." that sentence exceeds my brain's stack size – milahu Jul 03 '22 at 14:58

Marco · Answer 2 · 2014-01-08T12:02:37.447

51

First of all, this question is not directly related to tar. Tar just creates an uncompressed archive, the compression is then applied later on.

Gzip is known to be relatively fast when compared to LZMA2 and bzip2. If speed matters, gzip (especially the multithreaded implementation pigz) is often a good compromise between compression speed and compression ratio. Although there are alternatives if speed is an issue (e.g. LZ4).

However, if a high compression ratio is desired LZMA2 beats bzip2 in almost every aspect. The compression speed is often slower, but it decompresses much faster and provides a much better compression ratio at the cost of higher memory usage.

There is not much reason to use bzip2 any more, except of backwards compatibility. Furthermore, LZMA2 was desiged with multithreading in mind and many implementations by default make use of multicore CPUs (unfortunately xz on Linux does not do this, yet). This makes sense since the clock speeds won't increase any more but the number of cores will.

There are multithreaded bzip2 implementations (e.g. pbzip), but they are often not installed by default. Also note that multithreaded bzip2 only really pay off while compressing whereas decompression uses a single thread if the file was compress using a single threaded bzip2, in contrast to LZMA2. Parallel bzip2 variants can only leverage multicore CPUs if the file was compressed using a parallel bzip2 version, which is often not the case.

edited Jan 08 '14 at 12:02

answered Jan 06 '14 at 18:55

Marco

33,548

4

Well some tars grok a z option. – tchrist Jan 07 '14 at 01:19
"speed" makes for a muddled answer, you should refer to compression speed or decompression speed. Neither pixz, pbzip2 or pigz are installed by default (or used by tar without the -I flag), but pixz and pbzip2 speed up compression and decompression and pigz is just for compression. – Tobu Jan 08 '14 at 08:33
@Tobu xz will be multithreaded by default so no pixz installation will be required in the future. On some platforms xz threading is supported already. Whereas bzip2 will unlikely ever be multithreaded since the format wasn't designed with multithreading in mind. Furthermore, pbzip2 only speeds up decompression if the file has been compressed using pbzip2 which is often not the case. – Marco Jan 08 '14 at 12:07
2

@Marco I believe lbzip2 allows for parallel decompression of files even if they were compressed with a non-parallel implementation (e.g. stock bzip2). That's why I use lbzip2 over pbzip2. (It's possible this has evolved since your comment.) – RaveTheTadpole Jan 20 '15 at 06:09
This makes sense since the clock speeds won't increase any more - what? that's not quite true. the post was made in 2014, when Intel released the i3-4370 at 3.8GHz. in 2017, Intel released the i7-8700K at 4.7GHz. in 2018 they released the i9-9900K at 5GHz - and there's probably cpus in 2015 & 2016 that's missing on this list too – hanshenrik Mar 11 '20 at 09:54
@hanshenrik Look at the charts in 42 Years of Microprocessor Trend Data. You'll notice the CPU frequency reaching a plateau in the mid 2000's. That's what I'm talking about. – Marco Mar 11 '20 at 19:08
it's not raising as fast as it used to, sure, but it's still raising: Intel's upcoming 10th gen 2020 desktop chips will supposedly reach 5.3GHz (10gen intel cpus are out already, but only the low-power laptop-variants, the desktop gen10's aren't out yet) – hanshenrik Mar 11 '20 at 21:19
This answer should include why LZMA2 is relevant to the question, which asks about the xz format. From wikipedia "The .xz format, which can contain LZMA2 data" https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Markov_chain_algorithm#xz_and_7z_formats – Chris L. Barnes Jan 27 '21 at 18:09

score 23 · Answer 3 · edited Apr 02 '24 at 22:45

23

Short answer: xz is more efficient in terms of compression ratio. So it saves disk space and optimizes the transfer through the network.

You can see this Quick Benchmark to discover the difference by practical tests.

Update: The “Quick Benchmark” web page is an elusive moving target.

The Wayback Machine link is probably solid.
This older link also seems to be broken; it is archived here.
CSDN seems to have the same content, but has a lot of ads, in Chinese.
A Quick Benchmark: Gzip vs. Bzip2 vs. LZMA has much of the same content, but is missing the xz information.

edited Apr 02 '24 at 22:45

Pablo A

2,712

answered Jan 06 '14 at 19:14

Slyx

3,885

Link is broken. – Sparkette Jun 10 '19 at 18:49
New link: http://catchchallenger.first-world.info/wiki/Quick_Benchmark:_Gzip_vs_Bzip2_vs_LZMA_vs_XZ_vs_LZ4_vs_LZO – Craig Younkins Jan 25 '20 at 18:17

score 22 · Answer 4 · answered Apr 14 '16 at 14:15

22

LZMA2 is a block compression system whereas gzip is not. This means that LZMA2 lends itself to multi-threading. Also, if corruption occurs in an archive, you can generally recover data from subsequent blocks with LZMA2 but you cannot do this with gzip. In practice, you lose the entire archive with gzip subsequent to the corrupted block. With an LZMA2 archive, you only lose the file(s) affected by the corrupted block(s). This can be important in larger archives with multiple files.

answered Apr 14 '16 at 14:15

Mark Warburton

331

3

This is a very useful and important distinction, indeed! – leden Jan 04 '17 at 16:49
1

Can you back up these claims with sources? I have yet to see an XZ recovery tool, and my known source claims otherwise: https://www.nongnu.org/lzip/xz_inadequate.html – nyov Dec 16 '19 at 09:05

Why are tar archive formats switching to xz compression to replace bzip2 and what about gzip?

4 Answers4

Linked

Related