34

I have a few thousand files that are individually GZip compressed (passing of course the -n flag so the output is deterministic). They then go into a Git repository. I just discovered that for 3 of these files, Gzip doesn't produce the same output on macOS vs Linux. Here's an example:

macOS

$ cat Engine/Extras/ThirdPartyNotUE/NoRedist/EnsureIT/9.7.0/bin/finalizer | shasum -a 256
0ac378465b576991e1c7323008efcade253ce1ab08145899139f11733187e455  -

$ cat Engine/Extras/ThirdPartyNotUE/NoRedist/EnsureIT/9.7.0/bin/finalizer | gzip --fast -n | shasum -a 256
6e145c6239e64b7e28f61cbab49caacbe0dae846ce33d539bf5c7f2761053712  -

$ cat Engine/Extras/ThirdPartyNotUE/NoRedist/EnsureIT/9.7.0/bin/finalizer | gzip -n | shasum -a 256
3562fd9f1d18d52e500619b4a5d5dfa709f5da8601b9dd64088fb5da8de7b281  -

$ gzip --version
Apple gzip 272.250.1

Linux

$ cat Engine/Extras/ThirdPartyNotUE/NoRedist/EnsureIT/9.7.0/bin/finalizer | shasum -a 256
0ac378465b576991e1c7323008efcade253ce1ab08145899139f11733187e455  -

$ cat Engine/Extras/ThirdPartyNotUE/NoRedist/EnsureIT/9.7.0/bin/finalizer | gzip --fast -n | shasum -a 256
10ac8b80af8d734ad3688aa6c7d9b582ab62cf7eda6bc1a0f08d6159cad96ddc  -

$ cat Engine/Extras/ThirdPartyNotUE/NoRedist/EnsureIT/9.7.0/bin/finalizer | gzip -n | shasum -a 256
cbf249e3a35f62a4f3b13e2c91fe0161af5d96a58727d17cf7a62e0ac3806393  -

$ gzip --version
gzip 1.6
Copyright (C) 2007, 2010, 2011 Free Software Foundation, Inc.
Copyright (C) 1993 Jean-loup Gailly.
This is free software.  You may redistribute copies of it under the terms of
the GNU General Public License <http://www.gnu.org/licenses/gpl.html>.
There is NO WARRANTY, to the extent permitted by law.

Written by Jean-loup Gailly.

How is this possible? I thought the GZip implementation was completely standard?

UPDATE: Just to confirm that macOS and Linux versions do produce the same output most of the time, both OSes output the same hash for:

$ echo "Vive la France" | gzip --fast -n | shasum -a 256
af842c0cb2dbf94ae19f31c55e05fa0e403b249c8faead413ac2fa5e9b854768  -
terdon
  • 242,166
Pol
  • 442
  • 4
    Can you post a hexdump or xxd of the these gzip files? That way we can analyze and know for sure whether it's a metadata/header difference or the data stream itself. – 640KB Mar 01 '20 at 17:30
  • 5
    It would be interesting if you posted those files (both compressed and the uncompressed) somewhere -- if they don't contain anything sensitive, of course. –  Mar 01 '20 at 18:25
  • 2
    For more gzip-related fun, offhand, I believe AdvanceComp compresses gzip as well as zip. That is, they take common files and compress them to be even smaller, while still being able to decompress right. So you can go to http://www.advancemame.it/comp-readme.html and download AdvanceComp and run advzip -z4pki 10000 file.gz to compress a typical small gzip file (using gzip -9) into a smaller gzip file. – TOOGAM Mar 02 '20 at 07:39
  • There is NO WARRANTY, to the extent permitted by law. – Firzen Mar 02 '20 at 16:30
  • 6
    I see no guarantee in the docs that (-n (i.e. --noname)) produces deterministic behavior. This would then seem to be a faulty assumption on your part. Some of the answers already explain that the used compression has certain parameters or freedom to do different things. The options --fast and --best also already indicate that the process is not deterministic. So why expect the same behavior from these two different programs? – hkBst Mar 02 '20 at 16:40
  • 2
    @hkBst, I think the behavior is expected, because most of the time the results are identical. There must be a reason why it works most of the time, but not in some situations. – JPhi1618 Mar 02 '20 at 20:11

4 Answers4

62

Note that the compression algorithm (Deflate) in GZip is not strictly bijective. To elaborate: For some data, there's more than one possible compressed output depending on the algorithmic implementation and used parameters. So there's no guarantee at all that Apple GZip and gzip 1.6 will return the same compressed output. These outputs are all valid GZip streams, the standard just guarantees that every of these possible outputs will be decompressed to the same original data.

schnaader
  • 666
  • 1
    That's what I thought too (different compression levels, etc), but according to the OP, it's only 3 files out of thousands which are different between the 2 OSes. –  Mar 01 '20 at 18:29
  • 16
    Doesn't change much. e.g. different deflate implementations can decide to split big data into different sized chunks or change the compression strategy depending on file content. So Apple Gzip and gzip 1.6 could be so similar that they make the same decisions in most cases, but with some exceptions. – schnaader Mar 01 '20 at 18:38
  • Yes they can, but do they? FWIW, both the gzip from MacOS and that from Linux are derived from the same source code, and neither is compressing by chunks (eg. with Z_FULL_FLUSH) (only dictionaries and similar do that, in order to create "almost seekable" gzipped files). –  Mar 02 '20 at 10:59
  • 5
    Well, for a detailed definite answer, testing using the files from OP would be needed (and/or code-diving in both implementations). It's true that without them, we are speculating about the source of the difference. But it still is a fact that there are more possible differences than one would expect. Another example is that there are two possible encodings of match length 258 (see https://stackoverflow.com/questions/27152629/deflate-length-of-258-double-encoding ) - which one is used depends on the implementation, and the difference would only occur on repetitive data. – schnaader Mar 02 '20 at 11:54
  • 7
    Even two different versions of gnu gzip could potentially produce different compressed results, depending on changes in the algorithms used between versions. And, at a glance, i'm not sure that this particular version of Apple gzip and this particular version of GNU gzip actually are derived from the same source. The same upstream project? Sure. But the upstream source for the apple implementation could be from different versions of that source than the gnu implementation. And that's assuming they aren't completely independent forks started at different times in gzip's development. – Cliff Armstrong Mar 02 '20 at 19:11
  • 4
    This is enough to answer the main part of the question: How is this possible – slebetman Mar 03 '20 at 11:57
20

The format should be very stable, but see its description. It contains a field for operating system ID.  Obviously that may differ for macOS and Linux and FreeBSD and ...

vonbrand
  • 18,253
  • 6
    I believe both OS's should produce files with the OS byte of 3 - Unix. This field is a legacy from the original PKZIP file format and any *nix-like OS should get a 3. Perhaps the difference has to do with OS-specific file attributes or differences in the POSIX ACLs. Or most likely I think @schnaader's comment below is probably the case. Perhaps the OP can post binary dumps of the created files so we can be certain. – 640KB Mar 01 '20 at 17:29
  • @640KB FWIW, the ACLs are metadata (just like the inode's permissions), they're not part of the file's data, and changing them does not reflect in its md5sum. –  Mar 01 '20 at 18:11
  • 3
    @640KB however, it's absolutely true that all of MacOS, FreeBSD, Linux use the same OS id (3 = Unix) -- you can see the whole list here. –  Mar 01 '20 at 18:21
  • @mosvy re:ACLs - right, of course! I had PKZIP format on my mind, and forgot we were talking about gzip streams. – 640KB Mar 01 '20 at 18:53
  • 2
    If indeed the operating system ID is the same for Linux and macOS, then this answer (as written) is unfortunately incorrect. – Pol Mar 01 '20 at 19:36
  • 2
    The linked format description has Unix =3 and Macintosh =7.does the latter apply not to modern macOS? – planetmaker Mar 02 '20 at 10:58
  • 2
    @planetmaker 7 is presumably intended for Apple's original MacOS before OS X, which wasn't Unix-like at all. – Barmar Mar 02 '20 at 17:18
  • 4
    echo hello | gzip | hexdump -C -s 9 -n 1 is a pretty easy verification of what the platform code is anywhere (it is 03 under both Apple gzip on macOS and GNU gzip on other Unix-like platforms). – Michael Homer Mar 02 '20 at 20:35
  • 2
    At least the other answers are pure speculation, but this is obviously, provably wrong. What's the purpose of keeping it around? –  Mar 04 '20 at 20:43
10

Gzip format is standard, the implementation - not necessarily. Wikipedia lists at least 5 free/oss independent implementations and there are also proprietary ones. Apple clearly outputs a different version string.

The format and the algorithm both allow for a lot of freedom and a lot of design choices that are either matter of taste and/or work better in different use cases.

See Zip Files: History, Explanation and Implementation

I generally would expect the results to be the same between different implementations only for a small percent of small-ish files.

Paulo Tomé
  • 3,782
fraxinus
  • 299
  • "The format and the algorithm both allow for a lot of freedom and a lot of design choices" are there any hard facts behind this claim? FWIW, apple's gzip is using the standard zlib library. –  Mar 02 '20 at 11:01
  • Apple uses the standard zlib library (zlib.h), but looking at the Linux gzip code ( though I only found it for 1.10 instead of 1.6) at https://fossies.org/linux/gzip/deflate.c their implementation seems to be a custom one. Also, even if both would be using zlib.h, the code that wraps the deflate calls (deflateInit, deflate, etc.) could be different. – schnaader Mar 02 '20 at 12:52
  • Follow-up: gzip 1.6 source code can be found at https://ftp.gnu.org/gnu/gzip/ and also doesn't use zlib.h – schnaader Mar 02 '20 at 13:02
  • @schnaader nope, zlib is derived ("abstracted") from the gzip program and both were written by the same people. –  Mar 02 '20 at 14:12
  • 2
    And I still wouldn't conclude that they always give the same compressed output from this. Looking at the gzip changelog at https://fossies.org/linux/gzip/ChangeLog , there have been many changes. Have these been adopted in Apple's code? Can't even find out which date Apple's gzip 272.250.1 is from. Also found this Debian bug report for gzip 1.4-1 ( https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=647522 ) where compressed output wasn't the same always. Fixed in 1.4-5, but demonstrates the problem and gives another past example. The cause was non-zeroed memory polluting the dictionary. – schnaader Mar 02 '20 at 15:23
  • @mosvy the article I linked gets in great detail about some possible design choices in zip implementation. It also mentions the different approaches used in some well-known implementations. "Lempel-Ziv compression can be both fast and slow. Zopfli spends a lot of time trying to find optimal back references to squeeze out a few extra percent of compression. This is useful for data that is compressed once and used many times, such as static content on a web server. On the other end of the spectrum are compressors such as Snappy and LZ4, which match only against the most recent 4-byte prefix ..." – fraxinus Mar 02 '20 at 17:31
0

Are you sure the files before compression are identical? Some VCS chekout text files differently, using UTF8 or not, windows or linux newlines, ...

Run the SHA-command on the original files to see if you are doing the same thing.

Maybe try compression level 0 to see if that works properly.

Find some simple files you can post here that get encoded differently on both systems.

Do the files get unzipped correctly on both systems? Run the SHA-command again.

And always ask yourself: does it matter? :)

  • 8
    Yes, the files are identical: see in my question how I compute their SHA256 on both platforms before gzip. – Pol Mar 02 '20 at 02:21