gzip same input different output

Question

Check out:

data/tmp$ gzip -l tmp.csv.gz
     compressed        uncompressed  ratio uncompressed_name
           2846               12915  78.2% tmp.csv
data/tmp$ cat tmp.csv.gz | gzip -l
     compressed        uncompressed  ratio uncompressed_name
             -1                  -1   0.0% stdout
data/tmp$ tmp="$(cat tmp.csv.gz)" && echo "$tmp" | gzip -l

gzip: stdin: unexpected end of file

Ok apparently the input is not the same, but it should have been, logically. What am I missing here? Why aren't the piped versions working?

score 5 · Accepted Answer · edited Apr 13 '17 at 12:37

This command

$ tmp="$(cat tmp.csv.gz)" && echo "$tmp" | gzip -l

assigns the content of tmp.csv.gz to a shell variable and attempts to use echo to pipe that to gzip. But the shell's capabilities get in the way (null characters are omitted). You can see this by a test-script:

#!/bin/sh
tmp="$(cat tmp.csv.gz)" && echo "$tmp" |cat >foo.gz
cmp foo.gz tmp.csv.gz

and with some more work, using od (or hexdump) and looking closely at the two files. For example:

0000000 037 213 010 010 373 242 153 127 000 003 164 155 160 056 143 163
        037 213  \b  \b 373 242   k   W  \0 003   t   m   p   .   c   s
0000020 166 000 305 226 141 157 333 066 020 206 277 367 127 034 012 014
          v  \0 305 226   a   o 333   6 020 206 277 367   W 034  \n  \f
0000040 331 240 110 246 145 331 362 214 252 230 143 053 251 121 064 026
        331 240   H 246   e 331 362 214 252 230   c   + 251   Q   4 026

drops a null in the first line of this output:

0000000 037 213 010 010 373 242 153 127 003 164 155 160 056 143 163 166
        037 213  \b  \b 373 242   k   W 003   t   m   p   .   c   s   v
0000020 305 226 141 157 333 066 020 206 277 367 127 034 012 014 331 240
        305 226   a   o 333   6 020 206 277 367   W 034  \n  \f 331 240
0000040 110 246 145 331 362 214 252 230 143 053 251 121 064 026 152 027
          H 246   e 331 362 214 252 230   c   + 251   Q   4 026   j 027

Since the data changes, it is no longer a valid gzip'd file, which produces the error.

As noted by @coffemug, the manual page points out that gzip will report a -1 for files not in gzip'd format. However, the input is no longer a compressed file in any format, so the manual page is in a sense misleading: it does not categorize this as error-handling.

Further reading:

@wildcard points out that other characters such as backslash can add to the problem, because some versions of echo will interpret a backslash as an escape and produce a different character (or not, depending on the treatment of escapes applied to characters not in their repertoire). For the case of gzip (or most forms of compression), the various byte values are equally likely, and since all nulls will be omitted, while some backslashes will cause the data to be modified.

The way to prevent this is not to try assigning a shell variable the contents of a compressed file. If you want to do that, use a better-suited language. Here is a Perl script which can count character-frequencies, as an example:

#!/usr/bin/perl -w

use strict;

our %counts;

sub doit() {
    my $file = shift;
    my $fh;
    open $fh, "$file" || die "cannot open $file: $!";
    my @data = <$fh>;
    close $fh;
    for my $n ( 0 .. $#data ) {
        for my $o ( 0 .. ( length( $data[$n] ) - 1 ) ) {
            my $c = substr( $data[$n], $o, 1 );
            $counts{$c} += 1;
        }
    }
}

while ( $#ARGV >= 0 ) {
    &doit( shift @ARGV );
}

for my $c ( sort keys %counts ) {
    if ( ord $c > 32 && ord $c < 127 ) {
        printf "%s:%d\n", $c, $counts{$c} if ( $counts{$c} );
    }
    else {
        printf "\\%03o:%d\n", ord $c, $counts{$c} if ( $counts{$c} );
    }
}

This was exactly what I was missing, thanks. Do you have any idea on how to prevent the shell from removing those null characters? Also do you have any idea on why the shell is behaving this way? — gwn, Jun 23 '16 at 09:31
This is only part of the problem. echo will interpret backslash sequences in various ways depending on the version used, so it is not equivalent to piping the raw data (e.g. with cat) even if the null bytes are addressed. — Wildcard, Jun 23 '16 at 19:35
I'm not sure I understand why gzip -l <data.gz works but cat data.gz | gzip -l doesn't... — Kusalananda, Jun 23 '16 at 20:08
@Kusalananda, because with gzip -l < data.gz, gzip can seek to the end of the file (on stdin) where that information is stored, while if stdin is a pipe, it cannot seek. — Stéphane Chazelas, Jun 23 '16 at 20:11

score 2 · Answer 2 · answered Jun 23 '16 at 20:36

The information about the uncompressed size of the file (actually of the uncompressed size of the last chunk as gzip files can be concatenated together) is stored as a little endian 32 bit integer in the last 4 bytes of the file.

To output that information, gzip -l seeks to the end of the file, reads those 4 bytes (actually, according to strace, it reads the last 8 bytes, that is the CRC and the uncompressed size).

It then prints the size of the file and that number. (you'll notice that the information given is misleading and would not give the same result as gunzip < file.gz | wc -c in the case of concatenated gzip files).

Now, that works if the file is seekable, but when it's not as in the case of a pipe it doesn't. And gzip is not smart enough to detect it and read the file fully to get to the end of the file.

Now, in the case of:

tmp="$(cat tmp.csv.gz)" && echo "$tmp" | gzip -l

There's also the problem that shells other than zsh cannot store NUL bytes in their variables, that $(...) strips all trailing newline characters (0xa bytes), and that echo transforms its arguments (if they start with - or contain \ depending on the echo implementation) and adds an extra newline character.

So even if gzip -l was able to work with pipes, the output it would receive would be corrupted.

On a little endian system (like x86 ones), you can use:

tail -c4 < file.gz | od -An -tu4

to get the uncompressed size of the last chunk.

tail, contrary to gzip is able to fall back to read the input when it cannot seek it.

Vombat · Answer 3 · 2016-06-23T09:01:36.860

Seems like gzip cannot recognize the name of file when getting its input from pipe. I did a test like this:

$ cat file.tar.gz | gzip -tv 
  OK

$ gzip -tv file.tar.gz
  file.tar.gz: OK

So in first case gzip is unable to recognize the name of file which seems to be necessary for the -l flag (you can see on the last column of output uncompressed_name is stdout).

Some more info (not directly related to your question) from gzip man page:

The uncompressed size is given as -1 for files not in gzip format, such as compressed .Z files. To get the uncompressed size for such a file, you can use:

     zcat file.Z | wc -c

gzip same input different output

3 Answers3