This command
$ tmp="$(cat tmp.csv.gz)" && echo "$tmp" | gzip -l
assigns the content of tmp.csv.gz
to a shell variable and attempts to use echo
to pipe that to gzip
. But the shell's capabilities get in the way (null characters are omitted). You can see this by a test-script:
#!/bin/sh
tmp="$(cat tmp.csv.gz)" && echo "$tmp" |cat >foo.gz
cmp foo.gz tmp.csv.gz
and with some more work, using od
(or hexdump
) and looking closely at the two files. For example:
0000000 037 213 010 010 373 242 153 127 000 003 164 155 160 056 143 163
037 213 \b \b 373 242 k W \0 003 t m p . c s
0000020 166 000 305 226 141 157 333 066 020 206 277 367 127 034 012 014
v \0 305 226 a o 333 6 020 206 277 367 W 034 \n \f
0000040 331 240 110 246 145 331 362 214 252 230 143 053 251 121 064 026
331 240 H 246 e 331 362 214 252 230 c + 251 Q 4 026
drops a null in the first line of this output:
0000000 037 213 010 010 373 242 153 127 003 164 155 160 056 143 163 166
037 213 \b \b 373 242 k W 003 t m p . c s v
0000020 305 226 141 157 333 066 020 206 277 367 127 034 012 014 331 240
305 226 a o 333 6 020 206 277 367 W 034 \n \f 331 240
0000040 110 246 145 331 362 214 252 230 143 053 251 121 064 026 152 027
H 246 e 331 362 214 252 230 c + 251 Q 4 026 j 027
Since the data changes, it is no longer a valid gzip'd file, which produces the error.
As noted by @coffemug, the manual page points out that gzip will report a -1
for files not in gzip'd format. However, the input is no longer a compressed file in any format, so the manual page is in a sense misleading: it does not categorize this as error-handling.
Further reading:
@wildcard points out that other characters such as backslash can add to the problem, because some versions of echo
will interpret a backslash as an escape and produce a different character (or not, depending on the treatment of escapes applied to characters not in their repertoire). For the case of gzip (or most forms of compression), the various byte values are equally likely, and since all nulls will be omitted, while some backslashes will cause the data to be modified.
The way to prevent this is not to try assigning a shell variable the contents of a compressed file. If you want to do that, use a better-suited language. Here is a Perl script which can count character-frequencies, as an example:
#!/usr/bin/perl -w
use strict;
our %counts;
sub doit() {
my $file = shift;
my $fh;
open $fh, "$file" || die "cannot open $file: $!";
my @data = <$fh>;
close $fh;
for my $n ( 0 .. $#data ) {
for my $o ( 0 .. ( length( $data[$n] ) - 1 ) ) {
my $c = substr( $data[$n], $o, 1 );
$counts{$c} += 1;
}
}
}
while ( $#ARGV >= 0 ) {
&doit( shift @ARGV );
}
for my $c ( sort keys %counts ) {
if ( ord $c > 32 && ord $c < 127 ) {
printf "%s:%d\n", $c, $counts{$c} if ( $counts{$c} );
}
else {
printf "\\%03o:%d\n", ord $c, $counts{$c} if ( $counts{$c} );
}
}
echo
will interpret backslash sequences in various ways depending on the version used, so it is not equivalent to piping the raw data (e.g. withcat
) even if the null bytes are addressed. – Wildcard Jun 23 '16 at 19:35gzip -l <data.gz
works butcat data.gz | gzip -l
doesn't... – Kusalananda Jun 23 '16 at 20:08gzip -l < data.gz
,gzip
can seek to the end of the file (on stdin) where that information is stored, while if stdin is a pipe, it cannot seek. – Stéphane Chazelas Jun 23 '16 at 20:11