Un-concatenate gzipped file

Question

The other day I was collecting some logs from a remote server and unthinkingly gzipped the files into a single file, rather than adding the directory to a tarball. I can manually separate out some of the log files, but some of them were already gzipped. So the original files look like:

ex_access.log
ex_access.log.1.gz
ex_access.log.2.gz
ex_debug.log
ex_debug.log.1.gz
ex_debug.log.2.gz
ex_update.log
ex_update.log.1.gz
ex_update.log.2.gz

and are compressed into exlogs.gz, which upon decompression is, as you would expect, one file with all the original files concatenated. Is there a way to separate out the original gz files so that they can be decompressed normally instead of printing out the binary:

^_<8B>^H^H<9B>C<E8>a^@
^Cex_access.log.1^@<C4><FD><U+076E>-Kr<9D>       <DE><F7>S<9C>^W<E8><CE><F0><FF><88>y[<D5><EA>+<A1>^EHuU<A8>^K<B6><94><AA>L4E^R̤^Z^B<EA><E1><DB>}<AE>̳<B6><D6>I<C6><F8><9C><DB><C6>
<F1>@G`<E6><D6><FE><E0>3<C2><C3>ٰ̆|<E4><FC><BB>#<FD><EE><B8>~9<EA>+<A7>W+<FF><FB><FF><F6><9F><FE><97><FF><E3><97><FF><FD>^Z<E3><FF><F8><E5><FF><FE><CB><C7><FF>Iy<FC>?<8E><F9>?<F3>?<EF><B5><F7><F9><BF><FF>ß<FF>
[etc]

Yes, I could just collect the logs again (since I did have the sense to leave the originals intact), but getting approval for access to the server is a pain and I'd like to avoid it if at all possible.

Edit: the command I used is

gzip -c ex_* > exlogs.gz

How exactly did you create the exlogs.gz file that you say contains all the files? — Kusalananda, Jan 25 '22 at 18:12
What, exactly, did you do? Until you show the exact steps, commands, results, we have little chance of helping you unscrew the files. — waltinator, Jan 25 '22 at 18:20

cg909 · Answer 1 · 2022-01-29T03:56:14.513

When gzipping files into a single file, gzip creates a file containing multiple gzip streams, as if you first compressed the files separately and then concatenated them.

This behaviour is briefly mentioned in the man page.

-c --stdout --to-stdout

Write output on standard output; keep original files unchanged. If there are several input files, the output consists of a sequence of independently compressed members.

This means that every source file has a separate gzip header (which among other things contains the original file name). So in principle they can be separated while decompressing.

Unfortunately the gzip developers chose to not support this in gunzip:

If you wish to create a single archive file with multiple members so that members can later be extracted independently, use an archiver such as tar or zip. […] gzip is designed as a complement to tar, not as a replacement.

Un-concatenating the files isn't trivial, as neither the gzip header nor the footer contain the length of the compressed data stream. This means that, to reliably find the start of the second stream, you need to decode the whole deflate data stream, which is half-way to decompressing the whole thing.

As far as I know, there is no tool yet that can only skim through the data stream to find out where it ends, even if there is some research in that area to support quasi random access to gzipped file contents.

Luckily some programming libraries can be used to decompress gzip streams separately, e.g. Perl's IO::Uncompress::Gunzip, as Stéphane Chazelas mentioned in his answer, or Rust's flate2.

Finally, as a solution I wrote the tool gunzip-split. It decompresses each file separately and can also de-concatenate the files. For the latter it decompresses every file, notes the offsets where the gzip streams start while discarding the results. This could be further optimized but works reasonably fast even for gigabyte-sized files.

$ ./gunzip-split --help
gunzip-split 0.1.1
Uncompress concatenated gzip files back into separate files.
USAGE:
    gunzip-split [OPTIONS] <FILE>
ARGS:
    <FILE>    concatenated gzip input file
OPTIONS:
    -d, --decompress                      Decompressing all files (default)
    -f, --force                           Overwrite existing files
    -h, --help                            Print help information
    -l, --list-only                       List all contained files instead of decompressing
    -o, --output-directory <DIRECTORY>    Output directory for deconcatenated files
    -s, --split-only                      Split into multiple .gz files instead of decompressing
    -V, --version                         Print version information
$ ./gunzip-split -s -o ./out/ combined.gz
file_1: OK.
file_2: OK.
$ ls ./out
file_1.gz file_2.gz

score 3 · Accepted Answer · edited Jan 27 '22 at 07:05

As it happens, in gzip -c file1 file2 > result, gzip does create two separate compressed streams one for each file and even stores the file name and modification time of the file.

It doesn't let you use that information upon decompression, but you could use perl's IO::Uncompress::Gunzip module instead to do that. For instance with:

#! /usr/bin/perl
use IO::Uncompress::Gunzip;
$z = IO::Uncompress::Gunzip->new("-");
do {
  $h = $z->getHeaderInfo() or die "can't get headerinfo";
  open $out, ">", $h->{Name} or die "can't open $h->{Name} for writing";
  print $out $buf while $z->read($buf) > 0;
  close $out;
  utime(undef, $h->{Time}, $h->{Name}) or warn "can't update $h->{Name}'s mtime";
} while $z->nextStream;

And calling that script as that-script < exlogs.gz, it would restore the files with their original names and modification time (without the sub-second part which is not stored by gzip) in the current working directory.

FelixJN · Answer 3 · 2022-01-25T22:03:57.943

This is a bit complicated, but works when making use of the following requirements:

The merged.gz is a mix of clear ASCII data and gzipped files
It comes from an operation like cat log0 log1.gz log2.gz log3 log4.gz > merged.gz
Lines in the clear ASCII files are from printable characters only
The magic bytes for for gzipped files are intact (in hex 1F 8B)

Most programs should be available, sponge from moreutils can be avoided by manually writing to a temporary file.

What is done:

Put lines with exclusively printable characters into a file for each consecutive block. Note that if you merged two clear ASCII files in a row this does not separate them (use the logs' timestamps to separate the file in this case) and the original filename is lost
Put other lines into an intermediate gz_only.gz file
Use the magic bytes to separate the files

The last point uses csplit, which can only split if there is also a newline - so this is introduced before splitting and removed afterwards. Currently assumes no more than 1000 gzipped files in the merged system.

#!/bin/bash
#lines with printable characters go to separate files for each consecutive block
awk '{ if ($0 ~ /^[[:print:]]+$/) { print > "file_"i+0}
       else {if (oldi==i) {i++}}}' merged.gz
#get lines with non-printables to other merged file
grep -av '^[[:print:]]$' merged.gz > gz_only.gz
#split into files and remember their count
#sed introduces newline before magic bytes
#csplit splits on occurrence of magic bytes and returns info on file lengths
nfiles=$( sed "s/$(printf '\x1f\x8b')/\n&/g" gz_only.gz |
          csplit - -z "/$(printf '\x1f\x8b')/" '{*}' -b'%03d.gz' |
          wc -l )
#first file is empty, due to introduced newline
rm -fv xx000.gz
#for all other remove newline
#note: the above grep introduced a newline to the last file
#if splitting is done for a file only concatenated from
#gz-files (no previous grep), the last file would have to
#be excluded from this operation.
for (( i=1 ; i<nfiles ; i++ )) ; do
    name=xx$(printf '%03d.gz' $i)
    head -c -1 $name | sponge $name
done
#retrieve original file name
for f in xx*gz ; do
    #this is ready for simple filenames like the suggested logs,
    #e.g. no " as file name character
    mv $f "$(file $f | awk -F'"' '{print $2}').gz"
done
#unzip files
find -name '*gz' ! -name gz_only.gz ! -name merged.gz -exec gunzip {} +

I somewhat feel that the separation into ASCII and non-ASCII as well as the splitting may be done more elegantly with perl, but I am not familiar.

Un-concatenate gzipped file

3 Answers3

Linked