12

I have a big .gz file. I would like to split it into 100 smaller gzip files, that can each be decompressed by itself. In other words: I am not looking for a way of chopping up the .gz file into chunks that would have to be put back together to be able to decompress it. I want to be able to decompress each of the smaller files independently.

Can it be done without recompressing the whole file?

Can it be done if the original file is compressed with --rsyncable? ("Cater better to the rsync program by periodically resetting the internal structure of the compressed data stream." sounds like these reset points might be good places to split at and probably prepend a header.)

Can it be done for any of the other compressed formats? I would imagine bzip2 would be doable - as it is compressed in blocks.

Ole Tange
  • 35,514
  • 1
    Have you try split -b ? – George Vasiliou Jan 17 '17 at 23:24
  • 5
    @GeorgeVasiliou It will not result in smaller gzip files that can be decompressed. – Ole Tange Jan 17 '17 at 23:39
  • The answer to your first question is no, this has been covered in Delete last line of gz file. The answer is probably no with most compressed formats, since what you're asking for goes against compression. I think the answer is also no with gzip --rsyncable given that “gunzip cannot tell the difference” (if you could find a place to split, you could tell that there is a place to split). It might be doable with bzip2 because of its peculiar block feature. – Gilles 'SO- stop being evil' Jan 17 '17 at 23:58
  • This may help: http://stackoverflow.com/a/22628945/4941495 Just let the standard input stream be the output of gzip -d -c bigfile.gz. – Kusalananda Jan 18 '17 at 00:18
  • Without recompressing, it would be doable with a bzip2 file indeed. It would be doable with gz or xz only by compressing each chunk independently, so this would require a recompression. – xhienne Jan 18 '17 at 00:59
  • @Gilles The reason why gunzip cannot tell the difference could also be that resetting the internal structure might happen without --rsyncable, but will happen more often with --rsyncable . – Ole Tange Jan 19 '17 at 08:01
  • zip can create an archive in pieces, but I don't believe they can be restored independently. It's just so they'll fit on external storage media. Given that archives can contain multi-level file trees, it would be kind of difficult or arbitrary to automatically decide where to split things so that restoring just some pieces would yield a usable result. – Joe Jan 22 '17 at 04:20
  • 1
    It is almost just a tar-ball with gzip files inside you want (not big.tar.gz but big-gzs.tar). Then all or only a few files can be extracted and decompressed. I have tried to extract the last file only in a tar-ball but I guess it can "fast forward" as a tape drive can. – hschou Feb 08 '17 at 12:00
  • the answer is simply use split -b and then use cat using >> to append each split file back into the one file. Doesn't matter what you split whether it is already zipped or not. nevermind i just reread what you are asking... u want to be able to decompress the split files. – ron Oct 27 '20 at 19:40

2 Answers2

2

Split and join of the big file works, but it is impossible to decompress pieces of the compressed file, because essential informations are distributed through the whole dataset. Another way; split the uncompressed file and compress the single parts. Now you can decompress each pieces. But why? You have to merge all decompressed parts before further processing.

ingopingo
  • 807
  • 2
    Fun fact: When you have the individually compressed parts (using gzip or xz), you may do concatenation and decompression, or decompression and concatenation. The order doesn't matter. – Kusalananda Feb 24 '17 at 09:13
  • Maybe, it depends on the data. If you split and compress disk images, you have a chance to recover parts of the filesystem. If you first compress and then split, you have definitively no chance. – ingopingo Feb 24 '17 at 09:24
  • No, and that was not my premise either. I just said that the order in which you do concatenation and decompression when you have individually compressed parts does not matter (this is due to the compressed file formats). If compressing first, then splitting, then one obviously need to recombine first. – Kusalananda Feb 24 '17 at 09:26
  • Oh that's cool. It work's, even though every part contains a individual file header! – ingopingo Feb 24 '17 at 09:36
-1

Except my error, i think this is not possible, without alterate your file in losing the ability to rebuild and decompress the big file because you will lose the metadata (header and tail) from the first big file compress and those metadata don't exist for each of your small file.

But you could create a wrapper that could do...

  1. (optional) compress the big file
  2. split your big file into 100 small chunk
  3. compress each of your small chunck in gzip
  4. decompress each chunck in gzip.
  5. concat chunck into the big file.
  6. (optional) decompress the big file

Notice : I am not sure in terme of your purpose... save storage ? save time network transmission ? limite space system ? what is your root need ?

Best Regards

GnuTux95
  • 29
  • 4
  • I want to take a gzip file and decompress it in parallel. And before you say pigz please test how well DEcompressing works on a 64-core machine. – Ole Tange Oct 27 '20 at 22:49
  • I was trying to say the same thing of ingopingo but with my poor english level. For the proposition yes, it is not interesting, it was just an explaination. Better to use the simple way, compress what you want in small archive independant and after make a big archive with a select of compress of your choice : you can compress, some other files already compress (not sure you while gain a lot but you can organize differently and with password or special security why not). --resyncable seems use only by rsync for the transfert data to avoid to retransfer the whole archive. The basic – GnuTux95 Nov 17 '21 at 11:09