3

I have a big .gz file, which is 2.6 GB in itself. I cannot uncompress it due to size limitation. The file is a single large text file. I am not being able to decompress it completely due to size limitation. I want to split it into say 10 individual parts and decompress each one individually so that I can use each individual files:

My questions are:

  1. Is that possible ?
  2. Also, as part of the answer, if the commands can also be provided as I am not very well versed in these commands

Thanks

Noor
  • 989
  • it's just a text file – Noor Apr 16 '17 at 13:52
  • @StephenKitt, in the possible duplicate, it doesn't mention any command, i'm not well verse in the command so i'll need to commands as part of the answer too, thanks – Noor Apr 16 '17 at 13:59
  • The linked question’s answer doesn’t give any command because there is none, you can’t do what you’re asking with gzip. – Stephen Kitt Apr 16 '17 at 14:23
  • 1
    Could you say something more about what the size limitation is? Also, what is your goal? Is it to process the file somehow (could be done without storing the decompressed data), or do you want to do something else with the split file? – Kusalananda Apr 16 '17 at 14:26
  • @Kusalananda, I want to get into the text file, but cannot access it because i'm unable to decompress it – Noor Apr 16 '17 at 14:37
  • 1
    @Noor Well, that's clear, but can't you decompress it because there's not enough space left on the filesystem, or is there some other space-related issue? Also, "get into the file" can be done in a number of ways without decompressing it, depending on what you want to do. – Kusalananda Apr 16 '17 at 14:40
  • it's a disk limitation, can I get to know the number of line in the compressed text file ? – Noor Apr 16 '17 at 14:53
  • @Noor gunzip -c file.gz | wc -l would count the number of lines in the decompressed file. – Kusalananda Apr 16 '17 at 16:10
  • this is one of the issue, i don't have enough disk space to hold the decompressed file – Noor Apr 16 '17 at 16:20
  • 2
    @Noor The command I gave above will not store the decompressed file. Depending on what you want to do you may not need to store any part of the decompressed file, but you're not letting us know what you want to do, so we are struggling to help you. You may likewise grep the decompressed data with either zgrep directly, or with gunzip -c file.gz | grep ..., or pass it through awk or any other filter depending on what it is you want to do. – Kusalananda Apr 16 '17 at 17:24
  • in short, i just wanted to only know the number of lines in file to begin with, I really thank you for your help, it has been really helpful, i've used your command and it worked :) – Noor Apr 16 '17 at 17:47
  • There is a command called split - this let you split any file (including text files, archives and compressed files) into two or more smaller ones. You specify how many lines or bytes each one should have. These files may then be compressed individually. Use cat to merge them back into one large file (after uncompressing, if you compressed each part). – Baard Kopperud Apr 16 '17 at 20:27
  • @Noor: Please state what the size limitation is on. If it's disk space, your proposed solution doesn't make any sense. If it's main memory, it still seems a bad idea, as the above discussion proves. – reinierpost Apr 16 '17 at 21:44

2 Answers2

9

The gzip compression format supports decompressing a file that has been concatenated from several smaller compressed files (the decompressed file will then contain the concatenated decompressed data), but it doesn't support decompressing a cut up compressed file.

Assuming you would want to end up with a "slice" of the decompressed data, you may work around this by feeding the decompressed data into dd several times, each time selecting a different slice of the decompressed data to save to a file and discarding the rest.

Here I'm using a tiny example text file. I'm repeatedly decompressing it (which will take a bit of time for large files), and each time I pick a 8 byte slice out of the decompressed data. You would do the same, but use a much larger value for bs ("block size").

$ cat file
hello
world
1
2
3
ABC

$ gzip -f file   # using -f to force compression here, since the example is so small

$ gunzip -c file.gz | dd skip=0 bs=8 count=1 of=fragment
1+0 records in
1+0 records out
8 bytes transferred in 0.007 secs (1063 bytes/sec)

$ cat fragment
hello
wo

$ gunzip -c file.gz | dd skip=1 bs=8 count=1 of=fragment
1+0 records in
1+0 records out
8 bytes transferred in 0.000 secs (19560 bytes/sec)

$ cat fragment
rld
1
2

(etc.)

Use a bs setting that is about a tenth of the uncompressed file size, and in each iteration increase skip from 0 by one.


UPDATE: The user wanted to count the number of lines in the uncompressed data (see comments attached to the question). This is easily accomplished without having to store any part of the uncompressed data to disk:

$ gunzip -c file.gz | wc -l

gunzip -c will decompress the file and write the uncompressed data to standard output. The wc utility with the -l flag will read from this stream and count the number of lines read.

Kusalananda
  • 333,661
4

Well, split will happily split things for you in various ways.

To make 10 individual parts, you'd have to know the size of the uncompressed file. The following should give you files about 1 GiB in size each.

gunzip < bigfile.gz | split --line-bytes=1G - bigfile-split

1G is still a lot for a text file, many editors handle such large files poorly. So depending on what you really want to do with it, you might want to go for smaller splits. Or just leave it as gz, works well enough for zgrep and other tools, even if it has to be uncompressed every single time.

If this is a log file you might want to fine tune your log rotation, to produce smaller splits naturally.

frostschutz
  • 48,978