7

I have a huge file (420 GB) compressed with gzip and I want to decompress it, but my HDD doesn't have space for storing the whole compressed file and its contents.

Would there be a way of decompressing it 'while deleting it'?

In case it helps, gzip -l says that there is only a file inside (which is a tar file which I will also have to separate somehow)

Thanks in advance!

Noxbru
  • 73
  • 1
    If it is a tar.gzfile, you do not need to unzip - use tar -tzvf <file> to list the content and tar -xzvf <file> <path> to extract only part of the archive. – ridgy Jan 31 '17 at 13:29
  • 1
    @ridgy That would still require the space to store both the compressed file and the unarchived contents (if you wanted to get all of it). – Kusalananda Jan 31 '17 at 13:30
  • @ridgy No, it is not a tar.gz file (at least not explicitely). It is a .tgz that has a .tar inside. Also, would I need space to just check the list of the content files? – Noxbru Jan 31 '17 at 13:34
  • @Kusalananda It may be a solution if I can move the contents part by part to another HDD or something. – Noxbru Jan 31 '17 at 13:36
  • The suffix .tgzis a synonym for .tar.gz. Just try the tar tzvfcommand. You are not able to delete a file (and free the used blocks, what is more important) as long as this file is in use by some program. While running the extract, you are able to remove the file from another terminal window, but this will only remove the entry from the directory listing and not free the occupied space as long as the extraction is running. – ridgy Jan 31 '17 at 13:40
  • It is not possible to decompress inplace - you would need to write your own decompression executable. The original file will not be removed by any official tool before checking that the decompressed data has been written without errors. – Ned64 Jan 31 '17 at 13:45
  • @ridgy I checked and it worked :) although it seems that it will take a extremely long time to list all the contents :/ – Noxbru Jan 31 '17 at 13:47

3 Answers3

8

Would there be a way of decompressing it 'while deleting it'?

This is what you asked for. But it may not be what you really want. Use at your own risk.

If the 420GB file is stored on a filesystem with sparse file and punch hole support (e.g. ext4, xfs, but not ntfs), it would be possible to read a file and free the read blocks using fallocate --punch-hole. However, if the process is cancelled for any reason, there may be no way to recover since all that's left is a half-deleted, half-uncompressed file. Don't attempt it without making another copy of the source file first.

Very rough proof of concept:

# dd if=/dev/urandom bs=1M count=6000 | pigz --fast > urandom.img.gz
6000+0 records in
6000+0 records out
6291456000 bytes (6.3 GB, 5.9 GiB) copied, 52.2806 s, 120 MB/s
# df -h urandom.img.gz 
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           7.9G  6.0G  2.0G  76% /dev/shm

urandom.img.gz file occupies 76% of available space, so it can't be uncompressed directly. Pipe uncompressed result to md5sum so we can verify later:

# gunzip < urandom.img.gz | md5sum
bc5ed6284fd2d2161296363edaea5a6d  -

Uncompress while hole punching: (this is very rough without any error checking whatsoever)

total=$(stat --format='%s' urandom.img.gz) # bytes
total=$((1+$total/1024/1024)) # MiB
for ((offset=0; offset < $total; offset++))
do
    # read block
    dd bs=1M skip=$offset count=1 if=urandom.img.gz 2> /dev/null
    # delete (punch-hole) blocks we read
    fallocate --punch-hole --offset="$offset"MiB --length=1MiB urandom.img.gz
done | gunzip > urandom.img

Result:

# ls -alh *
-rw-r--r-- 1 root root 5.9G Jan 31 15:14 urandom.img
-rw-r--r-- 1 root root 5.9G Jan 31 15:14 urandom.img.gz
# du -hcs *
5.9G    urandom.img
0       urandom.img.gz
5.9G    total
# md5sum urandom.img
bc5ed6284fd2d2161296363edaea5a6d  urandom.img

The checksum matches, the size of the source file reduced from 6GB to 0 while it was uncompressed in place.

But there are so many things that can go wrong... better don't do it at all or if you really have to, at least use a program that does saner error checking. The loop above does not guarantee at all that the data was read and processed before it gets deleted. If dd or gunzip returns an error for any reason, fallocate still happily tosses it... so if you must use this approach better write a saner read-and-eat program.

frostschutz
  • 48,978
  • This sounds between amazing and crazy... I can't back up the original file (if I could I would just uncompress it and be done). So it is highly possible that I will just get an external HDD and try one of the other suggestions. Still, your answer is the most related to the original question, thanks! :) – Noxbru Jan 31 '17 at 14:27
2

If you have a second hard drive, you may move the compressed archive there and then uncompress and unarchive it onto where you want it to go:

$ mv archive.gz /mnt/somedrive/
$ cd /where/it/should/go
$ tar xvzf /mnt/somedrive/archive.gz
Kusalananda
  • 333,661
  • I guess then that I can do the same thing the other way around? from my HDD to an external device? – Noxbru Jan 31 '17 at 13:49
  • @Noxbru Absolutely. – Kusalananda Jan 31 '17 at 13:49
  • @Kusalalanda Seems that this will be the way to go... does gzip/tar have a progress bar or I should use 'pv' ? – Noxbru Jan 31 '17 at 13:52
  • @Noxbru There is no progress bar. You would have to pipe the output from zcat archive.gz into pv and then on to tar xvf -. – Kusalananda Jan 31 '17 at 13:53
  • So: go to my external device, zcat /path/to/file/in/HDD.gz | pv | tar xvf - ? – Noxbru Jan 31 '17 at 14:00
  • @Noxbru Looks correct. You may run it with t in place of xv first if you wish. It will show you the table of contents, but won't write anything. (table of contents = the files in the archive) – Kusalananda Jan 31 '17 at 14:02
1

It depends on what you want to do with it.

If it is a .tar.gz, file, you can see the tar contents without decompressing it first with tar --list -zf /path/to/file.

Then, if you only want some files inside the tgz, you can extract them with tar -xzvf /path/to/file relative/path/to/files/inside/tar. As always, you can change the destination dir with -C.

This worls because even if a .tar.gz is actually a .tar file compressed with gz, this scenario is so common that tar has the option to work with it built-in, passing the -z flag. This flag only works with gzip tho (and maybe bzip2 too, I'm not sure), not with xz or lz4.

As a bonus answer, if the file inside the .gz weren't a tar, you can always pipe the output to a pager like less, which will fit it into memory: gzcat /path/to/file | less