1

I want to download and extract a large zip archive (>180 GB) containing multiple small files of a single file format onto an SSD, but I don't have enough storage for both the zip archive and the extracted contents. I know that it would be possible to extract and delete individual files from an archive using the zip command as mentioned in the answers here and here. I could also get the names of all the files in an archive using the unzip -l command, store the results in an array as mentioned here, filter out the unnecessary values using the method given here, and iterate over them in BASH as mentioned here. So, the final logic would look something like this:

  1. List the zip file's contents using unzip -l and store the filenames in a bash array, using regular expressions to match the single file extension present in the archive.
  2. Iterate over the array of filenames and successively extract and delete individual files using the unzip -j -d and zip -d commands.

How feasible is this method in terms of time required, logic complexity, and computational resources? I am worried about the efficiency of deleting and extracting single files, especially with such a large archive. If you have any feedback or comments about this approach, I would love to hear them. Thank you all in advance for your help.

  • Decompress gzip file in place but I don't know if it would work with zip and even with gzip this method is unsafe. Probably very unsafe with zip since its not a streaming format – frostschutz Nov 12 '23 at 14:14
  • Oh, it's about downloading... if the server supports resume / offset / range you could probably cheese it with some flavor of fuse httpfs like simple-httpfs and only download the requested segments (and not store the zip file locally at all) – frostschutz Nov 12 '23 at 14:21
  • 1
    @frostschutz I found out a similar answer suggesting your approach here and it sounds like it could work, I will be sure to give it a try and report back here. – Kumaresh Balaji Sundararajan Nov 12 '23 at 17:30
  • @frostschutz, the example you give is for archive with one compressed file (gz) I have my doubts this will work with multyfile structure as zip – Romeo Ninov Nov 12 '23 at 17:37
  • @KumareshBalajiSundararajan, why you want to complicate the things. One USB disk will do the work. – Romeo Ninov Nov 12 '23 at 17:38
  • @RomeoNinov at the moment, I don't have any easy access to large storage media. The first method he shared using filesystem manipulation was extremely complicated, however the HTTPS file system mount seems feasible for my use case. – Kumaresh Balaji Sundararajan Nov 12 '23 at 18:38
  • @KumareshBalajiSundararajan, we talk about disk with size <500GB (portable). It can be SATA and will cost just few bucks – Romeo Ninov Nov 12 '23 at 18:43
  • 1
    @RomeoNinov I don't have much money at the moment and I don't have many uses for external storage besides this situation. – Kumaresh Balaji Sundararajan Nov 13 '23 at 06:27
  • Unfortunately for your particular case common tools like zip create a modified archive as a new file instead of modifying the old archive in place. In general this makes sense: the new archive is generated as a separate file and only when it's ready it atomically replaces the old archive; so at any moment the old pathname leads to the valid old archive or to the valid new archive. This means your plan with zip -d requires even more space. If you found a tool that can do zip -d in place then you could reduce your zip gradually from the end. Technically the format of zip allows this. – Kamil Maciorowski Nov 15 '23 at 03:15

2 Answers2

1

AFAIK deleting file from zip archive may need twice as much space as the archive. So the best is to attach USB disk and store archive there. Then extract the files to SSD and delete archive (if not required).

Romeo Ninov
  • 17,484
1

If the zip file:

  • contains trusted content; and
  • is available at a URL
    • on a reliable network connection

then the answers here may help.

In short, use a program that can unzip from a stream.

For example:

cd /place/to/store/data
curl https://www.example.org/input.zip | busybox unzip -
cd /place/to/store/data
curl https://www.example.org/input.zip | bsdtar xvf -
jhnc
  • 255