3

I have a bunch of files with a .zip extension that I cannot seem to extract on my HPC:

$ unzip RowlandMetaG_part1.zip
Archive:  RowlandMetaG_part1.zip
warning [RowlandMetaG_part1.zip]:  13082642473 extra bytes at beginning or within zipfile
  (attempting to process anyway)
error [RowlandMetaG_part1.zip]:  start of central directory not found;
  zipfile corrupt.
  (please check that you have transferred or created the zipfile in the
  appropriate BINARY mode and that you have compiled UnZip properly)

The size of the zip file itself is 17377631766 bytes.

However, when I download the file to my mac and double-click, the Archive Utility app is able to unpack the file (it contains a directory with about 200 gzipped files inside).

The place that generated the file says:

The files are simply zipped here on our local lab PC running Windows, then uploaded to Dropbox...most people don’t have any problems with them and many can directly download the links I give them using the Linux wget command directly into their servers, then unzip there (the Linux utility can usually handle PC-zipped files).

I'm not sure that the fact that the files are from dropbox is relevant, but I used curl -LO to download (also tried wget - this doesn't change anything), and the files show up with ?dl=1 at the end of the file name. That said, when I download from dropbox to my mac, unzip still fails with the same error.

My question - is there anyway to get this to unzip on the server? Some software that will accomplish the same thing that Archive Utility.app does, or some other way of determining what unzipping protocol to use?

EDIT: Based on comments: some additional information:

$ file RowlandMetaG_part1.zip
RowlandMetaG_part3.zip: Zip archive data, at least v2.0 to extract
$ zip --version
Copyright (c) 1990-2008 Info-ZIP - Type 'zip "-L"' for software license.
This is Zip 3.0 (July 5th 2008), by Info-ZIP.

Also, I did try tar, but without success.

$ tar -xvf RowlandMetaG_part1.zip
tar: This does not look like a tar archive
tar: Skipping to next header
tar: Archive contains `l@\022\t1\fjp\024uP\020' where numeric off_t value expected
tar: Archive contains `\024\311\032b\234\254\006\031' where numeric mode_t value expected
tar: Archive contains `\312\005hЈ\2138vÃ\032p' where numeric time_t value expected
# etc...

And I end up with crap in the directory like this:

$ ls
???MK??%b???mv?}??????@*??TZ?S?? ??????+??}n>,!???ӟw~?i?(??5?#?ʳ??z0?[?Ed?@?쑱??lT?d???A??T???H??
,??Y??:???'w,??+?ԌU??Wwxm???e~??ZJ]y??ˤ??4?SX?=y$Ʌ{N\?P}x~~?T?3????y?????'
kevbonham
  • 201

5 Answers5

4

There is a chance that, although the file ends with ".zip", it is not a zip file.

You can confirm if this is a zip file (and at the same time determine what is the actual file format) using the file utility:

file RowlandMetaG_part1.zip

Once the file format is determined you can use the proper tool to unarchive it.

Marcelo
  • 3,561
  • The file appears to be a zip archive (see edit) – kevbonham Apr 18 '18 at 12:58
  • Does the place where the file was generated provide a checksum info? The file can have been corrupted or truncated during download. Eventually Mac archive utility may circumvent small corruptions or truncation that block the zip tool you are using to work properly. – Marcelo Apr 18 '18 at 13:01
  • they did not. They definitely should, and I asked them for this info, but they are pretty unhelpful ("most people don't have issues") – kevbonham Apr 18 '18 at 13:13
4

It turns out that, because the file is so large, zip can't handle it (it maxes out at 2Gb). Instead, I can use jar:

$ jar xvf RowlandMetaG_part1.zip
inflated: RowlandMetaG_part1/296E-7-26-17-O_S23_L001_R1_001.fastq.gz
# etc...
kevbonham
  • 201
0

Try extracting it with tar utility maybe

tar xvf <file-name>

Maybe this link might be relevant:

https://apple.stackexchange.com/questions/208139/how-to-deal-with-unzip-error-on-a-large-file-in-osx

  • Should have added this - I did try to extract with tar as well - no dice. – kevbonham Apr 17 '18 at 20:03
  • 1
    ah ok. Feel like just copy pasting but I think this might be the solution here. https://unix.stackexchange.com/questions/115825/extra-bytes-error-when-unzipping-a-file, so maybe try brew install p7zip and than 7za x some_file.zip – mk_gunner Apr 17 '18 at 20:05
  • Ideally I'd do this on the HPC, which doesn't have brew, but I'll try to download the binary – kevbonham Apr 17 '18 at 20:09
0

I've run into the same issue and could not solve it, yet. If I do, I'll update this answer.

However, to set some things straight:

You can believe the OP that the file is a zip file.

The "issue" the Linux unzip tool seems to have is that the file does not have a central directory at the end but needs to be unzipped sequentially instead, and the Linux tool appears to be unable to do that.

In theory, the zip tool should be able to fix this, with the -FF option, which scans the archive sequentially, and then create a new zip file from it. Turns out, however, that this does not work with large (>4GB) zips - it creates yet another unreadable zip file, without the central dir at the end.

Background: The PKZIP archive format stores the information about each archived item in two places: Once before each compressed stream (this is mandatory, though the length information may be incorrect) and another time at the end in a list of all stored items, and this one is kind of optional (well, there should always be one by definition of the standard, but it also allows for a fallback by going thru the intial entries, which Apple's zip tool apparently does).

After some more analysing of the issue I believe the issue is that:

  • The zip file was written by the ditto command, which is a slightly modified version of zip, though I don't know the details.
  • The zip file in question is not using the zip64 format.
  • The local file header for the too-large file contains an invalid size (2^32-1). This is what confuses the unzip tool.
  • The local file header's CRC value is zero.
  • Apple's Archive Utility can unzip this file because it ignores the incorrect file size in the local header and instead decompresses the stream until the stream signals its end (compressed gzip streams have an end-of-stream marker).
0

Adding a note that in my case, tar/jar both failed, but 7zip helped (7z x big.zip). The large file discussion pointed me to this helpful question on the Apple stack, How to deal with unzip error on a large file in OSX?

Greenonline
  • 1,851
  • 7
  • 17
  • 23