5

Scenario: a single 1g CSV.gz is being written to an FTP folder. At the same time, my client machine connects over sFTP to that folder and attempts to pull it down.

Q: After fetching that file, resulting in whatever apparent length I get client-side, can gzip -t detect and fail the partial file, regardless of where the truncation hit?

I figure that unzipping or -t'esting will error out on 99% of the possible truncation points when a fragment ends abruptly, but does the gz structure have clean cleaving points where gzip will accidentally report success?

mitigations that are not on table (because if one of these were in play, I wouldn't need to ask the above.)

  1. Getting the file length or md5 through another network request.
    1. polling the file length through FTP isn't great, as the server may be sporadically writing chunks to the zip stream. Until the batch job has closed the file handle, it would be lethal to my analysis to mistake that for a complete data set.
    2. being given the final file length or hash by the batch job removes the need for this Q, but that puts an implementation burden on a team that (for this Q's purposes), may as well not exist.
  2. we can't avoid the racing by scheduling reads/writes for different times of day.
  3. the server is not using atomic move operations.
  4. I don't know the CSV row/column counts; it'll change per snapshot and per integration made. May as well say the file being gzipped is a opaque binary blob for this Q.
  5. there are no client=>sFTP network errors in play. (those are caught and handled; my concern is reading a file that's still being sporadically written during the server's batch job.)
  6. Using a RESTful API instead of sFTP.

didn't find an existing SO

Several SO touch on handling truncations, but are in a lossy-acceptable context, as compared to needing to reliably fail the whole workflow on any problem. (I'm computing in a medical-data context, so I'd rather have the server halt and catch fire than propagate incorrect statistics.)

xander
  • 153
  • You are going about this the wrong way. Since truncated .gz files cannot be reliably uncompressed, you should wait until the original download completes. When the filesize hasn't changed for N seconds? Read man -k inotify on the original download system. It may be able to notice the close and create a flag file. – waltinator Jan 21 '22 at 23:04
  • Sorry @waltinator, that doesn't help. I don't want truncated gz's to decompress. And as the writer of the client code, I already know if the download completes with a success or not. (see not-on-table-#6) I'm looking for a reliable solution while optimizing for "no new requests put on source company's dev team", because if I have their ear, there are several ways to solve the issue without relying on gunzip error detection. (inotify doesn't apply; I'm not shelling/exec to that server, and the source job already knows when it finishes.) If I get the vendor's devs' ear, a flag file... – xander Jan 22 '22 at 00:30
  • ...that has enough details to distinguish one well-formed snapshot from the next snapshot starting to overwrite the former, means I no longer need gzip -t at all. – xander Jan 22 '22 at 00:33

1 Answers1

3

Files in gzip format contain the length of the compressed data and the length of uncompressed data. However, this is an ancient format and the length fields only have 32 bits, so nowadays they're interpreted as being the length modulo 2^32 (i.e. 4 GiB). Before decompression, gzip checks that the checksum of the compressed data is correct. After decompression, gzip checks that the checksum of the decompressed data is correct, and that the size of the decompressed data is correct modulo 2^32.

As a consequence, gzip is guaranteed to detect a truncated input if the size of the compressed data (or the size of the decompressed data) is less than 4 GiB. However, for arbitrary-sized files, I don't see any reason why those checks would be enough. If the input is not deliberately crafted and its length is uniformly distributed modulo 4 GiB, there's only a 1/2^64 chance that both the compressed length and the checksum match, and additionally an error will be detected if the file is truncated in the middle of a multi-byte sequence or if the length of the uncompressed data doesn't match. (That doesn't necessarily reduce the chance to 1/2^96 because the compressed length modulo 2^32 and the uncompressed length modulo 2^32 are correlated.) So there's only a tiny chance of an undetected error, but it's nonzero, and I'm sure it could be crafted deliberately.

Note that this analysis only applies if the gzipped file consists of a single stream. gunzip can uncompress files that consists of multiple concatenated streams, and there's no way to detect if the file contains a valid sequence of streams but more streams were intended. However, your production chain probably doesn't generate multi-stream files: gzip doesn't do it by itself, you'd have to concatenate the output of multiple runs manually, or use some other tool (pkzip?).

the server is not using atomic move operations.

Unfortunately I don't think there's a completely reliable way to detect errors without either that or an external piece of metadata (length or cryptographic checksum) which is calculated after the server has finished writing.

  • Thanks! Excellent. my takeaway is: for truncations due to I/O race conditions (not attacks), it would need to accidentally collide on two CRCs and two modulus-lengths -- and if we add a design constraint: no single file may be 4G, either length field is sufficient to detect a half-written file. (Attackers finessing collision attacks is a topic with a bigger scope than this Q.) – xander Jan 22 '22 at 00:18
  • Wait... hey @gilles-so-stop-being-evil are the lengths in the first chunk? if I tarball-pipe-10gigs.sh | gzip -c > /tmp/foo.tar.gz gzip can't know the total length til the last chunk. I don't know if the source team is using streams, but it's unlikely.

    If the gz format has a endcap structure that wouldn't be wellformed in a truncated file, that's just another "I'm happy" scenario...

    – xander Jan 22 '22 at 00:40
  • (went off to read rfc1952) While I could easily be misreading the RFC, it looks like a truncation that just happens after a member's ISIZE field would trick gzip-t into reporting no errors. – xander Jan 22 '22 at 01:00
  • @xander I was assuming that the file is produced by gzip (either the program or the library), without manually concatenating the results of multiple invocations. A single run of gzip produces a file with a single compressed stream. – Gilles 'SO- stop being evil' Jan 22 '22 at 10:29
  • Thanks @gilles-so-stop-being-evil -- I can assume/verify the source system is isn't concatenating, merely streaming or using gzip onefile.csv, so if it's a given that one call to gzip only writes one "member," then gzip -t works for my use case. – xander Jan 23 '22 at 18:35