Scenario: a single 1g CSV.gz is being written to an FTP folder. At the same time, my client machine connects over sFTP to that folder and attempts to pull it down.
Q: After fetching that file, resulting in whatever apparent length I get client-side, can gzip -t
detect and fail the partial file, regardless of where the truncation hit?
I figure that unzipping or -t'esting will error out on 99% of the possible truncation points when a fragment ends abruptly, but does the gz structure have clean cleaving points where gzip will accidentally report success?
mitigations that are not on table (because if one of these were in play, I wouldn't need to ask the above.)
- Getting the file length or md5 through another network request.
- polling the file length through FTP isn't great, as the server may be sporadically writing chunks to the zip stream. Until the batch job has closed the file handle, it would be lethal to my analysis to mistake that for a complete data set.
- being given the final file length or hash by the batch job removes the need for this Q, but that puts an implementation burden on a team that (for this Q's purposes), may as well not exist.
- we can't avoid the racing by scheduling reads/writes for different times of day.
- the server is not using atomic move operations.
- I don't know the CSV row/column counts; it'll change per snapshot and per integration made. May as well say the file being gzipped is a opaque binary blob for this Q.
- there are no client=>sFTP network errors in play. (those are caught and handled; my concern is reading a file that's still being sporadically written during the server's batch job.)
- Using a RESTful API instead of sFTP.
didn't find an existing SO
Several SO touch on handling truncations, but are in a lossy-acceptable context, as compared to needing to reliably fail the whole workflow on any problem. (I'm computing in a medical-data context, so I'd rather have the server halt and catch fire than propagate incorrect statistics.)
- gzip: unexpected end of file with - how to read file anyway is the reverse -- they want to suppress EOF errors as not a problem for their use case
- Why I get unexpected end of file in my script when using gzip? is just posix stream ends intentionally inserted by
head
and doesn't cover "is it possible for a false-positive to slip in?" - zcat / gzip error while piping out is very close, but doesn't ask "am I guaranteed to get this error?"
- Merge possibly truncated gzipped log files is also close, as it deals with partial files from terminated batch jobs, but is still about throwing out a few unreadable rows, not about guaranteeing errors.
.gz
files cannot be reliably uncompressed, you should wait until the original download completes. When the filesize hasn't changed for N seconds? Readman -k inotify
on the original download system. It may be able to notice theclose
and create a flag file. – waltinator Jan 21 '22 at 23:04gzip -t
at all. – xander Jan 22 '22 at 00:33