I'd like to be able to process multiple files downloaded by wget -i
immediately after they are downloaded (instead of waiting for all files in the list to finish--for the entire wget
process to exit). The trouble is: because wget
downloads the file in place, I cannot be sure when a file is safe to process (fully downloaded). Ideally, the principled approach is (I believe) to have wget
initially download files into a temporary directory and then mv
them into the actual destination directory when complete. Because the mv
is atomic*, I can guarantee that any file present in the destination directory is completely downloaded and ready for processing.
I've been through the manpage, but can't seem to find anything to this end. My current hacky approach is to use fuser
to see if wget
no longer has the file open. But, this is very fragile (what if wget
opens a file multiple times?) and I'd like to avoid it.
If there isn't a way to achieve this exactly, is there a workaround that can achieve the same effect? The files are HTML pages if that's at all relevant.
*Addendum: Apparently mv
may not be atomic (although for my env it is), although I don't think strict atomicity is needed. The only requirement is that once a file is renamed into the destination directory it is completely downloaded (and the complete contents are immediately available at the new path).
edit: Splitting the process up into multiple wget
commands is also not ideal because it precludes using some core features of wget
(rate limiting, HTTP keepalive, DNS caching, etc.).
fuser
just for the last is less fragile but bad). I guess I could add a sentinel dummy file URL, but it feels like there should be a better way! – Bailey Parker May 15 '19 at 14:33wget [url] -O /dev/stdout | [next step of your process]
– Httqm May 15 '19 at 14:39wget -i
to download multiple URLs here. – Bailey Parker May 15 '19 at 14:41wget
process -- hopefully only the one you're interested in; or thewget
sequence couldtouch
a file at the end, which your monitoring process then knows to skip; etc. – Jeff Schaller May 15 '19 at 14:46wget
's PID to the monitoring process to get the right one. This removes the ability to have a dependency betweenwget
and the process though. For example you can't dowget -i $(./monitoring-process)
and have it emit new files to download. I'm probably pushing against the limits of what I should be doing here (before I should just throw everything into a script). I'm trying to lean onwget
as much as possible, because it does its job well! – Bailey Parker May 15 '19 at 14:55