1

I'd like to be able to process multiple files downloaded by wget -i immediately after they are downloaded (instead of waiting for all files in the list to finish--for the entire wget process to exit). The trouble is: because wget downloads the file in place, I cannot be sure when a file is safe to process (fully downloaded). Ideally, the principled approach is (I believe) to have wget initially download files into a temporary directory and then mv them into the actual destination directory when complete. Because the mv is atomic*, I can guarantee that any file present in the destination directory is completely downloaded and ready for processing.

I've been through the manpage, but can't seem to find anything to this end. My current hacky approach is to use fuser to see if wget no longer has the file open. But, this is very fragile (what if wget opens a file multiple times?) and I'd like to avoid it.

If there isn't a way to achieve this exactly, is there a workaround that can achieve the same effect? The files are HTML pages if that's at all relevant.

*Addendum: Apparently mv may not be atomic (although for my env it is), although I don't think strict atomicity is needed. The only requirement is that once a file is renamed into the destination directory it is completely downloaded (and the complete contents are immediately available at the new path).

edit: Splitting the process up into multiple wget commands is also not ideal because it precludes using some core features of wget (rate limiting, HTTP keepalive, DNS caching, etc.).

  • 1
    Just an option for consideration: look for the second-oldest file and process that one. All sorts of caveats (only one download at a time; your processing might create new files; the last file won't have a newer file...) – Jeff Schaller May 15 '19 at 14:29
  • @JeffSchaller Yeah that's a decent solution. However, the last file not having a newer file makes it hard to come up with a reasonable termination condition (timeout isn't great depending on file size and connection speed, fuser just for the last is less fragile but bad). I guess I could add a sentinel dummy file URL, but it feels like there should be a better way! – Bailey Parker May 15 '19 at 14:33
  • What about a construct like this : wget [url] -O /dev/stdout | [next step of your process] – Httqm May 15 '19 at 14:39
  • @Httqm That could work if there was a way to distinguish between separate files. Piping loses the file name and makes it difficult (impossible?) to separate the files. Note that I'm using wget -i to download multiple URLs here. – Bailey Parker May 15 '19 at 14:41
  • @BaileyParker; it gets convoluted, but your monitoring process could look for an active wget process -- hopefully only the one you're interested in; or the wget sequence could touch a file at the end, which your monitoring process then knows to skip; etc. – Jeff Schaller May 15 '19 at 14:46
  • @JeffSchaller Oh that's true. I bet with some pipes you could even communicate wget's PID to the monitoring process to get the right one. This removes the ability to have a dependency between wget and the process though. For example you can't do wget -i $(./monitoring-process) and have it emit new files to download. I'm probably pushing against the limits of what I should be doing here (before I should just throw everything into a script). I'm trying to lean on wget as much as possible, because it does its job well! – Bailey Parker May 15 '19 at 14:55

1 Answers1

1

Use aria2c instead:

aria2c --on-download-complete="/path/to/script" -i file

so your script can be:

#!/bin/bash
notify-send "Finished: $3"
  • $1 is the gid from aria2c.
  • $2 is the number of files.
  • $3 is the filename.
pLumo
  • 22,565