Suppose we want to dispatch jobs to a collection of servers using GNU parallel. What would happen if one of the servers die(power failure, thermal shutdown...) while busy executing a job? Will GNU parallel just dispatch the same job to another server or will that job be lost forever?
Asked
Active
Viewed 753 times
2 Answers
4
It seems I should have read the man pages more carefully. We can resume failed jobs by saving a joblog file and resuming from there, like so: parallel --resume-failed --joblog logfile
I will delete this post if it is deemed to be of little value to anyone.

niobe
- 205
4
It will be lost forever unless you use --retries
in which case it will be retried on another server. Also have a look at --filter-hosts
to remove hosts that are down.

Ole Tange
- 35,514
--retries
is what I am looking for. It should work well in conjunction with--resume-failed
if the job server might go down as well. – niobe Nov 22 '16 at 06:29