5

Suppose we want to dispatch jobs to a collection of servers using GNU parallel. What would happen if one of the servers die(power failure, thermal shutdown...) while busy executing a job? Will GNU parallel just dispatch the same job to another server or will that job be lost forever?

niobe
  • 205

2 Answers2

4

It seems I should have read the man pages more carefully. We can resume failed jobs by saving a joblog file and resuming from there, like so: parallel --resume-failed --joblog logfile

I will delete this post if it is deemed to be of little value to anyone.

niobe
  • 205
4

It will be lost forever unless you use --retries in which case it will be retried on another server. Also have a look at --filter-hosts to remove hosts that are down.

Ole Tange
  • 35,514
  • Thanks looking at the man pages, its seems --retries is what I am looking for. It should work well in conjunction with --resume-failed if the job server might go down as well. – niobe Nov 22 '16 at 06:29