1

Suppose you have a binary that has to be run on maaaaaaaaany files (suppose files are numbered from 1 to N). Each file has to be processed by making a call to this binary (say.... something like md5sum). Each run will save the result on a separate file. So.... if we have 1000 files and we only have 4 CPUs, we don't want to do something like (if at all possible, actually):

i=0; while [ $i -lt 1000 ]; do md5sum a_file_$i > result_$i & i=$(( $i + 1 )); done

Because (even if bash wouldn't complain), we would end up starting 1000 processes that will make the computer go into crawling mode.

Is there a command available that I could use where I can tell said command that it has to be run like n processes at a time (start n processes, monitors when a process finishes and then starts another so that the number of processes running is always n)?

eftshift0
  • 591

1 Answers1

2

GNU parallel is the tool you're looking for. The author, Ole Tange, is a regular here and has written several good answers to questions about it

GNU's version of xargs from findutils also has some options for running multiple jobs in parallel. It's probably easier to use for simple jobs like yours, but nowhere near as flexible or capable as parallel.

For example:

find . -maxdepth 1 -type f -name 'a_file_*' -print0 | 
  xargs -0r -L 1 -P 4 sh -c '/usr/bin/md5sum "$1" > "$1.md5sum"' {}

This will run up to 4 md5sum jobs in parallel (-P 4). I've also used -L 1 option to limit each job to processing one filename at a time - without that (otherwise it would only run 1 job with 1000 filenames)

cas
  • 78,579
  • I ended up opening N terminal sessions on konsole and using a number sequence starting from a different index (increasing by N on each session). Not the most elegant way to handle but it worked. I'll take a look at both xargs and parallel for the next time I face something like this. – eftshift0 Aug 04 '17 at 17:02