2

Let's say I have a command accepting a single argument which is a file path:

mycommand myfile.txt

Now I want to execute this command over multiple files in parallel, more specifically, file matching pattern myfile*.

Is there an easy way to achieve this?

  • 1
    You could use find for that, something like find . -name "*xls*" -exec ls -l {} \;. In your case: find . -name "*xls*" -exec mycommand {} \;. Although find searches for files recursively you can limit find not to search in all sub directories. – mnille Feb 25 '16 at 11:32
  • 1
    that would run the commands sequentially, not in parallel – the_velour_fog Feb 25 '16 at 11:34
  • Oh, ok, sorry, right. I didn't get this point. – mnille Feb 25 '16 at 12:26

3 Answers3

10

With GNU xargs and a shell with support for process substitution

xargs -r -0 -P4 -n1 -a <(printf '%s\0' myfile*) mycommand

Would run up to 4 mycommands in parallel.

If mycommand doesn't use its stdin, you can also do:

printf '%s\0' myfile* | xargs -r -0 -P4 -n1 mycommand

Which would also work with the xargs of modern BSDs.

For a recursive search for myfile* files, replace the printf command with:

find . -name 'myfile*' -type f -print0

(-type f is for regular-files only. For a glob-equivalent, you need zsh and its printf '%s\0' myfile*(.)).

  • Is the benefit of this - over a loop - that the xargs -0 and printf "%s\0" combination handle whitespace in the filenames better? – the_velour_fog Feb 25 '16 at 11:39
  • @the_velour_fog, no the loop will handle whitespace OK as long as you don't forget to quote the variables. The benefit is that you can limit the number of concurrent processes more easily and that you can handle find's output more easily. – Stéphane Chazelas Feb 25 '16 at 11:46
7

Using a loop:

for f in myfile*; do
  mycommand "$f" &
done

wait

or using GNU parallel.

cuonglm
  • 153,898
4

Using GNU Parallel it looks like this:

parallel mycommand ::: myfile*

It will run one job per core.

GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to. It can often replace a for loop.

If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:

Simple scheduling

GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:

GNU Parallel scheduling

Installation

If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:

(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash

For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README

Learn more

See more examples: http://www.gnu.org/software/parallel/man.html

Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html

Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

Ole Tange
  • 35,514