26

I have a bunch of PNG images on a directory. I have an application called pngout that I run to compress these images. This application is called by a script I did. The problem is that this script does one at a time, something like this:

FILES=(./*.png)
for f in  "${FILES[@]}"
do
        echo "Processing $f file..."
        # take action on each file. $f store current file name
        ./pngout -s0 $f R${f/\.\//}
done

Processing just one file at a time, takes a lot of time. After running this app, I see that the CPU is just 10%. So I discovered that I can divide these files in 4 batches, put each batch in a directory and fire 4, from four terminal windows, four processes, so I have four instances of my script, at the same time, processing those images and the job takes 1/4 of the time.

The second problem is that I lost time dividing the images and batches and copying the script to four directories, open 4 terminal windows, bla bla...

How do that with one script, without having to divide anything?

I mean two things: first how do I from a bash script, fire a process to the background ? (just add & to the end?) Second: how do I stop sending tasks to the background after sending the fourth tasks and put the script to wait until the tasks end? I mean, just sending a new task to the background as one tasks end, keeping always 4 tasks in parallel? if I do not do that the loop will fire zillions of tasks to the background and the CPU will clog.

Jeff Schaller
  • 67,283
  • 35
  • 116
  • 255
Duck
  • 4,674

4 Answers4

35

If you have a copy of xargs that supports parallel execution with -P, you can simply do

printf '%s\0' *.png | xargs -0 -I {} -P 4 ./pngout -s0 {} R{}

For other ideas, the Wooledge Bash wiki has a section in the Process Management article describing exactly what you want.

jw013
  • 51,212
  • 2
    There are also "gnu parallel" and "xjobs" designed for this case. It's mostly a matter of taste which you prefer. – wnoise Apr 01 '12 at 04:41
  • Could you please explain the proposed command? Thanks! – Eugene S Apr 01 '12 at 07:25
  • 1
    @EugeneS Could you be a bit more specific about what part? The printf collects all png files and passes them via a pipe to xargs, which collects arguments from standard input and combines them into arguments for the pngout command the OP wanted to run. The key option is -P 4, which tells xargs to use up to 4 concurrent commands. – jw013 Apr 01 '12 at 10:05
  • 2
    Sorry for not being precise. I was specifically interested why did you use printf function here rather than just regular ls .. | grep .. *.png? Also I was interested in the xargs parameters you used (-0 and -I{}). Thanks! – Eugene S Apr 01 '12 at 11:12
  • 4
    @EugeneS It's for maximum correctness and robustness. File names are not lines, and ls cannot be used to parse filenames portably and safely. The only safe characters to use to delimit file names are \0 and /, since every other character, including \n, can be part of the file name itself. The printf uses \0 to delimit file names, and the -0 informs xargs of this. The -I{} tells xargs to replace {} with the argument. – jw013 Apr 01 '12 at 18:23
  • Great! Thank you for your detailed explanation! – Eugene S Apr 02 '12 at 09:36
  • @jw013 - just one question: this answer you posted will just fire another 4 processes when all the original processes are done or will one new process be fired for each one that is done? – Duck Apr 02 '12 at 13:39
  • @DigitalRobot You can try it for yourself but the xargs man page seems to imply it will try to keep 4 processes running at all times. – jw013 Apr 02 '12 at 17:43
9

In addition to solutions already proposed, you can create a makefile that describes how to make a compressed file from uncompressed, and use make -j 4 to run 4 jobs in parallel. The problem is that you will need to name compressed and uncompressed files differently, or store them in different directories, else writing a reasonable make rule will be impossible.

9000
  • 1,639
9

If you have GNU Parallel http://www.gnu.org/software/parallel/ installed you can do this:

parallel ./pngout -s0 {} R{} ::: *.png

You can install GNU Parallel simply by:

wget http://git.savannah.gnu.org/cgit/parallel.git/plain/src/parallel
chmod 755 parallel
cp parallel sem

Watch the intro videos for GNU Parallel to learn more: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Ole Tange
  • 35,514
7

To answer your two questions:

  • yes, adding & at the end of the line will instruct you shell to launch a background process.
  • using the wait command, you can ask the shell to wait for all the processes in the background to finish before proceeding any further.

Here's the script modified so that j is used to keep track of the number of background processes. When NB_CONCURRENT_PROCESSES is reached, the script will reset j to 0 and wait for all the background processes to finish before resuming it's execution.

files=(./*.png)
nb_concurrent_processes=4
j=0
for f in "${files[@]}"
do
        echo "Processing $f file..."
        # take action on each file. $f store current file name
        ./pngout -s0 "$f" R"${f/\.\//}" &
        ((++j == nb_concurrent_processes)) && { j=0; wait; }
done
jw013
  • 51,212
  • 1
    This will wait for the last of the four concurrent processes and will then start a set of another four. Perhaps one should build an array of four PIDs and then wait for these specific PIDs? – Nils Mar 31 '12 at 19:59
  • Just to explain my fixes to the code: (1) As a matter of style, avoid all uppercase variable names as they potentially conflict with internal shell variables. (2) Added quoting for $f etc. (3) Use [ for POSIX compatible scripts, but for pure bash [[ is always preferred. In this case, (( is more appropriate for the arithmetic. – jw013 Mar 31 '12 at 20:30