2

As part of my research project I'm processing huge amount of data splitted up into many files.

All files in folder foo have to be processed by the script myScript involving all elements of folder bar.

This is myScript:

for f in bar/*
do
    awk 'NR==FNR{a[$0]=$0;next}!a[$0]' $f $1 > tmp
    cp tmp $1
done

The first idea to just process all files with a for loop is valid:

for f in foo/*
do
    ./myScript $f
done

However, this simply takes forever. Simply starting every myScript in background by appending & would create thousands of parallely executed instances of awk and cp with huge input, which is obviously bad.

I thought of limiting the number of "threads" created with the following

for f in foo/*
do
    THREAD_COUNT=$(ps | wc -f)
    while [ $THREAD_COUNT -ge 12 ]
    do
        sleep 1
        THREAD_COUNT=$(ps | wc -f)
    done
    ./myScript $f &
done

As a side note: I'm comparing with 12, because I've got 8 cores on my nodes and apparently there's always bash, ps and wc running as well as the header line at the moment of the call of ps | wc -l.

Unfortunately the call of myScript causes more than one additional entry in ps, so the behaviour of my script wasn't as intended.

So here's my question: Is there a simpler way? A way which is more stable?

I'm not doing anything else on the nodes, so everything happening is caused by the scripts only.

Jeff Schaller
  • 67,283
  • 35
  • 116
  • 255
stefan
  • 1,009

3 Answers3

3

While you can do this with a shell script, this is going at it the hard way. Shell scripts aren't very good at manipulating multiple background jobs.

My recommendation is to use GNU make or some other version of make that has a -j option to execute multiple jobs in parallel. Write each subtask as a makefile rule.

I think the makefile snippet below implements your rules, but your code was hard to follow so I might now have gotten it right. The first line enumerates the output files from the input files (note: never overwrite any input files! If the job stops in the middle for any reason, you'll end up with data for which you won't know whether it's been processed). The indented lines are the commands to run. Use a tab to indent each command, not 8 spaces. In these commands, $< represents the source file (a .in file), $@ represents the target (the .out file), and $* is the target without its extension. All $ signs in shell commands must be doubled, and each command line is executed in a separate subshell unless you put a \ at the end which cancels that newline (so the shell sees one long line starting with set -e and ending with done).

all: $(patsubst %.in,%.out,$(wildcard foo/*.in))
%.out: %.in
        cp $< $*.tmp.in
        set -e; \
        for f in bar/*; do \
          awk 'NR==FNR{a[$$0]=$$0;next}!a[$$0]' $$f $*.tmp.in >$*.tmp.out; \
          mv $*.tmp.out $*.tmp.in; \
        done
        mv $*.tmp.in $@

Put this in a file called Makefile and call make -j12.

  • Thanks for your Makefile-approach. Actually since I do all the scripts with temporary files on a dedicated node, it's not a problem to overwrite input data. It's only copied back to my workspace if successful (and it's not overwritten there). I have additional logic (such as locking), but I didn't considered it necessary to add to the question. – stefan Sep 07 '12 at 07:13
2

Using GNU Parallel (http://www.gnu.org/software/parallel/) it looks like this:

parallel awk \'NR==FNR\{a\[\$0\]=\$0\;next\}\!a\[\$0\]\' {1} {2} '>{2}.tmp; mv {2}.tmp {2}' ::: bar/* ::: foo/*

This will run one job per core. Use -j150% to run 1.5 jobs per core.

If you just want to run several myScript in parallel do:

parallel ./myScript ::: foo/*

Watch the intro videos to learn more: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Ole Tange
  • 35,514
1

You may try using ulimit. From bash man-page:

ulimit [-HSTabcdefilmnpqrstuvx [limit]]
Provides control over the resources available to the shell and to processes started  by  it, 
on systems  that  allow  such control.
[...]
-u     The maximum number of processes available to a single user

So if you put ulimit -u 8to an appropriate position inside the script, it limits the processes available to that shell to 8.

Did not test it, though.

  • Seems like a good and easy way. Unfortunately there is an error "fork: resource temporarily unavailable" now – stefan Sep 06 '12 at 21:39
  • That is expected - if you limit the maximum number of processes you will get an error on fork if you try to exceed that number. You can still accomplish what you want with a loop that constantly tries to spawn background jobs and constantly retries the ones that failed due to hitting the process limit. – jw013 Sep 06 '12 at 23:33