As part of my research project I'm processing huge amount of data splitted up into many files.
All files in folder foo
have to be processed by the script myScript
involving all elements of folder bar
.
This is myScript
:
for f in bar/*
do
awk 'NR==FNR{a[$0]=$0;next}!a[$0]' $f $1 > tmp
cp tmp $1
done
The first idea to just process all files with a for loop is valid:
for f in foo/*
do
./myScript $f
done
However, this simply takes forever. Simply starting every myScript in background by appending &
would create thousands of parallely executed instances of awk
and cp
with huge input, which is obviously bad.
I thought of limiting the number of "threads" created with the following
for f in foo/*
do
THREAD_COUNT=$(ps | wc -f)
while [ $THREAD_COUNT -ge 12 ]
do
sleep 1
THREAD_COUNT=$(ps | wc -f)
done
./myScript $f &
done
As a side note: I'm comparing with 12, because I've got 8 cores on my nodes and apparently there's always bash
, ps
and wc
running as well as the header line at the moment of the call of ps | wc -l
.
Unfortunately the call of myScript
causes more than one additional entry in ps
, so the behaviour of my script wasn't as intended.
So here's my question: Is there a simpler way? A way which is more stable?
I'm not doing anything else on the nodes, so everything happening is caused by the scripts only.
parallel
utility? – Kevin Sep 07 '12 at 01:19