1

Ok so i have a bash function that I apply on several folders:

function task(){
do_thing1
do_thing2
do_thing3
...
}

I want to run that function in parallel. So far I was using a little fork trick:

N=4 #core number
for temp_subj in ${raw_dir}/MRST*
do
  ((i=i%N)); ((i++==0)) && wait
  task "$temp_subj" &
done

And it works great. But I decided to get something 'cleaner' and use GNU parallel:

ls -d ${raw_dir}/MRST* | parallel task {}

Problem is it's putting EVERYTHING in parallel, including the do_thing within my task function. And it is inevitably crashing because those have to be executed in a serial fashion. I tried to modify the call to parallel in many ways but nothing seems to work. Any ideas?

Orchid
  • 11
  • @thanasisp I tried many suggestions from that post which are using the gnu parallel but it still won't work. – Orchid Nov 20 '20 at 04:12
  • Your current solution executes the function in batches of 4, that means it's starting 4 of them, waiting for all of them to finish, then starts next 4 etc. If you want to execute at most 4 parallel calls of the function, use xargs. You have to export the function and use xargs -n1 -P4 calling a subshell like into this post. If you want to use parallel, similar thing, like this post. Also do not use the ls output for making the arguments, use find or glob expressions. – thanasisp Nov 20 '20 at 20:21
  • @thanasisp thanks a lot for the quick reply and your suggestions. I tried xargs -n1 -P4 with a subshell but it has the same issue as parallel. All the do_thing commands within my function are being executed at the same time as if the function is being treated just as a list of commands instead of "as a whole" just like using a & would do. – Orchid Nov 20 '20 at 23:02
  • Of course, task is the minimum unit that will be executed in your example. When you say in your description that task "$temp_subj" & works well, it means you are ok with that. You run 4 tasks in parallel and each task intentionally is not running any of its commands asynchronously. If you mean to parallelize the calls inside task then you have to rephrase the question. – thanasisp Nov 20 '20 at 23:08
  • @thanasisp. " If you mean to parallelize the calls inside task then you have to rephrase the question" No I don't and that is my problem because parallel and xargs are doing it by default, or so it seems. – Orchid Nov 21 '20 at 04:08

1 Answers1

1

I think your problem is related to do_thingX:

do_thing() { echo Doing "$@"; sleep 1; echo Did "$@"; }
export -f do_thing
do_thing1() { do_thing 1 "$@"; }
do_thing2() { do_thing 2 "$@"; }
do_thing3() { do_thing 3 "$@"; }
# Yes you can name functions ... - it is a bit unconventional, but it works
...() { do_thing ... "$@"; }
export -f do_thing1
export -f do_thing2
export -f do_thing3
export -f ...

function task(){ do_thing1 do_thing2 do_thing3 ... } export -f task

This should take 4 seconds for a single input

ls ${raw_dir}/MRST* | time parallel task {}

or you are using a different parallel than GNU Parallel. Check that it is GNU Parallel with:

$ parallel --version
GNU parallel 20201122
Copyright (C) 2007-2020 Ole Tange, http://ole.tange.dk and Free Software
Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
GNU parallel comes with no warranty.

Web site: https://www.gnu.org/software/parallel

When using programs that use GNU Parallel to process data for publication please cite as described in 'parallel --citation'.

Ole Tange
  • 35,514
  • Thanks a lot Ole for your reply. The do_thingX in my function are actually commands from other toolbox and not functions I made. I actually changed my task function into a .sh file then called: find ${raw_dir} -name "MRST*" | parallel -j+0 bash mytask.sh {/} and it's working ! I just thought I could put everything in one file e.g. a function plus the parallel command that calls the function. – Orchid Nov 24 '20 at 18:21