4

I have a for loop in which a function task is called. Each call to the function returns a string that is appended to an array. I would like to parallelize this for loop. I tried using & but it does not seem to work.

Here is the code not parallelised.

task (){ sleep 1;echo "hello $1"; }
arr=()

for i in {1..3}; do
    arr+=("$(task $i)")
done

for i in "${arr[@]}"; do
    echo "$i x";
done

The output is:

hello 1 x
hello 2 x
hello 3 x

Great! But now, when I try to parallelise it with

[...]
for i in {1..3}; do
    arr+=("$(task $i)")&
done
wait
[...]

the output is empty.

UPDATE #1

Regarding the task function:

  • The function task takes some time to run and then outputs one string. After all the strings have been gathered, another for loop will loop through the strings and perform some other task.
  • The order does not matter. The output string can consist of a single line string, possibly with multiple words separated by a white space.
slm
  • 369,824
BiBi
  • 141

3 Answers3

6

You can't send an assignment to the background, since the background process is a fork of the shell, and the changes to the variable aren't visible back in the main shell.

But you could run a bunch of tasks in parallel, have them all output to a pipe, and then read whatever comes out. Or actually, use process substitution, to avoid the issue of commands in a pipe being executed in a subshell (see Why is my variable local in one 'while read' loop, but not in another seemingly similar loop?)

As long as the outputs are single lines written atomically, they won't get intermixed, but might get reordered:

$ task() { sleep 1; echo "$1"; }
$ time while read -r line; do arr+=("$line"); done < <(for x in 1 2 3 ; do task "$x" & done)
real    0m1.006s
$ declare -p arr
declare -a arr=([0]="2" [1]="1" [2]="3")

The above will run all the tasks at the same time. There's also GNU parallel (and -P in GNU xargs), which is meant exactly for running tasks in parallel, and will only run a few at the same time. Parallel also buffers the outputs from the tasks, so you don't get intermixed data, even if the task writes lines in parts.

$ mapfile -t arr < <(parallel -j4 bash ./task.sh ::: {a,b,c})
$ declare -p arr
declare -a arr=([0]="a" [1]="b" [2]="c")

(Bash's mapfile here reads the input lines in to the array, similarly to the while read .. arr+=() loop above.)

Running an external script as above is straightforward, but you can actually have it run an exported function too, though of course all tasks run in independent copies of the shell, so they'll have their own copies of each variable etc.

$ export -f task
$ mapfile -t arr < <(parallel task ::: {a,b,c})

The above example happened to keep a, b, and c in order, but that's a coincidence. Use parallel -k to have it make sure the outputs are kept in order.

ilkkachu
  • 138,973
0

A slightly naive but robust way of doing this that is portable across Bourne-like shells:

#!/bin/sh

task () {
    tid="$1"
    printf 'tid %d: Running...\n' "$tid"
    sleep "$(( RANDOM % 5 ))"
    printf 'tid %d: Done.\n' "$tid"
}

ntasks=10

tid=0
while [ "$tid" -ne "$ntasks" ]; do
    tid=$(( tid + 1 ))
    printf 'main: Starting task with tid=%d\n' "$tid"
    task "$tid" >"output.$tid" 2>&1 &
done

wait

tid=0
while [ "$tid" -ne "$ntasks" ]; do
    tid=$(( tid + 1 ))
    printf 'main: Processing output from task with tid=%d\n' "$tid"
    # do something with "output.$tid"
done

This spawns the tasks in the first loop, then waits for them to finish before processing the output in the second loop. This would be suitable if the tasks produce a lot of data.

To limit the number of running tasks to at most 4, the initial loop may be changed to

tid=0
while [ "$tid" -ne "$ntasks" ]; do
    tid=$(( tid + 1 ))
    printf 'main: Starting task with tid=%d\n' "$tid"
    task "$tid" >"output.$tid" 2>&1 &

    if [ "$(( tid % 4 ))" -eq 0 ]; then
        wait
    fi
done
Kusalananda
  • 333,661
0

You are looking for parset (part of GNU Parallel since 20170422) or env_parset (available since 20171222):

# If you have not run:
#    env_parallel --install
# and logged in again, then you can instead run this to activate (env_)parset:
. `which env_parallel.bash`

task (){
  echo "hello $1"
  sleep 1.$1
  perl -e 'print "binary\001\002\n"'
  sleep 1.$1
  echo output of parallel jobs do not mix
}
env_parset arr task ::: {1..3}
env_parset a,b,c task ::: {1..3}

echo "${arr[1]}" | xxd
echo "$b" | xxd

parset is supported in Bash/Ksh/Zsh (including arrays), ash/dash (without arrays).

Ole Tange
  • 35,514