1

I have a Bash script in which there is a loop to enter a specific directory and it does a certain calculation on a specific input file. After that, it exits from the directory and does the same thing within another directory with a different input file. The calculation needs a lot of time and I would like to parallelize it.

How I can modify my script? Is there an option an option to do that?

myscript.sh

cd MainDir
for dir in *
  do
      cd ${dir}
      LD_LIBRARY_PATH="$software"/ "$software"/calc -i /home/files/"$dir.txt" -l /home/Str/Art.pdb -a 5.0 -rf /home/file/prot -cpu 1 opt -w ${dir}_res > ${dir}_WPA.log
     cd .. 
  done

I use the command -cpu to indicate how cpu use. I have many CPUs to use so how can I parallelize more jobs?

e.g If I have three different input files I would like to run (in adifferent directory) together the following command:

cd 1
LD_LIBRARY_PATH="$software"/ "$software"/calc -i /home/files/1.txt -l /home/Str/Art.pdb -a 5.0 -rf /home/file/prot -cpu 1 opt -w 1_res > 1_WPA.log
-----------
cd 2
LD_LIBRARY_PATH="$software"/ "$software"/calc -i /home/files/2.txt -l /home/Str/Art.pdb -a 5.0 -rf /home/file/prot -cpu 1 opt -w 2_res > 2_WPA.log
-----------
cd 3
LD_LIBRARY_PATH="$software"/ "$software"/calc -i /home/files/3.txt -l /home/Str/Art.pdb -a 5.0 -rf /home/file/prot -cpu 1 opt -w 3_res > 3_WPA.log

Could someone help me please? Thanks.

shprogram
  • 118
Tommaso
  • 167
  • 1
  • 9
  • Why not just increase the number of CPUs (threads) allocated to the job? Be wary of spinning up a lot of different jobs working on different files as this might actually slow things down. – Philip Couling May 04 '22 at 16:45
  • @PhilipCouling How do you know the tool they are using is multi-threaded with a configurable number of worker threads? – Kusalananda May 04 '22 at 17:00
  • 1
    @Kusalananda did I misread the bit about -cpu 1? – Philip Couling May 04 '22 at 17:14
  • @PhilipCouling No, you did not misread that. Do you know what software they are using and what impact changing that option-argument would have? For all I know, that option may lock the process to the specified CPU, or it could have some completely other meaning that is specific to the problem domain. I've tried searching for software that has options matching the ones used in the question, and all I can say is that it may have something to do with protein structures. – Kusalananda May 04 '22 at 17:44
  • @Kusalananda I can't be sure of course, only that it looks like a cpu count... at least that's the way I read it. Just my experience has been that where a job offers it's own threading mechanism, its often better to use it. – Philip Couling May 04 '22 at 18:06
  • @PhilipCouling Well, I wouldn't argue with that. It's just that we know nothing about the user's tool here. – Kusalananda May 04 '22 at 18:14
  • your for loop should be for dir in */ to ensure that it only matches directories, not any regular files that might be in your MainDir (and yes, you might be certain that there aren't any regular files in there at this moment in time...but you should still program defensively because things can change and that might not always be true). Also, curly braces are not a substitute for quotes. See $VAR vs ${VAR} and to quote or not to quote – cas May 05 '22 at 02:56

3 Answers3

1

You can add an & at the end of the command to send it to the background:

for i in 1 2 3 4; do
    (
        cd dir
        command
        [...]
    ) &
done
wait # pause until all background processes are terminated
DopeGhoti
  • 76,081
0

With zsh:

do-calc() (
  cd -- $1 &&
    LD_LIBRARY_PATH=$software/ $software/calc \
      -i /home/files/$1.txt \
      -l /home/Str/Art.pdb \
      -a 5.0 \
      -rf /home/file/prot -cpu 1 opt -w ${1}_res > ${1}_WPA.log
)

autoload zargs cd MainDir && zargs -rn1 -P12 -- ./*(N-/) -- do-calc

To run up to 12 of those do-calc functions in parallel.

With any Korn-like shell with process substitution support (such as bash -- the GNU shell) and with GNU utilities, you could do something similar with:

export software
cd MainDir &&
  xargs -0rn1 -P12 -a <(
      LC_ALL=C find . -maxdepth 1 ! -name '.*' -xtype d -print0 |
        sort -z
    ) sh -c '
      cd -- "$1" &&
        LD_LIBRARY_PATH="$software/" "$software/calc" \
          -i "/home/files/$1.txt" \
          -l /home/Str/Art.pdb \
          -a 5.0 \
          -rf /home/file/prot -cpu 1 opt -w "${1}_res" > "${1}_WPA.log"
      ' sh
0

With GNU Parallel you would do something like:

doit() {
      dir="$1"
      cd ${dir}
      LD_LIBRARY_PATH="$software"/ "$software"/calc -i /home/files/"$dir.txt" -l /home/Str/Art.pdb -a 5.0 -rf /home/file/prot -cpu 1 opt -w ${dir}_res > ${dir}_WPA.log
}
export -f doit

cd MainDir parallel doit ::: *

This will run one job per CPU thread. If you do not like that, you can adjust it to running 13 jobs in parallel with:

parallel -j13 doit  ::: *

If you have files like ':::' you need to do something like:

LC_ALL=C find . -maxdepth 1 ! -name '.*' -xtype d -print0 |
  parallel -0 doit

Or:

parallel --argsep /// -j13 doit  /// *
Ole Tange
  • 35,514
  • 1
    It seems like things like parallel doit ::: * don't work properly if some files in the current directory are called ::: or ::::... That can get nasty. After touch ::::; ln -s /etc/shadow ., parallel echo ::: * outputs the contents of /etc/shadow (here with GNU parallel 20210822) – Stéphane Chazelas May 09 '22 at 16:17