1

Currently I have the following script for using the HaploTypeCaller program on my Unix system on a repeatable environment I created:

#!/bin/bash
#parallel call SNPs with chromosomes by GATK
for i in 1 2 3 4 5 6 7
do
  for o in A B D
  do
    for u in _part1 _part2
    do 
      (gatk HaplotypeCaller \
        -R /storage/ppl/wentao/GATK_R_index/genome.fa \
        -I GATK/MarkDuplicates/ApproachBsortedstettler.bam \
        -L chr$i$o$u \
        -O GATK/HaplotypeCaller/HaploSample.chr$i$o$u.raw.vcf &)
    done
  done
done

gatk HaplotypeCaller \
    -R /storage/ppl/wentao/GATK_R_index/genome.fa \
    -I GATK/MarkDuplicates/ApproachBsortedstettler.bam \
    -L chrUn \
    -O GATK/HaplotypeCaller/HaploSample.chrUn.raw.vcf&

How can I change this piece of code to parallel at least partially? Is it worth to do I am trying to incorporate this whole script in a different script that you can see on a different question here should I? Will I get quite the boost on performance?

The69Er
  • 87

4 Answers4

1

I don’t have parallel, and I don’t really understand what the guts of your script is doing, so I can’t test this.  But I believe that this will work, and may be in the style you were looking for.

Rewrite the script to remove the loops, and to take arguments:

#!/bin/bash
#parallel call SNPs with chromosomes by GATK
            (to be safe, verify that "$#" is 3)
i="$1"
o="$2"
u="$3"
            (if you want, verify that the arguments are valid)
gatk HaplotypeCaller \
          ︙       \
    -L "chr$i$o$u" \
    -O "GATK/HaplotypeCaller/HaploSample.chr$i$o$u.raw.vcf" &
 
gatk HaplotypeCaller \
          ︙       \
    -L chrUn -O GATK/HaplotypeCaller/HaploSample.chrUn.raw.vcf &
Then run it like this:
printf '%s\n' {1,2,3,4,5,6,7}' '{A,B,D}' '_part{1,2} | parallel -L1 (your_script)
Let me walk you through this:
  • {1,2,3} expands as three words: 1, 2 and 3.
  • {1,2,3} {A,B} expands as five words: 1, 2, 3, A and B.
  • {1,2,3}{A,B} expands as six words: 1A, 1B, 2A, 2B, 3A and 3B.
  • {1,2,3}' '{A,B} expands as six words: 1 A, 1 B, 2 A, 2 B, 3 A and 3 B.  Note that these “words” include spaces.
  • {1,2,3,4,5,6,7}' '{A,B,D}' '_part{1,2} expands as 42 (7×3×2) words, each of which includes two spaces.
  • printf '%s\n' outputs each “word” on a separate line.  But remember, we’re talking about “words” with spaces in them.  The effect is that it prints two or three regular (non-whitespace) words per line.  For example,
    $ printf '%s\n' {1,2,3}' '{A,B}
    1 A
    1 B
    2 A
    2 B
    3 A
    3 B
    

    At this point, these are ordinary spaces; they are no longer quoted.

  • -L1 tells parallel to run your program with the data from one line.  It will break the line apart at the spaces, and get a set of three arguments.
1
parallel echo HaploSample.chr{1}{2}{3}.raw.vcf ::: 1 2 3 4 5 6 7 ::: A B D ::: _part1 _part2
1
#!/bin/bash
#parallel call SNPs with chromosomes by GATK

gatk_a_single() {
    i="$1"
    o="$2"
    u="$3"

    gatk HaplotypeCaller \
        -R /storage/ppl/wentao/GATK_R_index/genome.fa \
        -I GATK/MarkDuplicates/ApproachBsortedstettler.bam \
        -L chr$i$o$u \
        -O GATK/HaplotypeCaller/HaploSample.chr$i$o$u.raw.vcf
}
export -f gatk_a_single

# This call does not fit the pattern of the others, so just run that in the background
gatk_a_single Un "" "" &    

# Use 3 input sources: All combinations between all input sources will be generated and run
parallel gatk_a_single ::: 1 2 3 4 5 6 7 ::: A B D ::: _part1 _part2

# The parallel tasks are now complete
# Wait for the earlier backgrounded task to complete
wait
Ole Tange
  • 35,514
0

At the first approximation, paralleling makes sense when the number of jobs is <= the number of cores. Do you have 42 cores available? If not, then paralleling all the jobs at once might not make sense.

Here's the "naive" way to parallel your jobs:

1) Don't RUN the commands, WRITE them:

for i in 1 2 3 4 5 6 7
do
  for o in A B D
  do
    for u in _part1 _part2
    do 
      echo "gatk HaplotypeCaller \
        -R /storage/ppl/wentao/GATK_R_index/genome.fa \
        -I GATK/MarkDuplicates/ApproachBsortedstettler.bam \
        -L chr$i$o$u \
        -O GATK/HaplotypeCaller/HaploSample.chr$i$o$u.raw.vcf &"
    done
  done > file-$i.sh
done

Now you've got 7 text files of 6 commands each. You can parallel your jobs in batches of 6 by running those 7 scripts one after the other. If you've got enough cores (16?), you can run two batches of 6 at a time.

Lather, rinse, repeat.

Jim L.
  • 7,997
  • 1
  • 13
  • 27
  • I have 96 real cores then I guess I should adapt it to that number but hey using gnu parallel, can it make that better or faster? – The69Er Jul 11 '19 at 23:48