1

I am building the pipeline for a set of data and what I have for the main part is something like this

#! /bin/bash

time bwa mem -o bwa/mem/Stettler -M -t 96 -R "@RG\tID:Test\tSM:Stettler\tLB:TestLib\tPL:ILLUMINA" /storage/ppl/wentao/bwa_Index/genome.fa $1 $2
wait
echo "finished mem"
samtools view -Sb -@ 96 -o samtools/Stettler.bam bwa/mem/Stettler
wait
echo  "got stettler"
wait
time samtools sort -@ 96 -O bam -o samtools/sort/approachAsortedstettler.bam samtools/Stettler.bam
wait
echo "sorted"

time samtools index samtools/sort/approachAsortedstettler.bam
wait
echo "finished indexing"

time gatk MarkDuplicates -I samtools/sort/approachAsortedstettler.bam -O GATK/MarkDuplicates/ApproachAsortedstettler.bam -M GATK/MarkDuplicates/metrics/ApproachB
wait
echo "Marked Duplicates"
time samtools index GATK/MarkDuplicates/ApproachAsortedstettler.bam
wait
echo "indexed again ++++++++++++++++++++++++++++++++++++++++"
time bash scripts/Parallelhaplo.sh
wait
echo "Parallelhaplo"

time bash scripts/MergerHAplo.sh
wait
echo "merged"
time vcftools --vcf GATK/MergedSample_gather.raw.vcf --min-meanDP  $3 --recode --out vcftools/MergedGATKdp2.vcf
wait
echo "deep checked"
time gatk IndexFeatureFile --feature-file vcftools/MergedGATKdp2.vcf.recode.vcf
wait
echo "IFF"
time gatk SelectVariants -R /storage/ppl/wentao/GATK_R_index/genome.fa --variant vcftools/MergedGATKdp2.vcf.recode.vcf --concordance vcftools/Mergedmpileupdp2.vcf.recode.vcf -O GATK/SelectVariants/Common$
wait
echo "finished"

and the process called parallel Haplo looks like this

#!/bin/bash
#parallel call SNPs with chromosomes by GATK

for i in 1 2 3 4 5 6 7;do for o in A B D;do for u in _part1 _part2;do (gatk
 HaplotypeCaller -R /storage/ppl/wentao/GATK_R_index/genome.fa -I 
GATK/MarkDuplicates/ApproachAsortedstettler.bam -L chr$i$o$u -O 
GATK/HaplotypeCaller/HaploSample.chr$i$o$u.raw.vcf &);done;done ; done 

gatk HaplotypeCaller -R /storage/ppl/wentao/GATK_R_index/genome.fa -I 
GATK/MarkDuplicates/ApproachBsortedstettler.bam -L chrUn -O 
GATK/HaplotypeCaller/HaploSample.chrUn.raw.vcf&

wait

echo "parallel call finished"

wait

However when I then execute the script what usually happens is that ParallelHaplo is started but for some reason the wait on any of the two scripts doesn't wait for it to finish so it goes to the next step and since the next step cant find the files I just get errors. What can I then do?

terdon
  • 242,166
The69Er
  • 87
  • Does it work as expected if you have the & outside the subshell parentheses? So instead of do (gatk ... & ); done in ParallelHaplo, try do (gatk ...) & done. – terdon Jul 04 '19 at 17:44
  • I will try to thanks – The69Er Jul 04 '19 at 17:53
  • I just tested using a sleep command and can reproduce your issue when using ( sleep & ) and can confirm it works as expected when using (sleep) &. By the way, you may be interested in our sister site: [bioinformatics.se]. – terdon Jul 04 '19 at 17:58

1 Answers1

1

The problem is that you are sending the gatk process to the background inside a subshell: ( gatk ... & ). The backgrounded process is noa child of that subshell and not of your script's shell, so wait doesn't see it and doesn't wait for it. From help wait:

wait: wait [-fn] [id ...]
    Wait for job completion and return exit status.

    Waits for each process identified by an ID, which may be a process ID or a
    job specification, and reports its termination status.  If ID is not
    given, waits for all currently active child processes, and the return
    status is zero.  If ID is a job specification, waits for all processes
    in that job's pipeline.

If you change that to background the entire subshell instead ( i.e. ( gatk ... ) &or, better yet, not use a subshell at all, since it isn't doing anything useful here, it will work as expected:

for i in 1 2 3 4 5 6 7; do
  for o in A B D; do
    for u in _part1 _part2; do
      gatk HaplotypeCaller \
           -R /storage/ppl/wentao/GATK_R_index/genome.fa \
           -I GATK/MarkDuplicates/ApproachAsortedstettler.bam \
           -L chr$i$o$u \
           -O GATK/HaplotypeCaller/HaploSample.chr$i$o$u.raw.vcf &
    done
  done
done 
terdon
  • 242,166
  • No, a background command started in a subshell (eg. (sleep 3600 &) will not be a child process of the main shell -- it will be first a child of the subshell (and a grandchild of the main shell) and, after the subshell terminated (ie when a (foo &) command returned), it will be reparented to pid 1 (init) and will run in orphaned process group, no longer under the control of the shell. wait only waits for direct children. –  Jul 04 '19 at 19:05
  • @mosvy yes that's what it looks like. But I tested with sleep in a subshell and pstree showed it as a child of the script. Do you know of any other way of demonstrating this? – terdon Jul 04 '19 at 19:07
  • Demonstrate what? (sleep 90 &); pstree $$ will not show the sleep process as a child of your shell (all this assumes a shell like dash or bash, where (...) are actual subprocesses/subshells, not optimized away as in ksh93). –  Jul 04 '19 at 19:12
  • @mosvy ah, you're quite right! I was testing with ( sleep 30 ) & like an idiot. Thanks. – terdon Jul 04 '19 at 19:19