4

I'm trying to use tee in a for loop as such:

for ea in $(ls *bam)
  do samtools mpileup -f $ref $ea | \
  tee \
  >(java -jar $varscan2 mpileup2indel --output-vcf 1 > vcf/"$(echo $ea | sed s/.bam//)"_mpileup2indel.vcf) \
  >(java -jar $varscan2 mpileup2snp --output-vcf 1 > vcf/"$(echo $ea | sed s/.bam//)"_mpileup2snp.vcf) | \
  tail -n 5
done;

I.e. using the output of samtools mpileup, and piping that into two separate commands. I added the tail -n 5 to prevent the output of samtools mpileup from being printed in its entirety to stdout (I want full output to be used as input to java -jar varscan, however).

This seems to work initially, but the command doesn't seem to complete (file size for each output is smaller than if command was done without tee).

Eventually I get an error that the two java -jar $varscan commands are waiting for an input that never arrives (prior to getting a chance to start the second iteration of the loop).

Is this the best way to accomplish what I'm after, i.e. using the output of the first command in two separate commands (ideally, not recording/printing the output of the first command at all)? Is tee incompatible with for loops?

Thanks in advance.

  • 1
    Besides other problems that may occur, the tail should be piped right after the command you want to tail, not at the end of the pipe line. Also look at Why not parse ls (and what to do instead)?. – schrodingerscatcuriosity Oct 22 '21 at 17:59
  • Sorry, I don't follow. Could you write the code as it should be? And, I also don't follow why you're referring me to the 'ls parsing discussion'. Thanks. – Rebecca Eliscu Oct 22 '21 at 18:04
  • For completeness, which shell are you using? Also, do you get one error, from one of the two java -jar $varscan processes, or do you get an error from each of those two processes (i.e. two error messages)? It would probably be useful if you could edit the actual error messages into your question. – fra-san Oct 22 '21 at 18:05
  • 2
    With for ea in $(ls *bam) you are parsing the output of the ls command, which for various reasons detailed in the link is considered bad practice. – schrodingerscatcuriosity Oct 22 '21 at 18:06
  • @fra-san I'm running again because I didn't record the error! I'm using bash, btw. Will update when running is complete. – Rebecca Eliscu Oct 22 '21 at 18:21
  • 1
    @schrodigerscatcuriosity I should clarify that I want the output of samtools mpileup to be piped in its entirety to my tee commands, but not to the stdout. I'm not sure how else to accomplish this. – Rebecca Eliscu Oct 22 '21 at 18:25
  • 2
    As I was saying in a comment which I deleted because it was mis-diagnosing the cause: try using the -p option of tee: | tee -p .... –  Oct 22 '21 at 18:30
  • @CocaineMitch Trying right now! Will report back if this works. – Rebecca Eliscu Oct 22 '21 at 19:16
  • 1
    I would expect the problem to occur when piping to head, but not when to tail. In general anything that exits before EOF and causes tee to get SIGPIPE will make it not send the whole stream to files specified as operands (process substitutions in your case). Is tail -n 5 in your actual code? If not then maybe the actual command exits early. tee -p seems a good idea. If -p is not supported then try tee … | { actual_command; cat >/dev/null; }, so even after actual_command exits, cat reads the rest of the stream from tee and tee never gets SIGPIPE. – Kamil Maciorowski Oct 23 '21 at 11:14

1 Answers1

3
  1. quote your variables
  2. don't parse ls
  3. optional-but-recommended: simplify your script and don't repeat yourself. You generate the basename twice with sed and append a different suffix to it each time - it would be better to generate it once - this would reduce the risk of bugs and improve readability (and very slightly improve performance - it's "cheaper" to do something once and re-use the result than it is to do exactly the same operation twice or more).
  4. readability (i.e. the ability to read and understand a program you've written) is one of, if not the most important things when writing code....so, any time that performance is not absolutely critical it is better to prioritise writing your code in a way that makes it easier to understand. This may mean inserting more line breaks or indentation, or breaking down long and complicated commands into shorter, simpler commands. This will help write and debug the script NOW, and also help you understand it when you need to revisit it in X months (or years) time.
for ea in *.bam; do
  bn="$(basename "$ea" .bam)"
  samtools mpileup -f "$ref" "$ea" |
    tee \
      >(java -jar "$varscan2" mpileup2indel --output-vcf 1 > "vcf/${bn}_mpileup2indel.vcf") \
      >(java -jar "$varscan2" mpileup2snp --output-vcf 1 > "vcf/${bn}_mpileup2snp.vcf") |
    tail -n 5
done

Note the varying indentation levels. e.g. tee is indented slightly from samtools, then tee's args are indented from tee, and then the tail is back at the same indentation level as tee. This all helps to understand which args belong to which program, and where in the pipeline (or loop, etc) you are as you're reading it.

BTW, backslashes to continue a line are optional after a pipe character.

or even:

outdir="vcf"

for ea in *.bam; do bn="$(basename "$ea" .bam)" indel="$outdir/${bn}_mpileup2indel.vcf" snp="$outdir/${bn}_mpileup2snp.vcf"

samtools mpileup -f "$ref" "$ea" | tee
>(java -jar "$varscan2" mpileup2indel --output-vcf 1 > "$indel")
>(java -jar "$varscan2" ​mpileup2snp --output-vcf 1 > "$snp") | ​ tail -n 5 done

cas
  • 78,579
  • BTW, as @zevzek implies in a comment, you may want a wait statement after the tail (or a while pgrep -f "$varscan2" >& /dev/null; do sleep 1; done loop) - this would ensure that you don't have dozens or hundreds of java processes all running in the background at the same time (two for each .bam file). Only the two processes started by the process substitutions would be running during each pass through the loop. – cas Oct 23 '21 at 02:19