1

I'm currently working with HISAT2 and am trying to use xargs to make my life easier when inputting multiple samples.

So I have a text file "samples.txt" which has each sample name separated by spaces-

ERR199044 ERR188104 ERR188234 ERR188245 ...

My current command line input looks like this-

> cat ./samples.txt| xargs -I {} sh -c "./hisat2-2.1.0/hisat2 -p 8 --dta -x ./indexes/chrX_tran -1 ./samples/{}_chrX_1.fastq.gz -2 ./samples/{}_chrX_2.fastq.gz -S ./map/{}_chrX.sam"

The command line output I'm going for should be formatted like so for each input sample name:

./hisat2-2.1.0/hisat2 -p 8 --dta -x ./indexes/chrX_tran -1 ./samples/ERR199044_chrX_1.fastq.gz -2 ./samples/ERR199044_chrX_2.fastq.gz -S ./map/ERR199044_chrX.sam

I'm getting an error message that says-

Warning: Same mate file "./chrX_data/samples/ERR199044" appears as argument to both -1 and -2
Extra parameter(s) specified: "ERR188104_chrX_1.fastq.gz", "ERR188104_chrX_2.fastq.gz", "ERR188104_chrX.sam"
Note that if <mates> files are specified using -1/-2, a <singles> file cannot
also be specified.  Please run bowtie separately for mates and singles.
Error: Encountered internal HISAT2 exception (#1)
Command: /mnt/c/Alex/Lab_Files/RNAseq/tools/hisat2-2.1.0/hisat2-align-s --wrapper basic-0 -p 8 --dta -x ./chrX_data/indexes/chrX_tran -S ./map/ERR199044 -1 ./chrX_data/samples/ERR199044 -2 ./chrX_data/samples/ERR199044 ERR188104_chrX_1.fastq.gz ERR188104_chrX_2.fastq.gz ERR188104_chrX.sam
(ERR): hisat2-align exited with value 1

So the issue looks to be that everything after "{}" is ignored, making two distinct files look to be the same, and HISAT2 stops working.

What I'm unclear about is what the difference is between "mates" and "singles", and if there's any way to resolve this issue so I can input the same sample name and have Unix understand that it specifies multiple different files with that name in it?

Thank you!

Kusalananda
  • 333,661
andyhow
  • 13
  • 1
    Please [edit] your question to show the desired assembled command line(s) from xargs, so we can see what you are trying to achieve. Possibly the biggest issue is that -I {} implies -L 1 which is likely not what you want if your input is space-separated – steeldriver Sep 05 '19 at 00:48
  • Just added it! Thanks for the help, that certainly might be an issue I have to look into. – andyhow Sep 05 '19 at 02:45

1 Answers1

1

xargs version:

To do this, you need to split each line of the input file on white-space (which is xargs' default behaviour), and then get xargs to run the sh script once for each word, using -n 1. Alternatively, you could get the shell script to loop over "$@".

You can't use -I {} here because that causes xargs to read the input file one line at a time. Even if you set the delimiter to a space with -d ' ', you'll get errors at the end of every input line, when it begins reading the next input line.

Fortunately, you don't need to use -I {} at all - you're already just appending echo word of the input to the end of the sh command line as a separate argument, which is xargs' default behaviour without -I.

In the shell script, you refer to the argument by its positional parameter (i.e. as $1), as you would in any shell script. You can use $1 as often as you like within the shell script.

You also need to provide argument 0 (an arbitrary name for the sh process - the string "sh" is convenient - but you can use any name if you want it to be easy to find in ps) for the sh -c '...script...' command. It has to be the first argument after the '...script...' argument, and will be $0 inside the shell script.

So, you need to do something like this (without a for loop):

xargs -n 1 sh -c \
  './hisat2-2.1.0/hisat2 -p 8 --dta -x ./indexes/chrX_tran \
    -1 "./samples/$1_chrX_1.fastq.gz" -2 "./samples/$1_chrX_2.fastq.gz" \
    -S "./map/$1_chrX.sam"' sh < ./samples.txt

or (with a for loop):

xargs sh -c '
  for f in "$@"; do \
    ./hisat2-2.1.0/hisat2 -p 8 --dta -x ./indexes/chrX_tran \
      -1 "./samples/${f}_chrX_1.fastq.gz" -2 "./samples/${f}_chrX_2.fastq.gz" \
      -S "./map/${f}_chrX.sam"
  done' sh < ./samples.txt

The for loop version will be faster because it doesn't need to execute sh multiple times (once for each "word").

It runs sh as few times as possible, with the limit being the shell's maximum command-line length (about 2MB on modern systems). Unless samples.txt is extremely large (over 200,000 entries) that means it will run sh only once.

shell while-read loop:

xargs isn't needed for a job like this. The following will work in bash, and probably other Bourne-like shells that support arrays and the -a option to read.

while read -a words; do
  for f in "${words[@]}"; do 
    ./hisat2-2.1.0/hisat2 -p 8 --dta -x ./indexes/chrX_tran \
      -1 "./samples/${f}_chrX_1.fastq.gz" \
      -2 "./samples/${f}_chrX_2.fastq.gz" \
      -S "./map/${f}_chrX.sam"
  done
done < samples.txt

This reads each word of each input line into a bash array, then (using a for loop iterating over each word in the array), runs the hisat2 program with the appropriate args.

However, see: Why is using a shell loop to process text considered bad practice?

awk version:

awk '{
  for (i=1;i<=NF;i++) {
    printf "./hisat2-2.1.0/hisat2 -p 8 --dta -x ./indexes/chrX_tran -1 \"./samples/%s_chrX_1.fastq.gz\" -2 \"./samples/%s_chrX_2.fastq.gz\" -S \"./map/%s_chrX.sam\"\n", $i, $i,$i;
  }
}' ./samples.txt | sh

Note that this pipes the output of awk in to sh to be excecuted. Without that pipe into sh, the output looks like:

./hisat2-2.1.0/hisat2 -p 8 --dta -x ./indexes/chrX_tran -1 "./samples/ERR199044_chrX_1.fastq.gz" -2 "./samples/ERR199044_chrX_2.fastq.gz" -S "./map/ERR199044_chrX.sam"
./hisat2-2.1.0/hisat2 -p 8 --dta -x ./indexes/chrX_tran -1 "./samples/ERR188104_chrX_1.fastq.gz" -2 "./samples/ERR188104_chrX_2.fastq.gz" -S "./map/ERR188104_chrX.sam"
./hisat2-2.1.0/hisat2 -p 8 --dta -x ./indexes/chrX_tran -1 "./samples/ERR188234_chrX_1.fastq.gz" -2 "./samples/ERR188234_chrX_2.fastq.gz" -S "./map/ERR188234_chrX.sam"
./hisat2-2.1.0/hisat2 -p 8 --dta -x ./indexes/chrX_tran -1 "./samples/ERR188245_chrX_1.fastq.gz" -2 "./samples/ERR188245_chrX_2.fastq.gz" -S "./map/ERR188245_chrX.sam"

This will also run quickly, as the sh script executes each line as the awk script prints it.

Alternatively, you could use sprintf to put the command line into a variable, rather than printf to stdout. Then you could use awk's system() function to execute it directly, similar to the perl example below:

perl version:

perl -lane '
  foreach $f (@F) {
    system(qw(./hisat2-2.1.0/hisat2 -p 8 --dta -x ./indexes/chrX_tran),
      -1, "./samples/${f}_chrX_1.fastq.gz",
      -2, "./samples/${f}_chrX_2.fastq.gz",
      "-S", "./map/${f}_chrX.sam");
  };' ./samples.txt

This uses perl's system() function so it executes the commands directly, without needing to be piped into sh.

For a test-run, add the word echo and a space immediately after the qw( quote-operator.

BTW, it would be easier to add code to check whether the files existed, and/or test whether each run of ./hisat2-2.1.0/hisat2 was successful or post-process the output in this perl version than in the shell or awk versions - especially if it was written as a stand-alone script rather than a one-liner. For example:

#!/usr/bin/perl -w

use strict;

while(<>) {
  foreach my $f (split) {
    my $f1 = "./samples/${f}_chrX_1.fastq.gz";
    my $f2 = "./samples/${f}_chrX_2.fastq.gz";
    my $sam = "./map/${f}_chrX.sam";

    if (!(-r $f1 && -r $f2 && -r $sam)) {
      warn "Missing or unreadable file for $f\n";
      next
    };

    my $rc = system(
        qw(echo ./hisat2-2.1.0/hisat2 -p 8 --dta -x ./indexes/chrX_tran),
        -1, $f1, -2, $f2, '-S', $sam
    );

    if ($rc) {
      warn "hisat2 returned non-zero exit code for $f: $rc\n";
    };
  }
}
cas
  • 78,579
  • @StéphaneChazelas The OP's input file contains multiple filename prefixes on each input line, word-splitting is required: I have a text file "samples.txt" which has each sample name separated by spaces. Also, I've edited my answer to more accurately describe xargs usage. – cas Sep 05 '19 at 07:00
  • BTW, all examples given in my answer will work for space and/or newline separated input. i.e. one word per line, and/or multiple space-separated words per line, whether one line or many. – cas Sep 05 '19 at 07:08
  • Ah yes, I missed that sorry. Which explains their error. -I can't be used as -I takes one line at a time. – Stéphane Chazelas Sep 05 '19 at 07:11
  • Like the original embedding of {} in the sh code, that awk approach is dangerous though (think of inputs that contain $(reboot) for instance. You'd want to do proper shell quoting to be foolproof (use single quotes and convert all single quotes to '\''). – Stéphane Chazelas Sep 05 '19 at 09:30
  • Note that it's not limit being the shell's maximum command-line length but a limit of the execve() system call (which generally is on the size of the arguments and environment variables, including strings and pointers). On most systems it can be much smaller than 2MB. Linux also has a limit on the size of a single argument/envvar. – Stéphane Chazelas Sep 05 '19 at 09:32
  • Yeah, that's kind of unavoidable with anything that gets a shell to run user-provided input. I'm assuming that the user isn't going to do anything malicious or unfortunate in their own data file. The perl versions are safer because they don't invoke a shell. – cas Sep 05 '19 at 09:34
  • @StéphaneChazelas i'm not trying to provide a perfect definition of every tangentially related concept, just some rough near-enough-is-good-enough practical advice without obscuring the actual answer by adding an extra paragraph or three for each concept that is technically out-of-scope for this Q. – cas Sep 05 '19 at 09:38
  • I cant tell y'all how helpful this is! Obviously I still have a lot to learn so getting an answer this thorough is greatly appreciated. Thanks so much! – andyhow Sep 05 '19 at 16:06
  • @andyhow I'll add another variation. this time a shell while read loop. – cas Sep 06 '19 at 00:22