2

I have a folder with almost 100 files, organized in groups of 16 files each. I need to concatenate each of the 16 files of each group into a single file. For example, one group of file names is:

randomString_$groupName- 

I have a folder with almost 100 samples, the sample are run on the Nextseq500 and are single stranded. Each sample is run on 4 Flowcells for the Nextseq500 having 4 lanes. So per sample 16 fastq files are generated (see example below). Now I want to concatenate all these files and generated one output with name 102697-001-001_R1.fastq.gz

HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz

HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz

HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz

HJJMYBGXX_102697-001-001_ATTACTCG-GCCTCTAT_L001_R1.fastq.gz
HJJMYBGXX_102697-001-001_ATTACTCG-GCCTCTAT_L002_R1.fastq.gz
HJJMYBGXX_102697-001-001_ATTACTCG-GCCTCTAT_L003_R1.fastq.gz
HJJMYBGXX_102697-001-001_ATTACTCG-GCCTCTAT_L004_R1.fastq.gz

All of the files above should be concatenated into a single file named 102697-001-001_R1.fastq.gz (so keeping the string between the two first _ and after the last _ as the name).

I have tried:

$ cat HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz \
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz \
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz \
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz \
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz \
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz \
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz \
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz \
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz \
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz \
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz \
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz \
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz \
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz \
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz \
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz \
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz \
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz \
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz \
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz \
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz \
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz > 102697_001_001_R1.fastq.gz

and it works, but as I have a lot of files, I don't want to do manually.

Jeff Schaller
  • 67,283
  • 35
  • 116
  • 255
H.K
  • 79

2 Answers2

4
for name in ./*.fastq.gz; do
    rnum=${name##*_}
    rnum=${rnum%%.*}

    sample=${name#*_}
    sample=${sample%%_*}

    cat "$name" >>"${sample}_$rnum.fastq.gz"
done

This would iterate over all compressed Fastq files in the current directory and extract the sample name into the shell variable sample. For all the filenames shown in the question, this would be 102697-001-001.

The rnum variable will hold the R# bit at the end of the filename.

The sample name is extracted by taking the filename and first removing everything up to and including the first _ character, and then removing everything after and including the first _ character from that result. The value for the rnum variable is extracted in a similar manner.

The file is then simply appended onto the end of the aggregated file using cat >>. The output filename will be constructed from the sample name, the R#, and the string .fastq.gz. For the shown files, this will be 102697-001-001_R1.fastq.gz.

Gzip compressed files do not have to be uncompressed in order to concatenated them. Uncompressing the resulting file will give you the uncompressed concatenation of all Fastq files.


Alternative way of doing this with bash, using a regular expression to figure out the output file name:

for name in ./*.fastq.gz; do
    if [[ "$name" =~ _([0-9-]+)_.*(..)\.fastq\.gz ]]; then
        outfile="${BASH_REMATCH[1]}_${BASH_REMATCH[2]}.fastq.gz"

        cat "$name" >>"$outfile"
    fi
done

The filename is matched against the regular expression

_([0-9-]+)_.*(..)\.fastq\.gz

The two groups (bits in parentheses) will pick out the relevant parts of the filename for us. The first group captures a string that only consists of characters that are either digits or dashes. This group needs to be surrounded by _ on either side. The only place in the filename that this bit matches is the sample name.

After the first group, and the _ after it, we allow for any number of any characters (.*) up to the (..)\.fastq\.gz bit. The \.fastq\.gz will match the .fastq.gz string at the end of the filename, so the last group, (..), captures the R1 immediately before that (the . pattern will match any one character, while \. will match a dot).

The two captured groups are stored as index 1 and 2 in the BASH_REMATCH array (the name is short for "Bash Regular Expression Match"), and we use these in constructing the output file name.

Kusalananda
  • 333,661
  • 1
    @H.K If this answer solved your issue, please take a moment and accept it by clicking on the check mark to the left. That will mark the question as answered and is the way thanks are expressed on the Stack Exchange sites. – terdon Sep 26 '17 at 08:51
  • @kusalananda Can you kindly give the description of bash file. – H.K Sep 26 '17 at 09:11
  • @H.K The sting I added at the end? I will add an explanation. – Kusalananda Sep 26 '17 at 09:11
  • @Kusalananda I tink there is some problem with thisfor name in ./.fastq.gz; do rnum=${name##_} rnum=${rnum%%.*}
    sample=${name#*_}
    sample=${sample%%_*}
    
    cat "$name" >"${sample}_$rnum.fastq.gz"
    

    done
    I think there is some problem with this code, when i try amnually just the 1st sample the output file size is different from what this loop creates. The code that you posted before putting in rnum, it was working Ok. The bash is working perfectly Ok

    – H.K Sep 27 '17 at 10:06
  • @H.K Yes, you have found a bug in my code. The cat > should be cat >> as in the bash code. I will update the answer at once. Thanks! – Kusalananda Sep 27 '17 at 10:18
  • @H.K The difference is that > replaces the contents of the target file, while >> appends to the contents. – Kusalananda Sep 27 '17 at 10:20
1

I had to do something very similar for my work.

The solution that was the cleanest for me is as follows:

ls *.fastq.gz | cut -d '_' -f2 | sort | uniq | parallel -j 16 'cat *{}*.fastq.gz > {}_R1.fastq.gz'

In this code I:

  1. find all files with the .fastq.gz extension. **Filenames are not allowed to contain special characters (eg., !?' '). See adminbee response to my comment.
  2. cut the output of (1) by the delimiter _ and save the second output (-f2)
  3. sort the cut text
  4. keep only the unique (i.e., uniq) text
  5. send the unique text to parallel.
  6. have parallel start up to 16 jobs (-j 16)
  7. each parallel job runs the command
    'cat *{}*.fastq.gz > {}_R1.fastq.gz'
    
    This cat code should concatenate all files it finds matching the input ({}) from uniq in the directory in which the code is run. It will call the output file: 102697-001-001_R1.fastq.gz.

I know that it doesn't automatically capture the R1. Perhaps someone could suggest a way to capture R1 in my code?

What is great about this code is that it does this task for all unique samples in the directory. I had 32 files coming from 16 samples (i.e., Sample1_L001.fastq and Sample1_L002.fastq; Sample2_L001 and Sample2_L002; etc...). This code concatenated them all by sample in one go. So I ended up with just Sample1.fastq, Sample2.fastq, etc.

Stephan
  • 11
  • Welcome to the site, and thank you for your contribution. Please note that while your approach may work in your case, parsing the output of ls is highly disrecommended as it can fail when the filenames contain "special" characters, even as simple as a space. – AdminBee Jul 02 '20 at 09:00
  • Thank you for that information! :). I will add that caveat to my readme file and to the answer. Fortunately, I do not work with filenames that contain special characters. Nevertheless, it was very informative to read about the issues with parsing ls's output. – Stephan Jul 03 '20 at 15:13