1

I have a fasta file of millions of paired sequences, that looks like this:

>7001289F:56:HKH3FBCXX:2:1101:1692:2074 1:N:0:CGATGT
GAGCAGAGGCACCGCTGAGCAGACAGCGAGCGAGTGAAGGGGTCAGGGGCCAGTCAGCAATCTCGTGTAGAAAGAATCACGGTCGAGCGGTGCACGCATG
>NNNNN
GACACCTTCATTTCCACTTTATTGAGCAGCGGCGCATGCGTGCACCGCTCGACCGTGATTCTTTCTACACGAGATTGCTGACTGGCCCCTGACCCCTTCA
>7001289F:56:HKH3FBCXX:2:1101:1522:2186 1:N:0:CGATGT
GTAGATGATGAATACAGCTGTTGCTGCAGCAACTGGTGCTGAGTAAGCAACTGCGATCCATGGACGCATACCTAAACGGAAAGATAATTCCCAC
>NNNNN
GTGGGAATTATCTTTCCGTTTAGGTATGCGTCCATGGATCGCAGTTGCTTACTCAGCACCAGTTGCTGCAGCAACAGCTGTATTCATCATCTAC

I need to format it as below:

>7001289F:56:HKH3FBCXX:2:1101:1692:2074 1:N:0:CGATGT
GAGCAGAGGCACCGCTGAGCAGACAGCGAGCGAGTGAAGGGGTCAGGGGCCAGTCAGCAATCTCGTGTAGAAAGAATCACGGTCGAGCGGTGCACGCATGNNNNNGACACCTTCATTTCCACTTTATTGAGCAGCGGCGCATGCGTGCACCGCTCGACCGTGATTCTTTCTACACGAGATTGCTGACTGGCCCCTGACCCCTTCA
>7001289F:56:HKH3FBCXX:2:1101:1522:2186 1:N:0:CGATGT
GTAGATGATGAATACAGCTGTTGCTGCAGCAACTGGTGCTGAGTAAGCAACTGCGATCCATGGACGCATACCTAAACGGAAAGATAATTCCCACNNNNNGTGGGAATTATCTTTCCGTTTAGGTATGCGTCCATGGATCGCAGTTGCTTACTCAGCACCAGTTGCTGCAGCAACAGCTGTATTCATCATCTAC

Basically the complex headers represent the forward reads of DNA sequence and the header immediately following represents the corresponding reverse read with the headers NNNNN. I need to append these reverse reads to the forward reads separated with only NNNNN but am struggling to remove the new line characters with sed. Can anybody shed light on this please?

terdon
  • 242,166
C.cook
  • 41

1 Answers1

1

If your file is small enough to fit in memory, you can do:

perl -00pe 's/\n>(NNNNN)\n/$1/g' file

Since your file will almost certainly be too large to load into your RAM, you can use this instead:

$ perl -pe '$c++; if($c==2){chomp}
                  elsif($c==3){s/[>\n]//g;}
                  elsif($c==4){$c=0}' file.fa
>7001289F:56:HKH3FBCXX:2:1101:1692:2074 1:N:0:CGATGT
GAGCAGAGGCACCGCTGAGCAGACAGCGAGCGAGTGAAGGGGTCAGGGGCCAGTCAGCAATCTCGTGTAGAAAGAATCACGGTCGAGCGGTGCACGCATGNNNNNGACACCTTCATTTCCACTTTATTGAGCAGCGGCGCATGCGTGCACCGCTCGACCGTGATTCTTTCTACACGAGATTGCTGACTGGCCCCTGACCCCTTCA
>7001289F:56:HKH3FBCXX:2:1101:1522:2186 1:N:0:CGATGT
GTAGATGATGAATACAGCTGTTGCTGCAGCAACTGGTGCTGAGTAAGCAACTGCGATCCATGGACGCATACCTAAACGGAAAGATAATTCCCACNNNNNGTGGGAATTATCTTTCCGTTTAGGTATGCGTCCATGGATCGCAGTTGCTTACTCAGCACCAGTTGCTGCAGCAACAGCTGTATTCATCATCTAC

Explanation

  • perl -pe '...' file.fa : for each line of the input file file.fa, run the script given by -e and -print.
  • $c++ : increment the variable $c by one for each line.
  • if($c==2){chomp} : if the current value of $c is 2, remove the newline from the end of the line. This matches the lines of forward sequences.
  • elsif($c==3){s/[>\n]//g;} : if $c is 3, the >NNNNN lines, remove the > and newline characters.
  • elsif($c==4){$c=0}' : if $c is 4, set it back to 0 again.

Note that this assumes paired reads. It will fail if you don't have exactly one reverse read for every forward read in the file. It also assumes you have your sequences on a single line. Fasta files often have multiple lines per sequence, the default is to cut at 60 characters. This has changed in recent years, but the format still allows for multi-line sequences.

terdon
  • 242,166
  • 1
    @C.cook, if this answer solved your issue, please take a moment and accept it by clicking on the check mark to the left. That will mark the question as answered and is the way thanks are expressed on the Stack Exchange sites. – terdon Mar 23 '16 at 11:34