2

I am dealing with fasta files having lines such as:

\>97977-100;sample=Samp1  
TAATGATGATTTGT  
\>97978-60;sample=Samp2  
AACATTCAACGCGGTCGGTGAGTA  
\>97979-30;sample=Samp3  
AACCGTAGGAGTTGATGTGCGGT  
\>97980-20;sample=Samp4  
ACTGTCTGTATGTGGTG  

I would like to find all characters between - and ; and add them to the end of the line along with the text ;size="(value)";, so I would get:

\>97977-100;sample=Samp1;size=100;  
TAATGATGATTTGT  
\>97978-60;sample=Samp2;size=60;  
AACATTCAACGCGGTCGGTGAGTA  
\>97979-30;sample=Samp3;size=30;  
AACCGTAGGAGTTGATGTGCGGT  
\>97980-20;sample=Samp4;size=20;  
ACTGTCTGTATGTGGTG  

I have seen on this Question some help on how to find the characters between 2 strings, and I can get them with something like:

sed -n 1~2p $file | sed -e 's/.*-\(.*\);.*/\1/'

And I know how to append to end of a line with:

sed "1~2s/$/;size=(I want this to be the output of the command above);/" $file

But I am not getting the two together. Neither sed with a command as it gives too large argument error.

Jeff Schaller
  • 67,283
  • 35
  • 116
  • 255

1 Answers1

3

sed solution:

sed -E 's/(.*-)([0-9]+)(;.*)/\1\2\3;size=\2;/' file

The output:

>97977-100;sample=Samp1;size=100;
TAATGATGATTTGT
>97978-60;sample=Samp2;size=60;
AACATTCAACGCGGTCGGTGAGTA
>97979-30;sample=Samp3;size=30;
AACCGTAGGAGTTGATGTGCGGT
>97980-20;sample=Samp4;size=20;

Or with awk:

awk -F'-' '/^>/{ $0=$0";size=" int($2) ";" }1' file