I have 28 files that each have ~14,000 "entries". A single entry consists of a header, denoted by >string, a newline, and then a sequence which is a string. Each entry has variable length sequence/string. Across all 28 files there are identical entry headers but the sequence for each entry is variable.
For example one file CR1_ref.fasta would look like
>FBgn0080937
ATGGATAAAAGGCTCAGCGATAGTCCCGGAGATTGTCGCGTAACCAGATCCAGCATGACGCCCACCCTCCGCTTGGAGCACAGTCCCCGGCGGCAACAACAGCAACAACA
>FBgn0076379
ATGCTGCGCACCCTTTTCGCCGTGCGTGGTCAGTGCCAGCAGCTGCTGAGGAGAACATTCACCCCCCATTGCAGTGGCCAACGA
>FBgn0070974
ATGCAGACGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAACTCCTGCGGGAGCTGCCGCCGCAGAAATGCTCCAGCGCCACGCTGGCCAAGAAGGTGCTGTCGCAGAGCCCGCCGGCAGCCCCGCCGCCCACACCGGCCACAATTGTGCCGCTCACTGCGGTGCCCGTCATCCAGCTGACGCCTCCGTCGCACTCCGGCGACACGCCGCAAAAGCCAGCACCTCCGGCGCCGCCGCCGCC
The overall goal is to create ~14,000 new files. Where each file is the entry associated with a particular ID/header across all 28 files.
To extract a single entry from a single file I can use the following command
sed -n '/^>FBgn0080937$/{p;n;p;}' CR1_ref.fasta
To extract this entry across all 28 files, each ends in ref.fasta, I can do
for i in *ref.fasta; do sed -n '/^>FBgn0080937$/{p;n;p}' $i; done > FBgn0080937.fasta
I have a separate text file that has 14,000 lines each line corresponding to a header for an entry called gene.txt. The first few lines of this file look like
FBgn0080937
FBgn0076379
FBgn0070974
FBgn0081668
FBgn0076576
FBgn0076572
FBgn0079684
FBgn0070907
FBgn0080226
FBgn0072746
I would like to read through this file creating a new text file per header ID. Below $F is extracting entries for a particular header (FBgn*) and storing this in a new file. I am using the substitution command to rename sequences based on while ref.fasta file they come from.
while read -r line;
do F=$line
for i in *ref.fasta
do sed -n "/^>$F$/{s/FB.*/$i/;p;n;p;}" $i > $line.fasta
done
done < "gene.txt"
Currently this script creates 14,000 files but each file only has a single sequence.
>Z9_ref.fasta
ATGCAGACGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAAC
I am expecting 28 sequences one sequence per *ref.fasta file. The sed command is outputting the last entry. The expected output would be
>CR1_ref.fasta
ATGCAGACGCGTCCGAGCAGTGAACC
>FH2_ref.fasta
AGCAGTGAACCGCAGCGCGCCAAGGAGCAAC
>MSH10_ref.fasta
CGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAAC
>Z9_ref.fasta
ATGCAGACGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAAC
""
, not within single quotes''
. Also, you can just doF=$line
, there's no need to do the pass around theecho
(unless you want the side effects of the unquoted expansion inecho $line
) – ilkkachu Dec 13 '17 at 21:06fastaexplode
from theexonerate
suite which will do this for you. – terdon Dec 13 '17 at 21:42CR1_ref.fasta
for us to be able to understand what processing is necessary to get the output you desire. It would probably also be helpful to explain why you are saying'/^>$F$/{p;n;p;}'
and what you expect it to do. – G-Man Says 'Reinstate Monica' Dec 13 '17 at 23:49