2

I have 28 files that each have ~14,000 "entries". A single entry consists of a header, denoted by >string, a newline, and then a sequence which is a string. Each entry has variable length sequence/string. Across all 28 files there are identical entry headers but the sequence for each entry is variable.

For example one file CR1_ref.fasta would look like

>FBgn0080937
ATGGATAAAAGGCTCAGCGATAGTCCCGGAGATTGTCGCGTAACCAGATCCAGCATGACGCCCACCCTCCGCTTGGAGCACAGTCCCCGGCGGCAACAACAGCAACAACA
>FBgn0076379
ATGCTGCGCACCCTTTTCGCCGTGCGTGGTCAGTGCCAGCAGCTGCTGAGGAGAACATTCACCCCCCATTGCAGTGGCCAACGA
>FBgn0070974
ATGCAGACGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAACTCCTGCGGGAGCTGCCGCCGCAGAAATGCTCCAGCGCCACGCTGGCCAAGAAGGTGCTGTCGCAGAGCCCGCCGGCAGCCCCGCCGCCCACACCGGCCACAATTGTGCCGCTCACTGCGGTGCCCGTCATCCAGCTGACGCCTCCGTCGCACTCCGGCGACACGCCGCAAAAGCCAGCACCTCCGGCGCCGCCGCCGCC

The overall goal is to create ~14,000 new files. Where each file is the entry associated with a particular ID/header across all 28 files.

To extract a single entry from a single file I can use the following command

sed -n '/^>FBgn0080937$/{p;n;p;}' CR1_ref.fasta

To extract this entry across all 28 files, each ends in ref.fasta, I can do

for i in *ref.fasta; do sed -n '/^>FBgn0080937$/{p;n;p}' $i; done > FBgn0080937.fasta

I have a separate text file that has 14,000 lines each line corresponding to a header for an entry called gene.txt. The first few lines of this file look like

FBgn0080937
FBgn0076379
FBgn0070974
FBgn0081668
FBgn0076576
FBgn0076572
FBgn0079684
FBgn0070907
FBgn0080226
FBgn0072746

I would like to read through this file creating a new text file per header ID. Below $F is extracting entries for a particular header (FBgn*) and storing this in a new file. I am using the substitution command to rename sequences based on while ref.fasta file they come from.

while read -r line;
do F=$line
for i in *ref.fasta
do sed -n "/^>$F$/{s/FB.*/$i/;p;n;p;}" $i > $line.fasta
done
done < "gene.txt"

Currently this script creates 14,000 files but each file only has a single sequence.

>Z9_ref.fasta
ATGCAGACGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAAC

I am expecting 28 sequences one sequence per *ref.fasta file. The sed command is outputting the last entry. The expected output would be

    >CR1_ref.fasta
    ATGCAGACGCGTCCGAGCAGTGAACC
    >FH2_ref.fasta
    AGCAGTGAACCGCAGCGCGCCAAGGAGCAAC
    >MSH10_ref.fasta
    CGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAAC
    >Z9_ref.fasta
    ATGCAGACGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAAC
Kusalananda
  • 333,661
  • 1
    Can you rewrite this question in computing terms? I only have the vaguest idea what a fasta thingie is, and I certainly couldn't identify one in a data file. – Chris Davies Dec 13 '17 at 21:05
  • 1
  • 1
    In short: variable expansion works within double quotes "", not within single quotes ''. Also, you can just do F=$line, there's no need to do the pass around the echo (unless you want the side effects of the unquoted expansion in echo $line) – ilkkachu Dec 13 '17 at 21:06
  • 3
    This is on topic here, but I would strongly recommend you post this on [bioinformatics.se] instead. The people here are very knowledgeable about text parsing but, as you can see from the comments, most of us won't have any idea what fasta is. If you like, I can migrate it over for you. If not, please [edit] and explain (also mention if your sequences will always be one line or not). You should also explain what you are trying to do because it really isn't obvious and I'm a bioinformatician. – terdon Dec 13 '17 at 21:06
  • 1
    Also, it sounds like you're trying to reinvent fastaexplode from the exonerate suite which will do this for you. – terdon Dec 13 '17 at 21:42
  • A couple of folks have asked you to explain your problem in terms that we Unix-savvy people can understand.  In case that isn't obvious, that includes posting enough of CR1_ref.fasta for us to be able to understand what processing is necessary to get the output you desire.  It would probably also be helpful to explain why you are saying '/^>$F$/{p;n;p;}' and what you expect it to do. – G-Man Says 'Reinstate Monica' Dec 13 '17 at 23:49
  • Migrating to Bioinformatics would be great if that would be the most appropriate. I will also try to edit the post to make the problem more clear – Dcastillo Dec 14 '17 at 02:11

1 Answers1

0

The shell is not really suited for this type of parsing. You can see in your own code that you read through the entirety of each file once for each of the gene names read from the gene.txt file.

The following single awk command would do the same thing quicker.

awk -F '>' '
    FNR == NR           { genes[$1]; next }
    /^>/ && $2 in genes { if (out != "") close(out);
                          out = $2 ".fa"
                          split(FILENAME, a, "_")
                          $0 = ">" a[1] "_" $2 }
    out != ""           { print >>out }' genes.txt *_ref.fasta

This first reads the genes.txt file and creates an associative array called genes from this with the gene names as keys.

When it gets to the Fasta files (the code assumes that these are all called something like XXX_ref.fasta), when we read a Fasta header, and the gene in the header is a key in our genes list, then we create an output filename from the gene name as genename.fa and rewrite the header to include the part of the current filename before the underscore.

If the original header in XXX_ref.fasta is

>genename

then this would be transformed into

>XXX_genename

The last part of the awk script sends all lines to the appropriate output file.

Testing this with the data that you have provided generates three files:

$ ls *.fa
FBgn0070974.fa FBgn0076379.fa FBgn0080937.fa

$ cat FBgn0076379.fa
>CR1_FBgn0076379
ATGCTGCGCACCCTTTTCGCCGTGCGTGGTCAGTGCCAGCAGCTGCTGAGGAGAACATTCACCCCCCATTGCAGTGGCCAACGA
Kusalananda
  • 333,661