Extract match and everything before next match, do it for every match

Question

I have a file with this structure:

>Cluster 0
0       51aa, >MG00HS05:520:C8M1TACXX:3:1101:1428:2080/1... *
1       51aa, >MG00HS05:520:C8M1TACXX:3:1101:1658:2480/1... at 3:51:1:49/96.08%
2       51aa, >MG00HS05:520:C8M1TACXX:3:1101:15131:2756/1... at 1:51:1:51/100.00%
[thousands of similarly looking lines]
>Cluster 1
0       51aa, >MG00HS05:520:C8M1TACXX:3:1101:3733:2088/1... *
1       50aa, >MG00HS05:520:C8M1TACXX:3:1101:6962:2026/1... at 2:50:1:49/98.00%
2       51aa, >MG00HS05:520:C8M1TACXX:3:1101:14617:2071/1... at 2:51:1:50/96.08%
[thousands of similarly looking lines]
>Cluster 2
0       51aa, >MG00HS05:520:C8M1TACXX:3:1101:5164:2153/1... *
1       51aa, >MG00HS05:520:C8M1TACXX:3:1101:15660:20057/1... at 1:51:1:51/98.04%
2       51aa, >MG00HS05:520:C8M1TACXX:3:1101:8563:35493/1... at 1:50:1:51/96.08%
[thousands of similarly looking lines]

The lines starting with > are about two millions.

I'd like to extract the lines starting with > and those following it, without taking the following line starting with >, and put them in files. Something like this:

File_one:

>Cluster 0
0       51aa, >MG00HS05:520:C8M1TACXX:3:1101:1428:2080/1... *
1       51aa, >MG00HS05:520:C8M1TACXX:3:1101:1658:2480/1... at 3:51:1:49/96.08%
2       51aa, >MG00HS05:520:C8M1TACXX:3:1101:15131:2756/1... at 1:51:1:51/100.00%
[thousands of similarly looking lines]

File_two

>Cluster 1
0       51aa, >MG00HS05:520:C8M1TACXX:3:1101:3733:2088/1... *
1       50aa, >MG00HS05:520:C8M1TACXX:3:1101:6962:2026/1... at 2:50:1:49/98.00%
2       51aa, >MG00HS05:520:C8M1TACXX:3:1101:14617:2071/1... at 2:51:1:50/96.08%
[thousands of similarly looking lines]

File_three

>Cluster 2
0       51aa, >MG00HS05:520:C8M1TACXX:3:1101:5164:2153/1... *
1       51aa, >MG00HS05:520:C8M1TACXX:3:1101:15660:20057/1... at 1:51:1:51/98.04%
2       51aa, >MG00HS05:520:C8M1TACXX:3:1101:8563:35493/1... at 1:50:1:51/96.08%
[thousands of similarly looking lines]

I've written a script that should do it in bash, but it's not working. I'm not a pro in bash scripting.

mkdir FemaleMito1_clusters
while read i
        do $i > FemaleMito1_clusters/FemaleMito1_${i#>}
        n=1
        while [ `grep -A $n $i FemaleMito1_cdhit2 | tail -n1 | grep -c "^>"` -eq 0 ]
                do grep -A"$n" $i FemaleMito1_cdhit2 | tail -n1 >> FemaleMito1_clusters/FemaleMito1_"${i#>}"
                ((n++))
                done
        done < FemaleMito1_cdhit2_list #this is a file containing just the lines starting with >

How can I do it? Feel free to skip completely my script, there's probably a one-liner that does what I'm looking for.

I also have to filter the files and retain only the ones with more than a certain line number. I thought about doing it with a simple wc -l after creating the files, but if there's a way to include this in the command without creating useless files it's better.

Did you try any solution available by searching the web for fasta split? This is a fairly common bioinformatics task, and it has been solved many times over already. — Kusalananda, Jul 26 '19 at 16:23
@Kusalananda ...sometimes I forget that I'm not the only one in the world. I searched my issue in grep or awk terms, not in fasta terms. Thank you for reminding me I don't have to reinvent the wheel every time -_- — LinuxBlanket, Jul 26 '19 at 16:28
Also, if you're not aware of it, you may enjoy the bioinformatics StackExchange site. See e.g. https://bioinformatics.stackexchange.com/search?q=split+fasta (not guaranteed to give you anything useful, mind you) — Kusalananda, Jul 26 '19 at 16:30
@LinuxBlanket yes, while this is 100% on topic here, it really makes more sense to ask this sort of question on [bioinformatics.se] where people know what fasta is and what tools are available to parse it. — terdon, Jul 26 '19 at 18:08

score 2 · Answer 1 · answered Jul 26 '19 at 17:19

Although (as you have been advised in comments) there are probably bioinformatics tools that may be more suitable for your application, it can be done using csplit:

csplit -sz file '/^>/' '{*}'

gives

$ head xx*
==> xx00 <==
>Number_one
[some thousands lines]

==> xx01 <==
>Number_two
[some other thousands lines, less than the latter]

==> xx02 <==
>Number_three
[Some other hundreds lines]

For options concerning the numbering and format of the output file names, refer to the manual page (man csplit)

This is a really useful command, but unfortunately it's quite slow on my big file. — LinuxBlanket, Jul 26 '19 at 20:51

terdon · Accepted Answer · 2019-07-27T11:42:06.870

2

You can do this quite easily in awk:

awk '{ if(/^>/){name=$0; sub(/^>/,"", name);}{print >> name".fa"}}' file.fa

That will iterate over all of the input file's line and, if the first character is a >, it will save that line as name. Then, it will remove the > from the contents of name since you don't want that in the file name. Finally, each line is appended to a file called name.fa where name is whatever the current sequence's name is.

If you only want to print those sequences with more than N lines, you can use:

awk -v min=4 '{ 
               if(/^>/){ 
                    if(num >= min){
                        print seq >> name".fa"
                    } 
                    name=$0; 
                    sub(/^>/,"", name); 
                    seq=$0; 
                    num=0
                }
                else{
                    seq = seq"\n"$0; 
                    num++
                }
               }
               END{
                 if(num >= min){
                    print seq >> name".fa"
                 }
               }' file.fa

As a general rule, don't use shell loops for text processing. They're slow, cumbersome, and prone to error.

edited Jul 27 '19 at 11:42

answered Jul 26 '19 at 17:57

terdon

242,166

Thank you for the link about shell loops, it was very informative. I'm trying to make your script work, but it doesn't. It doesn't show any error, but it doesn't create any file. I'm just replacing file.fa with my file, is there anything more I should do? – LinuxBlanket Jul 26 '19 at 20:19
@LinuxBlanket please use fastaexplode instead of the script. But if the script isn't working, you might have whitespace before the > or some other oddity. Please edit your question and add a few lines of your actual file. Also, if you're using the second script, make sure you actually have sequences with 4 or more lines. – terdon Jul 26 '19 at 20:23
Well, since it's not really a fasta, fastaexplode ...exploded. I tried csplit but it's really slow, and since I'm not really using all those files but only a fraction of them I think the script is more efficient. I'll edit the question with the real data. – LinuxBlanket Jul 26 '19 at 20:30
@LinuxBlanket I'm not at my computer now, but the awk script should work on that. Make sure the > is the first character on each line, that there is nothing (including no spaces) before it and that the file has normal Unix newlines and hasn't been edited in windows. – terdon Jul 26 '19 at 20:37
Ok, now it works! Thank you! – LinuxBlanket Jul 26 '19 at 20:50

Extract match and everything before next match, do it for every match

2 Answers2