sort fasta file based on its alphanumeric ids

Question

>Ipunensis_00386        Yfr1
GCGGAGACGAAAGTTTCCGTTCACTCCTCACACCACACTCCGCCCAAATCATTGATTTGG
GCGGTT
>Ipunensis_00401        tRNA-Gly(gcc)
GCGGGTATAGCTCAGTGGTAGAGCGTCACCTTGCCAAGGTGAATGTCGCGCGTTCGAATC
GCGTTACCCGCT
>Ipunensis_00001        transposase IS4 family protein
ATGCAGAAGTTTCAGGGCATCCACTGGGTCAACCTAGACGGGCAGCACCAGGTTAGCAAT
CTCAGTGATGAGCGACGCTTAATCATCCACCTCTTGGGGCCACCTGTTGAGCGCTACTAC
CATGCCCCTGGTTAA
>Ipunensis_00002        Photosystem I assembly protein Ycf3
ATGCGTCACCCCGCCAAGTTACTCGGGTTAGTCACTCTCACCAGTATGCTTACGCTGGCT
>Ipunensis_00003        Cell wall-associated protease
ATGAAACGTTTTCTGACCAGTCTTTTGCTGACGGGCCTGCTTTGGCATAGTGGGGGCAGC
GTTGGGGTTGGGAGAGGTGCGATCGCACAAACCCAGTCCACCCCAGACCTCTACTACACC
>Ipunensis_00004        Photosystem I assembly protein Ycf3
TTGACCTGCGGCCCGCAGCCCTACCTGCCCAACCTGACTCCAGAAATTCCCATGATCTAC
CGCCTCTCGTCTCCCGGATTTTTGCTGGCGCTGCTGCTGCTATCTGCCGTCGATCCGGCA
>Ipunensis_00226        tRNA-Leu(gag)
TGCGGATGTGGTGGAACTGGTAGACACGCACGTTTGAGGGGCGTGTGGCTTACGCCTTGC
GAGTTCGAGTCTCGCCATCCGCAT
>Ipunensis_00045        tRNA-Ala(cgc)
GGGGAATTAGCTCAGCTGGTAGAGCGCTGCGATCGCACCGCAGAGGTCAGGAGTTCGAAT
CTCCTATTCTCCA
>Ipunensis_00357        glnA
ATCGTTCATCTCTTCAAACTGTCAAAGCTACTTACAAAAGCTACAGACGCACCAAGAGAC
GGAAGTAGGGGTCTGATCCCCCCGAAGGAACGCGCC
>Ipunensis_00403        tRNA-Gly(gcc)
GCGGGTATAGCTCAGTGGTAGAGCGTCACCTTGCCAAGGTGAATGTCGCGCGTTCGAATC

How can I sort the above fasta file based on its alphanumeric ids: starting with >Ipunensis_00001 so on and so forth.

Desired output:

>Ipunensis_00001        transposase IS4 family protein
ATGCAGAAGTTTCAGGGCATCCACTGGGTCAACCTAGACGGGCAGCACCAGGTTAGCAAT
CTCAGTGATGAGCGACGCTTAATCATCCACCTCTTGGGGCCACCTGTTGAGCGCTACTAC
CATGCCCCTGGTTAA
>Ipunensis_00002        Photosystem I assembly protein Ycf3
ATGCGTCACCCCGCCAAGTTACTCGGGTTAGTCACTCTCACCAGTATGCTTACGCTGGCT
>Ipunensis_00003        Cell wall-associated protease
ATGAAACGTTTTCTGACCAGTCTTTTGCTGACGGGCCTGCTTTGGCATAGTGGGGGCAGC
GTTGGGGTTGGGAGAGGTGCGATCGCACAAACCCAGTCCACCCCAGACCTCTACTACACC
>Ipunensis_00004        Photosystem I assembly protein Ycf3
TTGACCTGCGGCCCGCAGCCCTACCTGCCCAACCTGACTCCAGAAATTCCCATGATCTAC
CGCCTCTCGTCTCCCGGATTTTTGCTGGCGCTGCTGCTGCTATCTGCCGTCGATCCGGCA
>Ipunensis_00045        tRNA-Ala(cgc)
GGGGAATTAGCTCAGCTGGTAGAGCGCTGCGATCGCACCGCAGAGGTCAGGAGTTCGAAT
CTCCTATTCTCCA
>Ipunensis_00226        tRNA-Leu(gag)
TGCGGATGTGGTGGAACTGGTAGACACGCACGTTTGAGGGGCGTGTGGCTTACGCCTTGC
GAGTTCGAGTCTCGCCATCCGCAT
>Ipunensis_00357        glnA
ATCGTTCATCTCTTCAAACTGTCAAAGCTACTTACAAAAGCTACAGACGCACCAAGAGAC
GGAAGTAGGGGTCTGATCCCCCCGAAGGAACGCGCC
>Ipunensis_00386        Yfr1
GCGGAGACGAAAGTTTCCGTTCACTCCTCACACCACACTCCGCCCAAATCATTGATTTGG
GCGGTT
>Ipunensis_00401        tRNA-Gly(gcc)
GCGGGTATAGCTCAGTGGTAGAGCGTCACCTTGCCAAGGTGAATGTCGCGCGTTCGAATC
GCGTTACCCGCT
>Ipunensis_00403        tRNA-Gly(gcc)
GCGGGTATAGCTCAGTGGTAGAGCGTCACCTTGCCAAGGTGAATGTCGCGCGTTCGAATC

Please rewrite your question in computational terms, not bio-whatever-it-is terms. (What is the id field that you want to sort by? Are the > characters really in the file?) What I can understand from your question is "How can I sort the above specially formatted file based on some field" — Chris Davies, Aug 01 '20 at 20:39
Looking harder, it seems each section starts with the > character. Are there really multiple lines per section, or is that an artefact of the way you posted your question? — Chris Davies, Aug 01 '20 at 20:48
The >Ipunensis_00045 and >Ipunensis_00403 are missing from your expected output. If that's a mistake, please fix it, otherwise please explain it. — Ed Morton, Aug 02 '20 at 00:52
I should not be the one to answer that but the Wikipedia link shows that > starts a multi-line block. So not lines shall be sorted but blocks. — Hauke Laging, Aug 02 '20 at 14:37
Please see the corrected question. Let me know if any further clarification is needed? — Gavin, Aug 03 '20 at 15:37

guest · Answer 1 · 2020-08-02T11:55:31.987

4

With GNU sort & sed, assuming null bytes cannot appear in your file:

sed 's/^>/\x00&/' file  | sort -z | tr -d '\0'

This separates blocks that begin with > by a null byte, then uses -z/--zero-terminated option of GNU sort to sort the records lexicographically, then tr removes the null bytes.

edited Aug 02 '20 at 11:55

answered Aug 02 '20 at 08:00

guest

2,134

score 2 · Answer 2 · edited Aug 03 '20 at 09:38

With GNU sort plus POSIX awk and cut:
```
awk '/^>/{key=$1} {print key, $0}' file | sort -k1,1 -s | cut -d' ' -f2-
```
That works because we can use GNU sort's -s ("stable sort") option to ensure retention of the original line order within each block rather than having to print and additionally sort on the original line numbers as in the next example.
Alternatively, with all POSIX/standard UNIX tools:
```
awk '/^>/{key=$1} {print key, NR, $0}' file | sort -k1,1 -k2,2n | cut -d' ' -f3-
```
The benefit of that one is it'll work on all UNIX systems.

Both of the above have the benefit over the below gawk-only alternative that awk doesn't have to store the whole input file in memory so it'll work even for massive input files - only the sort in that pipeline has to handle the whole file and it uses demand paging, etc. to be able to do just that.

Finally, with GNU awk for sorted_in (see https://www.gnu.org/software/gawk/manual/gawk.html#Controlling-Scanning):
```
awk '
    /^>/{key=$1} {vals[key]=vals[key] $0 ORS}
    END { PROCINFO["sorted_in"]="@ind_str_asc"; for (key in vals) printf "%s", vals[key] }
' file
```
That last one is really only preferable if you're doing this as an intermediate part of some larger awk script and so using an external shell tool for sorting is a less desirable option.

You're welcome, see https://stackoverflow.com/help/someone-answers for what to do next. — Ed Morton, Aug 04 '20 at 19:41

score 0 · Answer 3 · answered Aug 02 '20 at 12:14

This is the same basic idea as guest's clever \0 trick, but doesn't require GNU tools:

$ perl -pe '/>/ ? s/\n/\t/ : s/\n//; ' foo.fa | sort | perl -pe 's/(?<=.)>/\n>/g; y/\t/\n/' | fold -w 60
>Ipunensis_00386        Yfr1
GCGGAGACGAAAGTTTCCGTTCACTCCTCACACCACACTCCGCCCAAATCATTGATTTGG
GCGGTT
>Ipunensis_00401        tRNA-Gly(gcc)
GCGGGTATAGCTCAGTGGTAGAGCGTCACCTTGCCAAGGTGAATGTCGCGCGTTCGAATC
GCGTTACCCGCT
>Ipunensis_00001        transposase IS4 family protein
ATGCAGAAGTTTCAGGGCATCCACTGGGTCAACCTAGACGGGCAGCACCAGGTTAGCAAT
CTCAGTGATGAGCGACGCTTAATCATCCACCTCTTGGGGCCACCTGTTGAGCGCTACTAC
CATGCCCCTGGTTAA
>Ipunensis_00002        Photosystem I assembly protein Ycf3
ATGCGTCACCCCGCCAAGTTACTCGGGTTAGTCACTCTCACCAGTATGCTTACGCTGGCT
>Ipunensis_00003        Cell wall-associated protease
ATGAAACGTTTTCTGACCAGTCTTTTGCTGACGGGCCTGCTTTGGCATAGTGGGGGCAGC
GTTGGGGTTGGGAGAGGTGCGATCGCACAAACCCAGTCCACCCCAGACCTCTACTACACC
>Ipunensis_00004        Photosystem I assembly protein Ycf3
TTGACCTGCGGCCCGCAGCCCTACCTGCCCAACCTGACTCCAGAAATTCCCATGATCTAC
CGCCTCTCGTCTCCCGGATTTTTGCTGGCGCTGCTGCTGCTATCTGCCGTCGATCCGGCA
>Ipunensis_00226        tRNA-Leu(gag)
TGCGGATGTGGTGGAACTGGTAGACACGCACGTTTGAGGGGCGTGTGGCTTACGCCTTGC
GAGTTCGAGTCTCGCCATCCGCAT
>Ipunensis_00045        tRNA-Ala(cgc)
GGGGAATTAGCTCAGCTGGTAGAGCGCTGCGATCGCACCGCAGAGGTCAGGAGTTCGAAT
CTCCTATTCTCCA
>Ipunensis_00357        glnA
ATCGTTCATCTCTTCAAACTGTCAAAGCTACTTACAAAAGCTACAGACGCACCAAGAGAC
GGAAGTAGGGGTCTGATCCCCCCGAAGGAACGCGCC
>Ipunensis_00403        tRNA-Gly(gcc)
GCGGGTATAGCTCAGTGGTAGAGCGTCACCTTGCCAAGGTGAATGTCGCGCGTTCGAATC

Explanation

The first perl command will replace newlines with a \t on lines containing a > and with nothing on lines that don't. This puts the id and sequence on the same line separated by a tab character.

This is then passed to sort, and then a second perl that replaces > wih a newline but only if they have another character before them (to avoid adding an extra newline before the first entry). This converts back to a fasta-like format.

Finally, we pass the output through fold -w 60 to get the standard 60-char fasta format. Note, however, that this would also fold the ID line if you happen to have ID lines with more than 60 characters. If that's a problem, you could try the two awk scripts I have posted previously and do:

$ FastaToTbl foo.fa | sort | TblToFasta 
>Ipunensis_00001 transposase IS4 family protein 
ATGCAGAAGTTTCAGGGCATCCACTGGGTCAACCTAGACGGGCAGCACCAGGTTAGCAAT
CTCAGTGATGAGCGACGCTTAATCATCCACCTCTTGGGGCCACCTGTTGAGCGCTACTAC
CATGCCCCTGGTTAA
>Ipunensis_00002 Photosystem I assembly protein Ycf3 
ATGCGTCACCCCGCCAAGTTACTCGGGTTAGTCACTCTCACCAGTATGCTTACGCTGGCT
>Ipunensis_00003 Cell wall-associated protease 
ATGAAACGTTTTCTGACCAGTCTTTTGCTGACGGGCCTGCTTTGGCATAGTGGGGGCAGC
GTTGGGGTTGGGAGAGGTGCGATCGCACAAACCCAGTCCACCCCAGACCTCTACTACACC
>Ipunensis_00004 Photosystem I assembly protein Ycf3 
TTGACCTGCGGCCCGCAGCCCTACCTGCCCAACCTGACTCCAGAAATTCCCATGATCTAC
CGCCTCTCGTCTCCCGGATTTTTGCTGGCGCTGCTGCTGCTATCTGCCGTCGATCCGGCA
>Ipunensis_00045 tRNA-Ala(cgc) 
GGGGAATTAGCTCAGCTGGTAGAGCGCTGCGATCGCACCGCAGAGGTCAGGAGTTCGAAT
CTCCTATTCTCCA
>Ipunensis_00226 tRNA-Leu(gag) 
TGCGGATGTGGTGGAACTGGTAGACACGCACGTTTGAGGGGCGTGTGGCTTACGCCTTGC
GAGTTCGAGTCTCGCCATCCGCAT
>Ipunensis_00357 glnA 
ATCGTTCATCTCTTCAAACTGTCAAAGCTACTTACAAAAGCTACAGACGCACCAAGAGAC
GGAAGTAGGGGTCTGATCCCCCCGAAGGAACGCGCC
>Ipunensis_00386 Yfr1 
GCGGAGACGAAAGTTTCCGTTCACTCCTCACACCACACTCCGCCCAAATCATTGATTTGG
GCGGTT
>Ipunensis_00401 tRNA-Gly(gcc) 
GCGGGTATAGCTCAGTGGTAGAGCGTCACCTTGCCAAGGTGAATGTCGCGCGTTCGAATC
GCGTTACCCGCT
>Ipunensis_00403 tRNA-Gly(gcc) 
GCGGGTATAGCTCAGTGGTAGAGCGTCACCTTGCCAAGGTGAATGTCGCGCGTTCGAATC

sort fasta file based on its alphanumeric ids

3 Answers3

Explanation