Linearizing a fasta file and removing special characters in

Question

I linearized a fasta file using using awk on a remote computer. when I used nano to open it, it showed that the file had linearized. However when I downloaded the file to my local computer, and I viewd it using Notepad the file that I had generated is back to it's original wrapped format. Could you please advise what could be the reason.

This is the sequence:

>P1
MPPRRSIVEVKVLDVQKRRVPNKHYVYIIRVTWSSGATEAIYRRYSKFFDLQMQMLDKFP
MEGGQKDPKQRIIPFLPGKILFRRSHIRDVAVKRLIPIDEYCKALIQLPPYISQCDEVLQ
FFETRPEDLNPPKEEHIGKKKSGNDPTSVDPMVLEQYVVVADYQKQESSEISLSVGQVVD
IIEKNESGWWFVSTAEEQGWVPATCLEGQDGVQDEFSLQPEEEEKYTVIYPYTARDQDEM
NLERGAVVEVVQKNLEGWWKIRYQGKEGWAPASYLKKNSGEPLPPKLGPSSPAHSGALDL
DGVSRHQNAMGREKELLNNQRDGRFEGRLVPDGDVKQRSPKMRQRPPPRRDMTIPRGLNL
>P2
MAEVRKFTKRLSKPGTAAELRQSVSEAVRGSVVLEKAKLVEPLDYENVITQRKTQIYSDP
LRDLLMFPMEDISISVIGRQRRTVQSTVPEDAEKRAQSLFVKECIKTYSTDWHVVNYKYE
DFSGDFRMLPCKSLRPEKIPNHVFEIDEDCEKDEDSSSLCSQKGGVIKQGWLHKANVNST
ITVTMKVFKRRYFYLTQLPDGSYILNSYKDEKNSKESKGCIYLDACIDVVQCPKMRRHAF
ELKMLDKYSHYLAAETEQEMEEWLIMLKKIIQINTDSLVQEKKDTVEAIQEEETSSQGKA
ENIMASLERSMHPELMKYGRETEQLNKLSRGDGRQNLFSFDSEVQRLDFSGIEPDVKPFE
EKCNKRFMVNCHDLTFNILGHIGDNAKGPPTNVEPFFINLALFDVKNNCKISADFHVDLN
PPSVREMLWGTSTQLSNDGNAKGFSPESLIHGIAESQLCYIKQGIFSVTNPHPEIFLVVR

Then I used awk to linearize it as follows:

 awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < Sequences.fa > out3.fasta

The output was :

 >P1
MPPRRSIVEVKVLDVQKRRVPNKHYVYIIRVTWSSGATEAIYRRYSKFFDLQMQMLDKFP^MMEGGQKDPKQRIIPFLPGKILFRRSHIRDVAVKRLIPIDEYCKALIQLPPYISQCDEVLQ^MFFETRPEDLNPPKEEHIGKKKSGNDPTSVDPM$
>P2
MAEVRKFTKRLSKPGTAAELRQSVSEAVRGSVVLEKAKLVEPLDYENVITQRKTQIYSDP^MLRDLLMFPMEDISISVIGRQRRTVQSTVPEDAEKRAQSLFVKECIKTYSTDWHVVNYKYE^MDFSGDFRMLPCKSLRPEKIPNHVFEIDEDCEK$

The problem now is that when I download it to my local computer and view it on Notepad (on my windows computer) or MEGA it goes back t the wrapped format. What could be the reason for this? Another issue I faced was that when I tried to remove the carets (^) in the sequences using sed 's/\^//g' out3.fasta>seq3.fasta it did not remove them. The $ is the line break

(1) AFAIK notepad in Windows has an option to wrap lines. It's a display option, it affects how files are displayed, the actual content is not affected. Are you sure "back to the wrapped format" is a real issue with the content? (2) The output contains ^ characters and I cannot get them from the awk code and the input you posted. Does your input contain ^M? It's possible all these problems are because you're using *nix tools to process a file with Windows line endings. Use dos2unix first. — Kamil Maciorowski, Dec 30 '22 at 21:17
@KamilMaciorowski No my input does not contain ^M. Thank you I will read up on dos2unix and then give it a try. — thole, Dec 31 '22 at 03:44
Editors have their own text wrapping feature that you need to turn off. A clear way to know is to count the number of lines your file has. Use wc -l on Linux and find /v /c "" on Windows. — seshoumara, Dec 31 '22 at 03:46
@KamilMaciorowski I managed to linearize the fasta it using a python. I am not sure I should put it here as an answer since this a Unix & Linux environment. Thanks for the suggestion earlier. — thole, Dec 31 '22 at 12:37
@thole if your input doesn't contain ^Ms then the script you wrote to linearize it is adding them as they are present in the output you posted. P^MM in your output, for example, is <P><control-M><M>. There is no carat (^) in that string or anywhere else in your data which is why your sed command couldn't find/replace it, the ^ you see in every ^M is just part of the control-M display, not a separate character. — Ed Morton, Dec 31 '22 at 14:42
@EdMorton thank you I didn't know this. I did manage to linearize the sequences with biopython eventually — thole, Dec 31 '22 at 15:29
@thole sure, you can post a python solution, python is a standard tool on nix systems. A better solution, however, is to avoid mixing systems. If you are doing bioinformatics on Windows (which is a very bad idea since the vast majority of bioinformatics tools don't work there) then only use Windows. If you are working on macOS or Linux or other nix systems, then stick to those and don't bring Windows into it. — terdon, Dec 31 '22 at 15:40
If you "downloaded it to your local computer", did you try opening it using WSL (Windows Subsystem for Linux)? You should even be able to run nano on your local computer with WSL. https://learn.microsoft.com/en-us/windows/wsl/install — jubilatious1, Jan 19 '23 at 21:53

score 2 · Answer 1 · answered Dec 31 '22 at 15:38

Don't do this. Never open files on Windows if you can avoid it since that breaks them by converting them to use Windows line endings, so you can no longer use them with the standard bioinformatics tools, the vast majority of which are designed for *nix systems. This is what caused the problem you have with the extra ^M.

So, your first solution is to simply not get Windows systems involved. If you must, then you need to change the line endings. So, to make the file look folded in Windows, but not on *nix or other systems, and remembering that this will break any downstream processing you want to do on the file unless that is also done on Windows machines, you can do this with GNU awk and some others:

awk '{ if(/^>/){ print NR==1 ? $0"\r" : "\r\n"$0"\r"}else{ printf "%s",$0}} END{print "\r"}' Sequences.fa

Or, with Perl:

perl -ne 'chomp; if(/^>/){$.==1 ? print "$_\r\n" : print "\r\n$_\r\n"}else{s/\n//g; print}END{print "\r\n"}' Sequences.fa

Finally, note that there is almost certainly no reason to do this. The Fasta format allows multiple lines and most sequences will indeed be split (usually at 60 characters) across multiple lines. This is normal. I have only see this one-line sequence proliferate since the popularization of the fastq format, which also allows multi-line sequences, but since it is mostly used for short reads, you rarely actually see multi-line entries in the wild. In any case, any program that is designed to deal with fasta is perfectly happy with multi-line sequences, so this whole thing is probably unnecessary.

jubilatious1 · Answer 2 · 2023-01-19T22:22:20.310

Using Raku (formerly known as Perl_6)

raku -ne '(/^ \> /) ?? "\n$_".put !! .print;'

As mentioned in the comments above, you can try using Windows Subsystem for Linux (WSL) on your local Windows machine. You can even run nano locally!

This answer provides Raku code for 'linearizing' your Fasta sequence. There's some cross-platform Raku development going on, so you may find Raku useful going forward. See: https://www.rakudo.org/downloads

Sample Input:

>P1
MPPRRSIVEVKVLDVQKRRVPNKHYVYIIRVTWSSGATEAIYRRYSKFFDLQMQMLDKFP
MEGGQKDPKQRIIPFLPGKILFRRSHIRDVAVKRLIPIDEYCKALIQLPPYISQCDEVLQ
FFETRPEDLNPPKEEHIGKKKSGNDPTSVDPMVLEQYVVVADYQKQESSEISLSVGQVVD
IIEKNESGWWFVSTAEEQGWVPATCLEGQDGVQDEFSLQPEEEEKYTVIYPYTARDQDEM
NLERGAVVEVVQKNLEGWWKIRYQGKEGWAPASYLKKNSGEPLPPKLGPSSPAHSGALDL
DGVSRHQNAMGREKELLNNQRDGRFEGRLVPDGDVKQRSPKMRQRPPPRRDMTIPRGLNL
>P2
MAEVRKFTKRLSKPGTAAELRQSVSEAVRGSVVLEKAKLVEPLDYENVITQRKTQIYSDP
LRDLLMFPMEDISISVIGRQRRTVQSTVPEDAEKRAQSLFVKECIKTYSTDWHVVNYKYE
DFSGDFRMLPCKSLRPEKIPNHVFEIDEDCEKDEDSSSLCSQKGGVIKQGWLHKANVNST
ITVTMKVFKRRYFYLTQLPDGSYILNSYKDEKNSKESKGCIYLDACIDVVQCPKMRRHAF
ELKMLDKYSHYLAAETEQEMEEWLIMLKKIIQINTDSLVQEKKDTVEAIQEEETSSQGKA
ENIMASLERSMHPELMKYGRETEQLNKLSRGDGRQNLFSFDSEVQRLDFSGIEPDVKPFE
EKCNKRFMVNCHDLTFNILGHIGDNAKGPPTNVEPFFINLALFDVKNNCKISADFHVDLN
PPSVREMLWGTSTQLSNDGNAKGFSPESLIHGIAESQLCYIKQGIFSVTNPHPEIFLVVR

Sample Output (first line of file is actually blank):

>P1
MPPRRSIVEVKVLDVQKRRVPNKHYVYIIRVTWSSGATEAIYRRYSKFFDLQMQMLDKFPMEGGQKDPKQRIIPFLPGKILFRRSHIRDVAVKRLIPIDEYCKALIQLPPYISQCDEVLQFFETRPEDLNPPKEEHIGKKKSGNDPTSVDPMVLEQYVVVADYQKQESSEISLSVGQVVDIIEKNESGWWFVSTAEEQGWVPATCLEGQDGVQDEFSLQPEEEEKYTVIYPYTARDQDEMNLERGAVVEVVQKNLEGWWKIRYQGKEGWAPASYLKKNSGEPLPPKLGPSSPAHSGALDLDGVSRHQNAMGREKELLNNQRDGRFEGRLVPDGDVKQRSPKMRQRPPPRRDMTIPRGLNL
>P2
MAEVRKFTKRLSKPGTAAELRQSVSEAVRGSVVLEKAKLVEPLDYENVITQRKTQIYSDPLRDLLMFPMEDISISVIGRQRRTVQSTVPEDAEKRAQSLFVKECIKTYSTDWHVVNYKYEDFSGDFRMLPCKSLRPEKIPNHVFEIDEDCEKDEDSSSLCSQKGGVIKQGWLHKANVNSTITVTMKVFKRRYFYLTQLPDGSYILNSYKDEKNSKESKGCIYLDACIDVVQCPKMRRHAFELKMLDKYSHYLAAETEQEMEEWLIMLKKIIQINTDSLVQEKKDTVEAIQEEETSSQGKAENIMASLERSMHPELMKYGRETEQLNKLSRGDGRQNLFSFDSEVQRLDFSGIEPDVKPFEEKCNKRFMVNCHDLTFNILGHIGDNAKGPPTNVEPFFINLALFDVKNNCKISADFHVDLNPPSVREMLWGTSTQLSNDGNAKGFSPESLIHGIAESQLCYIKQGIFSVTNPHPEIFLVVR

https://raku.org/

Linearizing a fasta file and removing special characters in

2 Answers2

Linked