I linearized a fasta file using using awk on a remote computer.
when I used nano to open it, it showed that the file had linearized. However when I downloaded the file to my local computer, and I viewd it using Notepad the file that I had generated is back to it's original wrapped format. Could you please advise what could be the reason.
This is the sequence:
>P1
MPPRRSIVEVKVLDVQKRRVPNKHYVYIIRVTWSSGATEAIYRRYSKFFDLQMQMLDKFP
MEGGQKDPKQRIIPFLPGKILFRRSHIRDVAVKRLIPIDEYCKALIQLPPYISQCDEVLQ
FFETRPEDLNPPKEEHIGKKKSGNDPTSVDPMVLEQYVVVADYQKQESSEISLSVGQVVD
IIEKNESGWWFVSTAEEQGWVPATCLEGQDGVQDEFSLQPEEEEKYTVIYPYTARDQDEM
NLERGAVVEVVQKNLEGWWKIRYQGKEGWAPASYLKKNSGEPLPPKLGPSSPAHSGALDL
DGVSRHQNAMGREKELLNNQRDGRFEGRLVPDGDVKQRSPKMRQRPPPRRDMTIPRGLNL
>P2
MAEVRKFTKRLSKPGTAAELRQSVSEAVRGSVVLEKAKLVEPLDYENVITQRKTQIYSDP
LRDLLMFPMEDISISVIGRQRRTVQSTVPEDAEKRAQSLFVKECIKTYSTDWHVVNYKYE
DFSGDFRMLPCKSLRPEKIPNHVFEIDEDCEKDEDSSSLCSQKGGVIKQGWLHKANVNST
ITVTMKVFKRRYFYLTQLPDGSYILNSYKDEKNSKESKGCIYLDACIDVVQCPKMRRHAF
ELKMLDKYSHYLAAETEQEMEEWLIMLKKIIQINTDSLVQEKKDTVEAIQEEETSSQGKA
ENIMASLERSMHPELMKYGRETEQLNKLSRGDGRQNLFSFDSEVQRLDFSGIEPDVKPFE
EKCNKRFMVNCHDLTFNILGHIGDNAKGPPTNVEPFFINLALFDVKNNCKISADFHVDLN
PPSVREMLWGTSTQLSNDGNAKGFSPESLIHGIAESQLCYIKQGIFSVTNPHPEIFLVVR
Then I used awk to linearize it as follows:
awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' < Sequences.fa > out3.fasta
The output was :
>P1
MPPRRSIVEVKVLDVQKRRVPNKHYVYIIRVTWSSGATEAIYRRYSKFFDLQMQMLDKFP^MMEGGQKDPKQRIIPFLPGKILFRRSHIRDVAVKRLIPIDEYCKALIQLPPYISQCDEVLQ^MFFETRPEDLNPPKEEHIGKKKSGNDPTSVDPM$
>P2
MAEVRKFTKRLSKPGTAAELRQSVSEAVRGSVVLEKAKLVEPLDYENVITQRKTQIYSDP^MLRDLLMFPMEDISISVIGRQRRTVQSTVPEDAEKRAQSLFVKECIKTYSTDWHVVNYKYE^MDFSGDFRMLPCKSLRPEKIPNHVFEIDEDCEK$
The problem now is that when I download it to my local computer and view it on Notepad (on my windows computer) or MEGA it goes back t the wrapped format. What could be the reason for this? Another issue I faced was that when I tried to remove the carets (^) in the sequences using sed 's/\^//g' out3.fasta>seq3.fasta it did not remove them. The $ is the line break
notepadin Windows has an option to wrap lines. It's a display option, it affects how files are displayed, the actual content is not affected. Are you sure "back to the wrapped format" is a real issue with the content? (2) The output contains^characters and I cannot get them from theawkcode and the input you posted. Does your input contain^M? It's possible all these problems are because you're using *nix tools to process a file with Windows line endings. Usedos2unixfirst. – Kamil Maciorowski Dec 30 '22 at 21:17^M. Thank you I will read up ondos2unixand then give it a try. – thole Dec 31 '22 at 03:44wc -lon Linux andfind /v /c ""on Windows. – seshoumara Dec 31 '22 at 03:46^Ms then the script you wrote to linearize it is adding them as they are present in the output you posted.P^MMin your output, for example, is<P><control-M><M>. There is no carat (^) in that string or anywhere else in your data which is why your sed command couldn't find/replace it, the^you see in every^Mis just part of thecontrol-Mdisplay, not a separate character. – Ed Morton Dec 31 '22 at 14:42WSL(Windows Subsystem for Linux)? You should even be able to runnanoon your local computer with WSL. https://learn.microsoft.com/en-us/windows/wsl/install – jubilatious1 Jan 19 '23 at 21:53