I linearized a fasta file using using awk
on a remote computer.
when I used nano
to open it, it showed that the file had linearized. However when I downloaded the file to my local computer, and I viewd it using Notepad the file that I had generated is back to it's original wrapped format. Could you please advise what could be the reason.
This is the sequence:
>P1
MPPRRSIVEVKVLDVQKRRVPNKHYVYIIRVTWSSGATEAIYRRYSKFFDLQMQMLDKFP
MEGGQKDPKQRIIPFLPGKILFRRSHIRDVAVKRLIPIDEYCKALIQLPPYISQCDEVLQ
FFETRPEDLNPPKEEHIGKKKSGNDPTSVDPMVLEQYVVVADYQKQESSEISLSVGQVVD
IIEKNESGWWFVSTAEEQGWVPATCLEGQDGVQDEFSLQPEEEEKYTVIYPYTARDQDEM
NLERGAVVEVVQKNLEGWWKIRYQGKEGWAPASYLKKNSGEPLPPKLGPSSPAHSGALDL
DGVSRHQNAMGREKELLNNQRDGRFEGRLVPDGDVKQRSPKMRQRPPPRRDMTIPRGLNL
>P2
MAEVRKFTKRLSKPGTAAELRQSVSEAVRGSVVLEKAKLVEPLDYENVITQRKTQIYSDP
LRDLLMFPMEDISISVIGRQRRTVQSTVPEDAEKRAQSLFVKECIKTYSTDWHVVNYKYE
DFSGDFRMLPCKSLRPEKIPNHVFEIDEDCEKDEDSSSLCSQKGGVIKQGWLHKANVNST
ITVTMKVFKRRYFYLTQLPDGSYILNSYKDEKNSKESKGCIYLDACIDVVQCPKMRRHAF
ELKMLDKYSHYLAAETEQEMEEWLIMLKKIIQINTDSLVQEKKDTVEAIQEEETSSQGKA
ENIMASLERSMHPELMKYGRETEQLNKLSRGDGRQNLFSFDSEVQRLDFSGIEPDVKPFE
EKCNKRFMVNCHDLTFNILGHIGDNAKGPPTNVEPFFINLALFDVKNNCKISADFHVDLN
PPSVREMLWGTSTQLSNDGNAKGFSPESLIHGIAESQLCYIKQGIFSVTNPHPEIFLVVR
Then I used awk
to linearize it as follows:
awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' < Sequences.fa > out3.fasta
The output was :
>P1
MPPRRSIVEVKVLDVQKRRVPNKHYVYIIRVTWSSGATEAIYRRYSKFFDLQMQMLDKFP^MMEGGQKDPKQRIIPFLPGKILFRRSHIRDVAVKRLIPIDEYCKALIQLPPYISQCDEVLQ^MFFETRPEDLNPPKEEHIGKKKSGNDPTSVDPM$
>P2
MAEVRKFTKRLSKPGTAAELRQSVSEAVRGSVVLEKAKLVEPLDYENVITQRKTQIYSDP^MLRDLLMFPMEDISISVIGRQRRTVQSTVPEDAEKRAQSLFVKECIKTYSTDWHVVNYKYE^MDFSGDFRMLPCKSLRPEKIPNHVFEIDEDCEK$
The problem now is that when I download it to my local computer and view it on Notepad (on my windows computer) or MEGA it goes back t the wrapped format. What could be the reason for this? Another issue I faced was that when I tried to remove the carets (^) in the sequences using sed 's/\^//g' out3.fasta>seq3.fasta
it did not remove them. The $ is the line break
notepad
in Windows has an option to wrap lines. It's a display option, it affects how files are displayed, the actual content is not affected. Are you sure "back to the wrapped format" is a real issue with the content? (2) The output contains^
characters and I cannot get them from theawk
code and the input you posted. Does your input contain^M
? It's possible all these problems are because you're using *nix tools to process a file with Windows line endings. Usedos2unix
first. – Kamil Maciorowski Dec 30 '22 at 21:17^M
. Thank you I will read up ondos2unix
and then give it a try. – thole Dec 31 '22 at 03:44wc -l
on Linux andfind /v /c ""
on Windows. – seshoumara Dec 31 '22 at 03:46^M
s then the script you wrote to linearize it is adding them as they are present in the output you posted.P^MM
in your output, for example, is<P><control-M><M>
. There is no carat (^
) in that string or anywhere else in your data which is why your sed command couldn't find/replace it, the^
you see in every^M
is just part of thecontrol-M
display, not a separate character. – Ed Morton Dec 31 '22 at 14:42WSL
(Windows Subsystem for Linux)? You should even be able to runnano
on your local computer with WSL. https://learn.microsoft.com/en-us/windows/wsl/install – jubilatious1 Jan 19 '23 at 21:53