Removing special characters from a fasta file

Question

I recently linearized a fasta file using awk. The output is perfect. However there is a caret(^) in my sequence. I want to remove this caret. below is my attempt, any assistance is highly appreciated.

>P1
MPPRRSIVEVKVLDVQKRRVPNKHYVYIIRVTWSSGATEAIYRRYSKFFDLQMQMLDKFP^MMEGGQKDPKQRIIPFLPGKILFRRSHIRDVAVKRLIPIDEYCKALIQLPPYISQCDEVLQ^MFFETRPEDLNPPKEEHIGKKKSGNDPTSVDPM
>P2
MAEVRKFTKRLSKPGTAAELRQSVSEAVRGSVVLEKAKLVEPLDYENVITQRKTQIYSDP^MLRDLLMFPMEDISISVIGRQRRTVQSTVPEDAEKRAQSLFVKECIKTYSTDWHVVNYKYE^MDFSGDFRMLPCKSLRPEKIPNHVFEIDEDCEK
>P3
GDDSEWLKLPVDQKCEHKLWKARLSGYEEALKIFQKIKDEKSPEWSKYLGLIKKFVTDS^MNAVVQLKGLEAALVYVENAHVAGKTTGEVVSGVVSKAKELGIEICLMYVEIE^MKGESVQEELLKGLDNKNPKIIVACIETLRKALS

I tried using:

$ sed '/s: ^// seq2.fa>seq3.fa

The code above is giving me an error of sed:e expression #1,char7: unkown command: '/' Any assistance is appreciated, thanks.

Those aren't carets, they are ^M, carriage return characters. If you just remove the ^, you will get the wrong sequence. — terdon, Dec 31 '22 at 15:02
@terdon Thank you I did not know that. I also found an alternative way of linearizing the sequences using biopython, and I got the desired output where the were no ^M — thole, Dec 31 '22 at 15:25
Yes, most likely because you used rstrip() which will remove both \r and \n. But seriously, I cannot stress this enough: do NOT try to use both Windows and non-Windows systems on the same file unless you always remember to convert between the line endings. Even better, if you're doing bioinformatics, just don't use Windows at all. — terdon, Dec 31 '22 at 15:44
Oh, and if you used any of the solutions here that don't handle ^M but instead focus on ^, you will have borked your sequences with extra methionine residues so you really want to fix that too. — terdon, Dec 31 '22 at 15:45

score 2 · Accepted Answer · edited Dec 29 '22 at 11:22

2

sed 's/\^//' seq2.fa>seq3.fa (to remove the first caret on each line) or sed 's/\^//g' seq2.fa>seq3.fa (to remove all carets from each line) is what you're looking for.

edited Dec 29 '22 at 11:22

Kusalananda

333,661

answered Dec 29 '22 at 05:37

stoney

1,055

1

The OP's description is misleading. Those are ^M not ^. – terdon Dec 31 '22 at 15:06

Kusalananda · Answer 2 · 2022-12-29T10:34:22.927

1

If you want to remove all ^ characters from anywhere in your file, you may use tr like so:

tr -d '^' <seq2.fa >seq3.fa

The tr utility is the most efficient tool for manipulating single characters. It can delete, replace, or "squeeze" (replace multiple consecutive ones with a single one) characters. It does, however, not allow you to use any logic.

If you only want to remove the character from any sequence line and avoid touching the fasta header lines:

sed '/^>/! s/\^//g' <seq2.fa >seq3.fa

This triggers the substitution command s/\^//g (which I believe you tried to use but got the order of the slashes slightly wrong) on any line that does not start with a > character. The substitution removes any ^ character on the line by repeatedly replacing it with nothing until no such character is left.

The ^ needs to be escaped since it otherwise would act as an anchor, anchoring the regular expression to the start of the line.

edited Dec 29 '22 at 10:34

answered Dec 29 '22 at 06:15

Kusalananda

333,661

The OP's description is misleading. Those are ^M not ^. – terdon Dec 31 '22 at 15:05
@terdon This is clear from their other question (Linearizing a fasta file and removing special characters in), but since this was not clear in this question, I'll let my answer be an answer to the way the query is currently formulated. – Kusalananda Dec 31 '22 at 18:13
Yeah, I hadn't seen the other one when I wrote here. – terdon Dec 31 '22 at 18:14

terdon · Answer 3 · 2022-12-31T15:10:27.880

Those aren't carets (^). Windows systems use \r\n (carriage return followed by newline) to indicate the end of a line, unlike *nix systems which just use \n. The \r are often represented as ^M. See for example:

$ printf 'a\r\n' | cat -v
a^M

Indeed, I blasted one of your sequences (after removing the ^ but leaving the M) against nr and found an almost perfect hit, but the extra Ms are gaps:

I am guessing you did something with this file on a Windows system and that is what added the \r or ^M that you see. Note how each ^ is actually ^M in your example. As confirmed by the blast hit above, those are not real methionines, and you want to remove the M as well as the ^. So try something like this:

tr -d '\r' < seq2.fa > seq3.fa

Or, if the processing you have done to the file has entered a literal ^ and M, remove both:

sed 's/\^M//g' seq2.fa > seq3.fa

If you just remove the ^ you will have a wrong sequence with extra methionines.

jubilatious1 · Answer 4 · 2023-01-19T23:51:47.510

Using Raku (formerly known as Perl_6)

~$ raku -e 'my $data = 1; my $fh = open($*OUT, :nl-out("\n\r")); $fh.put: $data;' | od -bc
0000000   061 012 015
           1  \n  \r
0000003
~$ raku -e 'my $data = 1; my $fh = open($*OUT, :nl-out("\n")); $fh.put: $data;' | od -bc
0000000   061 012
           1  \n
0000002

The problem the OP has encountered is due to improper end-of-line processing. To perform end-of-line processing properly, you need a language that can control this parameter. Fortunately, Raku is such a language.

Above, data is stored in the variable $data, and a filehandle named $fh is opened. The adverbial parameter :nl-out is used to set the output terminator (either \n\r or \n), and the data is output on $*OUT stdout using the proper terminator.

So, if you have a FASTA file, you can set the proper :nl-out("\n") terminator for re-opening a file on Unix/Linux systems. Of course, you can get carried away with this, see below. Thats all folks!

~$ raku -e 'my $data = 1; my $fh = open($*OUT, :nl-out("thats-all-folks")); $fh.put: $data;' | od -bc
0000000   061 164 150 141 164 163 055 141 154 154 055 146 157 154 153 163
           1   t   h   a   t   s   -   a   l   l   -   f   o   l   k   s
0000020
~$ raku -e 'my $data = 1; my $fh = open($*OUT, :nl-out("")); $fh-eol.put: $data;' | od -bc
0000000   061
           1
0000001

(A similar adverbial parameter named :nl-in is used to control how newlines are interpreted when reading files into Raku. But since Raku auto-chomps by default, it's much less important).

See the page: "Newline handling in Raku" for further information.

https://raku.org

Removing special characters from a fasta file

4 Answers4

Linked