2

I recently linearized a fasta file using awk. The output is perfect. However there is a caret(^) in my sequence. I want to remove this caret. below is my attempt, any assistance is highly appreciated.

>P1
MPPRRSIVEVKVLDVQKRRVPNKHYVYIIRVTWSSGATEAIYRRYSKFFDLQMQMLDKFP^MMEGGQKDPKQRIIPFLPGKILFRRSHIRDVAVKRLIPIDEYCKALIQLPPYISQCDEVLQ^MFFETRPEDLNPPKEEHIGKKKSGNDPTSVDPM
>P2
MAEVRKFTKRLSKPGTAAELRQSVSEAVRGSVVLEKAKLVEPLDYENVITQRKTQIYSDP^MLRDLLMFPMEDISISVIGRQRRTVQSTVPEDAEKRAQSLFVKECIKTYSTDWHVVNYKYE^MDFSGDFRMLPCKSLRPEKIPNHVFEIDEDCEK
>P3
GDDSEWLKLPVDQKCEHKLWKARLSGYEEALKIFQKIKDEKSPEWSKYLGLIKKFVTDS^MNAVVQLKGLEAALVYVENAHVAGKTTGEVVSGVVSKAKELGIEICLMYVEIE^MKGESVQEELLKGLDNKNPKIIVACIETLRKALS

I tried using:

$ sed '/s: ^// seq2.fa>seq3.fa

The code above is giving me an error of sed:e expression #1,char7: unkown command: '/' Any assistance is appreciated, thanks.

user9101329
  • 1,004
thole
  • 33
  • 4
  • 1
    Those aren't carets, they are ^M, carriage return characters. If you just remove the ^, you will get the wrong sequence. – terdon Dec 31 '22 at 15:02
  • @terdon Thank you I did not know that. I also found an alternative way of linearizing the sequences using biopython, and I got the desired output where the were no ^M – thole Dec 31 '22 at 15:25
  • Yes, most likely because you used rstrip() which will remove both \r and \n. But seriously, I cannot stress this enough: do NOT try to use both Windows and non-Windows systems on the same file unless you always remember to convert between the line endings. Even better, if you're doing bioinformatics, just don't use Windows at all. – terdon Dec 31 '22 at 15:44
  • 1
    Oh, and if you used any of the solutions here that don't handle ^M but instead focus on ^, you will have borked your sequences with extra methionine residues so you really want to fix that too. – terdon Dec 31 '22 at 15:45

4 Answers4

2

sed 's/\^//' seq2.fa>seq3.fa (to remove the first caret on each line) or sed 's/\^//g' seq2.fa>seq3.fa (to remove all carets from each line) is what you're looking for.

Kusalananda
  • 333,661
stoney
  • 1,055
1

If you want to remove all ^ characters from anywhere in your file, you may use tr like so:

tr -d '^' <seq2.fa >seq3.fa

The tr utility is the most efficient tool for manipulating single characters. It can delete, replace, or "squeeze" (replace multiple consecutive ones with a single one) characters. It does, however, not allow you to use any logic.

If you only want to remove the character from any sequence line and avoid touching the fasta header lines:

sed '/^>/! s/\^//g' <seq2.fa >seq3.fa

This triggers the substitution command s/\^//g (which I believe you tried to use but got the order of the slashes slightly wrong) on any line that does not start with a > character. The substitution removes any ^ character on the line by repeatedly replacing it with nothing until no such character is left.

The ^ needs to be escaped since it otherwise would act as an anchor, anchoring the regular expression to the start of the line.

Kusalananda
  • 333,661
1

Those aren't carets (^). Windows systems use \r\n (carriage return followed by newline) to indicate the end of a line, unlike *nix systems which just use \n. The \r are often represented as ^M. See for example:

$ printf 'a\r\n' | cat -v
a^M

Indeed, I blasted one of your sequences (after removing the ^ but leaving the M) against nr and found an almost perfect hit, but the extra Ms are gaps:

blast result showing the Ms are wrong

I am guessing you did something with this file on a Windows system and that is what added the \r or ^M that you see. Note how each ^ is actually ^M in your example. As confirmed by the blast hit above, those are not real methionines, and you want to remove the M as well as the ^. So try something like this:

tr -d '\r' < seq2.fa > seq3.fa

Or, if the processing you have done to the file has entered a literal ^ and M, remove both:

sed 's/\^M//g' seq2.fa > seq3.fa

If you just remove the ^ you will have a wrong sequence with extra methionines.

terdon
  • 242,166
0

Using Raku (formerly known as Perl_6)

~$ raku -e 'my $data = 1; my $fh = open($*OUT, :nl-out("\n\r")); $fh.put: $data;' | od -bc
0000000   061 012 015
           1  \n  \r
0000003
~$ raku -e 'my $data = 1; my $fh = open($*OUT, :nl-out("\n")); $fh.put: $data;' | od -bc
0000000   061 012
           1  \n
0000002

The problem the OP has encountered is due to improper end-of-line processing. To perform end-of-line processing properly, you need a language that can control this parameter. Fortunately, Raku is such a language.

Above, data is stored in the variable $data, and a filehandle named $fh is opened. The adverbial parameter :nl-out is used to set the output terminator (either \n\r or \n), and the data is output on $*OUT stdout using the proper terminator.

So, if you have a FASTA file, you can set the proper :nl-out("\n") terminator for re-opening a file on Unix/Linux systems. Of course, you can get carried away with this, see below. Thats all folks!

~$ raku -e 'my $data = 1; my $fh = open($*OUT, :nl-out("thats-all-folks")); $fh.put: $data;' | od -bc
0000000   061 164 150 141 164 163 055 141 154 154 055 146 157 154 153 163
           1   t   h   a   t   s   -   a   l   l   -   f   o   l   k   s
0000020
~$ raku -e 'my $data = 1; my $fh = open($*OUT, :nl-out("")); $fh-eol.put: $data;' | od -bc
0000000   061
           1
0000001

(A similar adverbial parameter named :nl-in is used to control how newlines are interpreted when reading files into Raku. But since Raku auto-chomps by default, it's much less important).

See the page: "Newline handling in Raku" for further information.

https://raku.org

jubilatious1
  • 3,195
  • 8
  • 17