Compare two DNA sequences

Question

I have two DNA sequence ATGCATGC and TACGTTGC

I want to write a program which gives "+" if upon comparing A align with T and G with C otherwise print "-"

Like

ATGCATGC
TACGTTGC
+++++---

Can anyone help me out?

Output which I expect is given above.

What I tried is as following:

#!/bin/bash
declare -a seq1=()
declare -a seq2=()
read -p 'Enter the ncleotide seq (charactor by charactor followe\d by space0) ' -a seq1
read -n1 seq1
read -p 'Enter the ncleotide seq (charactor by charactor followe\d by space0) ' -a seq2
read -n1 seq2
for a in ${seq1[*]} ; do
for b in ${seq2[*]} ; do
if [ $a == A ] || [ $b == T ] ; then
echo -n "+"
elif [ $a == A ] || [ $b == C ] ; then
echo -n " -"
elif [ $a == A ] || [ $b == G ] ; then
echo -n "-"
elif [ $a == T ] || [ $b == A ] ; then
echo -n "+"
elif [ $a == T ] || [ $b == C ] ; then
echo -n "-"
elif [ $a == T ] || [ $b == G ] ; then
echo -n "-"
elif [ $a == C ] || [ $b == G ] ; then
echo -n "+"
elif [ $a == C ] || [ $b == A ] ; then
echo -n "-"
elif [ $a == C ] || [ $b == T ] ; then
echo -n "-"
elif [ $a == G ] || [ $b == C ] ; then
echo -n "+"
elif [ $a == G ] || [ $b == A ] ; then
echo -n "-"
elif [ $a == G ] || [ $b == T ] ; then
echo -n "-"
else

echo $a $b
fi
done
done

What have you tried so far? StackExchange is not a script writing service — Chris Davies, Apr 10 '21 at 07:06
Add the code of that if condition with array to your question. — berndbausch, Apr 10 '21 at 08:42
are the sequences in separate files? or in consecutive lines in the same file? are there just two sequences, or are there more? what should happen if the sequences aren't the same length? — cas, Apr 10 '21 at 08:48
I used read -p flag as I want to prompt the 1st seq then second sequence then it will do the required job. So far I want to take only same length sequence. — Hanuman Singh Dagur, Apr 10 '21 at 08:52
read? oh, you're trying to do it with a shell script? don't do that, use awk or perl instead (note that perl has a lot of existing libraries for bioinformatics, see https://bioperl.org/). See Why is using a shell loop to process text considered bad practice? — cas, Apr 10 '21 at 09:19
Perl I don't know. Can you help me using awk ? I am new to programming. — Hanuman Singh Dagur, Apr 10 '21 at 10:20
I attached what I have tried. When I execute it gives false results. Like if I type A in first array and C in second array it gives + sign but it should give - sign. Could you please correct it or write new one. — Hanuman Singh Dagur, Apr 10 '21 at 11:56
Why in the world are you trying to do this in bash? It is a very, very bad tool for the job, and there are dozens of programs already written that do this sort of thing far better. From a biological point of view, this is pointless since you don't allow for gaps or shifting the sequence left or right. And from a programming point of view, trying to use the shell for something like that is just painful. Please use a proper sequence aligner, we have been working on this problem for more than three decades now. — terdon, Apr 10 '21 at 18:10

αғsнιη · Answer 1 · 2021-04-10T14:57:18.813

Using awk:

awk -v seq1='CATGCATGCTCAT' -v seq2='ATACGTTGCGTTA' '
function sign(s) { cmp=(cmp==""?"":cmp) s }
BEGIN{
    split(seq1, tmp1, ""); split(seq2, tmp2, "");
    for(i in tmp1) {
        if( tmp1[i] == tmp2[i] ){ sign("-"); continue }
        if( (tmp1[i] ~/[AT]/ && tmp2[i] ~/[AT]/) || 
            (tmp1[i] ~/[GC]/ && tmp2[i] ~/[GC]/) ) { sign("+"); continue }
        sign("-")
    }
    print seq1 ORS seq2 ORS cmp
}'

Output:

CATGCATGCTCAT
ATACGTTGCGTTA
-+++++-----++

Same code with explanations commented:

awk -v seq1='CATGCATGCTCAT' -v seq2='ATACGTTGCGTTA' '
# set sequences one in seq1 another in seq2
function sign(s) { cmp=(cmp==""?"":cmp) s }
function to join the the changes on +/- for each pair of chars
BEGIN{
    split(seq1, tmp1, ""); split(seq2, tmp2, "");
    # split each sequences characters into individual arrays
for(i in tmp1) {
# loop over keys on one of the arrays (assuming length of both seq will be same)

    if( tmp1[i] == tmp2[i] ){ sign(&quot;-&quot;); continue }
    # if both chars were same AA, TT, CC, GG, ..., sign should be &quot;-&quot;

    if( (tmp1[i] ~/[AT]/ &amp;&amp; tmp2[i] ~/[AT]/) || 
        (tmp1[i] ~/[GC]/ &amp;&amp; tmp2[i] ~/[GC]/) ) { sign(&quot;+&quot;); continue }
    # if one was &quot;A&quot; and another was &quot;T&quot; or vice-versa as well as
    # if one was &quot;G&quot; and another was &quot;C&quot; or vice-versa, sign should be &quot;+&quot;

    sign(&quot;-&quot;)
    # otherwise &quot;-&quot;

}
print seq1 ORS seq2 ORS cmp
# print the last result, first printing the sequences then 
# the comparison result in 'cmp'

}'

guest_7 · Answer 2 · 2021-04-12T14:29:54.513

read seq1
read seq2
printf '%s\n' "$seq1" "$seq2" |
sed -Ee '
  N;H;p;z;x

  :zip
    #   1  2     3
    s/\n(.)(.*\n)(.)/\1\3\n\2/
    s/\n\n//; # zipping is complete when the two markers collide
  t zip

  :cmp
  s/^([-+]*)(AT|TA|CG|GC)/\1+/;t cmp
  s/[ATCG]{2}/-/;t cmp
'

Output:

ATGCATGCTATC
TACGTTGCATTG
+++++---++-+

Another approach this time using awk utility where we zip the two sequences char by char and with the help of an associative array h which is pre-filled with the appropriate keys to give a + whilst the rest get a -

awk '
BEGIN {
  OFS = ORS
split("AT TA CG GC", a)
  for (var in a) h[a[var]]
seq1 = ARGV[1]; seq2 = ARGV[2]
  len = length(seq1)
while ( ++pos <= len ) {
    s = substr(seq1, pos, 1) 

        substr(seq2, pos, 1) ;
    res = res ((s in h) ? "+" : "-")
  }
print seq1, seq2, res
}
' "$seq1" "$seq2"

We use python3 to perform the sequence comparisons stored in strings and zipped

python3 -c 'import sys
s,t = sys.argv[1::]
dna = "ATCG"
d = {i+j: "+" for i,j in zip(dna[0::2],dna[1::2])}
print(*[d.get(x+y,d.get(y+x,"-")) for x,y in zip(s,t)],sep="")
' "$seq1" "$seq2"

Perl adaptation of the python example above:

printf '%s\n' "$seq1" "$seq2" |
perl -F// -pale '
  my @h = qw(A T C G);
  my %h = (@h, reverse @h);
  my $zip = join q(), map { $F[$a++], $_ } <> =~ /./g;
  $_ = $zip =~ s/(.)(.)/qw(- +)[$2 eq $h{$1}]/reg;
'

Kusalananda · Answer 3 · 2021-04-11T15:49:33.680

Using bash and tr:

#!/bin/bash
printf '%s\n%s\n' "$1" "$2"
tr 37124568 '++[-*]' <<<$(( $(tr ATCG 1234 <<<"$1 + $2") ))

Testing:

$ bash script ACCTACCATAG AGTACCCGATC
ACCTACCATAG
AGTACCCGATC
-+-+----+++

The script replaces the characters in the sequences given on the command line with digits in such a way that when you add them together, the digits 3 and 7 in the sum means + and all other digits mean -.

The character are changed in the following manner:

A --> 1
T --> 2
C --> 3
G --> 4

This means that A + T is the same as 1+2, which is 3. In the same way, 7 comes from 3+4 for C and G.

The outer tr invocation transforms all 3s and 7s into + and all other possible digits to -. The sum is calculated by an arithmetic expansion, $(( ... )), which performs the addition of the two numbers resulting from the inner tr invocation that just changes the letters to digits in the two sequences.

No validation is done on the input. There is no check for arithmetic errors, like overflow (which is likely to occur for sequences longer than a handful of base pairs, unless you take care to break them up e.g. by character).

Would you want to read the sequences from standard input instead, use

#!/bin/bash
read s1
read s2
printf '%s\n%s\n' "$s1" "$s2"
tr 37124568 '++[-*]' <<<$(( $(tr ATCG 1234 <<<"$s1 + $s2") ))

This is really clever! You might want to explain the logic of why 3 and 7 are special though, it took me a while to figure it out even knowing how tr works. — terdon, Apr 11 '21 at 12:12
@terdon Done, I think. This is only ever useful for short sequences though as you'd expect to get arithmetic overflows for longer sequences, unless you split them up. — Kusalananda, Apr 11 '21 at 13:01

Compare two DNA sequences

3 Answers3

function to join the the changes on +/- for each pair of chars