-1

I have two genomic sequences that I want to compare letter by letter. They share almost the same pattern but some nucleotide transitions occur.

For example

Sequence 1

ATGGCGATGAGCAGCGGCGGCAGTGGTGGCGGCGTCCCGGAGCAGGAGGATTCCGTGCTGTTCCGGCGCGGCACAGGCCAGAGCGATGATTCTGACATTTGGGATGATACAGCACTGATAAAAGCATATGATAAAGCTGTGGCTTCATTTAAGCATGCTCTAAAGAATGGTGACATTTGTGAAACTTCGGGTAAACCAAAAACCACACCTAAAAGAAAACCTGCTAAGAAGAATAAAAGCCAAAAGAAGAATACTGCAGCTTCCTTACAACAGTGGAAAGTTGGGGACAAATGTTCTGCCATTTGGTCAGAAGACGGTTGCATTTACCCAGCTACCATTGCTTCAATTGATTTTAAGAGAGAAACCTGTGTTGTGGTTTACACTGGATATGGAAATAGAGAGGAGCAAAATCTGTCCGATCTACTTTCCCCAATCTGTGAAGTAGCTAATAATATAGAACAAAATGCTCAAGAGAATGAAAATGAAAGCCAAGTTTCAACAGATGAAAGTGAGAACTCCAGGTCTCCTGGAAATAAATCAGATAACATCAAGCCCAAATCTGCTCCATGGAACTCTTTTCTCCCTCCACCACCCCCCATGCCAGGGCCAAGACTGGGACCAGGAAAGCCAGGTCTAAAATTCAATGGCCCACCACCGCCACCGCCACCACCACCACCCCACTTACTATCATGCTGGCTGCCTCCATTTCCTTCTGGACCACCAATAATTCCCCCACCACCTCCCATATGTCCAGATTCTCTTGATGATGCTGATGCTTTGGGAAGTATGTTAATTTCATGGTACATGAGTGGCTATCATACTGGCTATTATATGGGTTTCAGACAAAATCAAAAAGAAGGAAGGTGCTCACATTCCTTAAATTAA

Sequence 2

ATGGCGATGAGCAGCGGCGGCAGTGGTGGCGGCGTCCCGGAGCAGGAGGATTCCGTGCTGTTCCGGCGCGGCACAGGCCAGAGCGATGATTCTGACATTTGGGATGATACAGCACTGATAAAAGCATATGATAAAGCTGTGGCTTCATTTAAGCATGCTCTAAAGAATGGTGACATTTGTGAAACTTCGGGTAAACCAAAAACCACACCTAAAAGAAAACCTGCTAAGAAGAATAAAAGCCAAAAGAAGAATACTGCAGCTTCCTTACAACAGTGGAAAGTTGGGGACAAATGTTCTGCCATTTGGTCAGAAGACGGTTGCATTTACCCAGCTACCATTGCTTCAATTGATTTTAAGAGAGAAACCTGTGTTGTGGTTTACACTGGATATGGAAATAGAGAGGAGCAAAATCTGTCCGATCTACTTTCCCCAATCTGTGAAGTAGCTAATAATATAGAACAGAATGCTCAAGAGAATGAAAATGAAAGCCAAGTTTCAACAGATGAAAGTGAGAACTCCAGGTCTCCTGGAAATAAATCAGATAACATCAAGCCCAAATCTGCTCCATGGAACTCTTTTCTCCCTCCACCACCCCCCATGCCAGGGCCAAGACTGGGACCAGGAAAGCCAGGTCTAAAATTCAATGGCCCACCACCGCCACCGCCACCACCACCACCCCACTTACTATCATGCTGGCTGCCTCCATTTCCTTCTGGACCACCAATAATTCCCCCACCACCTCCCATATGTCCAGATTCTCTTGATGATGCTGATGCTTTGGGAAGTATGTTAATTTCATGGTACATGAGTGGCTATCATACTGGCTATTATATGGAAATGCTGGCATAG

Is there a bash way to compare the sequence and colorize or print the differences?

EDIT: I want to compare gene sequence by means of linux tools.

I need line by line comparison of letters and highlighting the differences. I think diff is not good at comparing single word which formed of hundreds of letters.

Rui F Ribeiro
  • 56,709
  • 26
  • 150
  • 232
kenn
  • 753
  • I don't understand why my question downvoted. I asked for help for a scientific purpose. – kenn Sep 23 '17 at 09:33
  • Why don't you use clustal/clustalx for this? – sebasth Sep 23 '17 at 10:06
  • @don_crissti Actually I tried that diff-highlight in the link before editing my question. It's close to my needs but author didn't mention how to create tmp.diff – kenn Sep 23 '17 at 10:20
  • @sebasth After your suggestion I tried quite a number of bioinformatics tools, at last I come accross jemboss tool. It's bioinformatics tool with a GUI. I used jemboss > Alignment> Consensus > merger. It does what I want. – kenn Sep 24 '17 at 13:13

1 Answers1

1

You can save each line a separate file, e.g.

f1 - 1st genomic sequence f2 - 2nd genomic sequence

Now, you need to transform the lines (horizontal to vertical):

awk '{gsub(".","&\n");printf "%s",$0}' < f1 >f1a
awk '{gsub(".","&\n");printf "%s",$0}' < f2 >f2a

This will save the new format in 2 new files ( f1a and f2a ) Now compare the 2 files with diff

diff -y f1a f2a  #will output both lines and show differences
diff -c f1a f2a  #will only output the differences and tell you from which line to which line

More on diff: http://man7.org/linux/man-pages/man1/diff.1.html

The above can also made into a small script, where you could pass the 2 genomic sequences as variables.

If you want colored output, try using colordiff (In Ubuntu: sudo apt-get install colordiff) and just replace diff with colordiff in the script below. For side by side output, use option -y:

#!/bin/bash
# This script will compare the 2 genomic sequences
echo "1st genomic sequence, followed by [ENTER]:"
read gena
echo "2nd genomic sequence, followed by [ENTER]:"
read genb
echo $gena | awk '{gsub(".","&\n");printf "%s",$0}' > /tmp/fa
echo $genb | awk '{gsub(".","&\n");printf "%s",$0}' > /tmp/fb
echo "Insert the diff argument you wish to use (e.g. -y or -c). Please refer to man diff for information. Hit [ENTER]:"
read $arg
diff $arg /tmp/fa /tmp/fb
exit
Marin N.
  • 182
  • Thank you for the answer and the script. It works but I am unfamiliar with diff output. It would be great if visual colorized output which underlines the differences. – kenn Sep 22 '17 at 16:15
  • I am not sure if there is something implemented in diff to show the output colored. Probanly it can be done, or some other tool has this feature already. Can't look into it now..remote area with poor internet connection, but I'll check thia tomorrow. – Marin N. Sep 23 '17 at 17:28
  • @kenn - I just updated my answer. – Marin N. Sep 25 '17 at 11:38
  • thanks but it doesn's make much difference. I found another approach. I would post it as an answer but I am not allowed because Moderators marked it duplicate. – kenn Sep 25 '17 at 17:03