0

I have a data like this as an example

sp|O15304|SIVA_HUMAN    MPKRSCPFADVAPLQLKVRVSQRELSRGVCAERYSQEVFEKTKRLLFLGAQAYLDHVWDEGCAVVHLPESPKPGPTGAPRAARGQMLIGPDGRLIRSLGQASEADPSGVASIACSSCVRAVDGKAVCGQCERALCGQCVRTCWGCGSVACTLCGLVDCSDMYEKVLCTSCAMFET 
tr|A0A1B1L9R9|A0A1B1L9R9_BACTU  MNKQLFLASLKETQKSILSYACGAALYLWLLIWIFPSMVSAKGLNELIAAMPDSVKKIVGMESPIQNVMDFLAGEYYSLLFIIILTIFCVTVATHLIARHVDKGAMAYLLATPVSRVQIAITQATVLILGLLIIVSVTYVAGLVGAEWFLQDNNLNKELFLKINIVGGLIFLVVSAYSFFFSCICNDERKALSYSASLTILFFVLDMVGKLSDKLEWMKNLSLFTLFRPKEIAEGAYNIWPVSIGLIAGALCIFIVAIVVFKKRDLPL 

and I have another data like the following which has some similar strings

tr|A0A1B1L9R9|A0A1B1L9R9_BACTU This is just an example 1-20-100

I want to be able to match the two data and anywhere that it has similar string from second txt to the first test, paste the part that is in the second text file . For example.

in the first data I have this

sp|O15304|SIVA_HUMAN
tr|A0A1B1L9R9|A0A1B1L9R9_BACTU

in the second data I have only this which matches to one of the first data

tr|A0A1B1L9R9|A0A1B1L9R9_BACTU

so the output will be like this

sp|O15304|SIVA_HUMAN    MPKRSCPFADVAPLQLKVRVSQRELSRGVCAERYSQEVFEKTKRLLFLGAQAYLDHVWDEGCAVVHLPESPKPGPTGAPRAARGQMLIGPDGRLIRSLGQASEADPSGVASIACSSCVRAVDGKAVCGQCERALCGQCVRTCWGCGSVACTLCGLVDCSDMYEKVLCTSCAMFET 
tr|A0A1B1L9R9|A0A1B1L9R9_BACTU This is just an example 1-20-100 MNKQLFLASLKETQKSILSYACGAALYLWLLIWIFPSMVSAKGLNELIAAMPDSVKKIVGMESPIQNVMDFLAGEYYSLLFIIILTIFCVTVATHLIARHVDKGAMAYLLATPVSRVQIAITQATVLILGLLIIVSVTYVAGLVGAEWFLQDNNLNKELFLKINIVGGLIFLVVSAYSFFFSCICNDERKALSYSASLTILFFVLDMVGKLSDKLEWMKNLSLFTLFRPKEIAEGAYNIWPVSIGLIAGALCIFIVAIVVFKKRDLPL 
Jeff Schaller
  • 67,283
  • 35
  • 116
  • 255
Learner
  • 165

1 Answers1

0

A simple Bash script like this should work, altough there may be even shorter methods.

file1.txt:

sp|O15304|SIVA_HUMAN    MPKRSCPFADVAPLQLKVRVSQRELSRGVCAERYSQEVFEKTKRLLFLGAQAYLDHVWDEGCAVVHLPESPKPGPTGAPRAARGQMLIGPDGRLIRSLGQASEADPSGVASIACSSCVRAVDGKAVCGQCERALCGQCVRTCWGCGSVACTLCGLVDCSDMYEKVLCTSCAMFET 
tr|A0A1B1L9R9|A0A1B1L9R9_BACTU  MNKQLFLASLKETQKSILSYACGAALYLWLLIWIFPSMVSAKGLNELIAAMPDSVKKIVGMESPIQNVMDFLAGEYYSLLFIIILTIFCVTVATHLIARHVDKGAMAYLLATPVSRVQIAITQATVLILGLLIIVSVTYVAGLVGAEWFLQDNNLNKELFLKINIVGGLIFLVVSAYSFFFSCICNDERKALSYSASLTILFFVLDMVGKLSDKLEWMKNLSLFTLFRPKEIAEGAYNIWPVSIGLIAGALCIFIVAIVVFKKRDLPL

file2.txt:

tr|A0A1B1L9R9|A0A1B1L9R9_BACTU This is just an example 1-20-100

merge.sh:

fileone="file1.txt"
filetwo="file2.txt"
IFS=$'\n'
for line in `cat $fileone`; do
    #convert to array
    IFS=' '
    read -ra parts -d '' <<< "$line"

    other_text=$(cat $filetwo | sed -n -e "s/^${parts[0]} //p")
    echo "${parts[0]} $other_text ${parts[1]}"
done

This script reads file1.txt line per line, then checks if the prefix ${parts[0]} is contained in the second file file2.txt and then merges the strings together.

How sed -n -e "s/^${parts[0]} //p works:

  • -n means not to print anything by default.
  • -e is followed by a sed command.
  • s is the pattern replacement command.
  • The regular expression ^${parts[0] will match lines that begin with ${parts[0], which is our prefix (e.g.; sp|O15304|SIVA_HUMAN).
  • The match, e.g. sp|O15304|SIVA_HUMAN, is replaced by the empty string.
  • p will print the transformed line. (which will be stored in the variable other_text)

Also, see this detailed explaination of this particular sed command.

To redirect the ouput in a file you can run ./merge.sh > output.txt. You can make the script more flexible by setting fileone=$1 and filetwo=$2 and specify the files as arguments instead, like this: ./merge.sh file1.txt file2.txt

rudib
  • 1,602