how can I merge two txt files by one similar string

Question

I have a data like this as an example

sp|O15304|SIVA_HUMAN    MPKRSCPFADVAPLQLKVRVSQRELSRGVCAERYSQEVFEKTKRLLFLGAQAYLDHVWDEGCAVVHLPESPKPGPTGAPRAARGQMLIGPDGRLIRSLGQASEADPSGVASIACSSCVRAVDGKAVCGQCERALCGQCVRTCWGCGSVACTLCGLVDCSDMYEKVLCTSCAMFET 
tr|A0A1B1L9R9|A0A1B1L9R9_BACTU  MNKQLFLASLKETQKSILSYACGAALYLWLLIWIFPSMVSAKGLNELIAAMPDSVKKIVGMESPIQNVMDFLAGEYYSLLFIIILTIFCVTVATHLIARHVDKGAMAYLLATPVSRVQIAITQATVLILGLLIIVSVTYVAGLVGAEWFLQDNNLNKELFLKINIVGGLIFLVVSAYSFFFSCICNDERKALSYSASLTILFFVLDMVGKLSDKLEWMKNLSLFTLFRPKEIAEGAYNIWPVSIGLIAGALCIFIVAIVVFKKRDLPL

and I have another data like the following which has some similar strings

tr|A0A1B1L9R9|A0A1B1L9R9_BACTU This is just an example 1-20-100

I want to be able to match the two data and anywhere that it has similar string from second txt to the first test, paste the part that is in the second text file . For example.

in the first data I have this

sp|O15304|SIVA_HUMAN
tr|A0A1B1L9R9|A0A1B1L9R9_BACTU

in the second data I have only this which matches to one of the first data

tr|A0A1B1L9R9|A0A1B1L9R9_BACTU

so the output will be like this

sp|O15304|SIVA_HUMAN    MPKRSCPFADVAPLQLKVRVSQRELSRGVCAERYSQEVFEKTKRLLFLGAQAYLDHVWDEGCAVVHLPESPKPGPTGAPRAARGQMLIGPDGRLIRSLGQASEADPSGVASIACSSCVRAVDGKAVCGQCERALCGQCVRTCWGCGSVACTLCGLVDCSDMYEKVLCTSCAMFET 
tr|A0A1B1L9R9|A0A1B1L9R9_BACTU This is just an example 1-20-100 MNKQLFLASLKETQKSILSYACGAALYLWLLIWIFPSMVSAKGLNELIAAMPDSVKKIVGMESPIQNVMDFLAGEYYSLLFIIILTIFCVTVATHLIARHVDKGAMAYLLATPVSRVQIAITQATVLILGLLIIVSVTYVAGLVGAEWFLQDNNLNKELFLKINIVGGLIFLVVSAYSFFFSCICNDERKALSYSASLTILFFVLDMVGKLSDKLEWMKNLSLFTLFRPKEIAEGAYNIWPVSIGLIAGALCIFIVAIVVFKKRDLPL

So you have strings like "A1 B1" and "A2 B2". If "A2" = "A1", set "A1"= "A1 B1 B2"? I'm new at regexp but I think you can use (^.)\s(.) to capture the "part between the beginning of the line and the whitespace" and "the remainder" So I guess you would have the comparison pull out the first part and the remainder into variables, repeat for the line it would be compared to and use logic to build the result. Are you looking through all the lines in a file and merging the ones with similar 'first part's? — Jeter-work, Dec 21 '18 at 18:55
Can you explain how exactly this differs from your previous question how can I pass over a string on several txt files? — steeldriver, Dec 21 '18 at 19:31
Possible duplicate of how can I pass over a string on several txt files — Scott - Слава Україні, Dec 22 '18 at 03:10

rudib · Accepted Answer · 2018-12-21T19:15:08.067

A simple Bash script like this should work, altough there may be even shorter methods.

file1.txt:

sp|O15304|SIVA_HUMAN    MPKRSCPFADVAPLQLKVRVSQRELSRGVCAERYSQEVFEKTKRLLFLGAQAYLDHVWDEGCAVVHLPESPKPGPTGAPRAARGQMLIGPDGRLIRSLGQASEADPSGVASIACSSCVRAVDGKAVCGQCERALCGQCVRTCWGCGSVACTLCGLVDCSDMYEKVLCTSCAMFET 
tr|A0A1B1L9R9|A0A1B1L9R9_BACTU  MNKQLFLASLKETQKSILSYACGAALYLWLLIWIFPSMVSAKGLNELIAAMPDSVKKIVGMESPIQNVMDFLAGEYYSLLFIIILTIFCVTVATHLIARHVDKGAMAYLLATPVSRVQIAITQATVLILGLLIIVSVTYVAGLVGAEWFLQDNNLNKELFLKINIVGGLIFLVVSAYSFFFSCICNDERKALSYSASLTILFFVLDMVGKLSDKLEWMKNLSLFTLFRPKEIAEGAYNIWPVSIGLIAGALCIFIVAIVVFKKRDLPL

file2.txt:

tr|A0A1B1L9R9|A0A1B1L9R9_BACTU This is just an example 1-20-100

merge.sh:

fileone="file1.txt"
filetwo="file2.txt"
IFS=$'\n'
for line in `cat $fileone`; do
    #convert to array
    IFS=' '
    read -ra parts -d '' <<< "$line"

    other_text=$(cat $filetwo | sed -n -e "s/^${parts[0]} //p")
    echo "${parts[0]} $other_text ${parts[1]}"
done

This script reads file1.txt line per line, then checks if the prefix ${parts[0]} is contained in the second file file2.txt and then merges the strings together.

How sed -n -e "s/^${parts[0]} //p works:

-n means not to print anything by default.
-e is followed by a sed command.
s is the pattern replacement command.
The regular expression ^${parts[0] will match lines that begin with ${parts[0], which is our prefix (e.g.; sp|O15304|SIVA_HUMAN).
The match, e.g. sp|O15304|SIVA_HUMAN, is replaced by the empty string.
p will print the transformed line. (which will be stored in the variable other_text)

Also, see this detailed explaination of this particular sed command.

To redirect the ouput in a file you can run ./merge.sh > output.txt. You can make the script more flexible by setting fileone=$1 and filetwo=$2 and specify the files as arguments instead, like this: ./merge.sh file1.txt file2.txt

how can I merge two txt files by one similar string

1 Answers1

Linked