Replacing multiple words in tab delimited file with alternative words

Question

I have a tab-delimited file with codes (e.g. ctc, nmg, nml) indicating organisms (File_1):

522 ctc:CTC00972    -   506 3.20E-138   
522 nmg:Nmag_3027   -   561 4.70E-73
522 nml:Namu_3564   -   566 1.80E-146

I also have a tab-delimited file that links the codes to the organism names (File_2):

ctc  Clostridium_tetani_E88
nla  Neisseria_lactamica
nmg  Natrialba_magadii

How can I replace the codes in File_1 with the organism names in File_2?

Desired output:

522 Clostridium_tetani_E88:CTC00972 -   506 3.20E-138   
522 Natrialba_magadii:Nmag_3027     -   561 4.70E-73
522 Neisseria_lactamica:Namu_3564   -   566 1.80E-146

Keeping in mind that there are thousands of these codes in the full data set and that the order of the organisms is not the same in File_1 and File_2.

Your data doesn't appear to be consistent with the desired output (nla not nml corresponds to Neisseria_lactamica) — steeldriver, May 16 '17 at 12:45

score 1 · Answer 1 · answered May 16 '17 at 12:52

It's really just a variant of the "build a lookup table from one file; use it when processing the other file", with the wrinkle that the lookup key needs to be split from the second field. You could do that in awk for example:

awk '
  BEGIN{OFS=FS="\t"} 
  NR==FNR {
    a[$1]=$2; next
  } 
  {
    split($2,b,":"); 
    if (b[1] in a) $2 = a[b[1]]":"b[2]
  } 1' File_2 File_1

score 0 · Answer 2 · answered May 16 '17 at 12:52

This should do the job with sed:

sed '/^[a-z]*[[:cntrl:]].*/{s/[[:cntrl:]]/###/;H;d;}
 G
 s/\([a-z]*\)\(:.*\n\)\1###\([^[:cntrl:]]*\)/\3\2/
 P
 d' file_2 file_1

It's an adoption of this generic solution. See there for an explanation of how it works.

score 0 · Answer 3 · answered May 16 '17 at 14:11

0

awk approach:

awk 'NR==FNR{a[$1]=$2;next}$2 in a{$2=a[$2]":"$3;$3=""}1' File_2 FS="[ |:]" File_1

The output:

522 Clostridium_tetani_E88:CTC00972     -   506 3.20E-138   
522 Natrialba_magadii:Nmag_3027    -   561 4.70E-73
522 nml:Namu_3564   -   566 1.80E-146

Note, that nml and nla codes don't have matches between the 2 input files

answered May 16 '17 at 14:11

RomanPerekhrest

30,212

I like the one line approach. Could you explain it a bit more, so that it can be modified? For example, how would you change the command if you wanted to replace words in column 1 instead of column 2 or how would you modify the command if column 2 didn't possess the section after the ":"? – P.tin May 19 '17 at 09:58
@P.tin, 1) to replace words in column 1 instead of column 2 - move columns as per your needs $1=a[$1]. 2) is the 2nd field of file1 doesn't have : you'll need to rearrange field separator FS="[ |:]". If you have new requirements - it should be as new question – RomanPerekhrest May 19 '17 at 10:07

Replacing multiple words in tab delimited file with alternative words

3 Answers3