0

I have a data like this but much bigger. So I have a df1.txt as follow

sp|O15304|SIVA_HUMAN    MPKRSCPFADVAPLQLKVRVSQRELSRGVCAERYSQEVFEKTKRLLFLGAQAYLDHVWDEGCAVVHLPESPKPGPTGAPRAARGQMLIGPDGRLIRSLGQASEADPSGVASIACSSCVRAVDGKAVCGQCERALCGQCVRTCWGCGSVACTLCGLVDCSDMYEKVLCTSCAMFET 
tr|A0A1B1L9R9|A0A1B1L9R9_BACTU  MNKQLFLASLKETQKSILSYACGAALYLWLLIWIFPSMVSAKGLNELIAAMPDSVKKIVGMESPIQNVMDFLAGEYYSLLFIIILTIFCVTVATHLIARHVDKGAMAYLLATPVSRVQIAITQATVLILGLLIIVSVTYVAGLVGAEWFLQDNNLNKELFLKINIVGGLIFLVVSAYSFFFSCICNDERKALSYSASLTILFFVLDMVGKLSDKLEWMKNLSLFTLFRPKEIAEGAYNIWPVSIGLIAGALCIFIVAIVVFKKRDLPL    

and I have df2.txt as follow

sp|O15304|SIVA_HUMAN    IGPDGR

I am trying to join them both together so I do the following

join df1.txt df2.txt | awk '{gsub($3, tolower($3), $2) ; print $1 "\t" $2}' > out.txt

I expect to have this

sp|O15304|SIVA_HUMAN    MPKRSCPFADVAPLQLKVRVSQRELSRGVCAERYSQEVFEKTKRLLFLGAQAYLDHVWDEGCAVVHLPESPKPGPTGAPRAARGQMLigpdgrLIRSLGQASEADPSGVASIACSSCVRAVDGKAVCGQCERALCGQCVRTCWGCGSVACTLCGLVDCSDMYEKVLCTSCAMFET
tr|A0A1B1L9R9|A0A1B1L9R9_BACTU  MNKQLFLASLKETQKSILSYACGAALYLWLLIWIFPSMVSAKGLNELIAAMPDSVKKIVGMESPIQNVMDFLAGEYYSLLFIIILTIFCVTVATHLIARHVDKGAMAYLLATPVSRVQIAITQATVLILGLLIIVSVTYVAGLVGAEWFLQDNNLNKELFLKINIVGGLIFLVVSAYSFFFSCICNDERKALSYSASLTILFFVLDMVGKLSDKLEWMKNLSLFTLFRPKEIAEGAYNIWPVSIGLIAGALCIFIVAIVVFKKRDLPL    

but instead I am having this

sp|O15304|SIVA_HUMAN    MPKRSCPFADVAPLQLKVRVSQRELSRGVCAERYSQEVFEKTKRLLFLGAQAYLDHVWDEGCAVVHLPESPKPGPTGAPRAARGQMLigpdgrLIRSLGQASEADPSGVASIACSSCVRAVDGKAVCGQCERALCGQCVRTCWGCGSVACTLCGLVDCSDMYEKVLCTSCAMFET

, how can I solve it?

Jeff Schaller
  • 67,283
  • 35
  • 116
  • 255
Learner
  • 165
  • Why would you expect to have the tr line in the output? The only way I can see of obtaining your desired output from the two example files is cat df1.txt, which presumably isn't what you're intending. – Chris Davies Dec 21 '18 at 20:48
  • @roaima I want to have all the ones that are in the df1.txt, just modify by the df2.txt – Learner Dec 21 '18 at 20:52
  • And the others passed through unchanged? – Chris Davies Dec 21 '18 at 20:54

1 Answers1

2

The problem is in the join command. You need to use -a 1.

From man join

-a FILENUM
       also print unpairable lines from file FILENUM, where
       FILENUM is 1 or  2,  corresponding to FILE1 or FILE2

i.e. the final command is

join -a 1 df1.txt df2.txt | awk '{gsub($3, tolower($3), $2) ; print $1 "\t" $2}' > out.txt

Background

When troubleshooting, you should test each part of the pipe sequentially. join df1.txt df2.txt only outputs lines that are in both files. To include lines in df1.txt that have no match in df2.txt, use -a 1 as above.

Sparhawk
  • 19,941
  • I have a MAC, should still this work? because when I run it it seems I have a syntax issue – Learner Dec 21 '18 at 00:44
  • It does give me output with I use without a but as soon as I add -a 1 or 2, it gives me like usage: join [-a fileno | -v fileno ] [-e string] [-1 field] [-2 field] [-o list] [-t char] file1 file2 – Learner Dec 21 '18 at 00:55
  • @Learner I've changed the position of the option. Could you please try again? (Often OS X ships very old versions of command line tools, because Apple doesn't like the open-source GPL licence. As an aside, I'd recommend installing from an unofficial repository.) – Sparhawk Dec 21 '18 at 00:56