Comparing and fetching words in two columns of different files

Question

I want to compare two different columns of different files and fetch the common entries among them:

file1

abc
123
ttt
kkk

file2

111 wed
222 kad
333 ttt
444 kkk

I want to compare column 1 of file1 to column 2 of file2. If there are any common entries, I want to print the match lines from file2:

Expected result:

333 ttt
444 kkk

I have tried below commands to fetch the result:

awk -F 'NR==FNR{c[$1$2]++;next};c[$1$2] > 0' file1 file2

or

join -t -1 1 -2 2 -o 2.1,2.2 file1 file2

but I didn't get the expected results.

How have you tried to resolve this so far? With what results? Please don't answer in a comment, but edit your question adding the command(s) / script(s) you used, and the resulting output(s). Please use the curly brackets { } to keep the formatting readable. — Peregrino69, Sep 26 '21 at 10:22
Your join command uses , as separator, but the input does not. Why so? — FelixJN, Sep 26 '21 at 10:40
@FelixJN, edited the question. Thank you for pointing the mistake. — Abhishek Patil, Sep 26 '21 at 10:46
Though it'll work if all of your fields are the same width as shown in your example, in general don't concatenate 2 strings without a separator ($1$2) and hope to get a unique value. Consider that a bc and ab c both concatenate to abc. Always use a separator between the strings, e.g. SUBSEP or RS or FS or OFS are the most common ones used. — Ed Morton, Sep 26 '21 at 13:17
At least part of your problem with join is that the input files must be sorted on the join columns. Neither or yours is. — Dale Hagglund, Sep 27 '21 at 08:47

terdon · Answer 1 · 2021-09-26T13:21:20.747

4

Your awk fails because you are using both fields as the key, but your file1 only has one field so both will never be present in the c array. Also, you are using the -F option to give a field separator but are not actually giving one. This means that the field separator will be the awk script itself:

awk -F 'script' file

Here's a working version of your awk attempt:

$ awk  'NR==FNR{c[$1]++;next};c[$2] ' file1 file2
333 ttt
444 kkk

And here's a more efficient version that doesn't use more memory than needed:

$ awk  'NR==FNR{c[$1]}; $2 in c' file1 file2 
333 ttt                 
444 kkk

Your join also missuses the -t option. Just like with the -F of awk, the -t also takes an argument and you aren't giving it one. Next, join requires the input to be sorted. Here's a working version of your join approach:

$ join  -1 1 -2 2 -o 2.1,2.2 <(sort file1) <(sort -k2 file2)
444 kkk
333 ttt

edited Sep 26 '21 at 13:21

answered Sep 26 '21 at 10:57

terdon

242,166

@EdMorton true, but I wanted to use the syntax of the OP to illustrate how to get that syntax working. But fair enough, I added a version that doesn't have those issues (I think, anyway!). – terdon Sep 26 '21 at 13:21
You're just missing the next in that 2nd script. – Ed Morton Sep 26 '21 at 13:22
@EdMorton no, it's omitted by design. Since only the second file has two columns, $2 in c will only ever be true when reading the second file, so I figured there was no need for it. Carlos's answer has a version with the next that could be used even if the first file can have >1 fields, so I'll keep this one as it does work for the OP and I like brevity. – terdon Sep 26 '21 at 13:45
That brevity is at the cost of performance (it'll do that $2 in c test for every line in file1) and robustness (if file1 contained an empty line you'd unexpectedly and undesirably get that line printed) and maintainability (if you ever wanted to add comments or an additional field to the end of the lines in file1 it'd break the script) though so IMHO it's not worth tightly coupling the code to the data like that just to save 5 chars. – Ed Morton Sep 26 '21 at 14:53
1

@EdMorton yep, all true. Don't really think efficiency is that relevant in the context of a couple of small files though, and I certainly don't think that either maintainability or the ability to work with blank files is relevant. Fair points though, thanks. – terdon Sep 26 '21 at 15:33

score 4 · Answer 2 · edited Sep 26 '21 at 13:15

4

awk 'NR==FNR {a[$1]; next} $2 in a' file1 file2
333 ttt
444 kkk

The above follows what you want: to compare column 1 of file1 to column 2 of file2. If there are any common entries, I want to print the match lines from file2

edited Sep 26 '21 at 13:15

Ed Morton

31,617

answered Sep 26 '21 at 11:34

Carlos Pascual

554

sseLtaH · Answer 3 · 2021-09-26T15:11:12.430

3

Does this awk work for you?

awk 'FNR==NR {array[$2]=$0; next} {$0=array[$1]}/./{print}' file2 file1

Output

$ awk 'FNR==NR {array[$2]=$0; next} {$0=array[$1]}/./{print}' file2 file1
333 ttt
444 kkk

edited Sep 26 '21 at 15:11

answered Sep 26 '21 at 10:51

sseLtaH

2,786

Prabhjot Singh · Answer 4 · 2021-09-26T13:45:50.963

Using csvjoin (part of the csvkit toolset):

Assuming file1 as:

x
abc
123
ttt
kkk

and file2 as:

y x
111 wed
222 kad
333 ttt
444 kkk

The following command would do:

csvjoin -d' ' -c x file{2,1} | tr ',' ' '

-d' ' is using space as delimiter.
-c x csvjoin would join files using column x.

As @terdon fixed the above command for file without headers. The command may be changed to this:

csvjoin -H -d ' ' -c 2,1 fil{2,1} | tr ',' ' ' | tail -n +2

The above command works for file that has no headers(-H argument).

And tail commands starts output from second record(+2). This is because csvjoin command adds a new header line at start of output.

score -1 · Answer 5 · answered Sep 28 '21 at 19:30

-1

#!/usr/bin/python
k1=open('file1','r')
for f1 in k1:
    k2=open('file2','r')
    for f2 in k2:
        if f1.strip() in f2.strip():
            print f2.strip()
~

output

333 ttt
444 kkk

answered Sep 28 '21 at 19:30

Praveen Kumar BS

5,211

This is astonishingly bad. Not only does it involve reading all of file2 for every single line of file1, it matches complete lines in file2 instead of only the second column. – Stephen Kitt Sep 29 '21 at 09:30

Comparing and fetching words in two columns of different files

5 Answers5

Linked