Comparing two files using Unix and Awk

Question

I need to compare two files, File1 and File2 (Separated by space) using 4 fields (Field 1, 2, 4 and 5 of File 1 with field1, 2, 4 and 5 of File2).

Logic:
If column 1, 2 and 4 of File 1 matches with column 1, 2 and 4 of File 2 and there is a mismatch at column 5 then both the lines from File 1 and File 2 are concatenated redirected as output. Therefore, the output file only contains those lines where Column 1, 2 and 4 of File1 and File2 matches and Column 5 does not match.

File1:

sc2/80         20      .        A       T         86       PASS     N=2     F=5;U=4
sc2/60         55      .        G       T         76       PASS     N=2     F=5;U=4 
sc2/68         20      .        T       C         71       PASS     N=2     F=5;U=4
sc2/24         24      .        T       G         31       PASS     N=2     F=5;U=4

File2:

sc2/80         20      .        A        C        80      PASS    N=2       F=5;U=4
sc2/60         55      .        G        C        72      PASS    N=2       F=5;U=4 
sc2/68         20      .        T        C        71      PASS    N=2       F=5;U=4 
sc2/10         24      .        T        G        31      PASS    N=2       F=5;U=4
sc2/40         59      .        T        G        31      PASS    N=2       F=5;U=4
sc2/24         24      .        A        G        38      PASS    N=2       F=5;U=4

Output:

sc2/80         20      .        A       T        86      PASS     N=2      F=5;U=4
sc2/80         20      .        A       C        80      PASS     N=2      F=5;U=4

sc2/60         55      .        G       T        76      PASS     N=2      F=5;U=4 
sc2/60         55      .        G       C        72      PASS     N=2      F=5;U=4

I am new in the field and I appreciate your help.

What happens if the files have different numbers of lines? Does the solution need to be in awk? — Joseph R., Jul 28 '13 at 23:56
The files has different number of lines. I thought awk is quite good in manipulating datasets. That is the reason I have chosen awk. — Namrata, Jul 29 '13 at 13:02
http://theunixshell.blogspot.in/2012/12/i-have-two-files-file-1-contains-3.html — Vijay, Mar 29 '14 at 13:49

slm · Accepted Answer · 2013-07-29T02:50:44.483

You can use awk. Put the following in a script, script.awk:

FNR == NR {
  f1[$1,$2,$4] = $0
  f1_c14[$1,$2,$4] = 1
  f1_c5[$1,$2,$4] = $5
  next
}  

f1_c14[$1,$2,$4] {
  if ($5 != f1_c5[$1,$2,$4]) print f1[$1,$2,$4];
}

f1[$1,$2,$4] {
  if ($5 != f1_c5[$1,$2,$4]) print $0;
}

Now run it like this:

$ awk -f script.awk file1  file2
sc2/80         20      .        A       T         86       PASS     N=2     F=5;U=4
sc2/80         20      .        A        C        80      PASS    N=2       F=5;U=4
sc2/60         55      .        G       T         76       PASS     N=2     F=5;U=4 
sc2/60         55      .        G        C        72      PASS    N=2       F=5;U=4

The script works as follows. This block creates 3 arrays, f1, f1_c14, and f1_c5. f1 contains all the lines of file1 in an array, indexed using the contents of the columns 1, 2, & 4 from file1. f1_c14 is another array with the same index (1, 2, & 4's contents) and a value of 1. The 3rd array uses the same index as the 1st 2, with the value of the 5th column from file1.

FNR == NR {
  f1[$1,$2,$4] = $0
  f1_c14[$1,$2,$4] = 1
  f1_c5[$1,$2,$4] = $5
  next
}

The next block is responsible for printing lines from the 1st file, file1 under the conditions that the columns 1, 2, & 4 match the columns from file2, AND it will onlu print the line from file1 if the 5th columns of file1 and file2 do not match.

f1_c14[$1,$2,$4] {
  if ($5 != f1_c5[$1,$2,$4]) print f1[$1,$2,$4];
}

The 3rd block is responsible for printing the associated line from file2 there's a corresponding line in the array f1 for file2's columns 1, 2, & 4. Again it only prints if the 5th columns do not match.

f1[$1,$2,$4] {
  if ($5 != f1_c5[$1,$2,$4]) print $0;
}

Example

Running the above script like so:

$ awk -f script.awk file1  file2
sc2/80         20      .        A       T         86       PASS     N=2     F=5;U=4
sc2/80         20      .        A        C        80      PASS    N=2       F=5;U=4
sc2/60         55      .        G       T         76       PASS     N=2     F=5;U=4 
sc2/60         55      .        G        C        72      PASS    N=2       F=5;U=4

You can use the column command to clean up the output slightly:

$ awk -f script.awk file1  file2 | column -t
sc2/80  20  .  A  T  86  PASS  N=2  F=5;U=4
sc2/80  20  .  A  C  80  PASS  N=2  F=5;U=4
sc2/60  55  .  G  T  76  PASS  N=2  F=5;U=4
sc2/60  55  .  G  C  72  PASS  N=2  F=5;U=4

How it works?

FNR == NR

This makes use of awk's ability to loop through files in a particular way. Here's we're looping through the files and when we're on a line that's from the first file, file, we want to run a particular block of code on this line from file1.

This example shows what FNR == NR is doing when we give it 2 simulated files. One has 4 lines in it while the other has 5 lines:

$ awk 'BEGIN {print "NR\tFNR\tline"} {print NR"\t"FNR"\t"$0}' \
     <(seq 1 4) <(seq 1 5)
NR  FNR line
1   1   1
2   2   2
3   3   3
4   4   4
5   1   1
6   2   2
7   3   3
8   4   4
9   5   5

other blocks

The other blocks, f1_c14[$1,$2,$4] AND f1[$1,$2,$4] only run when the values from those array elements has a value.

@namrata - Great. If you're issue's been resolved can you please mark this as the accepted answer so that others know that it's been solved. — slm, Jul 29 '13 at 13:20

Joseph R. · Answer 2 · 2013-07-29T00:39:38.913

Here's a solution in Perl. You should save the following code in a file and run it as a script (see below):

#!/usr/bin/perl
$file1 = '/path/to/file1';
$file2 = '/path/to/file2';
open $f1,'<',$file1;
open $f2,'<',$file2;
while(<$f1>){
    ($c1,$c2,$c4,$c5) = (split / /)[0,1,3,4]; #get relevant columns in file 1
    $lines_dictionary{"$c1 $c2 $c4"}="$c5---$_"; #create a hash entry keyed by the relevant columns
}
while(<$f2>){
    ($c1,$c2,$c4,$c5) = (split / /)[0,1,3,4]; #get relevant columns in file 2
    if(exists $lines_dictionary{"$c1 $c2 $c4"}){ #if a line with similar columns was seen in file 1
        ($file1_c5,$file1_line) = split /---/,$lines_dictionary{"$c1 $c2 $c4"}; #parse the hash entry this line in file 1
        if($file1_c5 -ne $c5){ #if column 5 of file 2 doesn't match column 5 of file 1
            print "${file1_line}$_\n"; #we only need one extra newline as the lines read from the files have trailing ones.
        }
    }
}
close $f1;
close $f2;

Use any text editor to paste this script into a file, modify the $file1 and $file2 variables to reflect the true locations of your files, then make the script executable by doing:

$ chmod +x /path/to/script

Finally, call the script:

$ /path/to/script

Disclaimer

This code is untested
This code assumes the pattern '---' is unlikely to occur in the 5th column.
This code assumes the lines in file 1 are unique (i.e. that each line has a different combination of "column1 column2 column4"). If there are multiple lines (not necessarily consecutive) containing the same data in the relevant columns, the script will use the last one (bottom-most in the file) of these lines.

Thank you Joseph R for your solution. It was of great help. – Namrata Jul 30 '13 at 15:00 — Namrata, Jul 30 '13 at 15:00
You're most welcome. Have you actually tested it? – Joseph R. Jul 30 '13 at 15:35 — Joseph R., Jul 30 '13 at 15:35

Comparing two files using Unix and Awk

2 Answers2

Example

How it works?

Linked