Looking for lines which is in one file but not in other using Unix and Awk

Question

I have 2 files with 7 fields and I want to print the lines which are present in file1 but not in file2 based on field1 and field2.

Logic: I want to print all the lines, where there is a particular column1 and column2. And we do not find the set of column1 and column2 in file2. Example: "sc2/10 10" this set you would not see in file 2, therefore it is printed as output.

File1:

sc2/80         20      .        A       T         86       F=5;U=4
sc2/60         55      .        G       T         76       F=5;U=4
sc2/10         10      .        G       C         50       F=5;U=4
sc2/68         20      .        T       C         71       F=5;U=4
sc2/24         24      .        T       G         31       F=5;U=4
sc2/11         30      .        A       T         60       F=5;U=4

File2:

sc2/80         20      .        A       T         30       F=5;U=4 
sc2/60         55      .        T       T         77       F=5;U=4 
sc2/68         20      .        C       C         01       F=5;U=4
sc2/24         29      .        T       G         31       F=5;U=4
sc2/24         19      .        G       G         11       F=5;U=4
sc2/88         89      .        T       G         51       F=5;U=4

Expected Output:

sc2/10         10      .        G       C         50       F=5;U=4 
sc2/11         30      .        A       T         60       F=5;U=4

I would appreciate your help.

related: http://unix.stackexchange.com/questions/86590/comparing-files-based-on-5-fields-using-awk-and-bash — slm, Aug 21 '13 at 15:35
http://stackoverflow.com/questions/4366533/remove-lines-from-file-which-appear-in-another-file — Ciro Santilli OurBigBook.com, Jun 03 '16 at 07:59

score 7 · Accepted Answer · edited Jun 11 '20 at 12:04

Unless the input is huge, I would go with saving the file2 pairs into a hash and use it to ignore lines in file1. For example:

awk 'FNR == NR { h[$1,$2]; next }; !($1 SUBSEP $2 in h)' file2 file1

Output:

sc2/10         10      .        G       C         50       F=5;U=4         
sc2/24         24      .        T       G         31       F=5;U=4
sc2/11         30      .        A       T         60       F=5;U=4

IIUC sc2/24 24 is correctly included in the output.

Explanation

FNR == NR { h[$1,$2]; next } saves $1/$2 pairs into the h hash (note that accessing an array by subscript is enough to allocate it), but only from the first input file (file2). The next command skips to the next record.
! ($1 SUBSEP $2 in h) is only evaluated for file1 and will invoke the default block for lines not containing the $1/$2 pairs. The default block is { print $0 }. (note: avoid using !h[$1,$2] (same as !h[$1 SUBSEP $2]) as that would allocate it)

The above assumes the value of SUBSEP (usually the ^\ character) is not found in the first two fields of the file.

score 0 · Answer 2 · edited May 23 '17 at 12:39

grep -Fvxf <remove> <all-lines>

works on non-sorted files
maintains the order
is POSIX

Example:

cat <<EOF > A
b
1
a
0
01
b
1
EOF

cat <<EOF > B
0
1
EOF

grep -Fvxf B A

Output:

b
a
01
b

Explanation:

-F: use literal strings instead of the default BRE
-x: only consider matches that match the entire line
-v: print non-matching
-f file: take patterns from the given file

This method is slower on pre-sorted files than other methods, since it is more general. If speed matters as well, see: https://stackoverflow.com/questions/18204904/fast-way-of-finding-lines-in-one-file-that-are-not-in-another

Looking for lines which is in one file but not in other using Unix and Awk

2 Answers2

Explanation

Linked

Related