5

I have 2 files with 7 fields and I want to print the lines which are present in file1 but not in file2 based on field1 and field2.

Logic: I want to print all the lines, where there is a particular column1 and column2. And we do not find the set of column1 and column2 in file2. Example: "sc2/10 10" this set you would not see in file 2, therefore it is printed as output.

File1:

sc2/80         20      .        A       T         86       F=5;U=4
sc2/60         55      .        G       T         76       F=5;U=4
sc2/10         10      .        G       C         50       F=5;U=4
sc2/68         20      .        T       C         71       F=5;U=4
sc2/24         24      .        T       G         31       F=5;U=4
sc2/11         30      .        A       T         60       F=5;U=4

File2:

sc2/80         20      .        A       T         30       F=5;U=4 
sc2/60         55      .        T       T         77       F=5;U=4 
sc2/68         20      .        C       C         01       F=5;U=4
sc2/24         29      .        T       G         31       F=5;U=4
sc2/24         19      .        G       G         11       F=5;U=4
sc2/88         89      .        T       G         51       F=5;U=4

Expected Output:

sc2/10         10      .        G       C         50       F=5;U=4 
sc2/11         30      .        A       T         60       F=5;U=4 

I would appreciate your help.

slm
  • 369,824
Namrata
  • 509

2 Answers2

7

Unless the input is huge, I would go with saving the file2 pairs into a hash and use it to ignore lines in file1. For example:

awk 'FNR == NR { h[$1,$2]; next }; !($1 SUBSEP $2 in h)' file2 file1

Output:

sc2/10         10      .        G       C         50       F=5;U=4         
sc2/24         24      .        T       G         31       F=5;U=4
sc2/11         30      .        A       T         60       F=5;U=4

IIUC sc2/24 24 is correctly included in the output.

Explanation

  • FNR == NR { h[$1,$2]; next } saves $1/$2 pairs into the h hash (note that accessing an array by subscript is enough to allocate it), but only from the first input file (file2). The next command skips to the next record.
  • ! ($1 SUBSEP $2 in h) is only evaluated for file1 and will invoke the default block for lines not containing the $1/$2 pairs. The default block is { print $0 }. (note: avoid using !h[$1,$2] (same as !h[$1 SUBSEP $2]) as that would allocate it)

The above assumes the value of SUBSEP (usually the ^\ character) is not found in the first two fields of the file.

Thor
  • 17,182
0

grep -Fvxf <remove> <all-lines>

  • works on non-sorted files
  • maintains the order
  • is POSIX

Example:

cat <<EOF > A
b
1
a
0
01
b
1
EOF

cat <<EOF > B
0
1
EOF

grep -Fvxf B A

Output:

b
a
01
b

Explanation:

  • -F: use literal strings instead of the default BRE
  • -x: only consider matches that match the entire line
  • -v: print non-matching
  • -f file: take patterns from the given file

This method is slower on pre-sorted files than other methods, since it is more general. If speed matters as well, see: https://stackoverflow.com/questions/18204904/fast-way-of-finding-lines-in-one-file-that-are-not-in-another

Ciro Santilli OurBigBook.com
  • 18,092
  • 4
  • 117
  • 102