3

I need to compare two txt files. Every line of both txt files contain entries. One entry per line. The new file contains entries the old one lacks. I have tried to use diff and vimdiff but these don't work because the lines may be in different order.

For example:

OLD FILE

alpha
beta
gama

NEW FILE

delta
omega
beta
alpha
gama
rho
phi

diff and vimdiff compares line 1 with line 1, line 2 with line 2, etc and even if I sort both files the comparison will not succeed because I can have new items between the sorted versions, like "alpha, beta, rho" versus "alpha, beta, gama, rho".

How do I get a list of entries that the new file have that the old one does not?

Duck
  • 4,674

5 Answers5

3
start cmd:> awk 'FNR == NR { oldfile[$0]=1; }; 
  FNR != NR { if(oldfile[$0]==0) print; }' file1 file2
delta
omega
rho
phi
Hauke Laging
  • 90,279
3

I would use grep

grep -Fxvf oldfile newfile

-F : use fixed string mode (no metacharacters)

-x: match the whole line (not a substring)

-f oldfile : read the strings to be matched from oldfile

-v : invert the match i.e. print strings NOT found in oldfile

steeldriver
  • 81,074
2

A shorter awk command:

awk 'NR==FNR{a[$0];next}!($0 in a)' file1 file2

If file1 can be empty, replace NR==FNR with FILENAME==ARGV[1].

grep -Fxvf file2 file1 is slow for large files:

$ jot -r 10000 1 100000 >file1;jot -r 10000 1 100000 >file2
$ time awk 'NR==FNR{a[$0];next}!($0 in a)' file1 file2 >/dev/null
0.015
$ time grep -Fxvf file2 file1 >/dev/null
36.758
$ time comm -13 <(sort file1) <(sort file2)>/dev/null
0.173

If you need to remove repeated lines, use

awk 'NR==FNR{a[$0];next}!b[$0]++&&!($0 in a)' file1 file2

or

comm -13 <(sort file1) <(sort -u file2)
Lri
  • 5,223
1

If you don't mind the order of entries, sort both files and then use comm.

comm -13 <(sort old.txt) <(sort new.txt)

Or if your shell doesn't have process substitution:

sort old.txt >sorted.old.txt
sort new.txt >sorted.new.txt
comm -13 sorted.old.txt sorted.new.txt

If the order of entries matters, you can use grep with the right options: -Fx to match full lines exactly, -v to exclude matching lines, and -f to read a file containing the patterns to exclude.

grep -Fxv -f old.txt new.txt
0

In case if you need the python way of doing it.

#!/usr/bin/env python3.4


oldfp = open('/tmp/tmp.Q3JiYGY6fs/oldfile')
newfp = open('/tmp/tmp.Q3JiYGY6fs/newfile')


old = set([ x.strip() for x in oldfp.readlines() ])
new = set([ x.strip() for x in newfp.readlines() ])

print('Lines that are present only in newfile are \n{}\n\n{} '.format(42*'-', '\n'.join(list(new - old))))

And the output will be

Lines that are present only in newfile are 
------------------------------------------

phi
rho
omega
delta
Kannan Mohan
  • 3,231