Finding new lines in one file compared with another

Question

I need to compare two txt files. Every line of both txt files contain entries. One entry per line. The new file contains entries the old one lacks. I have tried to use diff and vimdiff but these don't work because the lines may be in different order.

For example:

OLD FILE

alpha
beta
gama

NEW FILE

delta
omega
beta
alpha
gama
rho
phi

diff and vimdiff compares line 1 with line 1, line 2 with line 2, etc and even if I sort both files the comparison will not succeed because I can have new items between the sorted versions, like "alpha, beta, rho" versus "alpha, beta, gama, rho".

How do I get a list of entries that the new file have that the old one does not?

I don't understand the explanation why diff over the sorted files should not work. — Hauke Laging, Nov 02 '14 at 14:29

score 3 · Accepted Answer · answered Nov 02 '14 at 14:33

3

start cmd:> awk 'FNR == NR { oldfile[$0]=1; }; 
  FNR != NR { if(oldfile[$0]==0) print; }' file1 file2
delta
omega
rho
phi

answered Nov 02 '14 at 14:33

Hauke Laging

90,279

score 3 · Answer 2 · edited Nov 02 '14 at 14:53

3

I would use grep

grep -Fxvf oldfile newfile

-F : use fixed string mode (no metacharacters)

-x: match the whole line (not a substring)

-f oldfile : read the strings to be matched from oldfile

-v : invert the match i.e. print strings NOT found in oldfile

edited Nov 02 '14 at 14:53

Gilles 'SO- stop being evil'

829,060

answered Nov 02 '14 at 14:50

steeldriver

81,074

score 2 · Answer 3 · answered Nov 03 '14 at 22:47

A shorter awk command:

awk 'NR==FNR{a[$0];next}!($0 in a)' file1 file2

If file1 can be empty, replace NR==FNR with FILENAME==ARGV[1].

grep -Fxvf file2 file1 is slow for large files:

$ jot -r 10000 1 100000 >file1;jot -r 10000 1 100000 >file2
$ time awk 'NR==FNR{a[$0];next}!($0 in a)' file1 file2 >/dev/null
0.015
$ time grep -Fxvf file2 file1 >/dev/null
36.758
$ time comm -13 <(sort file1) <(sort file2)>/dev/null
0.173

If you need to remove repeated lines, use

awk 'NR==FNR{a[$0];next}!b[$0]++&&!($0 in a)' file1 file2

or

comm -13 <(sort file1) <(sort -u file2)

score 1 · Answer 4 · answered Nov 02 '14 at 14:52

If you don't mind the order of entries, sort both files and then use comm.

comm -13 <(sort old.txt) <(sort new.txt)

Or if your shell doesn't have process substitution:

sort old.txt >sorted.old.txt
sort new.txt >sorted.new.txt
comm -13 sorted.old.txt sorted.new.txt

If the order of entries matters, you can use grep with the right options: -Fx to match full lines exactly, -v to exclude matching lines, and -f to read a file containing the patterns to exclude.

grep -Fxv -f old.txt new.txt

score 0 · Answer 5 · answered Nov 04 '14 at 13:49

In case if you need the python way of doing it.

#!/usr/bin/env python3.4


oldfp = open('/tmp/tmp.Q3JiYGY6fs/oldfile')
newfp = open('/tmp/tmp.Q3JiYGY6fs/newfile')


old = set([ x.strip() for x in oldfp.readlines() ])
new = set([ x.strip() for x in newfp.readlines() ])

print('Lines that are present only in newfile are \n{}\n\n{} '.format(42*'-', '\n'.join(list(new - old))))

And the output will be

Lines that are present only in newfile are 
------------------------------------------

phi
rho
omega
delta

Finding new lines in one file compared with another

5 Answers5