Is there a tool to get the lines in one file that are not in another?

Question

Is there any tool that can get lines which file A contains, but file B doesn't? I could make a little simple script with, e.g, perl, but if something like that already exists, I'll save my time from now on.

see "http://stackoverflow.com/questions/5812756/print-lines-from-one-file-that-are-not-contained-in-another-file" — harish.venkat, Jan 02 '12 at 13:32
http://stackoverflow.com/questions/4366533/remove-lines-from-file-which-appear-in-another-file — Ciro Santilli OurBigBook.com, Jun 27 '15 at 08:47

Caleb · Accepted Answer · 2017-05-08T13:58:48.190

213

Yes. The standard grep tool for searching files for text strings can be used to subtract all the lines in one file from another.

grep -F -x -v -f fileB fileA

This works by using each line in fileB as a pattern (-f fileB) and treating it as a plain string to match (not a regular regex) (-F). You force the match to happen on the whole line (-x) and print out only the lines that don't match (-v). Therefore you are printing out the lines in fileA that don't contain the same data as any line in fileB.

The downside of this solution is that it doesn't take line order into account and if your input has duplicate lines in different places you might not get what you expect. The solution to that is to use a real comparison tool such as diff. You could do this by creating a diff file with the context value at 100% of the lines in the file, then parsing it for just the lines that would be removed if converting file A to file B. _{(Note this command also removes the diff formatting after it gets the right lines.)}

diff -U $(wc -l < fileA) fileA fileB | sed -n 's/^-//p' > fileC

edited May 08 '17 at 13:58

answered Jan 02 '12 at 13:57

Caleb

70,105

@inderpreet99 The lower case -u argument does actually take a parameter of a number as long as it is not followed by a space. The advantage of the way I had it before is that it will work with or without a value, so you could use something in that sub command routine that returned not output. Upper case '-U' on the other hand requires an argument. – Caleb Aug 27 '13 at 21:46
be careful, grep -f is O(N^2) I believe: http://stackoverflow.com/questions/4780203/deleting-lines-from-one-file-which-are-in-another-file – rogerdpack Oct 16 '15 at 17:33
1

the diff pipeline works a treat thanks. – Felipe Alvarez Nov 03 '16 at 23:01
To account for the sort problem, you could use process substitution in the command to process each file before the grep as needed. Example: grep -F -x -v -f <(sort fileB) <(sort fileA) – Tony Cesaro Oct 04 '17 at 15:22
@TonyCesaro That would work if your data set is not order specific and duplicates don't need to be taken into account. The advantage of using diff is that position in the file gets taken into account. – Caleb Oct 05 '17 at 06:06
The options can be collapsed together to get a shorter version: grep -Fxvf fileB fileA. – Quasímodo Oct 13 '20 at 13:39

score 64 · Answer 2 · edited Aug 27 '13 at 18:41

64

The answer depends a great deal on the type and format of the files you are comparing.

If the files you are comparing are sorted text files, then the GNU tool written by Richard Stallman and Davide McKenzie called comm may perform the filtering you are after. It is part of the coreutils.

Example

Say you have the following 2 files:

$ cat a
1
2
3
4
5

$ cat b
1
2
3
4
5
6

Lines in file b that are not in file a:

$ comm <(sort a) <(sort b) -3
    6

edited Aug 27 '13 at 18:41

slm

369,824

answered Jan 03 '12 at 06:30

A friend

641

11

so sort them ? comm <(sort a) <(sort b) -1 -2 – Sirex Feb 21 '12 at 08:17
This is some weird syntax. <()? It works and I get it, but is there a name for this weirdness? – mlissner Apr 05 '17 at 22:47
@mlissner comm can also read stdin (see man page). <()is your redirection. – berbt Apr 09 '17 at 22:59
2

@mlissner <() is also known as process substitution. – miku Apr 13 '17 at 15:51
5

comm was originally written circa 1973 by someone at Bell Labs, not rms. You're referring to the GNU implementation which came a lot later. There have been a lot of different implementations of the Unix utilities across the years. – Stéphane Chazelas Oct 30 '17 at 08:33
comm doesn't seem to be fitted for use in scripts. E.g. when you show more than one column. To be frank, the first impression, it's weird. – x-yuri Jun 02 '22 at 04:55

JJS · Answer 3 · 2020-03-02T12:53:32.333

42

from stackoverflow...

comm -23 file1 file2

-23 suppresses the lines in file2 (-2) and the lines that appear in both (-3), leaving only the unique lines from file1. The files have to be sorted (they are in your example) but if not, pipe them through sort first.

See the man page here

-1 suppress column 1 (lines unique to FILE1)
-2 suppress column 2 (lines unique to FILE2)
-3 suppress column 3 (lines that appear in both files)

edited Mar 02 '20 at 12:53

answered Jun 07 '13 at 19:51

JJS

521
4
5

1

This doesn't work for me, for some reason ... – Jan Feb 07 '18 at 21:05
@Jan are your files sorted? How did you sort them? – JJS Feb 08 '18 at 19:03
@roaima I have additional explanation not supplied by the other answer. Also, I'm adding -2, and they are only using -3, so not exactly the same. – JJS Mar 02 '20 at 12:55
@roaima thanks for pointing out that I can improve my answer by standing out from other similar answers! – JJS Mar 02 '20 at 18:28
This is the best answer because comm is purpose-built to do what OP wanted to do. And by "best answer" I mean "will suit for most people", not "will suit for every single use case including absurdly large unsorted files". :D – Paul Jul 16 '22 at 20:30

score 9 · Answer 4 · edited May 23 '17 at 12:40

9

The grep and comm (with sort) methods take a long time on large files. SiegeX and ghostdog74 shared two great awk methods for extracting lines unique to one of two files over on Stack Overflow:

$ awk 'FNR==NR{a[$0]++}FNR!=NR && !a[$0]{print}' file1 file2

$ awk 'FNR==NR{a[$0]++;next}(!($0 in a))' file1 file2

edited May 23 '17 at 12:40

Community

1

answered Sep 29 '13 at 05:36

Miles Wolbe

191

4

If you're doing this with huge files, then the memory constraints of loading a huge file into an associative array are going to be prohibitive. – Charles Duffy Jun 17 '16 at 14:01
@CharlesDuffy, that would be a lot worse with grep -Fvxf approach though, where the performance would be unacceptable long before memory exhaustion. – Stéphane Chazelas Jan 20 '24 at 12:29
@StéphaneChazelas, indeed; for really large files one really ought to use sort+comm, where better sort implementations' ability to generate temporary files and merge-sort them together means that memory constraints are moot. (Of course, keeping content pre-sorted to the extent possible is in this context a Very Good Idea) – Charles Duffy Jan 20 '24 at 17:57

score 5 · Answer 5 · edited Sep 15 '12 at 20:36

5

If the files are big and you don't have a custom order to your entries, grep takes much too long. A quick alternative would be

sort file1 > 1 
sort file2 > 2 
diff 1 2 | grep "\>" | sed -e 's/> //'

[file2-file1 results to screen, pipe to file etc.]

Changing > to < would get the opposite subtraction. rm 1 2

edited Sep 15 '12 at 20:36

jasonwryan

73,126

answered Aug 24 '12 at 15:12

Eshel Faraggi

51

score 3 · Answer 6 · answered Oct 13 '20 at 13:45

3

combine from moreutils package is very intuitive:

combine fileA not fileB

The files do not need to be sorted and the order is preserved.

There are also xor, or and and operators.

answered Oct 13 '20 at 13:45

Quasímodo

18,865
4
36
73

score 1 · Answer 7 · answered Feb 21 '12 at 10:26

1

You could also consider vimdiff, it highlights the differences between files in a vim editor

answered Feb 21 '12 at 10:26

simona

1,011

2

But is there an easy way to automatically do the subtraction in Vimdiff? – Kazark Mar 14 '13 at 17:38

score 0 · Answer 8 · answered Oct 13 '20 at 16:32

0

    awk 'NR==FNR{a[$1];next}!($1 in a){print }' fileA fileB



     cat fileA
    1
    2
    3
    4
    5
    6
    7
    8
    9

  file B
        8
        9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19

    output

    10
    11
    12
    13
    14
    15
    16
    17
    18
    19


    It will print contents of fileB which is not in fileA

answered Oct 13 '20 at 16:32

Praveen Kumar BS

5,211

Please let me know the reason for down vote – Praveen Kumar BS Oct 14 '20 at 13:44
1

You want $0, not $1. Also this is essentially the same as Miles' answer. – Quasímodo Oct 19 '20 at 15:37

Is there a tool to get the lines in one file that are not in another?

8 Answers8

Example

Linked

Related