144

Is there any tool that can get lines which file A contains, but file B doesn't? I could make a little simple script with, e.g, perl, but if something like that already exists, I'll save my time from now on.

john-jones
  • 1,736
daisy
  • 54,555

8 Answers8

213

Yes. The standard grep tool for searching files for text strings can be used to subtract all the lines in one file from another.

grep -F -x -v -f fileB fileA

This works by using each line in fileB as a pattern (-f fileB) and treating it as a plain string to match (not a regular regex) (-F). You force the match to happen on the whole line (-x) and print out only the lines that don't match (-v). Therefore you are printing out the lines in fileA that don't contain the same data as any line in fileB.

The downside of this solution is that it doesn't take line order into account and if your input has duplicate lines in different places you might not get what you expect. The solution to that is to use a real comparison tool such as diff. You could do this by creating a diff file with the context value at 100% of the lines in the file, then parsing it for just the lines that would be removed if converting file A to file B. (Note this command also removes the diff formatting after it gets the right lines.)

diff -U $(wc -l < fileA) fileA fileB | sed -n 's/^-//p' > fileC
Caleb
  • 70,105
  • @inderpreet99 The lower case -u argument does actually take a parameter of a number as long as it is not followed by a space. The advantage of the way I had it before is that it will work with or without a value, so you could use something in that sub command routine that returned not output. Upper case '-U' on the other hand requires an argument. – Caleb Aug 27 '13 at 21:46
  • be careful, grep -f is O(N^2) I believe: http://stackoverflow.com/questions/4780203/deleting-lines-from-one-file-which-are-in-another-file – rogerdpack Oct 16 '15 at 17:33
  • 1
    the diff pipeline works a treat thanks. – Felipe Alvarez Nov 03 '16 at 23:01
  • To account for the sort problem, you could use process substitution in the command to process each file before the grep as needed. Example: grep -F -x -v -f <(sort fileB) <(sort fileA) – Tony Cesaro Oct 04 '17 at 15:22
  • @TonyCesaro That would work if your data set is not order specific and duplicates don't need to be taken into account. The advantage of using diff is that position in the file gets taken into account. – Caleb Oct 05 '17 at 06:06
  • The options can be collapsed together to get a shorter version: grep -Fxvf fileB fileA. – Quasímodo Oct 13 '20 at 13:39
64

The answer depends a great deal on the type and format of the files you are comparing.

If the files you are comparing are sorted text files, then the GNU tool written by Richard Stallman and Davide McKenzie called comm may perform the filtering you are after. It is part of the coreutils.

Example

Say you have the following 2 files:

$ cat a
1
2
3
4
5

$ cat b
1
2
3
4
5
6

Lines in file b that are not in file a:

$ comm <(sort a) <(sort b) -3
    6
slm
  • 369,824
A friend
  • 641
42

from stackoverflow...

comm -23 file1 file2

-23 suppresses the lines in file2 (-2) and the lines that appear in both (-3), leaving only the unique lines from file1. The files have to be sorted (they are in your example) but if not, pipe them through sort first.

See the man page here

-1 suppress column 1 (lines unique to FILE1)
-2 suppress column 2 (lines unique to FILE2)
-3 suppress column 3 (lines that appear in both files)
JJS
  • 521
  • 4
  • 5
  • 1
    This doesn't work for me, for some reason ... – Jan Feb 07 '18 at 21:05
  • @Jan are your files sorted? How did you sort them? – JJS Feb 08 '18 at 19:03
  • @roaima I have additional explanation not supplied by the other answer. Also, I'm adding -2, and they are only using -3, so not exactly the same. – JJS Mar 02 '20 at 12:55
  • @roaima thanks for pointing out that I can improve my answer by standing out from other similar answers! – JJS Mar 02 '20 at 18:28
  • This is the best answer because comm is purpose-built to do what OP wanted to do. And by "best answer" I mean "will suit for most people", not "will suit for every single use case including absurdly large unsorted files". :D – Paul Jul 16 '22 at 20:30
9

The grep and comm (with sort) methods take a long time on large files. SiegeX and ghostdog74 shared two great awk methods for extracting lines unique to one of two files over on Stack Overflow:

$ awk 'FNR==NR{a[$0]++}FNR!=NR && !a[$0]{print}' file1 file2

$ awk 'FNR==NR{a[$0]++;next}(!($0 in a))' file1 file2
  • 4
    If you're doing this with huge files, then the memory constraints of loading a huge file into an associative array are going to be prohibitive. – Charles Duffy Jun 17 '16 at 14:01
  • @CharlesDuffy, that would be a lot worse with grep -Fvxf approach though, where the performance would be unacceptable long before memory exhaustion. – Stéphane Chazelas Jan 20 '24 at 12:29
  • @StéphaneChazelas, indeed; for really large files one really ought to use sort+comm, where better sort implementations' ability to generate temporary files and merge-sort them together means that memory constraints are moot. (Of course, keeping content pre-sorted to the extent possible is in this context a Very Good Idea) – Charles Duffy Jan 20 '24 at 17:57
5

If the files are big and you don't have a custom order to your entries, grep takes much too long. A quick alternative would be

sort file1 > 1 
sort file2 > 2 
diff 1 2 | grep "\>" | sed -e 's/> //'

[file2-file1 results to screen, pipe to file etc.]

Changing > to < would get the opposite subtraction. rm 1 2

jasonwryan
  • 73,126
3

combine from moreutils package is very intuitive:

combine fileA not fileB

The files do not need to be sorted and the order is preserved.

There are also xor, or and and operators.

Quasímodo
  • 18,865
  • 4
  • 36
  • 73
1

You could also consider vimdiff, it highlights the differences between files in a vim editor

simona
  • 1,011
0
    awk 'NR==FNR{a[$1];next}!($1 in a){print }' fileA fileB



     cat fileA
    1
    2
    3
    4
    5
    6
    7
    8
    9

  file B
        8
        9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19

    output

    10
    11
    12
    13
    14
    15
    16
    17
    18
    19


    It will print contents of fileB which is not in fileA