Is there any tool that can get lines which file A contains, but file B doesn't? I could make a little simple script with, e.g, perl, but if something like that already exists, I'll save my time from now on.
-
see "http://stackoverflow.com/questions/5812756/print-lines-from-one-file-that-are-not-contained-in-another-file" – harish.venkat Jan 02 '12 at 13:32
-
http://stackoverflow.com/questions/4366533/remove-lines-from-file-which-appear-in-another-file – Ciro Santilli OurBigBook.com Jun 27 '15 at 08:47
8 Answers
Yes. The standard grep
tool for searching files for text strings can be used to subtract all the lines in one file from another.
grep -F -x -v -f fileB fileA
This works by using each line in fileB as a pattern (-f fileB
) and treating it as a plain string to match (not a regular regex) (-F
). You force the match to happen on the whole line (-x
) and print out only the lines that don't match (-v
). Therefore you are printing out the lines in fileA that don't contain the same data as any line in fileB.
The downside of this solution is that it doesn't take line order into account and if your input has duplicate lines in different places you might not get what you expect. The solution to that is to use a real comparison tool such as diff
. You could do this by creating a diff file with the context value at 100% of the lines in the file, then parsing it for just the lines that would be removed if converting file A to file B. (Note this command also removes the diff formatting after it gets the right lines.)
diff -U $(wc -l < fileA) fileA fileB | sed -n 's/^-//p' > fileC

- 70,105
-
@inderpreet99 The lower case
-u
argument does actually take a parameter of a number as long as it is not followed by a space. The advantage of the way I had it before is that it will work with or without a value, so you could use something in that sub command routine that returned not output. Upper case '-U' on the other hand requires an argument. – Caleb Aug 27 '13 at 21:46 -
be careful, grep -f is O(N^2) I believe: http://stackoverflow.com/questions/4780203/deleting-lines-from-one-file-which-are-in-another-file – rogerdpack Oct 16 '15 at 17:33
-
1
-
To account for the sort problem, you could use process substitution in the command to process each file before the
grep
as needed. Example:grep -F -x -v -f <(sort fileB) <(sort fileA)
– Tony Cesaro Oct 04 '17 at 15:22 -
@TonyCesaro That would work if your data set is not order specific and duplicates don't need to be taken into account. The advantage of using
diff
is that position in the file gets taken into account. – Caleb Oct 05 '17 at 06:06 -
The options can be collapsed together to get a shorter version:
grep -Fxvf fileB fileA
. – Quasímodo Oct 13 '20 at 13:39
The answer depends a great deal on the type and format of the files you are comparing.
If the files you are comparing are sorted text files, then the GNU tool written by Richard Stallman and Davide McKenzie called comm
may perform the filtering you are after. It is part
of the coreutils.
Example
Say you have the following 2 files:
$ cat a
1
2
3
4
5
$ cat b
1
2
3
4
5
6
Lines in file b
that are not in file a
:
$ comm <(sort a) <(sort b) -3
6
-
11
-
This is some weird syntax.
<()
? It works and I get it, but is there a name for this weirdness? – mlissner Apr 05 '17 at 22:47 -
@mlissner
comm
can also read stdin (see man page).<()
is your redirection. – berbt Apr 09 '17 at 22:59 -
2
-
5
comm
was originally written circa 1973 by someone at Bell Labs, not rms. You're referring to the GNU implementation which came a lot later. There have been a lot of different implementations of the Unix utilities across the years. – Stéphane Chazelas Oct 30 '17 at 08:33 -
comm
doesn't seem to be fitted for use in scripts. E.g. when you show more than one column. To be frank, the first impression, it's weird. – x-yuri Jun 02 '22 at 04:55
from stackoverflow...
comm -23 file1 file2
-23 suppresses the lines in file2 (-2) and the lines that appear in both (-3), leaving only the unique lines from file1. The files have to be sorted (they are in your example) but if not, pipe them through sort first.
See the man page here
-1 suppress column 1 (lines unique to FILE1)
-2 suppress column 2 (lines unique to FILE2)
-3 suppress column 3 (lines that appear in both files)

- 521
- 4
- 5
-
1
-
-
@roaima I have additional explanation not supplied by the other answer. Also, I'm adding
-2
, and they are only using-3
, so not exactly the same. – JJS Mar 02 '20 at 12:55 -
@roaima thanks for pointing out that I can improve my answer by standing out from other similar answers! – JJS Mar 02 '20 at 18:28
-
This is the best answer because
comm
is purpose-built to do what OP wanted to do. And by "best answer" I mean "will suit for most people", not "will suit for every single use case including absurdly large unsorted files". :D – Paul Jul 16 '22 at 20:30
The grep and comm (with sort) methods take a long time on large files. SiegeX and ghostdog74 shared two great awk methods for extracting lines unique to one of two files over on Stack Overflow:
$ awk 'FNR==NR{a[$0]++}FNR!=NR && !a[$0]{print}' file1 file2
$ awk 'FNR==NR{a[$0]++;next}(!($0 in a))' file1 file2

- 191
-
4If you're doing this with huge files, then the memory constraints of loading a huge file into an associative array are going to be prohibitive. – Charles Duffy Jun 17 '16 at 14:01
-
@CharlesDuffy, that would be a lot worse with
grep -Fvxf
approach though, where the performance would be unacceptable long before memory exhaustion. – Stéphane Chazelas Jan 20 '24 at 12:29 -
@StéphaneChazelas, indeed; for really large files one really ought to use
sort
+comm
, where bettersort
implementations' ability to generate temporary files and merge-sort them together means that memory constraints are moot. (Of course, keeping content pre-sorted to the extent possible is in this context a Very Good Idea) – Charles Duffy Jan 20 '24 at 17:57
If the files are big and you don't have a custom order to your entries, grep takes much too long. A quick alternative would be
sort file1 > 1
sort file2 > 2
diff 1 2 | grep "\>" | sed -e 's/> //'
[file2-file1 results to screen, pipe to file etc.]
Changing >
to <
would get the opposite subtraction. rm 1 2

- 73,126
You could also consider vimdiff, it highlights the differences between files in a vim editor

- 1,011
-
2But is there an easy way to automatically do the subtraction in Vimdiff? – Kazark Mar 14 '13 at 17:38
awk 'NR==FNR{a[$1];next}!($1 in a){print }' fileA fileB
cat fileA
1
2
3
4
5
6
7
8
9
file B
8
9
10
11
12
13
14
15
16
17
18
19
output
10
11
12
13
14
15
16
17
18
19
It will print contents of fileB which is not in fileA

- 5,211
-
-
1You want
$0
, not$1
. Also this is essentially the same as Miles' answer. – Quasímodo Oct 19 '20 at 15:37