23

Saying that I have two files: a.txt and b.txt.

The content of a.txt:

hello world

The content of b.txt:

hello world
something else

Of course I can use vimdiff to check their difference, I can make sure that a.txt is a subset of b.txt, which means that b.txt must contain all of lines existing in a.txt (just like the example above).

My question is how to record lines which exists in b.txt but doesn't exist in a.txt into a file?

Yves
  • 3,291
  • possible duplicate https://unix.stackexchange.com/questions/114877/how-to-know-if-a-text-file-is-a-subset-of-another – Wissam Roujoulah Mar 06 '18 at 06:51
  • @WissamRoujoulah You see, I don't need to check if a.txt is a subset of b.txt. I've already known that a.txt is a subset of b.txt. My question is how to print the difference into a file. – Yves Mar 06 '18 at 06:56

4 Answers4

32
comm -1 -3 a.txt b.txt > c.txt

The -1 excludes lines that are only in a.txt, and the -3 excludes lines that are in both. Thus only the lines exclusively in b.txt are output (see man comm or comm --help for details). The output is redirected to c.txt

If you want the difference between the two files, use diff rather than comm. e.g.

diff -u a.txt b.txt > c.txt
cas
  • 78,579
  • 1
    +1. comm looks like a very handy command during file comparison. But diff -u is giving a bunch of other information as well. – Utsav Mar 06 '18 at 07:18
  • 1
    The extra information from diff is context and instructions (e.g. - to delete a line, + to add a line) that allow b.txt to be reconstructed from a.txt and c.txt - i.e. c.txt is a patch file, understood by the patch program. Also simple enough for people to easily understand. BTW, diff has several different output styles, -u is the "unified diff" format. There are also various specialised version of diff - e.g. to do side-by-side comparisons, or compare 3 files at once, or find word differences within a line. colordiff is also handy for colourising diff output. – cas Mar 06 '18 at 07:41
  • If the a.txt and b.txt files are not sorted you will get a warning. To avoid the warning and the sorting use the --nocheck-order flag. – Skek Tek May 09 '19 at 12:05
6

If you dont care for subset, you can use just

diff a.txt b.txt|grep ">"|cut -c 3- > foo.txt

.

$ cat a.txt
hello world
$ cat b.txt
hello world
something else
$ diff a.txt b.txt|grep ">"|cut -c 3- > foo.txt
$ cat foo.txt
something else
Utsav
  • 450
1

Limitation: this is not a real file diff, more like a set of the lines diff (but you may need exactly this).

All differences between a.txt and b.txt:

sort a.txt b.txt | uniq -u > c.txt

Lines which missing from a.txt (ignoring lines which missing from b.txt):

sort a.txt a.txt b.txt | uniq -u > c.txt

Explanation: after cat and sort the 2 files together you have the subset lines duplicated, uniq -u shows only the uniq lines, those are the lines which are only present in one of the file. Duplicating one of the input (a.txt above) suppresses all the lines exist in that file on the output.

Duplication in any of the files ruins the output of the commands above, if you have duplicates in your files you have to remove those first and run the commands above on the newly created files:

sort a.txt | uniq | aa.txt
sort b.txt | uniq | bb.txt

You can check the result, this 2 command should give you the same checksum:

sort b.txt c.txt | uniq | sha256sum
sort a.txt c.txt | uniq | sha256sum

If one of the file is a superset of the other (so it has all lines of the other plus (maybe) more) then you can simplify a bit. Like in your example the b.txt is the superset, so this 2 commands should give you the same checksum too:

sort b.txt | sha256sum
sort a.txt c.txt | sha256sum
redseven
  • 592
-2

b.txt - a.txt : sort a.txt a.txt b.txt | uniq -u > foo.txt

  • 1
    welcomme to U&L, would you care to explain a little ? specialu why 2 instance of a.txt ? – Archemar Feb 11 '22 at 08:07
  • 3
    This doesn’t preserve the order of lines in b.txt, and will remove any lines which are repeated in b.txt even if they aren’t present in a.txt. – Stephen Kitt Feb 11 '22 at 12:57