the seemingly short and sweet solution to this is the ubiquitous but also ominous awk duplicate remover:
awk '!x[$0]++'
not only does it remove duplicates it also keeps the original order of the input file(s). here is an explanation how this command works: How does awk '!a[$0]++' work?
used like this
awk '!x[$0]++' file1 file2
it will print file1 then file2 without the overlap. since overlap is duplicate it is removed.
but beware! it will remove all duplicates from the files. observe:
$ cat file1
a
b
b
overlap1
overlap2
$ cat file2
overlap1
overlap2
p
q
q
$ awk '!x[$0]++' file1 file2
a
b
overlap1
overlap2
p
q
it also removed the duplicate lines that are not overlap.
if your files are otherwise duplicate free, or you want to get rid of them anyway, then this command is fine.
if you want to keep the duplicate lines then read on.
here is a method to manually remove the overlap without removing duplicates. it can be automated but i did not make the effort.
first get the last line from the first file:
$ tail -n1 file1
overlap2
now remove all lines from the second file up that line:
$ sed '0,/overlap2/d' file2
p
q
q
concatenate file1 with the result of second command and you get concatenated file without overlap but duplicates preserved
$ cat file1 <(sed '0,/overlap2/d' file2)
a
b
b
overlap1
overlap2
p
q
q
this works and probably will work most of the time.
but beware! if the overlap has repetition this will not remove all of the overlap. observe:
$ cat evil1
a
b
overlap1
overlap2
overlap3
overlap1
overlap2
$ cat evil2
overlap1
overlap2
overlap3
overlap1
overlap2
p
overlap2
q
determine last line of first file
$ tail -n1 evil1
overlap2
remove to first occurence from second file
$ sed '0,/overlap2/d' evil2
overlap3
overlap1
overlap2
p
overlap2
q
just removing up to first occurence does not remove all overlap when there is repetition in the overlap. but because of the stray overlap2
line we also cannot just remove up to last occurence.
so how to determine maximal overlap? first find in file2 every occurence of last line from file1. for each occurence test if it is overlap. then take the last occurence which was still overlap.
find every occurence
$ grep -n overlap2 evil2
2:overlap2
5:overlap2
7:overlap2
test each if overlap
$ diff -q <(tail -n2 evil1) <(head -n2 evil2)
$ diff -q <(tail -n5 evil1) <(head -n5 evil2)
$ diff -q <(tail -n7 evil1) <(head -n7 evil2)
Files /dev/fd/63 and /dev/fd/62 differ
no output means there is no difference. two lines is overlap. five lines is also overlap. but seven lines is no longer overlap. that means occurence at line 5 is maximal overlap while occurence in line 7 is unrelated to overlap.
$ cat evil1 <(sed '1,5d' evil2)
a
b
overlap1
overlap2
overlap3
overlap1
overlap2
p
overlap2
q
as said this can be automated but i did not make the effort.
related xkcd: https://xkcd.com/974/
at least i made the effort for this answer. enjoy.
diff
and binary variants, but there's no guaranteed unique solution (no matter which tool you use), so this may or may not work with your data. – dirkt Jul 16 '20 at 04:17diff
is what I was missing. Makes sense. Make an answer out of it! (Maybe I fill in the technical details when I have them) – Volker Siegel Jul 16 '20 at 08:38awk '!_[$0]++' file
– Jakub Jindra Jul 16 '20 at 09:08man comm
for text files. – waltinator Jul 16 '20 at 18:52