6

Assume we have two text files and we want to combine them into one.

The second file starts with lines that are also in the first file, so it repeats part of it. There is a redundant overlap.

How can I combine these files?

Ideally, a solution for general binary files and multiple sections is welcome, of course.

Volker Siegel
  • 17,283
  • I don't know any ready made tool, but with the additional condition that the overlapping part is at the end resp. beginning, and you want the longest, writing a custom tool is not hard. For multiple sections, have a look at diff and binary variants, but there's no guaranteed unique solution (no matter which tool you use), so this may or may not work with your data. – dirkt Jul 16 '20 at 04:17
  • @dirkt Thanks, that's it, the idea diff is what I was missing. Makes sense. Make an answer out of it! (Maybe I fill in the technical details when I have them) – Volker Siegel Jul 16 '20 at 08:38
  • see https://stackoverflow.com/questions/2604088/remove-line-if-field-is-duplicate - just instead of $1 use $0 to match the whole line and not only the first field. so awk '!_[$0]++' file – Jakub Jindra Jul 16 '20 at 09:08
  • 1
    Read man comm for text files. – waltinator Jul 16 '20 at 18:52

2 Answers2

2

You can use the following command to get the job done if order is not important:

sort -u FILE1 FILE2 > FILE3

If order of the file is important, then please use the following command:

cat -n FILE1 FILE2 | sort -uk2 | sort -nk1 | cut -f2- > FILE3
Jaks
  • 41
2

the seemingly short and sweet solution to this is the ubiquitous but also ominous awk duplicate remover:

awk '!x[$0]++'

not only does it remove duplicates it also keeps the original order of the input file(s). here is an explanation how this command works: How does awk '!a[$0]++' work?

used like this

awk '!x[$0]++' file1 file2

it will print file1 then file2 without the overlap. since overlap is duplicate it is removed.

but beware! it will remove all duplicates from the files. observe:

$ cat file1
a
b
b
overlap1
overlap2

$ cat file2 overlap1 overlap2 p q q

$ awk '!x[$0]++' file1 file2 a b overlap1 overlap2 p q

it also removed the duplicate lines that are not overlap.

if your files are otherwise duplicate free, or you want to get rid of them anyway, then this command is fine.

if you want to keep the duplicate lines then read on.

here is a method to manually remove the overlap without removing duplicates. it can be automated but i did not make the effort.

first get the last line from the first file:

$ tail -n1 file1
overlap2

now remove all lines from the second file up that line:

$ sed '0,/overlap2/d' file2
p
q
q

concatenate file1 with the result of second command and you get concatenated file without overlap but duplicates preserved

$ cat file1 <(sed '0,/overlap2/d' file2)
a
b
b
overlap1
overlap2
p
q
q

this works and probably will work most of the time.

but beware! if the overlap has repetition this will not remove all of the overlap. observe:

$ cat evil1
a
b
overlap1
overlap2
overlap3
overlap1
overlap2

$ cat evil2 overlap1 overlap2 overlap3 overlap1 overlap2 p overlap2 q

determine last line of first file

$ tail -n1 evil1
overlap2

remove to first occurence from second file

$ sed '0,/overlap2/d' evil2
overlap3
overlap1
overlap2
p
overlap2
q

just removing up to first occurence does not remove all overlap when there is repetition in the overlap. but because of the stray overlap2 line we also cannot just remove up to last occurence.

so how to determine maximal overlap? first find in file2 every occurence of last line from file1. for each occurence test if it is overlap. then take the last occurence which was still overlap.

find every occurence

$ grep -n overlap2 evil2
2:overlap2
5:overlap2
7:overlap2

test each if overlap

$ diff -q <(tail -n2 evil1) <(head -n2 evil2)

$ diff -q <(tail -n5 evil1) <(head -n5 evil2)

$ diff -q <(tail -n7 evil1) <(head -n7 evil2) Files /dev/fd/63 and /dev/fd/62 differ

no output means there is no difference. two lines is overlap. five lines is also overlap. but seven lines is no longer overlap. that means occurence at line 5 is maximal overlap while occurence in line 7 is unrelated to overlap.

$ cat evil1 <(sed '1,5d' evil2)
a
b
overlap1
overlap2
overlap3
overlap1
overlap2
p
overlap2
q

as said this can be automated but i did not make the effort.

related xkcd: https://xkcd.com/974/

at least i made the effort for this answer. enjoy.

Lesmana
  • 27,439