Diff head of files

Question

I have two files. One file, I suspect, is a subset of the other. Is there a way to diff the files to identify (in a succinct manner) where in the first file the second file fits?

Related: http://unix.stackexchange.com/questions/79135/is-there-a-condensed-side-by-side-diff-format/79152#79152 — slm, Oct 29 '13 at 20:00
Do you mean the lines of one file are a subsequence of the other, or actually a contiguous substring? — Kaz, Oct 30 '13 at 03:40

score 14 · Answer 1 · 2013-10-29T19:52:39.347

diff -e bigger smaller will do the trick, but requires some interpretation, as the output is a "valid ed script".

I made two files, "bigger" and "smaller", where the contents of "smaller" is identical to lines 5 through 9 of "bigger" doing `diff -e bigger smaller" got me:

% diff -e bigger smaller
10,15d
1,4d

Which means "delete lines 10 through 15 of 'bigger', and then delete lines 1 through 4, to get 'smaller'". That means "smaller" is lines 5 through 9 of "bigger".

Reversing the file names got me something more complicated. If "smaller" truly constitutes a subset of "bigger", only 'd' (for delete) commands will show up in the output.

terdon · Answer 2 · 2013-10-29T20:12:37.340

5

You can do this visually with meld. Unfortunately, it is a GUI tool but if you just want to do this once, and on a relatively small file, it should be fine:

The image below is the output of meld a b:

enter image description here

edited Oct 29 '13 at 20:12

answered Oct 29 '13 at 19:47

terdon

242,166

1

Meld's nice, but it doesn't play quite as well with 100MB+ files. – Richard Oct 29 '13 at 20:10
@Richard no it doesn't and I would prefer a command line tool anyway, I just thought I'd mention it. – terdon Oct 29 '13 at 20:12
Looks a lot like vimdiff, which is available in terminal. – phemmer Nov 05 '13 at 23:35

Joseph R. · Answer 3 · 2013-10-29T20:03:26.570

2

If the files are small enough, you can slurp them both into Perl and have its regex engine do the trick:

perl -0777e '
        open "$FILE1","<","file_1";
        open "$FILE2","<","file_2";
        $file_1 = <$FILE1>;
        $file_2 = <$FILE2>;
        print "file_2 is", $file_1 =~ /\Q$file_2\E/ ? "" : "not";
        print " a subset of file_1\n";
'

The -0777 switch instructs Perl to set its input record separator $/ to the undefined value so as to slurp files completely.

edited Oct 29 '13 at 20:03

answered Oct 29 '13 at 19:50

Joseph R.

39,549

1

What does 777 do? I take it you are passing NULL as $/ but why? Also since these are kinda esoteric switches, an explanation would be nice for the non-perl people. – terdon Oct 29 '13 at 19:57
1

@terdon I am indeed doing it to slurp the files whole. Explanation added. – Joseph R. Oct 29 '13 at 20:04
But why is that necessary? $a=<$fh> should slurp anyway right? – terdon Oct 29 '13 at 20:06
1

@terdon Not that I know of, no. By default $/ is set to \n so that $a=<$fh> would read only one line of the file $fh has been opened to. Unless of course perl's command-line behavior has different defaults that I'm unaware of? – Joseph R. Oct 29 '13 at 20:09
Argh, yes, my bad, I almost never slurp files or use the while $foo=<FILE> idiom so I wasn't sure and ran a (wrong) test which seemed to work. Never mind :). – terdon Oct 29 '13 at 20:11
@terdon Never slurping files I can understand, but I'd be interested to see how you read your files if you've done away with the while $foo=<BAR> idiom as well. – Joseph R. Oct 29 '13 at 20:13
I use $_ and a simple while(<$fh>){}. And yes, that's the same thing, and no I did not notice and still left a silly comment. Rub it in why don't cha? ;) – terdon Oct 29 '13 at 20:14
@terdon Oh, so you're still using the while idiom, then; only Perlier. :) – Joseph R. Oct 29 '13 at 20:16
@terdon Nothing's being rubbed in here, I was genuinely interested to see what you were using. In my opinion, (ab)using $_ is the mark of Perl veterans :) – Joseph R. Oct 29 '13 at 20:19

score 1 · Answer 4 · answered Oct 29 '13 at 21:14

If the files are text files and smaller, within bigger starts at the beginning of a line, it's not too difficult to implement with awk:

awk -v i=0 'NR==FNR{l[n++]=$0;next}
    {if ($0 == l[i]) {if (++i == n) {print FNR-n+1;exit}} else i=0}
    ' smaller bigger

score 1 · Answer 5 · edited Nov 05 '13 at 22:54

Your question is "Diff head of files". If you really mean that one file is the head of the other, then a simple cmp will tell you that:

cmp big_file small_file
cmp: EOF on small_file

That tells you that a difference between the two files was not detected until end-of-file was reached while reading small_file.

If however you mean that the entire text of small file can occur anywhere inside big_file, then assuming you can fit both files in memory, you can use

perl -le '
   use autodie;
   undef $/;
   open SMALL, "<", "small_file";
   open BIG, "<", "big_file";
   $small = <SMALL>;
   $big = <BIG>;
   $pos = index $big, $small;
   print $pos if $pos >= 0;
'

This will print the offset within big_file where the contents of small_file are located (e.g. 0 if small_file matches at the beginning of big_file). If small_file does not match inside big_file, then nothing will be printed. If there is an error, the exit status will be non-zero.

Diff head of files

5 Answers5