10

I have two files. One file, I suspect, is a subset of the other. Is there a way to diff the files to identify (in a succinct manner) where in the first file the second file fits?

Richard
  • 1,381

5 Answers5

14

diff -e bigger smaller will do the trick, but requires some interpretation, as the output is a "valid ed script".

I made two files, "bigger" and "smaller", where the contents of "smaller" is identical to lines 5 through 9 of "bigger" doing `diff -e bigger smaller" got me:

% diff -e bigger smaller
10,15d
1,4d

Which means "delete lines 10 through 15 of 'bigger', and then delete lines 1 through 4, to get 'smaller'". That means "smaller" is lines 5 through 9 of "bigger".

Reversing the file names got me something more complicated. If "smaller" truly constitutes a subset of "bigger", only 'd' (for delete) commands will show up in the output.

5

You can do this visually with meld. Unfortunately, it is a GUI tool but if you just want to do this once, and on a relatively small file, it should be fine:

The image below is the output of meld a b:

enter image description here

terdon
  • 242,166
2

If the files are small enough, you can slurp them both into Perl and have its regex engine do the trick:

perl -0777e '
        open "$FILE1","<","file_1";
        open "$FILE2","<","file_2";
        $file_1 = <$FILE1>;
        $file_2 = <$FILE2>;
        print "file_2 is", $file_1 =~ /\Q$file_2\E/ ? "" : "not";
        print " a subset of file_1\n";
'

The -0777 switch instructs Perl to set its input record separator $/ to the undefined value so as to slurp files completely.

Joseph R.
  • 39,549
  • 1
    What does 777 do? I take it you are passing NULL as $/ but why? Also since these are kinda esoteric switches, an explanation would be nice for the non-perl people. – terdon Oct 29 '13 at 19:57
  • 1
    @terdon I am indeed doing it to slurp the files whole. Explanation added. – Joseph R. Oct 29 '13 at 20:04
  • But why is that necessary? $a=<$fh> should slurp anyway right? – terdon Oct 29 '13 at 20:06
  • 1
    @terdon Not that I know of, no. By default $/ is set to \n so that $a=<$fh> would read only one line of the file $fh has been opened to. Unless of course perl's command-line behavior has different defaults that I'm unaware of? – Joseph R. Oct 29 '13 at 20:09
  • Argh, yes, my bad, I almost never slurp files or use the while $foo=<FILE> idiom so I wasn't sure and ran a (wrong) test which seemed to work. Never mind :). – terdon Oct 29 '13 at 20:11
  • @terdon Never slurping files I can understand, but I'd be interested to see how you read your files if you've done away with the while $foo=<BAR> idiom as well. – Joseph R. Oct 29 '13 at 20:13
  • I use $_ and a simple while(<$fh>){}. And yes, that's the same thing, and no I did not notice and still left a silly comment. Rub it in why don't cha? ;) – terdon Oct 29 '13 at 20:14
  • @terdon Oh, so you're still using the while idiom, then; only Perlier. :) – Joseph R. Oct 29 '13 at 20:16
  • @terdon Nothing's being rubbed in here, I was genuinely interested to see what you were using. In my opinion, (ab)using $_ is the mark of Perl veterans :) – Joseph R. Oct 29 '13 at 20:19
1

If the files are text files and smaller, within bigger starts at the beginning of a line, it's not too difficult to implement with awk:

awk -v i=0 'NR==FNR{l[n++]=$0;next}
    {if ($0 == l[i]) {if (++i == n) {print FNR-n+1;exit}} else i=0}
    ' smaller bigger
1

Your question is "Diff head of files". If you really mean that one file is the head of the other, then a simple cmp will tell you that:

cmp big_file small_file
cmp: EOF on small_file

That tells you that a difference between the two files was not detected until end-of-file was reached while reading small_file.

If however you mean that the entire text of small file can occur anywhere inside big_file, then assuming you can fit both files in memory, you can use

perl -le '
   use autodie;
   undef $/;
   open SMALL, "<", "small_file";
   open BIG, "<", "big_file";
   $small = <SMALL>;
   $big = <BIG>;
   $pos = index $big, $small;
   print $pos if $pos >= 0;
'

This will print the offset within big_file where the contents of small_file are located (e.g. 0 if small_file matches at the beginning of big_file). If small_file does not match inside big_file, then nothing will be printed. If there is an error, the exit status will be non-zero.

Joseph R.
  • 39,549
jrw32982
  • 723