I have two files. One file, I suspect, is a subset of the other. Is there a way to diff the files to identify (in a succinct manner) where in the first file the second file fits?
-
Related: http://unix.stackexchange.com/questions/79135/is-there-a-condensed-side-by-side-diff-format/79152#79152 – slm Oct 29 '13 at 20:00
-
Do you mean the lines of one file are a subsequence of the other, or actually a contiguous substring? – Kaz Oct 30 '13 at 03:40
-
A contiguous substring, @Kaz. – Richard Oct 30 '13 at 03:58
5 Answers
diff -e bigger smaller
will do the trick, but requires some interpretation, as the output is a "valid ed script".
I made two files, "bigger" and "smaller", where the contents of "smaller" is identical to lines 5 through 9 of "bigger" doing `diff -e bigger smaller" got me:
% diff -e bigger smaller
10,15d
1,4d
Which means "delete lines 10 through 15 of 'bigger', and then delete lines 1 through 4, to get 'smaller'". That means "smaller" is lines 5 through 9 of "bigger".
Reversing the file names got me something more complicated. If "smaller" truly constitutes a subset of "bigger", only 'd' (for delete) commands will show up in the output.
You can do this visually with meld. Unfortunately, it is a GUI tool but if you just want to do this once, and on a relatively small file, it should be fine:
The image below is the output of meld a b
:

- 242,166
-
1
-
@Richard no it doesn't and I would prefer a command line tool anyway, I just thought I'd mention it. – terdon Oct 29 '13 at 20:12
-
If the files are small enough, you can slurp them both into Perl and have its regex engine do the trick:
perl -0777e '
open "$FILE1","<","file_1";
open "$FILE2","<","file_2";
$file_1 = <$FILE1>;
$file_2 = <$FILE2>;
print "file_2 is", $file_1 =~ /\Q$file_2\E/ ? "" : "not";
print " a subset of file_1\n";
'
The -0777
switch instructs Perl to set its input record separator $/
to the undefined value so as to slurp files completely.

- 39,549
-
1What does
777
do? I take it you are passing NULL as$/
but why? Also since these are kinda esoteric switches, an explanation would be nice for the non-perl people. – terdon Oct 29 '13 at 19:57 -
1@terdon I am indeed doing it to slurp the files whole. Explanation added. – Joseph R. Oct 29 '13 at 20:04
-
-
1@terdon Not that I know of, no. By default
$/
is set to\n
so that$a=<$fh>
would read only one line of the file$fh
has been opened to. Unless of courseperl
's command-line behavior has different defaults that I'm unaware of? – Joseph R. Oct 29 '13 at 20:09 -
Argh, yes, my bad, I almost never slurp files or use the
while $foo=<FILE>
idiom so I wasn't sure and ran a (wrong) test which seemed to work. Never mind :). – terdon Oct 29 '13 at 20:11 -
@terdon Never slurping files I can understand, but I'd be interested to see how you read your files if you've done away with the
while $foo=<BAR>
idiom as well. – Joseph R. Oct 29 '13 at 20:13 -
I use
$_
and a simplewhile(<$fh>){}
. And yes, that's the same thing, and no I did not notice and still left a silly comment. Rub it in why don't cha? ;) – terdon Oct 29 '13 at 20:14 -
@terdon Oh, so you're still using the
while
idiom, then; only Perlier. :) – Joseph R. Oct 29 '13 at 20:16 -
@terdon Nothing's being rubbed in here, I was genuinely interested to see what you were using. In my opinion, (ab)using
$_
is the mark of Perl veterans :) – Joseph R. Oct 29 '13 at 20:19
If the files are text files and smaller
, within bigger
starts at the beginning of a line, it's not too difficult to implement with awk
:
awk -v i=0 'NR==FNR{l[n++]=$0;next}
{if ($0 == l[i]) {if (++i == n) {print FNR-n+1;exit}} else i=0}
' smaller bigger

- 544,893
Your question is "Diff head of files". If you really mean that one file is the head of the other, then a simple cmp
will tell you that:
cmp big_file small_file
cmp: EOF on small_file
That tells you that a difference between the two files was not detected until end-of-file was reached while reading small_file
.
If however you mean that the entire text of small file can occur anywhere inside big_file
, then assuming you can fit both files in memory, you can use
perl -le '
use autodie;
undef $/;
open SMALL, "<", "small_file";
open BIG, "<", "big_file";
$small = <SMALL>;
$big = <BIG>;
$pos = index $big, $small;
print $pos if $pos >= 0;
'
This will print the offset within big_file
where the contents of small_file
are located (e.g. 0 if small_file
matches at the beginning of big_file
). If small_file
does not match inside big_file
, then nothing will be printed. If there is an error, the exit status will be non-zero.