how to find offset of one binary file inside another?

Question

I have two binary files.
One of few hundreds kilos and other of few gigabytes.
I want to know whether the whole, smaller, file is contained within the larger one and if so then what is the offset from the start of the larger file.
I am interested only in exact matches i.e. whether the whole file is contained by the another.
Both files are binary.
Is there any existing tool/one-liner that does that ?

I don't have the code but you may be able to do with CINT and strstr() or similar function — Behrooz, Jun 01 '12 at 15:19

maxschlepzig · Answer 1 · 2018-04-22T20:09:24.753

5

I could not come up with an existing tool.

grep -F --binary --byte-offset --only-matching seems to be close enough - but you can't escape newlines with -F. And cmp only allows to skip characters. diff also does not seem to be of much help.

But it is a few liner in a programming language with a decent library. For example as a C++ program using Boost:

#include <boost/algorithm/string/find.hpp>
#include <boost/iostreams/device/mapped_file.hpp>
#include <cassert>
#include <iostream>
using namespace boost;
using namespace boost::algorithm;
using namespace boost::iostreams;
using namespace std;

int main(int argc, char **argv)
{
  if (argc != 3) {
    cerr << "Call: " << argv[0] << " PATTERN_FILE SRC_FILE\n";
    return 3;
  }
  mapped_file_source pattern(argv[1]);
  mapped_file_source src(argv[2]);
  iterator_range<const char*> p_range(pattern.data(),
      pattern.data() + pattern.size());
  iterator_range<const char*> s_range(src.data(), src.data() + src.size());
  iterator_range<const char*> result = find_first(s_range, p_range);
  if (result) {
    size_t pos = result.begin()-s_range.begin();
    cout << pos << '\n';
    return 0;
  }
  return 1;
}

You can compile it like this (when the program source is saved as find.cc):

$ make CXXFLAGS="-Wall -g" LDLIBS="-lboost_iostreams" searchb

To test it:

$ dd if=WTF_-_EPISODE_277_RACHAEL_HARRIS.mp3 of=t skip=232323 bs=1 count=4K
$ ls -l t
-rw-r--r-- 1 juser users 4096 2012-05-31 15:24 t
$ ./searchb t WTF_-_EPISODE_277_RACHAEL_HARRIS.mp3
232323

The output is the matching position in the source file.

If the file is not contained the exit status is 1.

Update: In the meantime I've implemented this simple tool in several languages (C/C++/Python/Rust/Go) and have included those implementations in my utility repository. Look for searchb*. The Python implementation is the shortest one and doesn't require any external dependencies.

edited Apr 22 '18 at 20:09

answered May 31 '12 at 13:35

maxschlepzig

57,532

thanks a lot. does it load the whole pattern file into memory ? – Cyryl Płotnicki May 31 '12 at 14:49
also it does this:
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injectorstd::exception >' what(): std::exception Aborted (core dumped) – Cyryl Płotnicki May 31 '12 at 15:28
@CyrylPlotnicki-Chudyk, not really, it (the kernel) memory-maps the file into virtual memory. Thus it depends on the kernel's virtual memory system (and your available resources) how much of the pattern is loaded into memory during the execution of the program. – maxschlepzig May 31 '12 at 15:36
@CyrylPlotnicki-Chudyk, the error means that mapping the file into memory failed. Perhaps a virtual memory limit is reached or the filesystem does not support memory mapping. How large are your files exactly? What distribution and what architecture are you using (64/32 bit)? – maxschlepzig May 31 '12 at 15:54
Hey, thanks for your reply ! Actually it's Fedora 16, 32bit, with ~500kb pattern and 5gb haystack files – Cyryl Płotnicki Jun 01 '12 at 04:51
@CyrylPlotnicki-Chudyk, ok, that explains it - 5gb exceeds the virtual address space of a process on a 32 Bit system (which is limited to at most 4GB, or 2GB, 3GB depending on your kernel). Thus the memory mapping is not possible. On a 64 bit system you don't have that limit. Thus, if the hardware has a 64 Bit CPU you should upgrade to a 64 Bit version of your distribution (also for other reasons). One can of course change the program such that the big file is not mapped but read as a stream. I will look into that later. – maxschlepzig Jun 01 '12 at 07:29
I though of writing simple KMP search in C, will look into that later on, just though that it's just a matter of me not knowing of the existence of the simple tool – Cyryl Płotnicki Jun 01 '12 at 19:57
To port the above code to a 32 bit system one has to write a custom input-iterator that reads characters from an input stream - quite less elegant then the above solution that uses memory-mapped IO both for pattern and text. – maxschlepzig Jul 01 '12 at 09:29

score 1 · Answer 2 · answered Nov 27 '20 at 12:19

We do this in bioinformatics all the time - except we also want partial matches and we want to know how well they matched.

BLAT is the fastest solution I know: https://en.wikipedia.org/wiki/BLAT_(bioinformatics)

It builds an index and after building the index is ridiculously fast.

score 0 · Answer 3 · answered Jun 01 '12 at 01:12

Here's a Python script that performs a substring search on an external file. The script was originally written by Kamran Khan and posted to his blog. I very slightly adapted it to take the search string from a file and search in standard input.

#!/usr/bin/env python
import locale
import os
import sys
import urllib2

def boyermoore_horspool(fd, needle):
    nlen = len(needle)
    nlast = nlen - 1

    skip = []
    for k in range(256):
        skip.append(nlen)
    for k in range(nlast):
        skip[ord(needle[k])] = nlast - k
    skip = tuple(skip)

    pos = 0
    consumed = 0
    haystack = bytes()
    while True:
        more = nlen - (consumed - pos)
        morebytes = fd.read(more)
        haystack = haystack[more:] + morebytes

        if len(morebytes) < more:
            return -1
        consumed = consumed + more

        i = nlast
        while i >= 0 and haystack[i] == needle[i]:
            i = i - 1
        if i == -1:
            return pos

        pos = pos + skip[ord(haystack[nlast])]

    return -1

if __name__ == "__main__":
    if len(sys.argv) < 2:
        sys.stderr.write("""Usage: horspool.py NEEDLE_FILE [URL]
Search for the contents of NEEDLE_FILE inside the content at URL.
If URL is omitted, search standard input.
If the content is found, print the offset of the first occurrence and return 0.
Otherwise, return 1.""")
        sys.exit(2)
    needle_file = open(sys.argv[1])
    needle = needle_file.read()
    needle_file.close
    fd = urllib2.urlopen(sys.argv[2]) if len(sys.argv) > 2 else sys.stdin
    offset = boyermoore_horspool(fd, needle)
    if offset >= 0: print offset
    else: sys.exit(1)
    fd.close()

Keeely · Answer 4 · 2020-11-29T14:53:45.477

If you have the memory, just read the large file in chunks the size of the smaller one. After each chunk is read, concatenate the last two chunks read together and then do a string search on the result. In Python 3.8+ the code looks like this:

def find_at_offset(large_fp, small_fp):
    small = small_fp.read()
    blocks = [b"", b""]
    base = 0
    while blk := large_fp.read(len(small)):
        base += len(blocks[0])
        del blocks[0]
        blocks.append(blk)
        offset = b"".join(blocks).find(small)
        if offset != -1:
            return base + offset
    return -1

The concept is very simple, translates well into plain old C and doesn't require any special features like memory mapping. The caveat is it requires a minimum available memory of 3-5x the size of the small file depending on how you implement it. The benefit is it's extremely fast because it's making use of a simple string search which is highly optimised.

how to find offset of one binary file inside another?

4 Answers4

Linked