6

What is the fastest way to grep 400GB binary file? I need one txt file from HDD dump and I know some strings from it and want to find this file in the dump.

I tried to use grep -a -C 10 searchstring but grep crashes with out of memory when it tries to read large chunks of data without newline symbols. Also I would like to start searching not from the beginning but from some point of file

phuclv
  • 2,086
  • Have you tried grep --mmap ... already? Real programmers would do that with vim -R -b 400gbfile and then /pattern. – ott-- Aug 13 '15 at 20:52
  • @ott your vim answer piqued my interest. Does vim really only load what it needs into memory? A quick internet search is not conclusive. Some people say it still has a hard time with large files. Others say "Ctrl-c" will stop it trying to load everything into RAM. So I'm not sure whether your answer is correct. – Michael Martinez Aug 13 '15 at 21:49
  • @ott-- Real programmers would do it in Fortran. – Parthian Shot Aug 13 '15 at 22:13
  • Maybe this could work grep --only-matching --byte-offset --binary. The --only-matching option can be implemented without buffering the whole line, but I don't know if your implementation does take advantage of this to actually save memory. --byte-offset will indicate where the matching sequence starts in the binary data stream or blob. – Totor Oct 12 '21 at 13:19
  • @ott-- The OP is using Linux so there's no --mmap at all. It's an option in BSD grep – phuclv Oct 16 '23 at 05:06

4 Answers4

7

I would use strings that way :

strings 400Gfile.bin | grep -C 10 searchstring

To start at a given offset (eg: 20G),

dd if=400Gfile.bin bs=20G skip=1 | strings | grep -C 10 searchstring
jlliagre
  • 61,204
  • When looking for plain text files add -w to strings (to make it print newlines and whitespace, i.e. full text file) and 10 should be the number of lines you expect your text file to have. With bs=20G you'll probably get an out of memory error on most systems; smaller bs and larger skip – frostschutz Aug 13 '15 at 21:06
  • @frostschutz strings has no -w option I'm aware of. Moreover, it does preserve spaces and newlines of both text and binary files without requiring any specific option. – jlliagre Aug 13 '15 at 21:11
  • The option was added in GNU strings (binutils 2.25). Without it yes | strings prints nothing so the output won't contain a whole plain text ASCII file if it contains short/empty lines. In that case have strings print the offset the string was found at and then grab the data from around that offset using dd. – frostschutz Aug 13 '15 at 21:23
  • @frostschutz Thanks. This Gnu extension is quite recent. I was confused by "newlines and whitespace" while you really meant "short and empty lines". – jlliagre Aug 13 '15 at 21:30
  • ... Hoping that any chunking doesn't split at a possible grep match. – Jeff Schaller Aug 13 '15 at 23:56
  • @JeffSchaller There is no such risk as strings will get the chunks joined together through the pipe. – jlliagre Aug 14 '15 at 00:01
  • Good point; my mistake. – Jeff Schaller Aug 14 '15 at 00:04
2

bgrep

I keep coming back to this random repo from time to time: https://github.com/tmbinc/bgrep

"Installation" as per README:

curl -L 'https://github.com/tmbinc/bgrep/raw/master/bgrep.c' | gcc -O2 -x c -o $HOME/.local/bin/bgrep -

Sample usage on a minimal example:

printf '\x01\x02abcdabcd' > myfile.bin
bgrep -B2 -A2 6263 myfile.bin

outputs:

myfile.bin: 00000003
\x02abc
myfile.bin: 00000007
dabc

because 6263 is bc in ASCII, and that two byte sequence has matches at zero-indexed positions 3 and 7.

Let's see if it works on a large file that does not fit in the 32 GB memory of my Lnovo ThinkPad P51, tested on my SSD:

dd count=100M if=/dev/zero of=myfile.bin
printf '\x01\x02abcdabcd' >> myfile.bin
time bgrep -B2 -A2 6263 myfile.bin

output:

myfile.bin: c80000003
\x02abc
myfile.bin: c80000007
dabc

real 11m26.898s user 1m32.763s sys 9m53.756s

So it took a while but worked.

It is a bit annoying that doesn't support directly searching for plaintext characters, you have to give it a hex string. But we can convert as per https://stackoverflow.com/questions/2614764/how-to-create-a-hex-dump-of-file-containing-only-the-hex-characters-without-spac

bgrep `printf %s bc | od -t x1 -An -v | tr -d '\n '` myfile.bin

so a Bash alias would help:

bgrepa() {
  pat=$1
  shift
  bgrep `printf %s "$pat" | od -t x1 -An -v | tr -d '\n '` "$@"
}
bgrepa bc -B2 -A2 myfile.bin

Regular expressions are not supported.

Tested on Ubuntu 23.04, bgrep 28029c9203d54f4fc9332d094927cd82154331f2.

Ciro Santilli OurBigBook.com
  • 18,092
  • 4
  • 117
  • 102
1

The problem with grep is that it needs to have an entire line in memory. If that line is so large that it doesn't fit in memory, than grep bombs. The only way around this dilemma is to feed small chunks to grep. (which is really what grep should be doing itself, anyway, but it doesn't).

Use dd so that you can specify an offset to start at, and use fold or grep --mmap to avoid running out of memory on a line that is larger than the available RAM. grep --mmap will prevent the system from choking, but may or may not prevent grep itself from choking. This would be a good thing for someone to test. fold will allow you to insert a newline at regular intervals, this satisfies the criteria to split the input into manageable chunks.

dd if=bigfile skip=xxx | fold | grep -b -a string

The -b gives you the byte offset which you will find useful to know where your text strings are located within the file.

I tested this on a 100GB logical volume on one of my KVM hypervisors, using the search string "Hard" and running vmstat in a separate window to monitor performance. The logical volume is basically formatted as a hard drive (partitions and filesystems) on which a guest Linux VM is installed. There wasn't any impact on system performance. It processed each gig in about 33 seconds (of course this will vary widely depending on your hardware).

You said you wanted quick performance. This should give you the quickest performance from using utilities in a shell script. The only way to get a quicker search would be to write a program in C that seeks to an offset, reads in a specified chunksize, feeds that chunk into a pattern matching algorithm, before moving on to the next chunk. Seems like this type of "improved grep" should already be in existence, but searching online I don't find one.

Michael Martinez
  • 982
  • 7
  • 12
  • 1
    As noted in comment to other answer, this will fail to detect a string that happens to fall on a fold boundary. But if you run grep with a few different strings - `grep "string1|string2|string3" - chances are probably slim all three are always falling on a fold boundary, so you'll still find what you're looking for. – Michael Martinez Aug 14 '15 at 00:05
  • --mmap is only available in BSD grep, but Linux uses GNU grep – phuclv Oct 10 '23 at 07:34
0

Found a way to grep huge JSON file (300Gb - database export), that is really a text but without any newlines... The problem is to get byte offsets for specific fields inside it.

Solved with ugrep and making manual "lines" via "tr". This way file is not loaded into memory in full, ugrep works in "streamed" fashion, printing results as they are hit, and byte offsets are "original" - can be used for chopping json into pieces, for example.

cat ./full.json | tr '{' '\n' | ugrep --fixed-strings --byte-offset --format='%f:%b:%o%~' --binary --mmap -f ./splitstrings.txt
IPv6
  • 101