5

I am currently facing a "performance problem" while using grep. I am trying to locate the occurrences of many (10,000+) keywords in many (think Linux kernel repository size) files. The objective is to generate a kind of index for each keyword:

keyword: my_keyword
filepath 1
filepath 2
------
keyword: my_keyword2
...

Being fairly new to shell scripting, I have tried a few naive approaches:

while read LINE; do
    echo "keyword: "$LINE >> $OUTPUT_FILE
    grep -E -r --include='Makefile*' "($LINE)" . >> "$OUTPUT_FILE"
    echo "-----"
done < $KEYWORD_FILE

This takes about 45 minutes to complete, with the expected results.

Note: all they keywords I search for are located in a file ("KEYWORD_FILE", so fixed before anything starts), and the files in which to search can be determined before the search starts. I tried to stored the list of files to search in before the search like that:

file_set=( $(find . -name "Kbuild*" -o -name "Makefile*") )

and then replace the grep call by

echo {$file_set[*]} | parallel -k -j+0 -n 1000 -m grep -H -l -w "\($LINE\)" >> "$OUTPUT_FILE"

It takes about an hour to complete…

Question: considering I can use whatever technique I want, provided it's sh compliant, how can this be done "faster" and/or more efficiently ?

Perhaps grep is not the tool to use, and my use of parallel is wrong… any ideas are welcome!

Anthon
  • 79,293
Nico
  • 53

4 Answers4

6

The objective is to generate a kind of index for each keyword:

Then use indexing software instead of grep. I've had great success using codesearch on the entire CPAN archive.

  • I am doing a "one shot" operation, so a search engine might be overkill. Right now, I think a "quick grep and long wait" is preferable to "get the tool, figure things out and short wait" (emphasize on the "right now", your tool is the way to do it i think). – Nico Mar 27 '13 at 11:48
  • codesearch is trivial to set up (run cindex once, csearch as much as you want). I would (and have on occasion) give it a go, even for a one-off. – Dennis Kaarsemaker Mar 27 '13 at 13:48
  • This is the tool that I ended up using. Indeed, trivial to setup and pretty easy to use. Many thanks for the suggestion. – Nico Mar 27 '13 at 15:47
1

I'd write a Perl script (yes, I know Python is currently much more favored...) that slurps in the 10,000 keywords, reads each file and for each line sees if some of them matches. If a match, stash it away under the keyword as file/line (a hash works fine here). Once done with the set, go over the hash spitting out the findings for each keyword.

Perl (and Python) is a (semi) compiled language, the script is compiled into an internal representation (quite compact, easy and fast to interpret), and this internal form gets (some) "optimizer" love. The speed isn't the same as hand-optimized C, but it shouldn't be 10 times slower either.

In the end, your comment above hits the nail on the head: Your time (writing, debugging; or fetching, building, learning how to use) is much more valuable than the machine time (if you leave it running overnight, who cares if it is done by 10 PM or 6 AM...).

vonbrand
  • 18,253
0

Using the -f and -F switches to grep will likely speed things up a bit. Grep can be quite smart about handling matches for multiple patterns.

The objective is to generate a kind of index for each keyword

Then perhaps the solution is a search engine (e.g. mnogo)?

symcbean
  • 5,540
  • I'll give a shot to the -f/F options, i'll tell you how it went in about 90 mins or so. – Nico Mar 27 '13 at 10:46
  • Takes 40 mins with --mmap and -F, with the naive grep approach... slightly better! – Nico Mar 27 '13 at 11:38
0

These points come to my mind:

  1. You cannot seriously grep each file once for every keyword... And then also without -l / --files-with-matches thus grepping uselessly until the end of each file. Unfortunately there is (AFAIK) no combination of --files-with-matches with several patterns (at least not in the way you need). This could be done in awk. awk cannot be called recursively (AFAIK) but can be called via find with a lot of files.
  2. I have made the experience that operations with accesses to many files (or rather their inodes) can be speeded up a lot by traversing the tree with find first. I guess this reduces disk head movements a lot because the directory entries and inodes are read in blocks and this metadata is cached. Don't make find print the whole path, -printf . is enough, and write the output to /dev/null.
  3. The fastest solution is probably not to rely on shell tools but writing something which gets compiled (e.g. Python (edit:) with a JIT). That way you can optimize both I/O (implementing --files-with-matches) and CPU (just test each line for those key words which have not matched yet) and append the file path to each index file simultaneously (to tmpfs?); keep one FD open for each key word (given you have less than 1000). Afterwards you just have to concatenate these files. Even the find-like caching could be optimized this way: First read the file names until you have e.g. 1000 matching ones then read their inodes (stat() the file) then search the files then read the next 1000 names.
Hauke Laging
  • 90,279
  • 1
    writing something which gets compiled (e.g. Python): Python is an interpreted language. You don't compile Python code. – darnir Mar 27 '13 at 12:08
  • A few flags got lost when I posted the question, sorry... -l and -m1 are used. 2)Combination of find/grep didn't help that much...It took (a bit) more time with find than with --include. 3) you are right, grep is not the way to do this. Thanks!
  • – Nico Mar 27 '13 at 13:00