I am currently facing a "performance problem" while using grep. I am trying to locate the occurrences of many (10,000+) keywords in many (think Linux kernel repository size) files. The objective is to generate a kind of index for each keyword:
keyword: my_keyword
filepath 1
filepath 2
------
keyword: my_keyword2
...
Being fairly new to shell scripting, I have tried a few naive approaches:
while read LINE; do
echo "keyword: "$LINE >> $OUTPUT_FILE
grep -E -r --include='Makefile*' "($LINE)" . >> "$OUTPUT_FILE"
echo "-----"
done < $KEYWORD_FILE
This takes about 45 minutes to complete, with the expected results.
Note: all they keywords I search for are located in a file ("KEYWORD_FILE", so fixed before anything starts), and the files in which to search can be determined before the search starts. I tried to stored the list of files to search in before the search like that:
file_set=( $(find . -name "Kbuild*" -o -name "Makefile*") )
and then replace the grep call by
echo {$file_set[*]} | parallel -k -j+0 -n 1000 -m grep -H -l -w "\($LINE\)" >> "$OUTPUT_FILE"
It takes about an hour to complete…
Question: considering I can use whatever technique I want, provided it's sh compliant, how can this be done "faster" and/or more efficiently ?
Perhaps grep
is not the tool to use, and my use of parallel
is wrong… any ideas are welcome!