6

Suppose I do a time consuming recursive grep search. After seeing the results I want a different output; for example I want to add the option -C 3 for 3 context lines. I can do the whole search again with the new option added, but I have to wait the same time as before.

Is there any clever way to make grep perform the second search faster?

George M
  • 13,959
student
  • 18,305

5 Answers5

7

The second time should be already faster (if grep is I/O bound) as the file should be in the cache of the operating system.

As grep doesn't save any state at all and only works with the provided input parameter there is no way to to reuse previous results with grep itself.

If you regularly have this problem you may want to look into desktop search engines or text indexing to improve search time as well as results.

don_crissti
  • 82,805
Ulrich Dangel
  • 25,369
2

If the files are still in the disk cache, the second search will be faster.

If you want to speed up searches, you need to build an index. This is well beyond grep's job: it's a search tool, not an indexing tool. Command-line-friendly full-text indexing? lists some indexing tools.

There are ways in which you can leverage grep to make repeated searches faster. For example, first obtain the list of matching files with grep -l. If your file names don't contain any whitespace or shell wildcards *?\[, you can stuff the file names in a variable:

f=$(grep -l -r foo .)
grep foo $f
grep -C3 foo $f
grep foobar $f
1

You can save matching file list and grep only on matching files. It will be much faster. For example you can use find + grep :

find . -type f -exec grep -l 'PATTERN' {} \+ | xargs grep -H -C 3 'PATTERN'

If you need to see grep output after first run in find it is a little harder, but still pretty easy. You just need use something like that

find -exec grep -H 'PATTERN' {} \+ | tee -a out.log |\
sed 's/^[^:]*://' | sort -u | xargs grep -C 3  'PATTERN'

And output will be saved to out.log file.

Jeff Schaller
  • 67,283
  • 35
  • 116
  • 255
rush
  • 27,403
  • If we're talking about speed, then find . -type f -print0 | xargs -0 grep -l 'PATTERN' (one process per N files) is faster than using find's -exec (one process per file). And it allows you to use xargs -p option to parallelise the search. – Alexios Jun 07 '12 at 09:36
  • 5
    Nope. You can notice \+ (insted of usual \;) in the end of find construction. \+ allows find create one process per N files too. Therefore there is almost no difference in speed between find -exec \+ and find | xargs. – rush Jun 07 '12 at 10:08
  • +1: learn something new every day! Is this a new-ish option for find? I don't remember it from Ye Days of Yore. – Alexios Jun 07 '12 at 20:45
  • @Alexios this feature is present even in POSIX find. And I found it several years ago. – rush Jun 08 '12 at 07:53
  • The last time I read the find manpage top to bottom was 1993 or 1994 and it was either SunOS 4.1.x or some ancient version of HP-UX. (that's my excuse, anyway :) ) – Alexios Jun 08 '12 at 09:26
1

Just for something different...
The following script does not use grep the second time around. It relies only on the line numbers gathered by grep in the the first step, and uses sed for the printout..

grep -HnZ is used in the first step: H for the filename, n for the line number, and Z for a non-text delimiter \x00 between the filename and linenumber.

I don't think that it will be much (if any) faster than running grep over the files which were idendified in the first path, because each of the identified files need to be scanned in either case. Also it isn't accurate if anything relevant changes in the data set input in the first step. (This just caught my interest, so here it is..)

# create 2 test files.
  printf '%s\n' {a..z} >junk1
  printf '%s\n' {a..z} >junk2

# Make list of filenames and line numbers
# then convert the list into a shell script 
# which uses 'sed' to list the lines
grep -HnZ "[gms]" junk1 junk2 | 
  # Make list of filenames and line numbers
  awk -v"C=2" 'BEGIN{ FS="[\x00:]"
                 print "#!/bin/sh"
               }
               { negC=$2-C; if (negC<1){negC=1}; posC=$2+C }
               prev != $1 { 
                  if( prev ) print prev_grp "\""
                  prev = $1
                  prev_grp = "<\"" $1 "\" sed -nr \"" \
                  negC"i -- ("negC","$2","posC") "$1"\n\t"negC","posC"{p;b};"
                  next 
               }
               {  prev_grp = prev_grp" " \
                  negC"i -- ("negC","$2","posC") "$1"\n\t"negC","posC"{p;b};" 
              }
               END{ if( prev ) print prev_grp "\"" }
              '>junk.sh
chmod +x junk.sh   
./junk.sh

This is the output of the initial grep command, showing the null as \x00

junk1\x007:g
junk1\x0013:m
junk1\x0019:s
junk2\x007:g
junk2\x0013:m
junk2\x0019:s

Here is the generated script

#!/bin/sh
<"junk1" sed -nr "5i -- (5,7,9) junk1
        5,9{p;b}; 11i -- (11,13,15) junk1
        11,15{p;b}; 17i -- (17,19,21) junk1
        17,21{p;b};"
<"junk2" sed -nr "5i -- (5,7,9) junk2
        5,9{p;b}; 11i -- (11,13,15) junk2
        11,15{p;b}; 17i -- (17,19,21) junk2
        17,21{p;b};"

Here is the grep-like output (n,n,n) are linenumbers (from, matched, to)

-- (5,7,9) junk1
e
f
g
h
i
-- (11,13,15) junk1
k
l
m
n
o
-- (17,19,21) junk1
q
r
s
t
u
-- (5,7,9) junk2
e
f
g
h
i
-- (11,13,15) junk2
k
l
m
n
o
-- (17,19,21) junk2
q
r
s
t
u

It would be prettty simply to add colour, but it would be easier to use grep, (unless this offers anything desirable).

Peter.O
  • 32,916
1
  1. Do you really need grep — do you use regexps? fgrep is faster.
  2. GNU grep has --mmap — according to man page: «… In some situations, --mmap yields better performance …» (but it has some issues as well, see the man page).
  3. Just save file:line numbers of matched lines and do not re-grep again — you surely don't need to do it twice again, do you?
poige
  • 6,231