How do I search files using grep for multiple strings (intersection search)

Question

How do I use grep to search a nested directory structure for files containing all words in my search pattern?

I want to grep for files that contain multiple words - let’s use foo bar and bah. I can do grep -rl foo |xargs grep -rl bah| ...etc, but is there an easier way to do this? I know I can use -F for a file of strings to search for, but I believe this still searches for the strings using an OR operator (union), and I need to use the AND operator (intersection).

@dopeghoti Not a duplicate, looking across the entire file for search strings. — user3.1415927, Feb 09 '18 at 17:34
Do you want to match the words *as words*, or do ‘‘food’’ and ‘‘bargain’’ count? — G-Man Says 'Reinstate Monica', Feb 10 '18 at 02:20
I’m okay with either a direct full-string match or a wildcard (‘foo.*’) match. — user3.1415927, Feb 11 '18 at 03:25

RomanPerekhrest · Answer 1 · 2018-02-11T10:33:01.340

3

find + awk solution:

find . -type f -exec awk '/\<foo\>/{ p1=1 }/\<bar\>/{ p2=1 }/\<bah\>/{ p3=1 }
                          p1 && p2 && p3{ print FILENAME; exit }' {} +

awk program details:

/\<foo\>/{ p1=1 }/\<bar\>/{ p2=1 }/\<bah\>/{ p3=1 } - on encountering each of the needed patterns - set respective flag
p1 && p2 && p3 - as soon as all the patterns are found:
- print FILENAME - print the current filename/filepath
- exit - exit script execution immidiately

edited Feb 11 '18 at 10:33

answered Feb 09 '18 at 15:53

RomanPerekhrest

30,212

Thanks, but I’m hoping for a grep-only solution; find+awk (or sed) will work but I’m hoping to remember commands with as much streamline as possible. – user3.1415927 Feb 09 '18 at 17:37

cas · Answer 2 · 2018-02-13T00:02:02.797

My answer is similar to @RomanPerekhrest's answer. The main difference is that it takes advantage of the fact that you can get awk to process the entire input in one go by setting the record separator (RS) to something that will never match anything in the input (e.g. ^$). In other words, slurp in the entire file and search it as if it was a single string.

e.g.

find . -type f -exec \
  awk -v RS='^$' '/foo/ && /bar/ && /baz/ { print FILENAME }' {} +

This will list all files beneath the current directory (.) that contain ALL of the regular expressions foo, bar, and baz. If you need any or all the regular expressions to be treated as whole words, surround them with word-boundary anchors \< and \> - e.g. \<foo\>.

This also runs faster because it doesn't fork awk once for every file. Instead, it runs awk with as many filename arguments as will fit into the command line buffer (typically 128K or 1 or 2M characters on modern-ish systems)....e.g. if find discovers 1000 files, it will only run awk once instead of 1000 times.

Note: This requires a version of awk that allows RS to be a regular expression. See Slurp-mode in awk? for more details and an example of how to implement a limited form of "slurp mode" reading in other versions of awk.

Also Note: This will read the entire contents of each file found into memory, one at a time. For truly enormous files, e.g. log files that are tens of gigabytes or larger in size, this may exceed available RAM or even RAM+SWAP. As unlikely as it is, if it happens it can cause serious problems (e.g. on Linux, the kernel will start killing random processes if it runs of of RAM and SWAP).

I must mention the initial condition "for files containing all words** ..." (a slight hint) — RomanPerekhrest, Feb 11 '18 at 10:37
@RomanPerekhrest yeah and when asked what he wanted he said he's happy with a direct match or a wildcard foo.*. I think you're placing an emphasis on "words" that the OP didn't intend. He probably doesn't want to match foo, bar, or baz either - they're just example regular expressions for him to replace with whatever it is he actually wants to search for. — cas, Feb 11 '18 at 11:08
Do you really need nextfile, given that you’re treating each file as a single record? — G-Man Says 'Reinstate Monica', Feb 12 '18 at 18:17
@G-Man nope. You're right. I just tested it and it works the same without nextfile. I added it out of habit - it's an optimisation (skip to next file on first match...or more generally, skip or exit on first success) that works well when not slurping the entire file. — cas, Feb 13 '18 at 00:00

score 0 · Answer 3 · answered Feb 09 '18 at 15:42

0

For a logical AND like this, I usually fall back on awk:

awk '/foo/ && /bar/ && /bah/ { print }' /path/to/file

answered Feb 09 '18 at 15:42

DopeGhoti

76,081

1

Op says to search a nested directory structure for files, maybe you need to do find -type f ... and then awk. – 13dimitar Feb 09 '18 at 15:43
1

this will check for presence of all 3 patterns on a LINE, but not within the whole TEXT – RomanPerekhrest Feb 09 '18 at 15:52

αғsнιη · Answer 4 · 2018-02-09T16:34:40.343

0

Using GNU grep with -P (Perl-Compatibility) option and positive lookahead regex (?=(regex)) to look for the words in any order in a single line or in a whole file and recursively in all files found start from current directory.

grep -rlP '(?s)(?=.*?\bfoo\b)(?=.*?\bbar\b)(?=.*?\bbah\b)' .

(?s) here is DOTALL modifier and allows dot to match even \newlines, we would use (.|\n)*? too between words as well as [\s\S]*?.
in \bWORD\b; \b are word boundaries anchors.

With input as follows:

==> file1 <==
foo here and bar
bah
and of file1
==> file2 <==
foo then bar and bah
==> file3 <==
foo foobarbah ba
==> file4 <==
this is foo bar bahh
bah

The output is:

./file1
./file2
./sub-dir/file4

edited Feb 09 '18 at 16:34

answered Feb 09 '18 at 15:48

αғsнιη

41,407

Hmm, is there anyway to do this without the perl-specific regex? I’m hoping for something slightly more bash generic ( I also don’t know perl!) – user3.1415927 Feb 11 '18 at 03:28
surly there is, but you don't need to know perl, even most of us and you are using grep without knowing how its code written, isn't it? : ) – αғsнιη Feb 11 '18 at 04:30

How do I search files using grep for multiple strings (intersection search)

4 Answers4