3

I need to locate .php and .pl files that do not contain one string (e.g. aaa), but do contain another (e.g. bbb).

I'm currently using this command:

find /path/ \( -iname '*.php*' -or -name '*.pl*' \) -exec sh -c 'grep -l -v "aaa" {} | grep -l "bbb" {}' \; > resulttofile

It's about half a million files to search, so I'm wondering,

  • If my command works correctly - some eye sampling gives positive result,
  • If it's possible to become faster (it currently takes about 2min on a VM, but more files will be added) using some other form, or awk or sed instead of grep - or perhaps just one combined grep instead of two.

The system is a Debian GNU/Linux.

Krackout
  • 2,642
  • 12
  • 26
  • Please remember to always tell us your operating system. Especially for questions like this since different systems will have different implementations of the basic tools like grep or find. – terdon Apr 28 '23 at 11:03
  • 1
    Don’t change it now, but “not aaa but bbb” is confusing —  “aaa but not bbb” would be clearer. – G-Man Says 'Reinstate Monica' Apr 30 '23 at 06:40

3 Answers3

7

Your command doesn’t work correctly: the first grep will list any file which contains a line not matching "aaa", and the second grep will ignore the first’s output since it’s given its own file to process — so you’ll get a list of files matching "bbb", regardless of whether they contain "aaa" or not. You’d need to ask grep to only list a file if it doesn’t contain any line matching "aaa" (grep -L), and use xargs to process the resulting list of files and only feed that to the second grep (or make the second grep conditional on the result of the first one).

On top of that, it would work only as long as the file names that find lists don’t cause problems for the shell — notably, including {} directly in the command given to sh -c means that file names can end up being interpreted as shell commands (see Is it possible to use `find -exec sh -c` safely? for details).

The following will require fewer grep invocations and is safer, assuming you’re using GNU grep:

find /path/ \( -iname '*.php*' -o -name '*.pl*' \) -exec grep -LZ aaa {} + |
  xargs -r0 grep -l bbb

The -or operator is a GNU extension to find.  Use -o for portability.

Stephen Kitt
  • 434,908
5

Untested but this should do what I think you want, using GNU awk for nextfile and ENDFILE

find /path/ \( -iname '*.php*' -or -name '*.pl*' \) -exec awk '
    /aaa/{a=1} /bbb/{b=1} a&&b{nextfile} ENDFILE{if (b && !a) print FILENAME; a=b=0}
' {} + > resulttofile

The above only calls awk once on multiple files at a time so should be efficient.

The above is how to generally match multiple patterns in a file and then evaluate the results of the combination of matches once the file has been fully read but as @G-Man Says 'Reinstate Monica' mentioned in a comment you could make it more efficient in this specific case by stopping reading the current file if/when aaa matches since the success criteria is for aaa not to be present:

/aaa/{a=1; nextfile} /bbb/{b=1} ENDFILE{if (b && !a) print FILENAME; a=b=0}
Ed Morton
  • 31,617
3

You may chain multiple -exec directives (or other ones) together with one find command:

find /path \( -iname '*.php*' -or -name '*.pl*' \) -exec grep -q "bbb" {} ";" \
     -exec grep -L "aaa" {} ";" > resulttofile

(The linebreak is only to fit the layout of SE).

terdon
  • 242,166
user unknown
  • 10,482