Find recursively all files whose content match a specific regular expression

Question

I like to search all PHP files and find a particular string that is identified by a regular expression.

The regular expressions that I use to find the string is:

\$[a-zA-Z0-9]{5,8}\s\=\s.{30,50}\;\$[a-zA-Z0-9]{5,8}\s\=\s[a-zA-Z0-9]{5}\(\)

I tried to use:

grep -r "\$[a-zA-Z0-9]{5,8}\s\=\s.{30,50}\;\$[a-zA-Z0-9]{5,8}\s\=\s[a-zA-Z0-9]{5}\(\)" *.php

but this does not seem to work.

find . -name '*.php' -regex '\$[a-zA-Z0-9]{5,8}\s\=\s.{30,50}\;\$[a-zA-Z0-9]{5,8}\s\=\s[a-zA-Z0-9]{5}\(\)' -print

Does not work either.

I need is to search a path and all subdirectories for PHP files that contain a string identified by the regular expression stated above. What is the best way to accomplish this?

For your information this is a string similar to the ones I try to find:

<?php
$tqpbiu = '9l416rsvkt7c#*3fob\'2Heid0ypax_8u-mg5n';$wizqxqk = Array();$wizqxqk[] = $tqpbiu[11].$tqpbiu[5].$tqpbiu[21].$tqpbiu[27].$tqpbiu[9].$tqpbiu[21].$tqpbiu[29].$tqpbiu[15].$tqpbiu[31].$tqpbiu[36].$tqpbiu[11].$tqpbiu[9].$tqpbiu[22].$tqpbiu[16].$tqpbiu[36];$wizqxqk[] = ... etc.

As you probably realize, this is a malware code. So this string is similar but different on each file. However the regular expression code does a good job finding all files if it contains a similar content somewhere in the file.

Before, I had downloaded all files to my windows PC and then used EMEditor to search by regular expression. This works fine on the PC, but for this I need to download everything and it would be nice to be able to search direct on Linux command prompt.

Any tip would be very much appreciated.

You probably want to use single quotes in the grep command. – muru Feb 02 '21 at 07:00 — muru, Feb 02 '21 at 07:00

score 2 · Accepted Answer · edited Feb 05 '21 at 14:41

Since you are using grep to search using a regular expression, you have to be aware that grep by default interprets the search string as basic regular expression (BRE). The syntax you use contains extended regular expression (ERE) syntax, so you would need to use the -E flag.

Copying the string example you posted into a file test.php, the call

~$ grep -E '\$[a-zA-Z0-9]{5,8}\s=\s.{30,50}\;\$[a-zA-Z0-9]{5,8}\s=\s[a-zA-Z0-9]{5}\(\)' *.php
$tqpbiu = '9l416rsvkt7c#*3fob'2Heid0ypax_8u-mg5n';$wizqxqk = Array();$wizqxqk[] = $tqpbiu[11].$tqpbiu[5].$tqpbiu[21].$tqpbiu[27].$tqpbiu[9].$tqpbiu[21].$tqpbiu[29].$tqpbiu[15].$tqpbiu[31].$tqpbiu[36].$tqpbiu[11].$tqpbiu[9].$tqpbiu[22].$tqpbiu[16].$tqpbiu[36];$wizqxqk[] = ... etc.

finds the string (output in bold as highlighted by grep), so you could use that with the -r option (since you seem to be using GNU grep) to recursively look for it.

Also, keep in mind that the -regex option of find does not check if the file content matches the regular expression, but rather if the file's name matches. To do a regex-based search within all .php or .txt files using find, use

find . -type f \( -name '*.php' -o -name '*.txt' \) -exec grep -EH '\$[a-zA-Z0-9]{5,8}\s=\s.{30,50}\;\$[a-zA-Z0-9]{5,8}\s=\s[a-zA-Z0-9]{5}\(\)' {} \;

where the -H option to grep will ensure the filename is printed, too. Alternatively, use grep -El etc. to only print the filenames (which makes for cleaner output if many files match).

Some general remarks

As correctly noted by Stéphane Chazelas, and as reference for possible future readers: several elements of your syntax are non-portable extensions to the regular expression syntax, and the behavior of other constructs may vary depending on the environment settings:

Character classes (not to be confused with character lists) are extensions to the standard ERE. The \s shorthand notion e.g. is a Perl extension to regular expressions, and is not necesserily portable across programs designed to handle regular expressions.
The meaning of range specifications in character lists (such as [a-z]) can depend on the locale settings, specifically the collation order. The "naive" interpretation that [a-z] means abcdefgh....xyz is only true in the C locale; in others it often means aAbBcCdD ... xXyYz, so this needs to be used with care (see here and here for further discussions on the subject). If the program you use supports them, character classes may be a "safer", but as stated, not necessarily portable, way to express that kind of specification (the intention behind your use of [a-zA-Z0-9] would be fulfilled with the [[:alnum:]] POSIX character class, for example).
You have escaped several characters that actually don't have a special meaning in (most implementations of) regular expressions, e.g. \= and\;. This may work in many cases (the GNU awk man-page e.g. states

\c The literal character c

in the section "String constants"), but should in general be avoided since when trying to port the regex to other programs/environments it may get a special meaning there (in vim, \= actually is a regex quantifier), or even within the same program in a future version.

No worries. It's your answer - do re-edit it to reflect what you want to say — Chris Davies, Feb 05 '21 at 16:39

Find recursively all files whose content match a specific regular expression

1 Answers1