How to search efficiently for a list of strings in a big code directory

Question

I have a list of strings, for every of those strings, I want to check if it occurs in a big source code directory.

I came to a GNU grep solution that gives me what I want:

for key in $(cat /tmp/listOfKeys.txt); do
    if [ "$(grep -rio -m 1 "$key" . | wc -l)" = "0" ]; then
        echo "$key has no occurence"; 
    fi
done

However, it's not efficient at all since it always grep every file of the directory, even if it finds a match early. Since there are a lot of keys to lookup, and pretty much files to search in, it is not usable as-is.

Do you know a way to do this efficiently with a "standard" unix tool?

What is a "standard" unix tool? Folks have been writing new software, e.g. http://betterthanack.com/ — thrig, Mar 04 '16 at 15:28
I agree it's a bit vague. I expected something like grep, awk, ... that allows me do everything in a few command lines. I don't know Ack, it might be a good match but it seems it makes some semantic-based assertions and I'd like to avoid that. However, I'd be curious to see what proposition you could do with that tool. I'll try it anyway, thanks. — nimai, Mar 04 '16 at 15:41
Have used CodeSearch for static code bases. It takes some time (and resources) to index huge source trees, but once done searches are very fast. Not sure how this fits your Q though. — Runium, Mar 04 '16 at 17:22
You may want to look at ctags or cscope to index your code if those strings are code symbols. — Stéphane Chazelas, Mar 04 '16 at 17:42

Stéphane Chazelas · Accepted Answer · 2016-03-04T22:13:33.070

It can at least be simplified to:

set -f # needed if you're using the split+glob operator and don't want the
       # glob part

for key in $(cat /tmp/listOfKeys.txt); do
   grep -riFqe "$key" . ||
    printf '%s\n' "$key has no occurrence"
done

Which would stop searching after the first occurrence of the key and not consider the key as a regular expression (or possible option to grep).

To avoid having to read files several times, and assuming your list of keys is one key per line (as opposed to space/tab/newline separated in the code above), you could do with GNU tools:

find . -type f -size +0 -printf '%p\0' | awk '
  ARGIND == 2 {ARGV[ARGC++] = $0; next}
  ARGIND == 4 {a[tolower($0)]; n++; next}
  {
    l = tolower($0)
    for (i in a) if (index(l, i)) {
      delete a[i]
      if (!--n) exit
    }
  }
  END {
    for (i in a) print i, "has no occurrence"
  }' RS='\0' - RS='\n' /tmp/listOfKeys.txt

It's optimised in that it will stop looking for a key as soon as it's seen it and will stop as soon as all the keys have been found and will read the files only once.

It assumes keys are unique in listOfKeys.txt. It will output the keys in lower case.

The GNUisms above are -printf '%p\0', ARGIND and the ability of awk to handle NUL delimited records. The first two can be addressed with:

find . -type f -size +0 -exec printf '%s\0' {} + | awk '
  step == 1 {ARGV[ARGC++] = $0; next}
  step == 2 {a[tolower($0)]; n++; next}
  {
    l = tolower($0)
    for (i in a) if (index(l, i)) {
      delete a[i]
      if (!--n) exit
    }
  }
  END {
    for (i in a) print i, "has no occurrence"
  }' step=1 RS='\0' - step=2 RS='\n' /tmp/listOfKeys.txt step=3

The third one could be addressed with tricks like this one, but that's probably not worth the effort. See Barefoot IO's solution for a way to bypass the problem altogether.

I wasn't aware of ARGIND (perhaps because I rarely use gawk). I like what you did there; it's a clever and succinct way of not only passing pathnames robustly but of having gawk read them automatically. — Barefoot IO, Mar 04 '16 at 20:19
@BarefootIO, glad you like it. ARGIND can easily be emulated with other awks (see edit). The true gawk-specific feature here is the ability to deal with NUL delimited records. — Stéphane Chazelas, Mar 04 '16 at 22:08
This is exactly what I wanted. It's very efficient and does the job well. Thank you! — nimai, Mar 07 '16 at 13:14
To find all of the strings inside a file, you can run grep in FOR loop: https://unix.stackexchange.com/a/462445/43233 — Noam Manos, Aug 14 '18 at 07:03

score 5 · Answer 2 · edited Mar 04 '16 at 22:21

5

GNU grep (as well as most variants I know of) offer a -f option, which does exactly what you need. The fgrep variant treats the input lines as plain ordinary strings instead of regex's.

fgrep -rio -f /tmp/listOfKeys.txt .

And if you just want to test if at least one match is found, add -q option. Per Stéphane's comment, if you need to know which strings that were not found, add the -h option and then pipe through this common awk idiom:

fgrep -h -rio -f /tmp/listOfKeys.txt . |
awk '{$0=tolower($0)}; !seen[$0]++' |
fgrep -v -i -x -f - /tmp/listOfKeys.txt

The second fgrep now uses the first fgrep's ouput (uniqued case insensitively), inverts the sense, and shows non-matching strings from the keyfile.

edited Mar 04 '16 at 22:21

Stéphane Chazelas

544,893

answered Mar 04 '16 at 16:02

Otheus

6,138

Note that it won't give you the list of keys that are not matched, but you could add the -h option (GNU specific) and pipe to sort -u and compare with the listOfKeys with comm (after having converted both to upper or lower case because of -i). – Stéphane Chazelas Mar 04 '16 at 16:10
1

Also, fgrep is deprecated in favor of grep -F. – terdon Mar 04 '16 at 16:43
@terdon Who the hell deprecates these things? I think they deserve a new label "deprecato-terrorist" – Otheus Mar 04 '16 at 16:45
1

Heh, actually, fgrep has been deprecated for many years now. It still works (at least in GNU grep) but there's no guarantee it will continue to do so. Admittedly, fgrep, egrep etc, are kind of silly. They're just symlinks to grep. Using an option seems far more reasonable to me. – terdon Mar 04 '16 at 17:07
Is it deprecated from Solaris and AIX also? @terdon Seriously, where do I complain about this? – Otheus Mar 04 '16 at 17:14
Dunno but grep -E and grep -f (which have replaced egrep and fgrep respectively) are both defined by POSIX (and were also defined by the previous POSIX release). The POSIX specs don't state that the [fe]grep are deprecated but do refer to them as "historical". Sounds like you'll need to complain to a lot of people... – terdon Mar 04 '16 at 17:37
@terdon, they were actually introduced by POSIX, so would be in all revisions of POSIX. – Stéphane Chazelas Mar 04 '16 at 17:51
Your variants find all matches of all keys or one match of any key, but the question is about finding one match for each key. – Gilles 'SO- stop being evil' Mar 04 '16 at 21:20
@StéphaneChazelas I originally had the awk idiom, then realized a second fgrep would do the trick. Is the awk really necessary at all? I don't think so. And for some reason, my Mac's grep doesn't like -f - which I find very very strange ("No such file or directory"). – Otheus Mar 04 '16 at 22:48
1

@Gilles that's not so clear to me. Seems the OP is (1) wanting to find strings that are not in the directory (2) read all files no more than once. – Otheus Mar 04 '16 at 22:52
1

Mac's grep used to be GNU grep, but it's now a reimplementation. It looks like they broke backward compatibility when they (I don't know if "they" is FreeBSD or Apple, Apple's grep already diverges from FreeBSD) rewrote it, probably to get POSIX conformance as there's nothing in the POSIX spec that would allow -f - to treat - anything else but the - file in the current directory. Feel free to revert. The awk filter could increase performance if there's a very large list of matches. You'll want to keep -x though. – Stéphane Chazelas Mar 04 '16 at 23:02
I haven't investigated why, but this solution lists some keys several times. Otherwise, it looks good but it's clearly not as efficient as @StéphaneChazelas 's solution. – nimai Mar 07 '16 at 13:18
@nimai This one is necessarily exhaustive. All the files are searched for all strings, regardless if some have already been found or not. It has the benefit of simplicity and being able to recall it from memory. Also not sure why keys are listed multiple times. – Otheus Mar 07 '16 at 13:54

score 1 · Answer 3 · answered Mar 04 '16 at 20:05

1

A portable, POSIX-compliant translation of Stéphane Chazelas' gawk approach:

find . -type f -exec cat {} + |
awk '
    FNR==NR {keys[tolower($0)]; n++; next}
    {
        s = tolower($0)
        for (k in keys) 
            if (index(s, k)) {
                delete keys[k]
                if (!--n)
                    exit
            }
    }
    END {
        for (k in keys) print k, "has no occurrence"
    }
' /tmp/listOfKeys.txt -

Unless your source files are unusual, in that their names are consistently longer than their content, Stéphane's solution should be more efficient because less data is piped (which involves copying between buffers in two processes via the kernel).

answered Mar 04 '16 at 20:05

Barefoot IO

1,946

Another benefit of your approach over mine is that if there's a very large list of files, awk doesn't wait for find to finish to find them all before starting to look for keys. The memory foot print would also be smaller. – Stéphane Chazelas Mar 04 '16 at 22:10
1

One (pathological) case where our approaches are different is when there are files with extra data after the last newline (where the last line is not terminated) as with cat, that last line would be merged with the first line of the next file. – Stéphane Chazelas Mar 04 '16 at 22:12
@StéphaneC: Nice catch. Conceivably that merger could lead to a false positive. Perhaps awk 1 {} + instead. Each of the 4 implementations (gawk, mawk, bwk awk, busybox) that I tested accept and heal a terminal fragment. EDIT: As soon as I posted, I realized that awk's ARGV semantics (such as looking for variable assignments) would make filenames with an = unreadable. – Barefoot IO Mar 05 '16 at 00:09
At efficiency's expense, the case of the terminal fragment (sounds like the title of a Perry Mason episode) can be handled with an explicit newline after each cat: find . -type f -exec cat {} \; -exec echo \; – Barefoot IO Mar 05 '16 at 00:25
there's no problem with = if filenames start with ./ like here. – Stéphane Chazelas Mar 05 '16 at 06:52

How to search efficiently for a list of strings in a big code directory

3 Answers3