16

I have directory with cca 26 000 files and I need to grep in all these files. Problem is, that I need it as fast as possible, so it's not ideal to make script where grep will take name of one file from find command and write matches to file. Before "arguments list too long" issue it took cca 2 minutes to grep in all this files. Any ideas how to do it? edit: there is a script that is making new files all the time, so it's not possible to put all files to different dirs.

5 Answers5

22

With find:

cd /the/dir
find . -type f -exec grep pattern {} +

(-type f is to only search in regular files (also excluding symlinks even if they point to regular files). If you want to search in any type of file except directories (but beware there are some types of files like fifos or /dev/zero that you generally don't want to read), replace -type f with the GNU-specific ! -xtype d (-xtype d matches for files of type directory after symlink resolution)).

With GNU grep:

grep -r pattern /the/dir

(but beware that unless you have a recent version of GNU grep, that will follow symlinks when descending into directories). Non-regular files won't be searched unless you add a -D read option. Recent versions of GNU grep will still not search inside symlinks though.

Very old versions of GNU find did not support the standard {} + syntax, but there you could use the non-standard:

cd /the/dir &&
  find . -type f -print0 | xargs -r0 grep pattern

Performances are likely to be I/O bound. That is the time to do the search would be the time needed to read all that data from storage.

If the data is on a redundant disk array, reading several files at a time might improve performance (and could degrade them otherwise). If the performances are not I/O bound (because for instance all the data is in cache), and you have multiple CPUs, concurrent greps might help as well. You can do that with GNU xargs's -P option.

For instance, if the data is on a RAID1 array with 3 drives, or if the data is in cache and you have 3 CPUs whose time to spare:

cd /the/dir &&
  find . -type f -print0 | xargs -n1000 -r0P3 grep pattern

(here using -n1000 to spawn a new grep every 1000 files, up to 3 running in parallel at a time).

However note that if the output of grep is redirected, you'll end up with badly interleaved output from the 3 grep processes, in which case you may want to run it as:

find . -type f -print0 | stdbuf -oL xargs -n1000 -r0P3 grep pattern

(on a recent GNU or FreeBSD system) or use the --line-buffered option of GNU grep.

If pattern is a fixed string, adding the -F option could improve matters.

If it's not multi-byte character data, or if for the matching of that pattern, it doesn't matter whether the data is multi-byte character or not, then:

cd /the/dir &&
  LC_ALL=C grep -r pattern .

could improve performance significantly.

If you end up doing such searches often, then you may want to index your data using one of the many search engines out there.

2

26000 files in a single directory is a lot for most filesystems. It's likely that a significant part of the time is taken reading this big directory. Consider splitting it into smaller directories with only a few hundred files each.

Calling find cannot explain poor performance unless you do it wrong. It's a fast way of traversing a directory, and of ensuring that you don't risk attempting to execute a command line that's too long. Make sure that you use -exec grep PATTERN {} +, which packs as many files as it can per command invocation, and not -exec grep PATTERN {} \;, which executes grep once per file: executing the command once per file is likely to be significantly slower.

  • Thanks, I will google something about it and probably I will split that. I made exactly what you are writing about and it took 3 times longer than only grep... – user2778979 Aug 09 '13 at 11:53
  • Gilles, are you saying that performance would differ significantly for 26,000 files in one directory versus 26,000 files distributed over, say, 100 directories? – user001 Sep 02 '18 at 02:55
  • 1
    @user001 Yes. How much they differ depends on the filesystem and possibly on the underlying storage, but I'd expect any filesystem to be measurably faster with 260 files in each of 100 directories compared with 26000 files in a single directory. – Gilles 'SO- stop being evil' Sep 04 '18 at 11:09
  • Thanks for the clarification. I asked a follow-up question on this point in order to understand the basis for the discrepancy. – user001 Sep 04 '18 at 22:24
1

If you need to grep ALL the files multiple times (as you said, running a script) I would suggest looking into ram disks, copy all files there and then grep the files the multiple times, this will speed up your search by a factor of at least 100x.

You just need enough ram. Else, you should look into indexing the files, eg. into lucene or a nosql database and then running queries upon that.

  • As noted elsewhere, this doesn't help the fact that there are too many files to run a grep against. There's also the point that: "there is a script that is making new files all the time, so it's not possible to put all files to different dirs." – Jeff Schaller Sep 04 '19 at 19:56
0

If you are open/able to rely on an external utility, the popular utility ripgrep provides the binary rg which uses parallelism and smart file filtering by default.

You can control the level of parallelism if you wish:

-j, --threads NUM

The approximate number of threads to use. A value of 0 (which is the default) causes ripgrep to choose the thread count using heuristics.

Additionally, quoting the ripgrep README, rg "is built on top of Rust's regex engine [which] uses finite automata, SIMD, and aggressive literal optimizations to make searching very fast."

Beside this, there are other aspects that make rg faster than grep. Depending on your situation, you might want to turn some of these off:

  • By default rg is recursive.

If your directory contains subdirectories that you do not want to search, you can disable the recursive behaviour with --max-depth 1.

  • By default rg does not search files present in any .gitignore or similar file, nor does it search hidden files, and binary files.

If you want to remove those filters you can add the -u flag (once (-u) for ignored files, twice (-uu) for ignored and hidden files, three times (-uuu) to stop all filters).

Assuming that you have subdirectories in your directory that you do not want to search and that the default filtering is useful in your situation, you would then run (from your directory):

rg --max-depth 1 pattern

Outside your directory, you'd have to add its path:

rg --max-depth 1 pattern /path/to/dir 

If you want to search ignored files and hidden files, but not binary files, you'd use:

rg --max-depth 1 -uu pattern

If you want to follow symlinks, you have to add the -L flag.

prosoitos
  • 203
-2

All files in the directory

grep 'search string' *

with recursively

grep -R 'search string' *
Markus
  • 149
  • Care to elaborate the -1? – Markus Aug 07 '13 at 09:31
  • 4
    I didn't downvote, but there are a few issues with yours: the OP mentionned a "arg list too long", which your first one won't fix and is probably what the OP was doing before. The second one doesn't help either in that regard (would have helped had you used . instead of *). * will exclude dot files (though with -R, not the ones in the recursed directories). -R as opposed to -r follows symlinks even with recent versions of GNU grep. You'll also have an issue with files in the current directory whose name starts with - – Stéphane Chazelas Aug 07 '13 at 11:20