52

I have a folder which has 250+ files of 2 GB each. I need to search for a string/pattern in those files and output the result in an output file. I know I can run the following command, but it is too slow!!

grep mypattern * > output

I want to speed it up. Being a programmer in Java, I know multi-threading can be used for speeding up the process. I'm stuck on how to start grep in "multi-threaded mode" and write the output into a single output file.

Abhishek
  • 629

5 Answers5

42

There are two easy solutions for this. Basically, using xargs or parallel.

xargs Approach:

You can use xargs with find as follows:

find . -type f -print0  | xargs -0 -P number_of_processes grep mypattern > output

Where you will replace number_of_processes by the maximum number of processes you want to be launched. However, this is not guaranteed to give you a significant performance in case your performance is I/O limited. In which case you might try to start more processes to compensate for the time lost waiting for I/Os.

Also, with the inclusion of find, you can specify more advanced options instead of just file patterns, like modification time, etc ...

One possible issue with this approach as explained by Stéphane's comments, if there are few files, xargs may not start sufficiently many processes for them. One solution will be to use the -n option for xargs to specify how many arguments should it take from the pipe at a time. Setting -n1 will force xargs to start a new process for each single file. This might be a desired behavior if the files are very large (like in the case of this question) and there is a relatively small number of files. However, if the files themselves are small, the overhead of starting a new process may undermine the advantage of parallelism, in which case a greater -n value will be better. Thus, the -n option might be fine tuned according to the file sizes and number.

Parallel Approach:

Another way to do it is to use Ole Tange GNU Parallel tool parallel, (available here). This offers greater fine grain control over parallelism and can even be distributed over multiple hosts (would be beneficial if your directory is shared for example). Simplest syntax using parallel will be:

find . -type f | parallel -j+1 grep mypattern

where the option -j+1 instructs parallel to start one process in excess of the number of cores on your machine (This can be helpful for I/O limited tasks, you may even try to go higher in number).

Parallel also has the advantage over xargs of actually retaining the order of the output from each process and generating a contiguous output. For example, with xargs, if process 1 generates a line say p1L1, process 2 generates a line p2L1, process 1 generates another line p1L2, the output will be:

p1L1
p2L1
p1L2

whereas with parallel the output should be:

p1L1
p1L2
p2L1

This is usually more useful than xargs output.

Bichoy
  • 3,106
  • 2
  • 21
  • 34
  • 1
    You'd probably want to use -n in combination with -P. Otherwise, xargs may not end up spawning several processes if there are two few files. – Stéphane Chazelas Apr 25 '15 at 19:43
  • 1
    Well, -n1 would start one grep per file. Unless the files are very big and there are very few of them, you'd probably want to increase that a bit as you'll spend your time starting and stopping grep processes instead of searching in files. – Stéphane Chazelas Apr 25 '15 at 20:48
15

There are at least two ways of speeding up grep CPU-wise:

  • If you're searching for a fixed string rather than a regular expression, specify the -F flag;

  • If your pattern is ASCII-only, use an 8-bit locale instead of UTF-8, e.g. LC_ALL=C grep ....

These won't help though if your hard drive is the bottleneck; in that case probably parallelizing won't help either.

egmont
  • 5,866
  • 2
    Just saw in man grep "Direct invocation as either egrep or fgrep is deprecated, but is provided to allow historical applications that rely on them to run unmodified." Not sure this matters really, but is the same as grep -F – iyrin Apr 25 '15 at 18:05
  • 1
    Also when you say "rather than a pattern" are you referring to a regular expression? – iyrin Apr 25 '15 at 18:10
  • 1
    The "ASCII-only" search uses massively less CPU. But you need to read the caveats mentioned in the comments at http://stackoverflow.com/a/11777835/198219 – famzah Sep 19 '16 at 20:31
4

If the problem is not I/O bound you could use a tool that is optimized for multi-core processing.

You might want to take a look at sift (http://sift-tool.org, disclaimer: I am the author of this tool) or the silver searcher (https://github.com/ggreer/the_silver_searcher).

the silver searcher has a file size limit of 2GB if you use a regex pattern and not a spimple string search.

svent
  • 181
  • 1
    Surely searching a bunch of files is a classic example of a problem that is IO bound? – Jonathan Hartley Oct 13 '17 at 01:18
  • 1
    ^ Its IO bound, but the immediate bottleneck isn't always IO. Its possible to build a pc with commercial equipment where the CPU will be the bottleneck. E.g. Threadripper CPU with X16 NVME drives. – hiburn8 Jul 14 '20 at 21:44
1

You'll at minimum want to know what files matched, and you have to trick parallel into doing that. Knowing what line number matched might also be nice. Here is option 1, process each file in parallel using xargs.

find . -type f | xargs -n 1 -P 8 grep -n mypattern /dev/null > output

Here is option 2, process files in serial but process file blocks in parallel. In this case I can't get the line number but I can insert the filename with awk.

find . -type f | xargs -I filename parallel -j+1 --pipepart --block 10M -a filename -k "grep mypattern | awk -v OFS=: '{print \"filename\",\$0}'" > output

Option 3 combines 1 and 2.

find . -type f | xargs -n 1 -P 8 -I filename parallel -j+1 --pipepart --block 10M -a filename -k "grep mypattern | awk -v OFS=: '{print \"filename\",\$0}'" > output

I wrote a bit about this in a blog post.

Will High
  • 111
1

I realize that this does not exactly answer your question and it might not be an option for you. However, the popular utility ripgrep provides the binary rg which uses parallelism by default.

If you want to control the level of parallelism, the relevant flag is -j. From the man page:

-j, --threads NUM

The approximate number of threads to use. A value of 0 (which is the default) causes ripgrep to choose the thread count using heuristics.

Beside the parallelism, quoting the ripgrep README, rg "is built on top of Rust's regex engine [which] uses finite automata, SIMD, and aggressive literal optimizations to make searching very fast."

Finally, there are other aspects that make rg faster than grep. Depending on your situation, you might want to turn some of these off:

  • By default rg is recursive.

If your folder contains subfolders that you do not want to search, you can disable the recursive behaviour with --max-depth 1.

  • By default rg does not search files present in any .gitignore or similar file, nor does it search hidden files, and binary files.

If you want to remove those filters you can add the -u flag (once for ignored files, twice for ignored and hidden files, three times to stop all filters).

In summary, from your folder of interest, in the simplest case (no subdir, default filtering actually useful for your situation), you could run:

rg mypattern > output

(From outside your folder, you'd have to add its path: rg mypattern myfolder > output).

If you have subdirs in your folder and you want to suppress the default filtering, the command would then be:

rg --max-depth 1 -uuu mypattern > output

(And from outside your folder: rg --max-depth 1 -uuu mypattern myfolder > output).

While this does not technically answer your question as it does not use grep and while it might not work in your case (e.g. if you cannot or do not want to install an external utility), because rg uses parallelism and is much faster than grep, I thought that this might still be of use.

prosoitos
  • 203