5

I need to run grep on a couple of million files. Therefore I tried to speed it up, following the two approaches mentioned here: xargs -P -n and GNU parallel. I tried this on a subset of my files (9026 in number), and this was the result:

  1. With xargs -P 8 -n 1000, very fast:

    $ time find tex -maxdepth 1 -name "*.json" | \
                    xargs -P 8 -n 1000 grep -ohP "'pattern'" > /dev/null
    
    real    0m0.085s
    user    0m0.333s
    sys     0m0.058s
    
  2. With parallel, very slow:

    $ time find tex -maxdepth 1 -name "*.json" | \
                    parallel -j 8 grep -ohP "'pattern'" > /dev/null
    
    real    0m21.566s
    user    0m22.021s
    sys     0m18.505s
    
  3. Even sequential xargs is faster than parallel:

    $ time find tex -maxdepth 1 -name "*.json" | \
                    xargs grep -ohP 'pattern' > /dev/null
    
    real    0m0.242s
    user    0m0.209s
    sys     0m0.040s
    

xargs -P n does not work for me because the output from all the processes gets interleaved, which does not happen with parallel. So I would like to use parallel without incurring this huge slowdown.

Any ideas?

UPDATE

  1. Following the answer by Ole Tange, I tried parallel -X, the results are here, for completeness:

    $ time find tex -maxdepth 1 -name "*.json" | \
        parallel -X -j 8 grep -ohP "'pattern'" > /dev/null
    
    real    0m0.563s
    user    0m0.583s
    sys     0m0.110s
    
  2. Fastest solution: Following the comment by @cas, I tried to grep with -H option (to force printing the filenames), and sorting. Results here:

    time find tex -maxdepth 1 -name '*.json' -print0 | \
        xargs -0r -P 9 -n 500 grep --line-buffered -oHP 'pattern' | \
        sort -t: -k1 | cut -d: -f2- > /dev/null
    
    real    0m0.144s
    user    0m0.417s
    sys     0m0.095s
    
  • 1
    I think the slowdown might be caused by parallel launching new shell for each invocation. If yes, is there any way to avoid it? – nofrills Mar 30 '16 at 16:16
  • how many lines of output do you expect? – iruvar Mar 30 '16 at 18:24
  • @1_CR I expect 2 - 10 lines of output from each file. – nofrills Mar 30 '16 at 21:55
  • 2
    Ole Tange is usually on this forum and will probably have the last word on this but I suspect buffering up those many lines to be able to finally report them in order might have something to do with it – iruvar Mar 30 '16 at 22:14
  • Have you tried xargs -P8 with -H (--with-filename) instead of -h (--no-filename) option to grep, then pipe through sort -t: -k1 | cut -d: -f2- to sort by filename and then strip the filename? – cas Mar 31 '16 at 00:00
  • @cas, you're going to need -s to enable stable sort (disable last resort comparison) else output from within a file could come out of order – iruvar Mar 31 '16 at 00:09
  • also, i would suggest using find ... -print0 and xargs -0 so that your script works even with filenames containing annoying characters like spaces and newlines. time find tex -maxdepth 1 -name '*.json' -print0 | xargs -0r -P 8 -n 1000 grep -oHP 'pattern' | sort -t: -k1 | cut -d: -f2- > /dev/null – cas Mar 31 '16 at 00:09
  • not sure what you mean by the -s option (how does it differ from sort -k1,1?). BTW, i seem to have found a bug in GNU grep where -H -z doesn't act like -l -z even though the description for -z implies it should. – cas Mar 31 '16 at 00:15
  • @cas, I meant the one from [here] (https://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html). Apologies for brevity, on a mobile device now – iruvar Mar 31 '16 at 00:18
  • 1
    ah, okay. the info docs mention what --stable does, but the man page only lists --stable with minimal description. I hate the way GNU tools leave important info out of man pages and expect you to rely on their crappy .info format. – cas Mar 31 '16 at 00:23
  • actually, not a bug in grep. my mistake, should be using -Z, not -z. but using that reveals that sort can't use NUL as a field delimiter (as opposed to line separator) anyway, so there's no point in using -print0 in find or -0 in xargs. – cas Mar 31 '16 at 00:30
  • @cas : Thanks for the grep -H suggestion. However, I still get mangled output, this time within the same line: like so: line1word1_line2word1_line1word2. – nofrills Mar 31 '16 at 13:13
  • This got solved by using grep --line-buffered – nofrills Apr 03 '16 at 14:09
  • By using xargs -P you still risk getting mangled output - even with grep --line-buffered. See an example of mangling on https://www.gnu.org/software/parallel/parallel_alternatives.html#DIFFERENCES-BETWEEN-xargs-AND-GNU-Parallel – Ole Tange Mar 06 '18 at 07:28

1 Answers1

6

Try parallel -X. As written in the comments the overhead of starting a new shell and opening files for buffering for each argument is probably the cause.

Be aware that GNU Parallel will never be as fast as xargs because of that. Expect an overhead of 10 ms per job. With -X this overhead is less significant as you process more arguments in one job.

Ole Tange
  • 35,514
  • Thanks a lot! This indeed came out to be about twice as fast as xargs. I have included the results in the question. – nofrills Mar 31 '16 at 13:14