I need to run grep
on a couple of million files. Therefore I tried to speed it up, following the two approaches mentioned here: xargs -P -n
and GNU parallel
. I tried this on a subset of my files (9026 in number), and this was the result:
With
xargs -P 8 -n 1000
, very fast:$ time find tex -maxdepth 1 -name "*.json" | \ xargs -P 8 -n 1000 grep -ohP "'pattern'" > /dev/null real 0m0.085s user 0m0.333s sys 0m0.058s
With
parallel
, very slow:$ time find tex -maxdepth 1 -name "*.json" | \ parallel -j 8 grep -ohP "'pattern'" > /dev/null real 0m21.566s user 0m22.021s sys 0m18.505s
Even sequential
xargs
is faster thanparallel
:$ time find tex -maxdepth 1 -name "*.json" | \ xargs grep -ohP 'pattern' > /dev/null real 0m0.242s user 0m0.209s sys 0m0.040s
xargs -P n
does not work for me because the output from all the processes gets interleaved, which does not happen with parallel
. So I would like to use parallel
without incurring this huge slowdown.
Any ideas?
UPDATE
Following the answer by Ole Tange, I tried
parallel -X
, the results are here, for completeness:$ time find tex -maxdepth 1 -name "*.json" | \ parallel -X -j 8 grep -ohP "'pattern'" > /dev/null real 0m0.563s user 0m0.583s sys 0m0.110s
Fastest solution: Following the comment by @cas, I tried to grep with
-H
option (to force printing the filenames), and sorting. Results here:time find tex -maxdepth 1 -name '*.json' -print0 | \ xargs -0r -P 9 -n 500 grep --line-buffered -oHP 'pattern' | \ sort -t: -k1 | cut -d: -f2- > /dev/null real 0m0.144s user 0m0.417s sys 0m0.095s
parallel
launching new shell for each invocation. If yes, is there any way to avoid it? – nofrills Mar 30 '16 at 16:16xargs -P8
with-H
(--with-filename
) instead of-h
(--no-filename
) option togrep
, then pipe throughsort -t: -k1 | cut -d: -f2-
to sort by filename and then strip the filename? – cas Mar 31 '16 at 00:00-s
to enable stable sort (disable last resort comparison) else output from within a file could come out of order – iruvar Mar 31 '16 at 00:09find ... -print0
andxargs -0
so that your script works even with filenames containing annoying characters like spaces and newlines.time find tex -maxdepth 1 -name '*.json' -print0 | xargs -0r -P 8 -n 1000 grep -oHP 'pattern' | sort -t: -k1 | cut -d: -f2- > /dev/null
– cas Mar 31 '16 at 00:09-s
option (how does it differ fromsort -k1,1
?). BTW, i seem to have found a bug in GNUgrep
where-H -z
doesn't act like-l -z
even though the description for-z
implies it should. – cas Mar 31 '16 at 00:15--stable
does, but the man page only lists--stable
with minimal description. I hate the way GNU tools leave important info out of man pages and expect you to rely on their crappy .info format. – cas Mar 31 '16 at 00:23grep
. my mistake, should be using-Z
, not-z
. but using that reveals thatsort
can't use NUL as a field delimiter (as opposed to line separator) anyway, so there's no point in using-print0
infind
or-0
in xargs. – cas Mar 31 '16 at 00:30grep -H
suggestion. However, I still get mangled output, this time within the same line: like so:line1word1_line2word1_line1word2
. – nofrills Mar 31 '16 at 13:13grep --line-buffered
– nofrills Apr 03 '16 at 14:09xargs -P
you still risk getting mangled output - even withgrep --line-buffered
. See an example of mangling on https://www.gnu.org/software/parallel/parallel_alternatives.html#DIFFERENCES-BETWEEN-xargs-AND-GNU-Parallel – Ole Tange Mar 06 '18 at 07:28