1

I want to grep a pattern across 10M files really fast in a 36 core machine i tried this

find . -name '*.xml' -type f | xargs  -P 20 grep "username" >> output

But i am getting some other results in between.

Is there any better way to do this ?

Thanks in advance.

Jeff Schaller
  • 67,283
  • 35
  • 116
  • 255

3 Answers3

1

Given that your data is on non-RAIDed HDDs, I doubt you'll get better performance from parallelizing, the bottle neck is most likely to be I/O, not CPU.

LC_ALL=C grep -rwF --include='*.xml' username . > /on/some/other/disk/output

May be close to the best you can achieve.

To parallelize, you'd want to do it as:

LC_ALL=C find . -name '*.xml' -type f -print0 |
  LC_ALL=C xargs -r0P20 -n 1000 grep -HFw --line-buffered username > output

assuming there's no output line (input line + file pathname) longer than 4KiB, and note the lines of all 20 concurrent greps will end up interleaved.

See:

for details.

0

I think it's because of the pattern being too general. Usually the command line utilities print errors to the stderr or the terminal. They shouldn't get in the output file.

saga
  • 1,401
0

If you are grepping xml files in this way then your search will return the entire line containing the search string and, if the xml file has no newlines, the entire file contents. Quite a lot of "other" across 10M files.

As per @Kusalananda comment it isn't good practice to brute force an xml with grep and an xml parser e.g. xmllint is a better tool, however, if you insist ......

Check man for grep and read up on the -o option to restrict the returned value and use regex that defines the whole length of the match you are looking to find.

If username is an attribute

grep -o 'username="[^"]*"'

Or better

xmllint --xpath "//@username"

If username is a node then something like

grep -o "username>[^<][^<]*"

Or better

xmllint --xpath "//username"

With either of the xmllint queries, just wrap the query in string() to extract the attribute or node text.

xmllint --xpath "string(//username)"
xmllint --xpath "string(//@username)"
bu5hman
  • 4,756