How do I speed up a script based on a grep search?

Question

There is a very large text file with two comma-separated values:

78414387,10033
78769989,12668
78771319,13677
78771340,13759
80367563,16336
81634533,10025
82878571,10196
110059366,10218
110059411,10812
110059451,10067

I need to search for these values in a log file which is looking like so:

- delivery-AMC_prod_product 231825855936862016-07-02 00:00:52 c.c.i.d.s.d.DeliveryTopologyFactory$$anon$1$$anon$2 [INFO] ack: uid=57773c737e3d80d7def296c7| id=278832702| version=28| timestamp=1467432051000
- delivery-AMC_prod_product 231825855936862016-07-02 00:00:52 c.c.i.d.s.d.DeliveryTopologyFactory$$anon$1$$anon$2 [INFO] ack: uid=57773c732f18c26fe604fd04| id=284057302| version=9| timestamp=1467432051000
- delivery-AMC_prod_product 231825855936862016-07-02 00:00:52 c.c.i.d.s.d.DeliveryTopologyFactory$$anon$1$$anon$2 [INFO] ack: uid=57773c747e3d80d7def296c8| id=357229| version=1151| timestamp=1467432052000
- delivery-AMC_prod_product 231825855936862016-07-02 00:00:52 c.c.i.d.s.d.DeliveryTopologyFactory$$anon$1$$anon$2 [INFO] ack: uid=57773c742f18c26fe604fd05| id=279832706| version=35| timestamp=1467432052000
- delivery-AMC_prod_product 231825855936862016-07-02 00:00:52 c.c.i.d.s.d.DeliveryTopologyFactory$$anon$1$$anon$2 [INFO] ack: uid=57773c744697ddb976cf5a95| id=354171| version=503| timestamp=1467432052000
- delivery-AMC_prod_product 231825855936862016-07-02 00:00:53 c.c.i.d.s.d.DeliveryTopologyFactory$$anon$1$$anon$2 [INFO] ack: uid=57773c754697ddb976cf5a96| id=355638| version=1287| timestamp=1467432053000

My script:

#!/bin/bash
COUNT=0
while IFS=',' read ID VERSION; do
    VERSION=`echo $VERSION |col -bx`
    if (grep "id=${ID}| version=$VERSION" worker-6715.log.2016-$1.log.* > /dev/null); then
            let "COUNT++"
    else
            echo "$ID, $VERSION FAIL"
            exit 2

    fi
done < recon.txt

echo "All OK, $COUNT checked"

If I cut off unnecessary fields from the log file, would this speed the execution up?
If I create a RAM device and copy the logfile there, would this speed the execution up or is my Red Hat Linux 6 (Hedwig) caching the file anyway? Any other suggestions?

Your pattern seems to match but if you could use -E to match it explicitly then that would speed up too. — Keyshov Borate, Jul 04 '16 at 11:29
What does the ...| col -bx achieve? It spawns a subshell and an external process, each time round the loop, but doesn't look like it should have any effect on VERSION. — JigglyNaga, Jul 04 '16 at 14:58
Your problem seems very close to https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Grepping-n-lines-for-m-regular-expressions — Ole Tange, Jul 05 '16 at 06:52
It will be good to know how big the files are compared to the available memory. If one of the files can fit in RAM it can change the best algorithm dramatically. — Ole Tange, Jul 05 '16 at 07:07

Satō Katsura · Answer 1 · 2016-07-04T16:17:47.617

5

Your bottleneck is reading the recon.txt file line by line in the shell. To get the failing lines you could preprocess the lines in the logs to look like the lines in recon.txt, then use comm(1) to find the set difference, perhaps like this:

comm -23 \
    <(sort -u recon.txt) \
    <(sed 's/.*| id=\([0-9]*\)| version=\([0-9]*\)|.*/\1,\2/' worker-6715.log.2016-$1.log.* | \
        sort -u)

This assumes a shell that can handle <(...) constructs. Also, please note that the lines in the result don't preserve the order of the lines in recon.txt. Keeping that order would be a lot harder (and slower).

If you also need the success counts you can do it the other way around, preprocess recon.txt so that it looks like what might be found in the logs, then use fgrep(1) or grep -F to do the search. Setting locales to C can also speed up things substantially on some systems. Thus:

COUNT=$( \
    sed 's/\([0-9]*\),\([0-9]*\)/| id=\1| version=\2|/' recon.txt | \
    LC_ALL=C fgrep -f - worker-6715.log.2016-$1.log.* | \
    wc -l )

This assumes that recon.txt doesn't contain duplicates, and that each line in recon.txt matches at most once across all the logs. Lifting the first restriction would be hard. The second could be lifted with a carefully chosen comm(1).

edited Jul 04 '16 at 16:17

answered Jul 04 '16 at 11:26

Satō Katsura

13,368
2
31
50

@terdon Please. fgrep has been working fine for 40 years. It isn't going away any time soon, regardless of what agenda the GNU people are trying to push. Also, fgrep = grep -F, not grep -f. – Satō Katsura Jul 04 '16 at 11:31
1

Fair point about -f, that was a typo, my bad. As for the rest, it has nothing to do with GNU, it's POSIX. Yes, I know it's been around for years and is likely to go on being so for a while but there's no reason to force the new generation to use our bad habits. The grep -F is preferred and recommended by POSIX so why not get used to using it? – terdon Jul 04 '16 at 11:33
@terdon Because historically fgrep and grep are very different beasts. fgrep was based on Aho–Corasick, while grep was based on regexps. – Satō Katsura Jul 04 '16 at 11:36
4

OK. And why is that remotely relevant 20 or more years down the line? The current (and previous) version of the POSIX specs clearly defines grep -F and states that "The old [emphasis mine] egrep and fgrep commands are likely to be supported for many years to come as implementation extensions". Why continue to use the old way when the standards have changed? Why continue to produce legacy code and force devs to support obsolete commands? We should move with the times or, at least, update our recommendations with the times. – terdon Jul 04 '16 at 11:43
So? Does anyone actually use a stand-alone fgrep any more (i doubt it)? or is it just a hard or symbolic link to grep? Surprisingly, GNU fgrep is a wrapper sh script around exec grep -F "$@" (I assumed it would be a link...so GNU fgrep has the overhead of forking a shell and IFS splitting the args again UUOfgrep!). – cas Jul 04 '16 at 11:45
In the second method, you could use grep -c (also in POSIX) instead of passing it on to wc. – JigglyNaga Jul 04 '16 at 15:10
@JigglyNaga Not really. grep -c counts each line matched, and supposedly there are many of those. The OP wants the grand total, and grep -c doesn't provide that. – Satō Katsura Jul 04 '16 at 15:25
@SatoKatsura . Oh, I hadn't realised that there could be multiple log files. You could cat all the log files together and pipe that to grep (so that it counts them all as one), but then you'd have to put the patterns in a temp file as the argument for -f, and that's not simplifying things any more. – JigglyNaga Jul 04 '16 at 15:39
Hmm. Reading the OP's example again, it seems they don't want to count total lines matched, either. They want a count of the number of log files that contained any matches - in which case, add -l to the grep options, and pipe to wc -l as before. – JigglyNaga Jul 04 '16 at 15:43
@JigglyNaga COUNT is the number of lines from recon.txt that matches in at least one of the logs. – Satō Katsura Jul 04 '16 at 16:07
@SatoKatsura Your example produces the total number of lines, in all of the log files, that matched something from recon.txt. If 2 lines from recon.txt are found in 5 log files, 10 times each, that grep command will display all 100 of them. It's only the same if every ID+VERSION pair has only one match across all log files. – JigglyNaga Jul 04 '16 at 17:07
@JigglyNaga Hence the note about assumptions. – Satō Katsura Jul 04 '16 at 17:10
Oops, sorry, I think I read everything out of order there. – JigglyNaga Jul 04 '16 at 17:12
Sato briefly mentioned setting LC_ALL=C but this is very important in a grep on large files. See http://www.inmotionhosting.com/support/website/ssh/speed-up-grep-searches-with-lc-all for details. As discussed above, use grep -F (not fgrep) if your search string is not a regex for a performance boost. – Insomniac Software Jul 04 '16 at 19:23

score 1 · Answer 2 · answered Jul 04 '16 at 14:47

As pointed out, the main problem here is the shell loop. I'd use awk to process the log file first, joining the id and version values and saving the result into an array then read recon.txt and on each line check if it's in array, if not - save the line content in a variable t and exit immediately (executing the END block). In the END block, if there's a line saved in t then exit 2 with a message, else just print the OK message with no. of lines of recon.txt:

awk 'NR==FNR{j=$9","$10;gsub(/[^0-9,]/, "", j);a[j]++;next}
!($0 in a){t=$0;{exit}}
END{if (t){print t, "FAIL"; exit 2}
else{print "All OK,", FNR, "checked"}}' logfile recon.txt

This assumes numerical values for the id and version and that they're always in 9th and respectively 10th field. If that's not the case you could use a regex like Sato does - that is, if your awk flavor supports backreferences otherwise something like this would do:

NR==FNR{sub(/.* id=/, "");sub(/\| version=/, ",");sub(/\|.*/, ""));a[$0]++;next}

score 1 · Answer 3 · answered Jul 04 '16 at 17:47

I'd put these files into DB (any type you like). Perhaps even not import the large file, but put line-by-line from your app which generates that output.

Then the DB engine will not waste resources to call grep each time but use internal function compiled from C.

Don't forget to create indexes after import (if you importing the whole file) or after table creation (if you insert records row-by-row).

Ole Tange · Answer 4 · 2023-04-03T14:44:13.020

First convert recon.txt to a format that is fast to use for grep -F:

perl -pe 's/,/| version=/' recon.txt > recon.grep-F

Then your situation is as described on https://www.gnu.org/software/parallel/parallel_examples.html#example-grepping-n-lines-for-m-regular-expressions

Here the text is adapted for your problem:

The simplest solution to grep a big file for a lot of regexps is:

grep -f regexps.txt bigfile

Or if the regexps are fixed strings:

grep -F -f recon.grep-F worker.log

There are 2 limiting factors: CPU and disk I/O. CPU is easy to measure: If the grep takes >90% CPU (e.g. when running top), then the CPU is a limiting factor, and parallelization will speed this up. If not, then disk I/O is the limiting factor, and depending on the disk system it may be faster or slower to parallelize. The only way to know for certain is to measure.

If the CPU is the limiting factor parallelization should be done on the regexps:

cat recon.grep-F | parallel --pipe -L1000 --round-robin grep -F -f - worker.log

If a line matches multiple regexps, the line may be duplicated. The command will start one grep per CPU and read bigfile one time per CPU, but as that is done in parallel, all reads except the first will be cached in RAM. Depending on the size of regexp.txt it may be faster to use --block 10m instead of -L1000. If regexp.txt is too big to fit in RAM, remove --round-robin and adjust -L1000. This will cause bigfile to be read more times.

Some storage systems perform better when reading multiple chunks in parallel. This is true for some RAID systems and for some network file systems. To parallelize the reading of bigfile:

parallel --pipepart --block 100M -a worker.log -k grep -F -f recon.grep-F

This will split bigfile into 100MB chunks and run grep on each of these chunks. To parallelize both reading of bigfile and regexp.txt combine the two using --fifo:

parallel --pipepart --block 100M -a worker.log --fifo cat recon.grep-F \
\| parallel --pipe -L1000 --round-robin grep -f - {}

If a line matches multiple regexps, the line may be duplicated.

score 0 · Answer 5 · edited Apr 13 '17 at 12:36

You're performing the query "all from set A that are not in set B", in addition to transforming one set to be in the same format as the other. Most approaches to this require, at some level, nested loops.

In your version, you're using grep for the inner loop (testing every line in the logs), and bash for the outer one (every line in recon.txt). The bash loop spawns several subshells and processes each time round, which is slow. Each grep handles the inner loop internally, but still requires a new process each time.

The following line spawns three new processes (two subshells and col) each time round the outer loop:

VERSION=`echo $VERSION |col -bx`

And it's not clear what purpose it serves. If it is necessary, you should try to put the col outside the loop, so that it filters all of recon.txt in a single process. See also Why is using a shell loop to process text considered bad practice?.

Sato Katsura's answer avoids the nested loop by sorting both sets of input, storing them in temporary files, and then using comm. With the restriction that the inputs are sorted, comm doesn't have to use nested loops.

However, for large input files, the temporary files will also be large (recon.txt gets reordered, ie. output is the same size as its input; the log files are cut down to the same format and reordered). And there's the sort time itself - sort uses mergesort, which is at worst O(n log n).

If you want to avoid temporary files and sorting, you can let grep handle both loops:

cat worker-6715.log.2016-$1.log.* |
 sed 's/.*| id=\([0-9]*\)| version=\([0-9]*\)|.*/\1,\2/' |
 grep -F -x -v -f - recon.txt || ( echo -n "Lines OK: " ; wc -l < recon.txt)

As before, a single invocation of sed (copied from that answer) transforms the log format into the recon.txt format. Then grep -v reports only those lines of recon.txt that weren't found in the logs. -f - takes the reformatted logs from standard input (avoiding temporary files, but meaning that grep has to hold them all in memory). -F to match fixed strings rather than patterns, -x to match only whole lines.

As grep has been inverted to report non-matches, it will return "success" if there were any lines not found. So the || will only display the message and total count if all lines were found.

How do I speed up a script based on a grep search?

5 Answers5