4

I have run below two commands for a very large files

grep -E 'string1|string2' 151103*.log|grep 'string3' | grep string4

awk '/string1|string2/ && /string3/ && /string4/' 151103*.log

It took almost same time for execution. But awk was much faster to show me the results which matched. grep too showed me the same result but at the end, when the process completed.

Both took same time for the process to complete, just want to know the logic behind both the searches for awk and grep.

Why is awk faster? Do both programs have different search logic? What if I jumble the strings in above search, does it make a difference to the search speed?

Marco
  • 33,548
  • Since you are using extended regex in the first grep why not include the second and third match as well? That saves having to parse the output 3 times and will probably affect the results. – Bram Nov 04 '15 at 15:44
  • Not really a fair test. grep will probably be faster if all patterns are searched for in the same invocation. awk is a larger program, more code and more features, while grep is smaller in code size and only looking for text patterns in files. It would be interesting if you do that. – RobertL Nov 04 '15 at 15:50
  • If any of the existing answers solves your problem, please consider accepting it via the checkmark. Thank you! – Jeff Schaller Apr 25 '17 at 10:40

4 Answers4

7

GNU grep buffers output but GNU awk does not. And even if you were not using GNU awk and were using some other variant it would likely still be line-buffered if you were printing to a terminal and so would flush output for each occurring \newline, but your grep writes to a pipe and so would block-buffer anyway. If you have a GNU grep you can use grep --line-buffered ... | grep ... for your comparison to see results as quickly. Likely grep will beat awk at practically any match tests - especially a GNU grep.

Here is a sed to do what you want as well:

sed -ne'/string4/{/string3/s/string[12]/&/p;}' <in >out
mikeserv
  • 58,310
  • 1
    surely doing a replacement is just slowing you down: sed -ne'/string4/{/string3/{/string[12]/p;}}' – glenn jackman Nov 04 '15 at 16:27
  • 1
    @glennjackman - depends on what the sed does with it, I guess. your syntax needs a little -expression splitting portably, but is probably better: sed -ne'/string4/{/string3/{/string[12]/p;}' -e\} – mikeserv Nov 04 '15 at 16:33
  • Thanks for sed option, I have to compare with this as well with awk and grep – Chittha Shetty Nov 04 '15 at 16:56
3

The grep pipeline could not output anything until the final grep for string4 matched something, and it only gets its input after the previous pipe buffer fills up. See related questions How big is the pipe buffer? and Turn off buffering in pipe .

Depending on the frequency of the strings in your input, you could see a difference in runtimes by putting the static searches first, to give the extended regexp's less to look at.

Jeff Schaller
  • 67,283
  • 35
  • 116
  • 255
  • 1
    That's not actually quite right. The pipe buffer doesn't have to fill up - the final grep will get its input as soon as the grep before it produces some output. The problem is that many programs, when writing to something that's not a terminal (eg a pipe) will buffer their output internally before writing to the pipe, for efficiency. – psmears Nov 04 '15 at 17:56
  • @psmears, right... so use grep's --line-buffered option with each grep command (if you are using GNU grep). Another point worth mentioning is that the buffering actually improves efficiency in many cases (it allows later programs in the pipeline to read large chunks of input at a time rather than one line at a time) but appears slower because no output is displayed until the input has worked its way through all the buffers in the pipeline. – cas Nov 04 '15 at 23:38
  • @cas: Absolutely right - that's what I was referring to with my (rather cryptic) "for efficiency" comment - it's a tradeoff between "fastest time to first output" and "fastest time to completion" :-) – psmears Nov 05 '15 at 09:52
3

Your awk example is doing the entire regex search in one pass. For each line of input, if the first, second, and third regex is found, the line will be printed, and you'll see the output essentially immediately (upon processing of the matching line).

Your grep example is using 3 different invocations of grep (one for each regex) to do the same thing, but the output of each invocation becomes the input for the next, which means that each needs to complete before the next has anything to process.

If you had a single 1000 line file and only line 5 matched all three regex, the awk command would give you output after processing the 5th line, before processing the 6th line. Compare that to the piped grep statements. The 1st invocation of grep would find the 5th line and whatever other lines may match the 1st regex, and after processing the 1000th (final) line of input, its output becomes the input to the 2nd invocation of grep. The second invocation of grep processes however many lines the 1st output and outputs the lines that match both the 1st and 2nd regex, which then becomes the input to the 3rd invocation of grep. As the 3rd invocation of grep processes each line, it will output any line that matches its regex.

You can compare the best and worst cases of grep for the above example: if none of the lines match any of the regex except line 5, which matches all 5, then the first grep processes 1000 lines, the second grep processes 1 line, and the third grep processes 1 line: it will process 1002 lines before has any output (best case). If all the lines match the first two regex but only one line matches the third regex, then the piped grep construction will process 1000 + 1000 lines + 5 = 2005 lines before it finds the match on the 5th line and has some output (it will continue to process the remaining 995 lines from the second grep's output, but you won't see any more output because nothing else will match).

Compare that to the awk command, which checks all three regex simultaneously for each line and gives you output after processing the 5th line. The difference will be exaggerated as you check more files simultaneously.

For instance, compare if you see output faster if instead of running the grep command on all files simultaneously as you did above (theoretically, you should, but results may vary depending on distribution of hits throughout your files):

grep -E 'string1|string2' 151103*.log|grep 'string3' | grep string4

you instead run the series of grep commands on each file individually, like this:

for i in 151103*.log; 
  do grep -E 'string1|string2' $i |grep 'string3' | grep string4; 
done

This still won't produce output as quickly as the awk statement, but you may see a difference.

  • @ Brian Thanks for your explanation

    But unlike awk , do we have AND operation in grep.

    So I came up with this expression
    grep -E 'string1|string2' 151103*.log|grep 'string3' | grep string4

    Guess the below command is not the right way to achieve it, do we have any alternate in grep which process same way as awk, so that it doesn't wait for previous output.

    grep -E 'string1|string2 && string3 && string4' 151103*.log

    – Chittha Shetty Nov 05 '15 at 11:59
  • @ChitthaShetty - use the AST grep for &. – mikeserv Nov 05 '15 at 17:47
2

While grep and awk and sed can be used for similar tasks each has its strengths and weaknesses.

Awk works best for tabularised data or when you need to perform calculations etc.

Sed excells in replacing text.

Grep is best to select rows from input data so I was expecting that to be faster than awk for this task. Perhaps if you combine the 3 grep commands into one that's what you'll see. Right now grep is at the disadvantage since it needs to start 3 times and the second and third need to wait for input from the first. Which might explain why the result comes with a delay. Although I'm not sure about that.

Bram
  • 2,459
  • Thanks all for your inputs, please share a link/article if you have come across explaining how exactly does these two commands process for searching a string – Chittha Shetty Nov 04 '15 at 16:54