Why is egrep [wW][oO][rR][dD] faster than grep -i word?

Question

I've been using grep -i more often and I found out that it is slower than its egrep equivalent, where I match against the upper or lower case of each letter:

$ time grep -iq "thats" testfile

real    0m0.041s
user    0m0.038s
sys     0m0.003s
$ time egrep -q "[tT][hH][aA][tT][sS]" testfile

real    0m0.010s
user    0m0.003s
sys     0m0.006s

Does grep -i do additional tests that egrep doesn't?

Try the grep's the other way around, to make sure you're not measuring the difference between disk caching of the flie. — EightBitTony, Mar 14 '16 at 00:26
I have grep'd the file prior to testing, so it is cached. Almost same times if done in reverse order. — tildearrow, Mar 14 '16 at 00:28
This can depend on the locale: some locales involve complex calculations to account for case insensitivity. GNU grep is particularly slow in many situations involving Unicode. What locale settings did you use? Under what Unix variant? What is the content of your test file? — Gilles 'SO- stop being evil', Mar 14 '16 at 00:35
@Gilles looks good, repeating each test here 100 times (timing the entire thing), egrep is faster than grep until I set LANG=C and then they're both roughly the same. — EightBitTony, Mar 14 '16 at 00:39
en_US.UTF-8, Linux server 3.19.0-42-lowlatency #48~14.04.1-Ubuntu SMP PREEMPT Fri Dec 18 11:34:06 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux. That should be the answer, I guess. — tildearrow, Mar 14 '16 at 00:42
Try the first once again - I bet it's faster the next time around. The file is probably still in cache when you run the 2nd command. — Baard Kopperud, Mar 14 '16 at 13:05
So you have demonstrated to yourself that the original premise/assertion leading to the question is basically incorrect, right? egrep [wW][oO][rR][dD] is not faster than grep -i word under the same locale and given file in in the buffer-cache. — arielf, Mar 14 '16 at 20:36
@EightBitTony Look at user time (which does not include time waiting for disk). There is an order of magnitude in difference. — kasperd, Mar 15 '16 at 11:17
This slow-down issue with grep in UTF8 locales has been pretty widely reported over the years. Many other GNU text-scanning programs, in particular sort suffer from the same issue. See also http://stackoverflow.com/questions/13819635 — Adrian Pronk, Mar 19 '16 at 02:13

score 70 · Accepted Answer · edited Mar 15 '16 at 11:34

70

grep -i 'a' is equivalent to grep '[Aa]' in an ASCII-only locale. In a Unicode locale, character equivalences and conversions can be complex, so grep may have to do extra work to determine which characters are equivalent. The relevant locale setting is LC_CTYPE, which determines how bytes are interpreted as characters.

In my experience, GNU grep can be slow when invoked in a UTF-8 locale. If you know that you're searching for ASCII characters only, invoking it in an ASCII-only locale may be faster. I expect that

time LC_ALL=C grep -iq "thats" testfile
time LC_ALL=C egrep -q "[tT][hH][aA][tT][sS]" testfile

would produce indistinguishable timings.

That being said, I can't reproduce your finding with GNU grep on Debian jessie (but you didn't specify your test file). If I set an ASCII locale (LC_ALL=C), grep -i is faster. The effects depend on the exact nature of the string, for example a string with repeated characters reduces the performance (which is to be expected).

edited Mar 15 '16 at 11:34

Community

1

answered Mar 14 '16 at 01:02

Gilles 'SO- stop being evil'

829,060

The author uses Ubuntu 14.04 which ships with grep 2.10. The speed of case-insensitive matches (-i) with multibyte locales should have improved in 2.17. – Lekensteyn Mar 15 '16 at 12:24
@Lekensteyn Good to know, thanks. Ubuntu 14.04 actually comes with grep 2.16, but that's pre-2.17 too; I tested with grep 2.20, which explains why I didn't see the same slowdown. – Gilles 'SO- stop being evil' Mar 15 '16 at 12:32
Right, I was looking at the wrong LTS release, Ubuntu 12.04 ships with grep 2.10 while Ubuntu 14.04 includes grep 2.16. – Lekensteyn Mar 15 '16 at 12:49
1

I'm quite certain that grep -i 'a' is equivalent to grep '[Aa]' in any locale. The proper example is grep -i 'i' which is either grep '[Ii]'or grep '[İi]' (Uppercase I with dot above, U+130, Turkish locale). However, there is no efficient way for grep to find this equivalence class given a locale. – MSalters Mar 15 '16 at 13:13

score 15 · Answer 2 · edited Apr 13 '17 at 12:36

Out of curiosity, I tested this on an Arch Linux system:

$ uname -r
4.4.5-1-ARCH
$ df -h .
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           3.9G  720K  3.9G   1% /tmp
$ dd if=/dev/urandom bs=1M count=1K | base64 > foo
$ df -h .                                         
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           3.9G  1.4G  2.6G  35% /tmp
$ for i in {1..100}; do /usr/bin/time -f '%e' -ao grep.log grep -iq foobar foo; done
$ for i in {1..100}; do /usr/bin/time -f '%e' -ao egrep.log egrep -q '[fF][oO][oO][bB][aA][rR]' foo; done

$ grep --version
grep (GNU grep) 2.23
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Mike Haertel and others, see <http://git.sv.gnu.org/cgit/grep.git/tree/AUTHORS>.

And then some stats courtesy of Is there a way to get the min, max, median, and average of a list of numbers in a single command?:

$ R -q -e "x <- read.csv('grep.log', header = F); summary(x); sd(x[ , 1])"
> x <- read.csv('grep.log', header = F); summary(x); sd(x[ , 1])
       V1       
 Min.   :1.330  
 1st Qu.:1.347  
 Median :1.360  
 Mean   :1.362  
 3rd Qu.:1.370  
 Max.   :1.440  
[1] 0.02322725
> 
> 
$ R -q -e "x <- read.csv('egrep.log', header = F); summary(x); sd(x[ , 1])"
> x <- read.csv('egrep.log', header = F); summary(x); sd(x[ , 1])
       V1       
 Min.   :1.330  
 1st Qu.:1.340  
 Median :1.360  
 Mean   :1.365  
 3rd Qu.:1.380  
 Max.   :1.430  
[1] 0.02320288
> 
>

I'm on the en_GB.utf8 locale, but the times are nearly indistinguishable.

Why is egrep [wW][oO][rR][dD] faster than grep -i word?

2 Answers2