Confusion about difference in GNU/macOS grep output when using -o

Question

Why does BSD grep on macOS produce only the first word here:

$ echo "once upon a time" | grep -o "[a-z]*"
once

but all words here:

$ echo "once upon a time" | grep -o "[a-z][a-z]*"
once
upon
a
time

Or, using extended regular expressions:

$ echo "once upon a time" | grep -E -o "[a-z]*"
once

$ echo "once upon a time" | grep -E -o "[a-z]+"
once
upon
a
time

GNU grep will produce identical output for both [a-z]+ (or [a-z][a-z]*) and [a-z]*:

$ echo "once upon a time" | ggrep -E -o "[a-z]*"
once
upon
a
time

$ echo "once upon a time" | ggrep -E -o "[a-z]+"
once
upon
a
time

Hmm... there seems to be a difference between how BSD grep deals with [a-z]* and [a-z][a-z]* (using BREs here) when -o is used. The first will only return the first word, while the second will return each word separately. I'm testing on OpenBSD. GNU grep treats both expressions the same ("treats" = "gives identical results", they are quite different expressions). — Kusalananda, Mar 07 '18 at 20:09
Interestingly, adding -w, as in grep -ow '[a-z]*', return each word, so it has something to do with the word boundaries. — Kusalananda, Mar 07 '18 at 20:14
Same on FreeBSD. See how echo " once upon a time" | grep -o "[a-z]*" returns nothing. I suspect it stops at the first empty match. GNU grep skips all the empty matches and ast-open's grep runs into infinite loops if there are any as it resumes the search for more at the same location in the text. — Stéphane Chazelas, Mar 07 '18 at 20:27
Even more strangely, echo " once upon a time" | grep -o "[a-z]*" doesn't actually return nothing - it returns a single empty line with macOS grep (BSD grep) 2.5.1-FreeBSD. echo "once upon a time"|grep -o -E '[a-z]*' returns two matches, "once" and an empty line. — Michael Homer, Mar 07 '18 at 20:33
It's as if the FreeBSD grep stopped at the first empty match. (I tested the same ancient grep on Mac.) Which is sort of logical, since if you can find one such, you can usually find many others, and printing them all (with -o) would just be silly. Now, someone just needs to go look at the source code to see if it does that on purpose, or by accident... — ilkkachu, Mar 07 '18 at 20:40
My comment above about -w refers to OpenBSD grep which seems to be different from both FreeBSD and macOS grep (which I believe share ancestor). OpenBSD grep does not produce empty lines with any of the examples mentioned in comments (by @MichaelHomer) . — Kusalananda, Mar 07 '18 at 20:41
Counter-question: Does a pattern that matches the empty string make any sense with grep at all? Without -o, such pattern matches each and every input line. With -o, should all the empty matches be printed? If empty matches are ignored (like GNU grep does), then a successful grep -o can produce an empty output, which is sort of confusing if you think about it. — ilkkachu, Mar 07 '18 at 20:42

Kusalananda · Accepted Answer · 2018-03-07T21:55:41.270

Collecting the thoughts of the comment section, it seems this comes down to how different grep implementations have decided to deal with empty matches, and the [a-z]* expressions matches on the empty string.

The -o option is not defined by POSIX, so how an implementation deals with it is left to the developers.

GNU grep obviously throws away empty matches, for example the match of the empty string after once when using [a-z]*, and continues to process the input from the next character onwards.

BSD grep, seems to be hitting the empty match and decides that, for whatever reason, that's enough, and stops there.

Stéphane mentions that the ast-open version of grep actually goes into an infinite loop at the empty match of [a-z]* after once and doesn't get past that point in the string.

OpenBSD grep seems to be different from macOS and FreeBSD grep in that adding the -w flag (which requires the matches to be delimited by word boundaries) makes [a-z]* return each word separately.

ilkkachu makes the observation that -o with a pattern that allows matching an empty string in some sense is confusing (or possibly at least ambiguous). Should all empty matches be printed? There are in fact infinitely many such matches after each word in the given string.

The OpenBSD source for grep (which exhibit the same behaviour as grep on macOS) contains (src/usr.bin/grep/util.c):

               if (r == 0) {
                        c = 1;
                        if (oflag && pmatch.rm_so != pmatch.rm_eo)
                                goto print;
                        break;
                }
        }
        if (oflag)
                return c;
print:

This basically says, if the pattern matched (r == 0) and if we are using -o (oflag), and if the match start offset is the same as the match end offset (pmatch.rm_so == pmatch.rm_eo, i.e. an empty match), then the result of the match is not printed and the matching on this particular line of input ends (return c with c == 1 for "match found").

See how echo once upon a time | sed 's/[a-z]*/<&>/g' does it and how an empty match is not found after once. In echo once--upon | sed 's/[a-z]*/<&>/g' and empty match is found in between the two -s. A grep -o implementation could choose to behave like sed here, but I don't know any that does, though you can implement it as perl -lne 'print for /pattern/' — Stéphane Chazelas, Mar 07 '18 at 21:52
You might want to look over https://unix.stackexchange.com/a/398249/5132 and the associated bug reports, too. — JdeBP, Mar 07 '18 at 22:04
@JdeBP The fix commit confirms that a bailout is done on emtpy matches when -o is used. This is probably incorporated into macOS grep as well (either from FreeBSD or separately). — Kusalananda, Mar 07 '18 at 22:12
Actually perl -lpe 's/[a-z]*/<$&>/g' behave differently from sed 's/[a-z]*/<&>/g' here and does find an empty match after once. — Stéphane Chazelas, Mar 07 '18 at 22:13
Also interesting: echo once upon a time | perl -lne '@_= /[a-z]*(?{print pos()})/g' — Stéphane Chazelas, Mar 07 '18 at 22:20
@JdeBP The bailout for empty matches with -o was implemented in OpenBSD grep in 2012: https://cvsweb.openbsd.org/cgi-bin/cvsweb/src/usr.bin/grep/util.c#rev1.44 — Kusalananda, Mar 07 '18 at 22:25
@StéphaneChazelas Empty match not only after once: <once><> <upon><> <a><> <time><>. — Kusalananda, Mar 07 '18 at 22:27

Confusion about difference in GNU/macOS grep output when using -o

1 Answers1