Why is regex [0-9]{0,2} not greedy in sed?

Question

echo '123980925sriten34=ienat' | sed -e 's/^.*\?\([1-9][0-9]\{0,2\}\+\)\([%=+-]\).*/ \1 \2 /'

is giving the result:

4 =

I am expecting:

 34 =

What am I not understanding?

(Oh and I even added the + and ? to make doubly sure, but afaik {0,2} should be greedy without them.)

perl -pe 's/^.?([1-9][0-9]{0,2})([%=+-])./ $1 $2 /' is less annoying — user1133275, Jul 19 '19 at 23:59
Isn't it more to do with the fact that the preceding .* is greedy? — steeldriver, Jul 20 '19 at 00:00
... perhaps you're thinking that the following \? makes it non-greedy? — steeldriver, Jul 20 '19 at 00:49
wrt i even added the + and ? to make doubly sure - that will probably make doubly sure the regexp won't work. You can't just throw random characters into a regexp and hope they'll somehow improve it. Also \? and \+ are GNU sed only so if you aren't running GNU sed then they're going to be treated as literal chars - the POSIX equivalents are \{0,1\} and \{1,\} respectively. — Ed Morton, Jul 20 '19 at 05:08
\+ and \? aren't BRE, but even if they were, stacking the repetition specifiers (* and ? or {n,m} and +) isn't defined. — ilkkachu, Jul 20 '19 at 09:06

score 11 · Answer 1 · answered Jul 20 '19 at 01:37

The problem, as steeldriver states, isn’t that the [0-9]{0,2} is non-greedy; the problem is that the .*? before it is greedy. sed supports BRE and ERE, neither of which supports non-greedy matching. That’s a feature of PCREs. For example, the following commands:

$ echo 'aQbQc' | sed    's/.*\?Q/X/'
$ echo 'aQbQc' | sed    's/.*Q/X/'
$ echo 'aQbQc' | sed -r 's/.*?Q/X/'
$ echo 'aQbQc' | sed -r 's/.*Q/X/'

all output

Xc

(I’m not sure why it just ignores the ?.)

See Non-greedy match with SED regex (emulate perl's .*?).

Your description of the function that you want to perform is skimpy, but I believe that I’ve reverse engineered it. You can get the desired effect by not matching the characters before the number you want to match until after you’ve found the number:

$ echo '123980925sriten34=ienat' | sed -e 's/\([1-9][0-9]\{0,2\}\+\)\([%=+-]\).*/! \1 \2 /' -e 's/.*!//'
 34 =

replacing the ! with any string known not to appear in the input data. If you have no such string, but you’re using GNU sed, you can use newline:

$ echo '123980925sriten34=ienat' | sed -e 's/\([1-9][0-9]\{0,2\}\+\)\([%=+-]\).*/\n \1 \2 /' -e 's/.*\n//'
 34 =

which, of course, cannot appear in any line.

wrt I’m not sure why it just ignores the ? - because ? after another repetition RE metachar (* in this case) is undefined behavior per POSIX and since in other contexts it means zero-or-1 just ignoring it is as reasonable approach as any. Essentially .*? should be treated as bug in a regexp as it doesn't have any sensible meaning (zero-or one repetitions of zero-or-more repetitions of any character - huh?). — Ed Morton, Jul 20 '19 at 05:01
On a GNU system, the repetition operators stack, ? after + (using ERE syntax) makes the whole thing optional (same as *), and {2,4}? matches 0, 2, 3 or 4 repetitions (try grep -Eoe 'ba{2,4}?b' against bb, bab, baab). *? is just the same as * since * can already match zero repetitions. The thing where ? makes a pattern non-greedy is a feature of Perl regexes. — ilkkachu, Jul 20 '19 at 09:03

Why is regex [0-9]{0,2} not greedy in sed?

1 Answers1