3
echo '123980925sriten34=ienat' | sed -e 's/^.*\?\([1-9][0-9]\{0,2\}\+\)\([%=+-]\).*/ \1 \2 /'

is giving the result:

 4 =

I am expecting:

 34 =

What am I not understanding?

(Oh and I even added the + and ? to make doubly sure, but afaik {0,2} should be greedy without them.)

Jeff Schaller
  • 67,283
  • 35
  • 116
  • 255
runrin
  • 31
  • perl -pe 's/^.?([1-9][0-9]{0,2})([%=+-])./ $1 $2 /' is less annoying – user1133275 Jul 19 '19 at 23:59
  • 5
    Isn't it more to do with the fact that the preceding .* is greedy? – steeldriver Jul 20 '19 at 00:00
  • ... perhaps you're thinking that the following \? makes it non-greedy? – steeldriver Jul 20 '19 at 00:49
  • 2
    wrt i even added the + and ? to make doubly sure - that will probably make doubly sure the regexp won't work. You can't just throw random characters into a regexp and hope they'll somehow improve it. Also \? and \+ are GNU sed only so if you aren't running GNU sed then they're going to be treated as literal chars - the POSIX equivalents are \{0,1\} and \{1,\} respectively. – Ed Morton Jul 20 '19 at 05:08
  • \+ and \? aren't BRE, but even if they were, stacking the repetition specifiers (* and ? or {n,m} and +) isn't defined. – ilkkachu Jul 20 '19 at 09:06

1 Answers1

11

The problem, as steeldriver states, isn’t that the [0-9]{0,2} is non-greedy; the problem is that the .*? before it is greedy.  sed supports BRE and ERE, neither of which supports non-greedy matching.  That’s a feature of PCREs.  For example, the following commands:

$ echo 'aQbQc' | sed    's/.*\?Q/X/'
$ echo 'aQbQc' | sed    's/.*Q/X/'
$ echo 'aQbQc' | sed -r 's/.*?Q/X/'
$ echo 'aQbQc' | sed -r 's/.*Q/X/'

all output

Xc

(I’m not sure why it just ignores the ?.)

See Non-greedy match with SED regex (emulate perl's .*?).

Your description of the function that you want to perform is skimpy, but I believe that I’ve reverse engineered it.  You can get the desired effect by not matching the characters before the number you want to match until after you’ve found the number:

$ echo '123980925sriten34=ienat' | sed -e 's/\([1-9][0-9]\{0,2\}\+\)\([%=+-]\).*/! \1 \2 /' -e 's/.*!//'
 34 =

replacing the ! with any string known not to appear in the input data.  If you have no such string, but you’re using GNU sed, you can use newline:

$ echo '123980925sriten34=ienat' | sed -e 's/\([1-9][0-9]\{0,2\}\+\)\([%=+-]\).*/\n \1 \2 /' -e 's/.*\n//'
 34 =

which, of course, cannot appear in any line.

  • 5
    wrt I’m not sure why it just ignores the ? - because ? after another repetition RE metachar (* in this case) is undefined behavior per POSIX and since in other contexts it means zero-or-1 just ignoring it is as reasonable approach as any. Essentially .*? should be treated as bug in a regexp as it doesn't have any sensible meaning (zero-or one repetitions of zero-or-more repetitions of any character - huh?). – Ed Morton Jul 20 '19 at 05:01
  • 5
    On a GNU system, the repetition operators stack, ? after + (using ERE syntax) makes the whole thing optional (same as *), and {2,4}? matches 0, 2, 3 or 4 repetitions (try grep -Eoe 'ba{2,4}?b' against bb, bab, baab). *? is just the same as * since * can already match zero repetitions. The thing where ? makes a pattern non-greedy is a feature of Perl regexes. – ilkkachu Jul 20 '19 at 09:03