4

I can't seem to get the non-greedy regex to work.

Sample String:

ABCD,E《F》、《GH》、《XYIJ》、《KL》、《MN》。

Regex Code:

《.+?IJ》

Search Result:

regex search result

Because .+? is a non-greedy input, I was expecting the search to confine to the nearest 《》 pair, as shown below.

The present output is no different from searching this:

《.+IJ》

Which essentially makes the non-greedy code redundant.

Expected Result:

expected result

How should we get the non-greedy function to work as expected?

Drew
  • 75,699
  • 9
  • 109
  • 225
Sati
  • 775
  • 6
  • 21
  • 1
    "The present output is no different from searching this: `《.+IJ》`" -- that is because you have exactly *one* instance of `IJ》` in the line. Add additional instances, and you will observe how greedy vs non-greedy quantifiers work. – phils Jan 02 '20 at 01:07
  • 1
    If it helps, note that the `《` has *already* been matched before you begin matching the non-greedy expression. The latter says to find the smallest match for `.+` to get from the existing `《` to an instance of `IJ》`. If it can match the entire expression from that initial `《` then it's a good match. – phils Jan 02 '20 at 01:25

2 Answers2

5

The problem is not about using non-greedy matching. It is about which chars you're matching. Specifically, you want to match followed by any number of non- chars, followed by IJ》.

This is a regexp that finds that: 《[^《]+IJ》.

This is a typical operation, for searching text that has paired delimiters. For example, you do the same thing when searching for a string: "[^"]*".


Or if you also want to deal with char escaping, "\([^"]\|\\\(.\|\n\)\)*".

That matches either any char other than " ([^"]), or (\|) a backslash (\\) followed by any character. The "any character" part is (\(.\|\n\), that is, either any char except newline (.) or a newline char (\n). The final * says match zero or more such things.

(And don't forget that if you use a regexp in a Lisp string you need to double the backslashes - not shown here. See Syntax for Strings.)

Drew
  • 75,699
  • 9
  • 109
  • 225
  • Your answer and explanation is clear an concise. But I find the character escaping example given totally confusing. Would appreciate if you could say something more about the way this regex is constructed and the reason why. – Sati Jan 02 '20 at 05:30
  • I added some explanation. Hope it helps. – Drew Jan 02 '20 at 05:40
2

I will give the same answer as @Drew, but phrased a little differently.

Your expression 《.+IJ》will match the first 《, then will match the minimum number of characters (because of the ?) until the IJ》 sequence.

You cannot use the "non-greediness" of the ? to un-match the first matching 《 in order to find a later one.

An expression you can use to do what you want is what Drew said:

《[^《]+IJ》

This will match a 《, then at least one non-《 character, then IJ, then the closing 》. For this problem, you don't need a non-greedy modifier.