Non-greedy regex group

Question

This is related to my previous question over here.

(replace-regexp "\\(\\[\\[\\)\\(zotero:.+\\]\\)\\(\\[.+，[《〈][^《〈]+[》〉]\\)?.+。\\(\\]\\]\\)" "\\1cite-\\2\\3\\4，頁")

I'd like to use the above code to shorten Chinese citation strings to author，title，page format.

Sample Data

Original String:

[[zotero://select/items/1_P36Y9V2B][周東平，〈《晉書·刑法志》校注舉隅〉，《中國古代法律文獻研究》，2016年，00期。]]

Expected Result (shortened string):

[[cite-zotero://select/items/1_P36Y9V2B][周東平，〈《晉書·刑法志》校注舉隅〉]]，頁

The code snippet [《〈][^《]+[》〉] correctly restricts the search to a single title element between ，s.

When joined with .+，, however, it fails to be non-greedy and ends up picking up both article title (between 〈〉s) and journal title (between 《》s) at the same time.

This results in a shortened citation which is lengthier than expected.

[[cite-zotero://select/items/1_P36Y9V2B][周東平，〈《晉書·刑法志》校注舉隅〉，《中國古代法律文獻研究》]]，頁

How do we make \\3 non-greedy, as expected?

Update:

I just realize this could be an issue specific to the data string listed above, i.e.:

周東平，〈《晉書·刑法志》校注舉隅〉，《中國古代法律文獻研究》，2016年，00期。

Somehow, the 《晉書·刑法志》 nested in 《晉書·刑法志》校注舉隅〉 complicated matters.

Other simpler, but similarly structured data such as this worked as expected:

[[zotero://select/items/1_97XVI66Q][金春峯，〈周官之社會行政組織〉，收入《周官之成書及其反映的文化與時代新考》台北：東大，1993年。]]

Result:

[[cite-zotero://select/items/1_97XVI66Q][金春峯，〈周官之社會行政組織〉]]，頁

Oh - I get what you mean. Question has been edited to reflect the actual string behind the org-mode link display that is to be processed. — Sati, Mar 27 '20 at 07:44

phils · Accepted Answer · 2020-03-28T04:54:38.320

For starters the non-greedy quantifiers are ??, +?, and *?, and so you haven't specified any non-greedy matching in that regexp.

I strongly recommend using M-x re-builder to visualise what is going on (it will show each group in a different colour).

Looking at part of your regexp, this:

, [《〈][^《〈]+[》〉]

matches a comma and a space, followed by either 《 or 〈, followed by one-or-more characters not in that set, and finally either 》 or 〉

In the case of:

, 〈《晉書·刑法志》校注舉隅〉，《中國古代法律文獻研究》

The pattern does not match the start of that string because after , 〈 we have a character which fails to match [^《〈] (because it is one of those two characters). Consequently that part of the regexp does not match the text until the second comma.

This 'works' overall because you had .+ preceding that in the pattern, so that simply extends to match everything up to that second comma. If you change that .+ into \\(.+\\) in re-builder, you'll see what it is matching.

Try the following in re-builder:

"\\(\\[\\[\\)\\(zotero:.+\\]\\)\\(\\[\\(.+?\\)，\\(《.+?》\\|〈.+?〉\\)\\)?\\(.+\\)。\\(\\]\\]\\)"

I've added several extra groups so you can see exactly what each bit matches.

I recommend that you spend a bit of time experimenting with that.

The key difference to your original pattern is this:

\\(《.+?》\\|〈.+?〉\\)

in place of:

[《〈][^《〈]+[》〉]

This approach works for the examples you have shown.

If you have any titles containing newlines then you would need to use \\(?:.\\|\n\\)+? instead of .+? because . does not match newlines. Or you could alternatively use:

\\(《[^》]+》\\|〈[^〉]+〉\\)

(Which isn't exactly the same thing, but would most likely be equivalent for your purposes, and is probably what you were meaning to express in the first instance.)

Non-greedy regex group

Sample Data

Update:

1 Answers1