4

This is related to my previous question over here.

(replace-regexp "\\(\\[\\[\\)\\(zotero:.+\\]\\)\\(\\[.+,[《〈][^《〈]+[》〉]\\)?.+。\\(\\]\\]\\)" "\\1cite-\\2\\3\\4,頁")

I'd like to use the above code to shorten Chinese citation strings to author,title,page format.


Sample Data

Original String:

[[zotero://select/items/1_P36Y9V2B][周東平,〈《晉書·刑法志》校注舉隅〉,《中國古代法律文獻研究》,2016年,00期。]]

Expected Result (shortened string):

[[cite-zotero://select/items/1_P36Y9V2B][周東平,〈《晉書·刑法志》校注舉隅〉]],頁


The code snippet [《〈][^《]+[》〉] correctly restricts the search to a single title element between s.

enter image description here

When joined with .+,, however, it fails to be non-greedy and ends up picking up both article title (between 〈〉s) and journal title (between 《》s) at the same time.

enter image description here

This results in a shortened citation which is lengthier than expected.

[[cite-zotero://select/items/1_P36Y9V2B][周東平,〈《晉書·刑法志》校注舉隅〉,《中國古代法律文獻研究》]],頁

enter image description here

How do we make \\3 non-greedy, as expected?


Update:

I just realize this could be an issue specific to the data string listed above, i.e.:

周東平,〈《晉書·刑法志》校注舉隅〉,《中國古代法律文獻研究》,2016年,00期。

Somehow, the 《晉書·刑法志》 nested in 《晉書·刑法志》校注舉隅〉 complicated matters.

Other simpler, but similarly structured data such as this worked as expected:

[[zotero://select/items/1_97XVI66Q][金春峯,〈周官之社會行政組織〉,收入《周官之成書及其反映的文化與時代新考》台北:東大,1993年。]]

Result:

[[cite-zotero://select/items/1_97XVI66Q][金春峯,〈周官之社會行政組織〉]],頁

Sati
  • 775
  • 6
  • 21
  • Both have already been listed above in the original post. – Sati Mar 27 '20 at 03:33
  • Oh - I get what you mean. Question has been edited to reflect the actual string behind the org-mode link display that is to be processed. – Sati Mar 27 '20 at 07:44

1 Answers1

5

For starters the non-greedy quantifiers are ??, +?, and *?, and so you haven't specified any non-greedy matching in that regexp.


I strongly recommend using M-x re-builder to visualise what is going on (it will show each group in a different colour).


Looking at part of your regexp, this:

, [《〈][^《〈]+[》〉]

matches a comma and a space, followed by either or , followed by one-or-more characters not in that set, and finally either or

In the case of:

, 〈《晉書·刑法志》校注舉隅〉,《中國古代法律文獻研究》

The pattern does not match the start of that string because after , 〈 we have a character which fails to match [^《〈] (because it is one of those two characters). Consequently that part of the regexp does not match the text until the second comma.

This 'works' overall because you had .+ preceding that in the pattern, so that simply extends to match everything up to that second comma. If you change that .+ into \\(.+\\) in re-builder, you'll see what it is matching.


Try the following in re-builder:

"\\(\\[\\[\\)\\(zotero:.+\\]\\)\\(\\[\\(.+?\\),\\(《.+?》\\|〈.+?〉\\)\\)?\\(.+\\)。\\(\\]\\]\\)"

I've added several extra groups so you can see exactly what each bit matches.

I recommend that you spend a bit of time experimenting with that.

The key difference to your original pattern is this:

\\(《.+?》\\|〈.+?〉\\)

in place of:

[《〈][^《〈]+[》〉]

This approach works for the examples you have shown.

If you have any titles containing newlines then you would need to use \\(?:.\\|\n\\)+? instead of .+? because . does not match newlines. Or you could alternatively use:

\\(《[^》]+》\\|〈[^〉]+〉\\)

(Which isn't exactly the same thing, but would most likely be equivalent for your purposes, and is probably what you were meaning to express in the first instance.)

phils
  • 48,657
  • 3
  • 76
  • 115