7

Often I need to use regular expressions to match strings across multiple lines. In python and other languages, it's possible to ask that dots match newlines (as opposed to the default behavior). Is it possible to do this using the default regexp utilities without hacking the source?

I know I can use a negated character class, which will consume newlines if the newline is not in the class, but this is too limited since what I really want is to just match any character at all.

Malabarba
  • 22,878
  • 6
  • 78
  • 163
erjoalgo
  • 853
  • 1
  • 5
  • 18
  • 1
    FWIW: In [Icicles](http://www.emacswiki.org/emacs/Icicles), you can use [**`C-M-.`** anytime in the minibuffer](http://www.emacswiki.org/emacs/Icicles_-_Dot%2c_Dot%2c_Dot) to toggle whether when you type **`.`** in your minibuffer input it matches any char (including newlines) or any char except newlines. This is handy for apropos completion, which uses regexp matching. – Drew Aug 09 '15 at 01:03

3 Answers3

6

While erjoalgo's answer is correct, according to the Emacs Wiki Mutliline Regex page, it is not the most efficient answer:

To match any number of characters, use this: .* – the problem is that . matches any character except newline. What many people propose next works, but is inefficient if you assume that newlines are not that common in your text: "\\(.\\|\n\\)". Better match multiple lines instead: "\\(.*\n?\\)*". The newline is optional so that the expression can end in the middle of a line. Better yet: "[\0-\377[:nonascii:]]*" avoids “Stack overflow in regexp matcher” for huge texts, e.g., > 34k.

phils
  • 48,657
  • 3
  • 76
  • 115
Trebor Rude
  • 276
  • 1
  • 12
  • 2
    Can anyone suggest why that last pattern from the Wiki was written `"[\0-\377[:nonascii:]]*"` rather than `"[\0-\177[:nonascii:]]*"` (or, moreover, `"[[:ascii:][:nonascii:]]*"`) ? – phils Aug 09 '15 at 10:16
  • 1
    `"\\(.*\n?\\)*"` will lead to pathologically slow behavior because it can match in too many different ways. Better use `".*\\(\n.*\\)?"` – Stefan Aug 09 '15 at 19:56
  • How is this an answer? It does not answer the question – Matej J Apr 05 '20 at 19:52
4

Actually, I just noticed I can do this with \(.\|[\n]\)*. For example,

[code] $ sudo wpa_supplicant -B -i wlan0 -Dwext -c universitywpa 
Successfully initialized wpa_supplicant
ioctl[SIOCSIWENCODEEXT]: Invalid argument
ioctl[SIOCSIWENCODEEXT]: Invalid argument
[/code]

-

(progn 
  (re-search-forward "[[]code]\\(\\(.\\|[\n]\\)*?\\)[[]/code]" )
  (match-string 1))

gives

 " $ sudo wpa_supplicant -B -i wlan0 -Dwext -c universitywpa 

Successfully initialized wpa_supplicant
ioctl[SIOCSIWENCODEEXT]: Invalid argument
ioctl[SIOCSIWENCODEEXT]: Invalid argument
"

A shorthand for this would be nicer, though, something like

(let ((re-dot-match-all t))
      (re-search-forward "[[]code]\\(.*?\\)[[]/code]" )

(match-string 1))

Similar to the case-fold-search use. Or a named character class:

(re-search-forward "[[]code]\\([[:any:]]*?\\)[[]/code]" )
erjoalgo
  • 853
  • 1
  • 5
  • 18
  • I suppose it should be noted that `\s-` and `[:space:]` also match newlines, and `\ca` and `[:ascii:]` match any ASCII character. But see my answer for a potentially better solution. – Trebor Rude Aug 08 '15 at 20:45
  • someone care to explain the downvote? – erjoalgo Aug 08 '15 at 22:27
  • 1
    I wondered the same thing myself. I upvoted your answer since it does work. I've never cast any downvotes on emacs SE (as can be seen by my lack of the Critic badge). – Trebor Rude Aug 08 '15 at 23:08
  • Since you're writing elisp anyway, I suggest something along the lines of: `(search-forward "[code]") (buffer-substring (point) (progn (search-forward "[/code]") (match-beginning 0)))` – YoungFrog Aug 08 '15 at 23:47
  • yeah, that's exactly what I want to avoid, having to write more code, especially imperative and tedious code. the intent can be expressed more readably and concisely by making dot match newlines, which is already standard practice in other languages. – erjoalgo Aug 09 '15 at 01:33
  • There are also lots of other differences in regexp syntax between elisp and what is "standard practice in other [and typically newer] languages". Is this difference particularly noteworthy? – phils Aug 09 '15 at 10:09
2

At least in emacs-27, the elsip info page "regexp special" answers the question:

If the lower bound of a range is greater than its upper bound, the range is empty and represents no characters. Thus, ‘[z-a]’ always fails to match, and ‘[^z-a]’ matches any character, including newline.

I tested it a few times, and it seems to work, and this solution is imho more readable and shorter than the other ones.

hugobam
  • 21
  • 1