2

I am trying to use a regexp to match strings, e.g. a pattern that matches these kinds of quoted strings:

"test" or 'test' or 'test with "quote"' or "test with 'quote'"

This pattern matches all those,

"\\(\"\\|'\\)[^\\1]+?\\1"

but, it fails on "test\n" (and other control characters like \t and \r). I don't understand why \n (a newline) is so special it doesn't match [^\\1] which I thought was a character that doesn't match the opening quote. Is this expected?

If I replace [^\\1] with . then it works on all the strings, including the one with \n in it. I guess it is ok because +? makes it nongreedy so it seems to not over match.

Note, this question originated from How to highlight in different colors for variables inside `fstring` on python-mode.

Drew
  • 75,699
  • 9
  • 109
  • 225
John Kitchin
  • 11,555
  • 1
  • 19
  • 41
  • Would you mind linking where I can see what ^\1 matches? regexps and their docs give me a migraine every time I need to use one. – RichieHH Feb 17 '20 at 14:07
  • 1
    Are (negated) backreferences allowed inside character alternatives in Emacs regular expressions? The manual neither confirms or denies that, as far as I can tell (https://www.gnu.org/software/emacs/manual/html_node/elisp/Regexp-Backslash.html) – rpluim Feb 17 '20 at 14:14
  • @RichieHH \\1 should match whatever is in the first group, which in this case is \(\"\|'\), i.e. a " or a '. The syntax [^\1] (I thought) would match any character that is not \1. – John Kitchin Feb 17 '20 at 14:20
  • @rpluim Good point. I guess in general this is not a good thing to do, since \1 might be matching more than one character, and in that case, this probably wouldn't do what I was expecting. I guess it is not absolutely wrong to do it though because it works as long as there is no control character between the quote characters (even if it is not documented, or defined). – John Kitchin Feb 17 '20 at 14:22
  • @JohnKitchin at least in interactive search, eg \\(f\\)[^\1] matches 'f' followed by anything, not 'f' followed by not-f. I suspect it's a corner case nobody has thought about, emacs-devel might know more. – rpluim Feb 17 '20 at 14:24
  • @rpluim there is a hint in the link you provided "For example, ‘\(.*\)\1’ matches any newline-free string that is composed of two identical halves." That sure makes it seem like newlines are special here. – John Kitchin Feb 17 '20 at 14:29
  • Thanks. I did know that, as I used it only recently! In one ear out the next with regexps. – RichieHH Feb 17 '20 at 14:35
  • @JohnKitchin that's because '.' doesn't match newlines, I don't think it's a property of \1 (I took a quick look at the relevant code in emacs, and now I"m going to have to have a bit of a lie down). – rpluim Feb 17 '20 at 14:43
  • @rpluim '.' does match a newline (and also \t and \r and \j) in this case, at least when it is written as \n within a string. It does not match past the end of a string though. – John Kitchin Feb 17 '20 at 14:59
  • @JohnKitchin. You'll have to show me the code. At least `(string-match "." "\n")` returns nil. – rpluim Feb 17 '20 at 15:11
  • 1
    And one more bit of info: '\' is not special inside characacter alternatives according to the last paragraph of https://www.gnu.org/software/emacs/manual/html_node/emacs/Regexps.html , so I think that means backreferences aren't allowed. – rpluim Feb 17 '20 at 15:15
  • Can some of this long stream of comments perhaps be incorporated into the question? I.e., does some of it perhaps help clarify the question? Comments can be deleted at any time, and comments are not searched when you search for Q & A. – Drew Feb 17 '20 at 15:37
  • 1
    @Hubisan, your [^\\1] is matching 'anything except backslash or 1'. – rpluim Feb 17 '20 at 15:53

2 Answers2

4

You asked about regexps. But from my perspective this is rather an X-Y-problem.

This is an answer to your real question how to find strings in buffers with major mode derived from prog-mode assuming that its syntax table is set-up.

Instead of trying to compose a suitable regexp you can use the syntax parser built-in to Emacs. The following code shows how you can do that.

The main tools are:

  • syntax-ppss for requesting the current state of the syntax parser
  • parse-partial-sexp for fleeing comments and for finding the beginning of strings (and comments)
  • scan-sexp for finding the end of the string

Note, that it should be possible to use this function as MATCHER for font-lock-keywords and related.

(defun my-find-next-string (&optional bound)
  "Find the next string up to BOUND.
BOUND defaults to `point-max'.
If we start within a string we skip to its end
and start there with the search."
  (unless bound
    (setq bound (point-max)))
  (let ((state (syntax-ppss))
    start
    end)
    ;; starting within string or comment:
    (when (nth 8 state)
      (setq state (parse-partial-sexp (point)
                      bound
                      nil nil
                      state
                      'syntax-table)))
    ;; searching for strings, skipping comments
    (while (and
        (setq state (parse-partial-sexp (point)
                        bound
                        nil nil
                        state
                        'syntax-table))
        (< (point) bound)
        (null (setq start (and (nth 3 state)
                   (nth 8 state)))))) ;; inside string
    (when start
      (unwind-protect
      (set-match-data
       (list (goto-char start)
         (goto-char (scan-sexps (point) 1))
         ))
    nil)
      (point))))
Tobias
  • 32,569
  • 1
  • 34
  • 75
  • 1
    +1. But it depends on the current syntax defining "string" as indicated in the Q, i.e., double-quoted or single-quoted, with possibly embedded single-quoted or double-quoted, respectively. (But how many such nestings are allowed? Unspecified.) (And I agree with you about this likely being an X-Y problem.) – Drew Feb 17 '20 at 18:30
  • @Drew This is not really about nestings. In a double-quoted string you can use as many single-quotes you want but you need to escape every double-quote. In a single-quoted string you can use double-quotes without escape but you must escape single-quotes. – Tobias Feb 17 '20 at 20:34
  • One can imagine a syntax that allows, for example (and for whatever reason) a string with nested syntax, such as `"aaa 'bb "ccc cc" bbb' aaaa"`. The question isn't specific about what "string" syntax to respect. That's what I meant about the syntax not being well specified - OP just gave some examples. – Drew Feb 17 '20 at 22:11
  • @Drew Okay fair enough. I referred in my comment to the Python context which the OP gave. – Tobias Feb 18 '20 at 03:44
3

[^\\1] and [^\1] match the same thing: any single character except 1 or \. The doc is clear about this. \ and 1 are not special inside [...].

Negation, other than in a character alternative, is not possible using a regular expression. Use Lisp code instead. For example, find something that might include what you don't want, test it, and exclude it if it's something you don't want.

(Suggestion: describe your real problem - the problem for which you thought regexp-searching would provide a solution directly.)

Drew
  • 75,699
  • 9
  • 109
  • 225
  • Actually, `[^\1]` matches anything excepts Ctrl-A. – rpluim Feb 17 '20 at 16:10
  • @rpluim No, `[^\1]` and `[\\1]` do what Drew said if this character sequence is not part of a string to be read by the Elisp reader. Only the reader translates `\1` within a string to the character with code 1. Drew did not say anything about that. – Tobias Feb 17 '20 at 17:00
  • @Tobias Oh, I was testing in ielm, that explains it. – rpluim Feb 17 '20 at 17:52