Trouble while using string-match with rx expression

Question

I'm trying to write a function that takes a string and returns non-nil if it's the word do not followed by the character :. Something like this:

(string-match (rx "do[^:]") input)

but I get the following result:

(string-match (rx "do[^:]") "do") ;; => nil

Am I doing something wrong here? Or is there any other way to compare a string with a Regex?

Why do you use `rx` if you are giving a regexp. Remove the `rx` form and the expression will match. — Tobias, Dec 08 '19 at 22:05
Quite; one of the purposes of `rx` is to write regular expressions *without* special characters in strings -- it provides its own analogous constructs for all regexp metacharacters. You can combine the two using the rx form `(regexp "...")`, for some regexp string `"..."`, but that's a relatively unusual usage. — phils, Dec 09 '19 at 04:34

zck · Accepted Answer · 2019-12-09T04:07:57.707

Let's look at (rx "do[^:]") expands to:

ELISP> (rx "do[^:]")
"do\\[\\^:]"

This is...not what you want. What does it match?

ELISP> (string-match (rx "do[^:]") "do[^:]t")
0 (#o0, #x0, ?\C-@)

What's going on here? From the rx documentation:

(rx &rest REGEXPS)

Translate regular expressions REGEXPS in sexp form to a regexp string. REGEXPS is a non-empty sequence of forms of the sort listed below.

The following are valid subforms of regular expressions in sexp notation.

STRING matches string STRING literally.

So if we call rx with the argument "do[^:]", Emacs will look for the string "do[^:]" literally. That's not what you want.

Later in the documentation:

‘(any SET ...)’ matches any character in SET .... SET may be a character or string.

SET may also be the name of a character class: ‘digit’, ‘control’, ‘hex-digit’, ‘blank’, ‘graph’, ‘print’, ‘alnum’, ‘alpha’, ‘ascii’, ‘nonascii’, ‘lower’, ‘punct’, ‘space’, ‘upper’, ‘word’, or one of their synonyms.

‘(not (any SET ...))’ matches any character not in SET ...

To search for any character that is not :, we can use:

ELISP> (rx "do" (not (any ":")))
"do[^:]"

And this seems like it works:

ELISP> (string-match (rx "do" (not (any ":"))) "dot")
0 (#o0, #x0, ?\C-@)

But there's a slight issue:

ELISP> (string-match (rx "do" (not (any ":"))) "do")
nil

Let's look back at the generated regex:

ELISP> (rx "do" (not (any ":")))
"do[^:]"

This says: a letter d, a letter o, and then any character that is not a colon. That is, match something with at least three characters. Let's handle this.

Because we can match properly all strings that have at least three characters, the only case we need to worry about is the case with two characters. We can detect this by using string-end.

‘string-end’, ‘eos’, ‘eot’ matches the empty string, but only at the end of the string being matched against.

We'll look for the first two characters being "do", and then either the string ends, or it continues with a non-colon character

ELISP> (rx "do" (or string-end (not (any ":"))))
"do\\(?:\\'\\|[^:]\\)"
ELISP> (string-match (rx "do" (or string-end (not (any ":")))) "dots")
0 (#o0, #x0, ?\C-@)
ELISP> (string-match (rx "do" (or string-end (not (any ":")))) "do")
0 (#o0, #x0, ?\C-@)

Our non-matching case:

ELISP> (string-match (rx "do" (or string-end (not (any ":")))) "do:")
nil

This works. The only thing that might trip you up is that this regex matches anywhere in the string that matches the regex, even if do: also appears:

ELISP> (string-match (rx "do" (or string-end (not (any ":")))) "do: A thing I do.")
14 (#o16, #xe, ?\C-n)

This can be handled differently, depending what you're looking for.

That explains the question I had and much more! Thanks!!! – riekes Dec 08 '19 at 22:25 — riekes, Dec 08 '19 at 22:25

Trouble while using string-match with rx expression

1 Answers1