1

Is it possible to obtain an (rx ...) form, that is expanded at compile time for performance, but partially variable?


rx-to-string may be slow

For instance I want an expression that behaves equivalent to

(rx-to-string `(: "PREFIX" 
                  (or "hello" "world" ,@user-defined-words)
                  "SUFFIX))

Written this way, it works, but does an unnecessary amount of work at runtime (especially in more complex real-world examples). In some places, this can become a potential performance issue.

What I want to produce at compile time is essentially a partial compile-time expansion, such as

(concat "PREFIX\\(?:hello\\|world\\|"
        (mapconcat #'regexp-quote user-defined-words "\\|")
        "\\)SUFFIX")

eval form does not work

I already know that the (rx ... (eval ...) ...) form does not work:

;; INCORRECT RESULT; VARIABLE EXPANDED AT COMPILE TIME:
(rx "PREFIX" 
    (eval `(or "hello" "world" ,@user-defined-words)
    "SUFFIX))

;; When loading from source and user-defined-words is initialized to nil:
=> "PREFIX\\(?:hello\\|world\\)SUFFIX"

;; When compiling:
=> Error: Symbol's value as variable is void: user-defined-words
kdb
  • 1,561
  • 12
  • 21

2 Answers2

1

Starting from Emacs 27, rx supports this via the following forms:

(literal EXPR) Match the literal string from evaluating EXPR at run time.

(regexp EXPR)  Match the string regexp from evaluating EXPR at run time.

n.b. The regexp form in earlier versions of rx is less flexible, and only accepts a string argument.

phils
  • 48,657
  • 3
  • 76
  • 115
0

A compiled regexp is just a string, so (a) no, I don't believe you can make any kind of dynamic substitution when the regexp engine processes it; but (b) you can search and replace on it yourself -- simply produce a regexp string containing token substrings which you know will not otherwise occur, and ensure they occur in a context which will be safe to replace.

(rx "PREFIX" (or "hello" "world" (regexp "INFIX")) "SUFFIX")

=> "PREFIX\\(?:hello\\|world\\|INFIX\\)SUFFIX"

If you know that PREFIX and INFIX and SUFFIX are all unique in the regexp, then a simple run-time replacement of those can produce your desired outcome.

Knowing that your placeholders are unique and replaceable is the part you need to be careful about, so you might take advantage of the fact that the rx form (regexp FOO) trusts that FOO is a valid regexp string, and instead use it to produce an invalid regexp as a placeholder value -- therefore being safe in the knowledge that rx is never going to produce that same value1.

For example, if you searched for (rx (regexp "[[:placeholder:]]")) the regexp engine would signal (invalid-regexp "Invalid character class name").

You could therefore do something like this:

(defvar my-pattern-template
  (rx (group (or "hello" "world" (regexp "[[:placeholder:]]"))))
  "Regexp template with placeholder.")

;; Runtime...
(let* ((prefix "PREFIX")
       (suffix "SUFFIX")
       (words '("this" "is" "the" "word" "list" "to" "look" "for"))
       (regexp1 (progn
                  (string-match (rx "[[:placeholder:]]") my-pattern-template)
                  (replace-match (regexp-opt words) nil t my-pattern-template)))
       (regexp (concat (regexp-quote prefix)
                       regexp1
                       (regexp-quote suffix))))
  (re-search-forward regexp))

1 "\\[[:placeholder:]]" is a valid regexp containing that placeholder -- equivalent to "\\[[:placehodr]]" and matching the text [:] for instance -- but rx shouldn't ever generate such unnecessary duplication of characters, so it's probably still safe; but these things are tricky, and my scheme might yet be flawed, so be suitably careful. In practice, unless the inputs are truly arbitrary, you probably have enough control to provide certainty.

phils
  • 48,657
  • 3
  • 76
  • 115
  • Regarding that footnote, `[[[:placeholder:]]]` may be better, in that it eliminates that particular way of unwanted matching, as `\\[[[:placeholder:]]]` is still invalid. – phils May 01 '20 at 12:46
  • There is nothing really preventing the `rx' macro from expanding into a `(concat ...)` expression, so compilation of the rx-form doesn't preclude leaving parts of it variable. With the replacement method I am ensure; It results in worse readability compared to using the well-understood `rx`-macro, and I'm not sure it would be much faster than using `rx-to-string'. – kdb May 01 '20 at 13:55
  • My suggestion is certainly considerably less readable than `rx` or `rx-to-string`; but neither of them do what you want, so that seems fairly moot? As for performance, you talked about "more complex real-world examples" which "can become a potential performance issue", so I'm presuming you have some case in mind where `rx-to-string` was particularly slow. If you don't have such a scenario in mind, I would just use that. – phils May 01 '20 at 15:38
  • 1
    You'll be happy to learn that the re-written rx.el in Emacs 27 supports what you want. I'll add a different answer. – phils May 01 '20 at 15:42