1

I'm a little confused by the fact that underscore does not shown in the character class.

ELISP> (require 'rx)
rx
ELISP> (rx (any "A-Z" "_"))
"[A-Z_]"
ELISP> (rx (any "A-z" "_"))
"[A-z]"

Expected result is "[A-z_]" but actual is "[A-z]".

But the test shows curious results for me:

ELISP> (string-match-p (rx (any "A-z_")) "9")
nil
ELISP> (string-match-p (rx (any "A-z_")) "_")
0 (#o0, #x0, ?\C-@)

I would like to clarify 2 things:

  1. Is it correct to use (any "A-Z" "a-z" "_") to match "_" string?
  2. Should I use (any "A-Z" "a-z" "_") or I can safely use (any "A-z" "_") to match "Abc_Def" string
serghei
  • 272
  • 3
  • 15

2 Answers2

3

Character ranges use code point order. In modern version of Emacs, as long as you're using an encoding that's a subset of Unicode, Emacs uses Unicode code points. This is an extension of ASCII.

In ASCII, the uppercase letters occupy the code points 65 to 90 and the lowercase letters occupy the code points 97 to 122. Each of these ranges contains the 26 letters and no more, in alphabetical order. Thus the regular expression [A-Z] (which can be constructed with (rx (any "A-Z"))) matches any uppercase latin letter, and nothing else (it can't match a digit, a punctuation symbol, a space, an accented letter, etc.). Similarly [a-z] matches any lowercase latin letter.

The regular expression [A-z] matches any character whose code point is between 65 and 122. This includes not only the 52 latin letters of either case, but also a number of other characters whose code points are between 91 and 96. Those other characters are [\]^_`. Notice the _ in here: this is why (rx (any "A-z" "_")) returns "[A-z]". The rx macro notices that _ is included in the range A-z and simplifies [A-z_] down to the equivalent [A-z], just like (rx (any "A-C" "B")) is simplified to "[A-C]".

(rx (+ (any "A-z"))) ([A-z]* in regexp notation) happens to match Abc_Def, but it also matches things like Abc^D[ef]. So no, it isn't safe. You should never use A-z (unless you mean “an ASCII letter of one of the characters [\]^_`”, but when does this ever come up?).

To match identifiers and integer literals of typical programming languages (ASCII letters, digits and underscores), you can use (rx (+ (any "A-Z" "a-z" "0-9" "_"))) or "[A-Za-z_0-9]+". Alternatively, you can rely on the character syntax defined by the major mode and match only word-constituent characters (rx (+ word)) or "\w+". Note that in some modes, _ has the symbol syntax rather than word syntax, so you would need (rx (+ (or word (any "_")))) or \(\w\|_\)+.

If you want to match all letters including non-ASCII ones (accented letters, Greek letters, etc.), you can use character classes, e.g. (rx (any upper digit)) or "[[:upper:][:digit:]]" to match any uppercase letter or digit.

2

An additional data point might help:

(string-match-p (rx (any "A-z")) "_")

returns '0', i.e. it matches. Looking at man ascii, I see that A is ascii 65, z is ascii 122, and _ is ascii 95. So the range "A-z" is actually 65-122, which includes 95 (_).

To answer your second question, I think it's safe to use (any "A-z") to match "Abc_Def", if we can assume you won't be encountering any non-ascii characters. This won't work for accented or non-roman letters.

Tyler
  • 21,719
  • 1
  • 52
  • 92
  • `(any "A-z")` also matches the characters `[\\]^\``, so no, `(any "A-z")` is not a safe way of matching letters. – Gilles 'SO- stop being evil' Aug 28 '17 at 20:55
  • 1
    @Gilles I think we need some clarification from klay as to what safe means here. It's safe in the sense that there will be no false negatives. If we want to include only upper and lower case letters and the underscore, and match no other symbols, then this is not safe and will give false positives. – Tyler Aug 28 '17 at 21:06