Character ranges use code point order. In modern version of Emacs, as long as you're using an encoding that's a subset of Unicode, Emacs uses Unicode code points. This is an extension of ASCII.
In ASCII, the uppercase letters occupy the code points 65 to 90 and the lowercase letters occupy the code points 97 to 122. Each of these ranges contains the 26 letters and no more, in alphabetical order. Thus the regular expression [A-Z]
(which can be constructed with (rx (any "A-Z"))
) matches any uppercase latin letter, and nothing else (it can't match a digit, a punctuation symbol, a space, an accented letter, etc.). Similarly [a-z]
matches any lowercase latin letter.
The regular expression [A-z]
matches any character whose code point is between 65 and 122. This includes not only the 52 latin letters of either case, but also a number of other characters whose code points are between 91 and 96. Those other characters are [\]^_`
. Notice the _
in here: this is why (rx (any "A-z" "_"))
returns "[A-z]"
. The rx
macro notices that _
is included in the range A-z
and simplifies [A-z_]
down to the equivalent [A-z]
, just like (rx (any "A-C" "B"))
is simplified to "[A-C]"
.
(rx (+ (any "A-z")))
([A-z]*
in regexp notation) happens to match Abc_Def
, but it also matches things like Abc^D[ef]
. So no, it isn't safe. You should never use A-z
(unless you mean “an ASCII letter of one of the characters [\]^_`
”, but when does this ever come up?).
To match identifiers and integer literals of typical programming languages (ASCII letters, digits and underscores), you can use (rx (+ (any "A-Z" "a-z" "0-9" "_")))
or "[A-Za-z_0-9]+"
. Alternatively, you can rely on the character syntax defined by the major mode and match only word-constituent characters (rx (+ word))
or "\w+"
. Note that in some modes, _
has the symbol syntax rather than word syntax, so you would need (rx (+ (or word (any "_"))))
or \(\w\|_\)+
.
If you want to match all letters including non-ASCII ones (accented letters, Greek letters, etc.), you can use character classes, e.g. (rx (any upper digit))
or "[[:upper:][:digit:]]"
to match any uppercase letter or digit.