In case you were very concerned about national characters and precise treatment of Unicode character classes, then the only solution I was able to find so far is the Python regex
library. Both grep
and Perl
(to my utter surprise!) didn't do the job properly.
So, the regular expression you are after is this one: \p{L}
. This is known as Unicode property shorthand version, the full version is \p{Letter}
or even p\{General_Category=Letter}
. Letter
is itself a composite class, but I won't go into details, the best reference I could find on the subject is here.
Python library isn't built-into the language (it is an alternative to the built-in re
library). So, you would need to install it, for example:
# pip install regex
Then, you could use it like so:
import regex
>>> regex.match(ur'\p{L}+', u'۱۲۳۴۵۶۷۸۹۰')
>>> regex.match(ur'\p{L}+', u'абвгд')
<regex.Match object; span=(0, 5), match=u'\u0430\u0431\u0432\u0433\u0434'>
>>> regex.match(ur'\p{L}+', u'123')
>>> regex.match(ur'\p{L}+', u'abcd')
<regex.Match object; span=(0, 4), match=u'abcd'>
>>>
You could also put this script somewhere where you can access it:
#!/usr/bin/env python
import regex
import sys
if __name__ == "__main__":
for match in regex.finditer(ur'\p{L}+', sys.argv[1].decode('utf-8')):
print match.string
And call it from Emacs like so (suppose you saved this script in ~/bin
):
(defun unicode-character-p ()
(interactive)
(let* ((current (char-after (point)))
(result (shell-command-to-string
(format "~/bin/is-character.py '%c'" current))))
(message
(if (string= result "") "Character %c isn't a letter"
"Character %c is a letter")
current)))