11

How I can determine if the current character is a letter (an alphabetic character) (i.e., belongs to the syntax class [:alpha:] in regexp notions). I would like to write a simple function like below:

(defun test-letter () (interactive)
(if char-after-is-a-letter
    (message "This is a letter")
    (message "This is not a letter")
    )
)

Update Unfortunately my assumption about the equivalence of the class of the letters and the syntax class [:alpha:] seems to be false.

Drew
  • 75,699
  • 9
  • 109
  • 225
Name
  • 7,689
  • 4
  • 38
  • 84

4 Answers4

10

Use Unicode char properties

This should definitely work:

(memq (get-char-code-property (char-after) 'general-category)
      '(Ll Lu Lo Lt Lm Mn Mc Me Nl))

As a bonus it should also be faster than looking-at.


Emacs stores all character properties specified by the Unicode standard. They are accessible with get-char-code-property. Specifically, the general-category property specifies which characters are letters (Ll are lowercase, Lu are uppercase, and don't ask me what the others are).

Malabarba
  • 22,878
  • 6
  • 78
  • 163
  • Many thanks, this solves the problem with `۱۲۳۴۵۶۷۸۹۰` but there are some true-negatives, e.g. Arabic or Hebrew Alef: `א`, `ا`. – Name Feb 16 '15 at 14:05
  • @Name Fixed. Try it again. – Malabarba Feb 16 '15 at 14:08
  • 2
    Thank you again. I checked it with various Alphabets and it works. The only exception I found is with some Asian alphabet such as Chinese http://en.wikipedia.org/wiki/Chinese_numerals or Japanese http://en.wikipedia.org/wiki/Japanese_numerals. For example `五` is considered as the number `5` in Japanese. Your code considers this a letter. Maybe it is a letter (like in roman number `v`). Maybe somebody who is familiar with Japanese can verify this. – Name Feb 16 '15 at 15:15
  • 1
    `五` is like the English word `five`, so it is a letter. When writing the number 5 instead of the word five, they use `5` just like English. – Muir Nov 21 '18 at 16:15
  • Note that you can check the meaning of those char code properties with the function `char-code-property-description`. The ones starting with `M` are marks (nonspacing, spacing combining and enclosing), `Nl` is "Number, letter". – Joost Kremers Sep 16 '20 at 08:39
8

EDIT: This answer should be perfectly valid in 25.5 (where the bug had been fixed). For older versions, use the other option.


This should tell you if current char is a letter, and should work in any language.

 (looking-at-p "[[:alpha:]]")
Malabarba
  • 22,878
  • 6
  • 78
  • 163
  • Many thanks, I am just curious about the difference between `looking-at-p` used in your solution and `looking-at` in the other answer. – Name Feb 15 '15 at 16:56
  • 1
    The two functions are equivalent, except that `looking-at-p` doesn't set match data. – jch Feb 15 '15 at 17:56
  • 1
    @Name looking-at-p is closer to a pure predicate, because it doesn't set the match data. If you previously performed something like a search-forward, `match-string` (and its many siblings) will return the result of the search. Meanwhile, with the non predicate version, match-string will return the result of the looking-at match. – Malabarba Feb 15 '15 at 18:35
5

I think you can get away with this:

(defun test-letter ()
  (interactive)
  (let ((char (char-after)))
    (if (and (eq (char-syntax char) ?w)
             (or (> char ?9)
                 (< char ?1)))
        (message "This is a letter")
      (message "This is not a letter"))))

Update

This is a less efficient, but closer to what you want:

(defun test-letter ()
  (interactive)
  (if (looking-at "[a-z-A-Z]")
      (message "This is a letter")
    (message "This is not a letter")))
abo-abo
  • 13,943
  • 1
  • 29
  • 43
3

In case you were very concerned about national characters and precise treatment of Unicode character classes, then the only solution I was able to find so far is the Python regex library. Both grep and Perl (to my utter surprise!) didn't do the job properly.

So, the regular expression you are after is this one: \p{L}. This is known as Unicode property shorthand version, the full version is \p{Letter} or even p\{General_Category=Letter}. Letter is itself a composite class, but I won't go into details, the best reference I could find on the subject is here.

Python library isn't built-into the language (it is an alternative to the built-in re library). So, you would need to install it, for example:

# pip install regex

Then, you could use it like so:

import regex
>>> regex.match(ur'\p{L}+', u'۱۲۳۴۵۶۷۸۹۰')
>>> regex.match(ur'\p{L}+', u'абвгд')
<regex.Match object; span=(0, 5), match=u'\u0430\u0431\u0432\u0433\u0434'>
>>> regex.match(ur'\p{L}+', u'123')
>>> regex.match(ur'\p{L}+', u'abcd')
<regex.Match object; span=(0, 4), match=u'abcd'>
>>> 

You could also put this script somewhere where you can access it:

#!/usr/bin/env python
import regex
import sys

if __name__ == "__main__":
    for match in regex.finditer(ur'\p{L}+', sys.argv[1].decode('utf-8')):
        print match.string

And call it from Emacs like so (suppose you saved this script in ~/bin):

(defun unicode-character-p ()
  (interactive)
  (let* ((current (char-after (point)))
         (result (shell-command-to-string
                  (format "~/bin/is-character.py '%c'" current))))
    (message
     (if (string= result "") "Character %c isn't a letter"
        "Character %c is a letter")
     current)))
wvxvw
  • 11,222
  • 2
  • 30
  • 55