How to determine if the current character is a letter

Question

How I can determine if the current character is a letter (an alphabetic character) (i.e., belongs to the syntax class [:alpha:] in regexp notions). I would like to write a simple function like below:

(defun test-letter () (interactive)
(if char-after-is-a-letter
    (message "This is a letter")
    (message "This is not a letter")
    )
)

Update Unfortunately my assumption about the equivalence of the class of the letters and the syntax class [:alpha:] seems to be false.

Malabarba · Accepted Answer · 2015-02-28T14:15:41.703

10

Use Unicode char properties

This should definitely work:

(memq (get-char-code-property (char-after) 'general-category)
      '(Ll Lu Lo Lt Lm Mn Mc Me Nl))

As a bonus it should also be faster than looking-at.

Emacs stores all character properties specified by the Unicode standard. They are accessible with get-char-code-property. Specifically, the general-category property specifies which characters are letters (Ll are lowercase, Lu are uppercase, and don't ask me what the others are).

edited Feb 28 '15 at 14:15

answered Feb 16 '15 at 12:02

Malabarba

22,878
6
78
163

Many thanks, this solves the problem with `۱۲۳۴۵۶۷۸۹۰` but there are some true-negatives, e.g. Arabic or Hebrew Alef: `א`, `ا`. – Name Feb 16 '15 at 14:05
@Name Fixed. Try it again. – Malabarba Feb 16 '15 at 14:08
2

Thank you again. I checked it with various Alphabets and it works. The only exception I found is with some Asian alphabet such as Chinese http://en.wikipedia.org/wiki/Chinese_numerals or Japanese http://en.wikipedia.org/wiki/Japanese_numerals. For example `五` is considered as the number `5` in Japanese. Your code considers this a letter. Maybe it is a letter (like in roman number `v`). Maybe somebody who is familiar with Japanese can verify this. – Name Feb 16 '15 at 15:15
1

`五` is like the English word `five`, so it is a letter. When writing the number 5 instead of the word five, they use `5` just like English. – Muir Nov 21 '18 at 16:15
Note that you can check the meaning of those char code properties with the function `char-code-property-description`. The ones starting with `M` are marks (nonspacing, spacing combining and enclosing), `Nl` is "Number, letter". – Joost Kremers Sep 16 '20 at 08:39

score 8 · Answer 2 · edited Apr 13 '17 at 12:52

8

EDIT: This answer should be perfectly valid in 25.5 (where the bug had been fixed). For older versions, use the other option.

This should tell you if current char is a letter, and should work in any language.

 (looking-at-p "[[:alpha:]]")

edited Apr 13 '17 at 12:52

Community

1

answered Feb 15 '15 at 15:36

Malabarba

22,878
6
78
163

Many thanks, I am just curious about the difference between `looking-at-p` used in your solution and `looking-at` in the other answer. – Name Feb 15 '15 at 16:56
1

The two functions are equivalent, except that `looking-at-p` doesn't set match data. – jch Feb 15 '15 at 17:56
1

@Name looking-at-p is closer to a pure predicate, because it doesn't set the match data. If you previously performed something like a search-forward, `match-string` (and its many siblings) will return the result of the search. Meanwhile, with the non predicate version, match-string will return the result of the looking-at match. – Malabarba Feb 15 '15 at 18:35

abo-abo · Answer 3 · 2015-02-15T15:24:56.133

5

I think you can get away with this:

(defun test-letter ()
  (interactive)
  (let ((char (char-after)))
    (if (and (eq (char-syntax char) ?w)
             (or (> char ?9)
                 (< char ?1)))
        (message "This is a letter")
      (message "This is not a letter"))))

Update

This is a less efficient, but closer to what you want:

(defun test-letter ()
  (interactive)
  (if (looking-at "[a-z-A-Z]")
      (message "This is a letter")
    (message "This is not a letter")))

edited Feb 15 '15 at 15:24

answered Feb 15 '15 at 14:35

abo-abo

13,943
1
29
43

Thanks, a possible problem: This function considers the digits (123... ) as a letter. – Name Feb 15 '15 at 14:39
Easily fixable. – abo-abo Feb 15 '15 at 14:42
Many thanks again. Another false positive: This considers `۹` (i.e., the Indian digit 9) or `٪` as a letter. – Name Feb 15 '15 at 14:46
1

Your first solution was fine with Greek letters (such as `ζ` or `α`), but the update is not. – Name Feb 15 '15 at 14:59
But combining both is a closer solution. – Name Feb 15 '15 at 15:10
Unfortunately `[:alpha:]` matches `۹` as a letter. – Name Feb 15 '15 at 15:11
`string-match`/`make-string`? Why not just `looking-at`? – Feb 15 '15 at 15:12
@Name, wrong matching of Indian digits is a bug that should be reported. – abo-abo Feb 15 '15 at 15:23
@lunaryorn, you're right; the lame approach resulted from the previous start. – abo-abo Feb 15 '15 at 15:24
thankyou for this... i have been able to implement the holy grail - if the char at point is english translate it to chinese, if it's not it will be chinese so translate to english! – jolyon Jun 14 '20 at 10:57

score 3 · Answer 4 · answered Feb 16 '15 at 09:19

In case you were very concerned about national characters and precise treatment of Unicode character classes, then the only solution I was able to find so far is the Python regex library. Both grep and Perl (to my utter surprise!) didn't do the job properly.

So, the regular expression you are after is this one: \p{L}. This is known as Unicode property shorthand version, the full version is \p{Letter} or even p\{General_Category=Letter}. Letter is itself a composite class, but I won't go into details, the best reference I could find on the subject is here.

Python library isn't built-into the language (it is an alternative to the built-in re library). So, you would need to install it, for example:

# pip install regex

Then, you could use it like so:

import regex
>>> regex.match(ur'\p{L}+', u'۱۲۳۴۵۶۷۸۹۰')
>>> regex.match(ur'\p{L}+', u'абвгд')
<regex.Match object; span=(0, 5), match=u'\u0430\u0431\u0432\u0433\u0434'>
>>> regex.match(ur'\p{L}+', u'123')
>>> regex.match(ur'\p{L}+', u'abcd')
<regex.Match object; span=(0, 4), match=u'abcd'>
>>>

You could also put this script somewhere where you can access it:

#!/usr/bin/env python
import regex
import sys

if __name__ == "__main__":
    for match in regex.finditer(ur'\p{L}+', sys.argv[1].decode('utf-8')):
        print match.string

And call it from Emacs like so (suppose you saved this script in ~/bin):

(defun unicode-character-p ()
  (interactive)
  (let* ((current (char-after (point)))
         (result (shell-command-to-string
                  (format "~/bin/is-character.py '%c'" current))))
    (message
     (if (string= result "") "Character %c isn't a letter"
        "Character %c is a letter")
     current)))

How to determine if the current character is a letter

4 Answers4

Use Unicode char properties

Update