How to search an Arabic word in text without its diacritics/accents?

Question

In Arabic as with some other languages there is what is called diacritics to enhance pronunciation. There is no convention on how many diacritics should be written for a single word. Some use the minimum (which I prefer) just enough to disambiguate pronunciation, whereas some use them superfluously or just for aesthetic calligraphic purposes. Thus, there is a wide variation on what and how many diacritics are associated with one word. When I do isearch-forward/backward by pressing C-s/r, problem arises when I type it in the search mini-buffer without diacritics it will not match the same word in the text if it had diacritics, making the task of looking for this word with its potential diacritics ever unsatisfactory.

Is there a way to make search/regexp search unaware of diacritics? I hope there would be an answer that can be extended to include regexp C-M-s/r and grep search that I use quite often in helm-projectile to look for a word in a multi-file latex projects.

Update
It would be nice to see that Emacs in all of its search functions doing the stripping off step on the text (from accents/diacritics/you name it) before matching step as a default behavior that might be turned off by a prefix on demand no matter what language is at hand. Typically, when I search for something I don't expect from the best editor (Emacs) to fail in this errand just because of some diacritics or accents that are rarely if ever needed to accomplish mundane text chores.

Look at the `ucs-normalize-*` functions in `lisp/international/ucs-normalize.el`. There is no pre-defined search folding for those, like there is with case folding, but you can at least normalize a region before searching it. A good implementation is probably a fairly complex task. — Ted Zlatanov, Feb 05 '15 at 14:12
@Name, Arabic has much more possibilities of combinations of letters (26) with accents/diacritics, so it is not for Arabic. It seems there is no substitute for language-specific libraries. I can't believe that this has already been implemented in Microsoft Word and not in Emacs all those years back. — doctorate, Feb 05 '15 at 16:02
Arabic has about 80 diacritics and 26 letters, making all combinations is a daunting task. There must be some way to strip the text of its diacritics, like what in `php` implemented: http://stackoverflow.com/a/25563250/1288722 - also implemented in `Javascript`: http://stackoverflow.com/a/7193622/1288722 — doctorate, Feb 05 '15 at 16:31
Thought: is it not possible to run the string through that php cleansing function and then pass the result to something similar to `helm-swoop`? — Sean Allred, Feb 09 '15 at 00:38
Have a look at this answer (it may help you): http://stackoverflow.com/questions/5224267/javascriptremove-arabic-text-diacritic-dynamically — , Jan 19 '16 at 12:35

score 5 · Answer 1 · edited May 23 '17 at 12:38

5

Here's a rough start, based on the list of combining characters in this answer (and then extended). (Marking this as community wiki — please edit and improve this!)

(defconst arabic-diacritics '(#x064b #x064c #x064d #x064e #x064f #x0650 #x0651 #x0652 #x0653 #x0654 #x0655 #x0670)
  "Unicode codepoints for Arabic combining characters.")
(defconst arabic-diacritics-regexp (regexp-opt (mapcar #'string arabic-diacritics)))

(defconst arabic-equivalents
  '(
    ;; "alef" is equivalent to "alef with hamza above" etc
    (#x0627 #x0623 #x0625 #x0622)))

;; (require 'cl-lib)    
;; (defun arabic-strip-diacritics (string)
;;   (cl-reduce (lambda (s c) (remove c s)) arabic-diacritics :initial-value string))

(defun arabic-search-without-diacritics (string)
  (interactive (list (read-string "Search for: " nil nil nil t)))
  (let ((regexp
         (apply #'concat
                (mapcar (lambda (c)
                          (let ((equivalents (assq c arabic-equivalents)))
                            (concat
                             (if equivalents
                                 (regexp-opt (mapcar #'string equivalents))
                               (regexp-quote (string c)))
                             arabic-diacritics-regexp "*")))
                        string))))
    (search-forward-regexp regexp)))

So if a buffer contains "الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ", and I evaluate (arabic-search-without-diacritics "الحمد لله رب العالمين"), it finds the text. It also works interactively, as M-x arabic-search-without-diacritics.

Alternative approach:

Here's a full code example that demonstrates how diacritical and other nonspacing marks (Mn property) can be removed from normalized strings in regexp matches. It works with the examples given and IMO is the right approach.

(defun kill-marks (string)
  (concat (loop for c across string
                when (not (eq 'Mn (get-char-code-property c 'general-category)))
                collect c)))

(let* ((original1 "your Arabic string here")
      (normalized1 (ucs-normalize-NFKD-string original1))
      (original2 "your other Arabic string here")
      (normalized2 (ucs-normalize-NFKD-string original2)))
  (equal
   (replace-regexp-in-string "." 'kill-marks normalized1)
   (replace-regexp-in-string "." 'kill-marks normalized2)))

edited May 23 '17 at 12:38

Community

1

answered Feb 05 '15 at 17:52

legoscia

6,012
29
54

I added two more diacritics commonly used in Arabic to your nice list. This is the complete sorted list `1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1648` -- fee free to update. – doctorate Feb 05 '15 at 19:37
The first function `arabic-search-without-diacritics` works well but breaks with some words, I don't know why like this one `الأَ`. Other caveat, I have always to set-input-method to arabic when I enter my string in mini-buffer, while in `isearch-forward/backward` function it remains there. – doctorate Feb 05 '15 at 19:43
`kill-marks` is the better approach to provide hassle-free text ready for all kinds of search. What is unclear to me is how to implement that on a whole buffer and then on multifiles? – doctorate Feb 05 '15 at 19:51
I wish that `kill-marks` will be implemented as a full-fledged search function in Emacs soon. Good luck. – doctorate Feb 05 '15 at 20:15
Right, I made the function preserve the input method. For `الأَ`, I made the search consider match "alef with hamza above" whenever you search for "alef". Is that correct? (That seems to be a single pre-composed Unicode character.) – legoscia Feb 06 '15 at 12:29
U r right, `arabic-equivalents` is brilliant -- more on alef equivalents `1571 - 1573 - 1570` should be equivalent to `1575` -- feel free to update. Do u know by chance where can I find a complete list of such unicode numbers for Arabic? I can provide more such tricky equivalents. – doctorate Feb 06 '15 at 13:28
by googling I stumbled over this link: http://jrgraphix.net/r/Unicode/0600-06FF -- strangely, it has different numbers than what posted here. Any idea? – doctorate Feb 06 '15 at 13:32
I tested your updated `arabic-search-without-diacritics` and it works well; input method preserved and it could match that string `الأَ`. – doctorate Feb 06 '15 at 13:46
Unicode charts usually use hexadecimal numbers. I initially used decimal numbers, but that's probably confusing, so I just converted them all to hexadecimal, using `#x` which is the Emacs Lisp prefix for hexadecimal numbers. I added the alef equivalents; you can edit the post yourself if you find more of them. – legoscia Feb 06 '15 at 14:23
1

Thanks! is it possible to make it like `isearch-forward/backward` highlight all occurrences and the current one differently and by invoking `s` will move forward and `r` move backward? – doctorate Feb 06 '15 at 15:32
It should be possible to feed this regexp into the isearch regexp search, but I'm not sure how to do that... – legoscia Feb 06 '15 at 16:37
do you suggest accepting this answer and opening a new post for further improvements? – doctorate Feb 06 '15 at 18:54
No, I'll open a bounty on this question once it's eligible, and award it to whomever comes up with a solution that works for isearch. – legoscia Feb 06 '15 at 19:21
2

Discussion on emacs-devel: http://thread.gmane.org/gmane.emacs.devel/182483 – Ted Zlatanov Feb 09 '15 at 14:45
@legoscia thanks for all your efforts to find a solution to this problem, but could you please elaborate on the alternative approach? Is there any progress regarding it in later versions of Emacs? I came across `char-fold-ad-hoc` for `Isearch`, but how to get it to work for Arabic diacritics? – doctorate Nov 20 '22 at 15:44
Not sure. `char-fold-ad-hoc` looks promising, in that searching for a plain alef also matches alef with hamza, but I didn't find a way to make it ignore combining diacritics such as sukun. Perhaps @TedZlatanov knows more. – legoscia Nov 21 '22 at 18:32

How to search an Arabic word in text without its diacritics/accents?

1 Answers1

Linked