How to implement character folding (stripping Arabic text from diacritics or tashkeel aka text normalization or cleansing) when doing search in Emacs?

Question

This is a follow-up question on a previous one here looking for a solution to this very common task of searching for a string in a text irrespective of the vowel diacritics associated with. Arabic alphabet is made of 28 letters and 80 vowel diacritics (a.k.a. tashkeel or harakat).

Example Arabic text with diacritics

الَّذِي يَزْرَعُهُ الإِنْسَانُ إِيَّاهُ يَحْصُدُ أَيْضًا.

The same text without diacritics

الذي يزرعه الإنسان اياه يحصد ايضا.

Trials

As mentioned earlier, most trials in Emacs were posted here. The devised function arabic-search-without-diacritics used to work back then, but not anymore. An alternative approach had been added later, but I don't know for sure how to translate that into an elisp function to achieve this diacritics-free kind of search.

Another approach mentioned here under the name of character folding with Isearch function and this feature was first introduced in Emacs 25. The example used there was when looking for e in a text with many accented e's: e ｅ㋎㋍ ⓔ ⒠ ⅇ ℯ ₑ ẽ ẽ ẻ ẻ ẹ ẹ ḛ ḛ ḙ ḙ ᵉ ȩ ȩ ȇ ȇ ȅ ȅ ě ě ę ę ė ė ĕ ĕ ē ē ë ë ê ê é é è è

hit C-s and type e, the incremental search will highlight some but not all e's. Now M-x isearch-toggle-char-fold and do the search again. If you do the search again for e almost all e's will be highlighted, even the accented ones. My question is how to achieve this wonderful feature of character folding in Arabic language given the 28 letters and 80 diacritics it has?

Note

In the Emacs wiki, there is something called char-fold-ad-hoc

Option ‘char-fold-ad-hoc’ lets you customize this set of ad hoc foldings. The default value is the same set provided by vanilla Emacs.

I wonder whether this option is relevant to what I am trying to achieve or not. Any help would be very much appreciated.

Update 1

There is also what is called char-fold-symmetric mentioned in masteringemacs.org website. However, it wasn't mentioned there where to find this option to see if it works with Arabic diacritics or not. According to this website:

Char folding is a hidden gem of a feature.

New char-folding options: 'char-fold-include' lets you add ad hoc foldings, 'char-fold-exclude' to remove foldings from default decomposition, and 'char-fold-symmetric' to search for any of an equivalence class of characters. For example, with a nil value of 'char-fold-symmetric' you can search for "e" to find "é", but not vice versa. With a non-nil value you can search for either, for example, you can search for "é" to find "e".

Useful keybinding found in isearch help

I looked up isearch help and found out that after invoking isearch by C-s you can press M-s ' to toggle character folding, but unfortunately it didn't work for Arabic case only the above (e) example.

isearch for Arabic strings in action

For some strange reason I couldn't capture the problem of char-folding in Aarabic text using GIF, it was only possible with the (e) example, but this is just to confirm that char-folding doesn't work for all of the words in the Arabic sentence above.

keywords

character folding,text normalization, diacritic insensitive search, text cleansing

Update 2

I found this answer in this post very informative and useful but in Java, interestingly enough it deals with the Tatweel character as well.

`char-fold+.el` was written when vanilla `char-fold.el` was first written. The latter was changed, so the former is no longer useful. As the *Commentary* in the file indicates, `char-fold+.el` is usable only for some old Emacs 25 builds. The code is only kept available on the wiki just in case it might prove useful for some recycling. I've updated the wiki page to make this clear. I suggest you edit your question to remove anything that pertains to `char-fold+.el`. Sorry for the trouble. — Drew, Nov 21 '22 at 17:53
@Dew, I have Emacs 28.1 now and I can still use `ìsearch-toggle-char-fold`, hence the character folding is still a viable option and whether this char folding can be extended to Arabic language case is left open to be answered in this post. However, to your input thankfully provided, I don't know for sure what do you mean, do you mean that `char-fold-ad-hoc` is no longer a viable option by now? — doctorate, Nov 22 '22 at 06:15
Emacs-28.1 has `char-fold-include` that you can customize, but that's a bit klunky to use. Am I correct that any of the vowel diacritics can be used on any of the letters? If so, I'm sure we can come up with some lisp to pre-fill `char-fold-include` for you. — rpluim, Nov 22 '22 at 13:54
I am not sure if I understand things correctly, but I noticed that when inserting the example string including diacritics in some buffer, and then do `C-x =`, Emacs shows me the character without diacritics. Also, it looks to me that the diacritics are 'implemented' by [composing characters](https://en.wikipedia.org/wiki/Unicode_equivalence#Combining_and_precomposed_characters). Anyway, as the diacritics seem to get ignored when checking one by one (scrolling using the keyboard) the cursor position using `C-x =`, I guess it should not be too hard to ignore them also when searching. — dalanicolai, Nov 22 '22 at 15:25
When scrolling using `C-f`, it is bound to `forward-char`, but the cursor moves forward 'per column', ignoring the diacritics. When using `forward-char` programmatically, then Emacs does not ignore the diacritics (also not when calling with `call-interactively`). I don't know where this different behavior comes from. Anyway, as somehow, Emacs already seems able to ignore the diacritics, I would just ask for help directly in the [Emacs devel mailing list](https://lists.gnu.org/mailman/listinfo/emacs-devel). It looks to me that there should be some 'easy' way to achieve this. — dalanicolai, Nov 22 '22 at 15:31
As far as I can tell, if the text has base characters and combining characters only, then isearch works fine. char-folding is for when you have a precomposed character that you want to have treated as the base character, which also works out of the box for me. So, please tell us for which precise sequence of characters it's not working for you. — rpluim, Nov 22 '22 at 15:40
`char-fold-ad-hoc` is only defined in `char-fold+.el`, so yes, it's not useful for Emacs 28. — Drew, Nov 22 '22 at 16:33
The Emacs help mailing list (`help-gnu-emacs@gnu.org`) is generally a better place to ask for user help than is the Emacs dev mailing list (`emacs-devel@gnu.org`). The latter is a better place to propose/discuss possible improvements to Emacs. And `M-x report-emacs-bug` is the best place to request an enhancement (as well as to report a bug). — Drew, Nov 22 '22 at 16:36
Does it work better if you customize `search-default-mode` to `char-fold-to-regexp`? (don't use `setq`, use customize) — rpluim, Nov 22 '22 at 18:13
@rpluim, updated the post, char-folding isearch didn't work for all of the Arabic sequences in the provided example. I would be glad to see `char-fold-include` solve this problem, but how can I look for this option, is it a variable, a function, I didn't find it in the apropos of Eamcs. I will give it a try to see if `char-fold-to-regexp` works using customize as you pointed out thankfully and let you know. — doctorate, Nov 22 '22 at 18:21
Can you be more specific as to what doesn't work?ie "I searched for , I expected to match and ", then we can see what's going on. — rpluim, Nov 22 '22 at 18:22
@rpluim I first tried typing the first word `الذي` without diacritics of course hoping to match `الَّذِي` but it failed. I also tried all other words one by one to no avail. Just to give another example, I typed `الانسان` to match `الإِنْسَانُ` and it failed too. — doctorate, Nov 22 '22 at 18:26
by the way, it is almost impossible to come up with a map of which diacritics are more suitable being on which basic letters in Arabic, a much better strategy is to assume all 80 diacritics can be associated with any of the 28 basic letters. — doctorate, Nov 22 '22 at 18:30
OK, I think you can report this as a bug: it works with single characters, eg `ع` finds `ع` and `عُ`, but `رع` doesn't find `رَعُ` (I think I've got those right :-) ) — rpluim, Nov 22 '22 at 18:39
@rpluim exactly, works for individual chars but not for string sequence. I also found out that `char-fold-symmetric` is set to `non-nil` in my setting. Clicking on `char-fold-include` crashed my emacs. — doctorate, Nov 22 '22 at 18:44
Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/140806/discussion-between-rpluim-and-doctorate). — rpluim, Nov 22 '22 at 18:45
@rpluim, Hi, is there anything new from Emacs development team? I updated the post with **Update 2** that points out a link in Java that could be useful to come up with a reliable solution, pls have a look. Plus I have generated a list of combining characters replacements with simple ones AND diacritics removal list. — doctorate, Dec 07 '22 at 07:18
@rpluim `\u0674 \u0618 \u0619 \u061A \u0640 \u064B \u064C \u064D \u064E \u064F \u0650 \u0651 \u0652 \u0653 \u0654 \u0655 \u0656 \u0657 \u0658 \u0659 \u065A \u065B \u065C \u065D \u065E \u065F \u0670 \u0624>\u0648 \u0629>\u0647 \u064A>\u0649 \u0626>\u0649` this first part is removal of diacritics. — doctorate, Dec 07 '22 at 08:29
@rpluim `\u0622>\u0627 \u0623>\u0627 \u0625>\u0627 \u0671>\u0627 \u0672>\u0627 \u0673>\u0627 \u0675>\u0627 \uFEF5>\uFEFB \uFEF7>\uFEFB \uFEF9>\uFEFB \uFEF6>\uFEFC \uFEF8>\uFEFC \uFEFA>\uFEFC` this is the second part with `>` means replacement combining letters with their cognate base letters. — doctorate, Dec 07 '22 at 08:31

score 1 · Answer 1 · answered Nov 23 '22 at 18:13

It should be pretty easy to adapt this solution from Hebrew to Arabic by replacing the base letters and diacritics to the corresponding characters from another alphabet. And here is a complete working solution for Hebrew:

(require 'char-fold)
(setq char-fold-symmetric t)

;; Allow the search to match nekudot
;; (at least two consequent for shin dot and vowel)
;; https://en.wikipedia.org/wiki/Nikkud
(let ((letters
       '(
         ?\N{HEBREW LETTER ALEF}
         ?\N{HEBREW LETTER BET}
         ?\N{HEBREW LETTER GIMEL}
         ?\N{HEBREW LETTER DALET}
         ?\N{HEBREW LETTER HE}
         ?\N{HEBREW LETTER VAV}
         ?\N{HEBREW LETTER ZAYIN}
         ?\N{HEBREW LETTER HET}
         ?\N{HEBREW LETTER TET}
         ?\N{HEBREW LETTER YOD}
         ?\N{HEBREW LETTER KAF}
         ?\N{HEBREW LETTER FINAL KAF}
         ?\N{HEBREW LETTER LAMED}
         ?\N{HEBREW LETTER MEM}
         ?\N{HEBREW LETTER FINAL MEM}
         ?\N{HEBREW LETTER NUN}
         ?\N{HEBREW LETTER FINAL NUN}
         ?\N{HEBREW LETTER SAMEKH}
         ?\N{HEBREW LETTER AYIN}
         ?\N{HEBREW LETTER PE}
         ?\N{HEBREW LETTER FINAL PE}
         ?\N{HEBREW LETTER TSADI}
         ?\N{HEBREW LETTER FINAL TSADI}
         ?\N{HEBREW LETTER QOF}
         ?\N{HEBREW LETTER RESH}
         ?\N{HEBREW LETTER SHIN}
         ?\N{HEBREW LETTER TAV}
         ))
      (nekudot
       '(
         ?\N{HEBREW POINT SHEVA}
         ?\N{HEBREW POINT HATAF SEGOL}
         ?\N{HEBREW POINT HATAF PATAH}
         ?\N{HEBREW POINT HATAF QAMATS}
         ?\N{HEBREW POINT HIRIQ}
         ?\N{HEBREW POINT TSERE}
         ?\N{HEBREW POINT SEGOL}
         ?\N{HEBREW POINT PATAH}
         ?\N{HEBREW POINT QAMATS}
         ?\N{HEBREW POINT HOLAM}
         ?\N{HEBREW POINT QUBUTS}
         ?\N{HEBREW POINT DAGESH OR MAPIQ}
         ?\N{HEBREW POINT RAFE}
         ?\N{HEBREW POINT SHIN DOT}
         ?\N{HEBREW POINT SIN DOT}
         )))
  (setq char-fold-include
        (append char-fold-include
                (mapcan (lambda (letter)
                          (mapcan (lambda (nikkud1)
                                    (cons (list letter (string letter nikkud1))
                                          (mapcan (lambda (nikkud2)
                                                    (list (list letter (string letter nikkud1 nikkud2))
                                                          (list letter (string letter nikkud2 nikkud1))))
                                                  nekudot)))
                                  nekudot))
                        letters))))


;; https://en.wikipedia.org/wiki/Romanization_of_Hebrew
(setq char-fold-include
      (append char-fold-include
              '(
                (?\N{HEBREW LETTER ALEF} "'")
                (?\N{HEBREW LETTER BET} "v" "b" "bb")
                (?\N{HEBREW LETTER GIMEL} "g" "gg")
                (?\N{HEBREW LETTER DALET} "d" "dd")
                (?\N{HEBREW LETTER HE} "h")
                (?\N{HEBREW LETTER VAV} "v" "vv" "w" "ww" "o" "u")
                (?\N{HEBREW LETTER ZAYIN} "z" "zz")
                (?\N{HEBREW LETTER HET} "ch" "kh")
                (?\N{HEBREW LETTER TET} "t" "tt")
                (?\N{HEBREW LETTER YOD} "y" "yy")
                (?\N{HEBREW LETTER KAF} "kh" "kk")
                (?\N{HEBREW LETTER FINAL KAF} "kh" "kk")
                (?\N{HEBREW LETTER LAMED} "l" "ll")
                (?\N{HEBREW LETTER MEM} "m" "mm")
                (?\N{HEBREW LETTER FINAL MEM} "m" "mm")
                (?\N{HEBREW LETTER NUN} "n" "nn")
                (?\N{HEBREW LETTER FINAL NUN} "n" "nn")
                (?\N{HEBREW LETTER SAMEKH} "s" "ss")
                (?\N{HEBREW LETTER AYIN} "'")
                (?\N{HEBREW LETTER PE} "p" "pp")
                (?\N{HEBREW LETTER FINAL PE} "p" "pp")
                (?\N{HEBREW LETTER TSADI} "ts" "tz")
                (?\N{HEBREW LETTER FINAL TSADI} "ts" "tz")
                (?\N{HEBREW LETTER QOF} "k" "kk" "q" "qq")
                (?\N{HEBREW LETTER RESH} "r" "rr")
                (?\N{HEBREW LETTER SHIN} "s" "ss" "sh")
                (?\N{HEBREW LETTER TAV} "t" "tt")
                )))

(char-fold-update-table)

Ok, nekudot is the equivalent of tashkeel in Arabic, but what is `nikkud1`and `nikkud2`, what do they represent in your code? — doctorate, Nov 24 '22 at 07:03
of great help is the keystrokes C-x 8 RET (inser-char) followed by C-c TAB, this will yank the already provided names of code points into buffer. — doctorate, Nov 24 '22 at 07:11
adapted, it works but suboptimal, Arabic has more nuances e.g. الَّذِي يَزْرَعُهُ الإِنْسَانُ إِيَّاهُ يَحْصُدُ أَيْضًا. First, failed when searched for الانسان because we need to do alef equivalence with all other forms of alef (with hamza etc), second, fails when Tatweel (U+0640) is there. — doctorate, Nov 24 '22 at 18:29
more radical solution is needed to address this complicated issue. — doctorate, Nov 24 '22 at 18:35
The Tatweel character (U+0640) is a challenging character in my opinion as it is not one of the Arabic alphabet yet it sits in a word like a base letter either once or multiple times to lengthen words, no different meaning or grammar just aesthetics, hence it is neither a base letter nor a diacritic symbol. One possible way to avoid search mismatch because of Tatweel, I guess, is to do regex replacement to remove all of them, i.e. Tatweel cleansing, before doing the char-fold search. — doctorate, Nov 25 '22 at 07:35

How to implement character folding (stripping Arabic text from diacritics or tashkeel aka text normalization or cleansing) when doing search in Emacs?

1 Answers1