This is a follow-up question on a previous one here looking for a solution to this very common task of searching for a string in a text irrespective of the vowel diacritics associated with. Arabic alphabet is made of 28 letters and 80 vowel diacritics (a.k.a. tashkeel or harakat).
Example Arabic text with diacritics
الَّذِي يَزْرَعُهُ الإِنْسَانُ إِيَّاهُ يَحْصُدُ أَيْضًا.
The same text without diacritics
الذي يزرعه الإنسان اياه يحصد ايضا.
Trials
As mentioned earlier, most trials in Emacs were posted here. The devised function arabic-search-without-diacritics
used to work back then, but not anymore. An alternative approach had been added later, but I don't know for sure how to translate that into an elisp function to achieve this diacritics-free kind of search.
Another approach mentioned here under the name of character folding with Isearch
function and this feature was first introduced in Emacs 25. The example used there was when looking for e
in a text with many accented e's: e e ㋎ ㋍ ⓔ ⒠ ⅇ ℯ ₑ ẽ ẽ ẻ ẻ ẹ ẹ ḛ ḛ ḙ ḙ ᵉ ȩ ȩ ȇ ȇ ȅ ȅ ě ě ę ę ė ė ĕ ĕ ē ē ë ë ê ê é é è è
hit C-s and type e
, the incremental search will highlight some but not all e's. Now M-x isearch-toggle-char-fold
and do the search again. If you do the search again for e
almost all e's will be highlighted, even the accented ones. My question is how to achieve this wonderful feature of character folding
in Arabic language given the 28 letters and 80 diacritics it has?
Note
In the Emacs wiki, there is something called char-fold-ad-hoc
Option ‘char-fold-ad-hoc’ lets you customize this set of ad hoc foldings. The default value is the same set provided by vanilla Emacs.
I wonder whether this option is relevant to what I am trying to achieve or not. Any help would be very much appreciated.
Update 1
There is also what is called char-fold-symmetric
mentioned in masteringemacs.org website. However, it wasn't mentioned there where to find this option to see if it works with Arabic diacritics or not. According to this website:
Char folding is a hidden gem of a feature.
New char-folding options: 'char-fold-include' lets you add ad hoc foldings, 'char-fold-exclude' to remove foldings from default decomposition, and 'char-fold-symmetric' to search for any of an equivalence class of characters. For example, with a nil value of 'char-fold-symmetric' you can search for "e" to find "é", but not vice versa. With a non-nil value you can search for either, for example, you can search for "é" to find "e".
Useful keybinding found in isearch help
I looked up isearch
help and found out that after invoking isearch
by C-s you can press M-s ' to toggle character folding, but unfortunately it didn't work for Arabic case only the above (e) example.
isearch for Arabic strings in action
For some strange reason I couldn't capture the problem of char-folding in Aarabic text using GIF, it was only possible with the (e) example, but this is just to confirm that char-folding doesn't work for all of the words in the Arabic sentence above.
keywords
character folding
,text normalization
, diacritic insensitive search
, text cleansing
Update 2
I found this answer in this post very informative and useful but in Java, interestingly enough it deals with the Tatweel character as well.