In Persian numerals, ۰۱۲۳۴۵۶۷۸۹
is equivalent to 0123456789
in European digits.
How can I convert Persian number ( in UTF-8
) to ASCII?
For example, I want ۲۱
to become 21
.
In Persian numerals, ۰۱۲۳۴۵۶۷۸۹
is equivalent to 0123456789
in European digits.
How can I convert Persian number ( in UTF-8
) to ASCII?
For example, I want ۲۱
to become 21
.
Since it's a fixed set of numbers, you can do it by hand:
$ echo ۲۱ | LC_ALL=en_US.UTF-8 sed -e 'y/۰۱۲۳۴۵۶۷۸۹/0123456789/'
21
(or using tr
, but not GNU tr yet)
Setting your locale to en_US.utf8
(or better to the locale which characters set belongs to) is required for sed
to recognize your characters set.
With perl
:
$ echo "۲۱" |
perl -CS -MUnicode::UCD=num -MUnicode::Normalize -lne 'print num(NFKD($_))'
21
LC_ALL
is needed so that every single unicode characters will be also considered as such by sed
, right?
– phk
Jun 19 '16 at 12:03
tr
for this exact purpose?
– Kevin
Jun 19 '16 at 15:26
tr
how it does not work everywhere. Also keep in mind that some tools are optimized for dealing with bytes while others are for dealing with characters, with Unicode (especially UTF-8) this makes a huge difference.
– phk
Jun 19 '16 at 15:28
LC_ALL
. LC_ALL
is also not set in my environment (but LANG
is set to en_GB.UTF-8
). With the above code, I get the error “sed: 1: "y/۰۱۲۳۴۵۶۷۸۹/ ...": transform strings are not the same length”.
– Konrad Rudolph
Jun 20 '16 at 15:32
en_US.utf8
. What command did you run? How about setting LC_ALL=en_GB.UTF-8
?
– cuonglm
Jun 20 '16 at 15:42
utf8
with UTF-8
it works. In fact, I’ve never heard of the spelling without the dash. Might this be a typo in your answer or do some systems use that name? (EDIT, found this: http://superuser.com/a/999151/2269)
– Konrad Rudolph
Jun 20 '16 at 15:44
For Python there is the unidecode
library which handles such conversions in general: https://pypi.python.org/pypi/Unidecode.
In Python 2:
>>> from unidecode import unidecode
>>> unidecode(u"۰۱۲۳۴۵۶۷۸۹")
'0123456789'
In Python 3:
>>> from unidecode import unidecode
>>> unidecode("۰۱۲۳۴۵۶۷۸۹")
'0123456789'
The SO thread at https://stackoverflow.com/q/8087381/2261442 might be related.
/edit:
As Wander Nauta pointed out in the comments and as mentioned on the Unidecode page there is also a shell version of unidecode
(under /usr/local/bin/
if installed over pip
):
$ echo '۰۱۲۳۴۵۶۷۸۹' | unidecode
0123456789
unidecode
which does the same as your Python 3 snippet. Just echo '۰۱۲۳۴۵۶۷۸۹' | unidecode
should work.
– Wander Nauta
Jun 20 '16 at 11:43
unidecode/util.py
- strange that Debian doesn't include it. (Edit: Ah, mystery solved. The Debian package is out of date and older than the utility.)
– Wander Nauta
Jun 20 '16 at 15:31
A pure bash version:
#!/bin/bash
number="$1"
number=${number//۱/1}
number=${number//۲/2}
number=${number//۳/3}
number=${number//۴/4}
number=${number//۵/5}
number=${number//۶/6}
number=${number//۷/7}
number=${number//۸/8}
number=${number//۹/9}
number=${number//۰/0}
echo "Result is $number"
Have tested in my Gentoo machine and it works.
./convert ۱۳۲
Result is 132
Done as a loop, given the list of characters (from 0 to 9) to convert:
#!/bin/bash
conv() ( LC_ALL=en_US.UTF-8
local n="$2"
for ((i=0;i<${#1};i++)); do
n=${n//"${1:i:1}"/"$i"}
done
printf '%s\n' "$n"
)
conv "۰۱۲۳۴۵۶۷۸۹" "$1"
And used as:
$ convert ۱۳۲
132
Another (rather overkill) way using grep
:
#!/bin/bash
nums=$(echo "$1" | grep -o .)
result=()
for i in $nums
do
case $i in
۱)
result+=1
;;
۲)
result+=2
;;
۳)
result+=3
;;
۴)
result+=4
;;
۵)
result+=5
;;
۶)
result+=6
;;
۷)
result+=7
;;
۸)
result+=8
;;
۹)
result+=9
;;
۰)
result+=0
;;
esac
done
echo "Result is $result"
grep
. In fact, I don't understand that line, nor why you do not set result=0
. Are you being overly cautious in case $1
contains things other than Farsi digits?
– Kusalananda
Jun 20 '16 at 06:56
number=${number//۱/1}
etc., and would avoid the echo
and grep
.
– Kusalananda
Jun 20 '16 at 07:06
We can take advantage of the fact that the UNICODE code point of Persian numerals are consecutive and ordered from 0 to 9:
$ printf '%b' '\U06F'{0..9}
۰۱۲۳۴۵۶۷۸۹
That means that the last hex digit IS the decimal value:
$ echo $(( $(printf '%d' "'۲") & 0xF ))
2
That makes this simple loop a conversion tool:
#!/bin/bash
( ### Use a locale that use UTF-8 to make the script more reliable.
### Maybe something like LC_ALL=fa_IR.UTF-8 for you?.
LC_ALL=en_US.UTF-8
a="$1"
while (( ${#a} > 0 )); do
# extract the last hex digit from the UNICODE code point
# of the first character in the string "$a":
printf '%d' $(( $(printf '%d' "'$a") & 15 ))
a=${a#?} ## Remove one character from $a
done
)
echo
Using it as:
$ sefr.sh ۰۱۲۳۴۵۶۷۸۹
0123456789
$ sefr.sh ۲۰۱
201
$ sefr.sh ۲۱
21
Note that this code could also convert Arabic and Latin numerals (even if mixed):
$ sefr.sh ۴4٤۵5٥۶6٦۷7٧۸8٨۹9٩
444555666777888999
$ sefr.sh ٤٧0٠٦7١٣3٥۶٦۷
4700671335667
'۰
. It could have been written also as '"۰'
. The reason is that printf will give the UNICODE code point if the argument starts with a single quote '
or a double quote "
. Search a little before this link for the text "If the leading character is a single-quote or double-quote"
–
Jun 29 '16 at 03:35
Since iconv
can't seem to grok this, the next port of call would be to use the tr
utility:
$ echo "۲۱" | tr '۰۱۲۳۴۵۶۷۸۹' '0123456789'
21
tr
translates one set of characters to another, so we simply tell it to translate the set of Farsi digits to the set of Latin digits.
EDIT: As user @cuonglm points out. This requires non-GNU tr
, for example the tr
on a Mac, and it also requires that $LC_CTYPE
is set to en_US.UTF-8
.
en_US.utf8
.
– cuonglm
Jun 19 '16 at 12:07
numconv
is in the repository of some Linux distros, Debian and Ubuntu, at least. Install numconv
.
$ echo '۱۲۳۴۵۶۷۸۹۰' | numconv
1234567890
(Edit: Note that leading zeros are removed, and that this is purely for numeric conversion, and will not work with streams that contain non-numeric characters as well.)
echo "۰۱۲۳۴۵۶۷۸۹" | iconv -f UTF-8 -t ascii//TRANSLIT
doesn't handle it... – Kusalananda Jun 19 '16 at 11:52iconv
is just here to map characters in different encodings, but these are characters (Eastern Arabic numerals) that have no equivalent in ASCII, you can just convert them to something similar enough but it's one-way only. – phk Jun 19 '16 at 12:20iconv
was capable and not capable of doing. I was hoping thot using//TRANSLIT
would help, but it didn't. – Kusalananda Jun 19 '16 at 12:25