How can I convert Persian numerals in UTF-8 to European numerals in ASCII?

Question

In Persian numerals, ۰۱۲۳۴۵۶۷۸۹ is equivalent to 0123456789 in European digits.

How can I convert Persian number ( in UTF-8 ) to ASCII?

For example, I want ۲۱ to become 21.

Interesting, it seems like echo "۰۱۲۳۴۵۶۷۸۹" | iconv -f UTF-8 -t ascii//TRANSLIT doesn't handle it... — Kusalananda, Jun 19 '16 at 11:52
@Kusalananda: Is it really that unexpected? As I understood it iconv is just here to map characters in different encodings, but these are characters (Eastern Arabic numerals) that have no equivalent in ASCII, you can just convert them to something similar enough but it's one-way only. — phk, Jun 19 '16 at 12:20
Well, I wasn't quite sure what iconv was capable and not capable of doing. I was hoping thot using //TRANSLIT would help, but it didn't. — Kusalananda, Jun 19 '16 at 12:25
Do you also need to reverse the order? I know that Arabic numerals are written little-endian right-to-left, and Latin numerals are big-endian left-to-right (looking similar in print or on screen, but reversed in memory). Is Persian the same? — Toby Speight, Jun 20 '16 at 12:13
@TobySpeight : no reverse; arabic and persian numberic is left-to-right like european digit,, only alphabet is write right-to-left — Baba, Jun 20 '16 at 19:11

score 30 · Answer 1 · edited Apr 13 '17 at 12:36

30

Since it's a fixed set of numbers, you can do it by hand:

$ echo ۲۱ | LC_ALL=en_US.UTF-8 sed -e 'y/۰۱۲۳۴۵۶۷۸۹/0123456789/'
21

(or using tr, but not GNU tr yet)

Setting your locale to en_US.utf8 (or better to the locale which characters set belongs to) is required for sed to recognize your characters set.

With perl:

$ echo "۲۱" |
  perl -CS -MUnicode::UCD=num -MUnicode::Normalize -lne 'print num(NFKD($_))'
21

edited Apr 13 '17 at 12:36

Community

1

answered Jun 19 '16 at 11:58

cuonglm

153,898

Setting the LC_ALL is needed so that every single unicode characters will be also considered as such by sed, right? – phk Jun 19 '16 at 12:03
@phk: Yes, see the updating. – cuonglm Jun 19 '16 at 12:04
Why must everything be a sed script? Didn't we invent tr for this exact purpose? – Kevin Jun 19 '16 at 15:26
3

@Kevin See the other answer involving tr how it does not work everywhere. Also keep in mind that some tools are optimized for dealing with bytes while others are for dealing with characters, with Unicode (especially UTF-8) this makes a huge difference. – phk Jun 19 '16 at 15:28
This doesn’t work for me on OS X 10.10.5/GNU bash 4.3. Weirdly enough I need to remove the explicit setting of LC_ALL. LC_ALL is also not set in my environment (but LANG is set to en_GB.UTF-8). With the above code, I get the error “sed: 1: "y/۰۱۲۳۴۵۶۷۸۹/ ...": transform strings are not the same length”. – Konrad Rudolph Jun 20 '16 at 15:32
@KonradRudolph: Check if your locale has en_US.utf8. What command did you run? How about setting LC_ALL=en_GB.UTF-8? – cuonglm Jun 20 '16 at 15:42
@cuonglm Indeed, when I replace utf8 with UTF-8 it works. In fact, I’ve never heard of the spelling without the dash. Might this be a typo in your answer or do some systems use that name? (EDIT, found this: http://superuser.com/a/999151/2269) – Konrad Rudolph Jun 20 '16 at 15:44

score 16 · Answer 2 · edited May 23 '17 at 12:39

16

For Python there is the unidecode library which handles such conversions in general: https://pypi.python.org/pypi/Unidecode.

In Python 2:

>>> from unidecode import unidecode
>>> unidecode(u"۰۱۲۳۴۵۶۷۸۹")
'0123456789'

In Python 3:

>>> from unidecode import unidecode
>>> unidecode("۰۱۲۳۴۵۶۷۸۹")
'0123456789'

The SO thread at https://stackoverflow.com/q/8087381/2261442 might be related.

/edit: As Wander Nauta pointed out in the comments and as mentioned on the Unidecode page there is also a shell version of unidecode (under /usr/local/bin/ if installed over pip):

$ echo '۰۱۲۳۴۵۶۷۸۹' | unidecode
0123456789

edited May 23 '17 at 12:39

Community

1

answered Jun 19 '16 at 11:39

phk

5,953
7
42
71

2

The unidecode library also ships a utility called (unsurprisingly) unidecode which does the same as your Python 3 snippet. Just echo '۰۱۲۳۴۵۶۷۸۹' | unidecode should work. – Wander Nauta Jun 20 '16 at 11:43
@Wander - the Debian package of python-unidecode doesn't ship the utility program, so the long form may be necessary on such platforms (I didn't find one in the source tarball from upstream, so perhaps the program is something added by your distribution?) – Toby Speight Jun 20 '16 at 15:25
@TobySpeight If you install it using pip it's there. – phk Jun 20 '16 at 15:30
@TobySpeight The utility is in the upstream tarball as unidecode/util.py - strange that Debian doesn't include it. (Edit: Ah, mystery solved. The Debian package is out of date and older than the utility.) – Wander Nauta Jun 20 '16 at 15:31

Vombat · Answer 3 · 2016-06-30T06:04:49.540

8

A pure bash version:

#!/bin/bash

number="$1"

number=${number//۱/1}
number=${number//۲/2}
number=${number//۳/3}
number=${number//۴/4}
number=${number//۵/5}
number=${number//۶/6}
number=${number//۷/7}
number=${number//۸/8}
number=${number//۹/9}
number=${number//۰/0}

echo "Result is $number"

Have tested in my Gentoo machine and it works.

./convert ۱۳۲
Result is 132

Done as a loop, given the list of characters (from 0 to 9) to convert:

#!/bin/bash
conv() ( LC_ALL=en_US.UTF-8
         local n="$2"
         for ((i=0;i<${#1};i++)); do
              n=${n//"${1:i:1}"/"$i"}
         done
         printf '%s\n' "$n"
       )

conv "۰۱۲۳۴۵۶۷۸۹" "$1"

And used as:

$ convert ۱۳۲
132

Another (rather overkill) way using grep:

#!/bin/bash

nums=$(echo "$1" | grep -o .)
result=()

for i in $nums
do
    case $i in
        ۱)
            result+=1
            ;;
        ۲)
            result+=2
            ;;
        ۳)
            result+=3
            ;;
        ۴)
            result+=4
            ;;
        ۵)
            result+=5
            ;;
        ۶)
            result+=6
            ;;
        ۷)
            result+=7
            ;;
        ۸)
            result+=8
            ;;
        ۹)
            result+=9
            ;;
        ۰)
            result+=0
            ;;
    esac
done
echo "Result is $result"

edited Jun 30 '16 at 06:04

answered Jun 20 '16 at 06:50

Vombat

12,884

1

Pure Bash, except for the grep. In fact, I don't understand that line, nor why you do not set result=0. Are you being overly cautious in case $1 contains things other than Farsi digits? – Kusalananda Jun 20 '16 at 06:56
@Kusalananda that line reads the Farsi digits into nums. Makes it loop-able. – Vombat Jun 20 '16 at 07:01
1

Ten simple substitutions would have been quicker... number=${number//۱/1} etc., and would avoid the echo and grep. – Kusalananda Jun 20 '16 at 07:06
1

@Kusalananda Nice. Changed it. Now it is pure Bash! ;-) – Vombat Jun 20 '16 at 07:18
@coffeMug : ۱۳۲ is 132 no 123 :D – Baba Jun 20 '16 at 10:26
@Babyy Damn copy paste! And you have sharp eyes. ;-) – Vombat Jun 20 '16 at 10:29

score 7 · Accepted Answer · 2016-06-30T01:34:31.863

We can take advantage of the fact that the UNICODE code point of Persian numerals are consecutive and ordered from 0 to 9:

$ printf '%b' '\U06F'{0..9}
۰۱۲۳۴۵۶۷۸۹

That means that the last hex digit IS the decimal value:

$ echo $(( $(printf '%d' "'۲") & 0xF ))
2

That makes this simple loop a conversion tool:

#!/bin/bash
(   ### Use a locale that use UTF-8 to make the script more reliable.
    ### Maybe something like LC_ALL=fa_IR.UTF-8 for you?.
    LC_ALL=en_US.UTF-8
    a="$1"
    while (( ${#a} > 0 )); do
        # extract the last hex digit from the UNICODE code point
        # of the first character in the string "$a":
        printf '%d' $(( $(printf '%d' "'$a") & 15 ))
        a=${a#?}    ## Remove one character from $a
    done
)
echo

Using it as:

$ sefr.sh ۰۱۲۳۴۵۶۷۸۹
0123456789

$ sefr.sh ۲۰۱
201

$ sefr.sh ۲۱
21

Note that this code could also convert Arabic and Latin numerals (even if mixed):

$ sefr.sh ۴4٤۵5٥۶6٦۷7٧۸8٨۹9٩
444555666777888999

$ sefr.sh ٤٧0٠٦7١٣3٥۶٦۷
4700671335667

very very thanks, this is very nice solution,, and i have question ,,in this command printf '%d' '"۰' why use double-quotation ? — Baba, Jun 28 '16 at 07:59
@Babyy It is not a double quotation, it is a way to give printf an argument that start with a single quote: '۰. It could have been written also as '"۰'. The reason is that printf will give the UNICODE code point if the argument starts with a single quote ' or a double quote ". Search a little before this link for the text "If the leading character is a single-quote or double-quote" — , Jun 29 '16 at 03:35
@Babyy The code has been extended to convert Persian, Arabic, and Latin (even if mixed). — , Jun 29 '16 at 07:03

Kusalananda · Answer 5 · 2016-06-19T12:12:06.930

3

Since iconv can't seem to grok this, the next port of call would be to use the tr utility:

$ echo "۲۱" | tr '۰۱۲۳۴۵۶۷۸۹' '0123456789'
21

tr translates one set of characters to another, so we simply tell it to translate the set of Farsi digits to the set of Latin digits.

EDIT: As user @cuonglm points out. This requires non-GNU tr, for example the tr on a Mac, and it also requires that $LC_CTYPE is set to en_US.UTF-8.

edited Jun 19 '16 at 12:12

answered Jun 19 '16 at 12:00

Kusalananda

333,661

2

Note that it won't work with GNU tr, which does not support multi-byte characters. – cuonglm Jun 19 '16 at 12:01
1

Oh my. Silly GNU. ;-) – Kusalananda Jun 19 '16 at 12:02
And also you need to set your locale to the one which supports unicode, like en_US.utf8. – cuonglm Jun 19 '16 at 12:07

score 1 · Answer 6 · answered Oct 20 '20 at 10:13

numconv is in the repository of some Linux distros, Debian and Ubuntu, at least. Install numconv.

$ echo '۱۲۳۴۵۶۷۸۹۰' | numconv
1234567890

(Edit: Note that leading zeros are removed, and that this is purely for numeric conversion, and will not work with streams that contain non-numeric characters as well.)

How can I convert Persian numerals in UTF-8 to European numerals in ASCII?

6 Answers6