How to translate Unicode characters?

Question

I'm trying to convert some characters to the fullwidth form like this

tr 'abcdefghijklmnopqrstuvwxyz' 'ａｂｃｄｅｆｇｈｉｊｋｌｍｎｏｐｑｒｓｔｕｖｗｘｙｚ'

However, it doesn't work. I did a search and it turns out tr doesn't support UTF-8. So based on the answer on that question I tried to use perl

perl -C -pe 'tr/abcdefghijklmnopqrstuvwxyz/ａｂｃｄｅｆｇｈｉｊｋｌｍｎｏｐｑｒｓｔｕｖｗｘｙｚ/'

But still no help. I tried simpler versions of it

$ echo abca | perl -C -pe 's/a/ａ/g'
ï½bcï½
$ echo abca | perl -C -pe 'tr/a/ａ/'
ïbcï

It seems perl still treats multibyte UTF-8 characters as bytes

How can I convert those characters properly?

Michael Homer · Accepted Answer · 2018-03-11T08:02:15.827

Both GNU and BSD sed are multibyte-aware in appropriate locales, and the y command is analogous to tr:

$ echo hello | sed -e 'y/abcdefghijklmnopqrstuvwxyz/ａｂｃｄｅｆｇｈｉｊｋｌｍｎｏｐｑｒｓｔｕｖｗｘｙｚ/'
ｈｅｌｌｏ

This should work most places you're likely to run it, as long as your locale is a UTF-8 one.

The Perl issue isn't quite as simple as treating multibyte characters as bytes. It's understanding your input just fine, and even encoding the output, it's the source code it doesn't understand:

$ echo abc | perl -C -pe 'tr/abcdefghijklmnopqrstuvwxyz/ａｂｃｄｅｆｇｈｉｊｋｌｍｎｏｐｑｒｓｔｕｖｗｘｙｚ/'|hexdump -C
00000000  c3 af c2 bd c2 81 0a                              |.......|

The UTF-8 encoding of "ａ" is ef bd 81, so you can see that it's treating "b" as that second byte and then mangling it trying to encode it on the way out, and the same for "c". You need to use utf8 to have your Perl (5) source itself encoded in that way; -C only controls IO the program does when it's running.

You can put use utf8; into your -e string, or use -Mutf8 on the command line:

$ echo abc | perl -C -Mutf8 -pe 'tr/abcdefghijklmnopqrstuvwxyz/ａｂｃｄｅｆｇｈｉｊｋｌｍｎｏｐｑｒｓｔｕｖｗｘｙｚ/'
ａｂｃ

Perl 6 does solve this issue, as so many, but...

How to translate Unicode characters?

1 Answers1