43

I'm trying to remove some characters from file(UTF-8). I'm using tr for this purpose:

tr -cs '[[:alpha:][:space:]]' ' ' <testdata.dat 

File contains some foreign characters (like "Латвийская" or "àé"). tr doesn't seem to understand them: it treats them as non-alpha and removes too.

I've tried changing some of my locale settings:

LC_CTYPE=C LC_COLLATE=C tr -cs '[[:alpha:][:space:]]' ' ' <testdata.dat
LC_CTYPE=ru_RU.UTF-8 LC_COLLATE=C tr -cs '[[:alpha:][:space:]]' ' ' <testdata.dat
LC_CTYPE=ru_RU.UTF-8 LC_COLLATE=ru_RU.UTF-8 tr -cs '[[:alpha:][:space:]]' ' ' <testdata.dat

Unfortunately, none of these worked.

How can I make tr understand Unicode?

Toby Speight
  • 8,678
MatthewRock
  • 6,986

2 Answers2

46

That's a known (1, 2, 3, 4, 5, 6) limitation of the GNU implementation of tr.

It's not as much that it doesn't support foreign, non-English or non-ASCII characters, but that it doesn't support multi-byte characters.

Those Cyrillic characters would be treated OK, if written in the iso8859-5 (single-byte per character) character set (and your locale was using that charset), but your problem is that you're using UTF-8 where non-ASCII characters are encoded in 2 or more bytes.

GNU's got a plan (see also) to fix that and work is under way but not there yet.

FreeBSD or Solaris tr don't have the problem.


In the mean time, for most use cases of tr, you can use GNU sed or GNU awk which do support multi-byte characters.

For instance, your:

tr -cs '[[:alpha:][:space:]]' ' '

could be written:

gsed -E 's/( |[^[:space:][:alpha:]])+/ /'

or:

gawk -v RS='( |[^[:space:][:alpha:]])+' '{printf "%s", sep $0; sep=" "}'

To convert between lower and upper case (tr '[:upper:]' '[:lower:]'):

gsed 's/[[:upper:]]/\l&/g'

(that l is a lowercase L, not the 1 digit).

or:

gawk '{print tolower($0)}'

For portability, perl is another alternative:

perl -Mopen=locale -pe 's/([^[:space:][:alpha:]]| )+/ /g'
perl -Mopen=locale -pe '$_=lc$_'

If you know the data can be represented in a single-byte character set, then you can process it in that charset:

(export LC_ALL=ru_RU.iso88595
 iconv -f utf-8 |
   tr -cs '[:alpha:][:space:]' ' ' |
   iconv -t utf-8) < Russian-file.utf8
  • 1
    I've accepted your question because of information about tr. I've solved the problem, and removed question about how to solve it(so people looking for tr will find only information about tr, not some arbitrary problem). If you could please remove solution too, since it's no longer needed, I'd be thankful. – MatthewRock Sep 09 '15 at 14:41
  • 4
    @MatthewRock I've kept it but reworded it and made more generic as giving a word around would be useful to people with the same problem. – Stéphane Chazelas Sep 09 '15 at 15:22
  • Where do you get an idea that Cyrillic is (customarily) encoded in ISO 8859-5? Did you ever see a Russian text in anything but Unicode? – Incnis Mrsi Sep 09 '15 at 16:35
  • 9
    @IncnisMrsi, all that matters here is that ISO 8859-5 is one of those singe-byte charsets that has those Cyrillic characters. Whether it's in widespread use or not is irrelevant here. If you have a locale with KOI-R or window-1251 charset, by all means, use it instead. – Stéphane Chazelas Sep 09 '15 at 16:43
  • @IncnisMrsi Russian on the web is almost always encoded in UTF-8 (or occasionally in Windows-1251), but only because we’ve felt the pain of many single-byte encodings early on. Here’s an ancient (circa 1998) web page with a (non-functional) encoding switcher: http://sch57.ru/collect/. – Alex Shpilkin Apr 18 '18 at 21:06
  • @IncnisMrsi Windows, however, still uses Windows-1251 in many places and IBM866 (aka “GOST alternative”, “Alt”) in the terminal. Try getting your Python 2 script with Russian output to run in both Windows terminal and IDLE, it’s fun. – Alex Shpilkin Apr 18 '18 at 21:08
5

Just use GNU sed (with appropriate LANG environment var., like en_US.UTF-8):

% sed 'y/123/abc/; y/āōī/456/' <<< test123ingmāōī
testabcingm456