10

I have some UTF-8 .txt files which I would like to convert to all uppercase. If it was just ASCII, I could use:

tr [:lower:] [:upper:]

But since I'm working with diacritics and stuff, it doesn't seem to work. I guess it might work if I set the appropriate locale, but I need this script to be portable.

VPeric
  • 627

3 Answers3

20

All of:

tr '[:lower:]' '[:upper:]'

(don't forget the quotes, otherwise that won't work if there's a file called :, l, ... or r in the current directory) or:

awk '{print toupper($0)}'

or:

dd conv=ucase

are meant to convert characters to uppercase according to the rules defined in the current locale. However, even where locales use UTF-8 as the character set and clearly define the conversion from lowercase to uppercase, at least GNU dd, GNU tr and mawk (the default awk on Ubuntu for instance) don't follow them. Also, there's no standard way to specify locales other than C or POSIX, so if you want to convert UTF-8 files to uppercase portably regardless of the current locale, you're out of luck with the standard toolchest.

As often, for portability, your best bet may be perl:

$ echo lľsšcčtťzž | PERLIO=:utf8 perl -pe '$_=uc'
LĽSŠCČTŤZŽ

Now, you need to beware that not everybody agrees on what the uppercase version of a specific character is.

For instance, in Turkish locales, the uppercase i is not I, but İ (<U0130>). Here with the heirloom toolchest tr instead of GNU tr:

$ echo ií | LC_ALL=C.UTF-8 tr '[:lower:]' '[:upper:]'
IÍ
$ echo ií | LC_ALL=tr_TR.UTF-8 tr '[:lower:]' '[:upper:]'
İÍ

On my system, the perl to-upper conversion is defined in /usr/share/perl/5.14/unicore/To/Upper.pl, and I find that it behaves differently on a few characters from the GNU libc toupper() in the C.UTF8 locale for instance, perl being more accurate. For instance perl correctly converts ɀ to Ɀ, the GNU libc (2.17) doesn't.

  • 1
    For what its worth, I'm working with Czech letters (and the example you used is actually Slovak), where all uppercase letters are clearly defined, but locale set will probably be C and not Czech so that's a problem. Perl is already used in this toolchain, so adding another use might not be too bad. Thanks for the detailed explanation, btw! – VPeric Jul 30 '13 at 20:45
7

I think you can do this with awk and its toupper function.

For example

Doesn't work with GNU tr:

$ echo lľsšcčtťzž | tr '[:lower:]' '[:upper:]'
LľSšCčTťZž

Works with GNU awk:

$ echo lľsšcčtťzž | awk '{ print toupper($0) }'
LĽSŠCČTŤZŽ
slm
  • 369,824
  • @StephaneChazelas - thanks I changed the failing example. – slm Jul 30 '13 at 19:22
  • That depends on the current locale and on the tr or awk implementation. For instance, most tr will correctly convert character when in a UTF8 locale, according to the current locale, GNU tr doesn't. mawk doesn't. – Stéphane Chazelas Jul 30 '13 at 20:14
  • 1
    Actually, on FreeBSD (9.1), it's the other way round. It works with tr, but not with awk – Stéphane Chazelas Jul 30 '13 at 20:23
  • @StephaneChazelas - I'm not as versed on the variances 8-). Someone just downvoted, wonder why? – slm Jul 30 '13 at 20:32
2

This works with OS X's tr but not with GNU tr:

tr '[:lower:]' '[:upper:]'

This works with gawk but not with mawk or nawk (which is /usr/bin/awk in OS X):

awk '{print toupper($0)}'

Another option is to use GNU sed:

sed 's/./\u&/g'

In Bash 4.0 and later you can also use the ^^ parameter expansion:

while IFS= read -r l;do printf %s\\n "${l^^}";done
nisetama
  • 1,097