All the locale variables use the same locale name so that you can specify your favorite locale in a single swoop, e.g. LANG=en_AU.utf8
. As you surmise, the country information is occasionally relevant even in LC_CTYPE
, e.g. the uppercase version of i
is I
in most languages but İ
in Turkish (tr_TR.utf8
). But don't expect miracles; for example the lowercase-uppercase correspondence is one-to-one, so there's no good uppercase version of ß
in de_DE.iso8859-1
(it should be SS
).
You'll have an easier time understanding the output of locale -k LC_CTYPE
, with -k
to see the keyword names in addition to the values (without -k
, the output format is designed so you can get the value of a specific keyword, e.g. locale ctype-width
). The list of keywords and their meanings is system-dependent, as is the way locale data is stored, and doesn't interest many people, so you may not find much documentation outside the source code of your C library. By far the most useful form of the locale command is locale -a
to list available locale names.
For GNU libc (i.e. non-embedded Linux):
- All locale data other than messages is stored in
/usr/lib/locale/locale-archive
. This file is generated by localedef
from data in /usr/share/i18n
and /usr/local/share/i18n
. The format of the locale definition files in /usr/share/i18n/locales
is only documented in the source code, I think.
- The format of the character set and encoding definition files in
/usr/share/i18n/charmaps
is standardized by POSIX:2001. These files (or, in GNU libc, the compiled version in /usr/lib/locale/locale-archive
) are used by the iconv programming and commmand line facility. Encoding conversions also rely on code in /usr/lib/gconv/*.so
. The Gnu libc manual documents how to write your own gconv module, though that section contains the text “This information should be sufficient to write new modules. Anybody doing so should also take a look at the available source code in the GNU C library sources.”.
- Message catalogs get special treatment because each application comes with its own set. Message catalogs live in
/usr/share/locale/*/LC_MESSAGES
. The manual contains documentation for application writers. GNU libc supports both the POSIX interface catgets
and the more powerful gettext interface.
Written languages are indeed very complicated, even if you don't stray far from English. Are the French and German ü
the same character (is a “tréma” exactly the same as an “umlaut”, and does it matter that French and German printers typeset the accent at a slightly different height)? What is the uppercase of i
(it's İ
in Turkish)? Does Ö
transliterate to O
if you only have ASCII (in German, it's OE
)? Where is Ä
sorted in a dictionary (in Swedish, it's after Z
)? And that's just a few examples with European languages written in the latin alphabet! The Unicode mailing list has a lot of examples and sometimes heated discussions on such topics.
locale
does it's best in very difficult terrain, and anyone dealing with a fringe case would (hopefully) be aware of the quirks involved and make special concessions... Thelocale ctype-width
example solved my real-world problem of how to generically geticonv
to print out the Unicode Codepoint of a particular char based on the current locale (for the Multi-Lingual-Plane..UTF-16 surrogates as usual being the fly in the ointment ).. I think(?) this should do the trick:echo -n "∴" |iconv -f "$(locale charmap)" -t "UTF-16BE" |xxd -p
... 2234 – Peter.O Apr 26 '11 at 15:51en_AU
(inen_AU.UTF-8
) etc, appears to be just a name.. It is the name of a file which defines that particular "whatever-name-you-like" locale... but that now makes me wonder where the.UTF-8
suffix comes from, because all the LC_* vars are defined in this file... Perhapslocale
gets all its info from this file, either directly or indirectly, and just tacks on the '.UTF-8suffix, but I can't see any reference to UTF-8.. These definition files are in
/usr/share/i18n/locales" – Peter.O Apr 26 '11 at 17:11strace date
and other commands to see what files are actually used. Most locale information is compiled and not read from/usr/share/i18n/locales
under normal operation. See my edit. – Gilles 'SO- stop being evil' Apr 26 '11 at 20:03.UTF-8
suffix comes from... I just changed my defaultdate
by modifying/usr/share/i18n/locales/en_AU
... using this command:sudo localedef -f UTF-8 -i en_AU_mod fred
... The default output ofdate
is now[2011-04-27][11.21.44]
and the output fromlocale |grep LC_TIME
is: LC_TIME=fred ... :) – Peter.O Apr 27 '11 at 01:25