25

This question was stimulated by asking the question Chromium browser does not allow setting the default paper size for "Print to File", and also by a conversation with @Gilles on chat. As pointed out by @don_crissti, and as verified by me, changing the locale (at least LC_PAPER) makes a difference in what paper size is selected.

I had never given much thought to what to select, and had always gone with en_US.UTF-8 because it seemed like a reasonable default choice.

However, per @Gilles on chat (see conversation starting at http://chat.stackexchange.com/transcript/message/17017095#17017095). Extracts:

Gilles: LC_PAPER defaults to $LANG

Gilles: You must have LANG=en_US.UTF-8. That's a bad idea: it sets LC_COLLATE and that's almost always a bad thing

Gilles: LC_COLLATE doesn't describe correct collation, it's too restrictive (it goes character by character) remove LANG and instead set LC_CTYPE and LC_PAPER

Gilles: plus LC_MESSAGES if you want messages in a language other than English

Clearly, there are issues here I am not aware of, and I am sure many others are as well. So, what issues should you consider when setting locales, and how should you set them? I've always just run dpkg-reconfigure locales in Debian, and not thought twice about it.

Specific question: Should I set my locale to en_IN.UTF-8? Are there any drawbacks of doing so?

See also: Does (should) LC_COLLATE affect character ranges?

Faheem Mitha
  • 35,108

1 Answers1

39

Locale settings are user preferences that relate to your culture.

Locale names

On all current unix variants that I know of (but not on a few antiques), locale names follow the same pattern:

  • An ISO 639-1 lowercase two-letter language code, or an ISO 639-2 three-letter language code if the language has no two-letter code. For example, en for English, de for German, ja for Japanese, uk for Ukrainian, ber for Berber, …
  • For many but not all languages, an underscore _ followed by an ISO 3166 uppercase two-letter country code. Thus: en_US for US English, en_UK for British English, fr_CA Canadian (Québec) French, de_DE for German of Germany, de_AT for German of Austria, ja_JP for Japanese (of Japan), etc.
  • Optionally, a dot . followed by the name of a character encoding such as UTF-8, ISO-8859-1, KOI8-U, GB2312, Big5, etc. With GNU libc at least (I don't know how widespread this is), case and punctuation is ignored in encoding names. For example, zh_CN.UTF-8 is Mandarin (simplified) Chinese encoded in UTF-8, while zh_CN is Mandarin Chinese encoded in GB2312, and zh_TW is Taiwanese (traditional) Chinese encoded in Big5.
  • Optionally, an at sign @ followed by the name of a variant. The meaning of variants is locale-dependent. For example, many European countries have an @euro locale variant where the currency sign is € and where the encoding is one that includes this character (ISO 8859-15 or ISO 8859-16), as opposed to the unadorned variant with the older currency sign. For example, en_IE (English, Ireland) uses the latin1 (ISO 8859-1) encoding and £ as the currency symbol while en_IE@euro uses the latin9 (ISO 8859-15) encoding and € as the currency symbol.

In addition, there are two locale names that exist on all unix-like system: C and POSIX. These names are synonymous and mean computerese, i.e. default settings that are appropriate for data that is parsed by a computer program.

Locale settings

The following locale categories are defined by POSIX:

  • LC_CTYPE: the character set used by terminal applications: classification data (which characters are letters, punctuation, spaces, invalid, etc.) and case conversion. Text utilities typically heed LC_CTYPE to determine character boundaries.
  • LC_COLLATE: collation (i.e. sorting) order. This setting is of very limited use for several reasons:
    • Most languages have intricate rules that depend on what is being sorted (e.g. dictionary words and proper names might not use the same order) and cannot be expressed by LC_COLLATE.
    • There are few applications where proper sort order matters which are performed by software that uses locale settings. For example, word processors store the language and encoding of a file in the file itself (otherwise the file wouldn't be processed correctly on a system with different locale settings) and don't care about the locale settings specified by the environment.
    • LC_COLLATE can have nasty side effects, in particular because it causes the sort order A < a < B < …, which makes “between A and Z” include the lowercase letters a through y. In particular, very common regular expressions like [A-Z] break some applications.
  • LC_MESSAGES: the language of informational and error messages.
  • LC_NUMERIC: number formatting: decimal and thousands separator.
    Many applications hard-code . as a decimal separator. This makes LC_NUMERIC not very useful and potentially dangerous:
    • Even if you set it, you'll still see the default format pretty often.
    • You're likely to get into a situation where one application produces locale-dependent output and another application expects . to be the decimal point, or , to be a field separator.
  • LC_MONETARY: like LC_NUMERIC, but for amounts of local currency.
    Very few applications use this.
  • LC_TIME: date and time formatting: weekday and month names, 12 or 24-hour clock, order of date parts, punctuation, etc.

GNU libc, which you'll find on non-embedded Linux, defines additional locale categories:

  • LC_PAPER: the default paper size (defined by height and width).
  • LC_NAME, LC_ADDRESS, LC_TELEPHONE, LC_MEASUREMENT, LC_IDENTIFICATION: I don't know of any application that uses these.

Environment variables

Applications that use locale settings determine them from environment variables.

  • Then the value of the LANG environment variable is used unless overridden by another setting. If LANG is not set, the default locale is C.
  • The LC_xxx names can be used as environment variables.
  • If LC_ALL is set, then all other values are ignored; this is primarily useful to set LC_ALL=C run applications that need to produce the same output regardless of where they are run.
  • In addition, GNU libc uses LANGUAGE to define fallbacks for LC_MESSAGES (e.g. LANGUAGE=fr_BE:fr_FR:en to prefer Belgian French, or if unavailable France French, or if unavailable English).

Installing locales

Locale data can be large, so some distributions don't ship them in a usable form and instead require an additional installation step.

  • On Debian, to install locales, run dpkg-reconfigure locales and select from the list in the dialog box, or edit /etc/locale.gen and then run locale-gen.
  • On Ubuntu, to install locales, run locale-gen with the names of the locales as arguments.

You can define your own locale.

Recommendation

The useful settings are:

  • Set LC_CTYPE to the language and encoding that you encode your text files in. Ensure that your terminals use that encoding.
    For most languages, only the encoding matters. There are a few exceptions; for example, an uppercase i is I in most languages but İ in Turkish (tr_TR).
  • Set LC_MESSAGES to the language that you want to see messages in.
  • Set LC_PAPER to en_US if you want US Letter to be the default paper size and just about anything else (e.g. en_GB) if you want A4.
  • Optionally, set LC_TIME to your favorite time format.

As explained above, avoid setting LC_COLLATE and LC_NUMERIC. If you use LANG, explicitly override these two categories by setting them to C.