491

What does the C value for LC_ALL do in Unix-like systems?

I know that it forces the same locale for all aspects but what does C do?

jasonwryan
  • 73,126
jcubic
  • 9,932
  • 4
    If you want to resolve a problem with xclock warning(Missing charsets in String to FontSet conversion), it will be better if you will use LC_ALL=C.UTF-8 to avoid problems with cyrillic. To set this environment variable you must add the following line to the end of ~/.bashrc file - export LC_ALL=C.UTF-8 – fedotsoldier Jun 19 '19 at 12:42
  • 2
    @fedotsoldier you should probably ask question and give the answer yourself, I don't think it's related to the question. It's just answer to different problem you're having. – jcubic Jun 19 '19 at 13:20
  • Yeah, you are right, ok – fedotsoldier Jun 19 '19 at 13:22
  • 6
    legendary C locales rant https://github.com/mpv-player/mpv/commit/1e70e82baa9193f6f027338b0fab0f5078971fbe – qwr Sep 04 '22 at 06:28

6 Answers6

507

LC_ALL is the environment variable that overrides all the other localisation settings (except $LANGUAGE under some circumstances).

Different aspects of localisations (like the thousand separator or decimal point character, character set, sorting order, month, day names, language or application messages like error messages, currency symbol) can be set using a few environment variables.

You'll typically set $LANG to your preference with a value that identifies your region (like fr_CH.UTF-8 if you're in French speaking Switzerland, using UTF-8). The individual LC_xxx variables override a certain aspect. LC_ALL overrides them all. The locale command, when called without argument gives a summary of the current settings.

For instance, on a GNU system, I get:

$ locale
LANG=en_GB.UTF-8
LANGUAGE=
LC_CTYPE="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_PAPER="en_GB.UTF-8"
LC_NAME="en_GB.UTF-8"
LC_ADDRESS="en_GB.UTF-8"
LC_TELEPHONE="en_GB.UTF-8"
LC_MEASUREMENT="en_GB.UTF-8"
LC_IDENTIFICATION="en_GB.UTF-8"
LC_ALL=

I can override an individual setting with for instance:

$ LC_TIME=fr_FR.UTF-8 date
jeudi 22 août 2013, 10:41:30 (UTC+0100)

Or:

$ LC_MONETARY=fr_FR.UTF-8 locale currency_symbol
€

Or override everything with LC_ALL.

$ LC_ALL=C LANG=fr_FR.UTF-8 LC_MESSAGES=fr_FR.UTF-8 cat /
cat: /: Is a directory

In a script, if you want to force a specific setting, as you don't know what settings the user has forced (possibly LC_ALL as well), your best, safest and generally only option is to force LC_ALL.

The C locale is a special locale that is meant to be the simplest locale. You could also say that while the other locales are for humans, the C locale is for computers. In the C locale, characters are single bytes, the charset is ASCII (well, is not required to, but in practice will be in the systems most of us will ever get to use), the sorting order is based on the byte values¹, the language is usually US English (though for application messages (as opposed to things like month or day names or messages by system libraries), it's at the discretion of the application author) and things like currency symbols are not defined.

On some systems, there's a difference with the POSIX locale where for instance the sort order for non-ASCII characters is not defined.

You generally run a command with LC_ALL=C to avoid the user's settings to interfere with your script. For instance, if you want [a-z] to match the 26 ASCII characters from a to z, you have to set LC_ALL=C.

On GNU systems, LC_ALL=C and LC_ALL=POSIX (or LC_MESSAGES=C|POSIX) override $LANGUAGE, while LC_ALL=anything-else wouldn't.

A few cases where you typically need to set LC_ALL=C:

  • sort -u or sort ... | uniq.... In many locales other than C, on some systems (notably GNU ones), some characters have the same sorting order. sort -u doesn't report unique lines, but one of each group of lines that have equal sorting order. So if you do want unique lines, you need a locale where characters are byte and all characters have different sorting order (which the C locale guarantees).

  • the same applies to the = operator of POSIX compliant expr or == operator of POSIX compliant awks (mawk and gawk are not POSIX in that regard), that don't check whether two strings are identical but whether they sort the same.

  • Character ranges like in grep. If you mean to match a letter in the user's language, use grep '[[:alpha:]]' and don't modify LC_ALL. But if you want to match the a-zA-Z ASCII characters, you need either LC_ALL=C grep '[[:alpha:]]' or LC_ALL=C grep '[a-zA-Z]'². [a-z] matches the characters that sort after a and before z (though with many APIs it's more complicated than that). In other locales, you generally don't know what those are. For instance some locales ignore case for sorting so [a-z] in some APIs like bash patterns, could include [B-Z] or [A-Y]. In many UTF-8 locales (including en_US.UTF-8 on most systems), [a-z] will include the latin letters from a to y with diacritics but not those of z (since z sorts before them) which I can't imagine would be what you want (why would you want to include é and not ź?).

  • floating point arithmetic in ksh93. ksh93 honours the decimal_point setting in LC_NUMERIC. If you write a script that contains a=$((1.2/7)), it will stop working when run by a user whose locale has comma as the decimal separator:

     $ ksh93 -c 'echo $((1.1/2))'
     0.55
     $ LANG=fr_FR.UTF-8  ksh93 -c 'echo $((1.1/2))'
     ksh93: 1.1/2: arithmetic syntax error
    

Then you need things like:

    #! /bin/ksh93 -
    float input="$1" # get it as input from the user in his locale
    float output
    arith() { typeset LC_ALL=C; (($@)); }
    arith output=input/1.2 # use the dot here as it will be interpreted
                           # under LC_ALL=C
    echo "$output" # output in the user's locale

As a side note: the , decimal separator conflicts with the , arithmetic operator which can cause even more confusion.

  • When you need characters to be bytes. Nowadays, most locales are UTF-8 based which means characters can take up from 1 to 6 bytes³. When dealing with data that is meant to be bytes, with text utilities, you'll want to set LC_ALL=C. It will also improve performance significantly because parsing UTF-8 data has a cost.

  • a corollary of the previous point: when processing text where you don't know what character set the input is written in, but can assume it's compatible with ASCII (as virtually all charsets are). For instance grep '<.*>' to look for lines containing a <, > pair will not work if you're in a UTF-8 locale and the input is encoded in a single-byte 8-bit character set like iso8859-15. That's because . only matches characters, and non-ASCII characters in iso8859-15 are likely not to form a valid character in UTF-8. On the other hand, LC_ALL=C grep '<.*>' will work because any byte value forms a valid character in the C locale.

  • Any time where you process input data or output data that is not intended from/for a human. If you're talking to a user, you may want to use their convention and language, but for instance, if you generate some numbers to feed some other application that expects English style decimal points, or English month names, you'll want to set LC_ALL=C:

     $ printf '%g\n' 1e-2
     0,01
     $ LC_ALL=C printf '%g\n' 1e-2
     0.01
     $ date +%b
     août
     $ LC_ALL=C date +%b
     Aug
    

That also applies to things like case insensitive comparison (like in grep -i) and case conversion (awk's toupper(), dd conv=ucase...). For instance:

    grep -i i

is not guaranteed to match on I in the user's locale. In some Turkish locales for instance, it doesn't as upper-case i is İ (note the dot) there and lower-case I is ı (note the missing dot).


Notes

¹ again, only on ASCII based systems (the immense majority of systems). POSIX requires the collation order for the C locale to be that of the order of characters in the ASCII charset, even on EBCDIC systems which are not allowed to do the strcoll() === strcmp() optimisation in the C locale.


² Depending on the encoding of the text, that's not necessarily the right thing to do though. That's valid for UTF-8 or single-byte character sets (like iso-8859-1), but not necessarily non-UTF-8 multibyte character sets.

For instance, if you're in a zh_HK.big5hkscs locale (Hong Kong, using the Hong Kong variant of the BIG5 Chinese character encoding), and you want to look for English letters in a file encoded in that charsets, doing either:

LC_ALL=C grep '[[:alpha:]]'

or

LC_ALL=C grep '[a-zA-Z]'

would be wrong, because in that charset (and many others, but hardly used since UTF-8 came out), a lot of characters contain bytes that correspond to the ASCII encoding of A-Za-z characters. For instance, all of A䨝䰲丕乙乜你再劀劈呸哻唥唧噀噦嚳坽 (and many more) contain the encoding of A. is 0x96 0x41, and A is 0x41 like in ASCII. So our LC_ALL=C grep '[a-zA-Z]' would match on those lines that contain those characters as it would misinterpret those sequences of bytes.

LC_COLLATE=C grep '[A-Za-z]'

would work, but only if LC_ALL is not otherwise set (which would override LC_COLLATE). So you may end up having to do:

grep '[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz]'

if you wanted to look for English letters in a file encoded in the locale's encoding.


³ some would argue it's rather 1 to 4 bytes these days now that Unicode code points (and the libraries that encode/decode UTF-8 data) have been arbitrarily restricted to code points U+0000 to U+10FFFF (0xD800 to 0xDFFF excluded) down from U+7FFFFFFF to accommodate the UTF-16 encoding, but some applications will still happily encode/decode 6-byte UTF-8 sequences (including the ones that fall in the 0xD800 .. 0xDFFF range).

  • 19
    +1, it's the best answer (for pointing out the overriding, etc). But lacks the (nice) examples of Ignacio's answer ^^ – Olivier Dulac Aug 22 '13 at 11:08
  • 2
    A minor nitpick: The C locale is only required to support the "portable character set" (ASCII 0-127), and behavior for chars > 127 is technically unspecified. In practice, most programs will treat them as opaque data and pass them through as you described. But not all: in particular, Ruby may choke on char data with bytes > 127 if running in the C locale. I honestly don't know if that's technically "conformant", but we've seen it in the wild. – Andrew Janke Dec 16 '15 at 19:26
  • 3
    @AndrewJanke, yes. Note that portable character set does not event imply ASCII nor 0-127. There has been a lot of discussion on the Austin group mailing list on what the properties of the "C" locale character set would be and the general consensus (and that will be clarified in the next spec) is that that charset would be single-byte, and encompass the full 8bit range (with the properties described here). In the mean-time, yes there can be some divergence (as bug or because the spec is not explicit enough). In anycase LC_ALL=C is the closest you can get the a sane behaviour. – Stéphane Chazelas Dec 16 '15 at 20:11
  • 1
    A Unicode codepoint in UTF-8 can have a maximum of 4 octets (or bytes), but some Character need more then one codepoint, which can lead to longer sequences than 6 octets. – 12431234123412341234123 Apr 18 '17 at 17:15
  • 3
    @12431234123412341234123, the original UTF-8 encoding covers up to U+7FFFFFFF (6 bytes, and there are some extensions to go up to 13 bytes like perl's \x{7FFFFFFFFFFFFFFF}) and while the range of Unicode code points has been arbitrarily restricted to U+10FFFF (due to UTF-16 design limitation), some tools still recognise/produce 6 byte characters. That's what I meant by 6 byte characters. In Unix semantics, one character is one codepoint. Your more than one codepoint "characters" are more generally referenced as graphem clusters to disambiguate from characters. – Stéphane Chazelas Apr 18 '17 at 17:42
  • @StéphaneChazelas, what does C actually stand for? – 1.61803 Feb 23 '19 at 09:48
  • The link relative to some characters have the same sorting order seems to be rotten. Moreover, using bash 5.0, I couldn't find a LC_ALL value that would give me the drawback of having the same sorting order for two characters. Do you remember the given example in your link? – Ulysse BN Dec 27 '19 at 02:47
  • 2
    @UlysseBN, that has nothing to do with bash. It's about the locale definition. See https://lists.gnu.org/archive/html/bug-bash/2019-12/msg00098.html for a more recent example. I used to use ①②③④⑤ as striking examples (see for example What is the difference between "sort -u" and "sort | uniq"?), but they've now been fixed. Still in current GNU locales (as of glibc 2.30 at least), over 99% of characters don't have a defined order. See those – Stéphane Chazelas Dec 27 '19 at 10:32
  • This is a great answer. To clarify for my peace of mind, can it ever be destructive to use LC_ALL=C when processing text? Is it possible to lose text doing so? – Hashim Aziz Nov 29 '20 at 21:36
  • 1
    @Prometheus, yes, if you're processing non-ASCII text. – Stéphane Chazelas Nov 30 '20 at 06:52
  • Hey mate, I asked a more fleshed out question here that I'd really appreciate your input on, if possible. – Hashim Aziz Dec 01 '20 at 17:59
286

It forces applications to use the default language for output:

$ LC_ALL=es_ES man
¿Qué página de manual desea?

$ LC_ALL=C man
What manual page do you want?

and forces sorting to be byte-wise:

$ LC_ALL=en_US sort <<< $'a\nb\nA\nB'
a
A
b
B

$ LC_ALL=C sort <<< $'a\nb\nA\nB'
A
B
a
b
slm
  • 369,824
  • 33
    +1 for good exemples, but lacks the important info that are on Stephane's answer... – Olivier Dulac Aug 22 '13 at 11:06
  • 14
    What do you mean by default language? – Stéphane Chazelas Sep 10 '14 at 14:59
  • 1
    @StéphaneChazelas: Whatever language the strings in the application are written in. – Ignacio Vazquez-Abrams Sep 10 '14 at 14:59
  • You mean for gettext/LC_MESSAGES? But then if they are localised then their "strings" should be in English, right? Trying to see if there's any standard requiring it. – Stéphane Chazelas Sep 10 '14 at 15:23
  • 1
    They will be in whatever language whoever wrote the application wrote them in. – Ignacio Vazquez-Abrams Sep 10 '14 at 15:25
  • 3
    Yes, I understand the author can do whatever he likes including not do what it says on the tin. The thing is. US English is the only language that can be represented correctly with the charset in LC_ALL=C, the only language where the sorting order in LC_ALL=C (LC_COLLATE) makes sense, LC_ALL=C (LC_TIME) has English month and day names. I've never seen apps where LC_ALL=C returned message in a different language from LC_ALL=en LANGUAGE=en. So am I entitled to report a bug against a program if that's not the case? (not talking about apps not translated to English here). – Stéphane Chazelas Sep 10 '14 at 19:55
  • 3
    The problem is "US English is the only language that can be represented correctly with the charset in LC_ALL=C". This is usually only true in C/C++ programs when using narrow characters, but even then there are exceptions (since there are several languages that only use characters and symbols found in ASCII). Reporting a bug when the default language is not English will make you seem... bigoted. – Ignacio Vazquez-Abrams Sep 10 '14 at 22:37
  • 1
    Do you have an example of a language other than US-English that is compatible with ASCII? (genuine question, I could not think of one myself). Do you know of a software that has been translated to English and where messages are not in English when the locale is C? – Stéphane Chazelas Sep 19 '14 at 09:47
  • 3
    Note that in English (meaning LANG=en_US.utf8) the messages can (and should) use unicode characters such as “” for quoting strings. Whereas in LANG=C, it only has ASCII ones (double quotes, backquotes and apostrophes). – Ángel Mar 10 '15 at 16:55
  • Why does it force sorting to be bytewise? – Anton K Oct 13 '16 at 13:30
  • @AntonK: Because bytewise sorting is the default. – Ignacio Vazquez-Abrams Oct 13 '16 at 13:39
  • Be careful when I added this to my .profile and restarted, this prevented the gnome terminal from starting – Hasnaa Ibraheem Nov 16 '17 at 13:55
  • @StéphaneChazelas: English may or may not be representable in ASCII, because of foreign loan words, typographical conventions on punctuation, and so on, but IMO these don't count as the "language", so I agree that "English is compatible with ASCII". Note: NOT "US English". my system list 17 variations of English, all of which can be typed on an ASCII keyboard. And Latin. And various others, apparently: Swahili, Ido, etc. – EML Jan 16 '24 at 10:45
11

C is the default locale,"POSIX" is the alias of "C". I guess "C" is derived from ANSI-C. Maybe ANSI-C define the "POSIX" locale.

GAD3R
  • 66,769
  • Both C and UNIX by far predate ANSI C. – user Aug 22 '13 at 10:55
  • @MichaelKjörling: So? I've seen pre-ANSI documentation, and it didn't have locales. Internally at AT&T Bell Labs, everyone spoke English. – MSalters Aug 22 '13 at 14:50
  • @MSalters The fact that pre-ANSI documentation for the C language doesn't mention locales (which may or may not imply that pre-ANSI, C had no concept of locales; after all, I'm pretty sure the language still does not, but that's beside the point) does not imply that the C locale name derives from "ANSI C". – user Aug 22 '13 at 21:18
  • 6
    @MichaelKjörling: You're missing the point. When locales were introduced, "C" already meant "ANSI C". That it meant K&R C in the past is irrelevant. – MSalters Aug 23 '13 at 07:36
  • What is the "default locale"? – robertspierre Jan 28 '23 at 22:15
6

As far as I can tell, OS X uses code point collation order in UTF-8 locales, so it is an exception to some of the points mentioned in the answer by Stéphane Chazelas.

This prints 26 in OS X and 310 in Ubuntu:

export LC_ALL=en_US.UTF-8
printf %b $(printf '\\U%08x\\n' $(seq $((0x11)) $((0x10ffff))))|grep -a '[a-z]'|wc -l

The code below prints nothing in OS X, indicating that the input is sorted. The six surrogate characters that are removed cause an illegal byte sequence error.

export LC_ALL=en_US.UTF-8
for ((i=1;i<=0x1fffff;i++));do
  x=$(printf %04x $i)
  [[ $x = @(000a|d800|db7f|db80|dbff|dc00|dfff) ]]&&continue
  printf %b \\U$x\\n
done|sort -c

The code below prints nothing in OS X, indicating that there are no two consecutive code points (at least between U+000B and U+D7FF) that have the same collation order.

export LC_ALL=en_US.UTF-8
for ((i=0xb;i<=0xd7fe;i++));do
  printf %b $(printf '\\U%08x\\n' $((i+1)) $i)|sort -c 2>/dev/null&&echo $i
done

(The examples above use %b because printf \\U25 results in an error in zsh.)

Some characters and sequences of characters that have the same collation order in GNU systems do not have the same collation order in OS X. This prints ① first in OS X (using either OS X's sort or GNU sort) but ② first in Ubuntu:

export LC_ALL=en_US.UTF-8;printf %s\\n ② ①|sort

This prints three lines in OS X (using either OS X's sort or GNU sort) but one line in Ubuntu:

export LC_ALL=en_US.UTF-8;printf %b\\n \\u0d4c \\u0d57 \\u0d46\\u0d57|sort -u
nisetama
  • 1,097
6

It appears that LC_COLLATE controls the "alphabetical order" used by ls, as well. The US locale will sort as follows:

a.C
aFilename.C
aFilename.H
a.H

basically ignoring the periods. You might prefer:

a.C
a.H
aFilename.C
aFilename.H

I certainly do. Setting LC_COLLATE to C accomplishes this. Note that it will also sort lower case after all capitals:

A.C
A.H
AFilename.C
a.C
a.H
HalosGhost
  • 4,790
2

For an addition to the @Ignacio Vazquez-Abrams 's answer, for some console outputs it requires you to define in the session scale, but not in the local scale.

For example,

As he mentions, in most cases it does work in this way

$ man
What manual page do you want?
$ LC_ALL=es_ES man
Qupina de manual desea?

Yet doesn't work in some cases

$ LC_ALL=es_ES cpio
-bash: /usr/bin/cpio: Permission denied

So it requires you to do this instead

$ export LC_ALL=es_ES
$ cpio
-bash: /usr/bin/cpio: Permiso denegado

Then returns it back to English if it needed

$ export LC_ALL=C
$ cpio
-bash: /usr/bin/cpio: Permission denied

Also note that for some non-alphabetical languages you'd better add ".UTF-8" as others mention.

For example, for the Japanese language

$ export LC_ALL=ja_JP
$ cpio
-bash: /usr/bin/cpio: Ĥ

$ export LC_ALL=ja_JP.UTF-8 $ cpio -bash: /usr/bin/cpio: 許可がありません