LC_ALL=C sort
sorts by byte value. It will sort any input written in any charset by byte value, not only ASCII¹.
The UTF-8 encoding has that property that sorting by byte value is the same as sorting by Unicode code point (memcmp()
will find the encoding of U+1234 is greater than that of U+1233 or any Unicode code point less than 0x1234).
C.utf-8
, C.utf8
or C.UTF-8
(the latter being more common in my experience) are not locales standardized by POSIX, but wherever they're found, they're meant to be locales that have most of the properties of the C locale except that the charset is UTF-8.
LC_ALL=C.UTF-8 sort
would sort the input based on code point, but could end up decoding the UTF-8 before comparison or invoke the strcoll()
/strxfrm()
heavy machinery which would end up being wasted effort given that for UTF-8, using memcmp()
is enough for that.
With GNU sort
and GNU libc
as found on many non-embedded OSes that use Linux as their kernel (here also adding NUL characters in the input which GNU sort
supports even though strcoll()
doesn't):
$ printf 'a\0£1\na\0€2\n' | LC_ALL=C ltrace -e strcoll -e memcmp sort
sort->memcmp("a\0\302\2431", "a\0\342\202\254", 5) = -1
a£1
a€2
$ printf 'a\0£1\na\0€2\n' | LC_ALL=C.UTF-8 ltrace -e strcoll -e memcmp sort
sort->strcoll("a", "a") = 0
sort->strcoll("\302\2431", "\342\202\2542") = -31
a£1
a€2
(actually, you'll find that if the two strings to compare have the same number of bytes, GNU sort
calls memcmp()
first before calling strcoll()
in case they are identical, as memcmp()
is so cheap compared to strcoll()
).
Some timings on that output repeated 1,000,000 times:
$ printf 'a\0£1\na\0€2\n%.0s' {1..1000000} > file.test
$ wc -mc file.test
10000000 13000000 file.test
$ time LC_ALL=C sort file.test > /dev/null
LC_ALL=C sort file.test > /dev/null 0.74s user 0.06s system 390% cpu 0.205 total
$ time LC_ALL=C.UTF-8 sort file.test > /dev/null
LC_ALL=C.UTF-8 sort file.test > /dev/null 6.04s user 0.12s system 522% cpu 1.179 total
So to sort UTF-8 encoded text by codepoint, using C
or C.UTF-8
will make no different functionally, but using C
may be more efficient depending on the sort
implementation.
Now, not all sequences of bytes form valid UTF-8, so when it comes to non-UTF-8 input, that is input that contains sequences of bytes that can't be decoded as UTF-8, you may find the behaviour differs between C
and C.UTF-8
. Still on a GNU system:
$ print -l 'a\200b' 'a\201b' | LC_ALL=C sort -u
a�b
a�b
$ print -l 'a\200b' 'a\201b' | LC_ALL=C.UTF-8 sort -u
a�b
(where � is my terminal emulator's rendition of unknown things)
In C.UTF-8, strcoll()
returns 0 on those two strings that don't form valid UTF-8 text, in effect reporting that they have the same sorting order.
In the C locale, any line made of sequence of bytes other than 0 and not longer than LINE_MAX
bytes is valid text. In C.UTF-8, there are further restrictions. That a\200b
is not valid in UTF-8, so it's not text, so as per POSIX, the behaviour of sort
on it is unspecified.
As a side note: on GNU systems, while LC_ALL=C
takes precedence over $LANGUAGE
for the language of the messages, LC_ALL=C.UTF-8
doesn't.
$ LC_ALL=C LANGUAGE=fr:es:en sort /
sort: read failed: /: Is a directory
$ LC_ALL=C.UTF-8 LANGUAGE=fr:es:en sort /
sort: échec de lecture: /: est un dossier
¹ also note that the C
locale charset doesn't have to be based on ASCII and that ASCII only covers values 0 to 127. C
locales that use ASCII still consider bytes 128 to 255 as characters, albeit undefined characters. The C
locale charset has to guarantee one byte per character though, so the C
locale charset cannot be UTF-8