3

In some cases, the collating order of each individual character needs to be known (to be used). It is commonly expressed in the character class of a regex like [b-d]. That character class will match only one character in the range given.

Which individual characters are those in the range b-d (or other range).

It is also known that the collating order in the C locale is the byte value of each ASCII character[a] (showing only characters from 33 to 126):

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

Could the character range be expanded beyond ASCII?

But:

What is the collating order of individual characters in other locales ?

Is there a way to s̲h̲o̲w̲ such collating order (in any locale)?

[a] In systems where ASCII is used (most of the systems), but others may use EBCDIC or even something else.

3 Answers3

7

There are several aspects to that. We need to list all the characters in the locale's charset, select the graphical ones (like your 33 to 126 ASCII ones) and sort them.

And there's also the question of whether it makes sense to talk of collation order for characters or if it's ever defined. I'll start with that last point.

The POSIX collation algorithm

If we're talking of the C/POSIX collation algorithm as implemented by strcoll() and used by sort, or ls, or shell globs or awk/expr's <, > string comparison operator and more generally most tools that sort text based on the user's locale on POSIX systems, note that they're for comparing strings.

While in a en_US.UTF-8 locale on a GNU system, the é string that consists of a single é character sorts after the string that consists of a single e character, Stéphane sorts before Stephanie. In a cs_CZ.UTF-8 locale, c sorts between b and d, but ch sorts between h and i.

The collation algorithm considers the whole string, not characters individually. So knowing the order of characters when compared individually does not necessarily tell us how strings containing those characters will compare.

That algorithm is designed to compare strings like many languages do in the real world like in dictionaries and phone books. It is a bit too simple to cover all subtleties of sorting in different cultures (see ICU, which implement different algorithms for more on that), but is good enough for most cases.

In that algorithm, collating elements, which include characters but also combination of characters like the multi-part graphemes of the Korean alphabet or the Czech ch or on some systems the é expressed as e followed by a combining acute accent (U+0301) are assigned several weights.

And whole strings are compared using each weight in turn, from the primary weight to the last weight.

For instance, in that en_US.UTF-8 GNU locale, E, é, e, É all have the same primary weight. Stéphane and Stephanie are decomposed in

<S><t><é><p><h><a><n> <e>
<S><t><e><p><h><a><n> <i><e>

collating elements (here, one per character).

Up to n, the collating elements of the two strings have the same primary weight, but i's primary weight is greater that e's, so Stephanie sorts after Stéphane and the secondary weight doesn't even have to be considered.

Now, for Stephane vs Stéphane, when comparing primary weights, they sort the same, so the secondary weight has to be considered. If you look at /usr/share/i18n/locales/iso14651_t1_common on a GNU system as used as-is by the en_US.UTF-8 locale, you see:

<BAS> # 15
[...]
<ACA> # 18
[...]
<U0065> <e>;<BAS>;<MIN>;IGNORE # 259 e
<U00E9> <e>;<ACA>;<MIN>;IGNORE # 260 é

For characters in the latin alphabet, the secondary weight is used to compare diacritics. And the base character (BAS) sorts before the one with acute accent (ACA). So Stéphane sorts after Stephane. To compare STÉPHANE against Stéphane, we'd have to go up to the third weight where upper case sorts after lower case in English (contrary to Estonian for instance).

And there are the non-alphanumeric characters like space or punctuation whose primary weight is IGNORE and are not considered in the first comparison pass (de facto sorts between deface and degree, that doesn't mean space sorts between f and g).

For $'STE\u0301HANE' vs Stéphane, some systems like Solaris will treat E\u0301 as a collating element with same weight except for the last as the É (U+00C9) character, while some others treat \u0301 like punctuation, giving not as good results (like that $'STE\u0301HANE' before Stephane).

Not a total order

On GNU systems the sort order of U+0301 is not even defined, and there are thousands more characters in that case. I like to take the example of the rounded digit numbers (U+2460..U+2473) as those clearly should have a sorting order but don't¹:

$ touch ① ② ③ ④ ⑤
$ ls
④  ③  ⑤  ②  ①
$ ls | sort -u
④

There are also characters that are actually defined as having the exact same weights as others (like Ǝ, Ə, Ɛ that all sort the same here).

For that reason, it is actually not possible to sort arbitrary characters in some locales, unless, like some sort implementations do (and that will be a requirement in the next major version of the POSIX specification), you fall back to a memcmp() like comparison for characters that sort the same.

Listing all graphical characters in the locale's charset

Different locales may use different character sets.

There are three main categories of charsets, single byte character sets like ASCII or iso-8859-x where each byte corresponds to a character (though some may be undefined), multi-byte character sets (and encoding) like UTF-8, GB18030, BIG5 or EUCJP where characters are encoded on a varying number of bytes, and stateful ones, where a byte or sequence of bytes may represent different characters depending on whether a state-shifting code has been issued before.

That last category is rarely used in locales these days and is generally unmanageable so we can ignore it for now.

The C locale itself is guaranteed to have a single-byte character set. It doesn't have to be ASCII though it generally is on systems where it's not EBCDIC-based.

Note that some scripts like the latin one used in English are written left to right while some others are written right to left so having characters of those different scripts (as supported by some charsets) on the same line is not necessarily a good idea.

Same for combining characters which would end up being combined to random characters and together.

Also note that some charsets like Unicode are still evolving. While it's now fixed to 0..0xD7FF, 0xE000..0x10FFFF codepoint ranges, most of those are still unassigned and every new version of Unicode assigns new ones, and system vendors try to keep up.

The characters classified as graph are listed by the ISO/IEC 30112 Technical Report (2014) which follows up on ISO/IEC TR 14652 (2002). GNU locales seem to follow that, some others (like FreeBSD/Solaris) don't, but I won't blame them as that doesn't seem to make much sense to me. For instance, it excludes most spacing characters, but not U+00A0 (non-breaking space), U+2007 (figure space) nor U+200B (zero width space). It includes characters I would consider as control characters, like U+200C..U+200F, U+202D, U+202E...² that latter one, the right-to-left override being critical for this Q&A as it reverses the order of left-to-right characters:

$ printf '%b\n' '\u202E' a b c | sort | paste -sd '\0' -
‮abc

(some browsers will show cba if they support it, others abc).

It also includes most tag characters and also the thousands of private use characters, which are unlikely to be assigned, let alone drawable on your system.

For single byte charsets (on GNU systems, those for which locale ctype-mb-cur-max returns 1), listing graphical characters should just be a matter of looping over all 255 byte values (omitting the first one, NUL in every charset which is not graphical and would cause problems) and match them against [[:graph:]].

One could do that with awk for instance:

awk '
  BEGIN{
    for (i = 1; i < 256; i++) {
      c = sprintf("%c", i)
      if (c ~ /[[:graph:]]/) print c
    }
  }' | sort | paste -sd '\0' -

In a el_GR.iso88597 locale, Greek using the iso8859-7 single byte charset, that would give:

`^¨~<=>¦°_­-,;:!?/.·'ʽʼ"«»()[]{}§©@$£¤¥*\&#%+±ª―΄΅0½12²3³456789aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZΑαΆάΒβΓγΔδΕεΈέΖζΗηΉήΘθΙιΊίΪϊΐΚκΛλΜμΝνΞξΟοΌόΠπΡρΣσςΤτΥυΎύΫϋΰΦφΧχΨψΩωΏώ 

(with the trailing non-breaking space mis-classified as "graph" by GNU locales).

That approach can't be used for multi-byte characters.

If your iconv supports a UCS-4BE or UTF32BE charset encoding, you can generate all unicode code points as 32bit big endian numbers and convert them to the locale's charset:

perl -e 'print pack("L>*", $_, 10) for 1..0xd7ff, 0xe000..0x10ffff' |
  iconv -cf UCS-4BE |
  grep '[[:graph:]]' |
  sort

Or if it supports a UTF-8 one:

perl -C -M-warnings=nonchar -le 'print chr$_ for 1..0xd7ff, 0xe000..0x10ffff' |
  iconv -cf UTF-8 |
  grep '[[:graph:]]' |
  sort

(left on one character per line to avoid the problems mentioned above and also to avoid generating too long a line).

That works on the fact that Unicode (and thus its encodings) is meant to include characters of all other possible charsets, so there will always be a Unicode codepoint for each character in any charset. Modern systems actually define their charsets in relation to Unicode and their wchar_t usually corresponds to Unicode codepoints.

Now, as discussed above, sort resorts to a memcmp()-based comparison for characters that sort the same with strcoll(). For single-byte charsets, that will be a sort on the codepoint in those charsets; for UTF-8, that will sort on the Unicode code point as UTF-8 has that specific property. For other encodings of Unicode like the Chinese GB18030 or other multibyte charsets, that may more or less seem random.

In any case, that will mean that for two locales with the same collation order, the output of sort will be different if those locales use different character sets.

For instance, if we get back to our ①②③④⑤⑥⑦⑧⑨⑩⑪⑫⑬⑭⑮⑯⑰⑱⑲⑳ rounded numbers. Unicode specifies them in that order (codepoint 0x2460 to 0x2473). In GNU locales, their order is not defined (① is neither before nor after ②). Sort will put ② after ① in a locale using UTF-8 because UTF-8 order follows the Unicode codepoint order. But in a locale like zh_CN.gb18030 that uses GB18030, another encoding of Unicode from China, the order becomes ⑪⑫⑬⑭⑮⑯⑰⑱⑲⑳①②③④⑤⑥⑦⑧⑨⑩, down to how those characters are encoded at byte level (or at least would be if it wasn't for this bug which makes them order as ①②③④⑤⑥⑦⑧⑨⑩⑮⑯⑰⑱⑲⑳⑪⑫⑬⑭ instead).

If you wanted to order the characters of a string based on their collation order, with zsh:

printf "%s\n" ${(j::)${(s::o)string}}

Note that zsh can have NUL characters in its variables, but strcoll() doesn't work with those. zsh tries to work around that, but it's not perfect.

The result would be non-deterministic if strings contained different characters with same sorting order.


¹ 2019 edit the order of ① ② ③ ④ ⑤ has since been fixed in newer versions of the GNU libc, but as of 2.30, over 95% of characters still don't have a defined order, you can replace ① ② ③ ④ ⑤ with for instance. Hopefully, GNU locales will eventually be fixed completely (they will have to if they want to comply to the next revision of the standard) and the problem will then be limited to user-defined locales

² I believe the rationale is that the non-breaking space, figure space, zero width space etc are not included in the space category on the ground that they should not be used as delimiters (ISO30112 defines them as characters to be classified as white-space characters, to find syntactical boundaries), and graph is defined as the non-spacing printable characters (the characters that are in the print category and not the space category, though the text of ISO30112 defines it as characters to be classified as printable characters, not including the <space> character). So in effect, that's graphical characters and printable non-graphical characters that are not used as syntactical boundaries.

  • You are missing the point. –  May 15 '18 at 19:13
  • As I wrote a long time ago: You are missing the point. (1)The question is entirely about individual characters, not arbitrary strings. (2) Aiming at only graphical characters is a limitation that you have added by yourself. (3) Now, the collating order for glibc has included the byte value as a "last resort" collating element, thus: it is now a "total order". Rounded numbers sort perfectly well now. Well, at least in UTF-8. –  Jun 26 '19 at 22:25
4

In bash and zsh, it is easy to generate a range of numbers:

$ printf '\\x%x' {53..63}
\x35\x36\x37\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f

And it is also easy to convert UNICODE numbers to characters:

$ printf '%b' "$(printf '\\U%x' {53..63})"
56789:;<=>?

It is a gift from the bash coder that the range for the \UXXXXXXXX has no limit, it works from code 0 up to 10FFFF (1114111 in decimal).

So, we can generate all the first 127 characters with this simple code:

$ printf '%b' "$(printf '\\U%x' {0..127})" | hd
00000000  00 01 02 03 04 05 06 07  08 09 0a 0b 0c 0d 0e 0f  |................|
00000010  10 11 12 13 14 15 16 17  18 19 1a 1b 1c 1d 1e 1f  |................|
00000020  20 21 22 23 24 25 26 27  28 29 2a 2b 2c 2d 2e 2f  | !"#$%&'()*+,-./|
00000030  30 31 32 33 34 35 36 37  38 39 3a 3b 3c 3d 3e 3f  |0123456789:;<=>?|
00000040  40 41 42 43 44 45 46 47  48 49 4a 4b 4c 4d 4e 4f  |@ABCDEFGHIJKLMNO|
00000050  50 51 52 53 54 55 56 57  58 59 5a 5b 5c 5d 5e 5f  |PQRSTUVWXYZ[\]^_|
00000060  60 61 62 63 64 65 66 67  68 69 6a 6b 6c 6d 6e 6f  |`abcdefghijklmno|
00000070  70 71 72 73 74 75 76 77  78 79 7a 7b 7c 7d 7e 7f  |pqrstuvwxyz{|}~.|
00000080

Adding a NUL (0x00) character for each character and using sort -z we can see the collating order:

$  printf '%b' "$(printf '\\U%x\\0' {0..127})" | sort -z
`^~<=>| _-,;:!?/.'"()[]{}@$*\&#%+


▒▒123456789aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ

And by removing the NUL, see the numeric order:

$ printf '%b' "$(printf '\\U%x\\0' {0..127})" | sort -z | tr -d '\0' | hd
00000000  60 5e 7e 3c 3d 3e 7c 20  5f 2d 2c 3b 3a 21 3f 2f  |`^~<=>| _-,;:!?/|
00000010  2e 27 22 28 29 5b 5d 7b  7d 40 24 2a 5c 26 23 25  |.'"()[]{}@$*\&#%|
00000020  2b 01 02 03 04 05 06 07  08 09 0a 0b 0c 0d 0e 0f  |+...............|
00000030  10 11 12 13 14 15 16 17  18 19 1a 1b 1c 1d 1e 1f  |................|
00000040  7f 30 31 32 33 34 35 36  37 38 39 61 41 62 42 63  |.0123456789aAbBc|
00000050  43 64 44 65 45 66 46 67  47 68 48 69 49 6a 4a 6b  |CdDeEfFgGhHiIjJk|
00000060  4b 6c 4c 6d 4d 6e 4e 6f  4f 70 50 71 51 72 52 73  |KlLmMnNoOpPqQrRs|
00000070  53 74 54 75 55 76 56 77  57 78 58 79 59 7a 5a     |StTuUvVwWxXyYzZ|
0000007f

Note the difference with the C locale order:

$ printf '%b' "$(printf '\\U%x\\0' {0..127})" | LC_COLLATE=C sort -z | tr -d '\0' | hd
00000000  01 02 03 04 05 06 07 08  09 0a 0b 0c 0d 0e 0f 10  |................|
00000010  11 12 13 14 15 16 17 18  19 1a 1b 1c 1d 1e 1f 20  |............... |
00000020  21 22 23 24 25 26 27 28  29 2a 2b 2c 2d 2e 2f 30  |!"#$%&'()*+,-./0|
00000030  31 32 33 34 35 36 37 38  39 3a 3b 3c 3d 3e 3f 40  |123456789:;<=>?@|
00000040  41 42 43 44 45 46 47 48  49 4a 4b 4c 4d 4e 4f 50  |ABCDEFGHIJKLMNOP|
00000050  51 52 53 54 55 56 57 58  59 5a 5b 5c 5d 5e 5f 60  |QRSTUVWXYZ[\]^_`|
00000060  61 62 63 64 65 66 67 68  69 6a 6b 6c 6d 6e 6f 70  |abcdefghijklmnop|
00000070  71 72 73 74 75 76 77 78  79 7a 7b 7c 7d 7e 7f     |qrstuvwxyz{|}~.|
0000007f

The 128 values a byte could have beside the ASCII range, printed as utf8, yield (not collation order sorted):

$ printf '%b' "$(printf '\\U%x' {125..255})"
}~ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏ
ÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ

The same range sorted by the collating order is (note the first character being the acute accent and the alternating lower and Upper case vowels):

$ printf '%b' "$(printf '\\U%x\\0' {125..255})" | sort -z
´¨~÷׬¦°µ¯­¡¿·¸«»}§¶©®¤¢£¥±¼½¾¹²³áÁàÀâÂåÅäÄãêæÆ
çÇðÐéÉèÈêÊëËíÍìÌîÎïÏñÑóÓòÒôÔöÖõÕøØºßúÚùÙûÛüÜýÝÿþÞ

And the sort order becomes much much more complex when the range is expanded:

$ printf '%b' "$(printf '\\U%x\\0' {0..500})" | sort -z | tr -d '\0'
´`^¨~÷×<=>¬|¦°µ _¯­-,;:!¡?¿/.·¸'"«»()[]{}§¶©®@¤¢$£¥*\&#%+
±ƀƂƃƄƅƉƋƌƍƑƒƔƖƗƚƛƜƝƞƤƥƦƧƨƩƪƫƬƭƮƱƲƸƹƺƻƼƽƾǀǁǂǃ

▒▒0¼½¾1¹2²3³456789aAáÁàÀăĂâÂǎǍåÅäǟÄǞãÃǡǠąĄāĀªæǣÆǢ
bBƁcCćĆĉĈčČċĊçÇƈƇdDďĎđĐƊðÐdzDzDZdžDžDŽeEéÉèÈĕĔêÊěĚëËėĖęĘēĒǝƎƏƐ
fFgGǴğĞĝĜǧǦġĠǥǤģĢƓƣƢhHĥĤħĦƕiIíÍìÌĭĬîÎǐǏïÏĩĨįĮīĪıİijIJ
jJĵĴǰkKǩǨķĶƙƘlLĺĹľĽŀĿłŁļĻljLjLJmMnNńŃňŇñÑņŅʼnŋŊnjNjNJ
oOóÓòÒŏŎôÔǒǑöÖƟőŐõÕøØǫǭǪǬōŌơƠºƆœŒpPqQĸrRŕŔřŘŗŖ
sSśŚŝŜšŠşŞſßtTťŤŧŦţŢuUúÚùÙŭŬûÛǔǓůŮüǘǜǚǖÜǗǛǙǕűŰũŨųŲūŪưƯ
vVwWŵŴƿxXyYýÝŷŶÿŸƴƳzZźŹžŽżŻƶƵþÞƷǯǮ

Sorting such a large group of characters in a pure C locale is not possible as the printf generates the characters as pairs of bytes in utf8 and sort (with a plain C locale) will try to sort bytes. Or printf will fail with \U… for a C localle. What is needed is an equivalent locale that allows more characters than the 256 a byte could hold:

$ LC_ALL=C.UTF-8 printf "$(printf '\\U%x' {32..500})" | sort
 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ 
[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶· 
¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñò 
óôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭ 
ĮįİıIJijĴĵĶķĸĹĺĻļĽľĿŀŁłŃńŅņŇňʼnŊŋŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨ
ũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽžſƀƁƂƃƄƅƆƇƈƉƊƋƌƍƎƏƐƑƒƓƔƕƖƗƘƙƚƛƜƝƞƟƠơƢƣ
ƤƥƦƧƨƩƪƫƬƭƮƯưƱƲƳƴƵƶƷƸƹƺƻƼƽƾƿǀǁǂǃDŽDždžLJLjljNJNjnjǍǎǏǐǑǒǓǔǕǖǗǘǙǚǛǜǝǞ
ǟǠǡǢǣǤǥǦǧǨǩǪǫǬǭǮǯǰDZDzdzǴ
  • Note that in zsh/bash/ksh93, it's from \U0 to \U7fffffff supporting the whole UTF-8 range as originally specified (before it was restricted to 0..d800, e000..10ffff). GNU printf (where the \uxxxx notation originated) supports \U00000000 to \U7fffffff but reports an error for \ud800 to \udfff (non-character codepoints now reserved for the UTF-16 surrogate pairs). It also wants exactly 4 digits for \uxxxx and 8 for \uxxxxxxxx. – Stéphane Chazelas Feb 24 '18 at 22:53
  • Can you please clarify what you mean by The 128 values a byte could have beside the ASCII range, printed as utf8? – Stéphane Chazelas Feb 24 '18 at 22:57
  • 1
    @Kusalananda, in a C locale on a system where the C locale charset is ASCII, there is no non-ASCII character, but since the sorting is based on the byte value, and all byte values are considered valid characters even if not defined, LC_ALL=C sort can be used to sort any text file in any charset. A UTF-8 encoded é is seen as two undefined characters, a latin1 encoded on as one undefined character. A BIG5HKSCS one as an undefined character followed by a m, but sort doesn't care about what character they represent in the C locale as it just sorts on the byte numbers. – Stéphane Chazelas Feb 25 '18 at 20:42
  • 1
    (continued). That's not a sort order that is useful to a human, but it's one that is very simple to implement (and so very efficient), and a total order, so it's useful for instance to preprocess text before feeding to uniq -c/comm/join... where you don't care about the order as long as it's sorted. For UTF-8, ASCII or iso8859-1 encoded text, that also has that property that it sorts the file by Unicode codepoint. – Stéphane Chazelas Feb 25 '18 at 20:44
  • @StéphaneChazelas Thanks for your illumination and time! Much appreciated! – Kusalananda Feb 25 '18 at 21:17
  • @StéphaneChazelas In answer to your question: A byte could have a value from 0 to 255. ASCII is only defined and valid from 0 up to 127. The rest, from 128 up to 256 is not defined nor valid ASCII. Thus: The 128 values a byte could have beside the ASCII range is the range from 0x80 up to 0xFF. That range is used in different encodings for different characters. –  Mar 28 '18 at 23:57
0

Collate sequence is defined by locale files. Here is the sequence of all ASCII characters in locale en_US:

codes=$(for i in {32..126}; do echo "u$(printf '%04x' "$i")"; done)
grep_patterns=$(echo "$codes" | sed 's/^/^</; s/$/>/')
codes_locale=$(grep -oif - <<<"$grep_patterns" /usr/share/i18n/locales/iso14651_t1_common | tr -d '<>')

"ASCII sequence"

for c in $codes; do echo -ne "\$c"; done; echo

"en_US sequence"

for c in $codes_locale; do echo -ne "\$c"; done; echo

OUTPUT:

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~

!"#%&'()*+,-./:;<=>?@[]^_`{|}~$0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ

Should keep in mind that sequence order is not the same as sorting order. During sorting additional rules apply.