There are several aspects to that. We need to list all the characters in the locale's charset, select the graphical ones (like your 33 to 126 ASCII ones) and sort them.
And there's also the question of whether it makes sense to talk of collation order for characters or if it's ever defined. I'll start with that last point.
The POSIX collation algorithm
If we're talking of the C/POSIX collation algorithm as implemented by strcoll()
and used by sort
, or ls
, or shell globs or awk
/expr
's <
, >
string comparison operator and more generally most tools that sort text based on the user's locale on POSIX systems, note that they're for comparing strings.
While in a en_US.UTF-8 locale on a GNU system, the é
string that consists of a single é
character sorts after the string that consists of a single e
character, Stéphane
sorts before Stephanie
. In a cs_CZ.UTF-8 locale, c
sorts between b
and d
, but ch
sorts between h
and i
.
The collation algorithm considers the whole string, not characters individually. So knowing the order of characters when compared individually does not necessarily tell us how strings containing those characters will compare.
That algorithm is designed to compare strings like many languages do in the real world like in dictionaries and phone books. It is a bit too simple to cover all subtleties of sorting in different cultures (see ICU, which implement different algorithms for more on that), but is good enough for most cases.
In that algorithm, collating elements, which include characters but also combination of characters like the multi-part graphemes of the Korean alphabet or the Czech ch
or on some systems the é
expressed as e
followed by a combining acute accent (U+0301) are assigned several weights.
And whole strings are compared using each weight in turn, from the primary weight to the last weight.
For instance, in that en_US.UTF-8
GNU locale, E
, é
, e
, É
all have the same primary weight. Stéphane
and Stephanie
are decomposed in
<S><t><é><p><h><a><n> <e>
<S><t><e><p><h><a><n> <i><e>
collating elements (here, one per character).
Up to n
, the collating elements of the two strings have the same primary weight, but i
's primary weight is greater that e
's, so Stephanie
sorts after Stéphane
and the secondary weight doesn't even have to be considered.
Now, for Stephane
vs Stéphane
, when comparing primary weights, they sort the same, so the secondary weight has to be considered. If you look at /usr/share/i18n/locales/iso14651_t1_common
on a GNU system as used as-is by the en_US.UTF-8 locale, you see:
<BAS> # 15
[...]
<ACA> # 18
[...]
<U0065> <e>;<BAS>;<MIN>;IGNORE # 259 e
<U00E9> <e>;<ACA>;<MIN>;IGNORE # 260 é
For characters in the latin alphabet, the secondary weight is used to compare diacritics. And the base character (BAS
) sorts before the one with acute accent (ACA
). So Stéphane
sorts after Stephane
. To compare STÉPHANE
against Stéphane
, we'd have to go up to the third weight where upper case sorts after lower case in English (contrary to Estonian for instance).
And there are the non-alphanumeric characters like space or punctuation whose primary weight is IGNORE
and are not considered in the first comparison pass (de facto
sorts between deface
and degree
, that doesn't mean space sorts between f
and g
).
For $'STE\u0301HANE'
vs Stéphane
, some systems like Solaris will treat E\u0301
as a collating element with same weight except for the last as the É
(U+00C9) character, while some others treat \u0301
like punctuation, giving not as good results (like that $'STE\u0301HANE'
before Stephane
).
Not a total order
On GNU systems the sort order of U+0301 is not even defined, and there are thousands more characters in that case. I like to take the example of the rounded digit numbers (U+2460..U+2473) as those clearly should have a sorting order but don't¹:
$ touch ① ② ③ ④ ⑤
$ ls
④ ③ ⑤ ② ①
$ ls | sort -u
④
There are also characters that are actually defined as having the exact same weights as others (like Ǝ
, Ə
, Ɛ
that all sort the same here).
For that reason, it is actually not possible to sort arbitrary characters in some locales, unless, like some sort
implementations do (and that will be a requirement in the next major version of the POSIX specification), you fall back to a memcmp()
like comparison for characters that sort the same.
Listing all graphical characters in the locale's charset
Different locales may use different character sets.
There are three main categories of charsets, single byte character sets like ASCII or iso-8859-x where each byte corresponds to a character (though some may be undefined), multi-byte character sets (and encoding) like UTF-8, GB18030, BIG5 or EUCJP where characters are encoded on a varying number of bytes, and stateful ones, where a byte or sequence of bytes may represent different characters depending on whether a state-shifting code has been issued before.
That last category is rarely used in locales these days and is generally unmanageable so we can ignore it for now.
The C locale itself is guaranteed to have a single-byte character set. It doesn't have to be ASCII though it generally is on systems where it's not EBCDIC-based.
Note that some scripts like the latin one used in English are written left to right while some others are written right to left so having characters of those different scripts (as supported by some charsets) on the same line is not necessarily a good idea.
Same for combining characters which would end up being combined to random characters and together.
Also note that some charsets like Unicode are still evolving. While it's now fixed to 0..0xD7FF, 0xE000..0x10FFFF codepoint ranges, most of those are still unassigned and every new version of Unicode assigns new ones, and system vendors try to keep up.
The characters classified as graph
are listed by the ISO/IEC 30112 Technical Report (2014) which follows up on ISO/IEC TR 14652 (2002). GNU locales seem to follow that, some others (like FreeBSD/Solaris) don't, but I won't blame them as that doesn't seem to make much sense to me. For instance, it excludes most spacing characters, but not U+00A0 (non-breaking space), U+2007 (figure space) nor U+200B (zero width space). It includes characters I would consider as control characters, like U+200C..U+200F, U+202D, U+202E...² that latter one, the right-to-left override being critical for this Q&A as it reverses the order of left-to-right characters:
$ printf '%b\n' '\u202E' a b c | sort | paste -sd '\0' -
abc
(some browsers will show cba
if they support it, others abc
).
It also includes most tag characters and also the thousands of private use characters, which are unlikely to be assigned, let alone drawable on your system.
For single byte charsets (on GNU systems, those for which locale ctype-mb-cur-max
returns 1), listing graphical characters should just be a matter of looping over all 255 byte values (omitting the first one, NUL in every charset which is not graphical and would cause problems) and match them against [[:graph:]]
.
One could do that with awk
for instance:
awk '
BEGIN{
for (i = 1; i < 256; i++) {
c = sprintf("%c", i)
if (c ~ /[[:graph:]]/) print c
}
}' | sort | paste -sd '\0' -
In a el_GR.iso88597
locale, Greek using the iso8859-7 single byte charset, that would give:
`^¨~<=>¦°_-,;:!?/.·'ʽʼ"«»()[]{}§©@$£¤¥*\&#%+±ª―΄΅0½12²3³456789aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZΑαΆάΒβΓγΔδΕεΈέΖζΗηΉήΘθΙιΊίΪϊΐΚκΛλΜμΝνΞξΟοΌόΠπΡρΣσςΤτΥυΎύΫϋΰΦφΧχΨψΩωΏώ
(with the trailing non-breaking space mis-classified as "graph" by GNU locales).
That approach can't be used for multi-byte characters.
If your iconv
supports a UCS-4BE
or UTF32BE
charset encoding, you can generate all unicode code points as 32bit big endian numbers and convert them to the locale's charset:
perl -e 'print pack("L>*", $_, 10) for 1..0xd7ff, 0xe000..0x10ffff' |
iconv -cf UCS-4BE |
grep '[[:graph:]]' |
sort
Or if it supports a UTF-8 one:
perl -C -M-warnings=nonchar -le 'print chr$_ for 1..0xd7ff, 0xe000..0x10ffff' |
iconv -cf UTF-8 |
grep '[[:graph:]]' |
sort
(left on one character per line to avoid the problems mentioned above and also to avoid generating too long a line).
That works on the fact that Unicode (and thus its encodings) is meant to include characters of all other possible charsets, so there will always be a Unicode codepoint for each character in any charset. Modern systems actually define their charsets in relation to Unicode and their wchar_t
usually corresponds to Unicode codepoints.
Now, as discussed above, sort resorts to a memcmp()
-based comparison for characters that sort the same with strcoll()
. For single-byte charsets, that will be a sort on the codepoint in those charsets; for UTF-8, that will sort on the Unicode code point as UTF-8 has that specific property. For other encodings of Unicode like the Chinese GB18030 or other multibyte charsets, that may more or less seem random.
In any case, that will mean that for two locales with the same collation order, the output of sort
will be different if those locales use different character sets.
For instance, if we get back to our ①②③④⑤⑥⑦⑧⑨⑩⑪⑫⑬⑭⑮⑯⑰⑱⑲⑳ rounded numbers. Unicode specifies them in that order (codepoint 0x2460 to 0x2473). In GNU locales, their order is not defined (① is neither before nor after ②). Sort will put ② after ① in a locale using UTF-8 because UTF-8 order follows the Unicode codepoint order. But in a locale like zh_CN.gb18030 that uses GB18030, another encoding of Unicode from China, the order becomes ⑪⑫⑬⑭⑮⑯⑰⑱⑲⑳①②③④⑤⑥⑦⑧⑨⑩, down to how those characters are encoded at byte level (or at least would be if it wasn't for this bug which makes them order as ①②③④⑤⑥⑦⑧⑨⑩⑮⑯⑰⑱⑲⑳⑪⑫⑬⑭ instead).
If you wanted to order the characters of a string based on their collation order, with zsh
:
printf "%s\n" ${(j::)${(s::o)string}}
Note that zsh can have NUL characters in its variables, but strcoll()
doesn't work with those. zsh
tries to work around that, but it's not perfect.
The result would be non-deterministic if strings contained different characters with same sorting order.
¹ 2019 edit the order of ① ② ③ ④ ⑤ has since been fixed in newer versions of the GNU libc, but as of 2.30, over 95% of characters still don't have a defined order, you can replace ① ② ③ ④ ⑤ with for instance. Hopefully, GNU locales will eventually be fixed completely (they will have to if they want to comply to the next revision of the standard) and the problem will then be limited to user-defined locales
² I believe the rationale is that the non-breaking space, figure space, zero width space etc are not included in the space
category on the ground that they should not be used as delimiters (ISO30112 defines them as characters to be classified as white-space
characters, to find syntactical boundaries), and graph
is defined as the non-spacing printable characters (the characters that are in the print
category and not the space
category, though the text of ISO30112 defines it as characters to be classified as printable characters, not
including the <space> character). So in effect, that's graphical characters and printable non-graphical characters that are not used as syntactical boundaries.