(Unexpected sort order in en_US.UTF-8 locale explains the basics of UTF sorting)
Consider this test scenarion:
# touch abc{.{g,t,t.t},@.s,-{s-w,t.c}}
# LC_COLLATE="en_US.UTF-8" ls -l test
total 0
-rw-r--r-- 1 root root 0 May 8 08:52 abc.g
-rw-r--r-- 1 root root 0 May 8 08:52 abc@.s
-rw-r--r-- 1 root root 0 May 8 08:52 abc-s-w
-rw-r--r-- 1 root root 0 May 8 08:52 abc.t
-rw-r--r-- 1 root root 0 May 8 08:52 abc-t.c
-rw-r--r-- 1 root root 0 May 8 08:52 abc.t.t
I had expected that either "all dots" or "all minuses" are sorted first, but the result looks like a funny mixture. Software packages being used were
coreutils-8.25-13.7.1.x86_64
glibc-2.22-100.8.1.x86_64
glibc-locale-2.22-100.8.1.x86_64
Using LC_COLLATE=POSIX
results in "correct" sorting:
# ls -l test
total 0
-rw-r--r-- 1 root root 0 May 8 08:52 abc-s-w
-rw-r--r-- 1 root root 0 May 8 08:52 abc-t.c
-rw-r--r-- 1 root root 0 May 8 08:52 abc.g
-rw-r--r-- 1 root root 0 May 8 08:52 abc.t
-rw-r--r-- 1 root root 0 May 8 08:52 abc.t.t
-rw-r--r-- 1 root root 0 May 8 08:52 abc@.s
Some details:
# locale -k LC_COLLATE
collate-nrules=0
collate-rulesets=""
collate-symb-hash-sizemb=0
collate-codeset="ANSI_X3.4-1968"
# LC_COLLATE="en_US.UTF-8" locale -k LC_COLLATE
collate-nrules=4
collate-rulesets=""
collate-symb-hash-sizemb=2707
collate-codeset="UTF-8"
Is there a way to have the sorting "explained" like some verbose trace or debug messages? It does not have to be for the ls
command, but some simple demonstration code would be fine, too.
Is there a safe alternative for UTF-8 and LC_COLLATE
that sorts "more traditional", meaning: Can I safely use LC_COLLATE=POSIX
instead?
-
and.
are just ignored, at least while there are other differences. I don't know of a tool that produces collation weights for arbitrary strings and locales, though, or a common UTF-8 locale that sorts lexicographically by codepoint or similar. What kind of ordering do you want? – Michael Homer May 08 '19 at 07:50