3

(Unexpected sort order in en_US.UTF-8 locale explains the basics of UTF sorting)

Consider this test scenarion:

# touch abc{.{g,t,t.t},@.s,-{s-w,t.c}}
# LC_COLLATE="en_US.UTF-8" ls -l test
total 0
-rw-r--r-- 1 root root 0 May  8 08:52 abc.g
-rw-r--r-- 1 root root 0 May  8 08:52 abc@.s
-rw-r--r-- 1 root root 0 May  8 08:52 abc-s-w
-rw-r--r-- 1 root root 0 May  8 08:52 abc.t
-rw-r--r-- 1 root root 0 May  8 08:52 abc-t.c
-rw-r--r-- 1 root root 0 May  8 08:52 abc.t.t

I had expected that either "all dots" or "all minuses" are sorted first, but the result looks like a funny mixture. Software packages being used were

coreutils-8.25-13.7.1.x86_64
glibc-2.22-100.8.1.x86_64
glibc-locale-2.22-100.8.1.x86_64

Using LC_COLLATE=POSIX results in "correct" sorting:

# ls -l test
total 0
-rw-r--r-- 1 root root 0 May  8 08:52 abc-s-w
-rw-r--r-- 1 root root 0 May  8 08:52 abc-t.c
-rw-r--r-- 1 root root 0 May  8 08:52 abc.g
-rw-r--r-- 1 root root 0 May  8 08:52 abc.t
-rw-r--r-- 1 root root 0 May  8 08:52 abc.t.t
-rw-r--r-- 1 root root 0 May  8 08:52 abc@.s

Some details:

# locale -k LC_COLLATE
collate-nrules=0
collate-rulesets=""
collate-symb-hash-sizemb=0
collate-codeset="ANSI_X3.4-1968"
# LC_COLLATE="en_US.UTF-8" locale -k LC_COLLATE
collate-nrules=4
collate-rulesets=""
collate-symb-hash-sizemb=2707
collate-codeset="UTF-8"

Is there a way to have the sorting "explained" like some verbose trace or debug messages? It does not have to be for the ls command, but some simple demonstration code would be fine, too.

Is there a safe alternative for UTF-8 and LC_COLLATE that sorts "more traditional", meaning: Can I safely use LC_COLLATE=POSIX instead?

Jeff Schaller
  • 67,283
  • 35
  • 116
  • 255
U. Windl
  • 1,411
  • 1
    The reason for that order is that - and . are just ignored, at least while there are other differences. I don't know of a tool that produces collation weights for arbitrary strings and locales, though, or a common UTF-8 locale that sorts lexicographically by codepoint or similar. What kind of ordering do you want? – Michael Homer May 08 '19 at 07:50
  • As I wrote: either "all minus first" or "all dots first", but not a mixture. Isn't the idea of sorting to speed up finding? – U. Windl Sep 08 '22 at 09:42

0 Answers0