13

While trying to answer this question about SQL sorting, I noticed a sort order I did not expect:

$ export LC_ALL=en_US.UTF-8  
$ echo "T-700A Grouped" > sort.txt
$ echo "T-700 AGrouped" >> sort.txt
$ echo "T-700A Halved" >> sort.txt
$ echo "T-700 Whole" >> sort.txt
$ cat sort.txt | sort
T-700 AGrouped
T-700A Grouped
T-700A Halved
T-700 Whole
$ 

Why is 700 A sorted above 700A, while 700A is above 700 W ? I would expect a space to come before A consistently, independent of the characters following it.

It works fine if you use the C locale:

$ export LC_ALL=C
$ echo "T-700A Grouped" > sort.txt
$ echo "T-700 AGrouped" >> sort.txt
$ echo "T-700A Halved" >> sort.txt
$ echo "T-700 Whole" >> sort.txt
$ cat sort.txt | sort
T-700 AGrouped
T-700 Whole
T-700A Grouped
T-700A Halved
$ 
Andomar
  • 288
  • 2
  • 9

1 Answers1

16

Sorting is done in multiple passes. Each character has three (or sometimes more) weights assigned to it. Let's say for this example the weights are

         wt#1 wt#2 wt#3
space = [0000.0020.0002]
A     = [1BC2.0020.0008]

To create the sort key, the nonzero weights of the characters of a string are concatenated, one weight level at a time. That is, if a weight is zero, no corresponding weight is added (as can be seen at the beginning for " A"). So

       wt#1   -- wt#2 ---   -- wt#3 ---
" A" = 1BC2   0020   0020   0002   0008
       A      sp     A      sp     A

       wt#1   wt#2   wt#3
"A"  = 1BC2   0020   0008
       A      A      A

       wt#1   -- wt#2 ---   -- wt#3 ---
"A " = 1BC2   0020   0020   0008   0002
       A      A      sp     A      sp

If you sort these arrays you get the order you see:

       1BC2   0020   0008               => "A"
       1BC2   0020   0020   0002   0008 => " A"
       1BC2   0020   0020   0008   0002 => "A "

This is a simplification of what actually happens; see the Unicode Collation Algorithm for more details. The above example weights are actually from the standard table, with some details omitted.

Elliptical view
  • 3,921
  • 4
  • 27
  • 46
  • Very helpful. The fact that it sorts A-Space above Space-A above W-Space cannot be explained by a constant code point for space. As you suggest it is probably multiple passes, which section 1.6 seems to explain. – Andomar Dec 31 '15 at 00:23
  • I am a bit lost here: why A becomes A A A , why A becomes A sp A sp A, why A becomes A A sp A sp ? and why those A are of weight 1BC2 ? and why sp is sometimes weight 0020 and sometimes 0002 ? Please explain a bit more for me :) – Olivier Dulac Jan 27 '21 at 15:12