33

When sorting file names, ls ignores characters like -,_. I expected it to use those characters in sorting as well.

An example:

touch a1 a2 a-1 a-2 a_1 a_2 a.1 a.2 a,1 a,2

Now display these files with ls -1:

a1
a_1
a-1
a,1
a.1
a2
a_2
a-2
a,2
a.2

What I expected was something like this:

a1
a2
a,1
a,2
a.1
a.2
a_1
a_2
a-1
a-2

i.e. I expected the non-alphanumeric characters to be taken into account when sorting.

Can anyone explain this behaviour? Is this behaviour mandated by a standard? Or is this due the encoding being UTF-8?

Update: It seems that this is related to UTF-8 sorting:

$ LC_COLLATE=C ls -1
a,1
a,2
a-1
a-2
a.1
a.2
a1
a2
a_1
a_2
daniel kullmann
  • 9,527
  • 11
  • 39
  • 46
  • 2
    UTF-8 and ASCII are identical if all you're using is the first 128 codepoints (which your example is). What happens if you do LC_COLLATE=C ls? – Alexios Apr 01 '12 at 09:56
  • The problem is not that ASCII and UTF-8 are identical, it's rather that UTF-8 has its own collation (sorting) rules. – daniel kullmann Apr 01 '12 at 10:01
  • 4
    Yes, it's true that [_-,.] are being grouped and somehow semi-ignored. I don't know exactly how or where such collation is defined, but it must be a collation issue, because simply, and only, changing the collation to C (via LC_COLLATE=C ls -l) is enough to give you the sort order you expected (assuming the LC_ALL is not overriding LC_COLLATE). This holds true for the entire range of characters in the Unicode Basic Multilingual Plane... I've edited my answer to include an example script which bears this out... – Peter.O Apr 01 '12 at 14:29
  • if you don't like how it works, you can create an alias and put it in your ~/.profile: alias ls='LC_COLLATE=C ls' – jippie Apr 01 '12 at 16:37

4 Answers4

14

EDIT: Added test for data sorted with LC_COLLATE=C


The default collate sequence is treating those "punctuation-type" characters as being of equal value.. Use LC_COLLATE=C to treat them in codepoint order ..

for i in 'a1' 'a_1' 'a-1' 'a,1' 'a.1' 'a2' 'a_2' 'a-2' 'a,2' 'a.2' ;do
  echo $i; 
done |LC_COLLATE=C sort

Output

a,1
a,2
a-1
a-2
a.1
a.2
a1
a2
a_1
a_2

The following code tests all valid UTF-8 chars in the Basic Multilingual Plane (except for \x00 and \x0a; for simplicity)
It compares a file in a known (generated) ascending sequence, against that file randomly sorted and then sorted again with LC_COLLATE=C. The result shows that the C sequence is identical to the original generated sequence.

{ i=0 j=0 k=0 l=0
  for i in {0..9} {A..F} ;do
  for j in {0..9} {A..F} ;do
  for k in {0..9} {A..F} ;do
  for l in {0..9} {A..F} ;do
     (( 16#$i$j$k$l == 16#0000 )) && { printf '.' >&2; continue; }
     (( 16#$i$j$k$l == 16#000A )) && { printf '.' >&2; continue; }
     (( 16#$i$j$k$l >= 16#D800    && 
        16#$i$j$k$l <= 16#DFFF )) && { printf '.' >&2; continue; }
     (( 16#$i$j$k$l >= 16#FFFE )) && { printf '.' >&2; continue; }
     echo 0x"$i$j$k$l" |recode UTF-16BE/x4..UTF-8 || { echo "ERROR at codepoint $i$j$k$l " >&2; continue; } 
     echo 
  done
  done
  done; echo -n "$i$j$k$l " >&2
  done; echo >&2
} >listGen

             sort -R listGen    > listRandom
LC_COLLATE=C sort    listRandom > listCsort 

diff <(cat listGen;   echo "last line of listOrig " ) \
     <(cat listCsort; echo "last line of listCsort" )
echo 
cmp listGen listCsort; echo 'cmp $?='$?

Output:

63485c63485
< last line of listOrig 
---
> last line of listCsort

cmp $?=0
Peter.O
  • 32,916
  • 2
    Where is that documented? Is that part of the Unicode standard? – daniel kullmann Apr 01 '12 at 10:04
  • 2
    Actually, they don't get the same value; those characters are simply ignored when sorting. If they were treated as having an equal value, the sorting order a_1 a2 a_2 would be impossible. – daniel kullmann Apr 01 '12 at 10:07
  • +1 for your hard work and sample code. After many hours sorting directory names with punctuation to match the way tree does it I think there is more to the story such as punctuation being removed from comparison strings or something like that. I can say the / character has to be set as the lowest character in the collating sequence no matter what else. – WinEunuuchs2Unix Apr 30 '17 at 22:14
12

This has nothing to do with the charset. Rather, it's the language that determines the collation order. The libc examines the language presented in $LC_COLLATE/$LC_ALL/$LANG and looks up its collation rules (e.g. /usr/share/i18n/locales/* for GLibC) and orders the text as directed.

  • FYI: It's more complicated than this. If one were to use strcoll for example, you would see that something like aasa.c would be sorted above aas.c. – Don Scott Oct 04 '15 at 21:19
4

I am having exactly the same issue with Debian's default sort options, for me it's a comma that it's ignoring and that's preventing me from sorting CSV data effectively causing havoc in my AI.

The solution is, instead of using sort on it's own, that i need to force sort out of the default behaviour which seems to be -d, --dictionary-order.

Running the command:

sort -V

Fixes my problem and considers commas.

Owl
  • 215
  • You might want to look at using the option -t, (or --field-separator ,) which will then sort on the values of the fields (up to a point - embedded commas in data fields will throw it off). Surely someone, somewhere has written a program that will decode the CSV, sort records, then spit them out again? – Will Crawford Jan 28 '20 at 01:15
0

just a comment... i ve a big deal with my collation (es_AR.utf8) cos i cannot use 'C' because de accents and worst of all is that the problem appears in the db postgresql too doing that the sentence between '10' and '10.1' include (it's an example) the value '100' which i don't expect .. i guess i have to use the collation in each query .. SELECT '100' BETWEEN '10' AND '10.Z' shows true but SELECT '100' BETWEEN '10' AND '10.Z' COLLATE "C" shows 'false' which is correct (in my mind)