Why does sort say that ɛ = e?

Question

ɛ ("Latin epsilon") is a letter used in certain African languages, usually to represent the vowel sound in English "bed". In Unicode it's encoded as U+025B, very distinct from everyday e.

However, if I sort the following:

eb
ed
ɛa
ɛc

it seems that sort considers ɛ and e equivalent:

ɛa
eb
ɛc
ed

What's going on here? And is there a way to make ɛ and e distinct for sorting purposes?

sorting rules are called 'collation', if that helps your googling — BlueRaja - Danny Pflughoeft, Oct 26 '18 at 21:28
Try to put a certain number of ea mixed with ɛa inside a text file and sort it. You will see that it always sorts ea before ɛa. So, no they are not considered equal. — Bakuriu, Oct 27 '18 at 09:21
Might be an obvious point, but I haven't seen it suggested explicitly yet: if you are sorting words in $(certain_african_language), the natural thing to do is setting the locale to $(certain_african_language). — Federico Poloni, Oct 28 '18 at 09:13
@FedericoPoloni A very good point! Unfortunately I haven't been able to find any locale made for this language. — Draconis, Oct 28 '18 at 15:30
@GermánBouzas This is specifically "Latin epsilon", a form designed to fit in with the Latin alphabet. They look pretty much the same, but Latin epsilon is U+025B, while Greek epsilon is U+03B5. — Draconis, Nov 01 '18 at 13:22

Stéphane Chazelas · Accepted Answer · 2018-11-08T21:53:01.783

No, it doesn't consider them as equivalent, they just have the same primary weight. So that, in first approximation, they sort the same.

If you look at /usr/share/i18n/locales/iso14651_t1_common (as used as basis for most locales) on a GNU system (here with glibc 2.27), you'll see:

<U0065> <e>;<BAS>;<MIN>;IGNORE # 259 e
<U025B> <e>;<PCL>;<MIN>;IGNORE # 287 ɛ
<U0045> <e>;<BAS>;<CAP>;IGNORE # 577 E

e, ɛ and E have the same primary weight, e and E same secondary weight, only the third weight differentiates them.

When comparing strings, sort (the strcoll() standard libc function is uses to compare strings) starts by comparing the primary weights of all characters, and only go for the second weight if the strings are equal with the primary weights (and so on with the other weights).

That's how case seems to be ignored in the sorting order in first approximation. Ab sorts between aa and ac, but Ab can sort before or after ab depending on the language rule (some languages have <MIN> before <CAP> like in British English, some <CAP> before <MIN> like in Estonian).

If e had the same sorting order as ɛ, printf '%s\n' e ɛ | sort -u would return only one line. But as <BAS> sorts before <PCL>, e alone sorts before ɛ. eɛe sorts after EEE (at the secondary weight) even though EEE sorts after eee (for which we need to go up to the third weight).

Now if on my system with glibc 2.27, I run:

sed -n 's/\(.*;[^[:blank:]]*\).*/\1/p' /usr/share/i18n/locales/iso14651_t1_common |
  sort -k2 | uniq -Df1

You'll notice that there are quite a few characters that have been defined with the exact same 4 weights. In particular, our ɛ has the same weights as:

<U01DD> <e>;<PCL>;<MIN>;IGNORE
<U0259> <e>;<PCL>;<MIN>;IGNORE
<U025B> <e>;<PCL>;<MIN>;IGNORE

And sure enough:

$ printf '%s\n' $'\u01DD' $'\u0259' $'\u025B' | sort -u
ǝ
$ expr ɛ = ǝ
1

That can be seen as a bug of GNU libc locales. On most other systems, locales make sure all different characters have different sorting order in the end. On GNU locales, it gets even worse, as there are thousands of characters that don't have a sorting order and end up sorting the same, causing all sorts of problems (like breaking comm, join, ls or globs having non-deterministic orders...), hence the recommendation of using LC_ALL=C to work around those issues.

As noted by @ninjalj in comments, glibc 2.28 released in August 2018 came with some improvements on that front though AFAICS, there are still some characters or collating elements defined with identical sorting order. On Ubuntu 18.10 with glibc 2.28 and in a en_GB.UTF-8 locale.

$ expr $'L\ub7' = $'L\u387'
1

(why would U+00B7 be considered equivalent as U+0387 only when combined with L/l?!).

And:

$ perl -lC -e 'for($i=0; $i<0x110000; $i++) {$i = 0xe000 if $i == 0xd800; print chr($i)}' | sort > all-chars-sorted
$ uniq -d all-chars-sorted | wc -l
4
$ uniq -D all-chars-sorted | wc -l
1061355

(still over 1 million characters (95% of the Unicode range, down from 98% in 2.27) sorting the same as other characters as their sorting order is not defined).

See also:

This is exactly what I was looking for! For completeness, what does <PCL> stand for? The others seem to be Capital, Miniscule, and Basic? — Draconis, Oct 26 '18 at 19:51
Indeed if we put a bunch of ea and ɛa mixed together in a file we see that sort sorts all eas before ɛas. — Bakuriu, Oct 27 '18 at 09:22
From glibc 2.28, the codepoint should be used as a fallback for a 4th level weight, see https://sourceware.org/git/gitweb.cgi?p=glibc.git;a=commitdiff;h=bc1d41044c0cf9f0214acdbfd79b6cd11fd1e8c1 https://sourceware.org/bugzilla/show_bug.cgi?id=14095 — ninjalj, Oct 27 '18 at 10:47
With the weights of U+025B changed to <U025B> <S025B>;<BASE>;<MIN>;<U025B> % LATIN SMALL LETTER OPEN E — ninjalj, Oct 27 '18 at 10:49
@ninjalj that's great news! Thanks. I'm looking forward to that version landing in Debian. I'll have some editing to do in a few of my answers here including this one. — Stéphane Chazelas, Oct 27 '18 at 20:56
@ninjalj, actually in my tests 2.28 seems to be about as bad as 2.27 (see edit). sort seems to be a lot slower as well (though that may be down to me running that ubuntu 18.10 on a virtual machine). — Stéphane Chazelas, Oct 27 '18 at 21:55
@StéphaneChazelas: The L with middle dot has specific rules for it (e.g, there are rules for U0137, U004C_00B7 and U004C_0387 giving weights <S006C>;"<BASE><VRNT1>";"<CAP><MIN>";<U013F>, DUCET has similar rules). Regarding the million characters that sort as equal, non-assigned codepoints have no weights assigned to them, plus there was some trouble in glibc with sorting some codepoints at astral planes, see: https://sourceware.org/bugzilla/show_bug.cgi?id=22898 But, for the most part, actual assigned characters should sort in a somewhat sane way, barring bugs. — ninjalj, Oct 28 '18 at 12:28
@ninjalj, yes that's what I meant, that special rule doesn't make any sense. uconv -x '[[:assigned:]]>;\n>;' < all-chars-sorted | wc -m returns 837841 even with that older version of uconv, we're still far from 1061355. Last time I looked at it (probably a few years ago now), there were a few bug reports related to that glibc bug/misfeature and more in software that use it. I'd expect most of them to still be there. I'll do some digging if you want when I have a moment. Are you involved in the glibc development? — Stéphane Chazelas, Oct 28 '18 at 12:45
@StéphaneChazelas: nope, I'm just a bystander. Looking at it a bit more, it seems the problem here is that while the UCA has up to 5 levels (L1-L4 plus Ln for a final "Identical" tie-breaking level), glibc abuses the 4th level for tie-breaking (e.g: in https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a93fecdcece3e2178834f4b4868b2309b0158753). So yep, there's still much to improve. — ninjalj, Oct 29 '18 at 01:23

score 15 · Answer 2 · answered Oct 26 '18 at 16:35

15

man sort:

   ***  WARNING  ***  The locale specified by the environment affects sort
   order.  Set LC_ALL=C to get the traditional sort order that uses native
   byte values.

So, try: LC_ALL=C sort file.txt

answered Oct 26 '18 at 16:35

Ipor Sircer

14,546
1
27
39

1

That works! But why does the default locale consider these completely separate codepoints to be the same? I'm curious why this happens. – Draconis Oct 26 '18 at 16:36
@Draconis What is "the default locale"? – Kamil Maciorowski Oct 26 '18 at 16:39
@KamilMaciorowski An empty value of the environment variable; I'm not sure what locale that corresponds to. – Draconis Oct 26 '18 at 16:44
3

@Draconis if LC_ALL is empty, sort may use other LC_* variables, LANG or some configuration files. – Maya Oct 26 '18 at 19:43
1

LC_COLLATE is the string-sort-specific one, LANG is the extra-general one. – ShadowRanger Oct 27 '18 at 03:16
@NieDzejkob, if LC_ALL, LANG and LC_COLLATE are all empty or unset, sort will use the C/POSIX locale for collation, not some configuration file. Empty localisation variables are required to mean the same as unset ones. – Stéphane Chazelas Oct 28 '18 at 07:50

score 8 · Answer 3 · answered Oct 26 '18 at 17:34

8

The character ɛ is not equal to e, but some locales can gather these signs close together upon collation. The reason for this is language specific, but also some historical or even political background. For example most people probably expect that €uro currency comes close to Europe in dictionary.

Anyway to see what collation you are currently using run locale, the locale -a will give you the list of locales available on the system and to change collation say to C just for one sorting run LC_COLLATE=C sort file. Finally to see how different locales can sort your file try

for loc in $(locale -a)
    do echo ____"${loc}"____
    LC_COLLATE="$loc" sort file
done

Pipe the result to some greping tool to choose locale that fits your need.

answered Oct 26 '18 at 17:34

jimmij

47,140

This is a wonderful explanation, but the symbols seem to be considered identical, not just close together. – Draconis Oct 26 '18 at 17:47
1

No, they're not considered identical. Add a plain ea line to the file, then with sort -u you will get both ea and ɛa in the output. The best strategy vs. collate is avoid (export LC_COLLATE=C). Otherwise, many ugly things will happen (eg. /tmp/[a-z] in bash will match /tmp/a and /tmp/A but not /tmp/Z). – Oct 26 '18 at 18:13
@mosvy Huh, interesting…so they are considered the same for ordering purposes but not for uniqueness purposes? – Draconis Oct 26 '18 at 18:55
they're not considered the same. see here an explanation about it. – Oct 26 '18 at 19:03
@mosvy: Character ranges [a-z] and [A-Z] have been fixed recently, see https://sourceware.org/bugzilla/show_bug.cgi?id=23393 and https://sourceware.org/git/gitweb.cgi?p=glibc.git;a=commit;h=7cd7d36f1feb3ccacf476e909b115b45cdd46e77 – ninjalj Oct 27 '18 at 10:58
@mosvy: BTW, there are still things that need fixing, e.g: https://bugzilla.redhat.com/show_bug.cgi?id=1631472 – ninjalj Oct 27 '18 at 11:02
Good answer except for your crazy and completely made up "€uro" example. – pipe Oct 27 '18 at 21:28
1

@ninjalj, that may be fixed in the glibc fnmatch() and regexp ranges, but not in some like bash that implement its ranges by itself using strcoll(). ksh93 never had the problem because its range implementation uses strcoll() and also check the case of range ends and only match on lowercase characters if both ends are lower case. zsh ranges don't have the issue as it's done based on code point, not strcoll(). – Stéphane Chazelas Oct 28 '18 at 08:07

Why does sort say that ɛ = e?

3 Answers3

Related