How do Bash and Zsh handle collation in patterns and regexes?

Question

Consider the following example:

$ bash --version
GNU bash, version 4.4.20(1)-release (x86_64-pc-linux-gnu)
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software; you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
$ LC_COLLATE=C bash --norc -c '[[ B == [a-z] ]] && echo yes || echo no'
no
$ LC_COLLATE=en_GB bash --norc -c '[[ B == [a-z] ]] && echo yes || echo no'
yes
$ LC_COLLATE=C bash --norc -c '[[ B =~ [a-z] ]] && echo yes || echo no'
no
$ LC_COLLATE=en_GB bash --norc -c '[[ B =~ [a-z] ]] && echo yes || echo no'
no

It seems that when matching against a pattern (i.e., using = or ==), Bash collates according to LC_COLLATE; however, when matching against a regex (i.e., using =~) Bash collates according to POSIX or something similar.

Zsh—at least zsh 5.8.0.2-dev (x86_64-pc-linux-gnu)—prints no in all cases.

Is there any guarantee as to what exactly [a-z] will match when used in a pattern or a regex?

it's hairy, to say the least, see e.g. https://unix.stackexchange.com/a/227076/170373 — ilkkachu, Apr 14 '21 at 13:57
Thank you @ilkkachu, that's very helpful! I’m also interested in how regexes are handled, as the linked answer focuses on patterns. — d125q, Apr 14 '21 at 14:04
See also https://lists.gnu.org/archive/html/bug-bash/2021-02/msg00054.html — Stéphane Chazelas, Apr 14 '21 at 15:47

score 4 · Accepted Answer · 2021-04-18T05:18:27.493

No, there is no guarantee as to what exactly [a-z] will match, period.

Well, in any other locale than "C" (when the utility comply with POSIX ).

The core issue is with the "range" expression (using -).
An explicit list like [abcdefghijklmnopqrstuvwxyz] will never fail.

POSIX requests that a-z is exactly abcdefghijklmnopqrstvwxyz, yes, but only when the locale is the POSIX default, i.e.: "C".

From the POSIX spec:

In the POSIX locale, a range expression represents the set of collating elements that fall between two elements in the collation sequence, inclusive. In other locales, a range expression has unspecified behavior: strictly conforming applications shall not rely on whether the range expression is valid, or on the set of collating elements matched. A range expression shall be expressed as the starting point and the ending point separated by a ( '-' ).

And even if POSIX requests an specific meaning for a-z any application might choose to simply ignore POSIX.

Just to show the tip of the iceberg:

Python 2.7 match only ASCII on a-z but Python 3.0 will match many other Unicode characters. Bash used to match only ASCII up to version 3.2. Then it decided to match the characters that collate between a and z, which might include A-Y (usually not Z), in the applied locale. Now, in bash version 5.0+ the range is configurable with the option of globasciiranges, which, being on by default, makes a-z intent to match mostly ASCII characters.

$ LC_COLLATE=en_GB bash -c 'shopt -u globasciiranges; [[ B == [a-z] ]] && echo yes || echo no'
yes
$ LC_COLLATE=en_GB bash -c 'shopt -s globasciiranges; [[ B == [a-z] ]] && echo yes || echo no'
no

But even with bash 5.0 and globasciiranges active the == [a-z] will match 2190 characters in the en_GB.utf-8 locale. Just for you to understand, this is the list of a-like characters allowed:

a ａ ͣ ⒜              ⓐ Ａ               Ⓐ 
ᵃ ₐ ᴬ   ă Ă ắ Ắ ằ Ằ ẵ Ẵ ẳ Ẳ ấ Ấ ầ Ầ ẫ Ẫ ẩ Ẩ ǎ Ǎ Å ǻ Ǻ ᷲ ꞛ Ꞛ ǟ Ǟ ȧ Ȧ ǡ Ǡ ą
Ą ā Ā ả Ả ȁ Ȁ ȃ Ȃ ạ Ạ ặ Ặ ậ Ậ ḁ Ḁ ᷓ ꜳ Ꜳ  ℀ ᷔ ᴭ ǽ Ǽ ǣ Ǣ ㏂ ㏟ ᷕ ꜵ Ꜵ ℁ ⅍
ꜷ Ꜷ ㍳ ᷖ ꜹ Ꜹ ꜻ Ꜻ ꜽ Ꜽ ẚ ᴀ ⱥ Ⱥ ᶏ ᴁ ᴂ ᵆ ꬱ ɐ Ɐ ᵄ ɑ ᷧ Ɑ ᵅ ꬰ ᶐ ɒ Ɒ ᶛ ꭤ

And the number of characters matched changes to 1368 characters if the test is =~ [a-z]. i.e. changing to a regex changes the list of matched characters.

However, in the C locale (for bash) only 26 letters are matched (either == or =~).

zsh

Now, zsh is set up to compare the numeric value of the wchar_t representation of the character tested with the numeric range on the right hand side of:

[[ "$c" == [a-z] ]]

But what does that mean?

In ASCII, iso-8859 and UTF-8 it usually means that the numeric value of wchar_t is the same numeric value as the charmap.

a. We programmers should already know about ASCII numbering. And there is no particular reason why @ should sort after #, just plain luck.

b. As a practical example of a issue with iso-8859: In iso-8859-1 and iso-8859-15 ¾ and Ÿ have the same numeric value. How should that be resolved?

c. The numeric position of characters in Unicode generally follows that of age. Older characters usually have a lower numeric value than newer ones. A ñ (\u00f1), an Spanish character is not included in the a-z character range (\u0061 - \u007a). Any Spanish speaker would say it should.
```
$ LC_ALL=es_ES.UTF-8  zsh -c '[[ ñ == [a-z] ]] && echo yes || echo no'
no
```
But shockingly, it is in German:
```
$ LC_ALL=de_DE.UTF-8  zsh -c '[[ ñ =~ [a-z] ]] && echo yes || echo no'
yes
```
But (as also Stéphane Chazelas, a zsh fan) have said:

The other problem is that outside of locales using ASCII, ISO-8859-1 or UTF-8 charmaps, the behaviour varies between system as not all systems implement wchar_t's the same.

In fact, what a wchar means is [compiler dependent][4]:

wchar_t is compiler-dependent and therefore not very portable. Using it for Unicode binds a program to the character model of a compiler. Using it for Unicode binds a program to the character model of a compiler.

And, more generally, wchar is not needed to use Unicode, not at all. So: why use it ? And I am using the name Unicode here as meaning all existing characters.

rant mode on

Sorry but I must complaint, I need to take this out of my chest.

Yes, it is useful (for programmers) that the range a-z means an stable and preferably small list of characters. Yes, the ASCII list of lower letters seems like a reasonable (even if naive) solution. And yes, the situation with [a-z] matching 2190 characters, that easily change from one version to the next, as shown above is a huge problem for programmers and the security of the applications developed.

But forcing everyone, everywhere, to only be able to only match 26 letters on any locale (as is now the intent of POSIX to establish that [a-z] will match only ASCII lowercase letters in any locale) is extremely naive and is oversimplifying the issue at hand.

code-pages

There was the idea to encode languages in code-pages (lists of usually 256 characters), one of the first code-pages was ASCII (of only 128 characters). It was American-English only.

But American-English is only 369 Million people, the rest of the world, the 95% of the world, doesn't match that description. Just expanding it to the rest of America, the Spanish-Portuguese (which are more than 650 Million) part of America, makes English only 36% of America.

ASCII was expanded to cover different languages. So there was a code-page for Cyrillic, one code-page for Hebrew, one that included the Spanish ñ, one that had the Portuguese ç, etc. But that meant that you could not write a letter in both Hebrew and Cyrillic.

Then came the idea that 16 bits were enough. And that kind of worked for a while, until we got 50.000 characters from China. Several solutions tried to shoehorn all those characters in just 16 bits. But that had no hope of working.

That didn't work, much as "who will ever need more memory than 640K ?" didn't work.

utf-8

Then, thanks God, the web embraced UTF-8 and languages got a lot easier. UTF-8 covers all 1.114.112 (and much more if needed) possible UCS2 Unicode code-points (17 * 2^16) directly. Yes, only there are 143,859 characters, with Unicode 13.0 have been assigned (yes with all Chinese included).

Let's understand it: There are many languages !!!

There are many points of view !!!

programmers

We, programmers (and I am including myself as one), sure will like that everyone knew about ASCII and that only 26 letters (all the time) were matched with [a-z]. But that is not the world we live in. Most of people (I guess that 99.9% of the world population is not a programmer) doesn't know about ASCII. But most of people that could read and write have some idea of what a dictionary is. And a collation order is mostly a dictionary order.

I am sure that the dictionary in Greece doesn't use a-z as the main entries. That could be repeated for most of the world.

We need to grow above the 26 letters a-z range.

We either embrace the change or it will be externally imposed to us with its good and bad consequences. We better find a solution.

I could think of using [ascii:a-z] or something similar for only the 26 ASCII letters in any locale, even C. Yes, it is not: backward compatible, but we are already in the middle of the problem, a not backward compatible problem, [a-z] in bash might mach almost anything.

That would allow [french:a-z] for example and [greek:a-z], and/or [greek:α-ω] and expect something sensible to the users of that language. And [a-z] would remain as the all languages range that we are already getting (yes, it happens to be backward compatible).

rant mode off

Thank you. What about Zsh? From what I understand, Zsh’s ranges are based on codepoints, i.e., [a-z] will match anything whose codepoint falls between those of a and z. Is this correct? Of course, this is only relevant for the =/== operator as =~ uses either regex.h or pcre.h and is probably implementation-dependent. — d125q, Apr 16 '21 at 06:27
@d125q I added some bits about zsh. I'll add more latter (when I have simplified them). But please, do not oversimplify. Almost nothing in i18n (Internationalization) is simple. — , Apr 18 '21 at 05:21
globasciirange was added in bash 4.3. What changed in 5.0 is that it became the default. It is still very buggy and has little to do if anything with the ASCII charset. — Stéphane Chazelas, Apr 18 '21 at 06:47
UCS2 is an encoding (two different encodings for the two endiannesses) of Unicode limited to code points 0 to 0xffff, so 65536 minus the 0xd800 .. 0xdfff range which are reserved for UTF-16 encoding and are not characters. — Stéphane Chazelas, Apr 18 '21 at 06:50
What [a-z] matches with =~ is not to bash (or zsh, bearing in mind that zsh can also switch between PCRE and ERE) as that's done by the regex library, so it will vary greatly with the system and version thereof. For globs in bash, the number of characters matched by [a-z] for a given locale will also vary greatly between system and version thereof as it depends on how those locales are defined. — Stéphane Chazelas, Apr 18 '21 at 06:56
In your Spanish versus German example, you switch between == (using zsh globs, well defined) and =~ (using the system regexp library, so generally random), which is what explains the difference, not the change of locale. FWIW, I have a é character in my firstname and I don't thing I've ever use [a-z] (or [a-f]) expecting it to match on é (note that where [a-z] matches on é, it never matches on ź) — Stéphane Chazelas, Apr 18 '21 at 06:59
While wchar_t may be compiler dependent, it's not the problem here. wchar_t will be large enough to encompass any character supported by any locale on a given system. What matters here is what wchar_t values the C library (all the mbtowc and co functions) will give a given character. And it's true that zsh's behaviour is most useful on systems where it matches the unicode code point of the character. — Stéphane Chazelas, Apr 18 '21 at 07:06
@StéphaneChazelas Sorry, I meant UTF-16 where I wrote UCS2. One came from the other, so they are closely related. And you are wrong UCS2 had no surrogate points (https://en.wikipedia.org/wiki/Universal_Coded_Character_Set). Those were needed when it was modified to become UTF-16. — , Apr 18 '21 at 07:47
Should [french:a-z] match on æ, ﬀ, á, a<U+0302>? Would [polish:a-z] match on ź? You can also use [[:alpha:]] to match on character constituents of human language scripts (note that it's whether they are alphabetical or not). Or in zsh -o rematchpcre, [[ $char =~ '\p{Latin}' ]] to match on characters of the Latin script (or \p{Greek} for the Greek script (note that it matches on α (U+03B1) and not ⍺ (U+237A)) — Stéphane Chazelas, Apr 18 '21 at 07:48
@StéphaneChazelas So, you support the idea that [a-z] should match a different list of characters in globbing than in regex. — , Apr 18 '21 at 07:49
@StéphaneChazelas Should [french:a-z] match on æ, ﬀ, á, a<U+0302>? That is something to be left to the people of France to decide. Give freedom to different languages. — , Apr 18 '21 at 07:50
@StéphaneChazelas Not acording to Wikipedia UCS2 covers the full 65536 character numbers. (https://en.wikipedia.org/wiki/Universal_Coded_Character_Set) — , Apr 18 '21 at 07:52
you support the idea that [a-z] should match a different list of characters in globbing than in regex? No. But there's not much shells can do about that other than implementing their own regexp matcher like ksh93 does. Note that globs need to be able to match on file names which are not necessarily made of text, while regexp libraries are generally limited to text. — Stéphane Chazelas, Apr 18 '21 at 07:53
@StéphaneChazelas No? Then finding that they are different should be viewed as a defect. No justification is required, Changes should be implemented to reduce or remove such differences. — , Apr 18 '21 at 07:55
So, what character is UCS2BE 0xd8 0x00 for instance the encoding of? — Stéphane Chazelas, Apr 18 '21 at 07:56
You'll find that all those issues are still being discussed. There are plenty of open bugs against libcs, languages, systems, shells about that less than ideal situation, and it's still changing a lot between versions. It will take time before something usable comes out of it, even more so before all systems agree on it. For now, zsh's glob behaviour at least in UTF-8 locales or ASCII ranges is the closest to what I consider a sane approach (and bash's among the worst). — Stéphane Chazelas, Apr 18 '21 at 07:59
@StéphaneChazelas iconv claims it to be Ø. You can try: printf '\x00\xd8' | iconv -f UTF16BE -t UTF8 Or is it the other way around? — , Apr 18 '21 at 08:08
@StéphaneChazelas Yes, as usual, bashing on bash. Have a nice day ! — , Apr 18 '21 at 08:09
Thanks a lot for the answers and the discussion! I received way more information than I originally set out to, so I am accepting the answer. — d125q, Apr 18 '21 at 08:19
@StéphaneChazelas globasciirange was added in bash 4.3. Yes, it was. --- little to do if anything with the ASCII charset Well, the man bash states: range expressions used in pattern matching bracket expressions behave as if in the traditional C locale when performing comparisons. As the C locale uses the ASCII ordering, it seems that globasciirange is closely related to ASCII. — , Apr 18 '21 at 19:04
@StéphaneChazelas UCS-2 was one possible encoding of ISO/IEC 10646. It had only 2^16 character numeric values (65535). It was never updated to remove the surrogate range. What happened was that 2k numeric positions (1024 rows times 1024 columns) were set appart to encode 1M numeric values. Those are 16 planes of 65535 positions. Togetehr that makes 17 planes. Numerically from 0 to \x10FFFF. — , Apr 18 '21 at 19:23
@StéphaneChazelas From that we need to remove the row and column numeric values (2k or 2048 numbers) and some other invalid code numbers (66). That makes the count echo $(( 17 * 2**16 -2048 - 66 )) or echo $((0x10ffff + 1 - 2**11 - 66 )). In short UTF-16 covers 1114112 numbers but only 1,111,998 valid encoded characters. UCS-2 (no longer valid) still remains at 65535 code points. — , Apr 18 '21 at 19:23
@StéphaneChazelas I have a é character in my firstname . Yes, you do. And where would you need to find the word École in a french dictionary ? After the z or somewhere in the a-z range ? What is more natural for a french speaking user ?. — , Apr 18 '21 at 19:28