No, there is no guarantee as to what exactly [a-z]
will match, period.
Well, in any other locale than "C"
(when the utility comply with POSIX ).
The core issue is with the "range" expression (using -
).
An explicit list like [abcdefghijklmnopqrstuvwxyz]
will never fail.
POSIX requests that a-z
is exactly abcdefghijklmnopqrstvwxyz
, yes, but only when the locale is the POSIX default, i.e.: "C"
.
From the POSIX spec:
In the POSIX locale, a range expression represents the set of collating elements that fall between two elements in the collation sequence, inclusive. In other locales, a range expression has unspecified behavior: strictly conforming applications shall not rely on whether the range expression is valid, or on the set of collating elements matched. A range expression shall be expressed as the starting point and the ending point separated by a ( '-' ).
And even if POSIX requests an specific meaning for a-z
any application might choose to simply ignore POSIX.
Just to show the tip of the iceberg:
Python 2.7 match only ASCII on a-z
but Python 3.0 will match many other Unicode characters. Bash used to match only ASCII up to version 3.2. Then it decided to match the characters that collate between a
and z
, which might include A-Y
(usually not Z
), in the applied locale. Now, in bash version 5.0+ the range is configurable with the option of globasciiranges, which, being on by default, makes a-z
intent to match mostly ASCII characters.
$ LC_COLLATE=en_GB bash -c 'shopt -u globasciiranges; [[ B == [a-z] ]] && echo yes || echo no'
yes
$ LC_COLLATE=en_GB bash -c 'shopt -s globasciiranges; [[ B == [a-z] ]] && echo yes || echo no'
no
But even with bash 5.0 and globasciiranges active the == [a-z]
will match 2190 characters in the en_GB.utf-8 locale. Just for you to understand, this is the list of a
-like characters allowed:
a a ͣ ⒜ ⓐ A Ⓐ
ᵃ ₐ ᴬ ă Ă ắ Ắ ằ Ằ ẵ Ẵ ẳ Ẳ ấ Ấ ầ Ầ ẫ Ẫ ẩ Ẩ ǎ Ǎ Å ǻ Ǻ ᷲ ꞛ Ꞛ ǟ Ǟ ȧ Ȧ ǡ Ǡ ą
Ą ā Ā ả Ả ȁ Ȁ ȃ Ȃ ạ Ạ ặ Ặ ậ Ậ ḁ Ḁ ᷓ ꜳ Ꜳ ℀ ᷔ ᴭ ǽ Ǽ ǣ Ǣ ㏂ ㏟ ᷕ ꜵ Ꜵ ℁ ⅍
ꜷ Ꜷ ㍳ ᷖ ꜹ Ꜹ ꜻ Ꜻ ꜽ Ꜽ ẚ ᴀ ⱥ Ⱥ ᶏ ᴁ ᴂ ᵆ ꬱ ɐ Ɐ ᵄ ɑ ᷧ Ɑ ᵅ ꬰ ᶐ ɒ Ɒ ᶛ ꭤ
And the number of characters matched changes to 1368 characters if the test is =~ [a-z]
. i.e. changing to a regex changes the list of matched characters.
However, in the C
locale (for bash) only 26 letters are matched (either ==
or =~
).
zsh
Now, zsh is set up to compare the numeric value of the wchar_t representation of the character tested with the numeric range on the right hand side of:
[[ "$c" == [a-z] ]]
But what does that mean?
In ASCII, iso-8859 and UTF-8 it usually means that the numeric value of wchar_t is the same numeric value as the charmap.
a. We programmers should already know about ASCII numbering. And there is no particular reason why @
should sort after #
, just plain luck.
b. As a practical example of a issue with iso-8859: In iso-8859-1 and iso-8859-15 ¾
and Ÿ
have the same numeric value. How should that be resolved?
c. The numeric position of characters in Unicode generally follows that of age. Older characters usually have a lower numeric value than newer ones. A ñ
(\u00f1
), an Spanish character is not included in the a-z
character range (\u0061
- \u007a
). Any Spanish speaker would say it should.
$ LC_ALL=es_ES.UTF-8 zsh -c '[[ ñ == [a-z] ]] && echo yes || echo no'
no
But shockingly, it is in German:
$ LC_ALL=de_DE.UTF-8 zsh -c '[[ ñ =~ [a-z] ]] && echo yes || echo no'
yes
But (as also Stéphane Chazelas, a zsh fan) have said:
The other problem is that outside of locales using ASCII,
ISO-8859-1 or UTF-8 charmaps, the behaviour varies between
system as not all systems implement wchar_t's the same.
In fact, what a wchar means is [compiler dependent][4]:
wchar_t is compiler-dependent and therefore not very portable. Using it for Unicode binds a program to the character model of a compiler. Using it for Unicode binds a program to the character model of a compiler.
- And, more generally, wchar is not needed to use Unicode, not at all. So: why use it ? And I am using the name Unicode here as meaning all existing characters.
rant mode on
Sorry but I must complaint, I need to take this out of my chest.
Yes, it is useful (for programmers) that the range a-z means an stable and preferably small list of characters. Yes, the ASCII list of lower letters seems like a reasonable (even if naive) solution. And yes, the situation with [a-z] matching 2190 characters, that easily change from one version to the next, as shown above is a huge problem for programmers and the security of the applications developed.
But forcing everyone, everywhere, to only be able to only match 26 letters on any locale (as is now the intent of POSIX to establish that [a-z] will match only ASCII lowercase letters in any locale) is extremely naive and is oversimplifying the issue at hand.
code-pages
There was the idea to encode languages in code-pages (lists of usually 256 characters), one of the first code-pages was ASCII (of only 128 characters). It was American-English only.
But American-English is only 369 Million people, the rest of the world, the 95% of the world, doesn't match that description. Just expanding it to the rest of America, the Spanish-Portuguese (which are more than 650 Million) part of America, makes English only 36% of America.
ASCII was expanded to cover different languages. So there was a code-page for Cyrillic, one code-page for Hebrew, one that included the Spanish ñ
, one that had the Portuguese ç
, etc. But that meant that you could not write a letter in both Hebrew and Cyrillic.
Then came the idea that 16 bits were enough. And that kind of worked for a while, until we got 50.000 characters from China. Several solutions tried to shoehorn all those characters in just 16 bits. But that had no hope of working.
That didn't work, much as "who will ever need more memory than 640K ?" didn't work.
utf-8
Then, thanks God, the web embraced UTF-8 and languages got a lot easier. UTF-8 covers all 1.114.112 (and much more if needed) possible UCS2 Unicode code-points (17 * 2^16) directly. Yes, only there are 143,859 characters, with Unicode 13.0 have been assigned (yes with all Chinese included).
Let's understand it: There are many languages !!!
There are many points of view !!!
programmers
We, programmers (and I am including myself as one), sure will like that everyone knew about ASCII and that only 26 letters (all the time) were matched with [a-z]. But that is not the world we live in. Most of people (I guess that 99.9% of the world population is not a programmer) doesn't know about ASCII. But most of people that could read and write have some idea of what a dictionary is. And a collation order is mostly a dictionary order.
I am sure that the dictionary in Greece doesn't use a-z as the main entries. That could be repeated for most of the world.
We need to grow above the 26 letters a-z
range.
We either embrace the change or it will be externally imposed to us with its good and bad consequences. We better find a solution.
I could think of using [ascii:a-z]
or something similar for only the 26 ASCII letters in any locale, even C
. Yes, it is not: backward compatible, but we are already in the middle of the problem, a not backward compatible problem, [a-z]
in bash might mach almost anything.
That would allow [french:a-z]
for example and [greek:a-z]
, and/or [greek:α-ω] and expect something sensible to the users of that language. And [a-z]
would remain as the all languages range that we are already getting (yes, it happens to be backward compatible).
rant mode off