There are a handful of "character classes" defined in POSIX as in LC_CTYPE locale definition with the following (12) names:
alnum alpha blank cntrl digit graph lower print punct space upper xdigit
And used as [[:lower:][:digit:]]
.
Each was set to define a very exact list of characters.
For example, digit
was meant to include only the characters 0123456789
.
However, with time and use, the exact definition of a digit
has been changing. Perl clearly match more than 0123456789
. Grep may also match more than 0123456789
:
$ echo '0123456789 ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९' |
grep -o '[0-9]\+'
0123456789
٠١٢٣٤٥٦٧٨
۰۱۲۳۴۵۶۷۸
߀߁߂߃߄߅߆߇߈
०१२३४५६७८
That, in general is caused by the pressure to internationalize the characters used. For example: It is perfectly natural for a Greek national to view αβγδεζηθικλμνξοπρσςτυφχψω
as the lower case letters. But that is not what has been defined. In fact, all those "character classes" have this limitation added to its POSIX page definition:
In the POSIX locale
That states that the character class is only defined (and valid) in the C locale.
That is most useful to programmers that need stable and well defined lists of characters.
That [0-9]
could only mean 0123456789
may seem reasonable to programmers.
Equivalently, an [a-z]
may seem reasonable to only mean abcdefghijklmnopqrstuvwxyz
to a programmer.
But if [a-z]
is read as "lower case letters", then, for a Greek national it seems unreasonable to not include any of the αβγδεζηθικλμνξοπρσςτυφχψω
letters. And for an user of the collating order (other than C) it may seem un-reasonable that [a-z]
doesn't mean aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYz
. But, in turn, that may be unexpected to a naive user. Many users have complained that the range [a-z]
was including uppercase letters.
In short: character classes are defined only for the C locale.
The rest of locales have not been defined which prevents their use. There is no way to ask for the lower case letters in Greek. Or to include them in a character range. That is shocking in a present day computer world that easily could use all languages in a web page.
Now, we can improve on that.
Trying to limit the now varied interpretation will most probably be a failure. We need a new syntax. What if we expand character classes to write exactly what we want them to mean:
Only digits from ASCII: [:as:digit:] <==> 0123456789
Only digits from English: [:en:digit:] <==> 0123456789
Only digits from Persian (Farsi): [:fa:digit:] <==> ۰۱۲۳۴۵۶۷۸۹
Only lowercase letters from English: [:en:lower:] <==> abcdefghijklmnopqrstuvwxyz
Only lowercase letters from Greek: [:el:lower:] <==> αβγδεζηθικλμνξοπρσςτυφχψω
Only uppercase from Russian: [:ru:upper:] <==> БВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ
.
.
etc.
Stable and the same in any/all locale (if the locale could encode the characters).
Who should be contacted to get this idea implemented in some utilities (grep,sed,bash maybe)?
\p
syntax inperl
andpcre
and evenjavascript
. Better clean it up and standardize it. The[:junk:]
is already an eyesore. No need for more ugliness. Eg. this already works to select lowercase greek characters:echo αβγδабвгabcdABCDАБВГΑΒΓΔ | grep -Po '(?=\p{Greek})\p{Ll}'
=> α β γ δ – Mar 05 '20 at 22:51