Proposed additional POSIX "Character Classes"

Question

There are a handful of "character classes" defined in POSIX as in LC_CTYPE locale definition with the following (12) names:

alnum alpha blank cntrl digit graph lower print punct space upper xdigit

And used as [[:lower:][:digit:]].

Each was set to define a very exact list of characters.
For example, digit was meant to include only the characters 0123456789.

However, with time and use, the exact definition of a digit has been changing. Perl clearly match more than 0123456789. Grep may also match more than 0123456789:

$ echo '0123456789 ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९' |
    grep -o '[0-9]\+'
0123456789
٠١٢٣٤٥٦٧٨
۰۱۲۳۴۵۶۷۸
߀߁߂߃߄߅߆߇߈
०१२३४५६७८

That, in general is caused by the pressure to internationalize the characters used. For example: It is perfectly natural for a Greek national to view αβγδεζηθικλμνξοπρσςτυφχψω as the lower case letters. But that is not what has been defined. In fact, all those "character classes" have this limitation added to its POSIX page definition:

In the POSIX locale

That states that the character class is only defined (and valid) in the C locale.
That is most useful to programmers that need stable and well defined lists of characters.
That [0-9] could only mean 0123456789 may seem reasonable to programmers.
Equivalently, an [a-z] may seem reasonable to only mean abcdefghijklmnopqrstuvwxyz to a programmer. But if [a-z] is read as "lower case letters", then, for a Greek national it seems unreasonable to not include any of the αβγδεζηθικλμνξοπρσςτυφχψω letters. And for an user of the collating order (other than C) it may seem un-reasonable that [a-z] doesn't mean aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYz. But, in turn, that may be unexpected to a naive user. Many users have complained that the range [a-z] was including uppercase letters.

In short: character classes are defined only for the C locale.
The rest of locales have not been defined which prevents their use. There is no way to ask for the lower case letters in Greek. Or to include them in a character range. That is shocking in a present day computer world that easily could use all languages in a web page.

Now, we can improve on that.

Trying to limit the now varied interpretation will most probably be a failure. We need a new syntax. What if we expand character classes to write exactly what we want them to mean:

Only digits from ASCII:              [:as:digit:]  <==> 0123456789
Only digits from English:            [:en:digit:]  <==> 0123456789
Only digits from Persian (Farsi):    [:fa:digit:]  <==> ۰۱۲۳۴۵۶۷۸۹
Only lowercase letters from English: [:en:lower:]  <==> abcdefghijklmnopqrstuvwxyz
Only lowercase letters from Greek:   [:el:lower:]  <==> αβγδεζηθικλμνξοπρσςτυφχψω
Only uppercase from Russian:         [:ru:upper:]  <==> БВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ
.
.
etc.

Stable and the same in any/all locale (if the locale could encode the characters).

Who should be contacted to get this idea implemented in some utilities (grep,sed,bash maybe)?

There's already support via the \p syntax in perl and pcre and even javascript. Better clean it up and standardize it. The [:junk:] is already an eyesore. No need for more ugliness. Eg. this already works to select lowercase greek characters: echo αβγδабвгabcdABCDАБВГΑΒΓΔ | grep -Po '(?=\p{Greek})\p{Ll}' => α β γ δ — , Mar 05 '20 at 22:51
I'm voting to close this question as off-topic because while asking for where to find POSIX specifications is on-topic, questions on where to submit amendment proposals to such specifications is not. — AdminBee, Mar 09 '20 at 14:09

score 3 · Answer 1 · answered Mar 05 '20 at 22:30

3

The problem's already been addressed in POSIX, using the wide-character functions. Start with <wctype.h> and <wchar.h>, which work with the current locale, and <locale.h> for specifying which locale that might be.

It seems that no one's found it necessary to add special syntax for referencing multiple, unrelated locales in a regular expression.

answered Mar 05 '20 at 22:30

Thomas Dickey

76,765

Your example comes across as UTF-8, which would be el_GR.UTF-8 – Thomas Dickey Mar 05 '20 at 23:31
POSIX provides the mechanism, but (someone may point to a useful source...) doesn't enumerate the possible locales. So the question is a little ambiguous (no single organization specifies everything) -hth – Thomas Dickey Mar 06 '20 at 00:59
neither: I recall that this is the case, but don't have time to dig out a suitable reference to provide for (what is actually) a second question. – Thomas Dickey Mar 06 '20 at 01:54

Stephen Kitt · Accepted Answer · 2020-03-06T17:06:58.163

Who should be contacted to get this idea implemented in some utilities (grep,sed,bash maybe)?

There already is some level of support, for example on systems using the GNU C library and its locale definitions, “é” is recognised as lower-case in French locales, and “α” is recognised as lower-case in Greek locales. Farsi, as defined in the GNU C library, uses ۰۱۲۳۴۵۶۷۸۹ in some circumstances (notably, scanf and printf with the I modifier), but they are not part of the “digits” class, and I imagine Sharif FarsiWeb know what they’re doing in this regard.

Suggesting changes of this type is a bit complicated nowadays. You could always join the Austin Group and discuss this there, either through the mailing list or the bug tracker (ideally, start by lurking for a while on the mailing list, or read the archives); but POSIX isn’t really the right place to try to drive change without any existing implementations. You could try to suggest the change to the developers of the various tools involved, presumably starting with the locale definitions in some C library or other, but you’re unlikely to get too far without some pressing case (typically, a standards requirement), so you end up in a bit of a Catch-22 situation.

Nowadays I think your best bet is to come up with a valid use-case for an important customer of one of the big operating system editors, and get the change driven that way. Then the editor will take care of all the community-wrangling for you.

Proposed additional POSIX "Character Classes"

2 Answers2

Linked