Does (should) LC_COLLATE affect character ranges?

Question

Collation order through LC_COLLATE defines not only the sort order of individual characters, but also the meaning of character ranges. Or does it? Consider the following snippet:

unset LANGUAGE LC_ALL
echo B | LC_COLLATE=en_US grep '[a-z]'

Intuitively, B isn't in [a-z], so this shouldn't output anything. That's what happens on Ubuntu 8.04 or 10.04. But on some machines running Debian lenny or squeeze, B is found, because the range a-z includes everything that's between a and z in the collation order, including the capital letters B through Z.

All systems tested do have the en_US locale generated. I also tried varying the locale: on the machines where B is matched above, the same happens in every available locale (mostly latin-based: {en_{AU,CA,GB,IE,US},fr_FR,it_IT,es_ES,de_DE}{iso8859-1,iso8859-15,utf-8}, also Chinese locales) except Japanese (in any available encoding) and C/POSIX.

What do character ranges mean in regular expressions, when you go beyond ASCII? Why is there a difference between some Debian installations on the one hand, and other Debian installations and Ubuntu on the other? How do other systems behave? Who's right, and who should have a bug reported against?

(Note that I'm specifically asking about the behavior of character ranges such as [a-z] in en_US locales, primarily on GNU libc-based systems. I'm not asking how to match lowercase letters or ASCII lowercase letters.)

On two Debian machines, one where B is in [a-z] and one where it isn't, the output of LC_COLLATE=en_US locale -k LC_COLLATE is

collate-nrules=4
collate-rulesets=""
collate-symb-hash-sizemb=1
collate-codeset="ISO-8859-1"

and the output of LC_COLLATE=en_US.utf8 locale -k LC_COLLATE is

collate-nrules=4
collate-rulesets=""
collate-symb-hash-sizemb=2039
collate-codeset="UTF-8"

The results on my systems (PLD-Linux) are frankly disconcerting. It seems most non-english collations include capital variants in character ranges, but at least for me en_US does not. The English collation set, meanwhile, includes SOME characters from other alphabets but not a very complete set, so [a-z] includes much more than 26 characters, but not the full Turkish alphabet for example. — Caleb, Jul 02 '11 at 11:52
Doesn't reproduce on a Debian Lenny instance I've had handy. Didn't check if en_US is generated, though. — alex, Jul 02 '11 at 13:35
@alex If the locale isn't generated, the C locale is used as a fallback, and its collation order is straight byte values, so B won't be matched. Test in a locale that appears in the output of locale -a. — Gilles 'SO- stop being evil', Jul 02 '11 at 13:43
@Gilles: locale -a shows C POSIX en_US.utf8. Tried with each of them, and it's all the same: no output from grep. — alex, Jul 02 '11 at 20:02
Same here: Cannot reproduce the issue on Debian Lenny, even though en_US.utf8 shows in locale -a. — Heinzi, Jul 04 '11 at 06:32
Oddly, I can reproduce this on several Debian machines, some of which I don't have root permissions on (so it's not something weird I do as an admin). — Gilles 'SO- stop being evil', Jul 04 '11 at 06:45
Note that en_US is NOT the same as en_US.utf8, and typically means en_US.iso-8859-1, depending on exactly what you have installed. If en_US (with no suffix) doesn't appear in the output of locale -a you don't actually have this locale. What does LC_COLLATE=en_US locale -k LC_COLLATE show? — Neil Mayhew, Jul 04 '11 at 13:26
This has since turned up in a practical rather than theoretical question here: Why are capital letters included in a range of lower-case letters in an awk regex? — Caleb, Aug 24 '11 at 21:22
There are tons of (rejected) bug submissions regarding this supposed problem on Debian grep bug page (see http://bugs.debian.org/cgi-bin/pkgreport.cgi?archive=both;package=grep). — enzotib, Aug 25 '11 at 07:37
A chat conversation with Stéphane Chazelas: http://chat.stackexchange.com/rooms/26/conversation/gilles-and-sch-on-http-unix-stackexchange-com-questions-15980-does-should-lc-co — Gilles 'SO- stop being evil', Sep 02 '15 at 23:40
Could you provide the output of printf '%s' $(printf '%s\n' {a..z} {A..Z} | sort); echo in both systems? — , May 15 '18 at 22:24
@isaac Unfortunately, 7 years later, I don't seem to have access to any problematic system. They've all been upgraded or decommissioned. — Gilles 'SO- stop being evil', May 15 '18 at 22:47
However @Gilles'SO-stopbeingevil' the question in the title (Does (should) LC_COLLATE affect character ranges?) is still relevant and useful if correctly answered 8 years later. — , Dec 19 '19 at 18:19
I'm voting to close this question as off-topic because There is no way to test any answer because the OP has declared: Unfortunately, 7 years later, I don't seem to have access to any problematic system. They've all been upgraded or decommissioned.
There is no way to test/confirm any answer and anything said would be "opinion-based". — , Apr 05 '20 at 09:54

Neil Mayhew · Answer 1 · 2018-02-14T17:59:03.647

5

If you are using anything other than the C locale, you shouldn't be using ranges like [a-z] since these are locale-dependent and don't always give the results you would expect. As well as the case issue you've already encountered, some locales treat characters with diacritics (eg á) the same as the base character (ie a).

Instead, use a named character class:

echo B | grep '[[:lower:]]'

This will always give the correct result for the locale. However, you need to choose the locale to reflect the meaning of both your input text and the test you are trying to apply.

For example, if you need to find a particular byte value, use the C locale, which is always available:

echo B | LANG=C grep '[a-z]'

If this doesn't work as expected, it really is a bug.

edited Feb 14 '18 at 17:59

answered Jul 03 '11 at 22:27

Neil Mayhew

755

I know that, it isn't what I asked. I'm specifically asking about what an explicit range means, and why different distributions (even with GNU libc and GNU grep) have different behaviors. (Downvoted because even though what you say is correct, it's irrelevant.) – Gilles 'SO- stop being evil' Jul 03 '11 at 22:47
2

My point is that the meaning of an explicit range is locale-dependent, and different systems are not required to define their locales the same way, so this is not a bug. Technically, you are abusing the system, so you shouldn't be surprised at getting "undefined" behaviour. Also, several people have commented that they can't reproduce the behaviour on their Debian systems, so there seems to be something unusual about your system(s). – Neil Mayhew Jul 04 '11 at 13:17
1

I know that the behavior of ranges depends on the locales. I'm asking how, and surprised that different systems using Glibc (and, it turns out, even different installations of the same Debian release) have different behaviors. I've added the output of locale -k to my question; it's identical on two Debian machines, one where B is in the range and one where it isn't. BTW I'm not root on either machine (so it's not something peculiar that I do as an admin). – Gilles 'SO- stop being evil' Jul 04 '11 at 19:46
echo "Baü" | LC_COLLATE=C grep -o '[[:lower:]]' returns a AND ü while echo "Baü" | LC_COLLATE=C grep -o '[a-z]' returns only a. In my eyes, "lower" is not really what the OP wanted – Daniel Alder Jan 24 '18 at 12:47
1

My original point still stands, though: don't use ranges unless you're in the C locale. I believe this is relevant to the OP, who was looking to report a bug. If you're not in the C locale, the results of using ranges are highly unpredictable and therefore can't ever be considered a bug. On the other hand, if you need to find a particular byte value, just use the C locale. My secondary point was that if you really do want to search for lowercase letters in a locale, use a character class. Even though the OP may not have been looking for this, others might if they find this question. – Neil Mayhew Feb 14 '18 at 17:54

score 1 · Answer 2 · answered Jul 04 '11 at 20:57

1

Ranges in regular expressions should observe the collation setting. Here is the relevant standard: http://pubs.opengroup.org/onlinepubs/007908799/xbd/re.html (look for "range expressions"). So echo B | LC_COLLATE=en_US grep '[a-z]' should output B given a sensible definition of the respective locale. I can't explain why this sometimes doesn't work for you, but I would be very surprised if I encountered this on a non-ancient system that is properly installed and configured.

answered Jul 04 '11 at 20:57

Peter Eisentraut

2,172

2

echo B | LC_COLLATE=en_US.utf8 grep '[a-z]'
Doesn't print anything on Ubuntu 12.04 with grep 2.10.
Doesn't print anything on Centos 6.5 with grep 2.6.3.
Does work on Debian 6.0.8 with grep 2.6.3.
– Ian D. Allen Jan 27 '14 at 06:20
1

Note when testing this, you may need to unset LC_ALL first so it's not overriding LC_COLLATE. – SpinUp __ A Davis Jul 13 '22 at 17:49
None of my non-ancient systems (Debian, Ubuntu, macOS) output anything for echo B | LC_COLLATE=en_US grep '[a-z]'. – SpinUp __ A Davis Jul 13 '22 at 18:06

Does (should) LC_COLLATE affect character ranges?

2 Answers2

Linked