Collation order through LC_COLLATE
defines not only the sort order of individual characters, but also the meaning of character ranges. Or does it? Consider the following snippet:
unset LANGUAGE LC_ALL
echo B | LC_COLLATE=en_US grep '[a-z]'
Intuitively, B
isn't in [a-z]
, so this shouldn't output anything. That's what happens on Ubuntu 8.04 or 10.04. But on some machines running Debian lenny or squeeze, B
is found, because the range a-z
includes everything that's between a
and z
in the collation order, including the capital letters B
through Z
.
All systems tested do have the en_US
locale generated. I also tried varying the locale: on the machines where B
is matched above, the same happens in every available locale (mostly latin-based: {en_{AU,CA,GB,IE,US},fr_FR,it_IT,es_ES,de_DE}{iso8859-1,iso8859-15,utf-8}
, also Chinese locales) except Japanese (in any available encoding) and C
/POSIX
.
What do character ranges mean in regular expressions, when you go beyond ASCII? Why is there a difference between some Debian installations on the one hand, and other Debian installations and Ubuntu on the other? How do other systems behave? Who's right, and who should have a bug reported against?
(Note that I'm specifically asking about the behavior of character ranges such as [a-z]
in en_US
locales, primarily on GNU libc-based systems. I'm not asking how to match lowercase letters or ASCII lowercase letters.)
On two Debian machines, one where B
is in [a-z]
and one where it isn't, the output of LC_COLLATE=en_US locale -k LC_COLLATE
is
collate-nrules=4
collate-rulesets=""
collate-symb-hash-sizemb=1
collate-codeset="ISO-8859-1"
and the output of LC_COLLATE=en_US.utf8 locale -k LC_COLLATE
is
collate-nrules=4
collate-rulesets=""
collate-symb-hash-sizemb=2039
collate-codeset="UTF-8"
[a-z]
includes much more than 26 characters, but not the full Turkish alphabet for example. – Caleb Jul 02 '11 at 11:52en_US
is generated, though. – alex Jul 02 '11 at 13:35C
locale is used as a fallback, and its collation order is straight byte values, soB
won't be matched. Test in a locale that appears in the output oflocale -a
. – Gilles 'SO- stop being evil' Jul 02 '11 at 13:43locale -a
showsC POSIX en_US.utf8
. Tried with each of them, and it's all the same: no output fromgrep
. – alex Jul 02 '11 at 20:02en_US.utf8
shows inlocale -a
. – Heinzi Jul 04 '11 at 06:32grep
bug page (see http://bugs.debian.org/cgi-bin/pkgreport.cgi?archive=both;package=grep). – enzotib Aug 25 '11 at 07:37printf '%s' $(printf '%s\n' {a..z} {A..Z} | sort); echo
in both systems? – May 15 '18 at 22:24There is no way to test/confirm any answer and anything said would be "opinion-based".
– Apr 05 '20 at 09:54