Why are capital letters included in a range of lower-case letters in an awk regex?

Question

$ echo ABC | awk '$0 ~ /^[a-b]/'
ABC
$ echo ABC | awk '$0 ~ /^[a-a]/'
$ echo ABC | awk '$0 ~ /^a/'
$

You see. /[a-b]/ captures A, but /[a-a]/ or /a/ doesn't. Why?

See Does (should) LC_COLLATE affect character ranges? for more (unresolved) info on this topic. — Caleb, Aug 24 '11 at 21:07
This appears to be more than just a simple(?) LC_COLLATE issue, because using some non-C values for LC_COLLATE produces different results, depending on which utility is used. eg. 'sed' and 'grep' give different results to 'awk' when using LC_COLLATE=en_AU.UTF-8 or en_US.UTF-8 ... sed and grep manage to resolve the case issue, and only lower-case is printed (using the same values as above) — Peter.O, Aug 25 '11 at 09:38
At least in gawk (GNU Awk) this has been fixed ([a-z] matches only lowercase letters) since version 4.0: https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html — Piotr Jurkiewicz, May 30 '16 at 06:08

score 8 · Accepted Answer · edited Dec 30 '19 at 21:06

8

It is a "locale" problem, I think.

In my locale, it_IT, the following snippet

if [[ a < A ]]; then
  echo "a < A"
elif [[ a > A ]]; then
  echo "a > A"
else
  echo "a = A"
fi

if [[ b < A ]]; then
  echo "b < A"
elif [[ b > A ]]; then
  echo "b > A"
else
  echo "b = A"
fi

shows

a < A
b > A

so that A is (surprisingly) between a and b, so in the range.

Try executing

echo ABC | LC_COLLATE=C awk '$0 ~ /^[a-b]/'

Edit

the following command shows the collating order in your locale:

echo $(LC_COLLATE=C printf '%s\n' {A..z} | sort)

the output on my machine is

` ^ _ [ ] a A b B c C d D e E f F g G h H i I j J k K l L m M n N o O p P q Q r R s S t T u U v V w W x X y Y z Z

(cannot understand from bash's manual page if sequence expressions are expanded in locale collating order or not; it seems not).

edited Dec 30 '19 at 21:06

answered Aug 24 '11 at 14:04

enzotib

51,661

+1, but you just need LC_COLLATE, not LC_ALL, for this particular case. – mattdm Aug 24 '11 at 15:40
@mattdm: you're right, i'm lazy – enzotib Aug 24 '11 at 15:52
@enzotib: I am puzzled... This idea seems to mean that every time I want to set a range /[a-x]/, I must use LC_COLLATE....What on earth has a collating sequence got to do identifying what is Upper-case vs Lower case? ... I can't see how a Collating sequence dictates what is upper case and what isn't... I keep grappling with these locale issues, and am slowly making headway, but this one has me stumped. – Peter.O Aug 24 '11 at 18:14
@fred: frequently, when using sort, join or the like, I start my scripts with export LC_COLLATE=C. Now I have to start this way also scripts using awk :) – enzotib Aug 24 '11 at 18:44
I've followed the link in Caleb's comment (worth readng). I can now see a bit of light here, but none-the-less, it still seems that something is amiss or at least 'not as expected'. Relying on a collation order to determine case simply doesn't make sense. I think this excerpt from The Single UNIX Spec (old) is notable: "Portable applications must not use range expressions, even though all implementations support them.". In this context, does this suggest that a range expression has, by definiton, side-stepped the notion of case! It deals in ranges, not case.. QED! (I think :) – Peter.O Aug 24 '11 at 22:46
@enzotib Why did you use LC_COLLATE=C with your printf command in the edit? – rozcietrzewiacz Oct 24 '11 at 06:23
@rozcietrzewiacz: it is inessential indeed, but to be sure that printf interpret the sequence {A..z} in a way independent of the particular locale (as the sentence following explains in some way: "cannot understand from bash's manual page if sequence expressions are expanded in locale collating order or not; it seems not". – enzotib Oct 24 '11 at 07:46
I see. And I also see they are not (but I share your worries in that matter). And... I don't agree with @mattdm's first comment, because if LC_ALL was set in the environment, then changing LC_COLLATE alone would have no effect. – rozcietrzewiacz Oct 24 '11 at 09:03
3

The sequence eval order doesn't matter in this case since you sort after the sequence is generated. However, your example would work more accurately with LC_COLLATE next to sort: "echo $(printf '%s\n' {A..z} | LC_COLLATE='C' sort)" ... which would contrast correctly with the default case "echo $(printf '%s\n' {A..z} | LC_COLLATE='' sort)". The original syntax above never actually applies the locally modified LC_COLLATE to the sort command [of course, all bets are off if LC_ALL was set somewhere...] – MartyMacGyver Jun 01 '12 at 19:22
@Peter.O [a-z] doesn't have anything to do with case. It's the list of characters between a and z inclusive, with "between" being defined as sort order so Swedes get ä after the other letters, as per their alphabet, and Germans get it mixed in with a, as normal for German. Most non-programmers would put B between a and z, for example, apples Berlin zeppelin, not apples zeppelin Berlin, so [a-z] includes B in most locales. – prosfilaes Jun 01 '14 at 21:12
Surprisingly sort and [a-z] may yield different results. – Kamil Maciorowski Nov 21 '20 at 22:08

Why are capital letters included in a range of lower-case letters in an awk regex?

1 Answers1

Linked