1

In The Linux Command Line William Shotts claims that character ranges can be problematic. See the relevant excerpt below, emphasis is mine.

Character Ranges

If you are coming from another Unix-like environment or have been reading some other books on this subject, you may have encountered the [A-Z] and [a-z] character range notations. These are traditional Unix notations and worked in older versions of Linux as well. They can still work, but you have to be careful with them because they will not produce the expected results unless properly configured. For now, you should avoid using them and use character classes instead.

What is he talking about in the last couple of sentences? What do the POSIX standards say about this?

Git Gud
  • 177

1 Answers1

3

That most likely refers to locales having uppercase and lowercase characters ordered in alternation, instead of first one, then the other:

$ echo "$LANG"
en_US.UTF-8
$ touch a A z Z
$ ls
A  Z  a  z
$ bash -c 'echo [a-z]'
a A z

However, the appropriate character class works:

$ bash -c 'echo [[:lower:]]'
a z

But might also match more than just a to z:

$ LANG=fi_FI.UTF-8
$ touch ä Ä ö Ö
$ bash -c 'echo [[:lower:]]'
a z ä ö

If you want to avoid that, and only match the English lowercase letters a to z, Bash in particular has an option to interpret the ranges in the ASCII order:

$ bash -c 'shopt -s globasciiranges; echo [a-z]'
a z

And you can always force the default collating order:

$ LC_COLLATE=C bash -c 'echo [a-z]'
a z

As for what POSIX says, it seems to me that ranges in bracket expressions are left undefined in locales other than the default POSIX one. The pattern matching description refers to the regex description of bracket expressions, which says:

In the POSIX locale, a range expression represents the set of collating elements that fall between two elements in the collation sequence, inclusive. In other locales, a range expression has unspecified behavior: strictly conforming applications shall not rely on whether the range expression is valid, or on the set of collating elements matched.

ilkkachu
  • 138,973
  • Thanks. I don't understand how bash -c 'echo [a-z]' works. If I don't run the commands in the specific order that you do in your answer, it just echoes [a-z]. Can you shed some light on that? Also, is there a way to check the definition of each character class? – Git Gud Mar 10 '19 at 18:20
  • @GitGud [a-z] is a glob pattern, just like foo*; if you don't have any file starting with foo in the current dir, echo foo* will just echo foo*; if you don't have any file named a, b, c, etc in the current dir, echo [a-z] will echo just [a-z]. –  Mar 10 '19 at 18:29
  • Ah! Of course! Got it. Thanks @UncleBilly – Git Gud Mar 10 '19 at 18:31
  • @ikkachu Just to make sure you didn't miss my second question: is there a way to check the definition of each character class? – Git Gud Mar 10 '19 at 22:24
  • @GitGud, I'm not really familiar with the internals of locales, so I can't say. – ilkkachu Mar 11 '19 at 12:49
  • 1
    Even worse, "unspecified behavior" means that it can change at any time. As of Ubuntu 20.04, the lower case ranges in the en_US.UTF-8 locale behaves just like the POXIS or C locale, unlike in the above example which outputs (some) uppercase. (End of the range UC was always missed, an implementation error most likely). – ubfan1 Nov 02 '20 at 21:59