46

In all shells I am aware of, rm [A-Z]* removes all files that start with an uppercase letter, but with bash this removes all files that start with a letter.

As this problem exists on Linux and Solaris with bash-3 and bash-4, it cannot be a bug caused by a buggy pattern matcher in libc or a miss-configured locale definition.

Is this strange and risky behavior intended or is this just a bug that exists unfixed since many years?

schily
  • 19,173
  • 3
    What does locale output? I cannot reproduce this (touch foo; echo [A-Z]* outputs the literal pattern, not "foo", in an otherwise empty directory). – chepner Sep 02 '15 at 17:36
  • 6
    Considering how many people have said it works for them, or have shown examples of how LC_COLLATE affects this, maybe you could edit your question to add a sample bash session which illustrates exactly the scenario you're asking about. Please include the bash version that you're using. – Kenster Sep 02 '15 at 19:15
  • If you did read all the text here you would know what bash version I use and what I did since I already posted the solution to my question. Let me repeat the solution: bash does not manage it's own locale so that setting LC_COLLATE does not change anything until you start another bash process with the new environment. – schily Sep 02 '15 at 19:25
  • 1
    See also Does (should) LC_COLLATE affect character ranges? (but that question wasn't specifically about bash) – Gilles 'SO- stop being evil' Sep 02 '15 at 21:11
  • "setting LC_COLLATE does not change anything until you start another bash process with the new environment." That doesn't match the behavior I see with bash-4 on Solaris. It is changing the behavior in the running shell. # echo [A-Z]* ; export LC_COLLATE=C ; echo [A-Z]* A b B z Z A B Z – BowlOfRed Sep 02 '15 at 21:25
  • Which bash4 on which Solaris? – schily Sep 02 '15 at 21:31
  • 4.1.11(2) on SunOS 5.11 (x86). Same behavior on bash 3.2.25(1) on CentOS 5.7 – BowlOfRed Sep 02 '15 at 21:35
  • There must be something really strange as I recently run bash under truss -ulibc::setlocale and the first try to change LC_COLLATE did not call setlocale() the second try did. Since then, all test work as they should. Even if I call another new bash with the previous LC_* setup that originally includes lower case characters in [A-Z]. I would usually believe that I did something wrong, but then the case where starting another bash could not make a difference and Stephané did also observe that setlocale was not called, so this cannot be based on wrong usage. – schily Sep 02 '15 at 21:56

7 Answers7

73

Note that when using range expressions like [a-z], letters of the other case may be included, depending on the setting of LC_COLLATE.

LC_COLLATE is a variable which determines the collation order used when sorting the results of pathname expansion, and determines the behavior of range expressions, equivalence classes, and collating sequences within pathname expansion and pattern matching.


Consider the following:

$ touch a A b B c C x X y Y z Z
$ ls
a  A  b  B  c  C  x  X  y  Y  z  Z
$ echo [a-z] # Note the missing uppercase "Z"
a A b B c C x X y Y z
$ echo [A-Z] # Note the missing lowercase "a"
A b B c C x X y Y z Z

Notice when the command echo [a-z] is called, the expected output would be all files with lower case characters. Also, with echo [A-Z], files with uppercase characters would be expected.


Standard collations with locales such as en_US have the following order:

aAbBcC...xXyYzZ
  • Between a and z (in [a-z]) are ALL uppercase letters, except for Z.
  • Between A and Z (in [A-Z]) are ALL lowercase letters, except for a.

See:

     aAbBcC[...]xXyYzZ
     |              |
from a      to      z

     aAbBcC[...]xXyYzZ
      |              |
from  A     to       Z

If you change the LC_COLLATE variable to C it looks as expected:

$ export LC_COLLATE=C
$ echo [a-z]
a b c x y z
$ echo [A-Z]
A B C X Y Z

So, it's not a bug, it's a collation issue.


Instead of range expressions you can use POSIX defined character classes, such as upper or lower. They work also with different LC_COLLATE configurations and even with accented characters:

$ echo [[:lower:]]
a b c x y z à è é
$ echo [[:upper:]]
A B C X Y Z
psmears
  • 465
  • 3
  • 8
chaos
  • 48,171
  • If this behavior was controllable by LC_* environment variables, I did not ask. I work in the POSIX standard committee and I know of collating problems with e.g. trso this is what I checked first. – schily Sep 02 '15 at 16:48
  • @schily I cannot reproduce your problem with neither an old bash-3 or a bash-4; both are controllable via LC_COLLATE which is also documented in the manual. – chaos Sep 02 '15 at 17:24
  • Sorry, I cannot reproduce what you believe, but see my own answer...From the ideas in this discussion I discovered the reason for the problem. – schily Sep 02 '15 at 17:27
25

[A-Z] in bash matches all collating elements (characters but call also be sequence of characters like Dsz in Hungarian locales) that sort after A and sort before Z. In your locale, c probably sorts in-between B and C.

$ printf '%s\n' A a á b B c C Ç z Z Ẑ | sort
a
A
á
b
B
c
C
Ç
z
Z
Ẑ

So c or z would be matched by [A-Z], but not or a.

$ printf '%s\n' A a á b B c C Ç z Z Ẑ |
pipe>  bash -c 'while IFS= read -r x; do case $x in [A-Z]) echo "$x"; esac; done'
A
á
b
B
c
C
Ç
z
Z

In the C locale, the order would be:

$ printf '%s\n' A a á b B c C Ç z Z Ẑ | LC_COLLATE=C sort
A
B
C
Z
a
b
c
z
Ç
á
Ẑ

So [A-Z] would match A, B, C, Z, but not Ç and still not .

If you want to match on upper-case letters (in any script), you can use [[:upper:]] instead. There's no builtin way in bash to only match uppercase letters in the latin script (except by listing them individually).

If you want to match the A to Z English letters without diacritics, you can either use [A-Z] or [[:upper:]] but in the C locale (assuming the data is not encoded in character sets like BIG5 or GB18030 which has several characters whose encoding contains the encoding of those letters) or list them individually ([ABCDEFGHIJKLMNOPQRSTUVWXYZ]).

Note that there is some variation between shells.

For zsh, bash -O globasciiranges (strangely named option introduced in bash-4.3), schily-sh and yash, [A-Z] matches on the characters whose code point is between that of A and that of Z, so would be equivalent to the behaviour of bash in the C locale.

For ash, mksh and ancient shells, same as zsh above but limited to single-byte charsets. That is, in a UTF-8 locale for instance, [É-Ź] would not match on Ó, but since that's [<c3><89>-<c5><b9>], that would match on byte values 0x89 to 0xc5!

ksh93 behaves like bash except that it treats as special cases ranges whose ends both start with lowercase letters or uppercase letters. In that case, it only matches on collating elements that sort between those ends, but that are (or their first character for multi-character collating elements) also lowercase (or uppercase respectively). So [A-Z] there would match on É, but not on e as e does sort between A and Z but is not uppercase like A and Z.

For fnmatch() patterns (as in find -name '[A-Z]') or system regular expressions (as in grep '[A-Z]'), it depends on the system and locale. For instance, on a GNU system here, [A-Z] doesn't match on x in the en_GB.UTF-8 locale, but it does in the th_TH.UTF-8 one. It's unclear to me what information it uses to determine that, but it is apparently based on a lookup table derived from LC_COLLATE locale data).

All behaviours are allowed by POSIX as POSIX leaves the behaviour of ranges unspecified in locales other than the C locale. Now we can argue over the benefits of each approach.

bash's approach makes a lot of sense as with [C-G], we want the characters in between C and G. And using the user's sort order for what determines what's in-between is the most logical approach.

Now, the problem is that it breaks the expectations of a lot of people, especially those people used to the traditional behaviour of pre-Unicode, even pre-internationalisation days. While from a normal user, it makes may sense that [C-I] includes h as the h letter is between C and I and that [A-g] does not include Z, it's a different matter for people having dealt with ASCII only for decades.

That bash behaviour is also different from the [A-Z] range matching in other GNU tools like in GNU regular expressions (as in grep/sed...) or fnmatch() as in find -name.

It also means that what [A-Z] matches varies with the environment, with the OS and with the version of the OS. The fact that [A-Z] matches Á but not Ź is also suboptimal.

For zsh/yash, we use a different sorting order. Instead of relying on the user's notion of character order, we use the character point code values. That has the benefit of being easy to understand, but from a practical point of few, outside of ASCII, it is not very useful. [A-Z] matches the 26 US-english upper-case letters, [0-9] matches decimal digits. There are code points in Unicode that follow the order of some alphabets but that's not generalised and can't be generalised as anyway different people using a same script do not necessarily agree on the order of letters.

For traditional shells and mksh, dash, it's broken (now that most people use multi-byte characters), but primarily because they don't have multi-byte support yet. Adding multi-byte support to shells like bash and zsh has been a huge effort and is still ongoing. yash (a Japanese shell) was initially designed with multi-byte support from the start.

ksh93's approach has the benefit to be consistent with the system's regular expressions or fnmatch() (or at least appears to at least on GNU systems). There, it doesn't break some people's expectation as [A-Z] doesn't include lower case letters, [A-Z] includes É (and Á, but not Ź). It's not consistent with sort or generally strcoll() order.

  • 1
    If you were right, this could be controlled via LC_* variables. There seems to be a different reason. – schily Sep 02 '15 at 16:50
  • posh also behave like zsh and yash. – cuonglm Sep 02 '15 at 16:59
  • 1
    @cuonglm, more like mksh (both derived from pdksh). posh -c $'case Ó in [É-Ź]) echo yes; esac' returns nothing. – Stéphane Chazelas Sep 02 '15 at 17:03
  • BTW: My question was against bash, but your reply is related to sort. Did you try to check file name globbing with bash-3? – schily Sep 02 '15 at 17:04
  • 2
    @schily, I mention sort because bash globs are based on character sort order. I don't currently have access to such an old version of bash, but I can check later. Was it different then? – Stéphane Chazelas Sep 02 '15 at 17:07
  • 1
    Let me mention again: zsh, POSIX-ksh88, ksh93t+ Bourne Shell, all behave the same way as I expect. Bash is the only shell that behaves different and bash is not controllable via the locale in this case. – schily Sep 02 '15 at 17:12
  • @Stéphane - What you get with the case statement in the Bourne Shell is expected behavior assuming that you use UTF-8. The Bourne Shell uses gmatch() for case statements and this supports wide characters but does not use strcoll() but a plain value compare for the range. 0xFF is perfectly inside the range you specified. – schily Sep 02 '15 at 22:45
  • 2
    @schily, note that \xFF there is the byte 0xFF, not the character U+00FF (ÿ itself encoded as 0xC3 0xBF). \xFF alone doesn't form a valid character so I can't see why it should be matched by [É-Ź]. – Stéphane Chazelas Sep 02 '15 at 22:51
  • @schily, having said that, it seems that zsh behaves the same. yash won't allow invalid character. – Stéphane Chazelas Sep 02 '15 at 22:56
  • @Stéphane - The way \xFF is handled depends on the shell internals. Given the fact that \x1234 is possible, this should explain that the shell has parts where characters are handled as wide characters and others where the shell uses multibyte characters. gmatch (that handles case exists in a place where the shell uses multibyte characters but inside gmatch() everything is temporary converted into wide characters for processing. – schily Sep 03 '15 at 10:19
  • @schily, in that case, the \xFF was expanded to the 0xFF byte by my shell (zsh) before passing to schily-sh. schily-sh internally wrongly identified it to U+00FF. You may want to have a look at the current discussion on the zsh ML. Note that ksh93 is a bit broken as well in that b=$'\xff' ksh -c $'[[ $b = [\uff] ]]' returns false but b=$'\xff' ksh -c $'[[ $b = [[:alpha:]] ]]' returns true. – Stéphane Chazelas Sep 03 '15 at 10:39
  • how characters are converted from a multi byte locale depends on mbtowc(). If there is a character that is officially an impossible multibyte value, mbtowc() returns -1 and the string converter advances by one and the output is still what the first wchar_t * parameter returns. – schily Sep 03 '15 at 10:59
  • @schily, yes the shell has to decide what to do with those bytes not forming parts of valid characters. What I'm saying is that treating them as if they were the characters whose code point has the same value as that byte value is not the best approach IMO. Hence me starting the discussion on the zsh ML (an on the Austin group 2 months ago). – Stéphane Chazelas Sep 03 '15 at 11:02
  • The only other way to handle this seems to drop such characters and this does not look like a better solution. – schily Sep 03 '15 at 11:09
  • @schily, there are other ways. See how zsh is now doing it. – Stéphane Chazelas Sep 09 '15 at 10:31
  • @schily Yes, Bash behavior is controllable via LC_* variables. Just that the must be active in the running environment to work their magic. Start a new bash as this: LC_COLLATE="C" bash and try again echo [a-z]*. Or, more to the point, try: LC_COLLATE="C" bash -c 'echo [a-z]*'. –  Sep 24 '15 at 02:37
  • This was a little easier to see using printf '%s\n' {{0..99},{A-Z},{a-z}} | sort and printf '%s\n' {{0..99},{A-Z},{a-z}} | LANG=C sort and also helped me confirm my language setting is causing the collating behavior I'm seeing. – dragon788 May 22 '17 at 20:43
9

It's intended and documented in bash documentation, pattern matching section. The range expression [X-Y] will be included any characters between X and Y using the current locale’s collating sequence and character set:

LC_ALL=en_US.utf8 bash -c 'case b in [A-Z]) echo yes; esac' 
yes

You can see, b sorted between A and Z in en_US.utf8 locale.

You have some choices to prevent this behavior:

# Setting LC_ALL or LC_COLLATE to C
LC_ALL=C bash -c 'echo [A-Z]*'

# Or using POSIX character class
LC_ALL=C bash -c 'echo [[:upper:]]*'

or enable globasciiranges (with bash 4.3 and above):

bash -O globasciiranges -c 'echo [A-Z]*'
cuonglm
  • 153,898
7

I observed this behavior on a new Amazon EC2 instance. Since the OP didn't offer an MCVE, I'll post one:

$ cd $(mktemp -d)
$ touch foo
$ echo [A-Z]*     # prepare for a surprise!
foo

$ echo $BASH_VERSION
4.1.2(1)-release
$ uname -a
Linux spinup-tmp12 3.14.27-25.47.amzn1.x86_64 #1 SMP Wed Dec 17 18:36:15 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

$ env | grep LC_  # no locale, let's set one
$ LC_ALL=C
$ echo [A-Z]*
[A-Z]*

$ unset LC_ALL    # ok, good. what if we go back to no locale?
$ echo [A-Z]*
foo

So, not having my LC_* set leads bash 4.1.2(1)-release on Linux to produce apparently odd behavior. I can reliably toggle the odd behavior by setting and unsetting the respective locale variables. Unsurprisingly, this behavior appears consistent through exporting:

$ export LC_ALL=C
$ bash
$ echo [A-Z]*
[A-Z]*
$ exit
$ echo $SHLVL
1
$ unset LC_ALL
$ bash
$ echo [A-Z]*
foo

While I'm seeing bash behave as Stéphane "Shellshock" Chazelas answered, I think the bash documentation on pattern matching is buggy:

For example, in the default C locale, ‘[a-dx-z]’ is equivalent to ‘[abcdxyz]’

I read that sentence (emphasis mine) as "if the relevant locale variables are not set, then bash will default to the C locale". Bash does not appear to be doing that. Instead it appears to be defaulting to a locale where the characters are sorted in dictionary order with diacritic folding:

$ echo [A-E]*
[A-E]*
$ echo [A-F]*
foo
$ touch "évocateur"
$ echo [A-F]*
foo évocateur

I think it'd be good for bash to document how it will behave when LC_* (specifically LC_CTYPE and LC_COLLATE) are undefined. But in the mean time, I'll share some wisdom:

... you have to be very careful with [character ranges] because they will not produce the expected results unless properly configured. For now, you should avoid using them and use character classes instead.

and

If you're really proper, and/or are scripting for a multi-locale environment, it's probably best to make sure you know what your locale variables are when you're matching files, or to be sure that you're coding in a completely generic way.


Update Based on @G-Man comment, let's look deeper into what's happening:

$ env | grep LANG
LANG=en_US.UTF-8

Ah, ha! That explains the collation seen earlier. Let's remove all the locale variables:

$ unset LANG LANGUAGE LC_ALL
$ env | grep 'LC_|LANG'
$ echo [A-Z]*
[A-Z]*

There we go. Now bash operates consistently with respect to documentation on this Linux system. If any of the locale variables are set (LANGUAGE, LANG, LC_COLLATE, LC_CTYPE, LC_ALL, etc.) then Bash uses those according to its manual. Otherwise, bash falls back to C.

The Wooledge bash FAQ has this to say:

On recent GNU systems, the variables are used in this order. If LANGUAGE is set, use that, unless LANG is set to C, in which case LANGUAGE is ignored. Also, some programs simply don't use LANGUAGE at all. Otherwise, if LC_ALL is set, use that. Otherwise, if the specific LC_* variable that covers this usage is set, use that. (For example, LC_MESSAGES covers error messages.) Otherwise, use LANG.

So the apparent problem, both in operation and documentation, can be explained by looking at the total sum of all locale driving variables.

bishop
  • 3,209
  • If no LC_variable is present and bash does not behave as documented for the C locale, this is a bug. – schily Sep 04 '15 at 14:02
  • 1
    @bishop: (1) Typo: MVCE should be MCVE.  (2) If you want your example to be complete, you should add env | grep LANG or echo "$LANG". – G-Man Says 'Reinstate Monica' Sep 04 '15 at 21:59
  • @schily Further investigation convinced me there's no bug in the documentation or operation on this Linux system. – bishop Sep 05 '15 at 00:48
  • @G-Man Thanks! I forgot about LANG. With that hint, all is explained. – bishop Sep 05 '15 at 00:48
  • 1
    LANG was introduced around 1988 by Sun for the first localization attempts, before they discovered that a single variable is not sufficient. Today it it used as a fallback and LC_ALL is used as forced overwrite. – schily Sep 05 '15 at 09:52
3

Locale can change what characters are matched by [A-Z]. Use

(LC_ALL=C; rm [A-Z]*)

to eliminate the influence. (I used a subshell to localize the change).

choroba
  • 47,233
2

As has been already said, this is a "collating order" issue.

The range a-z may contain upper case letters in some locales:

     aAbBcC[...]xXyYzZ
     |              |
from a      to      z

The correct solution since bash 4.3 is to set the option globasciiranges:

shopt -s globasciiranges

to make bash act as if LC_COLLATE=C has been set in globing ranges.

-7

It seems that I found the right answer to my own question:

Bash is buggy as it does not manage it's own locale. So setting LC_* in a bash process is without effect in that shell process.

If you set the LC_COLLATE=C and then start another bash, the globbing works as expected in the new bash process.

schily
  • 19,173
  • 2
    Not in any of my bashes. – chaos Sep 02 '15 at 17:26
  • 2
    I don't repro this in any version of bash on my machine, it sounds like you didn't export it properly. – Chris Down Sep 02 '15 at 17:44
  • So you believe that something that is properly exported, so that it affects a new bash process is not properly exported? – schily Sep 02 '15 at 18:06
  • I would guess that the bug is in bash and not in Solaris. It is most unlikely that Solaris has a version with patched in bugs. In order to make a setlocale working, the right values need to be in environ while calling setlocale(). If you have access to the bash sources, you may like to check whether this is done properly in bash. How did you check whether setlocale() is called? Did you use truss -u or did you check whether related files are accessed? – schily Sep 02 '15 at 20:39
  • If you need a piece of code that is known to work, a proper setlocale setup is done in the Bourne Shell in name.c in the function dolocale(). – schily Sep 02 '15 at 20:42
  • Could you run your tests again? It suddenly started to work for me. Does it change behavior for you too? – schily Sep 02 '15 at 22:13
  • Sorry, I had messed up my test cases, On Solaris, I had LC_ALL in the environment and was only changing LC_COLLATE. env -i LC_ALL=en_GB.UTF-8 truss -t '' -u ::setlocale bash -c 'LC_COLLATE=en_GB.utf8; [[ c = [A-Z] ]] && echo A; LC_COLLATE=C; [[ c = [A-Z] ]] && echo B' instead of env -i LC_COLLATE=en_GB.UTF-8 truss -t '' -u ::setlocale bash -c 'LC_COLLATE=en_GB.utf8; [[ c = [A-Z] ]] && echo A; LC_COLLATE=C; [[ c = [A-Z] ]] && echo B'. So yes, it is working correctly on both Solaris and GNU – Stéphane Chazelas Sep 02 '15 at 22:16
  • Well I would be willing to believe that I did do something wrong as well, but then the case where starting another bash could not work. Or do you have an explanation for that case? – schily Sep 02 '15 at 22:23
  • Setting LC_COLLATE=C works for me (changes the behaviour inside a single bash, not even a subshell. (bash 4.3.30(1) on Ubuntu)). @Schily, in your earlier test, did you maybe have LC_ALL set but not exported? That would explain LC_COLLATE having no effect on the existing shell, and why starting a new process did work. – Peter Cordes Sep 03 '15 at 02:13
  • 4
    Solaris's handling of the environment is notoriously deficient, so I wouldn't be surprised if the "bug" in bash was the lack of a Solaris-specific workaround. – hobbs Sep 03 '15 at 03:01
  • 2
    @schily: Do you have a citation for where changing the LC_* variables within a shell is required to cause it to update its own locale state? I would think exactly the opposite. In particular for a shell executing a script, changing locale mid-way through parsing/execution of the script would not even have well-defined behavior, as the script is a text file and "text file" is only meaningful within the context of a single character encoding. – R.. GitHub STOP HELPING ICE Sep 03 '15 at 04:14
  • @Peter: the window where I could check was flooded from truss -u - sorry I cannot verity this anymore. – schily Sep 03 '15 at 09:53
  • @R..5: With locales, POSIX is currently not really ready, setlocale() sets the locale for the process and not for a thread and uselocale() and the *_l() functions have been added in 2008. I am not sure whether the needed functionality is mentioned in POSIX at all. Important here is how shells import environment and th fact that there is no unexportsee next comment... – schily Sep 03 '15 at 10:06
  • ksh and bash are less usable with shell variables then the Burne Shell. The Bourne Shell imports all environment variables and exports the imported value even if you do local changes. This permits you to run a script in LC_ALL=C but all started programs use the imported locale. If you later run e.g. export LC_ALL, the modified value of the exported variable is exported. Bash and ksh always import the environment as well, but any imported variable is automatically exported as well, so local changes are always exported. – schily Sep 03 '15 at 10:10