Why does [a-z] asterisk match numbers?

Question

I have 3 directories at current path.

$ls
a_0db_data  a_clean_0db_data  a_clean_data
$ls a_*_data
a_0db_data:

a_clean_0db_data:

a_clean_data:

$ls a_[a-z]*_data
a_clean_0db_data:

a_clean_data:

I expected last ls command to match only a_clean_data. Why did it also match the one containing 0?

bash --version
GNU bash, version 4.2.24(1)-release (i686-pc-linux-gnu)

See this question for more on the difference between a regular expression and a glob. — terdon, Sep 10 '14 at 14:24
So the fact that a_*_data matched` any of this files didn't surprise you? — Cthulhu, Sep 10 '14 at 16:37

score 29 · Accepted Answer · edited Sep 10 '14 at 15:26

29

The [a-z] part isn't what matches the number; it's the *. You may be confusing shell globbing and regular expressions.

Tools like grep accept various flavours of regexes (basic by default, -E for extended, -P for Perl regex)

E.g. (-v inverts the match)

$ ls a_[a-z]*_data | grep -v "[0-9]"
a_clean_data

If you want to use a bash regex, here is an example on how to test if the variable $ref is an integer:

re='^[0-9]+$'
if ! [[ $ref =~ $re ]] ; then
  echo "error"
fi

edited Sep 10 '14 at 15:26

l0b0

51,350

answered Sep 10 '14 at 08:36

Sebastian

8,817
4
40
49

How to use bash regex then? (see http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_04_01.html) – user13107 Sep 10 '14 at 08:39
1

see this question – umläute Sep 10 '14 at 08:43

umläute · Answer 2 · 2014-09-11T08:20:29.203

So the problem is: why does a_[a-z]*_data match a_clean_0db_data?

This can be broken down into four parts:

a_ matches the beginning of a_clean_0db_data, leaving clean_0db_data to be matched
[a-z] matches any character in the range a-z (e.g. c), leaving lean_0db_data to be matched
* matches any number of characters, e.g. lean_0db
_data matches the trailing _data

In regular expressions, [a-z]* would mean any number of characters (including zero) in the range of a..z, but you are dealing with shell globbing, not with regular expressions.

If you want regular expressions, a few find implementations have a -regex predicate for that:

find . -maxdepth 1 -regex "^.*/a_[a-z]*_data$"

The -maxdepth is only here to limit the search-results to the folder you are in. The regular expression matches the entire filename, therefore I have added a ^.*/ to match the path-portion

score 11 · Answer 3 · answered Sep 10 '14 at 09:16

11

* in shell patterns matches 0 or more characters. It's not to be confused with the * regular expression operator that means 0 or more of the preceding atom.

There is no equivalent of regexp * in basic shell patterns. However, various shells have extensions for that.

ksh has *(something):
```
ls a_*([a-z])_data
```
you can have the same in bash with shopt -s extglob or zsh with setopt kshglob:
```
shopt -s extglob
ls a_*([a-z])_data
```
In zsh with extendedglob enabled, # is equivalent to regexp *:
```
setopt extendedglob
ls a_[a-z]#_data
```
In recent versions of ksh93, you can also use regular expressions in globs. Here with extended regular expressions:
```
ls ~(E:a_[a-z]*_data)
```

Note that [a-z] matches different things depending on the current locale. It generally matches only the 26 a to z latin non-accented letters in the C locale. In other locales, it generally matches more, and doesn't always make sense. To match a letter in your locale, you may prefer [[:alpha:]].

answered Sep 10 '14 at 09:16

Stéphane Chazelas

544,893

Could you give an example of [a-z] matching more that the 26 letters matched in the C locale? What I remember from when I last looked at this, all encodings practically used in Unix variants had ISO-646 as a base (then the upper 128 codes where used differently, directly for characters in encodings like the ISO-8859-X, combined in encodings like UTF-8 or the EUC family). Even AIX hadn't EBCDIC locales (at least as available to me). I remember trying to find if POSIX/UNIX standards demanded it, but I don't remember the result. – AProgrammer Sep 10 '14 at 09:40
1

@AProgrammer, that's independent of the encoding, that's based on sort order (LC_COLLATE). [a-z] generally includes é or í (but not necessarily ź) in the the locales where the charset have them, whether the codepoint in that encoding is between that of a and z or not. Only the C locale guarantees a sort order based on codepoint value. See this other answer for more details. – Stéphane Chazelas Sep 10 '14 at 09:46
Ok, what I missed was that the range was interpreted according to the current collation sequence. – AProgrammer Sep 10 '14 at 11:29

Why does [a-z] asterisk match numbers?

3 Answers3

Linked