13

I have 3 directories at current path.

$ls
a_0db_data  a_clean_0db_data  a_clean_data
$ls a_*_data
a_0db_data:

a_clean_0db_data:

a_clean_data:

$ls a_[a-z]*_data
a_clean_0db_data:

a_clean_data:

I expected last ls command to match only a_clean_data. Why did it also match the one containing 0?

bash --version
GNU bash, version 4.2.24(1)-release (i686-pc-linux-gnu)
umläute
  • 6,472
user13107
  • 5,335

3 Answers3

29

The [a-z] part isn't what matches the number; it's the *. You may be confusing shell globbing and regular expressions.

Tools like grep accept various flavours of regexes (basic by default, -E for extended, -P for Perl regex)

E.g. (-v inverts the match)

$ ls a_[a-z]*_data | grep -v "[0-9]"
a_clean_data

If you want to use a bash regex, here is an example on how to test if the variable $ref is an integer:

re='^[0-9]+$'
if ! [[ $ref =~ $re ]] ; then
  echo "error"
fi
l0b0
  • 51,350
Sebastian
  • 8,817
  • 4
  • 40
  • 49
21

So the problem is: why does a_[a-z]*_data match a_clean_0db_data?

This can be broken down into four parts:

  • a_ matches the beginning of a_clean_0db_data, leaving clean_0db_data to be matched

  • [a-z] matches any character in the range a-z (e.g. c), leaving lean_0db_data to be matched

  • * matches any number of characters, e.g. lean_0db

  • _data matches the trailing _data

In regular expressions, [a-z]* would mean any number of characters (including zero) in the range of a..z, but you are dealing with shell globbing, not with regular expressions.

If you want regular expressions, a few find implementations have a -regex predicate for that:

find . -maxdepth 1 -regex "^.*/a_[a-z]*_data$"

The -maxdepth is only here to limit the search-results to the folder you are in. The regular expression matches the entire filename, therefore I have added a ^.*/ to match the path-portion

umläute
  • 6,472
11

* in shell patterns matches 0 or more characters. It's not to be confused with the * regular expression operator that means 0 or more of the preceding atom.

There is no equivalent of regexp * in basic shell patterns. However, various shells have extensions for that.

  • ksh has *(something):

    ls a_*([a-z])_data
    
  • you can have the same in bash with shopt -s extglob or zsh with setopt kshglob:

    shopt -s extglob
    ls a_*([a-z])_data
    
  • In zsh with extendedglob enabled, # is equivalent to regexp *:

    setopt extendedglob
    ls a_[a-z]#_data
    
  • In recent versions of ksh93, you can also use regular expressions in globs. Here with extended regular expressions:

    ls ~(E:a_[a-z]*_data)
    

Note that [a-z] matches different things depending on the current locale. It generally matches only the 26 a to z latin non-accented letters in the C locale. In other locales, it generally matches more, and doesn't always make sense. To match a letter in your locale, you may prefer [[:alpha:]].

  • Could you give an example of [a-z] matching more that the 26 letters matched in the C locale? What I remember from when I last looked at this, all encodings practically used in Unix variants had ISO-646 as a base (then the upper 128 codes where used differently, directly for characters in encodings like the ISO-8859-X, combined in encodings like UTF-8 or the EUC family). Even AIX hadn't EBCDIC locales (at least as available to me). I remember trying to find if POSIX/UNIX standards demanded it, but I don't remember the result. – AProgrammer Sep 10 '14 at 09:40
  • 1
    @AProgrammer, that's independent of the encoding, that's based on sort order (LC_COLLATE). [a-z] generally includes é or í (but not necessarily ź) in the the locales where the charset have them, whether the codepoint in that encoding is between that of a and z or not. Only the C locale guarantees a sort order based on codepoint value. See this other answer for more details. – Stéphane Chazelas Sep 10 '14 at 09:46
  • Ok, what I missed was that the range was interpreted according to the current collation sequence. – AProgrammer Sep 10 '14 at 11:29