Why does '[a-z]*' match non-alphabetical strings?

Question

I have a file alphanum with these two lines:

123 abc
this is a line

I am confused as to why, when I run sed 's/[a-z]*/SUB/' alphanum, I get the following output:

SUB123 abc
SUB is a line

I was expecting:

123 SUB
SUB is a line

I found a fix (use sed 's/[a-z][a-z]*/SUB/' instead), but I don't understand why it works and mine doesn't.

Can you help?

https://unix.stackexchange.com/questions/154747/why-does-a-z-asterisk-match-numbers — Kamaraj, May 20 '18 at 14:52
@Kamaraj, that one is similar, but has the shell patterns vs regexes confusion on top (and the answers concentrate on the former, since that's what the ls foo* there uses). But anyway, if you find questions that are duplicates, I think you should be able to flag them as such, too. — ilkkachu, May 20 '18 at 14:58
@RozzA Note that the website that you link to supports Javascript and Perl regular expressions, not POSIX regular expressions. — Kusalananda, Jun 07 '19 at 17:44

Kusalananda · Accepted Answer · 2019-06-07T17:44:27.867

The pattern [a-z]* matches zero or more characters in the range a to z (the actual characters are dependent on the current locale). There are zero such characters at the very start of the string 123 abc (i.e. the pattern matches), and also four of them at the start of this is a line.

If you need at least one match, then use [a-z][a-z]* or [a-z]\{1,\}, or enable extended regular expressions with sed -E and use [a-z]+.

To visualize where the pattern matches, add parentheses around each match:

$ sed 's/[a-z]*/(&)/' file
()123 abc
(this) is a line

Or, to see all matches on the lines:

$ sed 's/[a-z]*/(&)/g' file
()1()2()3() (abc)
(this) (is) (a) (line)

Compare that last result with

$ sed -E 's/[a-z]+/(&)/g' file
123 (abc)
(this) (is) (a) (line)

Technically [a-z] matches collating elements which can be made of more than one character. For instance, in some Hungarian locales, [a-z] matches on dzs — Stéphane Chazelas, May 20 '18 at 20:47

score 12 · Answer 2 · answered May 20 '18 at 14:49

Because * matches zero or more repetitions of the previous atom, and all regex engines try to find the first match. There's a substring of exactly zero letters in the start of your string, so that's where it matches. In the case where the string starts with a letter, the * matches as many as it can, but this is secondary to finding the leftmost match.

Zero-length matches can be a bit problematic, and as you saw, the solution is to modify the pattern so that it requires at least one character. With extended regexes, you could + for that: sed -E 's/[a-z]+/SUB/'

For fun, try:

echo 'less than 123 words' | sed 's/[0-9]*/x/g'

Why does '[a-z]*' match non-alphabetical strings?

2 Answers2