How or Why using `.?` is better than `.`?

Question

I answered this question on SuperUser that was something related to kind of Regular expressions used while grepping an output.

The answer I gave was this :

 tail -f log | grep "some_string.*some_string"

And then, In three comments to my answer @Bob wrote this :

.* is greedy and might capture more than you want. .*? is usually better.

Then this,

the ? is a modifier on *, making it lazy instead of the greedy default. Assuming PCRE.

I googled for PCRE, but couldn't get what's the significance of this in my answer ?

and finally this,

I should also point out that this is regex (grep doing POSIX regex by default), not a shell glob.

I only know what a Regex is and very basic usage of it in grep command. So, I couldn't get any of those 3 comments and I have these questions in mind :

What are differences in usage of .*? vs. .*?
Which is better and under what circumstance? Please provide examples.

Also It would be helpful to understand the comments, If anyone could

UPDATE: As an answer to question How are Regex different from Shell Globs ? @Kusalananda provided this link in his comment.

NOTE: If needed, Please read my answer to this question before answering for referring to the context.

This is two very different questions. The first question is answered by https://unix.stackexchange.com/questions/57957/how-do-regular-expressions-differ-from-wildcards-used-to-filter-files while the second question is dependent on the application of the pattern (it can not be said to be "better" under all circumstances). — Kusalananda, May 05 '18 at 07:43
You may [edit] this question to be only about the .* vs. .*? issue. The "difference between regular expressions and shell globs" question has already been addressed on this site. — Kusalananda, May 05 '18 at 08:11

score 11 · Answer 1 · edited May 05 '18 at 15:39

11

Suppose I take a string like:

can cats eat plants?

Using the greedy c.*s will match the entire string since it starts with c and ends with s, being a greedy operator it continues to match until the final occurrence of s.

Whereas using the lazy c.*?s will only match until the first occurrence of s is found, i.e. string can cats.

From the above example, you might be able to gather that:

"Greedy" means matching the longest possible string. "Lazy" means matching the shortest possible string. Adding a ? to a quantifier like *, +, ?, or {n,m} makes it lazy.

edited May 05 '18 at 15:39

ilkkachu

138,973

answered May 05 '18 at 09:03

Ashok Arora

213

1

"Shortest possible" would be cats, so it's not enforcing "shortest possible" strictly in that sense. – Kusalananda May 05 '18 at 09:07
3

@Kusalananda true, not strictly in that sense but "shortest possible" here means between the first occurrence of both c and s. – Ashok Arora May 05 '18 at 10:34

score 9 · Accepted Answer · edited Apr 14 '21 at 21:25

Ashok already pointed out the difference between .* and .*?, so I'll just provide some additional information.

grep (assuming the GNU version) supports 4 ways to match strings:

Fixed strings, with the -F option
Basic regular expressions (BRE), default
Extended regular expressions (ERE), with the -E option
Perl-compatible regular expressions (PCRE), with the -P option in GNU grep

grep uses BRE by default.

BRE and ERE are documented in the Regular Expressions chapter of POSIX and PCRE is documented in its official website. Please note that features and syntax may vary between implementations.

It's worth saying that neither BRE nor ERE support lazyness:

The behavior of multiple adjacent duplication symbols ( '+', '*', '?', and intervals) produces undefined results.

So if you want to use that feature, you'll need to use PCRE instead:

# PCRE greedy
$ grep -P -o 'c.*s' <<< 'can cats eat plants?'
can cats eat plants
PCRE lazy
$ grep -P -o 'c.*?s' <<< 'can cats eat plants?'
can cats

Could you please explain a little about .* vs .*? ?

.* is used to match the "longest"¹ pattern possible.
.*? is used to match the "shortest"¹ pattern possible.

In my experience, the most wanted behavior is usually the second one.

For example, let's say we have the following string and we only want to match the html tags², not the content between them:

<title>My webpage title</title>

Now compare .* vs .*?:

# Greedy
$ grep -P -o '<.*>' <<< '<title>My webpage title</title>'
<title>My webpage title</title>
Lazy
$ grep -P -o '<.*?>' <<< '<title>My webpage title</title>'
<title>
</title>

^{1. The meaning of "longest" and "shortest" in a regex context is a bit tricky, as Kusalananda pointed out. Refer to official documentation for more information.

2. It's not recommended to parse html with regex. This is just an example for educational purposes, don't use it in production.}

Could you please explain a little about .* vs .*? ? — C0deDaedalus, May 05 '18 at 15:38

score 1 · Answer 3 · answered May 06 '18 at 03:02

A string could be matched in several ways (from simple to more complex):

As an static string (Assume var='Hello World!'):

shell[ "$var" = "Hello World!" ] && echo yes
grepecho "$var" | grep -F "Hello"
bashgrep -F "Hello" <<<"$var"
As a glob:

shellecho ./* # list all files in pwd.
shellcase $var in (*Worl*) echo yes;; (*) echo no;; esac
bash[[ "$var" == *"Worl"* ]] && echo yes

There are basic and extended globs. The case example use basic globs. The bash [[ example use extended globs. The first file match could be basic or extended on some shell like setting extglob in bash. Both are identical in this case. Grep could not use globs.

The asterisk in a glob means something different than an asterisk in a regex:

glob* matches any number (including none) ofany characters.
regex* matches any number (including none) of thepreceding element.
As a basic regular expression (BRE):

sedecho "$var" | sed 's/W.*d//' # print: Hello !
grepgrep -o 'W.*d' <<<"$var" # print World !

There are no BRE in (basic) shells or awk.
Extended regular expressions (ERE):

bash[[ "$var" =~ (H.*l) ]] # match: Hello Worl
sedecho "$var" | sed -E 's/(d|o)//g' # print: Hell Wrl!
awkawk '/W.*d/{print $1}' <<<"$var" # print: Hello
grepgrep -oE 'H.*l' <<<"$var" # print: Hello Worl
Perl Compatible Regular Expressions:

grepgrep -oP 'H.*?l # print: Hel

Only in a PCRE a *? has some specific syntax meaning.
It makes the asterisk lazy (ungreedy): Laziness Instead of Greediness.

$ grep -oP 'e.*l' <<<"$var"
ello Worl

$ grep -oP 'e.*?l' <<<"$var"
el

This is just the tip of the iceberg, there are greedy, lazy, and docile or possesive. There are also lookahead and lookbehind but those do not apply to the asterisk *.

There is an alternative to get the same effect as a non-greedy regex:

$ grep -o 'e[^o]*o' <<<"$var"
ello

The idea is very simple: don't use a dot ., negate the next character to match [^o]. With a web tag:

$ grep -o '<[^>]*>' <<<'<script type="text/javascript">document.write(5 + 6);</script>'
<script type="text/javascript">
</script>

The above should completely clarify all @Bob 3 comments. Paraphrasing:

A .* is a common regex, not a glob.
Only a regex could be PCRE compatible.
In PCRE: a ? modify the * quantifier. .* is greedy .*? is not.

Questions

What are differences in usage of .? vs. .?
- A .*? is valid only in PCRE syntax.
- A .* is more portable.
- The same effect as a non-greedy match could be done by replacing the dot with a negated character range: [^a]*
Which is better and under what circumstance? Please provide examples.
Better? It depends on the goal. There is no better, each is useful for different purposes. I have provided several examples above. Do you need more?

How or Why using `.*?` is better than `.*`?

3 Answers3

PCRE lazy

Lazy

Questions

How or Why using `.?` is better than `.`?