How to get complex regex to work in bash

Question

The idea is to get a rudimentary check on input pattern for a url:

$ ns='abc.def.com'
$ reg_expr="\N*\.(\D{2}|\D{3})$"
$ echo $reg_expr
\N*\.(\D{2}|\D{3})$
$ [[ $ns =~ "$reg_expr" ]] && echo "ok" || echo "no"
no

However, the regex always fails. Online regex checks for same pattern work fine.

https://regex101.com/r/vXxv1w/1

Why does this happen?

It's worth noting that many regular expression testing websites do not support POSIX regular expressions, commonly used by Unix text-processing tools such as grep, sed, and awk. While some websites may offer helpful features like syntax highlighting and debugging tools, they may not accurately reflect how your regular expression will behave with standard Unix tools. — Kusalananda, Oct 07 '23 at 09:30
See "Why does my regular expression work in X but not in Y?" — Gordon Davisson, Oct 08 '23 at 19:59

Stéphane Chazelas · Answer 1 · 2023-10-07T09:25:03.350

That

\N*\.(\D{2}|\D{3})$

Is a perl regexp. bash's [[ =~ ]] operator takes a POSIX extended regexp, not a perl regexp.

To use a perl-style regexp, use zsh and its rematchpcre option:

set -o rematchpcre
[[ $ns =~ '\N*\.(\D{2}|\D{3})$' ]]

Now, that regexp doesn't make much sense.

\N is meant to match on any character other than newline, but unless you set the s flag (like with (?s)), . won't match a newline anyway so you could replace \N with ..
Having <anything>* at the start or end of a regexp is pointless as it matches on 0 or more of <anything>, so it matches on nothing as well. [[ $ns =~ '\.(\D{2}|\D{3})$' ]] is functionally equivalent.
\D matches on any character other than decimal digits which are either 0123456789 or something like 0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९০১২৩৪৫৬৭৮৯੦੧੨੩੪੫੬੭੮੯૦૧૨૩૪૫૬૭૮૯୦୧୨୩୪୫୬୭୮୯௦௧௨௩௪௫௬௭௮௯౦౧౨౩౪౫౬౭౮౯೦೧೨೩೪೫೬೭೮೯൦൧൨൩൪൫൬൭൮൯෦෧෨෩෪෫෬෭෮෯๐๑๒๓๔๕๖๗๘๙໐໑໒໓໔໕໖໗໘໙༠༡༢༣༤༥༦༧༨༩၀၁၂၃၄၅၆၇၈၉႐႑႒႓႔႕႖႗႘႙០១២៣៤៥៦៧៨៩᠐᠑᠒᠓᠔᠕᠖᠗᠘᠙᥆᥇᥈᥉᥊᥋᥌᥍᥎᥏᧐᧑᧒᧓᧔᧕᧖᧗᧘᧙᪀᪁᪂᪃᪄᪅᪆᪇᪈᪉᪐᪑᪒᪓᪔᪕᪖᪗᪘᪙᭐᭑᭒᭓᭔᭕᭖᭗᭘᭙᮰᮱᮲᮳᮴᮵᮶᮷᮸᮹᱀᱁᱂᱃᱄᱅᱆᱇᱈᱉᱐᱑᱒᱓᱔᱕᱖᱗᱘᱙꘠꘡꘢꘣꘤꘥꘦꘧꘨꘩꣐꣑꣒꣓꣔꣕꣖꣗꣘꣙꤀꤁꤂꤃꤄꤅꤆꤇꤈꤉꧐꧑꧒꧓꧔꧕꧖꧗꧘꧙꧰꧱꧲꧳꧴꧵꧶꧷꧸꧹꩐꩑꩒꩓꩔꩕꩖꩗꩘꩙꯰꯱꯲꯳꯴꯵꯶꯷꯸꯹０１２３４５６７８９ depending on whether the matching is done based on Unicode character properties or not (not by default in zsh unless you add (*UCP)). So that also matches on . characters which makes your check fall apart.
\D{2}|\D{3} can be written \D{2,3}
$ in perl regexps matches on either the end of the subject or before a newline character at the end of the subject. To match at the end of the subject, you use \z instead.

So assuming you want to match on strings that end with . followed by 2 to 3 characters other than digits and ., with PCRE, you'd need:

# zsh
set -o rematchpcre
[[ $ns =~ '\.[^\d.]{2,3}\z' ]]

Or with ERE:

# zsh/bash -O compat31
[[ $ns =~ '\.[^[:digit:].]{2,3}$' ]]

Or:

# zsh/bash/ksh93
regex='\.[^[:digit:].]{2,3}$'
[[ $ns =~ $regex ]]

Bearing in mind that what [[:digit:]] matches varies with the locale and system. Use [^0123456789.] to match on any character other than . and those specific decimal digit characters.

\D, [^\d.] and [^[:digit:].] all also match on newline characters. If you wanted to make sure strings that contain a newline characters don't match, you'd need regex='^.*\.[^\d\n.]{2,3}\z' (or regex='^\N*\.[^\d\n.]{2,3}\z' to make it more explicit) in perl-style RE and regex=$'^[^\n]*\.[^[:digit:].\n]{2,3}$' in ERE.

The regex is also pointless since it isn't anchored. These solutions would match !@#$!.abc, for example. Presumably, the OP wants to enclose the regex in ^ and $. — terdon, Oct 07 '23 at 09:47
@terdon I did make that point already. Note that the OP used $ — Stéphane Chazelas, Oct 07 '23 at 11:42

Sotto Voce · Answer 2 · 2023-10-08T19:29:48.543

In the man page for bash, there are several paragraphs describing the operators that are valid within the [[ expression ]] test. The one covering regular expression matching says:

An additional binary operator, =~, is available, with the same precedence as == and !=. When it is used, the string to the right of the operator is considered a POSIX extended regular expression and matched accordingly ...

(skipping a couple of sentences)

If any part of the pattern is quoted, the quoted portion is matched literally. This means every character in the quoted portion matches itself, instead of having any special pattern matching meaning. If the pattern is stored in a shell variable, quoting the variable expansion forces the entire pattern to be matched literally.

To summarize, inside [[ expression ]], don't put quotes around the regular expression, and don't put quotes around a variable that contains the regular expression. Even though you've specified =~ as the comparison operator, quoting the regular expression changes the comparison to a mere string match, like ==.

It would also be good to follow the references through the man pages to see what regex syntax is supported. You may need to use [[:digit:]] and [^[:digit:]] instead of \d and \D. (I could be mistaken so check it for yourself)

How to get complex regex to work in bash

2 Answers2