How to quantify POSIX extended regex?

Question

Password should start with a capital (uppercase) letter
Password should contain a lower-case letter
Password should contain a number
Password length should be a minimum of 8 and less than 16 characters

I want to use POSIX character classes in a bash script and I have following (it doesn't work). I don't know how to group so that I can meet the length condition.

^[[:upper:]][[:lower:]]+[[:upper:]]*[[:digit:]]+$ Where should {8,15} go?

Conditions like these, IMO it's best to just express them as separate expressions (^[[:upper:]][[:alnum:]]{8,15}$ followed by checks for [[:lower:]] and [[:digit:]]. It'll be far easier to understand and maintain. — muru, Apr 17 '22 at 03:37
Something like if [[ "$pass" =~ ^[[:upper:]][[:alnum:]]{8,15}$ && "$pass =~ [[:lower:]]+[[:digit:]]+ ]]? I think the order of second condition will matter then. — Cruise5, Apr 17 '22 at 04:51

Stéphane Chazelas · Answer 1 · 2022-04-17T09:29:19.147

POSIX extended regular expressions have no "and" operator and no look around operators, so to construct one regexp that positively validates those passwords, you'd need to build a thousands of characters long one that lists all the combinations of lower and digits and number of characters in between, something like:

u='[[:upper:]]' l='[[:lower:]]' d='[[:digit:]]'
regexp="^$u(($l$d|$d$l).{5,12}|($d.$l|.$d$l|$l.$d|.$l$d).{4,11}|...etc...)\$"

It would be so long that you'd likely reach some limit in your system's regexp engine.

Here, it would be easier to match several regexps:

valid_password=(
  '^[[:upper:]]'
  '[[:lower:]]'
  '[[:digit:]]'
  '^.{8,15}$'
)
validate_password() {
  local regexp
  for regexp in "${valid_password[@]}"; do
    [[ $1 =~ $regexp ]] || return
  done
}
if validate_password "$some_password"; then
  echo OK
fi

Doing a negative matching with one regexp would be easier however:

incorrect='^([^[:upper:]].*|[^[:digit:]]*|[^[:lower]]*|.{0,7}|.{16,})$'

(incorrect if starting with a character other than an uppercase letter, or is made entirely of non-digits or of non-lowers or made of 0 to 7 characters or of 16 or more characters).

If [[ $password =~ $incorrect ]] returns true, that means the password is incorrect. However, if it returns false, that could also be because $password contains sequences of bytes that don't form valid characters, so you'd also want to add a check for [[ $password =~ ^.*$ ]] to verify that the password is made of valid character before declaring it valid.

If switching from bash to zsh is an option, you could use PCREs that do have some look-around operators, which would make it easier:

set -o rematchpcre
[[ $password =~ '^(?=.*\d)(?=.*\p{Ll})\p{Lu}.{7,14}\Z' ]]

Note that if $password is not valid text in the locale, that will fail (return false) and an error will be reported. Note that PCRE don't support multibyte encodings other than UTF-8.

Also note that variables in zsh can contain the NUL character, the PCRE API, unlike the POSIX ERE API doesn't choke on those bytes, but you'd likely want to reject those characters in passwords along with all other control characters (including newline).

^{(note that I've not tested any of this)}

In your valid_password function, wouldn't a lowercase character and a digit have to follow the first uppercase character? Except the first uppercase, the ordering shouldn't matter but it seems to me that it does in the function. — Cruise5, Apr 17 '22 at 15:33
@Cruise5 no, an unanchored pattern can match anywhere in the string, so the lowercase character and digit can be anywhere (except for the first character, because of the previous regex). — muru, Apr 17 '22 at 15:35

Kusalananda · Answer 2 · 2022-04-17T06:40:18.440

Your proposed regular expression requires digits to occur at the end of the string. It also does not allow lower-case letters to occur after any internal upper-case letters. It forces the second character to be a lower-case letter. None of these restrictions was part of the list of conditions.

You have four different conditions on some string $pw. It makes the most sense to try them one after the other. It makes the most sense as the tests are easier to write and understand, we're free to modify the restrictions separate from each other, and we would more easily be able to tell the user which ones of the conditions the string does not pass if we needed to. Doing the tests in sequence also allows us to add new conditions easily, like "must contain a punctuation character", and "must not contain three lower-case letters in a row".

if [[ $pw ==  [[:upper:]]* ]] &&
   [[ $pw == *[[:lower:]]* ]] &&
   [[ $pw == *[[:digit:]]* ]] &&
   [ "${#pw}" -ge 8 ] && [ "${#pw}" -lt 16 ]
then
    echo valid
else
    echo invalid
fi

The code above doesn't use regular expressions as it's not needed, and it assumes that "number" means "digit" as opposed to "any number in any notation."

Note that it's not just challenging to do the length test as part of a single regular expression; it is also unreasonably tricky to, at the same time, make sure that the string contains at least one lower-case character and a digit in any order. You may possibly do this in one expression, but it would be awkward.

Note that in bash if $password is not valid text in the locale, glob pattern matching switches to a ASCII byte-wise mode, which could give false positives or false negatives in some locales. It would make sense to also verify that the password is valid text. — Stéphane Chazelas, Apr 17 '22 at 09:18

How to quantify POSIX extended regex?

2 Answers2

Linked