-1

Can some one tell me what is the meaning of each line with an example , I am not getting why regex is used and even [!0122...]

#!/bin/sh
is_integer ()
{
    case "${1#[+-]}" in
        (*[!0123456789]*) return 1 ;;
        ('')              return 1 ;;
        (*)               return 0 ;;
    esac
}
Chris Davies
  • 116,213
  • 16
  • 160
  • 287
disovox
  • 35
  • 4
    If it helps, those aren't regular expressions, they're shell globs – Chris Davies Oct 24 '20 at 08:53
  • 2
    You will get downvotes if you fail to copy&paste correctly from one part of your question to another.  You will get downvotes if you say ‘‘explain every line of this (with an example!)’’ without showing that you have made any attempt to understand it yourself.  If you have attempted to figure this out, but you have come up completely dry, you need to improve your research skills.  If you have figured out part of it, tell us what part you understand and don’t say ‘‘explain every line …’’. – G-Man Says 'Reinstate Monica' Oct 24 '20 at 17:31

2 Answers2

12
#!/bin/sh

in the syntax of the shell is a comment. However, that #! tells the kernel, when executing that file that the interpreter stored at that /bin/sh path should be used to interpret that file, and should be executed with the path of the script as argument.

is_integer () compound-command

Is the POSIX sh syntax to define a function.

{
   ...
}

is a compound command called a command group. Its only purpose is to group commands, here to make it the body of the function. Here, it's superfluous as its content is only one compound command, however using the { ... } command group as the body of every function is common practice and makes for more readable code so is generally recommended. The same function could have been written:

is_integer () case "${1#[+-]}" in
  (*[!0123456789]*) return 1 ;;
  ('')              return 1 ;;
  (*)               return 0 ;;
esac

case something in (pattern1 | pattern2) ...;; (pattern3)... ; esac is a case/esac construct (makes up a compound command) which matches something in turn against each pattern(s), and upon the first match, executes the corresponding code.

Here something is ${1#[-+]}. That's a parameter expansion, which applies the ${param#pattern} operator to the 1 parameter which is the first argument to the function. That operator strips the shortest string that matches the pattern from the start of the contents of the parameter. [-+] is a wildcard pattern (not regexp) that matches on either the - or + character. So ${1#[-+]} expands to the value of the first argument stripped of a sign. So if the first argument was -2, that becomes 2. If it was - is becomes the empty string. If it was 2 is stays 2.

You'll notice "${1#[+-]}" is quoted. Generally, you need to quote parameter expansions as otherwise they're subject to split+glob. Here, it's one of the very few contexts where that wouldn't happens though, so strictly speaking those quotes are superfluous (but don't harm and are still good practice).

Then that value is matched against some patterns.

*[!0123456789]* is * --any number of characters (though most shells will also accept non characters)-- followed by [!0123456789] --any character that is neither 0 nor 1... nor 9-- followed by any number of characters (* again). So it will match on any string that contains a character (or non-character in most shells) that is not a decimal digit number.

If there's a match, the return 1 code is executed which will cause the function to return with that 1 exit code which, like any number other than 0 means false / failure.

'' is one way to represent the empty string. The empty string is also not a valid number but wouldn't have been matched by the previous pattern.

Then * matches anything. So the return 0 would be run for any string that didn't match any of the previous patterns. It's superfluous here as the case statement is the last command in that function, and a case statement returns success / true if no command was run within.

So here, that function definition could be shortened to:

is_integer() case ${1#[-+]} in
  ('' | *[!0123456789]*) false
esac

Though that doesn't make it more legible.

In any case, that code is right to use [0123456789]. Especially for input validation (and it's critical to validate input when it's used in shell arithmetic expressions, see Security Implications of using unsanitized data in Shell Arithmetic evaluation), [0-9] or [[:digit:]] should not be used, especially if your sh implementation is bash as [0-9] may match on any character (or possibly multi-character collation element) that sorts in between 0 and 9 and [[:digit:]] on some BSDs will match on digits of any decimal numeral systems, not only the 0123456789 English ones, even in English locales.

For instance, on a GNU system, in a typical US English locale (which these days tend to use UTF-8 as their charset), in bash, [0-9] would also match on , , and hundreds of other characters). And on FreeBSD, in that same locale, [[:digit:]] would match on hundreds of different characters (including ).

If you let through for instance during input validation, you're not closing the paths to those arbitrary code injection vulnerabilities. In ksh and on GNU systems, is a valid variable name (and that's the case for many other characters matched by [0-9]). If that variable is set (in the environment for instance) and contains a[0$(reboot>&2)], then:

is_integer "$1" || exit
echo "$(( $1 + 1 ))"

in ksh will cause a reboot if is_integer fails to reject that input.

To use a regular expression to do the matching, you'd need expr or awk, though few shells have those commands builtin, so it would be less efficient. Some [ implementations like the [ builtin of zsh or yash can also do regexp matching. And some shells also have a [[ ... ]] conditional expression construct that can do regexp matching, but none of those are in standard sh and come with their own problem when it comes to input validation.

While the * shell wildcard in most sh implementations will match on sequences of bytes even if some of them don't form valid characters, same for [!0123456789], the .* or [^0123456789] regexp equivalent often doesn't.

Here, it may not be a problem as long as that matching is positive. Doing a negative matching like:

regexp() {
  awk -- 'BEGIN {exit !(ARGV[1] ~ ARGV[2])}' "$@"
}

is_integer() { ! regexp "${1#[-+]}" '^(.[^012345679].)?$' }

As a direct translation of that case statement would be wrong as it would fail to reject input that contains sequences of bytes not forming valid characters, but

is_number() {
  regexp "$1" '^[-+]?[0123456789]+$'
}

Should be fine as it would reject any input containing sequences of bytes not forming valid characters.

3

It returns true (zero) if the first argument to the function is an integer, and false (1) if it isn't.

It does this by first removing any single + or - sign from the beginning of the 1st argument's value. This is what "${1#[+-]}" does. This is using a standard parameter expansion ${variable#pattern}, which removes the shortest substring matching pattern from the start of the value in the variable variable. The pattern should be a shell globbing pattern, not a regular expression.

It then runs the resulting value through a series of pattern matches (globbing patterns, not regular expressions). The first pattern that matches will trigger the corresponding return statement.

The first pattern tests whether there is some other character than a digit in the string. This pattern could also have been written *[!0-9]* or *[![:digit:]]* (but see also here). The ! function in the same way as a ^ would have done in a regular expression character class or range (i.e. as in [^...], and some shells would have accepted a ^ here too), i.e. it inverts the given character class or range. The pattern *[!0123456789]* could be understood as "match a non-digit anywhere in the given string". The * at the start and end of the pattern are needed as shell globbing patterns are always anchored (the corresponding regular expression would have been [^0-9], or ^.*[^0-9].*$ with explicit unneeded anchoring).

The second pattern is simply testing whether the string is empty.

The last pattern matches any string.

An alternative implementation of the function in bash (which allows for pattern matching with == inside [[ ... ]]):

is_integer () {
    set -- "${1#[+-]}"
if [ -z "$1" ] || [[ $1 == *[!0-9]* ]]; then
    return 1
fi

return 0

}

Kusalananda
  • 333,661