5

Why can't I put a regular expression on the left side of the ~ operator when using gawk?

For example, given the file below with fields delimited with tabs(\t):

$ cat cats
siberian    1970    73  2500
shorthair   1999    60  3000
longhair    1998    102 9859
scottish    2001    30  6000

If I use gawk to find a record, it works:

$ gawk '$1 ~ /h/' cats
shorthair   1999    60  3000
longhair    1998    102 9859
scottish    2001    30  6000

However if I move the operands $1 and /h/ around, it doesn't:

$ gawk '/h/ ~ $1' cats
gawk: cmd. line:1: warning: regular expression on left of `~' or `!~' operator

The gawk man page for the ~ operator says:

Regular expression match, negated match. NOTE: Do not use a constant regular expression (/foo/) on the left-hand side of a ~ or !~. Only use one on the right-hand side. The expression /foo/ ~ exp has the same meaning as (($0 ~ /foo/) ~ exp). This is usually not what was intended.

I don't understand how the expression /foo/ is evaluated to become ($0 ~ /foo/) and also this seems to only imply the weaker phrase "bad things will happen if you put a constant regular expression on the left" it doesn't actually imply the stronger phrase of "the behaviour of gawk is undefined if you put a constant regular expression on the left because it wasn't programmed to be used in this way".

I basically don't understand how the operator ~ is evaluated internally.

Jeff Schaller
  • 67,283
  • 35
  • 116
  • 255

1 Answers1

9

To quote the POSIX spec for awk:

When an ERE token appears as an expression in any context other than as the right-hand of the ~ or !~ operator or as one of the built-in function arguments described below, the value of the resulting expression shall be the equivalent of:

$0 ~ /ere/

This (combined with the action defaulting to { print }) is why you can use awk as a grep substitute by just doing awk '/b/' <file.

So, the answer is just "it's defined to work that way". /ere/ is defined to mean $0 ~ /ere/ except in certain circumstances, and /ere/ ~ $1 is not one of the exceptional circumstances, so it gets evaluated as ($0 ~ /ere/) ~ $1.

ilkkachu
  • 138,973
godlygeek
  • 8,053
  • Thanks, it makes sense now. Unfortunately I made the mistake of asking a question about awk when infact I was looking at the gawk(which is installed on my system) man page, and that POSIX quote is not on the gawk man page. I will edit my question and change all the references from awk to gawk. – Jerry Marbas Jun 16 '15 at 00:40
  • Note that gawk vs awk doesn't actually make any difference - both of them will behave this way, because that's just how the awk language is defined to work. – godlygeek Jun 16 '15 at 17:59
  • yes I know... what I meant was that that POSIX text you quoted was not in the gawk man page. If it was then I probably wouldnt have asked the question. However Im still glad I asked because your answer cleared up a couple of other confusing details. Thanks – Jerry Marbas Jun 17 '15 at 00:14
  • 1
    @jerry The direct POSIX quote is not likely to be in any man page, gawk or otherwise - I found it at http://pubs.opengroup.org/onlinepubs/9699919799/ - in general, the Open Group specifications are the way to figure out what POSIX guarantees to be portable across conforming implementations. For how the gawk manual (not man page!) explains this behavior, see here: http://www.gnu.org/software/gawk/manual/html_node/Using-Constant-Regexps.html#Using-Constant-Regexps – godlygeek Jun 17 '15 at 15:28