8

From How can I extract only the pid column and only the pathname column in the lsof output?

awk '{ for (i=9; i<=NF; i++) {
    if ($i ~ "string" && $1 != "wineserv" && $5 == "REG" && $NF ~ "\.pdf") {
        $1=$2=$3=$4=$5=$6=$7=$8=""
        print
    }
}}'

The regex "\.pdf" matches /.../pdf.../... in gawk, but not in mawk. I wonder why?

Thanks.

Jeff Schaller
  • 67,283
  • 35
  • 116
  • 255
Tim
  • 101,790

4 Answers4

11

I don't think it's about the regex, but about how the double-quoted string is handled. C-style escapes (like \n) are interpreted in awk strings, and gawk and mawk treat invalid escapes differently:

$ mawk 'BEGIN { print "\."; }'
\.
$ gawk 'BEGIN { print "\."; }'
gawk: cmd. line:1: warning: escape sequence `\.' treated as plain `.'
. 

That is, mawk seems to leave the backslash as-is, while gawk removes it (and complains, at least in my version). So, the actual regexes used are different: in gawk the regex is .pdf, which of course matches /pdf, since the dot matches any single character, while in mawk your regex is \.pdf, where the dot is escaped and matched literally.

GNU awk's manual explicitly mentions it's not portable to use a backslash before a character with no defined backslash-escape sequence (see the box "Backslash Before Regular Characters"):

If you place a backslash in a string constant before something that is not one of the characters previously listed, POSIX awk purposely leaves what happens as undefined. There are two choices:

Strip the backslash out
This is what BWK awk and gawk both do. For example, "a\qc" is the same as "aqc".
Leave the backslash alone
Some other awk implementations do this. In such implementations, typing "a\qc" is the same as typing "a\\qc".

I assume you want the dot to be escaped in the regex, so the safe ways are either $NF ~ "\\.pdf", or $NF ~ /\.pdf/ (since with the regex literal /.../, the escapes aren't "double processed").

The POSIX text also notes the double processing of the escapes:

If the right-hand operand [of ~ or !~] is any expression other than the lexical token ERE, the string value of the expression shall be interpreted as an extended regular expression, including the escape conventions described above. Note that these same escape conventions shall also be applied in determining the value of a string literal (the lexical token STRING), and thus shall be applied a second time when a string literal is used in this context.

So, this works in both gawk and mawk:

$ ( echo .pdf; echo /pdf ) |
  awk '{ if ($0 ~ "\\.pdf") print "   match: " $0; else print "no match: " $0; }'
   match: .pdf
no match: /pdf

as does this:

$ ( echo .pdf; echo /pdf ) |
  awk '{ if ($0 ~ /\.pdf/) print "   match: " $0; else print "no match: " $0; }'
   match: .pdf
no match: /pdf
ilkkachu
  • 138,973
  • See also http://austingroupbugs.net/view.php?id=1105 There's going to be some changes in that regard in the next POSIX version. – Stéphane Chazelas Feb 18 '19 at 12:57
  • @StéphaneChazelas, I was hoping you'd have a clarification. (I thought about making a proper Q out of it, but didn't get that done.) But the text in that bug report is still a bit too much standardese for me... Should that update on the escape characters table be understood so that in /\[/, \[ stays as-is, and then the backslash makes the [ lose it's special properties during the regex processing? So that the intention indeed is that /\[/ does the same as \[ in grep etc? – ilkkachu Feb 19 '19 at 13:02
  • [.] is a more readable way to match a dot as @mosvy mentions below. – jrw32982 Feb 20 '19 at 19:07
5

As you see from the table here, in a regular expression in awk, a backslash not followed by up to 3 octal digits, another backslash or any of the ["/abfnrtv] is undefined.

Your best bet is to write [.] instead of \. if you want a literal ..

Notice that in this case, it's mawk's behavior which is off wrt general practice; while all awk's implementations I know of will let you escape \., \+, \* inside a regex literal (/foo\.bar/), only mawk will let you do the same inside a string used as a regular expression ($0~"foo\.bar").

  • and, depending on what info you need, using /proc or sock_diag directly from perl or python may be a better idea than parsing the output of lsof. –  Feb 16 '19 at 07:58
  • I'm not so sure about this. Above the table it says "The awk utility shall make use of the extended regular expression notation [link] except that it shall allow the use of C-language conventions for escaping special characters" and behind the link "An ERE special character [...] when preceded by a , such a character shall be an ERE that matches the special character itself.". I don't think the first sentence is supposed to mean that the table there should override the usual meaning of the backslash in ERE. – ilkkachu Feb 16 '19 at 11:22
  • It does say "The interpretation of an ordinary character preceded by an unescaped is undefined", but then the dot is explicitly listed as a special character. Here, the backslash works fine in escaping the dot (in all versions of awk I can find): echo '/pdf \.pdf' |mawk '$1 ~ $2 { print "match"; }' – ilkkachu Feb 16 '19 at 11:26
  • that table is pretty clear (any backslash followed by a character not in this table or the \n, \r, etc => undefined). That table is also referenced from the description of the STRING token. Anyways, that spec should codify existing practice and, here mawk is the odd one. –  Feb 16 '19 at 12:05
  • hmm yes, that's a rather interesting contradiction. I wonder, is there some awk implementation that would take that table to the letter, and treat e.g. \[, \* or \( as something other than escaped special characters? – ilkkachu Feb 16 '19 at 12:11
  • The awk spec in susv4 has a lot of defects, eg. this. It also doesn't mention that a split (either implicit or explicit) will create dual string/number values, unlike other funcs or assignments (that's only vaguely and carefully alluded to ;-)). All in all, I don't see anything to change to my little answer to this trivial question: it tells the standard specified and the actual real-life behavior, and recommends the [.] which unlike \\., should work everywhere (ERE, BRE, pcre, emacs, rex stored in single or double quoted strings, etc). –  Feb 17 '19 at 10:29
3

Use the right tool for the job. You have these 2 expressions:

$i ~ "string"
$NF ~ "\.pdf"

but in both cases the patterns are literal strings. So no reason to even bother with regex matching, just use literal string matching:

index($i, "string")
index($NF, ".pdf")

http://pubs.opengroup.org/onlinepubs/9699919799/utilities/awk.html#tag_20_06_13_13

2

As in many other languages \x has a different meaning in strings or in regular expressions. You can use either

$NF ~ /\.pdf/

or

$NF ~ "\\.pdf"

The string "\.pdf" is just a strange way of saying ".pdf"

JJoao
  • 12,170
  • 1
  • 23
  • 45