This is not an error; it's just that the standard isn't clear enough while trying to codify the existing practice.
The mawk(1) manual is more explicit:
split(expr, A, sep)
works as follows:
...
(2) If sep = " "
(a single space), then <SPACE>
is trimmed from the
front and back of expr
, and sep
becomes <SPACE>
. mawk defines
<SPACE>
as the regular expression /[ \t\n]+/
.
Otherwise sep
is treated as a regular expression, except that
meta-characters are ignored for a string of length 1, e.g.,
split(x, A, "*")
and split(x, A, /*/)
are the same.
Also, the GNU awk manual from the current sources:
split(s, a [, r [, seps] ])
...
Splitting behaves identically to field splitting, described above.
In particular, if r
is a single-character string, that string acts as
the separator, even if it happens to be a regular expression
metacharacter.
This is the description from the susv4 standard:
An extended regular expression can be used to separate fields by assigning a
string containing the expression to the built-in variable FS, either
directly or as a consequence of using the -F sepstring
option. The
default value of the FS variable shall be a single <space>. The
following describes FS behavior:
- If FS is a null string, the behavior is unspecified.
If FS is a single character:
a. If FS is <space>, skip leading and trailing <blank> and
<newline> characters; fields shall be delimited by sets of one or
more <blank> or <newline> characters.
b. Otherwise, if FS is any other character c, fields shall be delimited
by each single occurrence of c.
Otherwise, the string value of FS shall be considered to be an
extended regular expression. Each occurrence of a sequence matching the
extended regular expression shall delimit fields.
Your example matches 2.b.
Even if that explicitly mentions FS
, it's same behavior with any argument used
instead of it as the 3rd argument to split
in all awk implementations, including in the case where that argument is a space.
It's unlikely that behavior will ever change, because the FS
variable is just a string (awk
doesn't have regexp objects, like javascript
or perl
; you cannot assign a regexp to a variable, as in a=/./
or $a=qr/./
); it's the split
function (called either implicitly or explicitly) which does interpret its argument as described above.
The origin of this behavior may be compatibility with the "old" awk, where FS
(or the 3rd argument to split
) was always treated as a single character. Example (on unix v7):
$ awk 'BEGIN{FS="."; print split("foo.bar.baz", a, "bar"); print a[2] }'
3
ar.
$ awk 'BEGIN{FS="."; print split("foo.bar.baz", a, /bar/); print a[2] }'
awk: syntax error near line 1
awk: illegal statement near line 1
Bus error - core dumped
/.../
format. It is also described in gawk documentation here (scroll down to split function): https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html – George Vasiliou Nov 24 '18 at 15:25