awk 'processing_script_here' my=file.txt
seems to stop and wait indefinitely...
What's going on here and how do I make it work ?
awk 'processing_script_here' my=file.txt
seems to stop and wait indefinitely...
What's going on here and how do I make it work ?
In most versions of awk, arguments after the program to execute are either:
x=y
Since your filename is being interpreted as case #2, awk is still waiting for something to read on stdin (since it doesn't perceive that there has been any filename passed).
Portably, this behaviour is documented in POSIX:
Either of the following two types of argument can be intermixed:
- file: A pathname of a file that contains the input to be read, which is matched against the set of patterns in the program. If no file operands are specified, or if a file operand is '-', the standard input shall be used.
- assignment: An operand that begins with an underscore or alphabetic character from the portable character set (see the table in the Base Definitions volume of IEEE Std 1003.1-2001, Section 6.1, Portable Character Set), followed by a sequence of underscores, digits, and alphabetics from the portable character set, followed by the '=' character, shall specify a variable assignment rather than a pathname.
As such, portably, you have a few options (#1 is likely the least intrusive):
awk ... ./my=file
, which sidesteps this since .
is not "an underscore or alphabetic character from the portable character set".awk ... < my=file
. However, this doesn't work well with multiple files.ln my=file my_file
, and then use my_file
as normal. No copying will be performed, and both files will be backed by the same data and inode metadata. After using it, it's safe to remove the link created as the number of references to the inode will still be greater than 0../my=file
work? % awk 'processing_script_here' ./my=file.txt awk: fatal: cannot open file ./my=file.txt' for reading (No such file or directory).
This should be portable because ./my
isn't a valid variable name, so shouldn't be parsed that way.
– Stephen Harris
Dec 22 '18 at 21:17
./
works just fine with either bash or dash on Ubuntu.
– Sergiy Kolodyazhnyy
Dec 22 '18 at 21:51
=
is preceded by an underscore or alphabetic character from the portable character set (see the table in the Base Definitions volume of IEEE Std 1003.1-2001, Section 6.1, Portable Character Set), followed by a sequence of underscores, digits, and alphabetics from the portable character set. so a file path like ++foo=bar.txt
or =foo
or ./foo=bar
are all OK as that .
or +
is not a [_a-zA-Z]
.
– Stéphane Chazelas
Dec 22 '18 at 22:04
./my=file
will be passed through verbatim.
– Chris Down
Dec 23 '18 at 00:00
awk ... < myfile
doesn't make any difference with regards to seeking (not sure why awk
would want to seek anyway). That does mean however that the name of the file is not available in FILENAME
.
– Stéphane Chazelas
Dec 23 '18 at 00:08
strace awk '{print $1,$2}' < /etc/passwd
there are instances of lseek()
going through several system files, such as /proc/self/maps
but nothing related to the actual file itself.
– Sergiy Kolodyazhnyy
Dec 23 '18 at 00:20
stdin
is not seekable, which isn't quite the case as shown in https://unix.stackexchange.com/questions/337739/what-is-the-difference-between-cat-file-binary-and-binary-file/337750#337750
– Sergiy Kolodyazhnyy
Dec 23 '18 at 00:23
awk '{print $1,$2}' /etc/passwd
. The point is that having the shell open the file as opposed to awk doesn't make any difference as to whether it makes it seekable or not. Actually, in awk '{exit}' < /etc/passwd
, you'd expect awk
to seek back to the end of the first record upon that exit
to make sure it leaves the position within stdin there. POSIX requires that. /usr/xpg4/bin/awk
does it on Solaris, but neither gawk
nor mawk
seem to do it on GNU/Linux.
– Stéphane Chazelas
Dec 23 '18 at 00:28
awk
to seek back to the end of the current record on exit
? It's true that solaris' awk does that, but that's pretty dumb IMHO, because it introduces a difference in behavior when the stdin isn't seekable, eg. in ksh -c 'printf "1st\n2nd\n3rd\n" > f; exec < <(cat f); awk "{exit}"; echo AFTER; cat'
vs. ... ; exec <f; ...
.
–
Dec 23 '18 at 05:39
awk
that way.
– Stéphane Chazelas
Dec 23 '18 at 09:14
awk 'FNR==1{system("cat")}' < file
which even Solaris' /usr/xpg4/bin/awk doesn't handle "properly" (the POSIX spec should probably be clarified first about that one).
– Stéphane Chazelas
Dec 23 '18 at 11:18
As Chris says, arguments of the form variablename=anything
are treated as variable assignment (that are performed at the time the arguments are processed as opposed to the (newer) -v var=value
ones which are performed before the BEGIN
statements) instead of input file names.
That can be useful in things like:
awk '{print $1}' FS=/ RS='\n' file1 FS='\n' RS= file2
Where you can specify a different FS
/RS
per file. It's also commonly used in:
awk '!file1_processed{a[$0]; next}; {...}' file1 file1_processed=1 file2
Which is a safer version of:
awk 'NR==FNR{a[$0]; next}; {...}' file1 file2
(which doesn't work if file1
is empty)
But that gets in the way when you have files whose name contains =
characters.
Now, that's only a problem when what's left of the first =
is a valid awk
variable name.
What constitutes a valid variable name in awk
is stricter than in sh
.
POSIX requires it to be something like:
[_a-zA-Z][_a-zA-Z0-9]*
With only characters of the portable character set. However, the /usr/xpg4/bin/awk
of Solaris 11 at least is not compliant in that regard and allows any alphabetical characters in the locale in variable names, not just a-zA-Z.
So an argument like x+y=foo
or =bar
or ./foo=bar
is still treated as an input file name and not an assignment as what's left of the first =
is not a valid variable name. An argument like Stéphane=Chazelas.txt
may or may not, depending on the awk
implementation and locale.
That's why with awk, it's recommended to use:
awk '...' ./*.txt
instead of
awk '...' *.txt
for instance to avoid the problem if you can't guarantee the name of the txt
files won't contain =
characters.
Also, beware that an argument like -vfoo=bar.txt
may be treated as an option if you use:
awk -f file.awk -vfoo=bar.txt
(also applies to awk '{code}' -vfoo=bar.txt
with the awk
from busybox versions prior to 1.28.0, see corresponding bug report).
Again, using ./*.txt
works around that (using a ./
prefix also helps with a file called -
which otherwise awk
understands as meaning standard input instead).
That's also why
#! /usr/bin/awk -f
shebangs don't really work. While the var=value
ones can be worked around by fixing the ARGV
values (add a ./
prefix) in a BEGIN
statement:
#! /usr/bin/awk -f
BEGIN {
for (i = 1; i < ARGC; i++)
if (ARGV[i] ~ /^[_[:alpha:]][_[:alnum:]]*=/)
ARGV[i] = "./" ARGV[i]
}
# rest of awk script
That won't help with the option ones as those ones are seen by awk
and not the awk
script.
One potential cosmetic issue with using that ./
prefix is it ends up in FILENAME
, but you can always use substr(FILENAME, 3)
to strip it if you don't want it.
The GNU implementation of awk
fixes all those issues with its -E
option.
After -E
, gawk expects only the path of the awk
script (where -
still means stdin) and then a list of input file paths only (and there, not even -
is treated specially).
It's specially designed for:
#! /usr/bin/gawk -E
shebangs where the list of arguments are always input files (note that you're still free to edit that ARGV
list in a BEGIN
statement).
You can also use it as:
gawk -e '...awk code here...' -E /dev/null *.txt
We use -E
with an empty script (/dev/null
) just to make sure those *.txt
afterwards are always treated as input files, even if they contain =
characters.
../foo
, /path/to/foo
and paths that are in a different encoding) -- in which case substr(FILENAME,3)
won't be enough, or it's a one shot script where the user basically knows what the filenames are -- in which case s/he probably shouldn't bother with any of them containing =
either ;-)
–
Dec 23 '18 at 04:09
./
is a problem, but that it may be undesirable under certain conditions, such as cases where filename has to be included in the output, in which case ./
should be redundant and unnecessary, so you'll need to get rid of it somehow. Here's at least one example. As for user knowing what filenames are - well, in this case we also know what filename is, but =
still gets in the way of proper processing. So can leading -
get in the way.
– Sergiy Kolodyazhnyy
Dec 23 '18 at 08:53
./
prefix to work around that awk
(mis)feature but then you end up with a that ./
on output which you may want to strip. See how to check if the first line of file contain a specific string? as an example.
– Stéphane Chazelas
Dec 23 '18 at 09:18
./
but also the global (absolute path) /
which makes awk interpret the argument as a file.
–
Oct 24 '19 at 15:39
To quote gawk documentation ( note emphasis added ):
Any additional arguments on the command line are normally treated as input files to be processed in the order specified. However, an argument that has the form var=value, assigns the value value to the variable var—it does not specify a file at all.
Why does the command stop and wait ? Because in the form awk 'processing_script_here' my=file.txt
there is no file specified by the above definition - my=file.txt
is interpreted as variable assignment, and if there's no file defined awk
will read stdin ( also evident from strace
which shows that awk in such command is waiting on read(0,'...)
syscall.
This is also documented in POSIX awk specifications, see OPERANDS section and assignments part of that )
Variable assignment is evident in awk '{print foo}' foo=bar /etc/passwd
that value of foo
is printed for every line in /etc/passwd. Specifying ./foo=bar
or full path however does work.
Note that running strace
on awk '1' foo=bar
as well as checking with cat foo=bar
shows that this is awk-specific issue, and execve does show filename as argument passed, so shells have nothing to do with env variable assignments in this case.
Additionally, please note that awk '...script...' foo=bar
will not cause environment variable creation by shell, since environment variable assignments should be preceding a command to take effect. See POSIX Shell Grammar Rules, point number 7. Additionally this can be verified via awk '{print ENVIRON["foo"]}' foo=bar /etc/passwd
awk '{ ... }' ./my=file.txt
. – Kevin E Feb 12 '20 at 10:52