2

From inside an AWK script, I can pass variables as arguments to external utilities:

awk 'BEGIN {
    filename = "path_to_file_without_space"
    "file " filename | getline
    print $0
}'

But if the variable contains spaces,

awk 'BEGIN {
    filename = "path to file with spaces"
    "file " filename | getline
    print $0
}'

I get the error

file: cannot open `path' (No such file or directory)

Suggesting that the argument is split on whitespace, much the same way that a shell splits unquoted variables on whitespace. I thought of disabling shell field splitting by setting the shell's IFS to null, like so

"IFS= file " filename | getline

Or by setting IFS to null before running the AWK command, but neither option makes any difference. How can I avoid this field splitting?

Kusalananda
  • 333,661
  • IFS does NOT control parsing of commands in a shell, it only controls some later steps, like 'expanding' $var or $(cmd) (if outside doublequotes) or executing read v1 v2 – dave_thompson_085 Apr 25 '20 at 04:30

2 Answers2

5

You will have to quote the name of the file:

awk 'BEGIN {
    filename = "path to file with spaces"
    "file \"" filename "\"" | getline
    print
}'

or, as suggested in comments, for ease of reading,

awk 'BEGIN {
    DQ = "\042" # double quote (ASCII octal 42)
    filename = "path to file with spaces"
    "file " DQ filename DQ | getline
    print
}'

or, assuming this is part of a larger awk program,

BEGIN {
    SQ = "\047"
    DQ = "\042"
}

BEGIN {
    name = "filename with spaces"
    cmd = sprintf("file %s%s%s", DQ, name, DQ)

    cmd | getline
    close(cmd)

    print
}

That is, close the command when done with it to save on open file handles. Set up convenience "constants" in a separate BEGIN block (these blocks are executed in order). Create the command using sprintf into a separate variable. (Most of these things are obviously for longer or more complicated awk programs that needs to present a readable structure to be maintainable; one could also imagine writing a dquote() and squote() function that quotes strings)

The left hand side of the "pipe" will evaluate to the literal string

file "path to file with spaces"

Basically, using cmd | getline makes awk call sh -c with a single argument, which is the string cmd. That string therefore must be properly quoted for executing with sh -c.

The technical details are found in POSIX standard:

expression | getline [var]

Read a record of input from a stream piped from the output of a command. The stream shall be created if no stream is currently open with the value of expression as its command name. The stream created shall be equivalent to one created by a call to the popen() function with the value of expression as the command argument and a value of r as the mode argument. As long as the stream remains open, subsequent calls in which expression evaluates to the same string value shall read subsequent records from the stream. The stream shall remain open until the close function is called with an expression that evaluates to the same string value. At that time, the stream shall be closed as if by a call to the pclose() function. If var is omitted, $0 and NF shall be set; otherwise, var shall be set and, if appropriate, it shall be considered a numeric string (see Expressions in awk).

The popen() function referred to here is the C popen() library function. This arranges for he given string to be executed by sh -c.

You'll have exactly the same issue with system() if executing a command using a filename with spaces, but in that case the C library's system() function is called, which also calls sh -c in a similar way as popen() (but with different plumbing of I/O streams).

So, no amount of setting IFS to anything would help if sh -c was invoked with the single argument

file path to file with spaces
Kusalananda
  • 333,661
  • In awk scripts where I need many quotes, I set up a var as -vq=\" or even {q=sprintf("%c",34)} – dave_thompson_085 Apr 25 '20 at 04:34
  • 1
    Always: BEGIN { SQ = "\047"; DQ = "\042"; } – Paul_Pedant Apr 25 '20 at 12:39
  • @Paul_Pedant Thanks, that makes it more readable. – Kusalananda Apr 25 '20 at 12:47
  • I would normally assemble the whole command like: cmd = sprintf ("file %s", SQ filename SQ); (a) Easy to debug. (b) You should close the pipework with close (cmd) after use. – Paul_Pedant Apr 25 '20 at 12:51
  • @Paul_Pedant Thanks, I've incorporated that. I will not add more style stuff as it distracts from the actual issue. – Kusalananda Apr 25 '20 at 13:00
  • @Paul_Pedant, but that makes the assumption that you're on an ASCII-based system (not too unreasonable an assumption these days). Using -v sq="'" -v dq='"' would make it more readable and portable. – Stéphane Chazelas Apr 25 '20 at 14:23
  • @StéphaneChazelas But inconvenient to use unless wrapped up in a shell script. – Kusalananda Apr 25 '20 at 14:25
  • If not wrapped up in a shell script, then it's just sq = "'"; dq = "\"". Still more readable and more portable. – Stéphane Chazelas Apr 25 '20 at 14:26
  • House style. My awk scripts are generally wrapped in '..', so putting sq = "'" inside a begin breaks it. I usually write my whole awk as a separate variable assignment, so the command line is some distance away, isolating the -v and -F options from the awk code. I dislike mixing escaped and syntactic quotes. Basically, I'm playing the %ages to avoid typos (especially hard-to-see ones). I also tend to declare BLK = " ";, NIL = "";, NUL = "\0";, and stderr = "cat 1>&2"; for readability and compatibility reasons. – Paul_Pedant Apr 25 '20 at 15:02
  • @Paul_Pedant I'm not entirely sure, but I think Stéphane possibly, maybe, implied using dq and sq as environment variables. – Kusalananda Apr 25 '20 at 15:32
  • @Kusalananda I'm seeing -v as specifying awk variables as command-line options. Env vars, or awk assignments in place of file args, would not like spaces around the =. I despair of sq = ENVIRON["sq"]; Maybe Stéphane refers to a shebang-awk file. I rely on my visual cues, and I trust my readership (3 or 4 benighted souls) will forgive me, but I prefer gsub (DQ, BLK) over gsub ("\"", " "). – Paul_Pedant Apr 25 '20 at 16:43
  • 1
    @Paul_Pedant :-) I will leave this comment thread be for now, as it's on a topic tangential to the question. – Kusalananda Apr 25 '20 at 16:52
  • Thanks for the detailed answer. That helps quite a bit. –  Apr 25 '20 at 20:38
3

Note that for arbitrary file names, spaces are the least of your worries. Consider for instance a file called $(reboot) or foo;reboot #whatever or foo|reboot|bar...

awk calls sh to interpret command lines in its cmdline | getline, print | cmdline, system(cmdline), so when building the command line out of arbitrary input, it's critical to properly escape arguments to avoid command injection vulnerabilities.

Quoting in shells is a tricky business. Shells have a great number of different quoting operators ('...', "...", \, $'...', $"...") all but '...' being potentially unsafe as they don't escape every character (in particular, they don't escape the \ character which is a dangerous one as its encoding is also found in the encoding of other characters in some charsets).

It's also important not to use the old `...` form of command substitution in the shell code as they introduce another level of backslash processing.

Say you have the arbitrary file name in an environment variable:

#! /bin/sh -
FILE="${1?No file provided}"
export FILE

awk -v q="'" '
  function shquote(s) {
    gsub(q, "&\"&\"&", s)
    return q s q
  }
  BEGIN {
    cmdline = "file -- " shquote(ENVIRON["FILE"])
    if ((cmdline | getline) > 0)
      print "The first line of \""cmdline"\" output was \""$0"\"."
    else
      print "Could not read a line from \""cmdline"\" output."
    if (close(cmdline) != 0)
      print cmdline" failed."
  }'

Above, shquote() takes a string as argument and quotes it for sh by enclosing it in single quotes (the safest quotes), except that single quotes in the string itself are changed to '"'"', that is a closing ', followed by a ' quoted with "..." followed by another ' that reopens another single-quoted string.

You'll notice above a few other hints at other possible caveats:

  • you need a -- to make sure your file name is not taken as an option if it starts with -.
  • the output of that file command is not guaranteed to be on a single line, especially if the filename itself contains newline characters. After all, the newline character is as valid as any in a file name. getline only reads one record, records being lines by default. See Slurp-mode in awk? for hints as to how to read the whole output.
  • that output could also not have any line at all. To tell that from an empty first line, you'd need to check the return value of getline.
  • it's a good idea to check the exit status of the command as well to report problems if need be. That's done with looking at the value returned by close(). Note however that there are variations between awk implementations on how that value encodes the exit status. The only common thing between all is that that value is 0 when the command succeeds (exits with a 0 exit code).