0

To start this script is running it a Github workflow, using shell bash, yaml truncated for readability. I've tried a number of things to make it work as multiline, so I can have comments.

  set -x
  set -e
  AWK_SOURCE=$( cat <<- AWK
  '
    {
      if ( $1 !~ /delete/ # ensure we are not trying to process deleted files
      && $4 !~ /theme.puml|config.puml/ # do not try to process our theme or custom config
      && $4 ~ /.puml/ ) # only process puml files
      { printf "%s ", $4 } # only print the file name and strip newlines for spaces
    }
    END { print "" } # ensure we do print a newline at the end
  '
  AWK
  )
  GIT_OUTPUT=`git diff-tree -r --no-commit-id --summary ${GITHUB_SHA}`
  AWK_OUPUT=`echo $GIT_OUTPUT | awk -F' ' $AWK_SOURCE`
  echo "::set-output name=files::$GIT_OUTPUT"
  set +e
  set +x

this is my current error

If I run it as a single line, it works fine

git diff-tree -r --no-commit-id --summary HEAD | awk -F' ' '{ if ( $1 !~ /delete/ && $4 !~ /theme.puml|config.puml/ && $4 ~ /.puml/ ) { printf "%s ", $4 } } END { print "" }'

this is the output/error I'm currently getting, I've gotten different ones.

shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
+ set -e
++ cat
+ AWK_SOURCE=''\''
  {
    if (  !~ /delete/ # ensure we are not trying to process deleted files
    &&  !~ /theme.puml|config.puml/ # do not try to process our theme or custom config
    &&  ~ /.puml/ ) # only process puml files
    { printf "%s ",  } # only print the file name and strip newlines for spaces
  }
  END { print "" } # ensure we do print a newline at the end
'\'''
++ git diff-tree -r --no-commit-id --summary 6c72c8a8dabf19ae2439ee506b9a4a636027193e
+ GIT_OUTPUT=' create mode 100644 .config/plantuml/config.puml
 create mode 100644 .config/plantuml/theme.puml
 delete mode 100644 config.puml
 create mode 100644 docs/README.md
 create mode 100644 docs/domain-model/README.md
 create mode 100644 docs/domain-model/user.md
 create mode 100644 docs/domain-model/user.puml
 delete mode 100644 theme.puml
 delete mode 100644 user.puml
 delete mode 100644 user.svg'
++ echo create mode 100644 .config/plantuml/config.puml create mode 100644 .config/plantuml/theme.puml delete mode 100644 config.puml create mode 100644 docs/README.md create mode 100644 docs/domain-model/README.md create mode 100644 docs/domain-model/user.md create mode 100644 docs/domain-model/user.puml delete mode 100644 theme.puml delete mode 100644 user.puml delete mode 100644 user.svg
++ awk '-F ' \' '{' if '(' '!~' /delete/ '#' ensure we are not trying to process deleted files '&&' '!~' '/theme.puml|config.puml/' '#' do not try to process our theme or custom config '&&' '~' /.puml/ ')' '#' only process puml files '{' printf '"%s' '",' '}' '#' only print the file name and strip newlines for spaces '}' END '{' print '""' '}' '#' ensure we do print a newline at the end \'
awk: cmd. line:1: '
awk: cmd. line:1: ^ invalid char ''' in expression
+ AWK_OUPUT=

how can I retain my multiline awk with comments?

muru
  • 72,889
xenoterracide
  • 59,188
  • 74
  • 187
  • 252
  • 1
    copy/paste your script into http://shellcheck.net and it'll tell about some of the issues with it. Why are you trying to save your awk script in a variable? – Ed Morton Mar 11 '21 at 19:36
  • @EdMorton that was just one of my attempts to solve this problem, I found that suggestion via the google, on stackoverflow, and I think some other stack exchange. – xenoterracide Mar 12 '21 at 17:39
  • If you got an answer to your question then see https://unix.stackexchange.com/help/someone-answers for what to do next. – Ed Morton Mar 17 '21 at 00:16

4 Answers4

2

Your main issue is that the awk code isn't quoted, which makes the shell replace things like $4 in the code. To protect the code from the shell, make sure that the here-document is quoted. You get a quoted here-document by enclosing the starting delimiting word in quotes, as in <<'AWK' or <<"AWK", or by escaping it as <<\AWK.

Here's a rewrite of your script the way I would write it:

git diff-tree -r --no-commit-id --summary "$GITHUB_SHA" |
awk '
    $1 !~ /^delete/ && $4 !~ /(theme|config)\.puml$/ && $4 ~ /\.puml$/ {
        name[++n] = $4
    }
    END {
        $0 = ""
        for (i in name) $i = name[i]
        printf "::set-output name=files::%s\n", $0
    }'

Note that I'm not storing intermediate data in variables. Doing so is inefficient (you may not know how much data you need to store in a variable) and prone to making quoting mistakes and instead spitting values on whitespaces and invoking filename globbing. Your use of $GIT_OUTPUT and $AWK without quoting is problematic in this respect, and echo $GIT_OUTPUT is particularly troublesome since echo may modify the data if it contains backslashes, depending on the configuration of the shell.

About quoting: See When is double-quoting necessary?

I'm using the standard pattern { action } syntax in the script to build up an array, name, of the strings that you want to parse out. In the END block, I create an output record, $0, that I output with a prefix that you used echo to output.

You could also do it like this, which leaves you a bit more room for comments:

git diff-tree -r --no-commit-id --summary "$GITHUB_SHA" |
awk '
    $1 ~ /^delete/ {
        # skip these
        next
    }
    $4 ~ /(theme|config)\.puml$/ {
        # and these...
        next
    }
    $4 ~ /\.puml$/ {
        # pick out filename (we assume no whitespace in filenames)
        name[++n] = $4
    }
    END {
        $0 = ""
        for (i in name) $i = name[i]
        printf "::set-output name=files::%s\n", $0
    }'

If you want to insist on having the awk source code in a here-document, I'd do it like this:

awk_script=$(mktemp) || exit 1
trap 'rm -f "$awk_script"' EXIT

cat <<'AWK_CODE' >"$awk_script" $1 !~ /^delete/ && $4 !~ /(theme|config).puml$/ && $4 ~ /.puml$/ { name[++n] = $4 } END { $0 = "" for (i in name) $i = name[i] printf "::set-output name=files::%s\n", $0 } AWK_CODE

git diff-tree -r --no-commit-id --summary "$GITHUB_SHA" | awk -f "$awk_script"

I.e., save the awk script to a temporary file that is invoked using awk -f later, and removed at the end of the script (here, using a trap). But for such a short awk program, I see no added benefit of doing this compared with using the script in a single-quoted string as shown first. It's messy and contains a lot of extra commands just for maintenance, apart from the two central commands that needs to be executed.

Ed Morton
  • 31,617
Kusalananda
  • 333,661
  • why is the for loop better? – xenoterracide Mar 12 '21 at 17:40
  • @xenoterracide Sorry, what for loop? You mean the one in the awk code? It's just a way to turn an array into a record for printing with the value of OFS as the delimiter (a space by default), so that you don't have to think about inserting spaces manually or when to output a newline at the end. – Kusalananda Mar 12 '21 at 17:45
  • yeah, the one in the awk code. Ok. – xenoterracide Mar 12 '21 at 18:06
2

Put your code in functions, not variables, something like this (untested and still room for improvement):

set -x
set -e
do_awk() {
    awk '
        ($1 !~ /delete/) &&                 # ensure we are not trying to process deleted files
        ($4 !~ /theme.puml|config.puml/) && # do not try to process our theme or custom config
        ($4 ~ /.puml/) {                    # only process puml files
            printf "%s ", $4                # only print the file name and strip newlines for spaces
        }
        END { print "" }                    # ensure we do print a newline at the end
    ' "${@:--}"
}
GIT_OUTPUT=$(git diff-tree -r --no-commit-id --summary "$GITHUB_SHA")
AWK_OUPUT=$(printf '%s\n' "$GIT_OUTPUT" | do_awk)
echo "::set-output name=files::$GIT_OUTPUT"
set +e
set +x
Ed Morton
  • 31,617
  • hmm... I had problem because of the comments with the if, that no one mentioned, I'm curious though as to why your awk doesn't need an if, and what the "$@:--" does, and why I should put it in a function. – xenoterracide Mar 12 '21 at 17:44
  • An awk script is made up of a list of <condition> { <action> } statements where the action is executed if the condition is true. You CAN alternatively write <condition1> { if (<condition2>) <action> } since the default condition1 is true but that's just not necessary. The comments in your "if" weren't a problem best I can tell (you didn't say what problem you had with them) but white space matters in awk so you can't write a condition as foo \n&& bar, it has to be foo &&\n bar. – Ed Morton Mar 12 '21 at 17:55
  • "${@:--}" in a shell script or function means "read the args as files if present, otherwise read stdin" so you can call a script/function as script file or cat file | script or script then type input. Encapsulating code is the reason functions exist (and aliases for trivial cases). Variables are for holding values. – Ed Morton Mar 12 '21 at 17:57
  • I was getting syntax errors where the comments were in the if – xenoterracide Mar 12 '21 at 18:04
  • Ah, I see - that'd probably be because you were passing your variable to awk unquoted (awk -F' ' $AWK_SOURCE instead of awk -F' ' "$AWK_SOURCE") and so it was getting converted by the shell from multi-line to a single line and therefore every part of the script after the first # was treated as within the comment that started as that point and so resulting in half a script. I expect shellcheck.net (see my first comment warned you about that. – Ed Morton Mar 12 '21 at 18:06
  • even after I got rid of the variable. – xenoterracide Mar 12 '21 at 18:09
  • Then I have no idea what code you were executing or what the specific error message you got was and so I can't help you debug it. Don't just paste the error, also paste the code that resulted in that error and be sure to put it in your question, don't post it in a comment. – Ed Morton Mar 12 '21 at 18:09
  • yeah, I know, you were too fast for me, I was adding "I'll see if I can repro later, and paste the error." ;) – xenoterracide Mar 12 '21 at 18:10
  • Also make sure to run it through shellcheck first so we don't have to look at it if the tool tells you what the issue is. – Ed Morton Mar 12 '21 at 18:12
  • Also - if it's going to be another case where you're storing the awk script in a variable, I personally wouldn't want to even look at that to try to guess where the issues might be between what the shell is doing with it when defined vs when used and what awk is doing with it. – Ed Morton Mar 12 '21 at 18:15
0

The easiest way (in terms of readability and maintainability) to my mind is to send your awk script to a temporary file to then be sourced by awk:

awksrc=$(mktemp) || exit 1
cat << 'EOF' > "${awksrc}"
{
  if ( $1 !~ /delete/ # ensure we are not trying to process deleted files
       && $4 !~ /theme.puml|config.puml/ # do not try to process our theme or custom config
       && $4 ~ /.puml/ 
  ) # only process puml files
      { printf "%s ", $4 } # only print the file name and strip newlines for spaces
}
END { print "" } # ensure we do print a newline at the end
EOF
echo "$GIT_OUTPUT" | awk -f "${awksrc}" 
rm -f "${awksrc}"
Ed Morton
  • 31,617
DopeGhoti
  • 76,081
0

I've never used GitHub Workflow, but the documentation says you can use a custom shell. It would seem if you say:

steps:
  - name: process puml files
    run: <your awk script here>
    shell: awk -f {0}

or some permutation thereof, you should be able to run the raw awk script without the shell shenanigans.