How may I generalize an awk command into a script? (extracting/rearranging columns from file)

Question

I'm trying to generalize:

$ awk -F":" '{ print $7 ":" $1 }' /etc/passwd

into a script, with delimiter, input file and selection of columns provided from command line arguments, something like:

#! /bin/bash
# parse command line arguments into variables `delimiter`, `cols` and `inputfile`
...    

awk -F"$delimiter" '{ print '"$cols"' }' "$inputfile"

Input is from a file, so that STDIN input can also apply. I would prefer specifying the columns as separate arguments in an order. The output delimiters are the same as the input delimiters, as in the example command.

How would you write such a script?

Related: Environment variable not expanded inside the command line argument — steeldriver, Jul 22 '18 at 17:48
How much do you want to generalize it? What are the things that should be changeable from the command line of script? Should it accept input from stdin? Do you want optional arguments with defaults or fixed number of args? Should the output delimiter be the input delimiter? Do you want to give the complete inner part in $cols as an argument to the script or do you want to give one or many or a range of columns as arguments to your script? If you just want to cut two fields with a static delimiter use cut like this: cut -d: -f1,7. — Lucas, Jul 22 '18 at 18:19
@Lucas: I want delimiter, input file and selection of columns to be provided from command line arguments to the script. Input is from a file, where stdin input can also apply. I would prefer specify the columns as separate arguments in an order. Optional arguments are possible, if they can make the script easier to write and use. The output delimiter are the same as the input delimiter, as in the example command. — Tim, Jul 22 '18 at 18:23
@Tim How is this different from cut? How would you want the command line to look? Whatever it looks like, it going to be a wrapper around cut, not awk. — Kusalananda, Jul 22 '18 at 21:24
@Kusalananda cut and awk can both work. But awk is more powerful in general, and I feel it is always difficult to write a shell script wrapping an awk command, so I am trying to see how that is done in general. The design of the command line interface of the script is up to being good and flexible. — Tim, Jul 22 '18 at 22:03
@Tim Wrapping a general awk command can not be done. Wrapping a specific awk command is easy. In this case though, the specific awk command degenerates to the cut utility, and the only thing that needs to be done by the wrapper is to sort out the command line arguments. If these are on the same form as with cut, then no wrapper is needed. — Kusalananda, Jul 22 '18 at 22:13
@Kusalananda cut cannot reorder colums, but awk can, according to http://matt.might.net/articles/sql-in-the-shell/. So I use awk not cut. — Tim, Jul 22 '18 at 22:16
@Tim Well, that's a good point that wasn't mentioned in the question. You should add that there. — Kusalananda, Jul 22 '18 at 22:39

Lucas · Answer 1 · 2018-07-22T21:37:30.310

2

You can use bash's getopts (you have to scroll down a little bit) to do some command line parsing:

#!/bin/bash
delimiter=:
first=1
second=2
while getopts d:f:s: FLAG; do
  case $FLAG in
    d) delimiter=$OPTARG;;
    f) first=$OPTARG;;
    s) second=$OPTARG;;
    *) echo error >&2; exit 2;;
  esac
done
shift $((OPTIND-1))
awk -F"$delimiter" -v "OFS=$delimiter" -v first="$first" -v second="$second" '{ print $first OFS $second }' "$@"

edited Jul 22 '18 at 21:37

answered Jul 22 '18 at 19:20

Lucas

2,845

Thanks. There can be arbitrary number of fields being selected. – Tim Jul 22 '18 at 19:38
1

You ought to pass first and second using -v too. – Kusalananda Jul 22 '18 at 19:42
If you want to select arbitrary fields you are very close to putting the whole awk script into a bash variable. And then it has to be given from the command line to your bash script and then you could just as well type out the literal awk command. In this sense awk itself would be the maximal generalisation of the bash script you are looking for. – Lucas Jul 22 '18 at 20:29
@Kusalananda and how do I access the ith field inside awk if i is an awk variable? – Lucas Jul 22 '18 at 20:29
2

awk -v i=7 '{ print $i }' – Kusalananda Jul 22 '18 at 21:20
1

@Kusalananda cool I didn't know that. – Lucas Jul 22 '18 at 21:35

Kusalananda · Answer 2 · 2018-07-23T12:59:57.193

The following shell script takes an optional -d option to set the delimiter (tab is default), as well as a non-optional -c option with a column specification.

The column specification is similar to that of cut but also allows for rearranging and duplicating the output columns, as well as specifying ranges backwards. Open ranges are also supported.

The file to parse is given on the command line as the last operand, or passed on standard input.

#!/bin/sh

delim='\t'   # tab is default delimiter

# parse command line option
while getopts 'd:c:' opt; do
    case $opt in
        d)
            delim=$OPTARG
            ;;
        c)
            cols=$OPTARG
            ;;
        *)
            echo 'Error in command line parsing' >&2
            exit 1
    esac
done
shift "$(( OPTIND - 1 ))"

if [ -z "$cols" ]; then
    echo 'Missing column specification (the -c option)' >&2
    exit 1
fi

# ${1:--} will expand to the filename or to "-" if $1 is empty or unset
cat "${1:--}" |
awk -F "$delim" -v cols="$cols" '
    BEGIN {
        # output delim will be same as input delim
        OFS = FS

        # get array of column specs
        ncolspec = split(cols, colspec, ",")
    }

    {
        # get fields of current line
        # (need this as we are rewriting $0 below)
        split($0, fields, FS)

        nf = NF     # save NF in case we have an open-ended range
        $0 = "";    # empty $0

        # go through given column specification and
        # create a record from it
        for (i = 1; i <= ncolspec; ++i)
            if (split(colspec[i], r, "-") == 1)
                # single column spec
                $(NF+1) = fields[colspec[i]]
            else {
                # column range spec

                if (r[1] == "") r[1] = 1    # open start range
                if (r[2] == "") r[2] = nf   # open end range

                if (r[1] < r[2])
                    # forward range
                    for (j = r[1]; j <= r[2]; ++j)
                        $(NF + 1) = fields[j]
                else
                    # backward range
                    for (j = r[1]; j >= r[2]; --j)
                        $(NF + 1) = fields[j]
            }

        print
    }'

There's a slight inefficiency in this as the code needs to re-parse the column specification for each new line. If support for open-ended ranges is not needed, or if all lines are assumed to have exactly the same number of columns, only a single pass over the specification can be done in the BEGIN block (or in a separat NR==1 block) to create an array of fields that should be outputted.

Missing: Sanity check for column specification. A malformed specification string may well cause weirdness.

Testing:

$ cat file
1:2:3
a:b:c
@:(:)

$ sh script.sh -d : -c 1,3 <file
1:3
a:c
@:)

$ sh script.sh -d : -c 3,1 <file
3:1
c:a
):@

$ sh script.sh -d : -c 3-1,1,1-3 <file
3:2:1:1:1:2:3
c:b:a:a:a:b:c
):(:@:@:@:(:)

$ sh script.sh -d : -c 1-,3 <file
1:2:3:3
a:b:c:c
@:(:):)

Thanks. I wrote a script, and could you give some constructive advice if I posted it, and comparing it to your script? — Tim, Jul 23 '18 at 13:09
@Tim Use site's chat, post a link to it there, I'll look at it when I'm back at a computer. — Kusalananda, Jul 23 '18 at 13:16

Tim · Answer 3 · 2018-07-23T15:35:19.687

Thanks for replies. Here is my script. I created it by trial and error which doesn't often lead to a working solution, and don't have a systematic way of coming up with a script which I always aim at. Please provide some code review if you can. Thanks.

The script works in the following examples (not sure if works in general):

$ projection -d ":" /etc/passwd 4 3 6 7

$ projection -d "/" /etc/passwd 4 3 6 7

Script projection is:

#! /bin/bash

# default arg value                                                                                                                                                               
delim="," # CSV by default                                                                                                                                                        
# Parse flagged arguments:                                                                                                                                                        
while getopts "td:" flag
do
  case $flag in
    d) delim=$OPTARG;;
    t) delim="\t";;
    ?) exit;;
  esac
done
# Delete the flagged arguments:                                                                                                                                                   
shift $(($OPTIND -1))

inputfile="$1"
shift 1

fs=("$@")
# prepend "$" to each field number                                                                                                                                                
fields=()
for f in "${fs[@]}"; do
    fields+=(\$"$f")
done

awk -F"$delim" "{ print $(join_by.sh " \"$delim\" " "${fields[@]}") }" "$inputfile"

where join_by.sh is

#! /bin/bash                                                                                                                                                                      

# https://stackoverflow.com/questions/1527049/join-elements-of-an-array                                                                                                           
# https://stackoverflow.com/a/2317171/                                                                                                                                

# get the separator:                                                                                                                                                              
d="$1";
shift;

# interpolate other parameters by teh separator                                                                                                                                   
# by treating the first parameter specially                                                                                                                                       
echo -n "$1";
shift;
printf "%s" "${@/#/$d}";

Your shell script is the same as (IFS="$delim"; echo "${fields[*]}"). — Kusalananda, Jul 23 '18 at 16:21
I dislike the fact that you inject shell code into the awk script. It would be safer to pass a list of field numbers as a string, and then let awk do a tiny bit of looping. — Kusalananda, Jul 23 '18 at 16:23
@Kusalananda (IFS="$delim"; echo "${fields[*]}") works only when the delimiter is a single character, not when it is a string. Or am I wrong? — Tim, Jul 25 '18 at 17:42
No, that's correct, only the first character of IFS will be used. — Kusalananda, Jul 25 '18 at 17:49

How may I generalize an awk command into a script? (extracting/rearranging columns from file)

3 Answers3