The following shell script takes an optional -d
option to set the delimiter (tab is default), as well as a non-optional -c
option with a column specification.
The column specification is similar to that of cut
but also allows for rearranging and duplicating the output columns, as well as specifying ranges backwards. Open ranges are also supported.
The file to parse is given on the command line as the last operand, or passed on standard input.
#!/bin/sh
delim='\t' # tab is default delimiter
# parse command line option
while getopts 'd:c:' opt; do
case $opt in
d)
delim=$OPTARG
;;
c)
cols=$OPTARG
;;
*)
echo 'Error in command line parsing' >&2
exit 1
esac
done
shift "$(( OPTIND - 1 ))"
if [ -z "$cols" ]; then
echo 'Missing column specification (the -c option)' >&2
exit 1
fi
# ${1:--} will expand to the filename or to "-" if $1 is empty or unset
cat "${1:--}" |
awk -F "$delim" -v cols="$cols" '
BEGIN {
# output delim will be same as input delim
OFS = FS
# get array of column specs
ncolspec = split(cols, colspec, ",")
}
{
# get fields of current line
# (need this as we are rewriting $0 below)
split($0, fields, FS)
nf = NF # save NF in case we have an open-ended range
$0 = ""; # empty $0
# go through given column specification and
# create a record from it
for (i = 1; i <= ncolspec; ++i)
if (split(colspec[i], r, "-") == 1)
# single column spec
$(NF+1) = fields[colspec[i]]
else {
# column range spec
if (r[1] == "") r[1] = 1 # open start range
if (r[2] == "") r[2] = nf # open end range
if (r[1] < r[2])
# forward range
for (j = r[1]; j <= r[2]; ++j)
$(NF + 1) = fields[j]
else
# backward range
for (j = r[1]; j >= r[2]; --j)
$(NF + 1) = fields[j]
}
print
}'
There's a slight inefficiency in this as the code needs to re-parse the column specification for each new line. If support for open-ended ranges is not needed, or if all lines are assumed to have exactly the same number of columns, only a single pass over the specification can be done in the BEGIN
block (or in a separat NR==1
block) to create an array of fields that should be outputted.
Missing: Sanity check for column specification. A malformed specification string may well cause weirdness.
Testing:
$ cat file
1:2:3
a:b:c
@:(:)
$ sh script.sh -d : -c 1,3 <file
1:3
a:c
@:)
$ sh script.sh -d : -c 3,1 <file
3:1
c:a
):@
$ sh script.sh -d : -c 3-1,1,1-3 <file
3:2:1:1:1:2:3
c:b:a:a:a:b:c
):(:@:@:@:(:)
$ sh script.sh -d : -c 1-,3 <file
1:2:3:3
a:b:c:c
@:(:):)
$cols
as an argument to the script or do you want to give one or many or a range of columns as arguments to your script? If you just want to cut two fields with a static delimiter usecut
like this:cut -d: -f1,7
. – Lucas Jul 22 '18 at 18:19cut
? How would you want the command line to look? Whatever it looks like, it going to be a wrapper aroundcut
, notawk
. – Kusalananda Jul 22 '18 at 21:24cut
andawk
can both work. Butawk
is more powerful in general, and I feel it is always difficult to write a shell script wrapping anawk
command, so I am trying to see how that is done in general. The design of the command line interface of the script is up to being good and flexible. – Tim Jul 22 '18 at 22:03