3

I have this awk statement that reads a YAML file and outputs a particular value. I need to loop this awk inside a loop where I read a key value from a list of values and pass that key to awk.

The YAML file has this structure:

abc:
  NAME: Bob
  OCCUPATION: Technician
def:
  NAME: Jane
  OCCUPATION: Engineer

Say I want to get key abc OCCUPATION value of TECHNICIAN, through googling I managed to construct an awk statement that gives what I want

> awk 'BEGIN{OFS=""} /^[^ ]/{ f=/^abc:/; next } f{ if (sub(/:$/,"")) abc=$2; else print abc,$1 $2}' test.yml| grep "OCCUPATION:" | cut -d':' -f2
Technician

However passing -v option to awk does not seem to give anything if I use this loop:

items="abc,def"
for item in $(echo $items | sed "s/,/ /g"); 
do
 echo $item;
 awk -v name="$item" 'BEGIN{OFS=""} /^[^ ]/{ f=/^\name:/; next } f{ if (sub(/:$/,"")) name=$2; else print name,$1 $2}' test.yml| grep "OCCUPATION:" | cut -d':' -f2; 
done

I get just the debug echos I set out

abc
def

Where am I going wrong? I thought the variable should be interpreted correctly inside awk?

EDIT: Based on steeldrivers comment I have changed the input a little

items="abc,def"
for item in $(echo $items | sed "s/,/ /g"); 
do
 echo $item;
 awk -v name="$item" 'BEGIN{OFS=""} /^[^ ]/{ f=name; next } f{ if (sub(/:$/,"")) name=$2; else print name,$1 $2}' test.yml| grep "OCCUPATION:" | cut -d':' -f2; 
done

However now I am getting all values for OCCUPATION printed:

abc
Technician
Engineer
def
Technician
Engineer

I tried to use the ~ operator but I think I am not using it right as it is giving me errors, so I decided to just parse the value directly, but this is giving duplicates :/

cas
  • 78,579
  • 2
    You can't use a variable inside a regexp constant /.../, you need to use the explicit ~ operator - see Pass shell variable as a /pattern/ to awk – steeldriver Apr 19 '21 at 00:50
  • I tried something similar without the ~ operator, just the value passed and it seems to work but gives duplicate results :/ – user3674993 Apr 19 '21 at 01:07
  • The variable equivalent to f=/^abc:/ would be something like f = ($0 ~ "^" name ":") however I can't help feeling you would be better served by a proper YAML-aware tool – steeldriver Apr 19 '21 at 01:15
  • that works but I can see how this is not a very clean approach! unfortunately I am constrained by base bash, cant install jq or use python or something else to process this :( – user3674993 Apr 19 '21 at 01:19
  • Do you have perl installed? The YAML module has been included as one of the standard core perl modules since around perl v5.14 in 2013. Try running perl --version, if it's >= 5.14, you have the YAML module. Also try perl -MYAML -e '' - if that compiles without error, you definitely have the YAML module installed. – cas Apr 19 '21 at 05:47

6 Answers6

7

When working with structured text like YAML or JSON or XML, you really should use a parser that "understands" the structure. There are several specific command-line tools for various kinds of structured text (e.g. xmlstarlet for xml, jq for json, and yq for yaml), and most programming/scripting languages have libraries for parsing and processing structured text.

Here's how to do it in perl, using the perl core YAML module:

(this requires a version of perl >= 5.14, which is when the YAML module was included as a standard part of the core module distribution. perl 5.14 was released in 2013. For earlier versions of perl, you can install YAML with cpan).

#!/usr/bin/perl

use strict; use YAML qw(LoadFile);

my $file = shift; # first arg is the input filename my $data = LoadFile($file); # load the yaml data into a hashref variable

loop over the remaining args (i.e. the keys)

foreach my $item (@ARGV) { print "$item\n"; print $$data{$item}{'OCCUPATION'}, "\n"; }

Save this as, e.g. yaml.pl and make it executable with chmod +x yaml.pl.

If your yaml data is save in a file called input.yaml, you can run it like this:

$ ./yaml.pl input.yaml abc def
abc
Technician
def
Engineer

Like awk or sed, this can also be condensed into an inscrutable one-liner:

$ perl -MYAML=LoadFile -E '$data=LoadFile(shift);foreach (@ARGV) {say $_;say $$data{$_}{"OCCUPATION"}}' input.yaml abc def
abc
Technician
def
Engineer

perl can also automatically split the arguments for you. e.g. if you change the foreach loop to:

foreach my $item (split /\s*,\s*/,join(",",@ARGV)) {

you can run it as:

$ ./yaml.pl input.yaml abc def

or

$ ./yaml.pl input.yaml "abc,def"

or any combination (asuuming hypothetical ghi and jkl keys):

$ ./yaml.pl input.yaml "abc,def" ghi jkl
cas
  • 78,579
  • 2
    Note: once you've written a little perl (or whatever) script to extract the data you want, you can always use it within a shell script, same as you can call awk or yq or sed or whatever. Part of the idea of unix, and the point of learning languages like awk or sed or perl, is to use them to build your own custom tools when you need them. Some of them will be general purpose tools that you use often, some will be one-off disposable tools that you never use again. Learning to think like a tool-maker is a valuable habit that will pay off for the rest of your life. – cas Apr 19 '21 at 07:18
6

Using yq (the jq wrapper from https://kislyuk.github.io/yq/) to parse the YAML on the command line (or in a script):

$ yq -r '.abc.OCCUPATION' file.yml
Technician

Giving it abc and def in a shell loop:

$ for thing in abc def; do yq -r --arg node "$thing" '$node,.[$node].OCCUPATION' file.yml; done
abc
Technician
def
Engineer

or, for tab-delimited columns:

$ for thing in abc def; do yq -r --arg node "$thing" '[$node,.[$node].OCCUPATION] | @tsv' file.yml; done
abc     Technician
def     Engineer

That is, call yq with --arg followed by the yq variable's name that you want to set, and then the value that you're setting it to. Then use the variable in the yq expression. This works identically in jq.

Without a shell loop and instead taking the values from the top-level keys:

$ yq -r 'foreach keys[] as $node (.;.;[$node,.[$node].OCCUPATION]|@tsv)' file.yml
abc     Technician
def     Engineer

There is a few other tools called yq out there that all do YAML parsing. If you install yq using snap on Ubuntu, you get a version from someone called Mike Farah. It works differently and I tend to use it for converting to JSON and then pipe the data to jq:

$ yq -j e file.yml | jq -r '.abc.OCCUPATION'
Technician
$ for thing in abc def; do yq -j e file.yml | jq -r --arg node "$thing" '$node,.[$node].OCCUPATION'; done
abc
Technician
def
Engineer

or, for tab-delimited columns:

$ for thing in abc def; do yq -j e file.yml | jq -r --arg node "$thing" '[$node,.[$node].OCCUPATION] | @tsv'; done
abc     Technician
def     Engineer
Kusalananda
  • 333,661
3

you don't need shell loop to processing a simple text when you have proper text-processing tools such as ; following we used GNU awk for that we can define multu-char RS and RT which is back-reference to the current matched RS:

$ awk -v RS='(^|\n)[a-z]+:\n' 'rt ~ /^abc:\n$/ { print $NF; exit } { rt=RT }' infile
Technician

to strictly checking that the reported value has really the "OCCUPATION" key and also pass key/header from the variable instead of hardcoding them, you could do:

$ awk -v hdr='abc' -v key='OCCUPATION' -v RS='(^|\n)[a-z]+:\n' -F'\n' \
'rt ~ ("^" hdr ":\n") { 
     for(i=1; i<=NF; i++)
         if(match($i, "^\\s*" key ":\\s*" )) { print substr($i, RSTART+RLENGTH); exit }
}
{ rt=RT }' infile
Technician
αғsнιη
  • 41,407
3

Using any POSIX awk:

$ awk -v key='abc' -v fld='OCCUPATION' '
    /^[^[:space:]]/{ inKeyBlock = (index($1,key":")==1) }
    inKeyBlock && (index($1,fld":")==1) { sub(/[^:]*:[[:space:]]*/,""); print }
' file
Technician

In the unlikely event that you don't have a POSIX awk or some other awk that supports character classes then just change [[:space:]] to [ \t], and [^[:space:]] to [^ \t].

Ed Morton
  • 31,617
  • Do you prefer [[:space:]] to [[:blank:]]? I mean, a space and a tab is what [[:blank:]] matches... At least in the POSIX locale. – Kusalananda Apr 19 '21 at 15:16
  • 3
    Yes, I always use [[:space:]] unless I need to exclude newline and a few other whitespace chars and then I use [[:blank:]], that way I don't get a surprise when I'm writing code that can contain multi-line records (or evolves to such). I also find the name [[:blank:]] to be a little misleading since it sounds like it's only matching a blank char whereas IMHO [[:space:]] is a bit clearer as matching all spaces but that's a very minor thing. – Ed Morton Apr 19 '21 at 15:18
1

Also using awk:

awk -F'[[:space:]]+' '$1 == "" {if (s == "abc:" && $2 == "OCCUPATION:") print $3; next} {s=$1}' file
Technician

This would fail if the occupation was e.g. "Network technician", or anything containing a space though. So to guard against that:

awk -F'[[:space:]]+' '$1 == "" {if (s == "abc:" && $2 == "OCCUPATION:") { sub(/[^:]*:[[:space:]]*/,""); print }; next} {s=$1}' file
Technician

Ed Morton's solution { sub(/[^:]*:[[:space:]]*/,""); print } instead of print $3 is valid also here.

Kusalananda
  • 333,661
0
# var definitions to be used by sed
key='abc'
subkey='OCCUPATION'
s='[[:blank:]]'

# make vars plug worthy in LHS of sed
for i in \\ \[ ^ \$ . \* /;do
  key=${key//"$i"/\\"$i"}
  subkey=${subkey//"$i"/\\"$i"}
done

# invoke sed with the variables
sed -ne "
  /^$key:\$/,/^[^:]*:\$/ s/^$s$s*$subkey:$s*//p
" input.yaml
Technician
guest_7
  • 5,728
  • 1
  • 7
  • 13