3

I want to write a data parser script. The example data is:

name: John Doe
description: AM
email: john@doe.cc
lastLogon: 999999999999999
status: active
name: Jane Doe
description: HR
email: jane@doe.cc
lastLogon: 8888888888
status: active
...
name: Foo Bar
description: XX
email: foo@bar.cc
status: inactive

The key-value pairs are always in the same order (name, description, email, lastLogon, status), but some of the fields may be missing. It is also not guaranteed that the first record is complete.

The expected output is delimiter-separated (e.g. CSV) values:

John Doe,AM,john@doe.cc,999999999999999,active
Jane Doe,HR,jane@doe.cc,8888888888,active
...
Foo Bar,XX,foo@bar.cc,n/a,inactive

My solution is by using a while read loop. The main part of my script:

while read line; do
    grep -q '^name:' <<< "$line" && status=''
    case "${line,,}" in
        name*) # capture value ;;
        desc*) # capture value ;;
        email*) # capture value ;;
        last*) # capture value ;;
        status*) # capture value ;;
    esac
if test -n &quot;$status&quot;; then
    printf '%s,%s,%s,%s,%s\n' &quot;${name:-n\a}&quot; ... etc ...
    unset name ... etc ...
fi

done < input.txt

This works. But obviously, very slow. The execution time with 703 lines of data:

real    0m37.195s
user    0m2.844s
sys     0m22.984s

I'm thinking about the awk approach but I'm not experienced enough using it.

annahri
  • 2,075

2 Answers2

5

The following awk program should work. Ideally, you would save it to a separate file (e.g. squash_to_csv.awk):

#!/bin/awk -f

BEGIN { FS=": *" OFS="," recfields=split("name,description,email,lastLogon,status",fields,",") }

function printrec(record) { for (i=1; i<=recfields; i++) { if (record[i]=="") record[i]="n/a" printf "%s%s",record[i],i==recfields?ORS:OFS; record[i]=""; } }

$1=="name" && (FNR>1) { printrec(current) }

{ for (i=1; i<=recfields;i++) { if (fields[i]==$1) { current[i]=$2 break } } }

END { printrec(current) }

You can then call this as

awk -f squash_to_csv.awk input.dat
John Doe,AM,john@doe.cc,999999999999999,active
Jane Doe,HR,jane@doe.cc,8888888888,active
Foo Bar,XX,foo@bar.cc,n/a,inactive

This will perform some initialization in the BEGIN block:

  • set the input field separator to "a : followed by zero or more spaces"
  • set the output field separator to ,
  • initialize an array of field names (we take a static approach and hard-code the list)

If the name field is encountered, it will check if it is on the first line of the file, and if not, print the previously collected data. It will then start collecting the next record in the array current, beginning with the name field just encountered.

For all other lines (I assume for simplicity that there are no empty or comment lines - but then again, this program should just silently ignore those), the program checks which of the fields is mentioned on the line, and stores the value at the appropriate position in the current array used for the current record.

The function printrec takes such an array as parameter and performs the actual output. Missing values are substituted with n/a (or any other string you may want to use). After printing, the fields are cleared so that the array is ready for the next bunch of data.

At the end, the last record is also printed.

Note

  1. If the "value" part of the file can also include :-space-combinations, you can harden the program by replacing
    current[i]=$2
    
    by
    sub(/^[^:]*: */,"")
    current[i]=$0
    
    which will set the value to "everything after the first :-space combination" on the line, by removing (sub) everything up to including the first :-space-combination on the line.
  2. If any of the fields can contain the output separator character (in your example ,), you will have to take appropriate measures to either escape that character or quote the output, depending on the standard you want to adhere to.
  3. As you correctly noted, shell loops are very much discouraged as tools for text processing. If you are interested in reading more, you may want to look at this Q&A.
AdminBee
  • 22,803
3
$ cat tst.awk
BEGIN {
    OFS = ","
    numTags = split("name description email lastLogon status",tags)
}
{
    tag = val = $0
    sub(/ *:.*/,"",tag)
    sub(/[^:]+: */,"",val)
}
(tag == "name") && (NR>1) { prt() }
{ tag2val[tag] = val }
END { prt() }

function prt(   tagNr,tag,val) {
    for ( tagNr=1; tagNr<=numTags; tagNr++ ) {
        tag = tags[tagNr]
        val = ( tag in tag2val ? tag2val[tag] : "n/a" )
        printf "%s%s", val, (tagNr<numTags ? OFS : ORS)
    }
    delete tag2val
}

$ awk -f tst.awk file
John Doe,AM,john@doe.cc,999999999999999,active
Jane Doe,HR,jane@doe.cc,8888888888,active
Foo Bar,XX,foo@bar.cc,n/a,inactive

If you want a header line printed too then just add this to the end of the BEGIN section:

for ( tagNr=1; tagNr<=numTags; tagNr++ ) {
    tag = tags[tagNr]
    printf "%s%s", tag, (tagNr<numTags ? OFS : ORS)
}
Ed Morton
  • 31,617