2

I am searching for an efficient way using (UNIX) command line tools to read a flat file into a(ny) database, perhaps after first converting to an intermediate structured format, for example XML or csv. The flat file has multiple records with lines containing each a single key and value, such as follows. The number of variables may vary between records and their number and names are only known after reading the input file. (A further complication is, there may be some nesting of repeated variables, but this may be ignored for the time being.)

[RecordUUID.1]
"Variable1Key"="Variable1Value"
"Variable2Key"="Variable2Value"
"Variable3Key"="Variable3Value"

[RecordUUID.4]
"Variable1Key"="Variable1Value"
"Variable5Key1"="Variable51Value1"
"Variable5Key1"="Variable51Value2"
"Variable5Key2"="Variable52Value1"
"Variable5Key2"="Variable52Value2"

I have checked answers given for transforming rows to columns and other answers on this site but none seemed to fit. The problem seems similar to reading INI or VCARD files, but I could not find a generic solution; an XSL transformation may be conceivable but was not found. Any pointers would be more than welcome, thanks.

1 Answers1

0

I'm not sure exactly what you're looking for as an end result, but here is a Python script that will convert your data to XML:

#!/usr/bin/env python2
# -*- coding: ascii -*-
"""transform.py

Parses a data file contain textual records in the following format:

    [RecordUUID.n]
    "Variable1Key"="Variable1Value"
    "Variable2Key"="Variable2Value"
    "Variable3Key"="Variable3Value"

and converts it to an XML document with record-elements of the following form:

    <RecordUUID.1>
        <Variable1Key>Variable1Value</Variable1Key>
        <Variable2Key>Variable2Value</Variable2Key>
        <Variable3Key>Variable3Value</Variable3Key>
    </RecordUUID.1>
"""

import sys
import re
from xml.etree.ElementTree import Element, SubElement, tostring
from xml.dom.minidom import parseString

# Creat a root element for the XML document
root = Element('root')

# Set a variable to keep track of the current record
current_record = None

# Parse the data and construct an XML representation
with open(sys.argv[1]) as datafile:

    # Extract the non-empty lines from the data file
    lines = [line.strip() for line in datafile if line.strip()]

    # Iterate over the lines
    for line in lines:

        # Check to see if we've reached a new record
        if "RecordUUID" in line:

            # Extract the record ID
            eid = line.strip()[1:-1]

            # Add a new child element to the document
            # and update the current record
            current_record = SubElement(root, eid)

        # Check to see if we've reached a new key-value pair
        else:
            match = re.match(r'^"(\w+)"="(\w+)"$', line.strip())

            # If we have a key-value pair then update the current record
            if match:
                key, value = match.groups()
                SubElement(current_record, key).text = value

# Display the generated XML document
print(parseString(tostring(root)).toprettyxml(indent="\t"))

If we put the following data (i.e. the example data from your question) in a file called data.txt:

[RecordUUID.1]
"Variable1Key"="Variable1Value"
"Variable2Key"="Variable2Value"
"Variable3Key"="Variable3Value"

[RecordUUID.4]
"Variable1Key"="Variable1Value"
"Variable5Key1"="Variable51Value1"
"Variable5Key1"="Variable51Value2"
"Variable5Key2"="Variable52Value1"
"Variable5Key2"="Variable52Value2"

And then run the script:

user@host:~$ python transform.py data.txt

Then we get the following output:

<?xml version="1.0" ?>
<root>
    <RecordUUID.1>
        <Variable1Key>Variable1Value</Variable1Key>
        <Variable2Key>Variable2Value</Variable2Key>
        <Variable3Key>Variable3Value</Variable3Key>
    </RecordUUID.1>
    <RecordUUID.4>
        <Variable1Key>Variable1Value</Variable1Key>
        <Variable5Key1>Variable51Value1</Variable5Key1>
        <Variable5Key1>Variable51Value2</Variable5Key1>
        <Variable5Key2>Variable52Value1</Variable5Key2>
        <Variable5Key2>Variable52Value2</Variable5Key2>
    </RecordUUID.4>
</root>
igal
  • 9,886