Convert / import flat file with multiple records, one variable per line

Question

I am searching for an efficient way using (UNIX) command line tools to read a flat file into a(ny) database, perhaps after first converting to an intermediate structured format, for example XML or csv. The flat file has multiple records with lines containing each a single key and value, such as follows. The number of variables may vary between records and their number and names are only known after reading the input file. (A further complication is, there may be some nesting of repeated variables, but this may be ignored for the time being.)

[RecordUUID.1]
"Variable1Key"="Variable1Value"
"Variable2Key"="Variable2Value"
"Variable3Key"="Variable3Value"

[RecordUUID.4]
"Variable1Key"="Variable1Value"
"Variable5Key1"="Variable51Value1"
"Variable5Key1"="Variable51Value2"
"Variable5Key2"="Variable52Value1"
"Variable5Key2"="Variable52Value2"

I have checked answers given for transforming rows to columns and other answers on this site but none seemed to fit. The problem seems similar to reading INI or VCARD files, but I could not find a generic solution; an XSL transformation may be conceivable but was not found. Any pointers would be more than welcome, thanks.

@Archemar Previously tried:

flat file extractor expects one record per row

Transposing rows and columns expect fixed variables

XFlat not updated, not working with current java

XSL transformations such as here no starting point found for own schema

transpose.c expects matrix input

sed to create e.g. XML or HTML (remains an option) — , Jul 24 '15 at 21:30
@Archemar Also tried: remarshal with toml which seems not to support repeated keys and PHP VCARD parser, which however have hardcoded fieldnames. — , Jul 24 '15 at 23:20
Do you have a list of expected fieldnames or does this need to dynamically adjust? If you have a (short) list of expected fieldnames, one approach would be to start with INI or VCARD code and modify to suit your needs. Of cours a genarlized solution would be much more interesting. — robartsd, Oct 16 '15 at 20:25
This question lacks a sample input/output pair. What does the alternative XML look like for the sample data? What would a database look like? — Kaz, Aug 31 '16 at 02:57

igal · Answer 1 · 2017-12-10T01:09:20.173

I'm not sure exactly what you're looking for as an end result, but here is a Python script that will convert your data to XML:

#!/usr/bin/env python2
# -*- coding: ascii -*-
"""transform.py

Parses a data file contain textual records in the following format:

    [RecordUUID.n]
    "Variable1Key"="Variable1Value"
    "Variable2Key"="Variable2Value"
    "Variable3Key"="Variable3Value"

and converts it to an XML document with record-elements of the following form:

    <RecordUUID.1>
        <Variable1Key>Variable1Value</Variable1Key>
        <Variable2Key>Variable2Value</Variable2Key>
        <Variable3Key>Variable3Value</Variable3Key>
    </RecordUUID.1>
"""

import sys
import re
from xml.etree.ElementTree import Element, SubElement, tostring
from xml.dom.minidom import parseString

# Creat a root element for the XML document
root = Element('root')

# Set a variable to keep track of the current record
current_record = None

# Parse the data and construct an XML representation
with open(sys.argv[1]) as datafile:

    # Extract the non-empty lines from the data file
    lines = [line.strip() for line in datafile if line.strip()]

    # Iterate over the lines
    for line in lines:

        # Check to see if we've reached a new record
        if "RecordUUID" in line:

            # Extract the record ID
            eid = line.strip()[1:-1]

            # Add a new child element to the document
            # and update the current record
            current_record = SubElement(root, eid)

        # Check to see if we've reached a new key-value pair
        else:
            match = re.match(r'^"(\w+)"="(\w+)"$', line.strip())

            # If we have a key-value pair then update the current record
            if match:
                key, value = match.groups()
                SubElement(current_record, key).text = value

# Display the generated XML document
print(parseString(tostring(root)).toprettyxml(indent="\t"))

If we put the following data (i.e. the example data from your question) in a file called data.txt:

[RecordUUID.1]
"Variable1Key"="Variable1Value"
"Variable2Key"="Variable2Value"
"Variable3Key"="Variable3Value"

[RecordUUID.4]
"Variable1Key"="Variable1Value"
"Variable5Key1"="Variable51Value1"
"Variable5Key1"="Variable51Value2"
"Variable5Key2"="Variable52Value1"
"Variable5Key2"="Variable52Value2"

And then run the script:

user@host:~$ python transform.py data.txt

Then we get the following output:

<?xml version="1.0" ?>
<root>
    <RecordUUID.1>
        <Variable1Key>Variable1Value</Variable1Key>
        <Variable2Key>Variable2Value</Variable2Key>
        <Variable3Key>Variable3Value</Variable3Key>
    </RecordUUID.1>
    <RecordUUID.4>
        <Variable1Key>Variable1Value</Variable1Key>
        <Variable5Key1>Variable51Value1</Variable5Key1>
        <Variable5Key1>Variable51Value2</Variable5Key1>
        <Variable5Key2>Variable52Value1</Variable5Key2>
        <Variable5Key2>Variable52Value2</Variable5Key2>
    </RecordUUID.4>
</root>

Convert / import flat file with multiple records, one variable per line

1 Answers1