Efficient data extraction from multiple files to a single CSV file

Question

I have a large collection of XML files with the same exact structure:

$ cat file_<ID>.xml
... 
 ... 
   ...
      <double>1.2342</double>
      <double>2.3456</double>
      ...
   ...
 ... 
...

where the number of such <double> entries in each XML file is fixed and known (in my particular case, 168).

I need to build a single csv file with the content of all these XML files stored as follows:

file_0001 1.2342 2.3456 ... 
file_0002 1.2342 2.3456 ...

etc.

How can I do this efficiently?

The best I have come up with is this:

#!/usr/bin/env zsh
for x in $path_to_xmls/*.xml; do
# 1) Get the doubles ignoring everything else
# 2) Remove line breaks within the same file
# 3) Add a new line at the end to construct the CSV file
# 4) Join the columns together

cat $x | grep -F '&lt;double&gt;' | \ 
sed -r 's/.*&gt;([0-9]+\.*[0-9]*).*?/\1/' | \
tr '\n' ' ' | sed -e '$a\'  |  &gt;&gt; table_numbers.csv

echo ${x:t} &gt;&gt; file_IDs.csv

done
paste file_IDs table_numbers.csv > final_table.csv

When I time the above script in a folder with ~10K XML files I get:

./from_xml_to_csv.sh  100.45s user 94.84s system 239% cpu 1:21.48 total

not terrible, but I am hoping to work with 100x or 1000x more files. How can I make this processing more efficient?

Also, with my solution above, could I ever end up in a situation where the glob expansion reaches a limit, e.g. when working with millions of files? (the typical "too many args" problem).

Update

For anyone interested in a great solution to this problem, please read @mikeserve's answer. It is the fastest and the one that scales up the best by far.

I'm no shell expert but for the too many args there's always xargs. Or you could export function then use find...-exec func{} +... for huge dirs. I do recall a limit of 32768 dirs on some bsd distro that screwed with some command combo. You could skip sed and use tr only to remove tag if element contents are just [0-9.] as element is just <double></double>, tr can do all that. Do you need to cat or can redirect straight to grep. Maybe can cut down on pipes and use more redirection. — gwillie, Jul 23 '15 at 00:56
It's hard to answer XML processing questions when your XML isn't valid. Any chance you could do a more complete version? — Sobrique, Aug 07 '15 at 11:56

score 5 · Answer 1 · edited Jan 21 '18 at 16:28

This should do the trick:

awk -F '[<>]' '
      NR!=1 && FNR==1{printf "\n"} 
      FNR==1{sub(".*/", "", FILENAME); sub(".xml$", "", FILENAME); printf FILENAME} 
      /double/{printf " %s", $3}
      END{printf "\n"}
    ' $path_to_xml/*.xml > final_table.csv

Explanation:

awk: use the program awk, I tested it with GNU awk 4.0.1
-F '[<>]': use < and > as field separators
NR!=1 && FNR==1{printf "\n"}: if it is not the first line overall (NR!=1) but the first line of a file (FNR==1) print a newline
FNR==1{sub(".*/", "", FILENAME); sub(".xml$", "", FILENAME); printf FILENAME}: if it is the first line of a file, strip away anything up to the last / (sub(".*/", "", FILENAME)) in the name of the file (FILENAME), strip a trailing .xml (sub(".xml$", "", FILENAME)) and print the result (printf FILENAME)
/double/{printf " %s", $3} if a line contains "double" (/double/), print a space followed by the third field (printf " %s", $3). Using < and > as separators this would be the the number (with the first field being anything before the first < and the second field being double). If you want, you can format the numbers here. For example by using %8.3f instead of %s any number will be printed with 3 decimal places and an overall length (including dot and decimal places) of at least 8.
END{printf "\n"}: after the last line print an additional newline (this could be optional)
$path_to_xml/*.xml: the list of files
> final_table.csv: put the result into final_table.csv by redirecting the output

In the case of "argument list to long" errors, you can use find with parameter -exec to generate a file list instead of passing it directly:

find $path_to_xml -maxdepth 1 -type f -name '*.xml' -exec awk -F '[<>]' '
      NR!=1 && FNR==1{printf "\n"} 
      FNR==1{sub(".*/", "", FILENAME); sub(".xml$", "", FILENAME); printf FILENAME} 
      /double/{printf " %s", $3}
      END{printf "\n"}
    ' {} + > final_table.csv

Explanation:

find $path_to_xml: tell find to list files in $path_to_xml
-maxdepth 1: do not descend into subfolders of $path_to_xml
-type f: only list regular files (this also excludes $path_to_xml itself)
-name '*.xml': only list files that match the pattern*.xml`, this needs to be quoted else the shell will try to expand the pattern
-exec COMMAND {} +: run the command COMMAND with the matching files as parameters in place of {}. + indicates that multiple files may be passed at once, which reduces forking. If you use \; (; needs to be quoted else it is interpreted by the shell) instead of + the command is run for each file separately.

You can also use xargs in conjunction with find:

find $path_to_xml -maxdepth 1 -type f -name '*.xml' -print0 |
 xargs -0 awk -F '[<>]' '
      NR!=1 && FNR==1{printf "\n"} 
      FNR==1{sub(".*/", "", FILENAME); sub(".xml$", "", FILENAME); printf FILENAME} 
      /double/{printf " %s", $3}
      END{printf "\n"}
    ' > final_table.csv

Explanation

-print0: output list of files separated by null characters
| (pipe): redirects standard output of find to the standard input of xargs
xargs: builds and runs commands from standard input, i.e. run a command for each argument (here file names) passed.
-0: direct xargs to assume arguments are separated by null characters

awk -F '[<>]' '      
      BEGINFILE {sub(".*/", "", FILENAME); sub(".xml$", "", FILENAME); printf FILENAME} 
      /double/{printf " %s", $3}
      ENDFILE {printf "\n"}
    ' $path_to_xml/*.xml > final_table.csv

where BEGINFILE, ENDFILE are called when changing file (if your awk supports it).

Thanks -- I'm afraid I now get "argument list too long: awk" . — Amelio Vazquez-Reina, Jul 23 '15 at 14:19
This is great. Thanks!! -- The explanations are very helpful. I measured the latest solutions and they don't have the "argument list too long" errors anymore. They run on ~20 secs against 10K XML files, making them on par with @mikeserv's slowest solution, but slower than mikeserv's faster solution. This is definitely a GREAT and very clear solution to the problem! — Amelio Vazquez-Reina, Jul 24 '15 at 23:17

mikeserv · Accepted Answer · 2015-07-24T01:39:10.733

Regarding glob expansion possibly exceeding a limit - yes and no. The shell's already running, and so it won't stop. But if you were to pass the entire globbed array as arguments to a single command, then yes, that is a definite possibility. The portable and robust way to handle this involves find...

find . \! -name . -prune -name pattern -type f -exec cat {} + | ...

...which will only cat regular files in the current directory with a name which matches pattern, but will also only invoke cat as many times as is necessary to avoid exceeding ARG_MAX.

In fact, though, since you have a GNU sed we can almost do the whole thing with just sed in a find script.

cd /path/to/xmls
find . \! -name . -prune -name \*.xml -type f -exec  \
    sed -sne'1F;$x;/\n*\( \)*<\/*double>/!d' \
        -e  '$s//\1/gp;H' {} + | paste -d\\0 - -

I thought of another way. This will be very fast, but it absolutely depends on there being exactly 168 matches per file, and there can only be the one . dot in the filenames.

(   export LC_ALL=C; set '' - -
    while [ "$#" -lt 168 ]; do set "$@$@"; done
    shift "$((${#}-168))"
    find . \! -name . -prune -name \*.xml -type f      \
              -exec  grep -F '<double>' /dev/null {} + |
    tr \<: '>>' | cut -d\> -f1,4 | paste -d\  "$@"     |
    sed 'h;s|./[^>]*>||g;x;s|\.x.*||;s|..||;G;s|\n| |'
)

As requested, here's a little breakdown of how this command works:

( ... )
- In the first place, the whole little script is run within its own subshell because there are a few global environmental properties that we'll be altering in the course of its execution, and this way when the job is done all of the properties we alter will be restored to their original values - whatever they were.
export LC_ALL=C; set '' - -
- By setting the current locale to C we can save our filters a lot of effort. In a UTF-8 locale any char might be represented by one or several bytes a piece, and any char found will need to be selected out of a group of many thousand possibles. In the C locale each char is a single byte, and there are only 128 of them. It makes char matching a much quicker affair overall.
- The set statement changes the shell's positional parameters. Doing set '' - - sets $1 to \0, and $2 and $3 to -.
while ... set "$@$@"; done; shift ...
- Basically the whole point of this statement is to get an array of 168 dashes. We'll be using paste later to replace sequential sets of 167 newlines with spaces, while preserving the 168th. The most simple way to do this is to give it 168 argument references to - stdin and tell it to paste all of those together.
find ... -exec grep -F '<double>' /dev/null' ...
- The find bit has been previously discussed, but with grep we print only those lines which can be matched against the -Fixed string <double>. By making grep's first argument /dev/null - which is a file that can never match our string - we ensure that grep is always searching 2 or more file arguments for every invocation. When invoked with 2 or more named search files grep will always print the filename like file_000.xml: at the head of every output line.
tr \<: '>>'
- Here we translate every occurrence in grep's output of either : or < characters to >.
- At this point a sample matched line will look like ./file_000.xml> >double>0.0000>/double>.
cut -d\> -f1,4
- cut will strip from its output all of its input which cannot be found within either the 1st or 4th fields as divided by > chars.
- At this point a sample matched line will look like ./file_000.xml>0.0000.
paste -d\ "$@"
- Already discussed, but here we paste input lines in batches of 168.
- At this point 168 matched lines occur together like: ./file_000.xml>0.000 .../file_000.xml>0.167
sed 'h;s|./[^>]*>||g;x;s|\.xml.*||;s|..||;G;s|\n| |'
- Now the faster, smaller utilities have already done the majority of the work. On a multicore system, they've probably even done it concurrently. And those utilities - especially cut and paste are far faster at what they do than any attempt at emulation we might do with higher-level utilities like sed, or, even worse, awk. But I've taken it as far as I could imagine I might do this far, and I have to call on sed.
- First I hold a copy of every input line, then I globally remove every occurrence of the pattern ./[^>]*> in pattern space - so every occurrence of the filename. At this point sed's pattern space looks like: 0.000 0.0001...0.167
- Then I exchange hold and pattern spaces and remove everything from \.xml.* on - so everything from the first filename on the saved copy of the line on. I then strip the first two chars - or ./ as well - and at this point pattern space looks like file_000.
- So all that remains is to stick them together. I Get a copy of hold space appended to pattern space following a \newline char, then I s///ubstitute the \newline for a space.
- And so, finally, pattern space looks like file_000 0.000...0.167. And that is what sed writes to output for each file find passes to grep.

Thanks -- but I am afraid I get: sed: -e expression #2, char 17: unknown option to 's' find: 'grep' terminated by signal 13 find: 'grep' terminated by signal 13. I am using GNU's variants of these tools in OS X. — Amelio Vazquez-Reina, Jul 23 '15 at 15:15
More specifically: sed (GNU sed) 4.2.2, find (GNU findutils) 4.4.2 and grep (GNU grep) 2.21. — Amelio Vazquez-Reina, Jul 23 '15 at 15:21
Thanks -- For some reason this is very slow (slower than the solution in the OP) on my machine. I'm really surprised to see this, since your solution looks definitely cleaner and involves less piping etc. from what I can see. Hmm, I'm still investigating. — Amelio Vazquez-Reina, Jul 23 '15 at 17:13
Thanks -- This is interesting -- It doesn't seem to work yet, I basically end up with csv with two XML filenames per line, i.e. first line looks like file1.xml file2.xml , second line like file3.xml file4.xml, etc. — Amelio Vazquez-Reina, Jul 23 '15 at 18:20
@AmelioVazquez-Reina - the bottom one works. And it's fast. Still, though, the two names per line was because I had a -n there, and it shouldn't have been. So just sed -se not sed -sne. — mikeserv, Jul 23 '15 at 19:20
Thanks -- This is really strange. I have updated my OP with your latest solution. — Amelio Vazquez-Reina, Jul 23 '15 at 20:27
And by the way, your assumption is correct (one dot per file and 168 numbers per file) — Amelio Vazquez-Reina, Jul 23 '15 at 20:28
@AmelioVazquez-Reina - oh - zsh? zsh does a weird thing with $# in math expansions. Sorry I should have done ${#} to protect against that in the first place. — mikeserv, Jul 23 '15 at 20:38
That worked! @mikeserve. I only had to fix type -f to have it be type f (editing your query if you don't mind). Impressive speed! I hope this is useful for others -- extracting structured data from many files into a csv file. I renamed the OP to signal that. — Amelio Vazquez-Reina, Jul 23 '15 at 21:00
Awesome. By the way, I saw your posterior edits to the answer, which is great. Would you mind elaborating perhaps a bit on how the second solution works? It may help others adapt this answer to future multiple-file to-single-CSV transformation questions :). — Amelio Vazquez-Reina, Jul 23 '15 at 21:56
Amazing. Thanks! I tried the first solution (with GNU versions) but I ended up with the same file1.xml file2.xml\nfile3.xml file4.xml ... pattern as before. — Josh, Jul 24 '15 at 00:12
Thanks @mikeserv! Both solutions now work great. This a huge speed up from the original solution in the OP (and 100X faster than a Python solution that properly loads xml's one by one to build a CSV dataframe in Pandas). Current times in my machine: Sol (1): 19 seconds. Sol (2): 14 seconds, both transforming ~10K XML files into a single CSV file. — Amelio Vazquez-Reina, Jul 24 '15 at 18:21
Thanks @mikeserv I prepended Sol (1) with export LC_ALL=C; and the timing for it was still ~ 20.28 seconds (i.e. ~unchanged). — Amelio Vazquez-Reina, Jul 24 '15 at 21:23
@AmelioVazquez-Reina - even with all of the additional setup and the additional execs and etc, it's this kind of stuff that makes shell pipelines powerful. Unix is timeshared - and so if there are multiple processes running in a foreground process group they will get execution priority - and each winds up operating on its own processor core. And so there is no delay when the data is passed around - it's just many hands making light work. The key is to make the execs count - a single pipeline which handles all data in a job from start to finish, and they're almost always worth it. — mikeserv, Jul 24 '15 at 21:39

score 2 · Answer 3 · answered Aug 07 '15 at 12:07

Please, on behalf of future maintenance programmers and sysadmins - DON'T use a regex to parse XML. XML is a structured data type, and it is NOT well suited for regex parsing - you can 'fake it' by pretending it's plain text, but there's a bunch of semantically identical things in XML that don't parse the same. You can embed linefeeds, and have unary tags for example.

Thus - use a parser - I have mocked up some source data, because your XML isn't valid. Give me a more complete sample, and I'll give you a more complete answer.

At a basic level - we extract the double nodes like this:

#!/usr/bin/env perl

use strict;
use warnings;
use XML::Twig;

my $twig = XML::Twig -> new;
$twig -> parse ( \*DATA ); 

foreach my $double ( $twig -> get_xpath('//double') ) {
   print $double -> trimmed_text,"\n";
}

__DATA__
<root> 
 <subnode> 
   <another_node>
      <double>1.2342</double>
      <double>2.3456</double>
      <some_other_tag>fish</some_other_tag>
   </another_node>
 </subnode>
</root>

This prints:

1.2342
2.3456

So we expand this:

#!/usr/bin/env perl

use strict;
use warnings;
use XML::Twig;
use Text::CSV;

my $twig = XML::Twig->new;
my $csv  = Text::CSV->new;

#open our results file
open( my $output, ">", "results.csv" ) or die $!;
#iterate each XML File. 
foreach my $filename ( glob("/path/to/xml/*.xml") ) {
    #parse it
    $twig->parsefile($filename);
    #extract all the text of all the 'double' elements. 
    my @doubles = map { $_->trimmed_text } $twig->get_xpath('//double');
    #print it as comma separated. 
    $csv->print( $output, [ $filename, @doubles ] );

}
close($output);

I think that should do the trick (without sample data, I can't say for sure). But note - by using an XML parser we don't get tripped up with some of the XML reformatting that can be done perfectly validly (as according to the XML spec). By using a CSV parser, we aren't going to get caught out by any fields with embedded commas or line feeds.

If you're looking for more specific nodes - you can specify a more detailed path. As it is, the above just looks for any instance of double. But you can use:

get_xpath("/root/subnode/another_node/double")

score 0 · Answer 4 · answered Jul 23 '15 at 04:11

0

You can try this single liner for each file. The awk multiple delimiters does efficient splitting and tr concats all lines in memory , rather than on disk.

for f in `ls *.xml` ; 
do 
     echo $f,`grep double $f | awk  -F  '[<>]' '{print $3}' | tr '\n' ','`; 
done

I can not profile this at my end - since I do not have the same data, but my hunch is that it should be faster.

Apart from that, this is the easiest problem to divide and rule - if you have access to multiple machines or farms, you can just divided up the whole task to multiple machines and finally concat all the outputs into one file. This way command line limits and memory can also be managed.

answered Jul 23 '15 at 04:11

amisax

3,025

can you explain the down-vote instead of the eat,shoot and leave ? – amisax Jul 23 '15 at 04:26
1

In most (all?) cases grep is not needed in conjunction with awk. Instead of grep double $f | awk -F '[<>]' '{print $3}' you can just useawk -F '[<>]' '/double/ {print $3}' $f. Alsotris not really needed here:echo $f$(awk -F '[<>]' '/double/{printf ",%d", $3} END{print ""}' $f` – Adaephon Jul 23 '15 at 09:39
To be clear, neither was the downvote from me nor did I try to explain why you got a downvote (personally I do not think it is warranted). It was just something I noticed when reading your answer. – Adaephon Jul 23 '15 at 18:51
@Adaphon - the downvote is mine, and it is primarily for for f in $(ls *.xml); do echo $f, $(grep double $f | awk ...); done... That code leaves so much open to interpretation, it might as well not be a program. It completely depends on IFS and shell-glob settings, and doesn't bother to check or set either. While the likelihood is small, it is possible that, given the right conditions, that code could erase and/or overwrite on-disk files and data. – mikeserv Jul 23 '15 at 23:28
@Adaephon - you are right ,a single awk could do it .. – amisax Jul 24 '15 at 04:31
@mikeserv - so is using find a better idea than ls ? I am okay with the downvote - no issues there as long as there is something to learn with it ! – amisax Jul 24 '15 at 04:32
It depends. If you need to take an action on each of the results then probably, and that's because find can -exec. But each situation is different. But that's not really what I was talking about though. If you're going to split with IFS you have to set it first. Don't rely on default values - especially not in a public forum where anyone can copy and paste what you've written directly to their terminal prompt regardless of what else they had previously pasted there. And you also have to take care not to glob. If the split results contain shell globs that match, it gets bad. – mikeserv Jul 24 '15 at 04:47
And @Adaephon, by the way, whether or not awk can find the match on its own, if speed is what is asked for, then you should use grep. grep is far faster at finding a match than awk. – mikeserv Jul 24 '15 at 04:49
1

@mikeserv That may depend on the implementation and the specific case. I did a quick check on some log files (cumulative size around 1.5 GiB, on ramdisk), searching for lines containing "line" (over all around 7.5 million matches) and for me grep was (far) slower. grep 'line' logfiles/* > /dev/null (GNU grep 2.16) takes about 15 seconds to finish, while gawk '/line/{print}' > /dev/null (GNU awk 4.0.1) takes about 8.5 seconds and `mawk '/line/{print} > /dev/null' (Mike's AWK 1.3.3) only needs about 2.9 seconds. – Adaephon Jul 24 '15 at 08:08
@Adaephon - that doesn't make any sense. Except that you're doing logfiles/* for grep and not so for gawk. You should be using -F though if you're not searching for a regexp. – mikeserv Jul 24 '15 at 10:53
1

@mikeserv Sorry, typing error, of course all three commands used logfiles/* as filelist. grep -F 'line' logfiles/* > /dev/null takes even longer with around 30 seconds. While grep -P 'line' logfiles/* > /dev/null (using PCRE) beats at least gawk with a time of 6 seconds. On a completely different machine with newer versions (grep 2.21, gawk 4.1.3, mawk 1.3.4) it looks different: gawk 8.0 s, mawk 4.2 s, grep 2.8 s, grep -F 3.4 s and grep -P 5.2 s. Here grep is indeed faster than awk. So, it really depends. (and grep -F does not help performance) – Adaephon Jul 24 '15 at 12:47
1

@Adaephon - wow. It's ... weird. In my experience awk is always the slowest tool in the box - only a step ahead of the turtles like perl or python, sed is faster, grep still more so, and dd, expand, cut, paste set the pace. This has been true with many different sets of tools that I have used - not just GNU. I'm thoroughly confused/intrigued by your results. Are the commands interpreting charsets in different ways? – mikeserv Jul 24 '15 at 14:27

Micah Elliott · Answer 5 · 2015-07-23T04:42:24.620

You’re writing twice for every file. This is probably the most expensive part. You’ll instead want to try to keep the whole thing in memory, probably in an array. Then write once in the end.

Look into ulimit if you start hitting memory limits. If you’re increasing this workload to 10-100x, you’re looking at maybe 10-100 GB of memory. You could batch this in a loop that does so-many-thousands per iteration. I’m not sure if this needs to be a repeatable process, but get more sophisticated if you need it to be faster/more robust. Otherwise, hand stitch the batches afterward.

You’re also spawning several processes per file -- every pipe you have. You could do the whole parsing/munging (grep/sed/tr) with a single process. After the grep, Zsh can handle the other translations via expansions (see man zshexpn). Or, you could do all the single sed line in one invocation with multiple expressions. sed may be faster if you avoid the -r (extended regex) and non-greediness. Your grep could just pull out matching lines from many files at once, and write to intermediate temp files. Know your bottlenecks, though, and don’t fix what’s not.

Efficient data extraction from multiple files to a single CSV file

Update

5 Answers5

Explanation:

Explanation:

Explanation

Linked