2

There are about 10.000 files in files/ and 10.000 lines in metadata.csv. Where metadata.csv contains information about the files. I have a shell script which prints information about each file and then a content of the file:

#!/bin/sh
for FILE in `find files/ -type f`
do
    ID=`echo $FILE | sed 's/some/thing/'`
    cat metadata.csv | awk -v ORS="" -v id=$ID -F "\t" '$1==id {
        print "id=\""id"\" Some=\""$2"\" Thing=\""$5"\" "}'
    cat $FILE
done

I thought I could speed this up by assigning content of metadata.csv to a variable METADATA. I thought that it wouldn't read the file from disk each time, but it would store it in memory instead:

#!/bin/sh
METADATA=`cat metadata.csv`
for FILE in `find files/ -type f`
do
    ID=`echo $FILE | sed 's/some/thing/'`
    echo "$METADATA" | awk -v ORS="" -v id=$ID -F "\t" '$1==id {
        print "id=\""id"\" Some=\""$2"\" Thing=\""$5"\" "}'
    cat $FILE
done

But the second one is not faster. The first one runs about 1 minute and the second one more than 2 minutes.

How does it work and why the second script is slower, not faster?

edit: on my system /bin/sh -> dash

mejem
  • 100
  • There should be I/O caching on subsequent reads of the file. Also, that cat is useless and wastes CPU, awk can read the file directly. – thrig Mar 09 '16 at 23:36

1 Answers1

2

You didn't provide enough information for others to reproduce your benchmark. I made my own and found the echo method to be slightly faster with dash and ksh, and about the same with mksh. The ratio was a lot less than 1:2 even when there was a difference. Obviously this depends on a lot of things, including the shell, the kernel, the implementation of the utilities, and the content of the data files.

Between these two methods, there isn't an obvious winner. Reading from the disk costs pratcically nothing because the file will be in the cache. Calling cat has the overhead of forking an external process, whereas echo is a shell bultin. If your sh is bash, its echo builtin prints its argument one line at a time even when the output is going to a pipe, which may account for a little of the slowness. Dash and ksh don't do that; typically they have better performance than bash.

There are a number of optimizations you could make in your script.

  • An obvious optimization on the cat method is to use redirection instead (<metadata.csv awk …), or pass metadata.csv as an argument to awk. In my tests, redirection was very slightly faster than echo, and there wasn't a measurable difference between redirection and awk … metadata.csv.

  • When you use an unquoted variable expansion, in addition to failing horribly if the value contains certain characters, it makes extra work for the shell because it has to do the splitting and globbing. Always use double quotes around variable substitutions unless you know why you need to omit them.

  • Similarly you're parsing the output of find, which will choke on some file names, and requires extra work. The canonical solution is to use find -exec; this may or may not be faster though, because that also has to do extra work to start a shell to process the files.
  • I presume that your awk script is simplified from the real thing. With the script you show, assuming that the first column of the CSV file contains only characters that aren't special in regexes, you could try using sed instead; it would be more cryptic, but it might be a little faster because more specialized tools are usually faster. There's no guarantee that you'll get an improvement though, let alone a measurable one.
  • When you set ID, you call an external program. Depending on exactly what you're doing here, this may be doable with the shell's own string manipulation constructs: they're typically not very fast and not very powerful, but they don't require calling an external program.

All in all, combining those local optimizations, I'd go with

#!/bin/ksh
find files/ -type f -exec sh -c '
  for FILE do
    ID=${FILE//some/thing}
    sed '/^$ID\t/ s/\([^\t]*\)\t\([^\t]*\)\t[^\t]*\t[^\t]*\t\([^\t]*\).*/id="\1" Some="\2" Thing="\3"/' metadata.csv
    cat "$FILE"
  done' _ {} +

There may be a faster algorithm though. You're processing the whole metadata set for each file. Especially if each file only matches one line, that's a lot of unnecessary comparisons. It's likely to be faster to generate the list of IDs from file names and collate it with the metadata. Untested code:

#!/bin/ksh
join -j 1 -t $'\t' -o 2.1,2.2,2.5,1.2 \
     <(find files/ -type f | sed 's!/some$!/thing\t&!' | sort) \
     <(sort metadata.csv) |
awk -F '\t' '{
    print "id =\"" $1 "\" Some=\"" $2 "\" Thing=\" $3 "\"";
    system("cat \047" $4 "\047"); # Assuming no single quotes in file names
}'
  • 1
    There's also the awk-hashtable method IF the data fits in memory: find whatever | awk -F\\t -vd=\" -vq=\' 'FNR==NR {info[$1]="id="d$1d"Some="d$2d"Thing="d$3d; next} {id=$1; sub(/some/,"thing",id); print info[id]; system("cat "q$1q)}' metadata.csv -. And we could avoid the spawn-cat cost but add stdio overhead with while(getline x <$1)print x; close($1) (ninjad typo) – dave_thompson_085 Mar 10 '16 at 10:18