You didn't provide enough information for others to reproduce your benchmark. I made my own and found the echo
method to be slightly faster with dash and ksh, and about the same with mksh. The ratio was a lot less than 1:2 even when there was a difference. Obviously this depends on a lot of things, including the shell, the kernel, the implementation of the utilities, and the content of the data files.
Between these two methods, there isn't an obvious winner. Reading from the disk costs pratcically nothing because the file will be in the cache. Calling cat
has the overhead of forking an external process, whereas echo
is a shell bultin. If your sh
is bash, its echo
builtin prints its argument one line at a time even when the output is going to a pipe, which may account for a little of the slowness. Dash and ksh don't do that; typically they have better performance than bash.
There are a number of optimizations you could make in your script.
An obvious optimization on the cat
method is to use redirection instead (<metadata.csv awk …
), or pass metadata.csv
as an argument to awk. In my tests, redirection was very slightly faster than echo
, and there wasn't a measurable difference between redirection and awk … metadata.csv
.
When you use an unquoted variable expansion, in addition to failing horribly if the value contains certain characters, it makes extra work for the shell because it has to do the splitting and globbing. Always use double quotes around variable substitutions unless you know why you need to omit them.
- Similarly you're parsing the output of
find
, which will choke on some file names, and requires extra work. The canonical solution is to use find -exec
; this may or may not be faster though, because that also has to do extra work to start a shell to process the files.
- I presume that your awk script is simplified from the real thing. With the script you show, assuming that the first column of the CSV file contains only characters that aren't special in regexes, you could try using sed instead; it would be more cryptic, but it might be a little faster because more specialized tools are usually faster. There's no guarantee that you'll get an improvement though, let alone a measurable one.
- When you set
ID
, you call an external program. Depending on exactly what you're doing here, this may be doable with the shell's own string manipulation constructs: they're typically not very fast and not very powerful, but they don't require calling an external program.
All in all, combining those local optimizations, I'd go with
#!/bin/ksh
find files/ -type f -exec sh -c '
for FILE do
ID=${FILE//some/thing}
sed '/^$ID\t/ s/\([^\t]*\)\t\([^\t]*\)\t[^\t]*\t[^\t]*\t\([^\t]*\).*/id="\1" Some="\2" Thing="\3"/' metadata.csv
cat "$FILE"
done' _ {} +
There may be a faster algorithm though. You're processing the whole metadata set for each file. Especially if each file only matches one line, that's a lot of unnecessary comparisons. It's likely to be faster to generate the list of IDs from file names and collate it with the metadata. Untested code:
#!/bin/ksh
join -j 1 -t $'\t' -o 2.1,2.2,2.5,1.2 \
<(find files/ -type f | sed 's!/some$!/thing\t&!' | sort) \
<(sort metadata.csv) |
awk -F '\t' '{
print "id =\"" $1 "\" Some=\"" $2 "\" Thing=\" $3 "\"";
system("cat \047" $4 "\047"); # Assuming no single quotes in file names
}'
cat
is useless and wastes CPU,awk
can read the file directly. – thrig Mar 09 '16 at 23:36