bash string manipulation speed vs. pipeline

Question

I have a file filled with md5 checksums and filenames. I need to perform some processing on each line, so I need to know:

Which is the checksum
Which is the filename

and act accordingly. That is, I need to slurp the checksum into a variable, then the filename. The filename may have non-ascii characters in it but I don't expect to see newlines. It looks like this:

05c00367e8914ca1be0964821d127977  ./.fseventsd/0000000000097aa1
cd9d4291f59a43c0e3d73ff60a337bb5  ./.fseventsd/00000000000fdfec
5d1280769e741e04622cfd852f33a138  ./.fseventsd/0000000000103197
8dda3534e5bbc0be1d15db2809123c50  ./.fseventsd/000000000017c9ca
(...etc., about 100,000 lines)

Traditionally, I might perform something like this:

md5sum=$(echo $line | awk '{print $1}')
filename=$(echo $line | sed 's/[^ ]*  //')

But how much faster would it be if I did this:

md5sum=${line%%" "*}
filename=${line#*"  "}

The correct solution to this depends on what it is you are expecting to do with the pathnames and checksums. — Kusalananda, Jul 13 '18 at 07:14

Mike S · Answer 1 · 2018-07-13T23:22:41.093

I tested by setting one of the variables. Performing this script twice:

while read line; do
        md5sum=${line%%" "*}
        #md5sum=$(echo $line | awk '{print $1}')
        echo "SUM: $md5sum FILE:_$file"
done < manifest.Stuph.180620

first with

md5sum=${line%%" "*}

and next with

md5sum=$(echo $line | awk '{print $1}')

where the file "manifest.Stuph.180620" is 100939 lines long (== about 14MiB) gave the following results:

First run (using bash's builtin string manipulation)

real    0m4.750s
user    0m4.174s
sys     0m0.550s

Second run (using the pipeline)

real    10m54.255s
user    4m42.257s
sys     7m32.880s

Some (such as myself) will say, that if speed matters then you shouldn't be messing around in the shell anyway, but sometimes you might want to be more efficient- no matter what environment you're using for the job.

Note that doing this:

while read md5sum filename; do
    (...etc...)

is even more efficient than doing the variable assignment, but not to the degree of eliminating the command substitution/pipe/awk construct. The thing I find most interesting is the difference between bash built-in performance and using external commands. I'll be more diligent about learning and using the fancy builtin stuff!

Why the cat in a process substitution? And why not read checksum and pathname with read directly? — Kusalananda, Jul 13 '18 at 07:12

score 0 · Answer 2 · answered Jul 13 '18 at 02:46

Yes, using the bash internal command, avoids many system calls. Especially when there is recursion.

Another example : we must use * against $(ls).

Bash offers some ways to make simple manipulations on strings (cuts and substitution). But not more. Because it is not made for that. Exemple : difficult to validate the presence of a pattern in a string without an external command.

External programs are more optimized for their tasks (cat, sed, grep, awk, cut, sort, ...)

bash string manipulation speed vs. pipeline

2 Answers2