1

I have a file filled with md5 checksums and filenames. I need to perform some processing on each line, so I need to know:

  • Which is the checksum
  • Which is the filename

and act accordingly. That is, I need to slurp the checksum into a variable, then the filename. The filename may have non-ascii characters in it but I don't expect to see newlines. It looks like this:

05c00367e8914ca1be0964821d127977  ./.fseventsd/0000000000097aa1
cd9d4291f59a43c0e3d73ff60a337bb5  ./.fseventsd/00000000000fdfec
5d1280769e741e04622cfd852f33a138  ./.fseventsd/0000000000103197
8dda3534e5bbc0be1d15db2809123c50  ./.fseventsd/000000000017c9ca
(...etc., about 100,000 lines)

Traditionally, I might perform something like this:

md5sum=$(echo $line | awk '{print $1}')
filename=$(echo $line | sed 's/[^ ]*  //')

But how much faster would it be if I did this:

md5sum=${line%%" "*}
filename=${line#*"  "}
Mike S
  • 2,502
  • 2
  • 18
  • 29

2 Answers2

0

I tested by setting one of the variables. Performing this script twice:

while read line; do
        md5sum=${line%%" "*}
        #md5sum=$(echo $line | awk '{print $1}')
        echo "SUM: $md5sum FILE:_$file"
done < manifest.Stuph.180620

first with

md5sum=${line%%" "*}

and next with

md5sum=$(echo $line | awk '{print $1}')

where the file "manifest.Stuph.180620" is 100939 lines long (== about 14MiB) gave the following results:

First run (using bash's builtin string manipulation)

real    0m4.750s
user    0m4.174s
sys     0m0.550s

Second run (using the pipeline)

real    10m54.255s
user    4m42.257s
sys     7m32.880s

Some (such as myself) will say, that if speed matters then you shouldn't be messing around in the shell anyway, but sometimes you might want to be more efficient- no matter what environment you're using for the job.

Note that doing this:

while read md5sum filename; do
    (...etc...)

is even more efficient than doing the variable assignment, but not to the degree of eliminating the command substitution/pipe/awk construct. The thing I find most interesting is the difference between bash built-in performance and using external commands. I'll be more diligent about learning and using the fancy builtin stuff!

Mike S
  • 2,502
  • 2
  • 18
  • 29
  • Why the cat in a process substitution? And why not read checksum and pathname with read directly? – Kusalananda Jul 13 '18 at 07:12
  • Muscle memory. Thanks, I've removed the spurious cat. Bad kitty! 2) True, that works, maybe it's more succinct, but I was interested in the difference between the builtin vs. the pipe where the technique used is similar between the two styles. I often use the pipe construct with more complex problems. I was curious as to how that hurt my efficiency, and whether I should apply myself more to optimizing it in my shell work in general. It seems the answer is yes, if it may help (ie, larger datasets). This is just an example- I was surprised at the speedup!
  • – Mike S Jul 13 '18 at 23:31