3

I've got the following script:

#!/usr/bin/env bash
# Script to generate MD5 hash for each line.
[ $# -eq 0 ] && { echo "Usage: $0 file"; exit 1; }
file=$1
shopt -s expand_aliases
alias calc_md5='while read -r line; do md5sum <<<$line; done'
paste <(sort "$file" | uniq | calc_md5) <(sort "$file" | uniq)
times

which prints MD5 checksum for each line, side by side, so exactly how I need it. For example:

$ ./md5_lines.sh file.dat
5c2ce561e1e263695dbd267271b86fb8  - line 1
83e7cfc83e3d1f45a48d6a2d32b84d69  - line 2
0f2d633163ca585e5fc47a510e60f1ff  - line 3
73bb3632fc91e9d1e1f7f0659da7ec5c  - line 4

The problem with above script is that it needs to read and parse the file twice, for each column/stream. Ideally, I'd like to sort and make all lines unique and use it as the input only once.

How can I convert the above script to parse the file only once (sort & uniq), then redirect output to two different streams and display lines side-by-side, so it can work quicker for the larger files?


Here is my another attempt:

tee >(calc_md5) >(cat -) \
      < <(sort "$file" | uniq) \
      >/dev/null
times

but it prints the streams separately (not side-by-side).

Ideally, I'd like to use paste, the same way as tee, however it gives me the error:

$ paste >(cat -) >(cat -) </etc/hosts
paste: /dev/fd/63: Permission denied
kenorb
  • 20,988
  • 1
    @DavidFoerster Entire stream will give different checksum, I need the checksum generated for each line. I've used md5sum to simplify the script, my original use case was to find partial SHA conflicts by: alias calc_sha="php -r 'while(\$line = fgets(STDIN)){ echo substr(sha1(strtok(\$line, PHP_EOL)), 6, 9) . PHP_EOL; };'" in large list of IDs (but it's the topic for another question), so ideally I'd like to not invoke separate instances for performance reasons, but work with streams. But any information about dealing with multiple streams is useful. – kenorb Mar 04 '18 at 21:03
  • @kenorb: Ah, my mistake! I thought you were running md5sum -c in "check" mode for some reason. Although I would still find it more readable and cleaner do define an equivalent function instead of an alias. – David Foerster Mar 04 '18 at 23:19

3 Answers3

8

If you want to display two things side by side you could just use printf for formatted printing.

#!/bin/bash
sort "$1" | uniq | while read line; do
    md5=$(md5sum <<< "$line")
    printf "%s %s\n" "$md5" "$line"
done 
times
terdon
  • 242,166
Captain Wobbles
  • 366
  • 1
  • 5
5

A couple of Perl approaches:

  1. Use Perl to get the md5sum

    $ perl -ne 'BEGIN{  
                    use Digest::MD5  qw(md5_hex)
                } 
                $k{$_}=md5_hex("$_"); 
                END{
                    print "$k{$_} - $_" for sort keys(%k)
                }' file
    5c2ce561e1e263695dbd267271b86fb8 - line 1
    83e7cfc83e3d1f45a48d6a2d32b84d69 - line 2
    0f2d633163ca585e5fc47a510e60f1ff - line 3
    73bb3632fc91e9d1e1f7f0659da7ec5c - line 4
    d82912361d84a675530f5e32aa6eeda1 - line 5
    

    And yes, this is a one liner:

    perl -ne 'BEGIN{use Digest::MD5  qw(md5_hex)} $k{$_}=md5_hex("$_"); END{print "$k{$_} - $_" for sort keys(%k)}' file
    

    This should be much faster than doing this sort of processing in the shell.

  2. Use a system call

    $ perl -lne 'chomp($md=`md5sum <<<"$_"`); print "$md $_" if !$seen{$_}++' file
    83e7cfc83e3d1f45a48d6a2d32b84d69  - line 2
    0f2d633163ca585e5fc47a510e60f1ff  - line 3
    d82912361d84a675530f5e32aa6eeda1  - line 5
    73bb3632fc91e9d1e1f7f0659da7ec5c  - line 4
    5c2ce561e1e263695dbd267271b86fb8  - line 1
    
terdon
  • 242,166
  • 2
    Your 2, part has several issues, like need a zsh-compatible sh for <<<, is an ACE vulnerability ($_ interpreted as shell code!), and of course is going to be even less efficient than a while read loop as you're also running one shell per line (not even one unique line) in addition to one md5sum. – Stéphane Chazelas Mar 05 '18 at 12:01
  • @StéphaneChazelas indeed. Ibasically added it for fun and because the OP was using the external command. I'd use option 1 myself or, even better, your much simplified version. – terdon Mar 05 '18 at 12:16
2

Going for a while read loop has many of the issues mentioned at Why is using a shell loop to process text considered bad practice?

Here, I'd use perl:

sort -u < "$file" | perl -MDigest::MD5=md5_hex -lpe '
  $_ = md5_hex($_) . " - " . $_'

Your more general question looks like a duplicate of or variation on tee + cat: use an output several times and then concatenate results

Note that it's not because two lines sort the same (meaning sort -u retains only one) that they will be identical and have the same MD5 checksum. You may want to use LC_ALL=C sort -u for the sorting and uniquing to be based on a byte-to-byte comparison as opposed to strcoll() (also beware some sort implementations could choke on non-text input which in the C locale would still include too-long lines, unterminated lines or lines containing NUL characters).