Concurrency of a find, hash val, and replace across large amount of rows

Question

I have a bunch of files and for each row there is a unique value I'm trying to obscure with a hash.

However there are 3M rows across the files and a rough calculation of the time needed to complete the process is hilariously long at 32days.

for y in files*; do 
  cat $y | while read z; do
    KEY=$(echo $z | awk '{ print $1 }' | tr -d '"')
    HASH=$(echo $KEY | sha1sum | awk '{ print $1 }')
    sed -i -e "s/$KEY/$HASH/g" $y
  done
done

To improve this processes speed I assume I'm going to have to introduce some concurrency.

A hasty attempt based of https://unix.stackexchange.com/a/216475 led me to

N=4
(
for y in gta*; do 
  cat $y | while read z; do
    (i=i%N)); ((i++==0)); wait
    ((GTA=$(echo $z | awk '{ print $1 }' | tr -d '"')
    HASH=$(echo $GTA | sha1sum | awk '{ print $1 }')
    sed -i -e "s/$KEY/$HASH/g) & 
  done
done
)

Which performs no better.

Example input

"2000000000" : ["200000", "2000000000"]
"2000000001" : ["200000", "2000000001"]

Example output

"e8bb6adbb44a2f4c795da6986c8f008d05938fac" : ["200000", "e8bb6adbb44a2f4c795da6986c8f008d05938fac"]
"aaac41fe0491d5855591b849453a58c206d424df" : ["200000", "aaac41fe0491d5855591b849453a58c206d424df"]

Perhaps I should read the lines concurrently then perform the hash-replace on each line?

Please include an example input with very few lines and the output you expect. Currenty, this code will be very inefficient for many lines, because it contains many process calls, around 10 per line. Also the shell is not suitable to read large files line by line. To be perfomant, you have to call one program to process the whole file. — thanasisp, Dec 10 '20 at 17:50
@thanasisp I can give an example of the input/output. I see no way around using these many calls per line. Unless I were to create an intermediary file. Which of course I can do as an experiment at least. But I think only expensive call out of all of those is sha1sum. — Liam Pieri, Dec 10 '20 at 18:09
I believe this sed -i takes almost all of the time, but after some small reproducible testable exampe, the whole process can be improved. — thanasisp, Dec 10 '20 at 18:13
The major fault in your process (first attempt) is that sed is happily overwriting the current file while cat is still reading it, and sending some unknown amount of it into a pipe. The second version has an unbalanced quote and does not name the file it is editing. Nobody doubts that this would run for 32 days: it creates at least 20 million processes and rewrites a file 3 million times. Shell is so much the wrong tool for this. Last time I saw a script like this, I got the runtime down from 30 days to 2 minutes using awk. — Paul_Pedant, Dec 10 '20 at 18:57
@Paul_Pedant the problem is that, while we should be able to improve the performance, having to call sha1sum once per line WILL be a major bottleneck in this case since calling it from awk would require creating a subshell every time awk reads a line. If there's a tool out there with sha1sum functionality built in (perl? python?) that'd be a much better approach wrt execution speed. — Ed Morton, Dec 10 '20 at 19:01
@αғsнιη please reread my Example "2000000001" : ["200000", "2000000000"] was my fat fingered mistake. The input line should have actually been "2000000001" : ["200000", "2000000001"]. — Liam Pieri, Dec 10 '20 at 19:08
@EdMorton I'm verifying my own solution within Python. Almost there! — Liam Pieri, Dec 10 '20 at 19:39
@EdMorton Standard sha1sum is a single-shot tool. I just timed it for 10,000 12-digit numbers, got 2m44s. That scales to 6 days for 30 million values. If this was my team's project, I would have somebody writing a bulk version by now. I can see python3 can import hashlib, and the example in https://www.geeksforgeeks.org/sha-in-python gives the same result as GNU sha1sum. Back on solid ground, I think. — Paul_Pedant, Dec 10 '20 at 19:56

Ed Morton · Accepted Answer · 2020-12-10T19:44:47.490

5

FWIW I think this is the fastest way you could do it in a shell script:

$ cat tst.sh
#!/usr/bin/env bash
for file in "$@"; do
    while IFS='"' read -ra a; do
        sha=$(printf '%s' "${a[1]}" | sha1sum)
        sha="${sha% *}"
        printf '%s"%s"%s"%s"%s"%s"%s"\n' "${a[0]}" "$sha" "${a[2]}" "${a[3]}" "${a[4]}" "$sha" "${a[6]}"
    done < "$file"
done

$ ./tst.sh file

$ cat file
"e8bb6adbb44a2f4c795da6986c8f008d05938fac" : ["200000", "e8bb6adbb44a2f4c795da6986c8f008d05938fac"]"
"aaac41fe0491d5855591b849453a58c206d424df" : ["200000", "aaac41fe0491d5855591b849453a58c206d424df"]"

but as I mentioned in the comments you'd be better of for speed of execution using a tool with sha1sum functionality built in, e.g. python.

edited Dec 10 '20 at 19:44

answered Dec 10 '20 at 19:24

Ed Morton

31,617

As this solves the original question. And @EdMorton was first (out of people who posted a solution) to suggest the solution I actually used, Python. I'm picking this answer. – Liam Pieri Dec 10 '20 at 20:14
1

I'll stick to suggestions next time. – Gerard H. Pille Dec 10 '20 at 20:24
@GerardH.Pille thank you for the effort. Sadly I can't pick two answers :( – Liam Pieri Dec 10 '20 at 21:52
no worries, as the ozzies say. – Gerard H. Pille Dec 10 '20 at 22:01

Gerard H. Pille · Answer 2 · 2020-12-11T10:17:06.567

As advised by Ed Morton, with a little help from python.

Create a python script /tmp/sha1.py and make it executable

#! /usr/local/bin/python -u
import hashlib
import sys
for line in sys.stdin:
  words = line.split()
  str_hash=hashlib.sha1(words[0].encode())
  words[0] = str_hash.hexdigest()
  print(" ".join(words))

The first line should contain the correct location of your python, but don't remove the "-u".

Then a ksh script, that you should also make executable.

#! /usr/bin/ksh
/tmp/sha1.py |&
for y in files*
do
  while read A B
  do
    eval "echo $A" >&p
    read A <&p
    echo &quot;$A&quot; $B
  done < $y > TMP.$y
  mv TMP.$y $y
done
terminate sha1.py
exec 3>&p
exec 3>&-

Now, if you want performance, you should let python handle a complete file at once. The following scripts treats each input line as a filename, and does your dirty work:

#! /usr/local/bin/python
import hashlib
import os
import sys
for IFileNmX in sys.stdin:
  IFileNm = IFileNmX.strip()
  IFile = open(IFileNm,'r')
  OFileNm = ".".join(["TMP",IFileNm])
  OFile = open(OFileNm,'w')
  for line in IFile.readlines():
    words = line.split()
    word1 = words[0].strip('"')
    str_hash=hashlib.sha1(word1.encode())
    words[0] = "".join(['"',str_hash.hexdigest(),'"'])
    OFile.write("".join([" ".join(words),'\n']))
  OFile.close()
  IFile.close()
  os.rename(OFileNm,IFileNm)

If you call this script /tmp/sha1f.py, and make it executable, I wonder how many minutes

ls files* | /tmp/sha1f.py

would take. My system took 12 seconds to deal with a 400Mb file of a million lines. But that's boasting, of course.

You can use os.listdir function to get file list in python and multiprocessing module, shipped in python standard library to process several files in parallel. — wl2776, Dec 11 '20 at 06:40

Concurrency of a find, hash val, and replace across large amount of rows

2 Answers2

terminate sha1.py