I have a bunch of files and for each row there is a unique value I'm trying to obscure with a hash.
However there are 3M rows across the files and a rough calculation of the time needed to complete the process is hilariously long at 32days.
for y in files*; do
cat $y | while read z; do
KEY=$(echo $z | awk '{ print $1 }' | tr -d '"')
HASH=$(echo $KEY | sha1sum | awk '{ print $1 }')
sed -i -e "s/$KEY/$HASH/g" $y
done
done
To improve this processes speed I assume I'm going to have to introduce some concurrency.
A hasty attempt based of https://unix.stackexchange.com/a/216475 led me to
N=4
(
for y in gta*; do
cat $y | while read z; do
(i=i%N)); ((i++==0)); wait
((GTA=$(echo $z | awk '{ print $1 }' | tr -d '"')
HASH=$(echo $GTA | sha1sum | awk '{ print $1 }')
sed -i -e "s/$KEY/$HASH/g) &
done
done
)
Which performs no better.
Example input
"2000000000" : ["200000", "2000000000"]
"2000000001" : ["200000", "2000000001"]
Example output
"e8bb6adbb44a2f4c795da6986c8f008d05938fac" : ["200000", "e8bb6adbb44a2f4c795da6986c8f008d05938fac"]
"aaac41fe0491d5855591b849453a58c206d424df" : ["200000", "aaac41fe0491d5855591b849453a58c206d424df"]
Perhaps I should read the lines concurrently then perform the hash-replace on each line?
sha1sum
. – Liam Pieri Dec 10 '20 at 18:09sed -i
takes almost all of the time, but after some small reproducible testable exampe, the whole process can be improved. – thanasisp Dec 10 '20 at 18:13sha1sum
functionality built in (perl? python?) that'd be a much better approach wrt execution speed. – Ed Morton Dec 10 '20 at 19:01"2000000001" : ["200000", "2000000000"]
was my fat fingered mistake. The input line should have actually been "2000000001" : ["200000", "2000000001"]. – Liam Pieri Dec 10 '20 at 19:08