I am looking for effective and simple ID generation for the following content using bash script:
{"name": "John", "surname": "Gates", "country": "Germany", "age": "20", "height": "180"}
{"name": "John1", "surname": "Gates", "country": "Germany", "age": "20", "height": "180"}
{"name": "John2", "surname": "Gates", "country": "Germany", "age": "20", "height": "180"}
{"name": "John3", "surname": "Gates", "country": "Germany", "age": "20", "height": "180"}
{"id": "XXX", "name": "John", "surname": "Gates", "country": "Germany", "age": "20", "height": "180"}
{"id": "XXX", "name": "John1", "surname": "Gates", "country": "Germany", "age": "20", "height": "180"}
{"id": "XXX", "name": "John2", "surname": "Gates", "country": "Germany", "age": "20", "height": "180"}
{"id": "XXX", "name": "John3", "surname": "Gates", "country": "Germany", "age": "20", "height": "180"}
I will have approximately 5,000,000 of similar records and I want to generate repeatable, predictable ID. As I will be constrained by time to process the following file, I need to do it under 20 minutes window to sql lite database on a Linux machine.
MD5, SHA1 are too expensive to be used, unless I can do something like GNU Parallel on 16 threads on AMD Ryzen 1900X CPU that will manage to do it under a few minutes?
I have tried with MD5, accomplished 28,000 IDs calculated with 1 min 45 seconds. With SHA1 it took me 2min 3 seconds.
I was thinking about creating ID very simple:
JohnGatesGermany20180
John1GatesGermany20180
John2GatesGermany20180
John3GatesGermany20180
What could you recommend where the following requirements have to be met:
- bash
- Linux
- 5,000,000 records to process
- under 20 minutes
- id has to be the same for the same json lines
Performed tests:
#!/usr/local/bin/bash
while IFS= read -r line
do
uuid=$(uuidgen -s --namespace @dns --name "www.example.com" )
done < testfile1.txt
md5 hashing of 1,000,000 lines:
$time bash script.sh
real 13m6.914s
user 10m24.523s
sys 2m56.095s
cksum doing crc on 1,000,000:
#!/usr/local/bin/bash
while IFS= read -r line
do
# uuid=$(uuidgen -s --namespace @dns --name "www.example.com" )
echo "$line $uuid"|cksum >> test3.txt
done < testfile1.txt
$time bash script.sh
real 12m49.396s
user 12m23.219s
sys 4m1.417s
uuidgen
? – DopeGhoti Aug 02 '18 at 20:45sed '/^\s*$/d' file > file2 && awk '{print NR-0,$0} file2 > file3'
. That will remove any blank lines and send the output tofile2
and then sequentially number the lines and send the output tofile3
. You can then examinefile3
to make sure that it's what you want and then export it to your SQL Lite database. Your original file will remain the same in case you don't like what you see. To make it cleaner, you can also usesed
to remove the brackets, colons, commas, and double quotes. – Nasir Riley Aug 02 '18 at 21:53