Generating unique IDs for json content indexing

Question

I am looking for effective and simple ID generation for the following content using bash script:

{"name": "John", "surname": "Gates", "country": "Germany", "age": "20", "height": "180"}
{"name": "John1", "surname": "Gates", "country": "Germany", "age": "20", "height": "180"}
{"name": "John2", "surname": "Gates", "country": "Germany", "age": "20", "height": "180"}
{"name": "John3", "surname": "Gates", "country": "Germany", "age": "20", "height": "180"}


{"id": "XXX", "name": "John", "surname": "Gates", "country": "Germany", "age": "20", "height": "180"}
{"id": "XXX", "name": "John1", "surname": "Gates", "country": "Germany", "age": "20", "height": "180"}
{"id": "XXX", "name": "John2", "surname": "Gates", "country": "Germany", "age": "20", "height": "180"}
{"id": "XXX", "name": "John3", "surname": "Gates", "country": "Germany", "age": "20", "height": "180"}

I will have approximately 5,000,000 of similar records and I want to generate repeatable, predictable ID. As I will be constrained by time to process the following file, I need to do it under 20 minutes window to sql lite database on a Linux machine.

MD5, SHA1 are too expensive to be used, unless I can do something like GNU Parallel on 16 threads on AMD Ryzen 1900X CPU that will manage to do it under a few minutes?

I have tried with MD5, accomplished 28,000 IDs calculated with 1 min 45 seconds. With SHA1 it took me 2min 3 seconds.

I was thinking about creating ID very simple:

JohnGatesGermany20180
John1GatesGermany20180
John2GatesGermany20180
John3GatesGermany20180

What could you recommend where the following requirements have to be met:

bash
Linux
5,000,000 records to process
under 20 minutes
id has to be the same for the same json lines

Performed tests:

#!/usr/local/bin/bash

while IFS= read -r line
do
   uuid=$(uuidgen -s --namespace @dns --name "www.example.com" )
done < testfile1.txt

md5 hashing of 1,000,000 lines:

$time bash script.sh 

real    13m6.914s
user    10m24.523s
sys 2m56.095s

cksum doing crc on 1,000,000:

#!/usr/local/bin/bash

while IFS= read -r line
do
#    uuid=$(uuidgen -s --namespace @dns --name "www.example.com" )
    echo "$line $uuid"|cksum >> test3.txt
done < testfile1.txt

$time bash script.sh 

real    12m49.396s
user    12m23.219s
sys 4m1.417s

To add to what muru said, use sed '/^\s*$/d' file > file2 && awk '{print NR-0,$0} file2 > file3'. That will remove any blank lines and send the output to file2 and then sequentially number the lines and send the output to file3. You can then examine file3 to make sure that it's what you want and then export it to your SQL Lite database. Your original file will remain the same in case you don't like what you see. To make it cleaner, you can also use sed to remove the brackets, colons, commas, and double quotes. — Nasir Riley, Aug 02 '18 at 21:53
Apologies, I have not specified one detail: I want the UUID to be the same for the same lines. — Anna, Aug 02 '18 at 23:21
If you're importing into a database, have you considered allowing the database to generate unique ID's itself? Then rows will be uniquely identifiable within the database and you don't have to generate them. — Jeff Schaller, Aug 03 '18 at 01:35

muru · Answer 1 · 2018-08-03T02:13:44.643

I'll wager that the reason your scripts take so long is your running uuidgen (or cksum) for each line. Plenty of time is wasted simply starting the processes for each of them.

Putting 5M lines of the form {"name": "John%d", "surname": "Gates", "country": "Germany", "age": "20", "height": "180"} in a file on a tmpfs filesystem, the following Python script finished in seconds:

#! /usr/bin/env python3

import hashlib
import sys
for line in sys.stdin:
    print(hashlib.md5(line.rstrip('\n').encode('utf-8')).hexdigest())

Execution:

$ time ./foo.py < input > output
./foo.py < input > output  6.00s user 0.13s system 99% cpu 6.135 total
% wc -l input output
  5000000 input
  5000000 output
 10000000 total

Since this is Python, you could also JSON-decode the lines and insert an ID into each. Even inefficient code like:

#! /usr/bin/env python3

import hashlib
import json
import sys
for line in sys.stdin:
    l = line.rstrip('\n').encode('utf-8')
    o = json.loads(line)
    o["id"] = hashlib.md5(l).hexdigest()
    print(json.dumps(o))

Finished in less than a minute:

% time ./foo.py < input > output
./foo.py < input > output  42.11s user 0.42s system 99% cpu 42.600 total

% head output 
{"name": "John1", "surname": "Gates", "country": "Germany", "age": "20", "height": "180", "id": "2dc573ccb15679f58abfc44ec8169e52"}
{"name": "John2", "surname": "Gates", "country": "Germany", "age": "20", "height": "180", "id": "ee0583acaf8ad0e502bf5abd29f37edb"}
{"name": "John3", "surname": "Gates", "country": "Germany", "age": "20", "height": "180", "id": "a7352ebb79db8c8fc2cc8758eadd9ea3"}
{"name": "John4", "surname": "Gates", "country": "Germany", "age": "20", "height": "180", "id": "2062ad1b67ccdce55663bfd523ce1dfb"}
{"name": "John5", "surname": "Gates", "country": "Germany", "age": "20", "height": "180", "id": "5f81325c104c01c3e82abd2190f14bcf"}
{"name": "John6", "surname": "Gates", "country": "Germany", "age": "20", "height": "180", "id": "493e0c9656f74ec3616e60886ee38e6a"}
{"name": "John7", "surname": "Gates", "country": "Germany", "age": "20", "height": "180", "id": "19af9ef2e20466d0fb0efcf03f56d3f6"}
{"name": "John8", "surname": "Gates", "country": "Germany", "age": "20", "height": "180", "id": "2348bd47b20ac6445213254c6a8aa80b"}
{"name": "John9", "surname": "Gates", "country": "Germany", "age": "20", "height": "180", "id": "090a521b4a858705dc69bf9c8dca6c19"}
{"name": "John10", "surname": "Gates", "country": "Germany", "age": "20", "height": "180", "id": "fc3c699323cbe399e210e4a191f04003"}

My specs:

Intel® Core™ i7-8700 CPU @ 3.20GHz × 12
2666MHz DDR4 memory

You uuidgen-based script barely managed to finish 500k lines in 4 minutes. Modified to save output:

#!/usr/bin/bash

while IFS= read -r line
do
   uuidgen -s --namespace @dns --name "$line"
done < input > uuid

Execution:

% timeout 240 ./foo.sh
% wc -l uuid
522160 uuid

Indeed, thank you for your ideas and time. For my use case that is above and beyond. — Anna, Aug 03 '18 at 06:26

score 1 · Answer 2 · answered Aug 03 '18 at 01:48

An implementation of your simple ID idea in awk, assuming the JSON lines are as you've indicated -- all each on one line:

awk -F'"' 'BEGIN{OFS=FS} {$1=$1"\"id\": \""$4$8$12$16$20"\", "; }1' < input

I don't have a comparable system to yours, so you'll have to see if the timing is acceptable.

slm · Accepted Answer · 2018-08-03T07:51:06.343

As more of a thought experiment I wanted to see how far we could push the CLI tools to solve this type of problem. To that end I wanted to try using the fast hashing CLI tool xxHash to do this work.

xxHash is an extremely fast non-cryptographic hash algorithm, working at speeds close to RAM limits. It is proposed in two flavors, 32 and 64 bits.

It's available in every programming language, but for this experiment, I'm going to use the CLI flavor, xxhsum, specifically the 32bit mode, so xxhsum -H0.

As you've found out and as others have stated, calling the hash function CLI tool, or any tool, over and over is typically where these types of approaches fall down. Calling the xxhsum here 5M times would be a sub-optimal way to use it. It's strength lies in file I/O, so what if we were to take the 5M lines, and convert them into 5M files?

That task is actually trivial in Linux, using the split command:

split -l 1 afile

And how fast would it be to hash, say 1M, of these files each with a line in them like this.

sample 1 line file

$ cat datadir/xzeyw
{"name": "John4000", "surname": "Gates", "country": "Germany", "age": "20", "height": "180"}

directory with 1M files

$ ls -l datadir | wc -l
1000002

time to hash them

$ { time xxhsum -H0 * > ../nfile 2>&1 ;} |& awk '/real|user|sys/ {print $1": "$2"\t"}' | tr -d '\n'
real: 0m6.998s  user: 0m5.007s  sys: 0m1.569s

Yes, that's correct, it took ~7 seconds! I find this rather impressive. Using xxhsum in this manner, we've only incurred the cost of running it one time, and allowed it to loop through 1M files.

Drawbacks to this method

So one of the drawbacks of doing it this way, is of course the split. This becomes our most expensive operations, as you can imagine. Since we're having to take a single file with X rows and explode it onto the HDD as X files with a single line in it.

Here's some of that data:

./hashy.bash

make data
---------
real: 0m17.492s user: 0m12.434s sys: 0m4.788s

split data
----------
real: 2m15.180s user: 0m0.700s  sys: 2m4.443s

hash data
---------
real: 0m6.487s  user: 0m5.798s  sys: 0m0.459s

Here we can see that our split operation took ~2 minutes. NOTE: The first line in this output shows the time to construct a file with 1M lines of JSON in it.

Another drawback is the number of files we're dealing with on the command line. I'm using * in places and so this will expand to 1M or 5M filenames, which could be considered dangerous, it is. Keep in mind as you increase the number of files you run the risk of exceeding the amount of space allocated to the command line arguments.

Refer to these links regarding command line length:

Conclusion

So as you can imagine, solving a problem like this using 1M files, or 5M files seems almost ridiculous. And I would have to agree. But it's still an interesting experiment, in that it shows that if you leverage the CLI tools in an appropriate manner, you can get excellent performance from them.

Code for hashy.bash

If anyone is interested in the code:

$ cat hashy.bash
#!/bin/bash

echo ""
echo "make data"
echo "---------"
rm -f afile
{ time for i in {0..1000000};do echo "{\"name\": \"John${i}\", \"surname\": \"Gates\", \"country\": \"Germany\", \"age\": \"20\", \"height\": \"180\"}">> afile ;done ;} \
  |& awk '/real|user|sys/ {print $1": "$2"\t"}' | tr -d '\n'
echo ""
echo ""

rm -fr datadir && mkdir datadir && cd datadir

echo "split data"
echo "----------"
{ time split -l 1 ../afile ;} |& awk '/real|user|sys/ {print $1": "$2"\t"}' | tr -d '\n'
echo ""
echo ""

echo "hash data"
echo "---------"
{ time xxhsum -H0 * > ../nfile 2>&1 ;} |& awk '/real|user|sys/ {print $1": "$2"\t"}' | tr -d '\n'

cd - > /dev/null 2>&1
echo ""
echo ""

References

Kusalananda · Answer 4 · 2023-11-10T16:42:43.190

Start by importing the data into your SQLite database. Here I use Miller (mlr) to convert the JSONL data that you present into CSV, and then I read that into the table data in a new database:

mlr --l2c cat file.json | sqlite3 database.db '.import --csv /dev/stdin data'

When that's done, you can create your identifiers using your proposed scheme using an UPDATE statement:

sqlite> .mode box
sqlite> SELECT * FROM data;
┌───────┬─────────┬─────────┬─────┬────────┐
│ name  │ surname │ country │ age │ height │
├───────┼─────────┼─────────┼─────┼────────┤
│ John  │ Gates   │ Germany │ 20  │ 180    │
│ John1 │ Gates   │ Germany │ 20  │ 180    │
│ John2 │ Gates   │ Germany │ 20  │ 180    │
│ John3 │ Gates   │ Germany │ 20  │ 180    │
└───────┴─────────┴─────────┴─────┴────────┘

sqlite> ALTER TABLE data ADD COLUMN id TEXT;
sqlite> UPDATE data SET id = concat(name,surname,country,age,height);
sqlite> SELECT * FROM data;
┌───────┬─────────┬─────────┬─────┬────────┬────────────────────────┐
│ name  │ surname │ country │ age │ height │           id           │
├───────┼─────────┼─────────┼─────┼────────┼────────────────────────┤
│ John  │ Gates   │ Germany │ 20  │ 180    │ JohnGatesGermany20180  │
│ John1 │ Gates   │ Germany │ 20  │ 180    │ John1GatesGermany20180 │
│ John2 │ Gates   │ Germany │ 20  │ 180    │ John2GatesGermany20180 │
│ John3 │ Gates   │ Germany │ 20  │ 180    │ John3GatesGermany20180 │
└───────┴─────────┴─────────┴─────┴────────┴────────────────────────┘

You could obviously have Miller create your id column on the fly. The following uses the MD5 hash of each record's space-delimited fields:

mlr --l2c put '$id = md5(joinv($*," "))' file | sqlite3 database.db '.import --csv /dev/stdin data'

sqlite> .mode box
sqlite> SELECT * FROM data;
┌───────┬─────────┬─────────┬─────┬────────┬──────────────────────────────────┐
│ name  │ surname │ country │ age │ height │                id                │
├───────┼─────────┼─────────┼─────┼────────┼──────────────────────────────────┤
│ John  │ Gates   │ Germany │ 20  │ 180    │ 150c35e2efb7093e1c30a46a0226f82c │
│ John1 │ Gates   │ Germany │ 20  │ 180    │ c58a8be627dc1d6c9da36dd6de9fa62d │
│ John2 │ Gates   │ Germany │ 20  │ 180    │ e41b62a821f51c13eea2191ebcbb5837 │
│ John3 │ Gates   │ Germany │ 20  │ 180    │ 8e1012a599356fee66727107b750ba1a │
└───────┴─────────┴─────────┴─────┴────────┴──────────────────────────────────┘

Testing this on a recent (2020) MacBook Air (M1), computing the MD5 hashes with Miller and importing 5 million records into a database takes about 42 seconds.

Generating unique IDs for json content indexing

4 Answers4

Drawbacks to this method

Conclusion

Code for hashy.bash

References