Generate big amount of pseudorandom data predictably

Question

I bought cheep 2 TB HDDs (60 € each) and want to check whether they return the data they were fed when reading before using them. I checked some cheep thumb drives drives by copying large files I had lying around to them and checking the hashes of the data they gave back (and found ones which just throw data away after their actual storage capacity is exhausted). Unfortunately, I don't have any 2 TB files lying around.

I now want to generate 2 TB of pseudorandom data, write it to the disks, and take a hash of the disks. I then want to write the same data directly to the hash function and get the hash it should produce this way. The pseudorandom function doesn't have to be cryptographically secure in any way, it just needs to produce data with high entropy fast.

If I write a script which just hashes a variable containing a number, prints the hash to stdout, increments the variable, and repeats, the data rate is way too slow, even on when using a fast CPU. Like 5 orders of magnitude too slow (not even 60 kByte/s).

Now, I could attempt to do this with tee but that seems like a really bad idea and I can't just reproduce the same data over and over again.

Ideally, I'd pass some short argument (a number, a string, I don't care) to the program and get an arbitrarily large amount of data out at its stdout, and that data is the same on each call.

Why do you need to do it with random data? Why not output increasing numbers continuously? If you absolutely need it to be random, most computer-based random number generators use a seed which makes it reproducible. Since performance is an issue, you should do it in C, not in a shell script. — Julie Pelletier, Mar 18 '17 at 23:20
It doesn't have to be random for this but it'd be nice and it'd be nice to have this. Are there any pseudorandom generators which take their seeds as an argument? Writing a C program for this would not be hard (I could just use srand() and rand().) but if there is a standard tool to get this done, knowing about it would be nice. — UTF-8, Mar 18 '17 at 23:24
If you know how to program it in 5 minutes, go ahead and do it. It's rather unlikely that someone would choose that over basic utilities like badblocks which is also included in fsck. Your concept will not result in a better verification. If no bad blocks are found, then your drive looks good. If you find some then it's starting to get less reliable. Of course there are other factors that can cause a drive to fail but these won't be noticeable in either tests until it fails. — Julie Pelletier, Mar 19 '17 at 02:37

frostschutz · Accepted Answer · 2023-03-12T11:39:03.253

Just encrypt zeroes(***).

Encryption does exactly what you want. Encrypted zeroes look like random data. Decrypting turns it back into zeroes. It's deterministic, repeatable, reversible so as long as you know the key and keep using the same cipher settings(**).

Example overwriting a drive with random data using cryptsetup:

cryptsetup open --type plain --cipher aes-xts-plain64 /dev/deletedisk cryptodeletedisk
# overwrite with zeroes
pv < /dev/zero > /dev/mapper/cryptodeletedisk
# verify (also drop caches or power cycle)
echo 3 > /proc/sys/vm/drop_caches
pv < /dev/mapper/cryptodeletedisk | cmp - /dev/zero
### alternatively, run badblocks in destructive mode:
badblocks -w -b 4096 -t 0 -v -s /dev/mapper/cryptodeletedisk

This should utilize full disk speed on a modern system with AES-NI.

If no errors are found, you know that the drive was fully overwritten with random data, and returned the correct data when reading it back.

Example using cryptsetup for repeatedly piping the same random data (without involving real storage).

In this example, instead of using /dev/zero as the data source, we exploit that sparse files can be created arbitrarily large and are also fully zero:

# truncate -s 1E exabyte_of_zero
# cryptsetup open --type plain --cipher aes-xts-plain64 --readonly exabyte_of_zero exabyte_of_random
Enter passphrase for exabyte_of_zero: yourseed
# hexdump -C -n 64 /dev/mapper/exabyte_of_random
# cat /dev/mapper/exabyte_of_random | something_that_wanted_random_data

This only works if your filesystem supports sparse files without side effects (does not work on tmpfs).

Note: this is just an example; actually creating a virtual exabyte device like this might have side effects. Limit the size to something reasonable for your use case, and put the file in a location where backup scripts don't try to pick it up and compress it, etc.

cryptsetup provides a readable block device, and it's seekable, so you can also use it to start comparing data somewhere in the middle of a file. You can also use it to analyze the random data at any offset and make sure it's really random and does not repeat(*). A traditional PRNG would usually require you to re-generate everything from start.

cryptsetup requires root permissions (although the resulting device mapper device could be made readable for any other user.)

Without root permissions, you can use openssl to generate random data by encrypting zeroes. (Already mentioned in comments.)

$ openssl enc -pbkdf2 -aes-256-ctr -nosalt \
      -pass pass:yourseed < /dev/zero \
      | hexdump -C -n 64
00000000  62 5e 3d cd 39 dc d6 a2  bb 73 2d 0f 63 b1 f1 75  |b^=.9....s-.c..u|
00000010  4d 84 f5 75 cb b6 1e 33  9c e8 41 9c 76 4b 7e 12  |M..u...3..A.vK~.|
00000020  c2 90 d5 93 2d a9 9e a0  48 bd b8 3e a5 1a d6 f7  |....-...H..>....|
00000030  2c a6 e0 07 4d 5a 45 31  13 dc ef 97 df 76 c5 b8  |,...MZE1.....v..|
00000040

This approach is also suggested by coreutils documentation on Sources of random data as a way to "generate a reproducible arbitrary amount of pseudo-random data given a seed value".

Instead of encrypting zeroes, you can also use a traditional PRNG for the job. It's just that I'm not aware of any standard tools that provide it. So this approach involves picking any PRNG algorithm / library of your choice and writing a few lines of code to produce the data.

shred has a nice PRNG but you can't seed it yourself and also can't pipe it, so there is no way to utilize it here.

tee can multiplex data from a single random source to multiple processes, however it requires this data to be consumed immediately and in parallel (example: Shuffle two parallel text files), so this is only suitable if the data has to be identical, but does not have to be reproduced again at a later time.

(*) When encrypting, it's important to choose the correct cipher settings. For example, aes-xts-plain repeats data after 2TB, this was fixed in aes-xts-plain64. So don't use aes-xts-plain anymore.

(**) Different cipher settings will lead to different results. The commands shown in this answer rely on some default settings, so data may not be identical in the long term. For example cryptsetup might be using either 256 or 512 bit keys and openssl uses a default iteration count.

(***) The reason we encrypt zeroes at all is that the kernel offers performant data sources for arbitrary amounts of zero. Otherwise, encrypting any other pattern would work just as well.

I want this to be very flexible, that's why I'm asking about data streams. Cryptsetup doesn't allow me to pass its output to stdout. Furthermore, it has to be run as root to work. If hardware acceleration isn't available, encryption is probably way slower than generating random data by other means. — UTF-8, Mar 18 '17 at 23:38
@UTF-8 stdout/pipe kind of also works. In terms of speed: well, it was considerably faster than /dev/urandom last time I tried it on a NAS with old / slow / unaccelerated CPU. But urandom implementation changed a lot since then... you can go with PRNG but that is already what badblocks does with -t random. Not sure about your "has to be run as root to work", you have to be root to write to raw disk in first place. — frostschutz, Mar 18 '17 at 23:54
Yes, I know that I have to be root to write to disks but I also have to be root when I don't want to write to disks. I had the problem of needing reproducible pseudorandom data in the past and always cheated my way around it but now I'm looking for an actual solution. Do you know about a PRNG tool for large data for the terminal? I only know of rand but it only produces relatively small number represented as decimal numbers encoded in UTF-8. Plus, it doesn't typically come with distros (which is another downside, it's not a requirement that it's a standard tool (combi) but it'd be nice). — UTF-8, Mar 19 '17 at 00:03
There's also a very important performance difference between encryption and simple RNG. — Julie Pelletier, Mar 19 '17 at 02:31
@UTF-8 A more convenient way to encrypt zeros is the following: openssl enc -e -aes-128-ctr -k "key" -in /dev/zero -out /dev/stdout (substitute /dev/stdout with anything you want) This generates at 1.1 GB/s, massively outstripping almost any drive's throughput. To reverse this, use -d rather than -e, change the -in argument and check that what you read back is just zeros. — Iwillnotexist Idonotexist, Mar 19 '17 at 04:13
Thanks! Weirdly, the command produces the error message error writing output file when piping its output to dd after a few seconds (but before dd terminates): http://pastebin.com/4cb3bP50 Do you know why this is? It doesn't happen when I use the command to write to a regular file directly. — UTF-8, Mar 19 '17 at 13:57
badblocks really is a much better choice for checking whether a storage medium is a fraud. I checked a fraudulent thumb drive with sudo badblocks -wsv /dev/sdb and it found bad blocks within a few minutes. — UTF-8, Mar 19 '17 at 14:19
@frostschutz this is ingenious! :) thank you for this explanation with examples :) also, can you check if there are no syntax errors in your commands, since, for example, I noticed that deletecryptodisk is not defined anywhere? was it supposed to reference /dev/deletedisk or /dev/mapper/cryptodeletedisk ? — Mladen B., Mar 12 '23 at 08:50

score 0 · Answer 2 · edited Mar 04 '23 at 22:57

If you are so afraid of data integrity, then just use ex.: ZFS, it has built-in integrity checking, since a new HW could be ok, but later years pass it could became bad.

https://changelog.complete.org/archives/9769-silent-data-corruption-is-real

Here’s something you never want to see:
ZFS has detected a checksum error:
eid: 138
 class: checksum
  host: alexandria
  time: 2017-01-29 18:08:10-0600
 vtype: disk

This means there was a data error on the drive. But it’s worse than a typical data error — this is an error that was not detected by the hardware. Unlike most filesystems, ZFS and btrfs write a checksum with every block of data (both data and metadata) written to the drive, and the checksum is verified at read time. Most filesystems don’t do this, because theoretically the hardware should detect all errors. But in practice, it doesn’t always, which can lead to silent data corruption. That’s why I use ZFS wherever I possibly can.

Generate big amount of pseudorandom data predictably

2 Answers2

Linked