57

I tried a bash script, but it took too long to create a simple 1 MB file. I think the answer lies in using /dev/random or /dev/urandom, but other posts here only show how to add all kinds of data to a file using these things, but I want to add only numbers.

So, is there a command that I can use to create a random file of size 1 GB containing only numbers between 0 and 9?

Edit: I want the output to be something like this

0 1 4 7 ..... 9
8 7 5 8 ..... 8
....
....
8 7 5 3 ..... 3

The range is 0 - 9 meaning only numbers 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9. Also I need them to be space separated and 100 per line, up to n number of lines. This n is something I don't care, I want my final size to be 1 GB.

Edit: I am using Ubuntu 16.04 LTS

Reid
  • 459
posixKing
  • 1,157
  • 2
    See http://superuser.com/questions/470949/how-do-i-create-a-1gb-random-file-in-linux – BillThor Nov 17 '16 at 00:03
  • @BillThor I already checked that, but solutions posted there add some random gibberish to the file, not numbers. I want only numbers. – posixKing Nov 17 '16 at 01:19
  • What specifically is the Linux distribution and version? – mdpc Nov 17 '16 at 02:58
  • 21
    You should probably say what you mean by "random" - cryptographic strength random, or is a pseudo-random sequence adequate? – Toby Speight Nov 17 '16 at 09:07
  • 4
    @posixKing: Note that although my answer is definitely tongue-in-cheek -- I'm not actually suggesting to write a C program for such a task! --, if you routinely generate such huge datasets, or you generate them often, the approach may save you time. (On my laptop, it generates 1GB of space-separated digits in about ten seconds.) However, if this is an one-off, don't even think about writing a C program for this (unless you like programming, and consider this practice or such); the shell commands and utilities achieve the task in less total time and effort spent. – Nominal Animal Nov 17 '16 at 10:05
  • @Mark - kernel module? For an operating system kernel? No, the fastest way is probably a GRUB module. You might need to write your own code to write to the filesystem, though... – Toby Speight Nov 17 '16 at 18:08
  • 7
    This is pretty fast and RFC 1149.5 compliant: yes 4 | tr '\n' ' ' | fold -w 200 | head -c1G – Matthew Crumley Nov 17 '16 at 21:00
  • 4
  • It depends a lot on how random you want your data to be. – njzk2 Nov 18 '16 at 04:29
  • 1
    Do you fastest to run, or fastest to type? If you mean fastest execution time, no shell command is going to be as fast as a C program, especially if you use a very cheap RNG, and take advantage of the fact that you only want one-digit numbers (so formatting integers into strings becomes trivial). You could probably vectorize a simple RNG like XORSHIFT with SSE or even AVX2 (different seeds in each vector element), and chop that up into byte elements. So I'd guess you might manage to generate results at something like 32 bytes (including spaces) every 4 clock cycles on Intel Haswell. – Peter Cordes Nov 18 '16 at 08:39
  • 1
    That's enough to saturate main memory write bandwidth with a single core on a 3GHz CPU (dual channel DDR3 1600MHz ~= 25GB/s). – Peter Cordes Nov 18 '16 at 08:41
  • @mdpc that a question has been answered on another site is not a reason to close it here! Please don't vote to close questions because they have an answer somewhere else. Instead, post an answer with the solution that you found elsewhere. – terdon Nov 18 '16 at 10:26
  • Even a crypto quality RNG should be able to beat 1 GB/s without a problem on a modern Intel CPU. So you typically don't need to sacrifice quality to achieve good performance. – CodesInChaos Nov 22 '16 at 09:00

10 Answers10

82

This:

 LC_ALL=C tr '\0-\377' \
             '[0*25][1*25][2*25][3*25][4*25][5*25][6*25][7*25][8*25][9*25][x*]' \
    < /dev/urandom |
    tr -d x |
    fold -w 1 |
    paste -sd "$(printf '%99s\\n')" - |
    head -c1G

(assuming a head implementation that supports -c) appears to be reasonably fast on my system.

tr translates the whole byte range (0 to 255, 0 to 0377 in octal): the 25 first bytes as 0, the 25 next ones as 1... then 25 9 the rest (250 to 255) to "x" which we then discard (with tr -d x) as we want a uniform distribution (assuming /dev/urandom has a uniform distribution itself) and so not give a bias to some digits.

That produces one digit for 97% of the bytes of /dev/urandom. fold -w 1 makes it one digit per line. paste -s is called with a list of separators that consists on 99 space characters and one newline character, so to have 100 space separated digits on each line.

head -c1G will get the first GiB (230) of that. Note that the last line will be truncated and undelimited. You could truncate to 230-1 and add the missing newline by hand, or truncate to 109 bytes instead which is 50 million of those 200 byte lines (head -n 50000000 would also make it a standard/portable command).

These timings (obtained by zsh on a quad-core system), give an indication of where the CPU time is spent:

LC_ALL=C tr '\0-\377'  < /dev/urandom  0.61s user 31.28s system 99% cpu 31.904 total
tr -d x  1.00s user 0.27s system 3% cpu 31.903 total
fold -w 1  14.93s user 0.48s system 48% cpu 31.902 total
paste -sd "$(printf '%99s\\n')" -  7.23s user 0.08s system 22% cpu 31.899 total
head -c1G > /dev/null  0.49s user 1.21s system 5% cpu 31.898 total

The first tr is the bottle neck, most of the time spent in the kernel (I suppose for the random number generation). The timing is roughly in line with the rate I can get bytes from /dev/uramdom (about 19MiB/s and here we produce 2 bytes for each 0.97 byte of /dev/urandom at a rate of 32MiB/s). fold seems to be spending an unreasonable amount of CPU time (15s) just to insert a newline character after every byte but that doesn't affect the overall time as it works on a different CPU in my case (adding the -b option makes it very slightly more efficient, dd cbs=1 conv=unblock seems like a better alternative).

You can do away with the head -c1G and shave off a few seconds by setting a limit on the file size (limit filesize 1024m with zsh or ulimit -f "$((1024*1024))" with most other shells (including zsh)) instead in a subshell.

That could be improved if we extracted 2 digits for each byte, but we would need a different approach for that. The above is very efficient because tr just looks up each byte in a 256 byte array. It can't do that for 2 bytes at a time, and using things like hexdump -e '1/1 "%02u"' that computes the text representation of a byte using more complex algorithms would be more expensive than the random number generation itself. Still, if like in my case, you have CPU cores whose time to spare, it may still manage to shave off a few seconds:

With:

< /dev/urandom LC_ALL=C tr '\0-\377' '\0-\143\0-\143[x*]' |
  tr -d x |
  hexdump -n250000000 -ve '500/1 "%02u" "\n"' |
  fold -w1 |
  paste -sd "$(printf '%99s\\n')" - > /dev/null

I get (note however that here it's 1,000,000,000 bytes as opposed to 1,073,741,824):

LC_ALL=C tr '\0-\377' '\0-\143\0-\143[x*]' < /dev/urandom  0.32s user 18.83s system 70% cpu 27.001 total
tr -d x  2.17s user 0.09s system 8% cpu 27.000 total
hexdump -n250000000 -ve '500/1 "%02u" "\n"'  26.79s user 0.17s system 99% cpu 27.000 total
fold -w1  14.42s user 0.67s system 55% cpu 27.000 total
paste -sd "$(printf '%99s\\n')" - > /dev/null  8.00s user 0.23s system 30% cpu 26.998 total

More CPU time overall, but better distributed between my 4 CPU cores, so it ends up taking less wall-clock time. The bottleneck is now hexdump.

If we use dd instead of the line-based fold, we can actually reduce the amount of work hexdump needs to do and improve the balance of work between CPUs:

< /dev/urandom LC_ALL=C tr '\0-\377' '\0-\143\0-\143[x*]' |
  tr -d x |
  hexdump -ve '"%02u"' |
  dd bs=50000 count=10000 iflag=fullblock status=none cbs=1 conv=unblock |
  paste -sd "$(printf '%99s\\n')" -

(here assuming GNU dd for its iflag=fullblock and status=none) which gives:

LC_ALL=C tr '\0-\377' '\0-\143\0-\143[x*]' < /dev/urandom  0.32s user 15.58s system 99% cpu 15.915 total
tr -d x  1.62s user 0.16s system 11% cpu 15.914 total
hexdump -ve '"%02u"'  10.90s user 0.32s system 70% cpu 15.911 total
dd bs=50000 count=10000 iflag=fullblock status=none cbs=1 conv=unblock  5.44s user 0.19s system 35% cpu 15.909 total
paste -sd "$(printf '%99s\\n')" - > /dev/null  5.50s user 0.30s system 36% cpu 15.905 total

Back to the random-number generation being the bottleneck.

Now, as pointed out by @OleTange, if you have the openssl utility, you could use it to get a faster (especially on processors that have AES instructions) pseudo-random generator of bytes.

</dev/zero openssl enc -aes-128-ctr -nosalt -pass file:/dev/urandom

on my system spews 15 times as many bytes per second than /dev/urandom. (I can't comment on how it compares in terms of cryptographically secure source of randomness if that applies to your use case).

</dev/zero openssl enc -aes-128-ctr -nosalt -pass file:/dev/urandom 2> /dev/null | 
  LC_ALL=C tr '\0-\377' '\0-\143\0-\143[x*]' |
  tr -d x |
  hexdump -ve '"%02u"' |
  dd bs=50000 count=10000 iflag=fullblock status=none cbs=1 conv=unblock |
  paste -sd "$(printf '%99s\\n')" -

Now gives:

openssl enc -aes-128-ctr -nosalt -pass file:/dev/urandom < /dev/zero 2>   1.13s user 0.16s system 12% cpu 10.174 total
LC_ALL=C tr '\0-\377' '\0-\143\0-\143[x*]'  0.56s user 0.20s system 7% cpu 10.173 total
tr -d x  2.50s user 0.10s system 25% cpu 10.172 total
hexdump -ve '"%02u"'  9.96s user 0.19s system 99% cpu 10.172 total
dd bs=50000 count=10000 iflag=fullblock status=none cbs=1 conv=unblock  4.38s user 0.20s system 45% cpu 10.171 total
paste -sd "$(printf '%99s\\n')" - > /dev/null

back to hexdump being the bottleneck.

As I still have CPUs to spare, I can run 3 of those hexdump in parallel.

</dev/zero openssl enc -aes-128-ctr -nosalt -pass file:/dev/urandom 2> /dev/null | 
  LC_ALL=C tr '\0-\377' '\0-\143\0-\143[x*]' |
  tr -d x |
  (hexdump -ve '"%02u"' <&3 & hexdump -ve '"%02u"' <&3 & hexdump -ve '"%02u"') 3<&0 |
  dd bs=50000 count=10000 iflag=fullblock status=none cbs=1 conv=unblock |
  paste -sd "$(printf '%99s\\n')" -

(the <&3 is needed for shells other than zsh that close commands' stdin on /dev/null when run in background).

Now down to 6.2 seconds and my CPUs almost fully utilised.

  • 3
    I just deleted my previous answer and voted for this one. I didn't get some of the requirements. Nice answer btw. – Marcelo Nov 17 '16 at 00:33
  • 3
    why don't you generate multiple digits each pass? Even if you read in byte-by-byte you're still able to produce 2 digits each time – phuclv Nov 17 '16 at 06:00
  • @LưuVĩnhPhúc, I've removed the perl variant which was significantly slower anyway. I can't get 2 digits per byte with that tr|fold|paste approach. – Stéphane Chazelas Nov 18 '16 at 07:40
  • I'm afk or I'd try this myself, but you could try converting 42 bytes at a time to 100-102 digits using bc (then drop the 0, 1, or 2 most significant digits). – Eric Towers Nov 18 '16 at 14:39
  • https://gitlab.com/ole.tange/tangetools/tree/master/rand generates 1-4 GB pseudorandom per second (AES quality). – Ole Tange Nov 19 '16 at 00:01
  • @OleTange. Thanks that does indeed generate random bytes a lot faster than /dev/urandom. I've added that to the answer. – Stéphane Chazelas Nov 19 '16 at 07:48
40

This is partially a tongue-in-cheek answer, because of the title of the question.

When you look for "the fastest way to ...", the answer is almost always some specialized tool. This "answers" shows one such tool, just so you can experiment.

This is not a serious answer, because you should not look into specialized tools for jobs you only do once, or very rarely. You see, you'll end up spending more time looking for tools and learning about them, than actually doing stuff. Shells and utilities like bash and awk are not the fastest, but you can usually write a one-liner to achieve the job, spending only seconds. Better scripting languages like perl can also be used, although the learning curve for perl is steep, and I hesitate to recommend it for such purposes, because I've been traumatized by awful perl projects. python on the other hand is slightly handicapped by its rather slow I/O; it is only an issue when you filter or generate gigabytes of data, though.

In any case, the following C89 example program (which uses POSIX.1 for higher accuracy clock only if available) should achieve about 100 MB/s generation rate (tested in Linux on a laptop with an Intel i5-4200U processor, piping the output to /dev/null), using a pretty good pseudo-random number generator. (The output should pass all the BigCrunch tests, except the MatrixRank test, as the code uses xorshift64* and the exclusion method to avoid biasing the digits.)

decimal-digits.c:

#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <locale.h>
#include <ctype.h>
#include <stdio.h>
#include <errno.h>
#include <time.h>

/* This program is licensed under the CC0 license, https://creativecommons.org/publicdomain/zero/1.0/ In other words, this is dedicated to the public domain. There are no warranties either, so if something breaks, you only have yourself to blame. */

#if _POSIX_C_SOURCE-199309 >= 0 static uint64_t time_seed(void) { struct timespec ts;

if (clock_gettime(CLOCK_REALTIME, &amp;ts))
    return (uint64_t)time(NULL);

return (uint64_t)ts.tv_sec
     ^ (((uint64_t)ts.tv_nsec) &lt;&lt; 32);

} #else static uint64_t time_seed(void) { return (uint64_t)time(NULL); } #endif

/* Preferred output I/O block size.

  • Currently, about 128k blocks yield
  • maximum I/O throughput on most devices.
  • Note that this is a heuristic value,
  • and may be increased in the future.

*/ #ifndef IO_BLOCK_SIZE #define IO_BLOCK_SIZE 262144 #endif

/* This is the Xorshift* pseudo-random number generator.

  • See https://en.wikipedia.org/wiki/Xorshift#xorshift.2A
  • for details. This is an incredibly fast generator that
  • passes all but the MatrixRank test of the BigCrush
  • randomness test suite, with a period of 2^64-1.
  • Note that neither xorshift_state, nor the result of
  • this function, will ever be zero.

*/ static uint64_t xorshift_state;

static uint64_t xorshift_u64(void) { xorshift_state ^= xorshift_state >> 12; xorshift_state ^= xorshift_state << 25; xorshift_state ^= xorshift_state >> 27; return xorshift_state * UINT64_C(2685821657736338717); }

/* This function returns a number between (inclusive)

  • 0 and 999,999,999,999,999,999 using xorshift_u64()
  • above, using the exclusion method. Thus, there is
  • no bias in the results, and each digit should be
  • uniformly distributed in 0-9.

*/ static uint64_t quintillion(void) { uint64_t result;

do {
    result = xorshift_u64() &amp; UINT64_C(1152921504606846975);
} while (!result || result &gt; UINT64_C(1000000000000000000));

return result - UINT64_C(1);

}

/* This function returns a single uniformly random digit. */ static unsigned char digit(void) { static uint64_t digits_cache = 0; static unsigned char digits_cached = 0; unsigned char retval;

if (!digits_cached) {
    digits_cache = quintillion();
    digits_cached = 17; /* We steal the first one! */
} else
    digits_cached--;

retval = digits_cache % (uint64_t)(10);
digits_cache /= (uint64_t)(10);

return retval;

}

static int parse_ulong(const char src, unsigned long to) { const char *end = src; unsigned long value;

if (!src)
    return errno = EINVAL;

errno = 0;
value = strtoul(src, (char **)&amp;end, 0);
if (errno)
    return errno;

if (end == src)
    return errno = EINVAL;
while (*end)
    if (isspace(*end))
        end++;
    else
        return errno = EINVAL;

if (to)
    *to = value;
return 0;

}

int main(int argc, char *argv[]) { unsigned long lines, cols, line, col, seed;

/* When parsing the command-line parameters,
 * use locale conventions. */
setlocale(LC_ALL, &quot;&quot;);

/* Standard output should be fully buffered, if possible.
 * This only affects output speed, so we're not too worried
 * if this happens to fail. */
(void)setvbuf(stdout, NULL, _IOFBF, (size_t)IO_BLOCK_SIZE);

if (argc &lt; 3 || argc &gt; 4 || !strcmp(argv[1], &quot;-h&quot;) || !strcmp(argv[1], &quot;--help&quot;)) {
    fprintf(stderr, &quot;\n&quot;);
    fprintf(stderr, &quot;Usage: %s [ -h | --help ]\n&quot;, argv[0]);
    fprintf(stderr, &quot;       %s COLS LINES [ SEED ]\n&quot;, argv[0]);
    fprintf(stderr, &quot;\n&quot;);
    fprintf(stderr, &quot;This program generates random decimal digits\n&quot;);
    fprintf(stderr, &quot;0 - 9, separated by spaces, COLS per line,\n&quot;);
    fprintf(stderr, &quot;LINES lines.  In total, COLS*LINES*2 bytes\n&quot;);
    fprintf(stderr, &quot;will be used.\n&quot;);
    fprintf(stderr, &quot;\n&quot;);
    fprintf(stderr, &quot;SEED is the optional seed for the Xorshift64*\n&quot;);
    fprintf(stderr, &quot;pseudo-random number generator used in this program.\n&quot;);
    fprintf(stderr, &quot;If omitted, current time is used as the seed.\n&quot;);
    fprintf(stderr, &quot;\n&quot;);
    return EXIT_SUCCESS;
}

if (parse_ulong(argv[1], &amp;cols) || cols &lt; 1UL) {
    fprintf(stderr, &quot;%s: Invalid number of digits per line.\n&quot;, argv[1]);
    return EXIT_FAILURE;
}
if (parse_ulong(argv[2], &amp;lines) || lines &lt; 1UL) {
    fprintf(stderr, &quot;%s: Invalid number of lines.\n&quot;, argv[2]);
    return EXIT_FAILURE;
}

if (argc &gt; 3) {
    if (parse_ulong(argv[3], &amp;seed)) {
        fprintf(stderr, &quot;%s: Invalid Xorshift64* seed.\n&quot;, argv[3]);
        return EXIT_FAILURE;
    }
} else
    seed = time_seed();

/* Since zero seed is invalid, we map it to ~0. */
xorshift_state = seed;
if (!xorshift_state)
    xorshift_state = ~(uint64_t)0;

/* Discard first 1000 values to make the initial values unpredictable. */
for (col = 0; col &lt; 1000; col++)
    xorshift_u64();

for (line = 0UL; line &lt; lines; line++) {
    fputc('0' + digit(), stdout);
    for (col = 1UL; col &lt; cols; col++) {
        fputc(' ', stdout);
        fputc('0' + digit(), stdout);
    }
    fputc('\n', stdout);

    /* Check for write errors. */
    if (ferror(stdout))
        return EXIT_FAILURE;
}

return EXIT_SUCCESS;

}

We can make it a lot faster, if we switch to a line buffer, and fwrite() it once instead of outputting each digit at a time. Note that we still keep the stream fully buffered, to avoid partial (non-power-of-two) writes if the output is a block device.

#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <locale.h>
#include <ctype.h>
#include <stdio.h>
#include <errno.h>
#include <time.h>

#if _POSIX_C_SOURCE-199309 >= 0 static uint64_t time_seed(void) { struct timespec ts;

if (clock_gettime(CLOCK_REALTIME, &amp;ts))
    return (uint64_t)time(NULL);

return (uint64_t)ts.tv_sec
     ^ (((uint64_t)ts.tv_nsec) &lt;&lt; 32);

} #else static uint64_t time_seed(void) { return (uint64_t)time(NULL); } #endif

/* Preferred output I/O block size.

  • Currently, about 128k blocks yield
  • maximum I/O throughput on most devices.
  • Note that this is a heuristic value,
  • and may be increased in the future.

*/ #ifndef IO_BLOCK_SIZE #define IO_BLOCK_SIZE 262144 #endif

/* This is the Xorshift* pseudo-random number generator.

  • See https://en.wikipedia.org/wiki/Xorshift#xorshift.2A
  • for details. This is an incredibly fast generator that
  • passes all but the MatrixRank test of the BigCrush
  • randomness test suite, with a period of 2^64-1.
  • Note that neither xorshift_state, nor the result of
  • this function, will ever be zero.

*/ static uint64_t xorshift_state;

static uint64_t xorshift_u64(void) { xorshift_state ^= xorshift_state >> 12; xorshift_state ^= xorshift_state << 25; xorshift_state ^= xorshift_state >> 27; return xorshift_state * UINT64_C(2685821657736338717); }

/* This function returns a number between (inclusive)

  • 0 and 999,999,999,999,999,999 using xorshift_u64()
  • above, using the exclusion method. Thus, there is
  • no bias in the results, and each digit should be
  • uniformly distributed in 0-9.

*/ static uint64_t quintillion(void) { uint64_t result;

do {
    result = xorshift_u64() &amp; UINT64_C(1152921504606846975);
} while (!result || result &gt; UINT64_C(1000000000000000000));

return result - UINT64_C(1);

}

/* This function returns a single uniformly random digit. */ static unsigned char digit(void) { static uint64_t digits_cache = 0; static unsigned char digits_cached = 0; unsigned char retval;

if (!digits_cached) {
    digits_cache = quintillion();
    digits_cached = 17; /* We steal the first one! */
} else
    digits_cached--;

retval = digits_cache % (uint64_t)(10);
digits_cache /= (uint64_t)(10);

return retval;

}

static int parse_ulong(const char src, unsigned long to) { const char *end = src; unsigned long value;

if (!src)
    return errno = EINVAL;

errno = 0;
value = strtoul(src, (char **)&amp;end, 0);
if (errno)
    return errno;

if (end == src)
    return errno = EINVAL;
while (*end)
    if (isspace(*end))
        end++;
    else
        return errno = EINVAL;

if (to)
    *to = value;
return 0;

}

int main(int argc, char argv[]) { unsigned long lines, cols, line, col, seed; char oneline;

/* When parsing the command-line parameters,
 * use locale conventions. */
setlocale(LC_ALL, &quot;&quot;);

/* Standard output should be fully buffered, if possible.
 * This only affects output speed, so we're not too worried
 * if this happens to fail. */
(void)setvbuf(stdout, NULL, _IOFBF, (size_t)IO_BLOCK_SIZE);

if (argc &lt; 3 || argc &gt; 4 || !strcmp(argv[1], &quot;-h&quot;) || !strcmp(argv[1], &quot;--help&quot;)) {
    fprintf(stderr, &quot;\n&quot;);
    fprintf(stderr, &quot;Usage: %s [ -h | --help ]\n&quot;, argv[0]);
    fprintf(stderr, &quot;       %s COLS LINES [ SEED ]\n&quot;, argv[0]);
    fprintf(stderr, &quot;\n&quot;);
    fprintf(stderr, &quot;This program generates random decimal digits\n&quot;);
    fprintf(stderr, &quot;0 - 9, separated by spaces, COLS per line,\n&quot;);
    fprintf(stderr, &quot;LINES lines.  In total, COLS*LINES*2 bytes\n&quot;);
    fprintf(stderr, &quot;will be used.\n&quot;);
    fprintf(stderr, &quot;\n&quot;);
    fprintf(stderr, &quot;SEED is the optional seed for the Xorshift64*\n&quot;);
    fprintf(stderr, &quot;pseudo-random number generator used in this program.\n&quot;);
    fprintf(stderr, &quot;If omitted, current time is used as the seed.\n&quot;);
    fprintf(stderr, &quot;\n&quot;);
    return EXIT_SUCCESS;
}

if (parse_ulong(argv[1], &amp;cols) || cols &lt; 1UL) {
    fprintf(stderr, &quot;%s: Invalid number of digits per line.\n&quot;, argv[1]);
    return EXIT_FAILURE;
}
if (parse_ulong(argv[2], &amp;lines) || lines &lt; 1UL) {
    fprintf(stderr, &quot;%s: Invalid number of lines.\n&quot;, argv[2]);
    return EXIT_FAILURE;
}

if (argc &gt; 3) {
    if (parse_ulong(argv[3], &amp;seed)) {
        fprintf(stderr, &quot;%s: Invalid Xorshift64* seed.\n&quot;, argv[3]);
        return EXIT_FAILURE;
    }
} else
    seed = time_seed();

/* Since zero seed is invalid, we map it to ~0. */
xorshift_state = seed;
if (!xorshift_state)
    xorshift_state = ~(uint64_t)0;

/* Discard first 1000 values to make the initial values unpredictable. */
for (col = 0; col &lt; 1000; col++)
    xorshift_u64();

/* Allocate memory for a full line. */
oneline = malloc((size_t)(2 * cols + 1));
if (!oneline) {
    fprintf(stderr, &quot;Not enough memory for %lu column buffer.\n&quot;, cols);
    return EXIT_FAILURE;
}

/* Set spaces and terminating newline. */
for (col = 0; col &lt; cols; col++)
    oneline[2*col + 1] = ' ';
oneline[2*cols-1] = '\n';

/* Not needed, but in case a code modification treats it as a string. */
oneline[2*cols] = '\0';

for (line = 0UL; line &lt; lines; line++) {
    for (col = 0UL; col &lt; cols; col++)
        oneline[2*col] = digit();

    if (fwrite(oneline, 2*cols, 1, stdout) != 1)
        return EXIT_FAILURE; 
}

/* Check for write errors. */
if (ferror(stdout))
    return EXIT_FAILURE;

return EXIT_SUCCESS;

}

Note: both examples edited on 2016-11-18 to ensure uniform distribution of digits (zero is excluded; see e.g. here for comparison and details on various pseudo-random number generators).

Compile using for example

gcc -Wall -O2 decimal-digits.c -o decimal-digits

and optionally install system-wide to /usr/bin using

sudo install -o root -g root -m 0755 decimal-digits /usr/bin

It takes the number of digits per line, and the number of lines. Because 1000000000 / 100 / 2 = 5000000 (five million; total bytes divided by columns divided by 2), you can use

./decimal-digits 100 5000000 > digits.txt

to generate the gigabyte-sized digits.txt as desired by the OP.

Note that the program itself is written more with readability than efficiency in mind. My intent here is not to showcase the efficiency of the code — I'd use POSIX.1 and low-level I/O anyway, rather than generic C interfaces — but to let you easily see what kind of balance there is with effort spent in developing dedicated tools versus their performance, compared to one-liners or short shell or awk scriptlets.

Using the GNU C library, calling the fputc() function for every character output incurs a very small overhead (of an indirect function call, or conditionals -- the FILE interface is actually pretty complex and versatile, you see). On this particular Intel Core i5-4200U laptop, redirecting the output to /dev/null, the first (fputc) version takes about 11 seconds, whereas the line-at-a-time version takes just 1.3 seconds.

I happen to often write such programs and generators only because I like to play with huge datasets. I'm weird that way. For example, I once wrote a program to print all finite positive IEEE-754 floating-point values into a text file, with sufficient precision to yield the exact same value when parsed. The file was a few gigabytes in size (perhaps 4G or so); there are not that many finite positive floats as one might think. I used this to compare implementations that read and parse such data.

For normal use cases, like the OP is having, shell scripts and scriptlets and one-liners are the better approach. Less time spent to accomplish the overall task. (Except if they need a different file every day or so, or there are many people who need a different file, in which — rare — case, a dedicated tool like above, might warrant the effort spent.)

  • Yeah, probably mmap() is the easiest route to the best I/O speed — but benchmark before making any claims! – Toby Speight Nov 17 '16 at 18:14
  • @TobySpeight: In Linux, low-level I/O, i.e. using write(), is typically faster than mmap(). fwrite() is not much slower. Yes, I've benchmarked that (just not for this particular example); write() in large chunks (262144, 524288, or 1048576 bytes) tends to outperform the other methods. The version of fputc() implemented in GNU C library (which I have also benchmarked extensively) is slow for a number of reasons; in particular, implementation having to do either conditional jumps or indirect calls for every character added; that slight overhead incurred so often adds up. – Nominal Animal Nov 17 '16 at 18:44
  • Just out of interest - have you done a performance comparison with the other answers? – Digital Trauma Nov 17 '16 at 20:17
  • 2
    @DigitalTrauma: I just ran them for you, redirecting the output to /dev/null. Stéphane Chazelas' scriptlet takes about 52 seconds; perl snippet (including the head filtering) about 58 seconds; your shuf snippet (with correct timing; you only measure shuf time, assuming paste wont take any longer) takes about 69 seconds. James Hollis' C++11 a line-at-a-time program takes 14 seconds. The above program takes 10 seconds. – Nominal Animal Nov 18 '16 at 05:33
  • @DigitalTrauma: Just for fun, I added the line-at-a-time version. That takes just 1.3 seconds. In summary, a simple program, say James Hollis' version, takes about one fourth the time a shell scriptlet or perl one-liner takes. The difference to mine is the PRNG; like I said, xorshift variants are fast, and their output is quite random enough for most purposes. (The exclusion method used avoids the zero issue, so the output is truly uniform.) See eg. here for details on various PRNGs. – Nominal Animal Nov 18 '16 at 05:48
  • Note that although I believe the xorshift64* PRNG does not yield zeros, I added an explicit check. This means that as used in these two programs, the output is both technically and really uniformly distributed. The execution times (over a few test runs) after the modification are 10 and 1.3 seconds, respectively. – Nominal Animal Nov 18 '16 at 05:57
  • 4
    (Lost my train of thought above, sorry.) Point is, picking the right algorithm -- the sufficiently random but very fast PRNG here --, yielded almost an order of magnitude (10×) speed increase. (The latter version of my programs is about 40 times faster than the shell or perl snippets.) This is typical. Perhaps I should have emphasized choosing the right algorithm when writing a program more in my answer above? (On the other hand, this is not a programming question, but an Unix/Linux question, on how to generate lots of digits.) – Nominal Animal Nov 18 '16 at 06:14
  • Hopefully, low level languages still have advantage on high level ones... which are build upon those good old low level ones ;-) – Serge Ballesta Nov 18 '16 at 07:25
  • @SergeBallesta: Well, we could go even lower level, and leverage the SSE/AVX/AVX512 extensions provided by current Intel and AMD processors. (This answer has one of my AVX2 brute force float programs.) Here, exclusion method means no gains from vectorizing the PRNG, but certainly the splitting into individual digits could be vectorized and thus sped up. However, at 0.77 GB/s output means we're close to bandwidth limits -- we do need at least one copy for power of two sized writes --, so it is diminishing returns vs. effort spent... – Nominal Animal Nov 18 '16 at 08:18
  • Yeah, I had exactly the same idea on reading the word "fastest": execution time or typing time? I was thinking of a vectorized XORSHIFT with a different seed in each element, then somehow chop that up into byte elements (with spaces for padding). If throughput bottlenecks on RNG latency (possible if the processing into decimal digits can be done very cheaply), then use two vectors of RNGs in parallel with different seeds. You might be able to max out main memory bandwidth with a single core, but only with a much cheaper quintillion(). (~25GB/s on a dual-channel DDR3-1600 Haswell). – Peter Cordes Nov 18 '16 at 08:56
  • So when we have CPUs with non-volatile storage attached to the memory bus directly, this will be useful :P (This is why Intel designed instructions like CLWB and PCOMMIT, but probably such high-bandwidth non-volatile storage will be a server thing before it's affordable for desktops, and we'll still have SATA or PCIe M.2 SSDs for a good while. While googling that, I see Intel has deprecated PCOMMIT, since they've decided not to require it after CLWB for nv storage.) – Peter Cordes Nov 18 '16 at 09:03
  • @PeterCordes: Right! This lappy actually has a Samsung MZ7PD128, 128GB SSD at ~ 500MB/s :) I think the digit splitting (digit()) is the actual bottleneck above, on most current Intel/AMD machines. OTOH, using more than one thread to compute large chunks (perhaps a megabyte at a time?) would still yield best wall-clock savings, I think. But, spending just an hour to speed it up by 50% means you'd need to run it over 5000 times (5TB of data generated) to recoup the time spent. Nah; I'd just put this one running on the background in a scriptlet, instead, nice'd and ionice'd, if needed... :) – Nominal Animal Nov 18 '16 at 11:40
  • I was thinking about this some more while getting to sleep: If your requirement for a uniform distribution is not very strict, you could use a modular multiplicative inverse to %10 every 16-bit element in a vector of RNG results (using PMULHUW to replace each element with the high half of the product). Then PADDW _mm_set1_epi16( (' '<<8) | '0' ) to convert the 0-9 integer in the low byte to an ASCII digit, and the zero in the high byte to a space. This could turn a vector of random bytes into an ASCII vector in a few insns, SSE2 or AVX2. – Peter Cordes Nov 18 '16 at 20:05
  • Hmm, I just looked at how your digit() function works. It's not strictly uniform; there is a tiny bias away from high numbers for the first digit, because 10 doesn't go evenly into 2^64. But AFAICT, that's not a problem after you divide by 10 to get the higher decimal digits of the uint64_t. And you stop soon enough that the you don't run into trouble with the most-significant digit, since 2^64 / (10^18) is ~18.4. – Peter Cordes Nov 19 '16 at 04:14
  • @PeterCordes: (Thanks for the bugfix!) Did you forget that digit() relies on quintillion(), which uses the exclusion method to yield an uniform random number between 0 and 999,999,999,999,999,999, inclusive? – Nominal Animal Nov 19 '16 at 07:16
  • @NominalAnimal: oh, yes, I did forget that. For my vectorized version, I wasn't interested in any conditional behaviour, and wanted to see how fast it could run with somewhat relaxed quality constraints, but still good enough to be called properly random. I got a decent speedup from using multiple decimal digits of my uint16_t elements, instead of just dividing them by 6554 (which is faster than %10). I should finish writing up my answer and post it. – Peter Cordes Nov 19 '16 at 09:04
  • My vectorized SSE2 version produces 1GiB of data in 0.21 seconds (writing to /dev/null in 128k chunks) on a 2.4GHz Core2Duo 64-bit, where your version on the same hardware runs in 3.20 seconds (gcc5.2 -O3 -march=native, or 3.42s with clang-3.8). As I commented on James's answer, rewriting the same hot-in-cache buffer in-place it's much faster than memcpy speed. (Linux's /dev/null device driver discards the write() buffer without touching it at all). – Peter Cordes Nov 19 '16 at 09:19
  • @PeterCordes: Agreed; I'd love to see your implementation (and whatever other relevant notes or comments you might have). For good performance on real-world storage devices, you definitely want to write in multiples of the storage allocation size (might be non-power-of-two on RAIDs). With multiple threads, a write-only memory map might be faster in Linux (due to more opportunities for the I/O elevator to reorder writes), especially on larger RAIDs, I suspect. – Nominal Animal Nov 19 '16 at 11:15
23

If you have shuf available (recent GNU coreutils does) you can do this:

time shuf -r -n $((512*1024*1024)) -i 0-9 | paste -sd "$(printf '%99s\\n')" -

On my VM, this is now a bit slower than Stéphane's answer by about a 3:4 factor.

22

If you don't need very high quality randomness, and close-to-uniform distribution is good enough, you can go really fast, especially on a modern CPU with efficient SIMD integer vectors like x86 with SSE2 or AVX2.

This is like @NominalAnimal's answer since we both had the same idea, but manually vectorized for x86. (And with with worse quality random numbers, but still probably good enough for a lot of use-cases.) This runs about 15 or 30 times faster than @Nominal's code, at ~13GB/s of ASCII output on a 2.5GHz Intel Haswell CPU with AVX2. That's still less than theoretical max main memory bandwidth (dual channel DDR3-1600 is about 25.6GB/s), but I was timing writing to /dev/null so it's actually just rewriting a buffer that stays hot in cache. Skylake should run this same code significantly faster than Haswell (see the bottom of this answer).

Assuming you actually bottleneck on I/O to disk or piping this somewhere, a fast implementation means your CPU doesn't even have to clock higher than idle. It uses much less total energy to produce the result. (Battery life / heat / global warming.)

This is so fast that you probably don't want to write it to disk. Just re-generate as-needed (from the same seed if you want the same data again). Even if you want to feed it to a multi-threaded process that can use all CPUs, running this to pipe the data to it will leave it hot in L3 cache (and L2 cache on the core that wrote it), and use so very little CPU time. (But note that piping adds a lot of overhead vs. writing to /dev/null. On a Skylake i7-6700k, piping to wc -c or another program that just reads + discards its input, it's about 8x slower than writing to /dev/null, and only uses 70% of a CPU. But that's still 4.0GB/s on a 3.9GHz CPU.

Re-generating it is faster than re-reading it even from a fast PCIe-connected SSD, but IDK if it's more power efficient (the vector-integer multiplier is kept pretty busy, and it's probably pretty power-hungry, along with other AVX2 256b vector ALUs). OTOH, I don't know how much CPU time reading from disk would take away from something that was maxing out all cores processing this input. I'd guess that a context-switch to re-generate in 128k chunks might be competitive with running filesystem / pagecache code and allocating pages to read data from disk. Of course, if it's already hot in the pagecache, it's just basically memcpy. OTOH, we already write about as fast as memcpy! (which has to split main memory bandwidth between reading and writing). (Also note that writing to memory that's not already hot in cache usually triggers a read-for-ownership to maintain cache coherency, which can be avoided with non-temporal stores, or with x86's rep movsb (optimized memcpy and memset in microcode, which avoids RFO, since Andy Glew's implementation of it in P6 (Pentium Pro))).


So far this is only a proof of concept, and the newline handling is only approximately correct. It's wrong around the ends of a power-of-2 buffer. With more development time. I'm confident I could find a more efficient way to insert newlines that's also exactly correct, with overhead at least as low as this (compared to outputting only spaces). I think this is something like 10 to 20%. I'm only interested in knowing how fast we could make this run, not in actually having a polished version of it, so I'll leave that part as an exercise for the reader, with comments describing some ideas.


On a Haswell i5 at its 2.5GHz max turbo, with DDR3-1600MHz RAM, timed producing 100GiB but scaled down. (Timed on cygwin64 on Win10 with gcc5.4 -O3 -march=native, omitted -funroll-loops since I was having a hard enough time getting decent timing runs on this borrowed laptop. Should have just booted Linux on a USB).

writing to /dev/null unless otherwise specified.

  • James Hollis's: (not tested)
  • Nominal's fwrite version: ~2.21s
  • this (SSE2): ~0.142s (unscaled times = real=14.232s, user=13.999s, sys=0.187s).
  • this (AVX-128): ~0.140s
  • this (AVX2): ~0.073s (unscaled: real=0m7.291s, user=0m7.125s, sys=0m0.155s).
  • this (AVX2) cygwin piping to wc -c, with 128kiB buffer size: 0.32s with CPU at 2.38GHz (max dual-core turbo). (unscaled times: real=32.466s user=11.468s sys=41.092s, including both this and wc). Only half the data was actually copied, though, because my silly program assumes that write does the full buffer, even though that's not the case and cygwin write() only does 64k per call into a pipe.

So with SSE2 this is about 15 times faster than @Nominal Animal's scalar code. With AVX2, it's about 30 times faster. I didn't try a version of Nominal's code which just uses write() instead of fwrite(), but presumably for large buffers stdio mostly stays out of the way. If it is copying the data, that would account for a lot of slowdown.


Times to produce 1GB of data on a Core2Duo E6600 (Merom 2.4GHz, 32kiB private L1, 4MiB shared L2 caches), DDR2-533MHz in 64-bit Linux 4.2 (Ubuntu 15.10). Still using a 128kiB buffer size for write(), haven't explored that dimension.

writing to /dev/null unless otherwise specified.

  • (SSE2) this with newline handling and 4 vectors of digits from each vector of random bytes: 0.183s (timed doing 100GiB in 18.3s, but similar results for 1GiB runs). 1.85 instructions per cycle.
  • (SSE2) this, piping to wc -c: 0.593s (unscaled: real=59.266s user=20.148s sys=1m6.548s, including wc's CPU time). Same number of write() system calls as with cygwin, but actually piping all the data because Linux handles all 128k of a write() to a pipe.
  • NominalAnimal's fwrite() version (gcc5.2 -O3 -march=native), run with ./decdig 100 $((1024*1024*1024/200)) > /dev/null: 3.19s +/- 0.1%, with 1.40 instruction per cycle. -funroll-loops made maybe a tiny difference. clang-3.8 -O3 -march=native: 3.42s +/- 0.1%
  • Nominal-fwrite piping to wc -c: real=3.980s user=3.176s sys=2.080s
  • James Hollis's line-at-a-time version (clang++-3.8 -O3 -march=native): 22.885s +/- 0.07%, with 0.84 instructions per cycle. (g++5.2 was slightly slower: 22.98s). Writing only one line at a time probably hurt significantly.
  • Stéphane Chazelas's tr < /dev/urandom | ...: real=41.430s user=26.832s sys=40.120s. tr was getting all of a CPU core to itself most of the time, spending nearly all its time in the kernel driver generating random bytes and copying them to a pipe. The other core on this dual core machine was running the rest of the pipeline.
  • time LC_ALL=C head -c512M </dev/urandom >/dev/null: i.e. just reading that much randomness with no piping: real=35.018s user=0.036s sys=34.940s.
  • Lưu Vĩnh Phúc's perl program (perl v5.20.2 from Ubuntu15.10):
    LANG=en_CA.UTF-8: real=4m32.634s user=4m3.288s sys=0m29.364.
    LC_ALL=C LANG=C: real=4m18.637s user=3m50.324s sys=0m29.356s. Still very slow.

  • (SSE2) this with no newline handling, and either 3 or 4 vectors of digits from each vector of random bytes (almost exactly the same speed: the dig3 = v%10 step is about break-even on this HW): 0.166s (1.82 instructions per cycle). This is basically the lower limit for what we can come close to with perfectly efficient newline handling.

  • (SSE2) Old version of this with no newline handling, but only getting one digit per uint16_t element using v%10, 0.222 seconds +/- 0.4%, 2.12 instructions per cycle. (Compiled with gcc5.2, -march=native -O3 -funroll-loops. Unroll loops does happen to help for this code on this hardware. Don't use it blindly, especially for large programs).
  • (SSE2) Old version of this, writing to a file (on a RAID10f2 of 3 fast magnetic hard drives, not very optimized for writes): ~4 seconds. Could go faster by tweaking kernel I/O buffer settings to allow a lot more dirty data before write() blocks. "System" time is still ~1.0 seconds, much higher than "user" time. On this old system with slow DDR2-533 RAM, it takes ~4x longer for the kernel to memcpy the data into the pagecache and run XFS functions than it does for my loop to keep rewriting it in-place in a buffer that stays hot in cache.

How it's done

A fast PRNG is obviously essential. xorshift128+ can be vectorized, so you have two or four 64-bit generators in parallel, in elements of a SIMD vector. Each step produces a full vector of random bytes. (256b AVX2 implementation here with Intel intrinsics). I picked it over Nominal's choice of xorshift*, because 64-bit vector integer multiplication is only possible in SSE2/AVX2 with extended-precision techniques.


Given a vector of random bytes, we can chop up each 16-bit element into multiple decimal digits. We produce multiple vectors of 16-bit elements that are each one ASCII digit + ASCII space. We store that directly into our output buffer.

My original version just used x / 6554 to get one random digit from every uint16_t element of a vector. It's always between 0 and 9, inclusive. It's biased away from 9, because (2^16 -1 ) / 6554 is only 9.99923. (6554 = ceil((2^16-1)/10), which ensures that the quotient is always < 10.)

x/6554 can be computed with one multiply by a "magic" constant (the fixed-point reciprocal) and a right shift of the high-half result. This is the best case for division by a constant; some divisors take more operations, and signed division takes extra work. x % 10 has a similar bias and isn't as cheap to compute. (gcc's asm output is equivalent to x - 10*(x/10), i.e. an extra multiply and subtract on top of the division using a modular multiplicative inverse.) Also, the lowest bit of xorshift128+ is not as high quality, so dividing to take entropy from high bits is better (for quality as well as speed) than modulo to take entropy from low bits.

However, we can use more of the entropy in each uint16_t by looking at the low decimal digits, like @Nominal's digit() function. For maximum performance, I decided to take the low 3 decimal digits and x/6554, to save one PMULLW and PSUBW (and probably some MOVDQA) vs. the higher quality option of taking the 4 low decimal digits. x/6554 is slightly affected by the low 3 decimal digits, so there is some correlation between digits from the same element (8 or 16 digits separation in the ASCII output, depending on vector width).

I think gcc is dividing by 100 and by 1000, rather than a longer chain that successively divides by 10, so it's probably not significantly shortening the length of the non-loop-carried dependency chain that produces 4 results from each PRNG output. port0 (vector multiply and shift) is the bottleneck because of the modular multiplicative inverses, and the shifts in xorshift+, so it's definitely useful to save a vector-multiply.

xorshift+ is so fast that even using only ~3.3 bits of randomness from every 16 (i.e. 20% efficiency) is not a lot slower than chopping it up into multiple decimal digits. We only approximate the uniform distribution, because this answer is focused on speed as long as the quality isn't too bad.

Any kind of conditional behaviour that keeps a variable number of elements would take much more work. (But could maybe still be done somewhat efficiently using SIMD left-packing techniques. However, that gets less efficient for small element sizes; giant shuffle-mask lookup tables are not viable, and there's no AVX2 lane-crossing shuffle with smaller than 32-bit elements. A 128b PSHUFB version might still be able to generate a mask on the fly with BMI2 PEXT/PDEP, like you can for AVX2 with larger elements, but it's tricky because a 64-bit integer only holds 8 bytes. The godbolt link on that answer has some code that might work for higher element counts.)


If latency of the RNG is a bottleneck, we could go even faster by running two vectors of generators in parallel, alternating which one we use. The compiler can still easily keep everything in registers in an unrolled loop, and that lets the two dependency chains run in parallel.

In the current version, chopping up the output of the PRNG, we actually bottleneck on port 0 throughput, not PRNG latency, so there's no need for that.


The code: AVX2 version

Full version with more comments on the Godbolt compiler explorer.

Not very tidy, sorry I have to get to sleep and want to get this posted.

To get the SSE2 version, s/_mm256/_mm, s/256/128/, s/v16u/v8u/, and change vector_size(32) to 16. Also change the newline increment from 4*16 to 4*8. (Like I said, code is messy, and not well set up for compiling two versions. Didn't originally plan on making an AVX2 version, but then I really wanted to test on a Haswell CPU I had access to.)

#include <immintrin.h>
#include <unistd.h>
#include <stdint.h>
#include <stdio.h>
//#include <string.h>

// This would work equally fast 128b or 256b at a time (AVX2):
// https://stackoverflow.com/questions/24001930/avx-sse-version-of-xorshift128
struct rngstate256 {
    __m256i state0;
    __m256i state1;
};

static inline __m256i xorshift128plus_avx2(struct rngstate256 *sp)
{
    __m256i s1 = sp->state0;
    const __m256i s0 = sp->state1;
    sp->state0 = s0;
    s1 = _mm256_xor_si256(s1, _mm256_slli_epi64(s1, 23));
    __m256i state1new = _mm256_xor_si256(_mm256_xor_si256(_mm256_xor_si256(s1, s0),
                            _mm256_srli_epi64(s1, 18)),
                      _mm256_srli_epi64(s0, 5));
    sp->state1 = state1new;
    return _mm256_add_epi64(state1new, s0);
}



// GNU C native vectors let us get the compiler to do stuff like %10 each element
typedef unsigned short v16u __attribute__((vector_size(32)));

__m256i* vec_store_digit_and_space(__m256i vec, __m256i *restrict p)
{
    v16u v = (v16u)vec;
    v16u ten = (v16u)_mm256_set1_epi16(10);

    v16u divisor = (v16u)_mm256_set1_epi16(6554);  // ceil((2^16-1) / 10.0)
    v16u div6554 = v / divisor;      // Basically the entropy from the upper two decimal digits: 0..65.
    // Probably some correlation with the modulo-based values, especially dig3, but we do this instead of
    // dig4 for more ILP and fewer instructions total.

    v16u dig1 = v % ten;
    v /= ten;
    v16u dig2 = v % ten;
    v /= ten;
    v16u dig3 = v % ten;
    //  dig4 would overlap much of the randomness that div6554 gets

    const v16u ascii_digitspace = (v16u)_mm256_set1_epi16( (' '<<8) | '0');

    v16u *vecbuf = (v16u*)p;
    vecbuf[0] = div6554 | ascii_digitspace;
    vecbuf[1] = dig1    | ascii_digitspace;
    vecbuf[2] = dig2    | ascii_digitspace;
    vecbuf[3] = dig3    | ascii_digitspace;
    return p + 4;  // always a constant number of full vectors
}


void random_decimal_fill_buffer(char *restrict buf, size_t len, struct rngstate256 *restrict rngstate)
{
    buf = __builtin_assume_aligned(buf, 32);

    // copy to a local so clang can keep state in register, even in the non-inline version
    // restrict works for gcc, but apparently clang still thinks that *buf might alias *rngstate
    struct rngstate256 rng_local = *rngstate;

    __m256i *restrict p = (__m256i*restrict)buf;
    __m256i *restrict endbuf = (__m256i*)(buf+len);
    static unsigned newline_pos = 0;
    do {
        __m256i rvec = xorshift128plus_avx2(&rng_local);
        p = vec_store_digit_and_space(rvec, p);  // stores multiple ASCII vectors from the entropy in rvec

#if 1
        // this is buggy at the end or start of a power-of-2 buffer:
        // usually there's a too-short line, sometimes a too-long line
        const unsigned ncols = 100;
        newline_pos += 4*16;
        if (newline_pos >= ncols) {
            newline_pos -= ncols;
            char *cur_pos = (char*)p;
            *(cur_pos - newline_pos*2 - 1) = '\n';
        }
#endif
        // Turning every 100th space into a newline.
        // 1) With an overlapping 1B store to a location selected by a counter.  A down-counter would be more efficient
        // 2) Or by using a different constant for ascii_digitspace to put a newline in one element

        // lcm(200, 16) is 400 bytes, so unrolling the loop enough to produce two full lines makes a pattern of full vectors repeat
        // lcm(200, 32) is 800 bytes
        // a power-of-2 buffer size doesn't hold a whole number of lines :/
        // I'm pretty sure this can be solved with low overhead, like maybe 10% at worst.
    } while(p <= endbuf-3);

    *rngstate = rng_local;
}



#define BUFFER_SIZE (128 * 1024)
const static size_t bufsz = BUFFER_SIZE;
__attribute__((aligned(64))) static char static_buf[BUFFER_SIZE];

int main(int argc, char *argv[])
{
    // TODO: choose a seed properly.  (Doesn't affect the speed)
    struct rngstate256 xorshift_state = {
      _mm256_set_epi64x(123, 456, 0x123, 0x456),
      _mm256_set_epi64x(789, 101112, 0x789, 0x101112)
    };

    for (int i=0; i < 1024ULL*1024*1024 / bufsz * 100; i++) {
        random_decimal_fill_buffer(static_buf, bufsz, &xorshift_state);
        size_t written = write(1, static_buf, bufsz);
        (void)written;
        //fprintf(stderr, "wrote %#lx of %#lx\n", written, bufsz);
    }

}

Compile with gcc, clang, or ICC (or hopefully any other compiler that understands the GNU C dialect of C99, and Intel's intrinsics). GNU C vector extensions are highly convenient to get the compiler to generate the magic numbers for division/modulo using modular multiplicative inverses, and occasional __attribute__s are useful.

This could be written portably, but it would take more code.


Performance notes:

The overlapping-store to insert newlines has significant overhead to decide where to place it (branch mispredictions, and frontend bottlenecks on Core2), but the store itself has no impact on performance. Commenting out just that store instruction in the compiler's asm (leaving all the branching the same) left the performance on Core2 completely unchanged, with repeated runs giving the same time to +/- less than 1%. So I conclude that the store buffer / cache handle it just fine.

Still, using some kind of rotating window of ascii_digitspace with one element having a newline might be even faster, if we unroll enough that any counters/branching go away.


Writing to /dev/null is basically a no-op, so the buffer probably stays hot in L2 cache (256kiB per core on Haswell). The perfect speedup from 128b vectors to 256b vectors is expected: there are no extra instructions, and everything (including the stores) happens with twice the width. The newline-insertion branch is taken twice as often, though. I unfortunately didn't time on my Haswell cygwin setup with that part #ifdefed out.

2.5GHz * 32B / 13.7GB/s = 5.84 cycles per AVX2-store on Haswell. That's pretty good, but could be faster. Maybe there's some overhead in the cygwin system calls than I thought. I didn't try commenting those out in the compiler's asm output (which would ensure that nothing optimized away.)

L1 cache can sustain one 32B store per clock, and L2 is not much lower bandwidth (higher latency, though).

When I looked at IACA a few versions ago (without the branching for newlines, but only getting one ASCII vector per RNG vector), it was predicting something like one 32B vector store per 4 or 5 clocks.

I was hoping to get more of a speedup from extracting more data from each RNG result, based on looking at the asm myself, considering Agner Fog's guides and other optimization resources which I've added links for in the SO x86 tag wiki.)

Likely it would be significantly faster on Skylake, where vector integer multiply and shift can run on twice as many ports (p0 / p1) compared to Haswell (p0 only). xorshift and the digit extraction both use a lot of shifts and multiplies. (Update: Skylake runs it at 3.02 IPC, giving us 3.77 cycles per 32-byte AVX2 store, timed at 0.030s per 1GB iteration, writing to /dev/null on Linux 4.15 on i7-6700k at 3.9GHz.


It doesn't require 64-bit mode to work well. The SSE2 version is just as fast when compiled with -m32, because it doesn't need very many vector registers, and all the 64-bit math is done in vectors, not general-purpose registers.

It's actually slightly faster in 32-bit mode on Core2, because compare/branch macro-fusion only works in 32-bit mode, so there are fewer uops for the out-of-order core (18.3s (1.85 Instructions Per Clock) vs. 16.9s (2.0 IPC)). The smaller code-size from having no REX prefixes also helps Core2's decoders.

Also, some reg-reg vector moves are replaced with loads, since not all the constants fix in vector regs anymore. Since load throughput from L1 cache isn't a bottleneck, this actually helps. (e.g. multiplying by a constant vector of set1(10): movdqa xmm0, xmm10 / pmullw xmm0, xmm1 turns into movdqa xmm0, [constant] / pmullw xmm0, xmm1.) Since reg-reg MOVDQA requires an ALU port, it competes with the real work being done, but a MOVDQA load only competes for front-end decode bandwidth. (Having a 4-byte address inside many instructions cancels out a lot of the gain from saving REX prefixes.

I wouldn't be surprised if saving ALU MOVDQA uops is where the real gains are coming from, since the frontend should be keeping up with the average of 2.0 IPC pretty well.

All these differences disappear on Haswell, where the whole thing should run from the decoded-uop cache, if not the loopback buffer. ALU+branch macro-fusion works in both modes since Nehalem.

Peter Cordes
  • 6,466
  • 7
    I just love how you went "beast mode" into the subject! :) More importantly, it is an excellent example of what kind of gains are available, if you truly need or want to squeeze out the maximum performance, utilizing very low-level knowledge of the hardware at hand. Plus, we are only using one thread here; most current desktop and server Intel/AMD processors (and even ARM ones in lightweight tablets and SBCs) have multiple cores, so there are still more real-world-time-taken speedups available. And finally, how impractical "the fastest way to" questions are, due to sheer effort involved. – Nominal Animal Nov 20 '16 at 05:57
  • 1
    @NominalAnimal: Yeah, even a slow ARM quad or octo core could easily saturate main memory bandwidth doing the same thing with NEON (even if they were hooked up to fast dual channel DDR3), if it has 64-bit integer SIMD adds and shifts. I assume NEON has 16-bit element-size multiplies for audio work. Instruction-scheduling would be much more work for an in-order ARM, because each iteration of the loop-carried dependency chain (the xorshift128+) feeds a few independent dependency chains of chopping that up and getting it stored to memory... – Peter Cordes Nov 20 '16 at 06:11
  • ... Out-of-order execution eats that for breakfast, because the whole thing is short enough that several iterations fit in the ROB (192 uops on HSW IIRC). (i.e. the "window" of instructions that out-of-order execution sees includes multiple iterations). So the CPU can be finishing the final store for 2 or 3 iterations ago while also getting started on the beginning of the current iteration. This hides the latency of the independent chains, so only throughput matters. On an in-order core, this would require software pipelining... – Peter Cordes Nov 20 '16 at 06:11
  • ... A good ARM compiler should do some of that for you if you write it with intrinsics (or GNU C native vector syntax for the whole thing, like I should have done in the first place). I don't have any experience with doing that for real, so you might need to massage your loop and maybe do some manual unrolling / software-pipelining in the source to get good asm. (There are some out-of-order ARM cores, found in higher-end phones, but they don't have as large an out-of-order window as Haswell. OTOH, they have lower peak throughput, too, so there's less to gain from finding more ILP). – Peter Cordes Nov 20 '16 at 06:17
  • 1
    @NominalAnimal: also, agreed on the silliness of the question. "Fastest" with no restrictions on quality of randomness is silly... With BTRFS, the same data on disk can be part of a file multiple times (see EXTENT_SAME in 4.2). So you can generate a random 4kiB or 1MB and repeat it. It's randomness with a short period, but it's still random, and costs only metadata I/O. (Actually, you need the block to end with a newline. lcm(4096, 4096200) = 4096200 = 819200 = 800kiB, so your repeat block is any multiple of that.) – Peter Cordes Nov 20 '16 at 06:48
  • And although I looked at the compiler-generated asm, I didn't do much with performance counters, or hand-write any asm myself. I'm sure there are more minor speedups (and for newline handling, major speedups) to be had. And of course the question of doing anything with this data... I guess piping it somewhere makes sense, because generating on the fly as-needed is faster and cheaper than writing to disk and re-reading, unless the reader is multi-threaded and can use all cores. (And even then, gen on the fly leaves it hot in L3 cache.) – Peter Cordes Nov 20 '16 at 06:51
  • Actually the repeat period for newline-at-the-end-of-a-block is LCM(4096, 200) = 102400 = 100kiB. Depending what you're doing with the data, that might be useful to know. – Peter Cordes Nov 22 '16 at 07:45
  • (I don't really want to consider the cache-DMA effects on various hardware; the rules whether DMA invalidates the cache or not, and so on, are too varied for my limited brain, for me say anything meaningful about it ;). Note: 100kiB = 25×4kiB = 5 × 5 × 4 kiB. If the storage (device, or the filesystem on the device -- RAIDs tend to have multiples of some power of two) uses a power-of-two block size larger than 4kiB, 100kiB is not large enough. I'd really prefer to let the user specify the block size at run time, even with the considerably added complexity. – Nominal Animal Nov 22 '16 at 09:58
  • @NominalAnimal: Modern x86 has cache-coherent DMA. The L3 has inclusive tags, so the "system agent" snoops the last-level cache on the way from PCIe to memory controller. I guess invalidating if necessary on hits. I don't really know the details, but I'm pretty sure that x86 has had cache-coherent DMA for many years now (but didn't way back in the day). See this diagram, which includes Skylake with eDRAM, which it uses as a memory-side cache (i.e. it can cache DMA). (This is a change vs. Broadwell, where the eDRAM was an L4 cache.) – Peter Cordes Nov 22 '16 at 10:11
  • Does that apply to AMD ones too? I'm not familiar with the latest ones' details -- I'm still on a Phenom II when I'm not on a i5-4200U laptop. – Nominal Animal Nov 22 '16 at 10:12
  • @NominalAnimal: Yes, AMD has had cache-coherent DMA since before PhenomII. AFAIK, the mechanism is basically the same: snoop caches for system I/O on its way through the CPU on its way to the on-chip memory controllers. The details are different, and I'm not sure exactly how AMD implements it power-efficiently if they don't have one set inclusive tags to snoop the way Intel does with their large shared last-level cache. But anyway yes, x86 in general has cache-coherent DMA since at least 2002, probably earlier (like P6), not just Intel's implementation. – Peter Cordes Nov 22 '16 at 10:17
  • @NominalAnimal: anyway yes, you might well see a win from making write() system calls in a size that matches the ideal IO request size for your storage, if you're actually going to write to disk. Note the update to my answer where I suggest that re-generating on the fly from the same seed may be better in a lot of cases than writing to disk. So I don't consider writing to disk a very interesting use-case compared to piping. However you're probably right that a tiny win in loop-unrolling efficiency isn't worth less efficient piping / copying. – Peter Cordes Nov 22 '16 at 10:19
14

Here is a solution I hope is simple to understand:

od -An -x /dev/urandom | tr -dc 0-9 | fold -w100 | awk NF=NF FS= | head -c1G
  • od creates a uniform stream of hexadecimal digits from /dev/random.
  • tr gets rid of letters, only keeping 0-9 digits
  • fold ensures there are 100 digits per line
  • awk inserts spaces inside lines
  • head truncates the input to 1 gigabyte
  • 2
    That's a nice alternative way to produce more than one digit by byte of /dev/random while still having a uniform distribution, that produces 320 digits for every 256 bytes of /dev/urandom on average (less than when you convert bytes < 200 modulo 100 to decimal which gives you 400 digits for every 256 bytes though). – Stéphane Chazelas Nov 18 '16 at 15:10
6

You can use the jot command for this:

jot -r 50000000 0 9 | fmt -w 200 > output.txt
gardenhead
  • 2,017
6

This is similar to Stéphane Chazelas' method, however I read 64 bits at once to improve performance. The distribution is still uniform but now you get 19 digits for each 8 bytes instead of only 8 in the best case like before

perl -nle 'BEGIN{$/=\8; $,=" "}
           $n = unpack("Q");
           next if $n >= 10000000000000000000;
           $s = sprintf("%019u", $n);
           push @a, (split //, $s);
           if (@a >= 100) {print (splice @a, 0, 100);}' < /dev/urandom | head -c1G

On 32-bit platform 9 digits will be read each time instead of 19.

phuclv
  • 2,086
  • This may raise exception if your system does not support 64-bit integer or perl isn't compiled with quad support. – cuonglm Nov 17 '16 at 06:11
  • @cuonglm yes as I said if perl is not 64 bits on that system then the program must be changed to next if $n >= 1000000000; $s = sprintf("%09u", $n); to get only 9 digits – phuclv Nov 17 '16 at 06:14
  • You can't, the program will crash at $n = unpack("Q") if quad isn't supported. – cuonglm Nov 17 '16 at 06:15
  • 1
    @cuonglm change to BEGIN{$/=\4; $,=" "} $n = unpack("L"); also – phuclv Nov 17 '16 at 06:18
  • @downvoter: is there any reason for the downvote? can you just comment what is wrong or should be improved? – phuclv Nov 18 '16 at 02:35
  • 1
    Sorry, but this gets 19 digits from 8 bytes input only about 54.2% of the time and none the rest, averaging 1.29 digits per input byte. If more like Stephane you use <16e18 and divide by 16, you get 18 digits 86.7% for 1.95 dpB. With 32bit, <4e9 /4 gets 9 digits 93.1% for 2.10 dpB. But 5 bytes (as hex(H10)) <1e12 gives 12 digits 90.9% for 2.18 dpB, or splitting the hex in half and doing each half <1e6 gives 6 digits 95.4% for 2.29 dpB; this approaches the limit of log_10(256)=2.41. – dave_thompson_085 Nov 19 '16 at 08:31
  • This is unfortunately a lot slower than Stéphane's tr pipeline, on x86-64 Ubuntu15.10 on my Core2Duo. 4m18s vs. 0m41.4s real time, both writing to /dev/null and run with LC_ALL=C. See benchmarks in my answer (and a solution that does the same thing in 0.18 seconds on the same hardware, with lower quality randomness). – Peter Cordes Nov 21 '16 at 04:06
3

I kind of agree with Nominal Animal in using a compiled programming language if you need speed. However, you do not have to write your own RNG code in C. C++11 offers the excellent Mersenne Twister as part of it's standard library.

#include <time.h>
#include <random>
#include <iostream>
using namespace std;

int main() {
    mt19937 gen(time(0)); 
    uniform_int_distribution<> dist(0,9);

    for(int j=0; j<5000000; j++){
        for (int i = 0; i < 99; i++) {  
            cout << dist(gen) << " ";
        }  
        cout << dist(gen) << endl;
    }
    return 0;
}

The above code is reasonably simple and takes about a minute when I pipe the output to a file. We can go a lot faster by creating a string big enough for 100 digits and hacking the digits into it. This allows us to call cout every line rather than every digit.

#include <time.h>
#include <random>
#include <iostream>
using namespace std;

int main() {
    mt19937 gen(time(0)); 
    uniform_int_distribution<> dist(0,9);

    char line[201];
    for(int i=1; i<199; i++)
        line[i] = ' ';
    line[199] = '\n';
    line[200] = 0;

    for(int j=0; j<5000000; j++){
        for (int i = 0; i < 199; i += 2) {  
            line[i] = dist(gen)+'0';
        }  
        cout << line;
    }
    return 0;
}

This code takes my machine around six seconds. Remember it's standard output, so pipe it to a file.

I have a couple of disclaimers. First, I'm writing this on a Windows PC. I think the libraries are all present on Linux, but if I'm wrong, be sure to point it out.

Also, it actually outputs exactly half a billion space separated digits, which is technically a gigabyte but maybe not exactly what you wanted. It outputs 5 million lines, 100 digits per line. If the difference is important, you can increase the number of lines. On my Windows box the file seems to be slightly larger than 10^9 bytes, which I think is something to do with extra newline characters.

  • 2
    Hey, the criticism is not really fair! :) Most of my program is command-line parameter parsing. If I also omit comments, error checks, and hardcode the number of columns and lines output, I can make it less than twice the size of your code -- hardly monstruous. :) Kidding aside: Yes, the libraries are available in most Linux distributions. On my laptop, your line-at-a-time takes about 14 seconds, whereas my line-at-a-time version takes just 1.3 seconds. The difference is only due to the PRNG: Mersenne Twister is that much slower than Xorshift64*. – Nominal Animal Nov 18 '16 at 06:06
  • 1
    There is one practical thing I'd like to point out that you have missed, but I hope you don't take it as negativity, just something to think about: As I mentioned in my answer, one-shot programs are rarely worth the time they took to write. That is why adding command-line parsing, and the help usage text, is almost always worthwhile. I have a large set of such utility programs, and rather than hunt their sources to find out what each of them does, I just run them, so they will tell me; and I can modify their behaviour enough to suit more than one need. Amortizing development cost. – Nominal Animal Nov 18 '16 at 06:11
  • @NominalAnimal another important thing is that you piped the output to /dev/null which would be far faster than writing to a real file – phuclv Nov 18 '16 at 10:51
  • @LưuVĩnhPhúc: Well, not really. This lappy has a Samsung 128GB SSD, with ~ 500 MB/s sequential reads and writes. Put four in a Linux software-RAID0 configuration, and you'll get well over a gigabyte a second reads and writes when generating such large datasets (I estimate ~ 1.75 TB/s). 1GB/s was reached years ago with 12 SATA drives (spinning platters, not even SSDs) with Linux sw-RAID0. (Note: bytes/s, not bits/s.) Sure, it sounds silly for a "normal" machine, but those who play with large datasets, find that worthwhile -- you shave time off everything you do (with large datasets) that way. – Nominal Animal Nov 18 '16 at 11:49
  • @Nominal Animal That's true. My program would be nearly as big as yours if it was parsing command line parameters, handling errors and offering help text. I kept it short to make it obvious how it worked. – James Hollis Nov 18 '16 at 12:13
  • @LưuVĩnhPhúc: I have a desktop with an I7 cpu, so it is inaccurate to compare our running times anyway. I'm sure he piped the output to the same place when comparing our code. – James Hollis Nov 18 '16 at 12:42
  • 1
    @NominalAnimal and Lu'u: More importantly, if you have enough RAM, the program can exit well before the data is all on disk. Most of the work in a large write() system call is a memcpy into the pagecache, which blocks only if the kernel decides to do that instead of allocating more buffer space. This program should only bottleneck on disk I/O when memory is tight, or if you had used O_DIRECT to bypass the pagecache. If you write() in chunks smaller than cache size, hopefully your data only goes into main memory once, and the buffer that's rewritten in place stays hot in L2 or L3 cache. – Peter Cordes Nov 18 '16 at 20:38
  • @JamesHollis: Good point. I did use the same command form, time bash -c 'stanza-to-generate-data' > /dev/null, for all tests; with the complete pipe sequence (including head) in the stanza-to-generate-data. I even made sure the machine is quiescent otherwise. The variation from test to test was well under a second for long ones, and about 10% for short (< 15s) ones. This variation comes from the CPU governor I was using (i.e., CPU clock rate changes under load), although I ensured performace was used (and laptop on AC power, not on battery). Not a true benchmark, but indicative. – Nominal Animal Nov 19 '16 at 07:26
  • @PeterCordes: Quite true. If I were to measure real-work storage speeds on some machine, I'd sync and drop caches sync ; echo 3 >/proc/sys/vm/drop_caches before the test, and include a final sync in the timings -- but then it is imperative that the system is quiescent otherwise during the test. It must be then repeated a few times, to avoid spikes caused by other activity. I use the median (not mean) of the timings as the result. That yields timings that include both data generation and the time it takes for the data to be on disk. Usually, slower one (often disk) dictates the result. – Nominal Animal Nov 19 '16 at 07:33
  • @NominalAnimal: I've been playing with this. My vectorized SSE2 version writes 1GiB to /dev/null in 0.208 seconds +/- 0.5% or so, writing in 128k chunks (Core2Duo E6600, 2.4GHz). But writing to a file, on XFS on a Linux md RAID10f2 of 3 disks, it takes about 4 seconds. (with about 1 second of "system" CPU time and still 0.2 seconds of user time). Piping into wc -c takes 0.61 seconds of real time. (Bash's time builtin measure the same 0.208s of user time, and 0.68s of system time, including both processes in the pipeline). I can produce data in cache much faster than memcpy (DDR2-533). – Peter Cordes Nov 19 '16 at 09:02
  • @PeterCordes: Interesting results -- I love how it shows that combining low-level knowledge (SSE/AVX extensions, cache behaviour) with a good algorithm can still squeeze a nice speedup! Have you measured your RAID sustained write rate? Note that mine is a laptop, Core i5-4200U, and I've not tuned either the hardware or the kernel settings for performance; so I do expect larger variance in my timings. OTOH, I only ran a few (under a dozen) tests, so my variances are .. unreliable :). Also, do you have comparison timings with the other implementations here, on your hardware? – Nominal Animal Nov 19 '16 at 11:09
1

It depends on your definition of "random". If you mean cryptographically random, you just have to get a good library and bite the bullet, wait for it to run.

If you just need something that looks pretty random, here's an easy way:

  1. Get a file that is several Gb long. Your favorite movie will be good.
  2. Gzip it, an easy way to squeeze out repeated patterns
  3. Go through the file a nybble (half a byte) at a time. Each value will be between 0 and 15. Throw away any less than 1 or greater than 10. Subtract 1 from each of the first billion survivors and write it out as a digit.

It might take an hour to run on a slow machine; fast enough and random enough for most purposes.

1
#!/bin/bash
FILE_CREAT='/tmp/testfile'
MAX_SIZE=$(( 1 * 1024 * 1024 ))
rm -rf ${FILE_CREAT}
while true
do
    STRING=''
    for (( i = 0 ; i < 100 ; i++ ))
    do
        NUM_RAN=$(cat /dev/urandom | tr -dc 0-9 | head -c 1)
        if [ $i -eq 0 ]
        then
            STRING=${NUM_RAN}
        else
            STRING=${STRING}' '${NUM_RAN}
        fi
    done
    echo ${STRING} >> $FILE_CREAT
    FILE_SIZE=$(du -s ${FILE_CREAT} | awk '{print $1}')
    if [ ${FILE_SIZE} -ge ${MAX_SIZE} ]
    then
        break
    fi
done
exit $1
NamNT
  • 19
  • 1
    Welcome to the site! See links on my profile page. There are a great many issues here that I see almost universally in shell scripts, but that doesn't make them correct. – Wildcard Nov 18 '16 at 07:01
  • And also http://unix.stackexchange.com/q/235985/135943 – Wildcard Nov 18 '16 at 07:03
  • @Wildcard - as shell scripts in the wild go, this one is actually pretty ok. What issues are you referring to (outside of lack of variable quoting) – iruvar Nov 19 '16 at 04:04
  • @iruvar, actually, you're right. And looking at it again, there are actually no potentials for split/glob problems either unless IFS contains digits. The biggest non-stylistic improvement I can see would be to move the >> $FILE_CREAT to a > $FILE_CREAT appended to the done line, so the file will only need to be opened once. – Wildcard Nov 19 '16 at 04:22
  • @iruvar, although actually, the inner loop can be replaced with something like cat /dev/urandom | tr -dc 0-9 | fold -bw 100 | sed 's/./& /g;s/ $//;1q' But we're getting closer and closer to just copying the top answer. – Wildcard Nov 19 '16 at 04:28
  • 2
    @Wildcard: never cat file | tr when you can just tr <file. IIRC, you can even <file tr. I thought you were just talking about this shell script looking clunky and slow, like du | awk after every line to check the size, and re-opening the file for append every line instead of redirecting outside the loop. – Peter Cordes Nov 19 '16 at 14:12
  • 2
    @PeterCordes, yes. Why is using a shell loop to process text considered bad practice? is particularly relevant—this script is based on an idea that Bash is a programming language such as C, which it's not. But, @NamNT, I hope you stick around on this site because it's clear that you have a very logical mind. :) – Wildcard Nov 19 '16 at 16:07
  • 4
    @PeterCordes cat /dev/urandom | busy-cmd is one of those rare cases where it can make sense as it can split the random generation and busy cmd between processors. Not so much for tr but makes a difference for Sam's od for instance. – Stéphane Chazelas Nov 21 '16 at 07:06
  • 1
    @StéphaneChazelas: oh right!! Yeah, the read() system call is where the RNG CPU time is spent. – Peter Cordes Nov 21 '16 at 07:07