15

I read about RNGs on Wikipedia and $RANDOM function on TLDP but it doesn't really explain this result:

$ max=$((6*3600))
$ for f in {1..100000}; do echo $(($RANDOM%max/3600)); done | sort | uniq -c
  21787 0
  22114 1
  21933 2
  12157 3
  10938 4
  11071 5

Why are the values above about 2x more inclined to be 0, 1, 2 than 3, 4, 5 but when I change the max modulo they're almost equally spread over all 10 values?

$ max=$((9*3600))
$ for f in {1..100000}; do echo $(($RANDOM%max/3600)); done | sort | uniq -c
  11940 0
  11199 1
  10898 2
  10945 3
  11239 4
  10928 5
  10875 6
  10759 7
  11217 8
Stephen Kitt
  • 434,908
cprn
  • 1,025
  • 9
    The usual answer to this is to reroll (discard the number you received and pick another) if you're between the maximum value for RANDOM and the highest possible value that can divide evenly into your modulo. That's not usual-to-RANDOM, that's usual-to-using-modulo-to-restrict-RNG-domain across all languages/tools/etc. implementing RNGs of that type. – Charles Duffy Jul 04 '19 at 21:12
  • 7
    See my 2013 article on the source of this bias if you want some nice graphs of how bad it gets: https://ericlippert.com/2013/12/16/how-much-bias-is-introduced-by-the-remainder-technique/ – Eric Lippert Jul 05 '19 at 17:41
  • 1
    "The generation of random numbers is too important to be left to chance." - Robert Coveyou. FYI though: most programs are unable to generate truly random numbers – jesse_b Jul 05 '19 at 19:54
  • @Eric Lippert thank you, I'll read it gladly! – cprn Jul 06 '19 at 10:29
  • @Jesse_b Yes, I'm perfectly aware. It's also mentioned in the attached articles. – cprn Jul 06 '19 at 10:30
  • 1
    Note that, even though you are seeing issues due to modulo bias, the $RANDOM variable does not use a good PRNG internally. – forest Jul 07 '19 at 01:19

2 Answers2

37

To expand on the topic of modulo bias, your formula is:

max=$((6*3600))
$(($RANDOM%max/3600))

And in this formula, $RANDOM is a random value in the range 0-32767.

   RANDOM Each time this parameter is referenced, a random integer between
          0 and 32767 is generated.

It helps to visualize how this maps to possible values:

0 = 0-3599
1 = 3600-7199
2 = 7200-10799
3 = 10800-14399
4 = 14400-17999
5 = 18000-21599
0 = 21600-25199
1 = 25200-28799
2 = 28800-32399
3 = 32400-32767

So in your formula, the probability for 0, 1, 2 is twice that of 4, 5. And probability of 3 is slightly higher than 4, 5 too. Hence your result with 0, 1, 2 as winners and 4, 5 as losers.

When changing to 9*3600, it turns out as:

0 = 0-3599
1 = 3600-7199
2 = 7200-10799
3 = 10800-14399
4 = 14400-17999
5 = 18000-21599
6 = 21600-25199
7 = 25200-28799
8 = 28800-32399
0 = 32400-32767

1-8 have the same probability, but there is still a slight bias for 0, and hence 0 was still the winner in your test with 100'000 iterations.

To fix the modulo bias, you should first simplify the formula (if you only want 0-5 then the modulo is 6, not 3600 or even crazier number, no sense in that). This simplification alone will reduce your bias by a lot (32766 maps to 0, 32767 to 1 giving a tiny bias to those two numbers).

To get rid of bias altogether, you need to re-roll, (for example) when $RANDOM is lower than 32768 % 6 (eliminate the states that do not map perfectly to available random range).

max=6
for f in {1..100000}
do
    r=$RANDOM
    while [ $r -lt $((32768 % $max)) ]; do r=$RANDOM; done
    echo $(($r%max))
done | sort | uniq -c | sort -n

Test result:

  16425 5
  16515 1
  16720 0
  16769 2
  16776 4
  16795 3

The alternative would be using a different random source that does not have noticable bias (orders of magnitude larger than just 32768 possible values). But implementing a re-roll logic anyway doesn't hurt (even if it likely never comes to pass).

frostschutz
  • 48,978
  • Your answer is largely correct, except: "you need to re-roll when $RANDOM is lower than 32768 % 6" should actually be "equal to or greater than floor((RANDMAX+1)/6)*6" (i.e. 32766), and fix the associated shell code below that. – Nayuki Jul 06 '19 at 13:48
  • @Nayuki if you can point out a specific error (that applies within the given context) I'll be happy to correct it. My solution is just an example, there is different ways to do it. You can remove bias from start range, or end range, or somewhere in the middle, it makes no difference. You can calculate it better (and not do a modulo in every iteration). You can handle special cases such as arbitrary modulos and randmax values, also handle RANDMAX=INTMAX where RANDMAX+1 doesn't exist, but that was not the focus here. – frostschutz Jul 06 '19 at 14:47
  • Your reply is significantly worse than your post. First of all, I pointed out specifically which phrase of yours is factually wrong. Note that "32768 % 6" == 2, so you want to reroll every time $RANDOM < 2? Regarding bias at start/end/midde of range, your entire post is about removing bias at the end of range, and my response caters to exactly that too. Third, you talk about handling RANDMAX=INTMAX, but in your answer you mentioned the value 32768 (= 32767+1) numerous times, which implies you are comfortable with computing RANDMAX+1. – Nayuki Jul 06 '19 at 15:05
  • 2
    @Nayuki my code removes 0 and 1, yours removes 32766 and 32767 and I'd like you to elaborate: what difference does it make? I'm only human, I make mistakes, but all you said so far is "it's wrong" without explaining or showing why. Thank you. – frostschutz Jul 06 '19 at 15:09
  • 1
    Never mind, figured it out. Sorry about the false alarm. – Nayuki Jul 06 '19 at 15:17
  • I don't want to change the original post now but I was calculating a random number of seconds within the range of 6 hours, thus 6*3600; could have simplify it for the question, sorry. – cprn Sep 03 '19 at 20:45
23

This is modulo bias. If RANDOM is well constructed, each value between 0 and 32767 is produced with equal probability. When you use modulo, you change the probabilities: the probabilities of all the values above the modulo are added to the values they map to.

In your example, 6×3600 is approximately two thirds of the range of values. The probabilities of the top third are therefore added to those of the bottom third, which means that values from 0 to 2 (approximately) are twice as likely to be produced as values from 3 to 5. 9×3600 is nearly 32767, so the modulo bias is much smaller and only affects values from 32400 to 32767.

To answer your main question, at least in Bash the random sequence is fully predictable if you know the seed. See intrand32 in variables.c.

Stephen Kitt
  • 434,908