Experiences with Numerics on OpenBSD - 8

Experiences with Numerics on OpenBSD - 8 - RNGs, again

After the graphics/imagery work in the last experience, I thought to review the definition of what ‘quality’ meant for pseudo-random number generators, again.

Compare on different architectures:

I added code for the xoroshiro128+ RNG, which uses various combinations of xor, shift, and or operators to generate random numbers quickly.

On the Raspbery Pi4 running Raspian:

$ ./xmcpi.x
xmcpi.f90 number of images:  1
approximating pi in   1 x   10100100100 points
  nsum    7932562674.0000000       msum    10100100100.000000
  pi   3.1415926535897931       iterated pi:    3.1415778439661208
  pi error   1.4809623672284999E-005
Elapsed wall clock time  278.     seconds, using  1 images.


pi@raspberrypi:~/rnums $ mpirun -np 2 -H localhost:4 ./xmcpi.x
xmcpi.f90 number of images:  2
approximating pi in   2 x   10100100100 points
  nsum    7932590015.0000000       msum    10100100100.000000
  nsum    7932562674.0000000       msum    10100100100.000000
  pi   3.1415926535897931       iterated pi:    3.1415832579718690
  pi error   9.3956179241239113E-006
Elapsed wall clock time  279.     seconds, using  2 images.

$ mpirun -np 4 -H localhost:4 ./xmcpi.x
xmcpi.f90 number of images:  4
approximating pi in   4 x   10100100100 points
  nsum    7932562674.0000000       msum    10100100100.000000
  nsum    7932637592.0000000       msum    10100100100.000000
  nsum    7932649267.0000000       msum    10100100100.000000
  nsum    7932590015.0000000       msum    10100100100.000000
  pi   3.1415926535897931       iterated pi:    3.1415965419986285
  pi error  -3.8884088353619006E-006
Elapsed wall clock time  403.     seconds, using  4 images.

The Rpi4 is overheating, and slows down, resulting in 400+ seconds elapsed when running 4 cores at the same time. However, the results are exactly the same numerically as for the Intel AMD64-compatible CPUs.

The xoroshiro128+ code is platform independent, with this very simple empirical test.

TestU01

So let’s try a different test: the statistical tests frequently mentioned by RNG practitioners (authors/users/critics). This is the TestU01 suite which includes a number of well-known RNGs and a scheme for testing anything we might come up with.

The code is, roughly

#include "unif01.h"
#include "bbattery.h"

#include "xoroshiro128plus_xd_xjr.h"
#include "xoroshiro128plus_xtc.h"
#include "pcg_od_ojr.h"

int main (void)
{
unif01_Gen *gen;

xoroshiro128plus_srandom(134893149852ull, 18324891485ull);
gen = unif01_CreateExternGen01 ("xoroshiro128plus_xd", xd_random_r8);
bbattery_SmallCrush (gen);
unif01_DeleteExternGen01 (gen);
...

Which is pretty dense code but basically runs the SmallCrush battery of tests on the double-precision floating-point values generated by xoroshiro128plus_xd.

The test subjects are:

xoroshiro128plus, and PCG RNG
the “x-times-a-constant” method
the “jr” [1.0,2.0) - 1.000 method
the “d” Downey scaled method

The Crush Tests

The SmallCrush test is a shorter test suite, and all tests pass.

The Crush test is a longer test suite. Of these, only 3 tests fail with unlikely p-values (outside 0.001 to 0.999):

PCG with jr in the ClosePairs test (1 of 144)
PCG with “d” in the RandomWalk1 test (1 of 144)
xoroshiro128plus with “d” in the ClosePairs test (1 of 144)

So, these RNGs are not so bad, accoring to TestU01.

My Tests

Looking at the binary (or hexadecimal) representation of the floating point results, it’s pretty clear there are issues with the [1,2)-1 and the times-a-constant methods of creating floating-point numbers from integer random numbers.

Some of the least significant bits are either always zero, or mostly zero. For example, the frequency of bits in the “xtc” method gives

 bit nbr    actual  expected
bit    0   2522555   5050050
bit    1   3786768   5050050
bit    2   4416094   5050050
bit    3   4733053   5050050
bit    4   4893247   5050050
bit    5   4969861   5050050

and the “xjr” method gives

bit    0         0   5050050
bit    1   2523599   5050050
bit    2   3785404   5050050
bit    3   4415926   5050050
bit    4   4734824   5050050
bit    5   4892002   5050050

meaning none of the lsbs are ever 1, and the next few bits are not one half the time as they should be. The number 5050050 is the expected number of ones.

The “good” method by Downey looks like

bit    0   5048674   5050050
bit    1   5048621   5050050
bit    2   5050834   5050050
bit    3   5048751   5050050
bit    4   5049737   5050050
bit    5   5049658   5050050

What is an objective, scientific method to measure this? Chi-square, of course. This test is for discrete distributions in large numbers. Here we have a mantissa of 23 bits (single-precision) or 53 bits (double-precision) which should all be independent and uniformly 1 or 0 half the time. We’ll run a Chi-square test of the distribution of the 1-bits in these mantissas.

And for good measure I added a test of half-octet, meaning 4 bits, which should always be averaging 7.5 in value.

The other measure is the frequency of exponents. This should be exponential, matching the concept that numbers ¼ the size have an exponent ¼ as likely. For example

Expon.   Freq.
...
-20        98
-19       167
-18       388
-17       753
-16      1480
...
 -4   6258023
 -3  12508379
 -2  25025356
 -1  50049539

This is easily tested by taking log2 of the frequency and checking that log2(freq) is linear with the exponent.

(The above list of exponent frequencies is from od_random_r4 for 10G values. The smaller exponents -31 through -21 have been omitted.)

My test results

Most of the single-precision methods work fine, since they essentially are taking 64-bits (or 48 for the drand48 builtin) and using only 23 of them. Any issues with lsbs do not appear.

The acceptable methods for single-precision are the PCG and xoroshiro128+ and ++ RNGs with the Downey (d), jr (1-2]-1 method, and times-a-constant method. These also pass the half-octet tests.

Not surprisingly, for double-precision, only the Downey (d) methods pass the bits tests. The xoroshiro+ passes the half-octets tests, while PCG only about a third of the time. As the half-octet tests are biased (some bits have more influence on the average value of a half-octet) it’s not clear to me this is really a useful test.

Here are some same results for 20 tests of PCG and xoroshiro128+ for 100M numbers:

test    chisqure-bits   chisquare-halfoctets   Note

od_...r8   20 pass          9 of 20 pass       PCG
xd_...r8   20 pass          20 pass            xoroshiro128+

This contrasts with integer64 tests, where fewer of both PCG and xoroshiro128+ tests passed the halfoctet tests. Rerunning those using the same seeds results in:

test    chisqure-bits   chisquare-halfoctets   Note

pcg        20 pass          6 of 20 pass       PCG
x...128+   20 pass          4 pass             xoroshiro128+

It is not surprising they differ; the floating point numbers are processed from the random integers by shifts and sometimes replacement.

One variant to be considered is a cryptologic transformation of the random result, before conversion to float, such as with AES128-CTR mode, to see if that changes the result. While slower, it is not better.

Conclusion

Statistical tests composed by statistical practitioners are not as complete as you might hope. Easy-to-check defects in RNGs are missed. And this has been true for years.

The practitioners have the wrong objectives in mind: generating statistics not generating usable random numbers.

Do the least-significant-bits of pseudo-random floating-point numbers matter? Especially the double-precision kind? Why, for some applications, yes they do. When taking differences, when making exact calculations locally and averaging over large spatial areas (such as particle-in-cell methods), when testing sensitivity of numerical models by masking (truncating values), and so on.

April 2020
Asahi Super Dry

Links:

Pierre L'Ecuyer website
xoshiro / xoroshiro generators and the PRNG shootout