I wouldn't say any of my statistics were "too good," although idk exactly what value would indicate that. The lowest ones were still in the teens. Also, isn't some of what you're talking about above with the coin covered by degrees of freedom? A coin only has a single DoF.
Agreed, it was really just an aside because I love talking about probability, it's of little practical value beyond better understanding the limitations of the tests like chi squared. You're right that the coin has only 1 degree of freedom, but this doesn't cover the phenomena I'm talking about.
To be clear, I actually combined the raw data for that analysis, which should be equivalent to a single dataset with a larger # of rolls, no?
Not really. It's equivalent to a larger dataset with more rolls
internally, but multiple comparisons correction is about procedure external to the test. The idea is that when you perform the test on data that is genuinely drawn from the distribution you're testing (i.e. the kind you want to not reject), there's a 5% chance that you incorrectly reject the hypothesis that the sampled data follows the theoretical distribution. Multiple comparisons is about recognizing this fact, and adjusting your testing procedure to reduce the chance of this error (Type 1 error). If you keep testing the same, or essentially the same hypothesis with different datasets, eventually you're going to reject it - you expect to incorrectly reject it 5% of the time after all. The Bonferroni correction essentially brings you back to 5% chance of incorrectly rejecting across the set of tests you've performed. In this instance, you're repeatedly testing whether the dice are fair using different datasets, but if the dice really are fair then that test should fail about 1 in 20 times. You can calculate the probability of making a type 1 error for two tests as
P(Type 1) = 2*(0.95)*(0.05) + (0.05)^2 = 0.975
Generally the formula is
P(Type 1) = sum_{i=1:n} n!/[i!(n-i)!] a^i*(1-a)^(n-i)
where n is the number of tests and a is the alpha level of the tests.
Better yet, calculate it as 1 - the probability that you don't make a Type 1 error (the most frequently used trick in all of probability theory, no doubt) to see that
P(Type 1) = 1-(1-a)^n
Anyway, by the time n = 20 you're at a 64% chance of type 1 error (falsely rejecting the hypothesis at least once). So if you've tested 20 different dice rolling data sets, there's a 64% you falsely rejected the null at least once. Bonferroni correction for n tests basically brings that back down to 5% across all your tests by decreasing the nominal alpha level of each test. Get to n = 100 and there's a 99% chance you make the error.
Actually, the original base RNG system *was* bad. If you look at the 1st plot in the "Niara Data" tab, clearly low rolls were being preferentially followed by other low rolls, forming this sinusoid pattern. No one actually did a statistical test on the level of correlation though. And originally, their Karmic Die system alternated between low and high values, as shown by the 2nd plot.
Yeah that is quite striking actually. The lag 1 Pearson's autocorrelation of 0.16, which is high, just outside of the 95% CI suggested by my simulation [-0.1582,0.1582]. DW test p value is 0.0232, but considering that we picked the bad looking dataset of many that have been tested, I wouldn't get too excited about that. What really does stand out is that the lag 2 autocorrelation is significantly larger at 0.22, which is well outside of any confidence interval you might like, even if you've performed 100 tests. Across my 1 million simulations of 200 dice rolls I didn't see a single result even half as extreme as that. I'd say seeing that kind of increasing autocorrelation (with lag) from a sample of a uniform distribution would be a genuinely rare event. I'm still not convinced that the most likely explanation is Larian's RNG being faulty, but there's certainly something going on. Not really much for the data collection side of things, but I might just grit my teeth and collect my own sample from the game over the weekend and do some tests.
The thing is, is that it's not hard to do a good job of simulating samples from a uniform distribution in this day and age. Any standard pseudorandom number generator implemented in essentially any computer language will do a good job - when new ones come around they get tested to hell and back by people who take this kind of thing extremely seriously, so to put a Bayesian spin on things, my prior belief is very much that it should be fine. You've piqued my interest though.
Edit: xckd comic on multiple comparisons:
https://xkcd.com/882/