 mersenneforum.org Statistics and statistical properties
 Register FAQ Search Today's Posts Mark Forums Read 2019-05-22, 19:50 #1 kriesel   "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 3×23×67 Posts Statistics and statistical properties This is a reference thread regarding statistics of GIMPS results and the expected and observed statistical properties of the various computation results. Please use the reference discussion thread for comments https://www.mersenneforum.org/showthread.php?t=23383 Table of Contents 1. This post 2. Statistical properties of categories of GIMPS results and interim results https://www.mersenneforum.org/showpo...98&postcount=2 3. Statistics of Mersenne number factors found https://www.mersenneforum.org/showpo...99&postcount=3 4. Statistics of Mersenne Prime Exponents https://www.mersenneforum.org/showpo...00&postcount=4 5. Probability estimates for trial factors in same bit level or class https://www.mersenneforum.org/showpo...82&postcount=5 6. Benford's Law https://www.mersenneforum.org/showpo...35&postcount=6 etc tbd See also https://www.mersenneforum.org/showpo...40&postcount=4 regarding error rates of primality tests. Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2020-09-17 at 17:53  2019-05-22, 19:50 #2 kriesel   "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 10010000011112 Posts Statistical properties of categories of GIMPS results and interim results Statistical properties of categories of GIMPS results and interim results (originally posted at https://www.mersenneforum.org/showpo...32&postcount=1) I've been thinking lately, about what the expected probability distributions are for various categories of outputs, in GIMPS computing runs, how the distributions may be affected by various types of errors, and whether or when it might be feasible to detect the difference. All the following is in regard to p>>64, such as at the current GIMPS wavefronts of primality testing and double checking, and up to p<232. In perl, the table for gpu applications of believed-bad res64 values is currently %badresidues=( 'cllucas', '0x0000000000000002, 0xffffffff80000000', 'cudalucas', '0x0000000000000000, 0x0000000000000002, 0xfffffffffffffffd', 'cudapm1', '0x0000000000000000, 0x0000000000000001, 0xfff7fffbfffdfffe, 0xfff7fffbfffdffff, 0xfff7fffbfffffffe, 0xfff7fffbffffffff, '. '0xfff7fffffffdfffe, 0xfff7fffffffdffff, 0xfff7fffffffffffe, 0xfff7ffffffffffff, 0xfffffffbfffdfffe, 0xfffffffbfffdffff, '. '0xfffffffbfffffffe, 0xfffffffbffffffff, 0xfffffffffffdfffe, 0xfffffffffffdffff, 0xfffffffffffffffe, 0xffffffffffffffff', 'gpuowl', '0x0000000000000000', 'mfaktc', '', 'mfakto', '' ); For cpu-based primality testers, interim res64 values 0x00, 0x02, and 0xfffffffffffffffd are regarded as bad LL values, and 0x00 is regarded as a bad PRP res64 value. However, in https://www.mersenneforum.org/showth...?t=5862&page=2 we see that res64 values 0x000000000000 or 0xffffffffffffffff occur sometimes legitimately, as in the last iteration before the LL test identifies a Mersenne prime. a) LL final residues: 0 if Mp is a prime and the computation went correctly. Certain error conditions generate spurious 0x02 or 0x0 or 0xffffffff80000000 or 0xfffffffffffffffd sequences or final res64. Some of these are application specific. Composite mersenne numbers' final LL full res are I think expected to be uniformly distributed in 0 2019-05-22, 20:00 #3 kriesel   "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 3×23×67 Posts Statistics of Mersenne number factors found A table of numbers of exponents with 0 to 7 factors under 2^65 is available at https://www.mersenneforum.org/showpo...1&postcount=46 Fully factored exponents show higher prevalence of p=3 mod 4 than p=1 mod 4. https://www.mersenneforum.org/showpo...4&postcount=62 And the ratio rises as they are screened for p > increasing thresholds. Semiprime exponents show higher prevalence of p=3 mod 4 than p=1 mod 4. https://www.mersenneforum.org/showpo...5&postcount=69 And the ratio rises to 2:1 as they are screened for p > increasing thresholds. Exponents p = 3 mod 4 have slightly more factors found, than p= 1 mod 4. https://www.mersenneforum.org/showpo...4&postcount=77 Also f=1 mod 8 has fewer factors found than f=7 mod 8, by about 12%. https://www.mersenneforum.org/showpo...6&postcount=80 Combining consideration of f = 1 or 7 mod 8 and 1 or 5 mod 6, f = 1, 7, 17, or 23 mod 24. "for exponents p=3 (mod 4), factors that are f=23 (mod 24) are found about one-third more frequently than the other kinds." https://www.mersenneforum.org/showpo...7&postcount=85 Known factors and other info can be downloaded from links at https://www.mersenneforum.org/showpo...1&postcount=11 for p<109; for above, at https://www.mersenneforum.org/showpo...54&postcount=9 Wagstaff paper: Divisors of Mersnne Numbers, MATHEMATICS OF COMPUTATION VOLUME 40, NUMBER 161 JANUARY 1983. PAGES 385-397 http://www.ams.org/journals/mcom/198...-0679454-X.pdf Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2019-11-19 at 06:35  2019-05-22, 20:03 #4 kriesel   "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 3×23×67 Posts Statistics of Mersenne Prime Exponents Begin with the content of https://primes.utm.edu/notes/faq/NextMersenne.html We expect slightly less than 6 Mersenne primes in a 10:1 interval on exponent. M82589933 is part of an unusual run of (at least) 13 in the 107 to 108 interval. "For comparison: 1 to 10: 4 10 to 100: 6 100 to 1000: 4 1000 to 10000: 8 10000 to 100000: 6 100000 to 1000000: 5 1000000 to 10000000: 5" https://www.mersenneforum.org/showpo...70&postcount=5 (The next decade after the above, 107 to 108 is not completely tested first time yet. 108 to 109 0 so far, with a great deal of further computation to do.) p for Mp is 1 mod 8 at twice the frequency of 3, 5, or 7 mod 8 up to Mp50. https://www.mersenneforum.org/showpo...2&postcount=53 Wagstaff conjecture predicts about 57 Mp below p=10^9. https://www.mersenneforum.org/showpo...7&postcount=55 Of the known Mersenne primes, many more are p=1 mod 4 (31) than p=3 mod 4 (19). (Mp51 had not been announced yet) https://www.mersenneforum.org/showpo...8&postcount=59 For #51, p= 82589933 = 1 mod 4, so 32 vs 19 (1.68:1) Mersenne primes are twice as likely to be p=1 mod 4 as 3 mod 4 conjectured, with some empirical tabular support. Within p=1 (mod 4), Mersenne primes are twice as likely to have p=1 (mod 8) rather than p=5 (mod 8) https://www.mersenneforum.org/showpo...5&postcount=71 Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2020-09-17 at 22:52  2019-07-08, 01:50   #5
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

3·23·67 Posts Probability estimates for trial factors in same bit level or class

The odds of finding a factor in a bit level of trial factoring are ln(bitlevel/(bitlevel-1)), or approx. 1/(bitlevel - 1/2) (better than 18 ppm at bitlevel 70, 15 ppm at bitlevel 76, 12 ppm at 86). These estimates are probabilities for completing the entire bit level, not quitting after the first class containing a factor.

What are the odds of finding more than one trial factor in a bit level?
per https://mersenneforum.org/showpost.p...postcount=1147
no factors: (bitlevel-1)/bitlevel
one factor: (bitlevel-1)/bitlevel * (ln(bitlevel/(bitlevel-1)))
Two factors: (bitlevel-1)/bitlevel * (ln(bitlevel/(bitlevel-1)))2 / 2
N factors: (bitlevel-1)/bitlevel * (ln(bitlevel/(bitlevel-1)))n / n!
The probability of finding n factors declines quickly with increasing n.

In the same class of the same bit level?

Two factors/bit-level case
This case appears likely to occur many thousands of times in p<109.
If two factors or more are found in the same bit level, it doesn't matter which class the first one occupies.
Assuming the class location of the next factor found is independent, there's a one in c chance it coincides with the class of the first factor, where c is the ACTIVE number of classes. For MORE_CLASSES, the 2x2x3x5x7x11 = 4620 classes are winnowed down to 960 by whole-class presieving for very small primes, so c=960; for LESS_CLASSES, the 420 classes are winnowed down to 96, so c=96.
LESS_CLASSES: 96/420 = 1/2 x 2/3 x 4/5 x 6/7 = 48 / 210 = ~.2286
MORE_CLASSES: 960/4620= 1/2 x 2/3 x 4/5 x 6/7 x 10/11 = 480 / 2310 =~ .2078 (9.1% smaller)
hypothetical EXTRA_CLASSES: 11520 / 60060 = 1/2 x 2/3 x 4/5 x 6/7 x 10/11 x 12/13 = 5760 / 30030 = ~.1918 (7.7% smaller)

The chance of two factors both in the same class is then (bitlevel-1)/bitlevel * (ln(bitlevel/(bitlevel-1)))2 / 2 / c.
For 75 bits, 74/75 * (ln(75/74))2/2/c = 8.89E-5/c; c=96 for less-classes so 9.26E-7, or c=960 for more-classes so 9.26E-8 (using Fortran style exponential notation).

If two factors are found in the same bit level, the chance of them landing in the same class is 1/c, and the chance of different classes 1-1/c. (1/c +(1-1/c) = 1, sum of probabilities = 1, check.)

Three factors/bit-level case
This case appears likely to occur hundreds of times in p<109.
If three factors are found in the same bit level, the chance of none of them coinciding in the same class is p0= (c-1)/c * (c-2)/c = (c2-3c+2)/c2 = 1- (3c-2)/c2
The chance of any two of them landing in the same class is the second one matches the first but the third does not, or the first two do not match but the third matches one of them,
1/c (c-1)/c +(c-1)/c 2/c = (c-1)/c2 + (2c-2)/c2 = 3(c-1)/c2
while the chance of all 3 landing in the same class is 1/c2, for a combined chance of (3c-2)/c2.
(p0 +p2 +p3 = 1- (3c-2)/c2 + 3(c-1)/c2 + 1/c2 = 1, check.)

For less-classes the chance of 2 or more landing in the same class is ( 3 x 96 - 2 ) / 962 =~ .031033
while for more-classes it's (3 x 960 - 2 ) / 9602 =~ .0031228 chance of coincident classes.
The probability of 3 factors is (bitlevel-1)/bitlevel * (ln(bitlevel/(bitlevel-1)))3 / 3!
For bitlevel 75, p=74/75 * (ln(75/74))^3 / 3! = 3.978E-7, so per factoring attempt,
less-classes yields a chance of coincident factors by 3 found of 1.233E-8 while more-classes yields 1.241E-9.

The combinations of two or three factors per bit level coinciding in a class are then, for 75 bits,
Less-classes: 9.26E-7 + 1.233E-8 = 9.38E-7
More-classes: 9.26E-8 + 1.241E-9 = 9.38E-8
The odds of four or more are not zero, but they fall to the right of the rounding of the above figures.

Note, for three or more factors per bit level, these are not binomial distributions (since probabilities of matching are not constant with repeated trials)

Four factors/bit-level case
This case appears likely to occur few times in p<109.
If four factors are found in the same bit level, the chance of each falling in a separate class is the chance of 3 successive misses. The chance of any two of them landing in the same class is, the second matches the first but the third and fourth do not, otherwise the third matches the first or second but the others do not, or otherwise the fourth matches one of the preceding 3 but the first three did not match each other.
The chance of 3 coinciding is 4-fold; 123, 124, 134, or 234; 1/c2 * (c-1)/c + (1/c)(c-1)/c * 1/c + (c-1)/c(1/c)2 + (c-1)/c(1/c)2= 4(c-1)/c3
The chance of all four coinciding is 1/c3.
no coincidences: p0= (c-1)/c (c-2)/c (c-3)/c = (c-1)(c-2)(c-3)/c3 = 1 - (6 c2 -11c +6)/c3)
coincidence of just 2: p2= 1/c (c-1)(c-2)/c2 + (c-1)(c-2)/c2 * 2 / c + 3/c *(c-1)(c-2)/c2 = 6(c-1)(c-2)/c3
coincidence of 3 in a class: p3= 4(c-1)/c3
coincidence of 4 in the same class: p4= 1/c3
The probability sums for the four-factors cases listed above don't add up to probability one, so something's not quite right yet. P0+p2+p3+p4 = 1 -3(c-1)/c3 instead of one.
There's also a slight possibility of a double coincidence of two, which could occur one of three ways.
A) First and second match, third misses first/second but third and fourth match. p22a=1/c * (c-1)/c * 1/c for this subcase.
B) First and third match, second and fourth match, first and second do not match. p22b=p22a.
C) First and fourth match, second and third match, first and second do not match. p22c=p22a.
Total, p22=3(c-1)/c3

In total, chances of 2 or more coinciding: 6(c-1)(c-2)/c3 + 4(c-1)/c3 + 1/c3 +3(c-1)/c3 = ( 6(c-1)(c-2) + 4(c-1) + 1 +3(c-1)) / c3
= ( 6c2 -18c +12 +7c -7 +1 ) / c3 = (6 c2 -11c +6) / c3

none: c=96, p0=0.938687; c=960, p= 0.9937619
2 coincide: c=96, p2=0.060560; c=960, p2= 0.0062305
3: c=96, p3=4.295e-4; c=960, p3= 4.34E-6
4: c=96, p4=1.13e-6; c=960, p4= 1.13E-9
dual-2: c=96, p22=0.000322; c=960, p22=3.25e-6

2 to 4: c=96, pn=0.061313; c=960, pn= 0.0062381
c=96, p0+pn= 1, check;
c=960, p0+pn= 1, check.

More than four factors/bit-level cases
These cases are very unlikely to occur in p<109 and so are not calculated here.
They also rapidly become more complex, as illustrated by the numbers of coincidence subcases possible versus number of factors found rises.
5: 2; 3; 4; 5; 22; 23
6: 2; 3; 4; 5; 6; 22; 23; 24; 33; 222
7: 2; 3; 4; 5; 6; 7; 22; 23; 24; 25; 33; 34; 222; 223
8: 2; 3; 4; 5; 6; 7; 8; 22; 23; 24; 25; 26; 33; 34; 35; 44; 222; 223; 224; 233; 2222
9: 2; 3; 4; 5; 6; 7; 8; 9; 22; 23; 24; 25; 26; 27; 33; 34; 35; 36; 44; 45; 222; 223; 224; 225; 233; 234; 333; 2222; 2223
10: 2; 3; 4; 5; 6; 7; 8; 9; 10; 22; 23; 24; 25; 26; 27; 28; 33; 34; 35; 36; 37; 44; 45; 46; 55; 222; 223; 224; 225; 226; 233; 234; 235; 244; 333; 334; 2222; 2223; 2224; 2233; 22222

Summary
Odds are estimated for a variety of bit levels and number of factors per level or class in the attachment. Odds are sufficient that in p<109, up to 2 factors per class seems likely to occur sometime (dozens to thousands of occurrences depending on factoring depth and frequency of use of less-classes), but 5 or more factors per class occurring sometime is unlikely; 3 or more is unlikely from 65 bits upward.

Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1
Attached Files tf odds.pdf (14.1 KB, 53 views) tf odds3.pdf (16.9 KB, 26 views)

Last fiddled with by kriesel on 2020-10-18 at 13:29  2020-09-17, 17:51   #6
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

3×23×67 Posts Benford's Law

Benford's Law https://en.wikipedia.org/wiki/Benford's_law comes from observations that many cases of observed or measured sets of numbers have a nonuniform distribution of leading digits. That is, 1 is the most common leading digit, at 30.1% rather than 11.1%,and 9 the least, at 4.6% rather than 11.1%, for base ten numbers. It also applies to second and third digits, and to expression in other number bases. Numbers following Benford's law have digit probabilities based on the logarithm of the numbers. It requires that the numbers are distributed over several orders of magnitude, which at 2 to 82589933, Mersenne prime exponents certainly are, as well as the Mersenne primes' number of decimal digits or digits in other number bases. Log10(82589933/2)=7.616; log10(24862048/1) =7.396.

Benford's law is one of the tests used to detect fabricated numbers, such as in tax or accounting or research fraud. The numbers that people think are random when faking values generally are not random, generally avoiding round numbers, repeated digits, ascending or descending series, palindromes, etc. and may even reveal individuals' distinct tendencies. (I have no reason to suspect fakery in the Mersenne primes that have been repeatedly checked, so expected them to exhibit usual probability behavior including probable deviations from expected statistical values, with significant deviation expected due to low number of samples available.)

Considering the 51 known Mersenne primes, in base ten, their exponents follow Benford's law quite closely, with occurrence within 10% of expected, usually within +-1.
But not so for the number of digits in the decimal expression of the Mersenne primes.
https://www.mersenne.org/primes/
In decimal, the leading digit 4 is noticeably underrepresented, 5 is absent, and 6 is substantially overrepresented. The absence of a leading 5 seems to line up with some of the larger ratio gaps between successive exponents.
Based on the Lenstra & Pomerance conjecture we expect on average a ratio of ~1.47576 on exponent. (https://primes.utm.edu/notes/faq/NextMersenne.html)
Mersenne primes with exponents 127 and 521 (exponent ratio 4.102) have 39 and 157 decimal digits respectively; 1279 and 2203 (exponent ratio 1.722) have 386 and 664; 11213 and 19937 (exponent ratio 1.778) have 3376 and 6002 digits;
132049 and 216091 (exponent ratio 1.636) have 39751 and 65050;
1398269 and 2976221 (exponent ratio 2.129) have 420921 and 895932;
13466917 and 20996011 (exponent ratio 1.559) have 4053946 and 6320430.
ALL the ranges where we could look for a known Mersenne prime with a leading 5 in the number of decimal digits have greater than expected average gaps present, and no Mersenne prime, except for the first, 13 and 17, exponent ratio 1.308, having 4 and 6 digits respectively. Average observed ratio for gaps excluding 5 as a leading digit of the number of digits is 2.033, about 38% larger than the expected average ratio.

So let's convert that list of decimal number of decimal digits into hexadecimal. Ahd octal.
Following are for each known Mersenne prime, the sequence number, exponent p, decimal number of decimal digits, of the Mersenne prime, and hexadecimal and octal representations of number of decimal digits;
Code:
 #         p   digits (hex)     (octal)
1         2        1 0x1       o1
2         3        1 0x1       o1
3         5        2 0x2       o2
4         7        3 0x3       o3
5        13        4 0x4       o4
6        17        6 0x6       o6
7        19        6 0x6       o6
8        31       10 0xA       o12
9        61       19 0x13      o23
10        89       27 0x1B      o33
11       107       33 0x21      o41
12       127       39 0x27      o47
13       521      157 0x9D      o235
14       607      183 0xB7      o267
15      1279      386 0x182     o602
16      2203      664 0x298     o1230
17      2281      687 0x2AF     o1257
18      3217      969 0x3C9     o1711
19      4253     1281 0x501     o2401
20      4423     1332 0x534     o2464
21      9689     2917 0xB65     o5545
22      9941     2993 0xBB1     o5661
23     11213     3376 0xD30     o6460
24     19937     6002 0x1772    o13562
25     21701     6533 0x1985    o14605
26     23209     6987 0x1B4B    o15513
27     44497    13395 0x3453    o32123
28     86243    25962 0x656A    o62552
29    110503    33265 0x81F1    o100761
30    132049    39751 0x9B47    o115507
31    216091    65050 0xFE1A    o177032
32    756839   227832 0x379F8   o674770
33    859433   258716 0x3F29C   o771234
34   1257787   378632 0x5C708   o1343410
35   1398269   420921 0x66C39   o1466071
36   2976221   895932 0xDABBC   o3325674
37   3021377   909526 0xDE0D6   o3360326
38   6972593  2098960 0x200710  o10003420
39  13466917  4053946 0x3DDBBA  o17355672
40  20996011  6320430 0x60712E  o30070456
41  24036583  7235733 0x6E6895  o33464225
42  25964951  7816230 0x774426  o35642046
43  30402457  9152052 0x8BA634  o42723064
44  32582657  9808358 0x95A9E6  o45324746
45  37156667 11185272 0xAAAC78  o52526170
46  42643801 12837064 0xC3E0C8  o60760310
47  43112609 12978189 0xC6080D  o61404015
48* 57885161 17425170 0x109E312 o102361422
49* 74207281 22338618 0x154DC3A o125156072
50* 77232917 23249425 0x162C211 o130541021
51* 82589933 24862048 0x17B5D60 o136656540
Again, leading digit 6 is over-represented in the hexadecimal, at more than double the expected count. Despite ones being favored by the available range of data. Excluding +-1 variations from expected frequency, 4 7 and E are underrepresented; E is absent. Similarly, in octal, leading digit 6 is more than twice as frequent as expected.
The substantial over-representation of leading digit 6 in bases 8, 10 and 16 seems odd. If there's some reason for that, other than statistics in low population sizes, please comment in a discussion thread.
Maybe it is surprising, maybe it isn't. Consider that if there's one digit that is the most over-represented, in one base, by pure randomness, its chance of being the most over-represented in a second base also seems to be 1/base. That's only mildly low for 10 or 16, not very low.
The usual test criterion, for rejecting that the difference between observed and expected, is less than 5% probability from chance, is not an issue in any of the Pearson chi-squared cases tabulated, although one is borderline. The 3 measures of x2, m and d gave widely different results for the same case.

Note to self: pursue further; perhaps try to calculate probability distribution for the various numbers of the different digits. It's not as "simple" as a single binomial distribution. Purpose is to put some numbers to how likely or unlikely the extent of digit-6 over-representation is in either base 10 or 16. Or some study of number theory and statistics.

Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1
Attached Files benfords law.pdf (33.0 KB, 11 views)

Last fiddled with by kriesel on 2020-09-25 at 21:48 Reason: fixed DOF effect on calculation of Pearson's chi-squared tests, added octal Thread Tools Show Printable Version Email this Page Similar Threads Thread Thread Starter Forum Replies Last Post kriesel Probability & Probabilistic Number Theory 1 2019-05-22 22:59 henryzz Homework Help 5 2013-03-25 11:26 diep Math 27 2010-01-13 20:18 mack Information & Answers 2 2009-09-07 05:48 9021951 Math 29 2005-03-15 16:26

All times are UTC. The time now is 15:38.

Wed Oct 28 15:38:59 UTC 2020 up 48 days, 12:49, 3 users, load averages: 1.77, 1.91, 1.90