2021-06-18, 16:49 | #12 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2·29·127 Posts |
Use as hardware reliability test
There's a pretty good video on this at https://www.youtube.com/watch?v=n0U7fPKRlVs.
There are what occur to me as some inaccuracies in that youtube video. Hyperthreading should not usually be used in primality testing; performance is usually better without employing the additional threads in prime95 / mprime. Prime95 includes disclosure of cpu type, number of cores, whether hyperthreading is available, instructions supported, cache sizes etc. Options, CPU... It will not test the reliability of your GPU, IGP, PCIe slots, etc. Consider Gpuowl, CUDALucas -memtest, mfakto or mfaktc selftest, or other GPU GIMPS applications for that. And actual hardware test software. Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2021-11-10 at 18:14 |
2022-08-13, 20:25 | #13 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2·29·127 Posts |
Memory use during PRP
P-1, P+1, or ECM factoring may benefit from considerable allowed RAM use. Primality testing via PRP or LLDC has less need of RAM. For exponents up to ~120M, a few hundred MB per worker is enough. Even dual workers at 500+M exponent each do not use 3GiB of RAM on systems with several times that or more installed.
It looks like from the small charted set of data, that ~2.6 bytes times sum of exponents being primality tested at the moment on the system is a usable rough estimate of required RAM, in the absence of any of the memory-hungry factoring algorithms. I don't think Windows version matters. Data were collected from Vista to Windows 11. Prime95 version probably does not matter much either. Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2022-08-13 at 20:47 |
2022-09-08, 05:46 | #14 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2×29×127 Posts |
Interpreting the error-counts 32-bit word
(draft)
I haven't verified that the output form in results.txt or results.json.txt matches the internal storage form, but it seems likely. from prime95 v30.8b15 source module commonb.c starting line 6116 Code:
/* Increment the error counter. The error counter is one 32-bit field containing 5 values. Prior to version 29.3, this was */ /* a one-bit flag if this is a continuation from a save file that did not track error counts, a 7-bit count of errors that were */ /* reproducible, a 8-bit count of ILLEGAL SUMOUTs or zeroed FFT data or corrupt units_bit, a 8-bit count of convolution errors */ /* above 0.4, and a 8-bit count of SUMOUTs not close enough to SUMINPs. */ /* NOTE: The server considers an LL run clean if the error code is XXaaYY00 and XX = YY and aa is ignored. That is, repeatable */ /* round off errors and all ILLEGAL SUMOUTS are ignored. */ /* In version 29.3, a.k.a. Wf in result lines, the 32-bit field changed. See comments in the code below. */ void inc_error_count ( int type, unsigned long *error_count) { unsigned long addin, orin, maxval; addin = orin = 0; if (type == 0) addin = 1, maxval = 0xF; // SUMINP != SUMOUT else if (type == 4) addin = 1 << 4, maxval = 0x0F << 4; // Jacobi error check else if (type == 1) addin = 1 << 8, maxval = 0x3F << 8; // Roundoff > 0.4 else if (type == 5) orin = 1 << 14; // Zeroed FFT data else if (type == 6) orin = 1 << 15; // Units bit, counter, or other value corrupted else if (type == 2) addin = 1 << 16, maxval = 0xF << 16; // ILLEGAL SUMOUT else if (type == 7) addin = 1 << 20, maxval = 0xF << 20; // High reliability (Gerbicz or dblchk) PRP error else if (type == 3) addin = 1 << 24, maxval = 0x3F << 24; // Repeatable error if (addin && (*error_count & maxval) != maxval) *error_count += addin; *error_count |= orin; } 1 & 0xC most significant two bits unassigned 0x12 && 0x3F is the repeatable error count field with max value 63 base 10; 3 is GEC error field, max value fifteen 4 is Illegal Sumout field, max value fifteen 5 & 4 is FFT data zeroed error bit field, 0 or 1 5 & 8 is corrupted data indicator bit field, 0 or 1 0x56 & 3F is roundoff error > 0.4 field, max value 63 base 10; 7 is Jacobi symbol error check field, max value fifteen 8 is suminp!=sumout errors field, max value fifteen Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2022-09-09 at 01:20 |
2022-09-25, 17:08 | #15 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2·29·127 Posts |
P-1 performance
Mprime / prime95 V30.8 introduces an enhancement in P-1 performance, using polynomials to achieve almost 100% pairing of primes in stage 2. This allows cost-effectively factoring to higher stage 2 bounds, and achieving higher factor found probability, saving more primality tests.
Use V30.8 or later for P-1 factoring. Use adequate memory to enable the gains. A few GB is better than only running stage 1. The savings (which I define as over many very similar exponents, estimated probability of finding a factor and thereby avoiding some full PRP tests times estimated cost per primality test avoided, minus estimated cost of P-1 factor testing as a function of exponent, ram, and bounds) are approximately logarithmic with allowed memory, so 4 GiB is better than nothing, 16 GiB is good, 32 is better, more is better yet, until risking the onset of slowdown by paging/swapping which can cut performance drastically and change the expected gain into a large loss. Prime95's GUI limits allowed stage 2 ram to 90% of installed physical system ram. That limit can be overridden by editing local.txt's Memory= line with a text editor, then restarting prime95. Use adequate memory and bounds the first time P-1 is run on an exponent. "Optimizing" by running P-1 to low bounds first, selected to maximize factors found per unit of initial computing effort, is actually a DE-optimization for the project. Avoid inadequate-bounds factoring attempts whenever feasible. (Even if it means asking someone else to do the P-1 run.) Typically on CPUs I've benchmarked, at the wavefront of DC or first time testing, fewer cores per worker produces highest aggregate throughput figures, but the difference is slight. The response of prime95 v30.8 P-1 to a lot of allowed ram is larger. So run two workers for a single-CPU-package system, and they will use about the same amount of time in stage 1 and two, and alternate using large quantities of memory for stage 2, fully employing available memory and maximizing expected net savings of computing time. See second attachment. That attachment clearly shows by comparing the first try and retry curves for similar exponents (current first-primality-test wavefront) that the expected time saved is much larger for a first try, even at considerably less allowed ram than for a retry with nearly 64 GiB of ram. It also indicates there is not much difference in expected time saved versus retried exponent for the same allowable ram, from near the current DC wavefront (66M) to the first-test wavefront (110M). Multi-socket systems (Dual-Xeon, Quad-Xeon, etc) may have nonuniform memory access (NUMA). Specifying a large amount of allowed ram that will cause significant traffic over the NUMA interconnect (QPI, UPI, etc) may be slower than using a lesser amount of ram all connected to one processor socket. Performance may be better on a dual-Xeon system by running 4 workers, with ~45% of system ram allowed per worker, so that it could all be on the near side of the NUMA interconnect. There was a noticeable dip in performance on a dual-Xeon system with 2 workers when using more than half the total system ram in stage 2 in a single worker, which would require some of it to be accessed across the NUMA boundary. See first attachment. In prime95 v30.8b14, with prime95 optimizing bounds freely for the given exponent and allowed stage 2 ram, selected B1 is observed to increase slightly (~O(ram^{0.15})) with allowed ram; selected B2 is observed to increase nearly linearly with allowed ram (~O(ram^{0.77} to ram^{0.98})), and B2/B1 ratio increase substantially (~O(ram^{0.62} to ram^{0.89}))with allowed ram. I believe the stage 2 performance increase results in selecting somewhat lower B1, as well as much higher B2, for the same exponent and allowed ram as would have occurred in v30.7 or earlier. Perhaps counterintuitively, these optimizations, for total probable compute time, result in longer P-1 stage 1 and stage 2 times with increasing ram allowed. See also https://www.mersenneforum.org/showpo...&postcount=724 and https://www.mersenneforum.org/showpo...&postcount=727 The preceding is all in the context of mprime/prime95 doing both stage 1 and stage 2. A recent development in gpuowl supports passing the results of P-1 stage 1 performed standalone in gpuowl, into an mprime/prime95 folder and worktodo file for performance of stage 2 by prime95. See https://mersenneforum.org/showpost.p...postcount=2870 and some posts following it for more info, relevant to gpuowl ~v7.2-129 and configuring prime95 v30.8+ to work with that. A brief comparison of v30.7b9 and v30.8b14, on i5-1035G1, Windows 10, two 32GiB DIMMs installed, on the same P-1 assignment, at 60GiB allowed stage 2 ram, running a single worker with 4 physical cores, yielded the following Code:
for worktodo line PFactor=(AID),1,2,118970857,-1,77,1.3 version B1 B2 factor odds runtime estimate computed odds/day of P-1 v30.7b9 587,000 25,716,000 3.59% 12 hours 26 minutes 6.93%/day v30.8b14 840,000 267,531,000 5.49% 10 hours 41 minutes 12.33%/day ratio 1.431 10.403 1.529 0.859 1.779 Last fiddled with by kriesel on 2023-01-23 at 20:24 Reason: updated second attachment, defined savings |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
gpuOwL-specific reference material | kriesel | kriesel | 32 | 2022-08-07 17:06 |
clLucas-specific reference material | kriesel | kriesel | 5 | 2021-11-15 15:43 |
Mfakto-specific reference material | kriesel | kriesel | 5 | 2020-07-02 01:30 |
gpu-specific reference material | kriesel | kriesel | 4 | 2019-11-03 18:02 |
CUDAPm1-specific reference material | kriesel | kriesel | 12 | 2019-08-12 15:51 |