Register FAQ Search Today's Posts Mark Forums Read

2021-08-15, 23:09   #12
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

19·311 Posts
Tuning Mlucas V20

My first try with Mlucas V20.0 was in Ubuntu 18.04 LTS installed in WSL1 on Windows 10 Home X64 in an i7-8750H laptop.

These were run with mfaktc also running on the laptop's discrete GPU, nothing on the IGP, a web browser with active Google Colab sessions, and TightVNC remote desktop for all access. Prime95 was stopped and exited before the test.

Experimenting a bit with Ernst's posted readme guidance, I obtained the timings shown in the attachment. Since some of these cases would have run with only 100 iterations, and I may have affected them somewhat with some interactive use, I may rerun some of these, perhaps after the next update release. Timings in a bit of production running seem to have gradually improved. Possibly that relates to ambient temperature.

Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1
Attached Files
 i7-8750H timings.pdf (40.7 KB, 50 views)

Last fiddled with by kriesel on 2021-08-31 at 23:57

 2021-09-06, 21:02 #13 kriesel     "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 134258 Posts Mlucas V20.1 timings on various hardware and environments, & prime95 compared Preface: None of the following should be mistaken for criticism of anyone's efforts or results. Writing such software is hard. Making it produce correct outputs is harder. Making it fast and functional also on a variety of inputs, hardware, environments, etc, is harder still. Few even dare to try. Note: prime95 prevents running multiple instances in the same folder. Mlucas does not prevent simultaneously running multiple instances on the same exponent in the same folder. Don't do that. It creates an awful mess. Case 1: PRP DC 84.7M on i7-8750H (6 core, 12 hyperthread) Mlucas v20.1 on Ubuntu 18.04 LTS atop WSL1 on Win10: nominal 4-thread 18 iters/sec; nominal 8-thread 29 iters/sec, so 47 iters/sec throughput for system as operated, potentially up to 54 iters/sec combined for 3 processes of 4-thread. V29.5b6 prime95 benchmark on Windows 10 Home x64, same system: benchmarked all FFT lengths 2M-32M. For 84.7M, fft is 4480K; 88.7 to 93.4 iters/sec throughput. Best throughput is all 6 cores on one worker, which also gives minimum latency. Mlucas v20.1/WSL1 performance observed is ~ 50 to 61% that of prime95/Win on this system. Note prime95 has subsequently improved speed in some aspects since the version benchmarked. Access via TightVNC & GPU app overhead present and should have been about constant. Case 2: Dual e5-2697V2 (12 core & x2 HT) for wavefront PRP V29.8b6 prime95 on Win10, benchmark 5760K fft length; best was 2 workers, 238. iters/sec throughput Mlucas v20.1/WSL 8 thread, 5632K fft length, 15.09 ms/iter -> 66.3 iters/sec. Optimistically extrapolating to triple throughput for 24 cores, 198.8 iters/sec. Other benchmarking showed a disadvantage to using all hyperthreads, versus 1 thread per processor core. Mlucas V20.0/WSL 4 thread, Code:  5632 msec/iter = 29.25 ROE[avg,max] = [0.231956749, 0.312500000] radices = 176 16 32 32 0 0 0 0 0 0 Corresponds to 1000/29.25 = 34.2 iter/sec Optimistically extrapolating to 6x throughput for 24 cores, 205.1 iter/sec throughput. Mlucas/WSL performance is ~83.5-86.2% of prime95 under favorable assumptions. Note that's in comparison to V29.8b6, not the current v30.6b4 prime95. Case 3: Hardware is i7-4770 (4 core & x2 HT) for wavefront PRP, 5-way test on dual-boot Win10/Ubuntu system Mlucas V20.1/Ubuntu 20.04/WSL2/Win10 sandwich on Windows boot (primary) partition Code:  5632 msec/iter = 23.68 ROE[avg,max] = [0.196175927, 0.250000000] radices = 352 16 16 32 0 0 0 0 0 0 1000ms/sec / (23.68 ms/iter) = 42.23 iter/sec. Probably suffered somewhat from RDP, GPU apps running on Windows simultaneously. Mlucas V20.1/Ubuntu 20.04 LTS boot on second partition on same system drive; 8 thread which showed advantage in WSL over 4 thread. Code:  5632 msec/iter = 15.98 ROE[avg,max] = [0.196175927, 0.250000000] radices = 352 16 16 32 0 0 0 0 0 0 1000ms/sec / (15.98 ms/iter) = 62.58 iter/sec. Much improved throughput over the WSL2 scenario above. prime95 v30.6b4, usual RDP, GPU apps running, etc so some overhead load. Code: FFTlen=5600K, Type=3, Arch=4, Pass1=448, Pass2=12800, clm=2 (3 cores, 1 worker): 13.37 ms. Throughput: 74.82 iter/sec. prime95 v30.6b4, Windows 10 Pro x64, no RDP or GPU apps running Code: Timings for 5760K FFT length (4 cores, 1 worker): 11.31 ms. Throughput: 88.43 iter/sec. Timings for 5760K FFT length (4 cores, 2 workers): 22.88, 22.71 ms. Throughput: 87.73 iter/sec. Timings for 5760K FFT length (4 cores, 4 workers): 45.76, 44.74, 44.90, 44.29 ms. Throughput: 89.06 iter/sec. Average of the 3 worker counts, 88.41 iter/sec mprime v30.6b4, Ubuntu 20.04, logged in at console, no GPU apps, no remote access, minimal overhead Code: Timings for 5760K FFT length (4 cores, 1 worker): 11.24 ms. Throughput: 88.97 iter/sec. Timings for 5760K FFT length (4 cores, 2 workers): 22.43, 22.48 ms. Throughput: 89.06 iter/sec. Timings for 5760K FFT length (4 cores, 4 workers): 44.65, 44.65, 44.60, 44.76 ms. Throughput: 89.55 iter/sec. Average of the 3 worker counts, 89.19 iter/sec (1.0088x Windows low-overhead run average throughput) (Mprime/Linux max throughput 1.0055 of prime95/Windows max throughput) mprime and prime95 timings are very close for equalized system overhead, same hardware. While there's essentially no speed advantage Linux vs Windows for prime95/mprime, there may be for Mlucas because of the core virtualization issue on WSL which is required to run Mlucas on Windows now. This should have less effect when the cores are fully loaded with enough Mlucas threads to occupy them all. Mlucas/WSL performance 42.23/74.82 ~56.4% of prime95/Win10. Both sessions may have been negatively impacted by remote-desktop overhead. Mlucas/Ubuntu performance 62.58/88.43 ~70.8% of prime95/Win10 single-worker. Both Win & Ubuntu timing were without remote desktop overhead or GPU apps. Benchmarking experimental error unknown. Digitization error up to ~0.09%. Mlucas can currently perform LL, PRP, and P-1 computations on higher exponents than any other GIMPS software known. Benchmark and estimate run times. Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2021-09-06 at 21:31
 2021-09-17, 18:53 #14 kriesel     "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 19·311 Posts Mlucas releases This is an incomplete draft list. 2017-06-15 V17.0 https://mersenneforum.org/showthread.php?t=22391 2017-07-02? V17.1 https://mersenneforum.org/showthread.php?t=2977 2019-02-20 V18 https://mersenneforum.org/showthread.php?t=24100 2019-12-01 v19 https://mersenneforum.org/showthread.php?t=24990 2021-02-11 v19.1 ARMv8-SIMD / Clang/LLVM compiler compatibility https://mersenneforum.org/showthread.php?t=26483 2021-07-31 V20.0 P-1 support; automake script makemake.sh https://mersenneforum.org/showthread.php?t=27031 2021-08-31 V20.1 faster P-1 stage 2, some bug fixes, print refinements, new help provisions, corrected reference-residues, raised maximum Mp limits https://mersenneforum.org/showthread.php?t=27114 tbd V20.2? minor cleanup such as labeling factor bits as bits, additional bug fixes; possibly resync mfactor variable types with shared routines typing from Mlucas tbd V21? PRP proof generation planned Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2021-09-19 at 13:41
 2021-09-17, 23:14 #15 kriesel     "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 10111000101012 Posts Wish list Features I'd like to see added in Mlucas. As always, it's the developer's call what actually happens; his time, his talent, his program. These are in when-I-thought-to-write-them order. PRP proof file generation by VDF. Preferably V2 format, such as prime95 and gpuowl produce. I think this is generally agreed to be the highest priority feature addition. ETAs in the .stat file output Jacobi symbol check for LL / LLDC running Solution for WSL-related core-hopping seen on Xeon Phi and elsewhere Solution for building native Windows executables Solution for building multithreaded native Windows executables Ability to accept a list of interim LL 64-bit residues from a parallel run for comparison at widely spaced iteration counts such as every 5M or 10M from previous runs, useful in DC / TC / new-discovery-verification Date/time stamps on first record of console or nohup.out output or upon restart in .stat file, and on most other output Only one process can run per folder at a time Multiple-worker integration into a single process Segmenting a P-1 stage 2 run onto multiple processes or machines for running in parallel; this will be necessary for F33 P-1 stage 2, and may be useful on OBD P-1 also Total run time per P-1 stage and total of both stages and GCDs, output by program for more convenient benchmarking, run time scaling measurement. Change worktodo.ini to worktodo.txt On Ernst's wish list judging by reading some source code is a GUI someday. Integral PrimeNet API use (what else?) Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2021-10-14 at 17:40
2021-09-17, 23:18   #16
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

19·311 Posts
Bug list

This is a partial list, mostly by version in which they were first seen. Testing has only involved Mersenne number related capabilities. NO attempt was made at testing on Fermat number capabilities.

V17.0

gave msec/iter times, but labeled sec/iter. Resolved in later version

V18.0
?

V19.0
?

V19.1
?

V20.0
There are several described at https://mersenneforum.org/showpost.p...47&postcount=1.
Upgrade to V20.1 for faster P-1 stage 2 and multiple bug fixes.
And at least one that slipped by brief testing, so are present in V20.1 also. See also P-1 stage 2 restart issue etc below.

V20.1
1. mislabels n-bit P-1 factor found as n (base ten) digits.
Code:
Found 70-digit factor in Stage 2: 646560662529991467527
Following examples from V20.0
Code:
Found 95-digit factor in Stage 1: 33287662948300610984694812407
Found 84-digit factor in Stage 2: 15299475858498328182948679
Log10(33,287,662,948,300,610,984,694,812,407)=28.52...; Log10(33,287,662,948,300,610,984,694,812,407)/Log10(2) = 94.74...
Appears to be corrected in a subsequent update under development.
2. When there is a restart in P-1 stage 2 (Mlucas intended stop/restart, or Windows Update or power failure pulls the rug out from under Linux/WSL and Mlucas), the following result record for P-1 stopped/restarted in stage 2 has 1970-01-01 midnight as time stamp, instead of the actual completion time. <exponent>.stat file entries are ok. The P-1 stage 2 restart code path bypasses the usual inits of calendar_time, which later affects the result output timestamp.
Appears to be corrected in a subsequent update under development.
3. More recently, also on Ubuntu/WSL/Win10, I've observed peculiar result line date values such as "4442758-11-21 10:39:25 UTC" on ~2021-10-07 after recovering from large-memory related stage 2 Mlucas crash on 10M and on 106M exponent runs.
4. Factors found at a GCD early in stage 2 are reported as if they were found in stage 1, with only stage 1 bound given. Computing the effective stage 2 bound in such a case is not easy or clear.
5. Factor found after a full but interrupted stage 2 was indicated as stage 1 bounds only.
6. -maxalloc with a % that equates to > ~32GiB attempted usage results in a segmentation fault at the beginning of P-1 stage 2. Observed on a 128 GiB ram AVX system with Win10/WSL1/Ubuntu 18.04.2 LTS combo. Ernst has been able to reproduce the issue on a KNL/Ubuntu system. Some variables that were typed uint32 will need to be uint64. Until resolved, a workaround is to use less of the available ram, at some loss of speed. Appears to be corrected in a subsequent update under development.
7. In P-1 at least, some values that ought be recalculated for each worktodo item appear to be reused unchanged instead. Number of buffers in stage 2 is the first example seen, not recalculated from 106M to 334M. Another is FFT length did not get updated from a 1M P-1 task to a 3M P-1 task immediately following. Possible workarounds include sorting and segregating assignments to similar exponents, or use of the command line and scripting for separate program sessions for disparate exponents. B2start is another variable that gets carried over.
8. For P-1 and probably other work types, on small exponents on which a stage of computation may complete faster than the checkpoint save interval or stat file update interval, no stage timing is saved to the stat file or displayed on stdout/stderr. This means run time scaling measurements on small exponents can not be made, except with a stopwatch or the batch file/shell script equivalent.
9. In self test on enormous fft lengths (256M - 512M) on 2 models of AVX512 CPUs on Ubuntu/WSL/Win10, one of the 512M radix sets reproducibly produces a segfault crash, preventing production of a line to finish the self test. On Ernst's attempt to reproduce on bare Ubuntu on AVX512, and perhaps later version of source code, there's instead an excessive roundoff error flagged. A workaround is to hand edit the mlucas.cfg file based on console output from other radix sets completed prior to the crash. Since the enormous class fft lengths begin at 256M, there's little or no need for them currently in the Mersenne number realm. FFT length 192M is expected to be sufficient for P-1 factoring attempts on OBD candidates.
10. for worktodo entry:
PMinus1=00000000000000000000000000000000,1,2,3000077,-1,8000,1200000
...
Product of Stage 1 prime powers with b1 = 8000 is 11649 bits (183 limbs), vs estimated 12035. Setting PRP_BASE = 3.
ERROR: at line 1165 of file ../src/mi64.c
Assertion failed: mi64_shl: zero-length array or shift count >= 64!
Code needs modification for the case where it incorrectly generates a shift count of 64.
11. a lingering bug related to relocation-prime handling in P-1 stage 2 restart
12. see also the Mlucas readme.html for a more comprehensive list (currently 14 bullet points, some of which appear to be compound)
v20.1.1 (2021-11-01 or 2021-11-06)
1. Observed on first attempt on WSL, if mlucas.ini does not exist (so no entries exist), running -s m -iters 100 (and probably other command line variations), generates incorrect error messages regarding related mlucas.ini possible entries: "User set unsupported value LowMem = nan in mlucas.ini ... ignoring.
User set non-whole-number CheckInterval = nan in mlucas.ini ... ignoring." Apparently the program does not initialize default values before looking for mlucas.ini and interpreting its contents if found. A likely simple workaround is to create mlucas.ini with applicable contents. This appears to be addressed in the 2021-11-23 patch.
2. Attempts to run worktodo.ini entry: PRP=00000000000000000000000000000000,1,2,3321928171,-1,91,0
yielded following, thought to be due to some variables cast (int). Consequently Mersenne number PRP and LL testing would be capped at 231-1 until further revision:
Code:
INFO: Maximum recommended exponent for FFT length (196608 Kdbl) = 3409766353; p[ = 3321928171]/pmax_rec = 0.9742392373.
Initial DWT-multipliers chain length = [medium] in carry step.
INFO: no restart file found...starting run from scratch.
ERROR: at line 1695 of file ../src/Mlucas.c
Assertion failed: Require (int)maxiter > 0
Run times impose lower limits. End users may attempt increasing the limit to 232-1 by removing the relevant (int) casts in their copy of source code before performing a build. For example, from
Code:
ASSERT(HERE, (int)maxiter > 0,"Require (int)maxiter > 0");
(It would be useful for debugging if the assert output the current maxiter value there.)
Code:
ASSERT(HERE, (int)maxiter > 0,"Require (int)maxiter > 0");
becomes
Code:
ASSERT(HERE, maxiter > 0,"Require maxiter > 0");
or perhaps the compiler-acceptable equivalent of
Code:
ASSERT(HERE, maxiter > 0,"Require maxiter = "+sprintf("%u",maxiter)+" > 0");
The 2021-11-23 patch has as line 1697,
Code:
ASSERT(HERE, maxiter > 0,"Require (uint32)maxiter > 0");
Which provides a bit more exponent range. A more complete solution, revising the source code to use 64-bit ints for more variables, to support the full nominal capability of the largest implemented fft lengths, is planned at some point. Until that is done, it will limit testing the code in several ways. https://www.mersenne.ca/exponent/8937021983 is a prime exponent, above the nominal Mlucas v20.x limit of PRP exponents supported with the 512M fft and above 232. Attempting to run PRP on it produces not a message about being too large, but
Code:
 worktodo.ini  entry: PRP=00000000000000000000000000000000,1,2,8937021983,-1,91,0

check_kbnc: Mersenne exponent must be prime!
ERROR: at line 590 of file ../src/Mlucas.c
Assertion failed: [k,b,n,c] portion of in_line fails to parse  correctly!
8937021983 is prime, but 8937021983 mod 232 is 347,087,391 = 3 × 7 × 16 527971. Similarly an exponent below the asympconst = 0.6 estimated limit of the 512M fft length, 8883334793 mod 232 = 293,400,201 = 3 × 97800067
Code:
 worktodo.ini entry: Pminus1=00000000000000000000000000000000,1,2,8883334793,-1,10000,10000

check_kbnc: Mersenne exponent must be prime!
ERROR: at line 716 of file ../src/Mlucas.c
Assertion failed: [k,b,n,c] portion of in_line fails to parse correctly!
What's occurring would be more clear if the error message indicated the exponent as perceived by the code at that point, eg
Quote:
 check_kbnc: Mersenne exponent 293,400,201 must be prime!
Just above 232, 4294967311 mod 232 = 15 = 3 x 5
Code:
 worktodo.ini entry: Pminus1=00000000000000000000000000000000,1,2,4294967311,-1,10000,10000

check_kbnc: Mersenne exponent must be prime!
ERROR: at line 716 of file ../src/Mlucas.c
Assertion failed: [k,b,n,c] portion of in_line fails to parse correctly!
while just below 232, 4294967291 P-1 appears at least initially to run. Note some exponents >232 that are prime may correspond to another smaller prime mod 232 and may fail another assert re minimum magnitude. For example, 4294967357 mod 232 = 61, while minimum exponent = 4096.
3. Fft length can carry over from a self test to a worktodo item, possibly yielding impossibly high bits/word.
Code:
ERROR: at line 1285 of file ../src/Mlucas.c
Assertion failed: ERROR: specified FFT length 44 K is much too small: Recommended length for this p = 196608 K.
Reportedly fixed in the 2021-11-23 patch.
4. P-1 stage 2 may indicate >100% completion along the way. Known factor is found, so it's cosmetic. Per Ernst, it's a side effect of ensuring no paired primes below B2 get skipped, by doing some paired above B2 with them, IIUC. However, there's much more status output from 100% to 104+% than from 0% to 100%, and that's a bug, for which a fix is planned in the next update. Reportedly fixed in the 2021-11-23 patch.
5. Attempts to stop a running Mlucas session, such as to restart from beginning, or to change the mlucas.ini file and have the changes take effect, may produce assorted .stat file anomalies, depending on the method of stopping the session attempted. top kill 9 works. Ctrl-C creates anomalies. (There is a patch file available which is thought to address most cases (but possibly not P-1 stage 2). And the 2021-11-23 patch likely incorporates that if not more, as a result of restored signal handling.) Following is a list of some of the interesting features of the resulting stat file. These are mostly also typically signs of a computation gone wrong.
1. Repeating res64 with differing iteration count during Ctrl-C as kill attempts.
2. Clocks anomalies vs. elapsed time indicated by time stamp. For example, 100k iter to 110k iter, 13 seconds, clocks value unchanged, ms/iter 10-fold increase.
3. Unchanged shift count between status lines indicated as 10k iterations apart, in a run using nonzero shift.
4. Error measures zero in some entries.
5. Exact same clocks values at 10k iters apart in early running.
6. Every log interval Mlucas claims to be restarting, including during the last run in which it was being left to run undisturbed while I slept.
7. Mismatch of res64 for same iteration count and exponent and computation type, between the affected run and independent runs' log output. After a thorough restart from scratch process, the 10k and 100k iter res64s match the corresponding res64s from a previous gpuowl run. I suspect the erroneous res64 in the affected Mlucas run is a res64 carried over from some earlier iteration count.

Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1

Last fiddled with by kriesel on 2021-11-24 at 15:52 Reason: misc edits

2021-10-10, 16:16   #17
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

19·311 Posts
V20.1.x P-1 run time scaling

(draft)

Based on very limited data, run time scaling is approximately p2.2, for typical recommended bounds, in line with expectations from other applications and from first principles. Run time scaling for Mlucas v20.1 on a dual-xeon-e5-2697v2 system on Ubuntu atop WSL & Windows 10 is consistent with OBD P-1 factoring whole attempts of ~10 months standalone duration, ~15 months with other usual loads. Running stage 2 in parallel on multiple systems can be used to ensure OBD P-1 completion in under a year. (Note the fit to points including a 10M exponent is inaccurate, because that point was run with 8 cores, unlike 16 for the others.)

When selecting exponents for run-time scaling tests, I recommend at least one with a known factor that should be found with usual bounds. That goes first to anchor the low end of the scaling. M10000831 works well. Widely spaced other exponents of use to GIMPS compose the rest; current first-test wavefront ~107M, ~220M, 332M (100Mdigit), & higher (~500M-700M). Running them in that order allows a scaling fit to develop in a spreadsheet with the least compute time expenditure. That helps avoid single data points costing months or the appearance of a hung application.

I'll add more here after more data points complete running for 192M-fft-optimized core count on V20.1.1.

Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1
Attached Files
 p1 run time scaling.pdf (39.1 KB, 3 views)

Last fiddled with by kriesel on 2021-11-18 at 18:40 Reason: update attachment which now includes gcd run time scaling

 2021-10-22, 20:23 #18 kriesel     "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 19×311 Posts Max exponent versus fft length Mlucas v20.1 fft length, maxp (excerpted from get_fft_radices.c, subject to change) Note, for 256M fft length and larger, -shift 0 is required. Code: With AsympConst = 0.6, this gives the following maxP values for various FFT lengths: maxn(N) as a function of AsympConst: N AC = 0.6 AC = 0.4 -------- ---------- ---------- 1 K 22686 22788 [0.45% larger] 2 K 44683 44888 3 K 66435 66742 4 K 88029 88438 5 K 109506 110018 6 K 130892 131506 7 K 152201 152918 8 K 173445 174264 9 K 194632 195554 10 K 215769 216793 12 K 257912 259141 14 K 299904 301338 16 K 341769 343407 18 K 383521 385364 20 K 425174 427222 24 K 508222 510679 28 K 590972 593840 32 K 673470 676747 36 K 755746 759433 40 K 837827 841923 48 K 1001477 1006392 56 K 1164540 1170275 64 K 1327103 1333656 72 K 1489228 1496601 80 K 1650966 1659158 96 K 1973430 1983260 112 K 2294732 2306201 128 K 2615043 2628150 144 K 2934488 2949234 160 K 3253166 3269550 176 K 3571154 3589176 192 K 3888516 3908176 208 K 4205305 4226604 224 K 4521565 4544502 240 K 4837335 4861911 256 K 5152648 5178863 288 K 5782016 5811507 320 K 6409862 6442630 352 K 7036339 7072384 384 K 7661575 7700897 416 K 8285675 8328273 448 K 8908726 8954601 480 K 9530805 9579957 512 K 10151977 10204406 576 K 11391823 11450805 640 K 12628648 12694184 704 K 13862759 13934849 768 K 15094405 15173048 832 K 16323795 16408992 896 K 17551103 17642854 960 K 18776481 18874785 1024 K 20000058 20104916 [0.52% larger] 1152 K 22442252 22560217 1280 K 24878447 25009519 1408 K 27309250 27453429 1536 K 29735157 29892444 1664 K 32156582 32326975 1792 K 34573872 34757372 1920 K 36987325 37183933 2048 K 39397201 39606917 2304 K 44207097 44443027 2560 K 49005071 49267215 2816 K 53792328 54080687 3072 K 58569855 58884428 3328 K 63338470 63679258 3584 K 68098867 68465868 3840 K 72851637 73244853 4096 K 77597294 78016725 4608 K 87069012 87540871 5120 K 96517023 97041311 5632 K 105943724 106520441 6144 K 115351074 115980220 6656 K 124740700 125422275 7168 K 134113980 134847983 7680 K 143472090 144258522 8 M = 8192 K 152816052 153654913 9 M = 9216 K 171464992 172408710 10 M = 10240 K 190066770 191115346 11 M = 11264 K 208626152 209779586 12 M = 12288 K 227147031 228405322 13 M = 13312 K 245632644 246995793 14 M = 14336 K 264085729 265553736 15 M = 15360 K 282508628 284081492 16 M = 16384 K 300903371 302581093 18 M = 18432 K 337615274 339502711 <*** smallest 100-Mdigit moduli *** 20 M = 20480 K 374233313 376330465 22 M = 22528 K 410766968 413073835 24 M = 24576 K 447223981 449740563 26 M = 26624 K 483610796 486337093 28 M = 28672 K 519932856 522868869 30 M = 30720 K 556194824 559340552 32 M = 32768 K 592400738 595756181 <*** Nov 2015: No ROE issues with a run of p = 595799947 [maxErr = 0.375], corr. to AsympConst ~= 0.4 36 M = 36864 K 664658102 668432976 40 M = 40960 K 736728582 740922886 44 M = 45056 K 808631042 813244776 48 M = 49152 K 880380890 885414055 52 M = 53248 K 951990950 957443546 56 M = 57344 K 1023472059 1029344085 60 M = 61440 K 1094833496 1101124952 64 M = 65536 K 1166083299 1172794185 72 M = 73728 K 1308275271 1315825018 80 M = 81920 K 1450095024 1458483632 88 M = 90112 K 1591580114 1600807583 96 M = 98304 K 1732761219 1742827549 104 M = 106496 K 1873663870 1884569060 112 M = 114688 K 2014309644 2026053695 120 M = 122880 K 2154717020 2167299932 128 M = 131072 K 2294902000 2308323773 144 M = 147456 K 2574659086 2589758580 160 M = 163840 K 2853674592 2870451808 176 M = 180224 K 3132023315 3150478252 192 M = 196608 K 3409766353 3429899013 <<<<*** Allows numbers slightly > 1Gdigit 208 M = 212992 K 3686954556 3708764937 224 M = 229376 K 3963630903 3987119006 240 M = 245760 K 4239832202 4264998026 <*** largest FFT length for 32-bit exponents 256 M = 262144 K 4515590327 4542433872 288 M = 294912 K 5065885246 5096084235 320 M = 327680 K 5614702299 5648256731 352 M = 360448 K 6162190494 6199100369 384 M = 393216 K 6708471554 6748736872 416 M = 425984 K 7253646785 7297267547 448 M = 458752 K 7797801823 7844778027 480 M = 491520 K 8341010002 8391341650 512 M = 524288 K 8883334834 8937021925 [0.60% larger] Top of this reference thread https://www.mersenneforum.org/showthread.php?t=23427 Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2021-11-13 at 21:55
 2021-10-25, 09:50 #19 kriesel     "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 19×311 Posts Ram required versus exponent for P-1 stage 2 in Mlucas Stage 1 can be run to gigadigit or higher on 12 GiB ram, probably less ram. Stage 2 is much more demanding of ram than stage 1. A rough estimate of stage 2 ram required for successful launch is 30 buffers x (fft length x 8 bytes)/buffer. For example, if for gigadigit stage 2, 256M is the fastest fft length of sufficient size (>=192Mi), 30 x 256Mi x 8 bytes = 61440 MiB = 60 GiB. In that case the ram required could be reduced by commenting out the 256M entry in mlucas.cfg, accepting the somewhat slower 192M timing in exchange for smaller ram requirement. Then 30 x 192Mi x 8 bytes = 45 GiB. Observed ram usage of the Mlucas program in top was 45.9 GiB. Stage 2 P-1 for F33 is estimated to require 30 x 512Mi x 8 bytes = 120 GiB. Wavefront stage 2 P-1 at ~107M exponent would require 6M fft length, ~1.4 GiB. 100Mdigit would require 18M fft length, ~4.2 GiB. ~1G exponent would require 56M fft length, ~13.1 GiB. So since free ram in a WSL session is ~12GiB on a 16 GiB system, max exponent for stage 2 on such a system would be ~910M. Those values above are the estimated minimums to be able to run the stage. Somewhat more ram could enable faster completion. In preliminary testing, I've observed Mlucas allocate buffers as multiples of 24, plus per Ernst the equivalent of ~5 are required for other data, and buffers are allocated in multiples of 24 or 40. For OBD, 64 GiB would be the same speed as 60 or 80 GiB, but 96 GiB or higher may allow use of 48 buffers or more and somewhat higher speed. There are diminishing returns with successive doublings of ram and buffer count, observed in other P-1 capable software. Ram in use may fluctuate somewhat. A run of 468M is observed with 6.72 GiB virtual size, 6.02GiB resident in top in stage 2, while the estimate would give for its 26M fft length, 6.1 GiB, for 24 buffers used. Top of this reference thread https://www.mersenneforum.org/showthread.php?t=23427 Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2021-11-17 at 18:06
2021-11-17, 12:37   #20
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

171516 Posts
Optimizing core count for fastest iteration time of a single task

On a dual-processor-package system, 2 x E5-2697V2 (each of which are 12-core plus x2 hyperthreading, for a total of 2 x 12 x 2 =48 logical processors), within Ubuntu running atop WSL1 on Win 10 Pro x64, a series of self tests for a single fft length and varying cpu core counts were run in Mlucas v20.1.1 (Nov 6 tarball). Usage that way would be likely when attempting to complete one testing task as quickly as possible (minimum latency). Examples are OBD or F33 P-1, or confirming a new Mersenne prime discovery. It is very likely not the maximum-throughput case, that would constitute typical production running.

Fastest iteration time, ~400ms/iter at 192M (suitable for OBD P-1) was obtained at 20 cores, which is less than the total physical core count 24.
Iteration times obtained were observed to have limited reproducibility at 100 iterations. (10% or worse variability.) Reproducibility was much better with 1000-iteration runs.
Best reproducibility was apparently by running nothing else, no interactive use, and not even top, although I left a gpuowl instance running uninterrupted.

The thread efficiency = ms/iter * threadcount / ms/iter for 1 thread varied widely, down to 20.6% at 48 threads. At the fastest iteration time, 20 threads, it was 65.8%. Power-of-two thread counts were in most cases local maxima.

The tests were performed by writing and launching a simple sequential shell script, specifying Mlucas command line and output redirection, followed by rename of the mlucas.cfg before the next thread count run.

Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1
Attached Files
 optimizing 192M thread count.pdf (27.8 KB, 3 views)

Last fiddled with by kriesel on 2021-11-17 at 14:46

2021-11-18, 19:22   #21
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

19×311 Posts
File size scaling

Size of the largest files depends on exponent and computation type. Number depends on the computation type. File size is independent of P-1 bounds, determined by mod Mp. LL will have p and q files. PRP will have also a .G file for the GEC data. P-1 will have p and q and eventually .s1 and .s2, and .s1_prod which is ~100X smaller. .stat files tend to be modest size. A 4G P-1 attempt means 1-2 GiB of space used, even with minimum possible bounds.
Attached Files
 file sizes.pdf (17.8 KB, 3 views)

 Similar Threads Thread Thread Starter Forum Replies Last Post kriesel kriesel 30 2021-09-10 16:09 kriesel kriesel 5 2020-07-02 01:30 kriesel kriesel 9 2020-05-28 23:32 kriesel kriesel 8 2020-04-17 03:50 kriesel kriesel 12 2019-08-12 15:51

All times are UTC. The time now is 12:24.

Sun Nov 28 12:24:11 UTC 2021 up 128 days, 6:53, 0 users, load averages: 1.00, 1.22, 1.16