20210815, 23:09  #12 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
6,089 Posts 
Tuning Mlucas V20
My first try with Mlucas V20.0 was in Ubuntu 18.04 LTS installed in WSL1 on Windows 10 Home X64 in an i78750H laptop.
These were run with mfaktc also running on the laptop's discrete GPU, nothing on the IGP, a web browser with active Google Colab sessions, and TightVNC remote desktop for all access. Prime95 was stopped and exited before the test. Experimenting a bit with Ernst's posted readme guidance, I obtained the timings shown in the attachment. Since some of these cases would have run with only 100 iterations, and I may have affected them somewhat with some interactive use, I may rerun some of these, perhaps after the next update release. Timings in a bit of production running seem to have gradually improved. Possibly that relates to ambient temperature. Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 20210831 at 23:57 
20210906, 21:02  #13 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
6,089 Posts 
Mlucas V20.1 timings on various hardware and environments, & prime95 compared
Preface:
None of the following should be mistaken for criticism of anyone's efforts or results. Writing such software is hard. Making it produce correct outputs is harder. Making it fast and functional also on a variety of inputs, hardware, environments, etc, is harder still. Few even dare to try. Note: prime95 prevents running multiple instances in the same folder. Mlucas does not prevent simultaneously running multiple instances on the same exponent in the same folder. Don't do that. It creates an awful mess. Case 1: PRP DC 84.7M on i78750H (6 core, 12 hyperthread) Mlucas v20.1 on Ubuntu 18.04 LTS atop WSL1 on Win10: nominal 4thread 18 iters/sec; nominal 8thread 29 iters/sec, so 47 iters/sec throughput for system as operated, potentially up to 54 iters/sec combined for 3 processes of 4thread. V29.5b6 prime95 benchmark on Windows 10 Home x64, same system: benchmarked all FFT lengths 2M32M. For 84.7M, fft is 4480K; 88.7 to 93.4 iters/sec throughput. Best throughput is all 6 cores on one worker, which also gives minimum latency. Mlucas v20.1/WSL1 performance observed is ~ 50 to 61% that of prime95/Win on this system. Note prime95 has subsequently improved speed in some aspects since the version benchmarked. Access via TightVNC & GPU app overhead present and should have been about constant. Case 2: Dual e52697V2 (12 core & x2 HT) for wavefront PRP V29.8b6 prime95 on Win10, benchmark 5760K fft length; best was 2 workers, 238. iters/sec throughput Mlucas v20.1/WSL 8 thread, 5632K fft length, 15.09 ms/iter > 66.3 iters/sec. Optimistically extrapolating to triple throughput for 24 cores, 198.8 iters/sec. Other benchmarking showed a disadvantage to using all hyperthreads, versus 1 thread per processor core. Mlucas V20.0/WSL 4 thread, Code:
5632 msec/iter = 29.25 ROE[avg,max] = [0.231956749, 0.312500000] radices = 176 16 32 32 0 0 0 0 0 0 Optimistically extrapolating to 6x throughput for 24 cores, 205.1 iter/sec throughput. Mlucas/WSL performance is ~83.586.2% of prime95 under favorable assumptions. Note that's in comparison to V29.8b6, not the current v30.6b4 prime95. Case 3: Hardware is i74770 (4 core & x2 HT) for wavefront PRP, 5way test on dualboot Win10/Ubuntu system Mlucas V20.1/Ubuntu 20.04/WSL2/Win10 sandwich on Windows boot (primary) partition Code:
5632 msec/iter = 23.68 ROE[avg,max] = [0.196175927, 0.250000000] radices = 352 16 16 32 0 0 0 0 0 0 Mlucas V20.1/Ubuntu 20.04 LTS boot on second partition on same system drive; 8 thread which showed advantage in WSL over 4 thread. Code:
5632 msec/iter = 15.98 ROE[avg,max] = [0.196175927, 0.250000000] radices = 352 16 16 32 0 0 0 0 0 0 prime95 v30.6b4, usual RDP, GPU apps running, etc so some overhead load. Code:
FFTlen=5600K, Type=3, Arch=4, Pass1=448, Pass2=12800, clm=2 (3 cores, 1 worker): 13.37 ms. prime95 v30.6b4, Windows 10 Pro x64, no RDP or GPU apps running Code:
Timings for 5760K FFT length (4 cores, 1 worker): 11.31 ms. Throughput: 88.43 iter/sec. Timings for 5760K FFT length (4 cores, 2 workers): 22.88, 22.71 ms. Throughput: 87.73 iter/sec. Timings for 5760K FFT length (4 cores, 4 workers): 45.76, 44.74, 44.90, 44.29 ms. Throughput: 89.06 iter/sec. mprime v30.6b4, Ubuntu 20.04, logged in at console, no GPU apps, no remote access, minimal overhead Code:
Timings for 5760K FFT length (4 cores, 1 worker): 11.24 ms. Throughput: 88.97 iter/sec. Timings for 5760K FFT length (4 cores, 2 workers): 22.43, 22.48 ms. Throughput: 89.06 iter/sec. Timings for 5760K FFT length (4 cores, 4 workers): 44.65, 44.65, 44.60, 44.76 ms. Throughput: 89.55 iter/sec. (Mprime/Linux max throughput 1.0055 of prime95/Windows max throughput) mprime and prime95 timings are very close for equalized system overhead, same hardware. While there's essentially no speed advantage Linux vs Windows for prime95/mprime, there may be for Mlucas because of the core virtualization issue on WSL which is required to run Mlucas on Windows now. This should have less effect when the cores are fully loaded with enough Mlucas threads to occupy them all. Mlucas/WSL performance 42.23/74.82 ~56.4% of prime95/Win10. Both sessions may have been negatively impacted by remotedesktop overhead. Mlucas/Ubuntu performance 62.58/88.43 ~70.8% of prime95/Win10 singleworker. Both Win & Ubuntu timing were without remote desktop overhead or GPU apps. Benchmarking experimental error unknown. Digitization error up to ~0.09%. Mlucas can currently perform LL, PRP, and P1 computations on higher exponents than any other GIMPS software known. Benchmark and estimate run times. Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 20210906 at 21:31 
20210917, 18:53  #14 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
6,089 Posts 
Mlucas releases
This is an incomplete draft list.
20170615 V17.0 https://mersenneforum.org/showthread.php?t=22391 20170702? V17.1 https://mersenneforum.org/showthread.php?t=2977 20190220 V18 https://mersenneforum.org/showthread.php?t=24100 20191201 v19 https://mersenneforum.org/showthread.php?t=24990 20210211 v19.1 ARMv8SIMD / Clang/LLVM compiler compatibility https://mersenneforum.org/showthread.php?t=26483 20210731 V20.0 P1 support; automake script makemake.sh https://mersenneforum.org/showthread.php?t=27031 20210831 V20.1 faster P1 stage 2, some bug fixes, print refinements, new help provisions, corrected referenceresidues, raised maximum Mp limits https://mersenneforum.org/showthread.php?t=27114 tbd V20.2? minor cleanup such as labeling factor bits as bits, additional bug fixes; possibly resync mfactor variable types with shared routines typing from Mlucas tbd V21? PRP proof generation planned Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 20210919 at 13:41 
20210917, 23:14  #15 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
6,089 Posts 
Wish list
Features I'd like to see added in Mlucas. As always, it's the developer's call what actually happens; his time, his talent, his program. These are in whenIthoughttowritethem order.
Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 20211014 at 17:40 
20210917, 23:18  #16  
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
6,089 Posts 
Bug list
This is a partial list, mostly by version in which they were first seen. Testing has only involved Mersenne number related capabilities. NO attempt was made at testing on Fermat number capabilities.
V17.0 gave msec/iter times, but labeled sec/iter. Resolved in later version V18.0 ? V19.0
? V20.0 There are several described at https://mersenneforum.org/showpost.p...47&postcount=1. Upgrade to V20.1 for faster P1 stage 2 and multiple bug fixes. And at least one that slipped by brief testing, so are present in V20.1 also. See also P1 stage 2 restart issue etc below. V20.1
Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 20211203 at 14:34 Reason: misc edits 

20211010, 16:16  #17 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
6089_{10} Posts 
V20.1.x P1 run time scaling
Based on very limited data, run time scaling is approximately p^{2.1}, for typical recommended bounds, in line with expectations from other applications and from first principles. (So twice the exponent is more than four times the run time, for nontrivial exponents, where fixedduration or loworderscaling setup time does not affect scaling much.)
When selecting exponents for runtime scaling tests, I recommend at least one with a known factor that should be found with usual bounds. That goes first to anchor the low end of the scaling. M10000831 works well. Widely spaced other exponents of use to GIMPS compose the rest; current firsttest wavefront ~107M, ~220M, 332M (100Mdigit), & higher (~500M700M). Running them in that order allows a scaling fit to develop in a spreadsheet with the least compute time expenditure. That helps avoid single data points costing months or the appearance of a hung application. If running on WSL & Windows, take care to pause Windows updating for a sufficient duration that the scaling runs will complete without interruption, for an easier situation for tabulating compute time per exponent. The first attachment shows results of runs while also running other GIMPS loads, and brief stage 1 tests without the other loads. Run time scaling for Mlucas v20.1 on a dualxeone52697v2 system on Ubuntu atop WSL & Windows 10, 128 GiB ECC ram, is consistent with OBD P1 factoring whole attempts of ~10 months standalone duration, ~15 months with other usual loads. Running stage 2 in parallel on multiple systems can be used to ensure OBD P1 completion in under a year. (Note the fit to points including a 10M exponent is inaccurate, because that point was run with 8 cores, unlike 16 for the others in that set.) An experimental sequence of self test at 192M fft length for varying thread counts indicated 20 threads was latency optimal for this system for that length. See https://www.mersenneforum.org/showpo...8&postcount=20 A second run time scaling set for the same dualxeone52697v2 system on Ubuntu atop WSL & Windows 10 system was run with 20 threads and Mlucas V20.1.1. See the second attachment. Several run time estimates for gigadigit P1 were computed, with all shorter than one year. There are a few available ways to shorten run time relative to those tests and estimates, listed in the attachment. A comparison of the stat files of M3321928171 from the first scaling run in Mlucas V20.1 and the second in V20.1.1 with differing core counts but matching B1=17,000,000 shows stage 1 iteration 100,000 res64 values match. (Res64: C51C82322FC7CBE6) See also https://www.mersenneforum.org/showpo...6&postcount=35 "requirements for comparability of interim residues" post. An additional run time scaling on a similar system (dualxeone52690, 64 GiB ECC ram) is incomplete and so is not yet attached, but the preliminary results through 332M are encouraging that it will also qualify for gigadigit P1 with estimated solo run time under ~1.5 years. Given the run time scalings obtained so far, and mlucas.cfg timing for 192M fft length, we can estimate that a timing under ~500 ms/iter is required to qualify for OBD P1 to designated bounds solo within a year. Assuming run time is 1/3 stage 1, 2/3 stage 2, stage 1 taking no more than (4 months * 365 /12 days/month * 24 hours/day * 3600 seconds/hour) / (17000000 * 1.442 iters) = 10512000 / 24514000 = 429. msec / iter provides a rough magnitude check. So far I have observed s2/s1 duration ratios lower than 2, during runtime scaling at exponents < 1G (and those would improve with greater memory allowed for stage 2), which would permit somewhat longer stage 1 than 4 months and longer than 429. msec/iter 192M fft length selftest times. Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 20211130 at 14:16 Reason: added second attachment, 1Gdigit 100k iter interim residue comparison, max ms/iter estimate 
20211022, 20:23  #18 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
1011111001001_{2} Posts 
Max exponent versus fft length
Mlucas v20.1 fft length, maxp (excerpted from get_fft_radices.c, subject to change)
Note, for 256M fft length and larger, shift 0 is required. Code:
With AsympConst = 0.6, this gives the following maxP values for various FFT lengths: maxn(N) as a function of AsympConst: N AC = 0.6 AC = 0.4    1 K 22686 22788 [0.45% larger] 2 K 44683 44888 3 K 66435 66742 4 K 88029 88438 5 K 109506 110018 6 K 130892 131506 7 K 152201 152918 8 K 173445 174264 9 K 194632 195554 10 K 215769 216793 12 K 257912 259141 14 K 299904 301338 16 K 341769 343407 18 K 383521 385364 20 K 425174 427222 24 K 508222 510679 28 K 590972 593840 32 K 673470 676747 36 K 755746 759433 40 K 837827 841923 48 K 1001477 1006392 56 K 1164540 1170275 64 K 1327103 1333656 72 K 1489228 1496601 80 K 1650966 1659158 96 K 1973430 1983260 112 K 2294732 2306201 128 K 2615043 2628150 144 K 2934488 2949234 160 K 3253166 3269550 176 K 3571154 3589176 192 K 3888516 3908176 208 K 4205305 4226604 224 K 4521565 4544502 240 K 4837335 4861911 256 K 5152648 5178863 288 K 5782016 5811507 320 K 6409862 6442630 352 K 7036339 7072384 384 K 7661575 7700897 416 K 8285675 8328273 448 K 8908726 8954601 480 K 9530805 9579957 512 K 10151977 10204406 576 K 11391823 11450805 640 K 12628648 12694184 704 K 13862759 13934849 768 K 15094405 15173048 832 K 16323795 16408992 896 K 17551103 17642854 960 K 18776481 18874785 1024 K 20000058 20104916 [0.52% larger] 1152 K 22442252 22560217 1280 K 24878447 25009519 1408 K 27309250 27453429 1536 K 29735157 29892444 1664 K 32156582 32326975 1792 K 34573872 34757372 1920 K 36987325 37183933 2048 K 39397201 39606917 2304 K 44207097 44443027 2560 K 49005071 49267215 2816 K 53792328 54080687 3072 K 58569855 58884428 3328 K 63338470 63679258 3584 K 68098867 68465868 3840 K 72851637 73244853 4096 K 77597294 78016725 4608 K 87069012 87540871 5120 K 96517023 97041311 5632 K 105943724 106520441 6144 K 115351074 115980220 6656 K 124740700 125422275 7168 K 134113980 134847983 7680 K 143472090 144258522 8 M = 8192 K 152816052 153654913 9 M = 9216 K 171464992 172408710 10 M = 10240 K 190066770 191115346 11 M = 11264 K 208626152 209779586 12 M = 12288 K 227147031 228405322 13 M = 13312 K 245632644 246995793 14 M = 14336 K 264085729 265553736 15 M = 15360 K 282508628 284081492 16 M = 16384 K 300903371 302581093 18 M = 18432 K 337615274 339502711 <*** smallest 100Mdigit moduli *** 20 M = 20480 K 374233313 376330465 22 M = 22528 K 410766968 413073835 24 M = 24576 K 447223981 449740563 26 M = 26624 K 483610796 486337093 28 M = 28672 K 519932856 522868869 30 M = 30720 K 556194824 559340552 32 M = 32768 K 592400738 595756181 <*** Nov 2015: No ROE issues with a run of p = 595799947 [maxErr = 0.375], corr. to AsympConst ~= 0.4 36 M = 36864 K 664658102 668432976 40 M = 40960 K 736728582 740922886 44 M = 45056 K 808631042 813244776 48 M = 49152 K 880380890 885414055 52 M = 53248 K 951990950 957443546 56 M = 57344 K 1023472059 1029344085 60 M = 61440 K 1094833496 1101124952 64 M = 65536 K 1166083299 1172794185 72 M = 73728 K 1308275271 1315825018 80 M = 81920 K 1450095024 1458483632 88 M = 90112 K 1591580114 1600807583 96 M = 98304 K 1732761219 1742827549 104 M = 106496 K 1873663870 1884569060 112 M = 114688 K 2014309644 2026053695 120 M = 122880 K 2154717020 2167299932 128 M = 131072 K 2294902000 2308323773 144 M = 147456 K 2574659086 2589758580 160 M = 163840 K 2853674592 2870451808 176 M = 180224 K 3132023315 3150478252 192 M = 196608 K 3409766353 3429899013 <<<<*** Allows numbers slightly > 1Gdigit 208 M = 212992 K 3686954556 3708764937 224 M = 229376 K 3963630903 3987119006 240 M = 245760 K 4239832202 4264998026 <*** largest FFT length for 32bit exponents 256 M = 262144 K 4515590327 4542433872 288 M = 294912 K 5065885246 5096084235 320 M = 327680 K 5614702299 5648256731 352 M = 360448 K 6162190494 6199100369 384 M = 393216 K 6708471554 6748736872 416 M = 425984 K 7253646785 7297267547 448 M = 458752 K 7797801823 7844778027 480 M = 491520 K 8341010002 8391341650 512 M = 524288 K 8883334834 8937021925 [0.60% larger] Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 20211113 at 21:55 
20211025, 09:50  #19 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
6089_{10} Posts 
Ram required versus exponent for P1 stage 2 in Mlucas
Stage 1 can be run to gigadigit or higher on 12 GiB ram, probably less ram.
Stage 2 is much more demanding of ram than stage 1. A rough estimate of stage 2 ram required for successful launch is 30 buffers x (fft length x 8 bytes)/buffer. For example, if for gigadigit stage 2, 256M is the fastest fft length of sufficient size (>=192Mi), 30 x 256Mi x 8 bytes = 61440 MiB = 60 GiB. In that case the ram required could be reduced by commenting out the 256M entry in mlucas.cfg, accepting the somewhat slower 192M timing in exchange for smaller ram requirement. Then 30 x 192Mi x 8 bytes = 45 GiB. Observed ram usage of the Mlucas program in top was 45.9 GiB. Stage 2 P1 for F33 is estimated to require 30 x 512Mi x 8 bytes = 120 GiB. Wavefront stage 2 P1 at ~107M exponent would require 6M fft length, ~1.4 GiB. 100Mdigit would require 18M fft length, ~4.2 GiB. ~1G exponent would require 56M fft length, ~13.1 GiB. So since free ram in a WSL session is ~12GiB on a 16 GiB system, max exponent for stage 2 on such a system would be ~910M. Those values above are the estimated minimums to be able to run the stage. Somewhat more ram could enable faster completion. In preliminary testing, I've observed Mlucas allocate buffers as multiples of 24, plus per Ernst the equivalent of ~5 are required for other data, and buffers are allocated in multiples of 24 or 40. For OBD, 64 GiB would be the same speed as 60 or 80 GiB, but 96 GiB or higher may allow use of 48 buffers or more and somewhat higher speed. There are diminishing returns with successive doublings of ram and buffer count, observed in other P1 capable software. Ram in use may fluctuate somewhat. A run of 468M is observed with 6.72 GiB virtual size, 6.02GiB resident in top in stage 2, while the estimate would give for its 26M fft length, 6.1 GiB, for 24 buffers used. Top of this reference thread https://www.mersenneforum.org/showthread.php?t=23427 Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 20211117 at 18:06 
20211117, 12:37  #20 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
6,089 Posts 
Optimizing core count for fastest iteration time of a single task
On a dualprocessorpackage system, 2 x E52697V2 (each of which are 12core plus x2 hyperthreading, for a total of 2 x 12 x 2 =48 logical processors), within Ubuntu running atop WSL1 on Win 10 Pro x64, a series of self tests for a single fft length and varying cpu core counts were run in Mlucas v20.1.1 (Nov 6 tarball). Usage that way would be likely when attempting to complete one testing task as quickly as possible (minimum latency). Examples are OBD or F33 P1, or confirming a new Mersenne prime discovery. It is very likely not the maximumthroughput case, that would constitute typical production running.
Fastest iteration time, ~400ms/iter at 192M (suitable for OBD P1) was obtained at 20 cores, which is less than the total physical core count 24. Iteration times obtained were observed to have limited reproducibility at 100 iterations. (10% or worse variability.) Reproducibility was much better with 1000iteration runs. Best reproducibility was apparently by running nothing else, no interactive use, and not even top, although I left a gpuowl instance running uninterrupted. The thread efficiency = ms/iter * threadcount / ms/iter for 1 thread varied widely, down to 20.6% at 48 threads. At the fastest iteration time, 20 threads, it was 65.8%. Poweroftwo thread counts were in most cases local maxima. The tests were performed by writing and launching a simple sequential shell script, specifying Mlucas command line and output redirection, followed by rename of the mlucas.cfg before the next thread count run. Top of this reference thread https://www.mersenneforum.org/showthread.php?t=23427 Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 20211117 at 14:46 
20211118, 19:22  #21 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
6,089 Posts 
File size scaling
Size of the largest files depends on exponent and computation type. Number depends on the computation type. File size is independent of P1 bounds, determined by mod Mp. LL will have p and q files. PRP will have also a .G file for the GEC data. P1 will have p and q and eventually .s1 and .s2, and .s1_prod which is ~100X smaller. .stat files tend to be modest size. A 4G P1 attempt means 12 GiB of space used, even with minimum possible bounds.

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
gpuOwLspecific reference material  kriesel  kriesel  31  20211215 21:04 
Mfaktospecific reference material  kriesel  kriesel  5  20200702 01:30 
CUDALucasspecific reference material  kriesel  kriesel  9  20200528 23:32 
Mfaktcspecific reference material  kriesel  kriesel  8  20200417 03:50 
CUDAPm1specific reference material  kriesel  kriesel  12  20190812 15:51 