2021-09-24, 00:03 | #23 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2^{2}·3·541 Posts |
Why don't we compute GCD for a P-1 stage in parallel with the next stage or assignment?
See discussion beginning at https://mersenneforum.org/showpost.p...&postcount=439
Most computations are being done multicore at this point, in prime95 / mprime and Mlucas. P-1 GCD is an exception. Running P-1 stage 1 computations multicore, then GCD single core, then P-1 stage 2 if stage 1 did not find a factor and available memory is sufficient for stage 2, and then the P-1 stage 2 GCD single-core, leaves most of the cores idle for the GCD durations. Gpuowl does this. In a case with multiple Radeon VII GPUs served by a single slow CPU, sequential GCD was taking about 5 minutes of the 40 minute wavefront P-1 to optimal bounds and leaving the GPU idle during the GCD. Running GCD in parallel with speculatively proceeding with the next stage or assignment added ~14% to throughput. The chance of a following stage or PRP's progress on the same exponent being unnecessary, with near optimal bounds applied, is ~2%. If the following work is for a different exponent, there is no potential loss. The potential gain on cpu applications such as prime95 / mprime or Mlucas seems smaller. Ballpark calculations indicate of order 0.075 to 0.26% of P-1 time. It depends on bounds and number of cores / worker. The analysis neglects the initial higher speed a multicore worker may experience upon resumption of multicore operation from package cooldown during reduced-core-count operation during the serial GCD. Since optimized bounds and limits TF and P-1 each occupy about 1/40 as long as a primality test, the possible gain overall per exponent is diluted by a factor of about 1/42, to ~62. ppm of exponent (TF + P-1 + PRP) time in one case (880M on 16-core Xeon Phi 7210 worker), 5.05sec x 2 /2hr29min x 3cores/4cores = 0.075% of P-1 time in the I3-9100 (4 core no hyperthreading) 27.4M case; that would correspond to ~.075%x1/42 = 18. ppm of (TF + P-1 + PRP) time. Where hyperthreading is available, full core count might be available and productive for the parallel speculative execution of the next work while waiting for the GCD to complete. Where hyperthreading is not available, it might be necessary to temporarily reduce that worker's core count by one, while the GCD runs on that freed core. The above figures include that effect, of regaining n-1 cores' productivity out of n allocated to a worker. With hyperthreading used for GCD, it may be as high as 66. ppm of exponent time, ~0.28% of P-1 time savings. That such maneuvers are not being employed in mprime /prime95 or Mlucas may indicate that if the authors have evaluated it, they've determined their time is better spent elsewhere or there are higher priorities. The average of the 66 and 18 ppm possible gain is the equivalent of adding 2/3 of a computer to the 15761 seen active on the project in the past 30 days. Or finishing a year's running 22 minutes sooner. Top of this reference thread: https://www.mersenneforum.org/showth...736#post510736 Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2021-09-24 at 00:05 |
2021-09-26, 12:55 | #24 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2^{2}·3·541 Posts |
Why don't we preallocate PRP proof disk space in parallel with the computation?
(Some of this was originally posted as parts of https://mersenneforum.org/showpost.p...&postcount=441 and https://mersenneforum.org/showpost.p...&postcount=443)
In recent versions of mprime / prime95, the entire disk space footprint for PRP proof generation temporary residues is preallocated before the PRP computation begins, then the PRP computation starts. On a Xeon Phi 7210 this was observed to take about 3 minutes for a 500M PRP, using a single core of an otherwise-stalled 16-core worker. Why not run the preallocation on one core, and initial PRP iterations on the remaining cores of the worker in parallel? One could compute a time estimate for space preallocation and a time estimate for when depositing the first proof residue will be needed, and only parallelize when there's a comfortable time margin, and also ensure it wait for completion of preallocation. Preallocate PRP proof power 8 space took 15.6 GB, 3 minutes at 500M on Xeon Phi 7210 in Windows 10 with a rotating drive. Forecast PRP time 328.5 days ~473040 minutes. 3/473040 x 15cores/16cores= 6. ppm of PRP time saved. This is a microoptimization. Use of hyperthreading may allow slightly higher by using n rather than n-1 worker cores; in that subcase the preallocate operates on a thread using a different logical core (hyperthread). Proof generation disk space is proportional to exponent and exponential with proof power, so presumably preallocation time is ~linear with exponent, while PRP run time is proportional to ~exponent^{2.1}, so at 110M preallocate time at proof power 8 is ~110/500 x 3 min = 0.66 minutes; run time ~(110/500)^{2.1} x 328.5 d = 13.67 days; ratio .66 min /13.67days/(1440 minutes/day) = 34. ppm, substantially more than for larger exponents. If there are no truncation losses, 34 ppm is equivalent to adding 34e-6 x 15700 computers on GIMPS in the past month = 0.53 computers, or increasing an assumed average clock rate of 3GHz by 102. kHz. At (gpuowl maximum) proof power 10 the file would be 4 times larger so presumably take 4 times longer to preallocate, 2.64 minutes; at mprime/prime95 max proof power 12 the file would be yet 4 times larger so presumably take ~10.6 minutes to preallocate. A rough estimate of time from beginning of preallocation and PRP iterations to first proof residue save for prime95's max supported proof power 12 so earliest residue save is for 110M, 4-worker, 13.67d/2^{12}*1440min/day = 4.8 minutes. So there is in this case , Xeon Phi 7210, 8 workers or less, not sufficient time to fully preallocate in parallel, for the max proof power case vs. 10.6 minutes preallocate time projected. At proof power 10, 110M, first proof residue time would be ~19.2 minutes from start on Xeon 7210 4-worker, vs. ~2.6 minutes preallocate time; even a single-worker setup at 4.8 minute first proof residue time would be ok in parallel. At the default power 8, 0.66 minutes preallocate at 110M, vs. ~77. minutes to reach the first residue to save, there's ample time for parallel preallocation for 4-worker, and also in the 2-worker or 1-worker case or even up to 16-worker. There may be a need for caching the first residue in ram, or stall the worker until preallocate completed after the first proof generation interim residue was reached, in some other processor/drive/exponent/proof-power combination cases. The longer we wait the less worthwhile it becomes. SSDs replacing rotating drives may reduce preallocate time and so diminish the potential time savings. That would need to be a very quick modification to be worth the programming and test time and risk of new bugs and additional complexity. As always, it's the authors' call whether any perceived optimization is worthwhile relative to other opportunities and priorities. Note, this currently relates only to prime95 / mprime. Gpuowl supports proof powers up to 10 and does not preallocate. Mlucas does not have PRP proof generation implemented and released yet, so its behavior is TBD. Top of this reference thread: https://www.mersenneforum.org/showth...736#post510736 Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2021-10-01 at 22:22 Reason: exponential proof file size growth with proof power; 3 exmple powers |
2022-03-01, 07:22 | #25 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2^{2}·3·541 Posts |
Why don't we run GIMPS computation on ASICs?
Time, money, and talent.
Developing ASICs is costly and time consuming. https://electronics.stackexchange.co...stom-asic-made Okay, how about Google cloud TPUs? Google already paid for R&D, design and production on several generations of TPU (tensor processing units). (Or any other Big Budget Corp that has done similarly, if any.) There is no software developed, tested, and released yet for GIMPS computation on TPUs at exponents & fft lengths of interest for current or future wavefronts' P-1 factoring or primality testing, or for TF for exponents and bit levels of interest either. TPUs were designed for massive parallelism and high performance, but at the low precision typical of neural networks. Bfloat16 is effectively 8-bit precision mantissa. It would take a LOT of those instructions to equal the effect of 53-bit precision DP on a GPU. Probably several dozen instructions to emulate a DP mul or add, and custom coding of use of those to replace library functions employing lower precision. There are some threads discussing DP emulation using SP on GPUs, or related including NTT, including at least the following: https://www.mersenneforum.org/showthread.php?t=23926 https://www.mersenneforum.org/showthread.php?t=25977 A 7-undergraduate-student team under the guidance of danc2 and tdulcet has made a start, on code to perform probable prime tests of Mersennes running on a Google Colaboratory TPU, and has stopped its effort. You can read about it on github here or watch their concluding presentation youtube video here. While it implements a considerable feature set for so quick a development, it is not productive currently for GIMPS use. It is limited to exponent ~4423 to ~9689 on TPU currently, due to the fft library routines using bfloat16 low precision internally, limiting available capacity for handling carries, resulting in excessive round-off error at higher exponents. It is currently orders of magnitude slower than existing GIMPS software. As I understand it, no appreciable optimization effort has been performed yet, and some functions are performed on CPU (carry propagation each iteration), so traffic between TPU and associated CPU slows operation and is very frequent. Also it is JIT compiled each time, which probably imposes considerable initial overhead for small exponents. (The bulk of runtime for M521 appears to be initialization.) It does include already, the Gerbicz Error Check, provision for save files and resuming from them, small sets of arguments and settings, and some effort toward implementation of a PrimeNet interface. They've apparently named it TensorPrime. It occurs to me as an impressive and promising beginning. It may lead to a version someday that is useful for GIMPS production effort. Any volunteers interested in continuing the effort may contribute to the git repository or contact danc2 or tdulcet to coordinate effort. Top of this reference thread: https://www.mersenneforum.org/showth...736#post510736 Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2022-03-17 at 15:34 Reason: add description of the TensorPrime effort |
2022-05-01, 00:32 | #26 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2^{2}·3·541 Posts |
Why don't we use direct GPU-storage transfers in GIMPS apps?
See https://static.rainfocus.com/nvidia/...212001mbEW.pdf
Instead of CPU and system ram being involved in retrieving from or sending to SSD or HD, and separately passing it to or from the GPU, let the GPU and storage device talk directly, for speed. This requires use cases, and hardware that supports it, at enough prevalence it's worthwhile for the GIMPS software authors to support it in their applications. There might not be enough use cases, or enough such gear in participants' possession to be worthwhile. Also, the authors would need to know such capability existed. It would ideally be available in both NVIDIA and AMD, Windows and Linux. AMD: https://gpuopen.com/direct-storage-support/ Windows: https://news.xbox.com/en-us/2021/06/...er-for-gaming/ Linux: https://www.reddit.com/r/linux_gamin...the_works_for/ Top of this reference thread: https://www.mersenneforum.org/showth...736#post510736 Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2022-05-01 at 01:29 |