View Single Post
Old 2021-09-24, 00:03   #23
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

171916 Posts
Default Why don't we compute GCD for a P-1 stage in parallel with the next stage or assignment?

See discussion beginning at https://mersenneforum.org/showpost.p...&postcount=439

Most computations are being done multicore at this point, in prime95 / mprime and Mlucas. P-1 GCD is an exception. Running P-1 stage 1 computations multicore, then GCD single core, then P-1 stage 2 if stage 1 did not find a factor and available memory is sufficient for stage 2, and then the P-1 stage 2 GCD single-core, leaves most of the cores idle for the GCD durations.

Gpuowl does this. In a case with multiple Radeon VII GPUs served by a single slow CPU, sequential GCD was taking about 5 minutes of the 40 minute wavefront P-1 to optimal bounds and leaving the GPU idle during the GCD. Running GCD in parallel with speculatively proceeding with the next stage or assignment added ~14% to throughput. The chance of a following stage or PRP's progress on the same exponent being unnecessary, with near optimal bounds applied, is ~2%. If the following work is for a different exponent, there is no potential loss.

The potential gain on cpu applications such as prime95 / mprime or Mlucas seems smaller. Ballpark calculations indicate of order 0.075 to 0.26% of P-1 time. It depends on bounds and number of cores / worker. The analysis neglects the initial higher speed a multicore worker may experience upon resumption of multicore operation from package cooldown during reduced-core-count operation during the serial GCD.

Since optimized bounds and limits TF and P-1 each occupy about 1/40 as long as a primality test, the possible gain overall per exponent is diluted by a factor of about 1/42, to ~62. ppm of exponent (TF + P-1 + PRP) time in one case (880M on 16-core Xeon Phi 7210 worker), 5.05sec x 2 /2hr29min x 3cores/4cores = 0.075% of P-1 time in the I3-9100 (4 core no hyperthreading) 27.4M case; that would correspond to ~.075%x1/42 = 18. ppm of (TF + P-1 + PRP) time.

Where hyperthreading is available, full core count might be available and productive for the parallel speculative execution of the next work while waiting for the GCD to complete. Where hyperthreading is not available, it might be necessary to temporarily reduce that worker's core count by one, while the GCD runs on that freed core. The above figures include that effect, of regaining n-1 cores' productivity out of n allocated to a worker. With hyperthreading used for GCD, it may be as high as 66. ppm of exponent time, ~0.28% of P-1 time savings.

That such maneuvers are not being employed in mprime /prime95 or Mlucas may indicate that if the authors have evaluated it, they've determined their time is better spent elsewhere or there are higher priorities. The average of the 66 and 18 ppm possible gain is the equivalent of adding 2/3 of a computer to the 15761 seen active on the project in the past 30 days. Or finishing a year's running 22 minutes sooner.


Top of this reference thread: https://www.mersenneforum.org/showth...736#post510736
Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1

Last fiddled with by kriesel on 2021-09-24 at 00:05
kriesel is online now