View Single Post
Old 2021-09-26, 12:55   #24
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

31·191 Posts
Default Why don't we preallocate PRP proof disk space in parallel with the computation?

(Some of this was originally posted as parts of https://mersenneforum.org/showpost.p...&postcount=441 and https://mersenneforum.org/showpost.p...&postcount=443)

In recent versions of mprime / prime95, the entire disk space footprint for PRP proof generation temporary residues is preallocated before the PRP computation begins, then the PRP computation starts. On a Xeon Phi 7210 this was observed to take about 3 minutes for a 500M PRP, using a single core of an otherwise-stalled 16-core worker.

Why not run the preallocation on one core, and initial PRP iterations on the remaining cores of the worker in parallel? One could compute a time estimate for space preallocation and a time estimate for when depositing the first proof residue will be needed, and only parallelize when there's a comfortable time margin, and also ensure it wait for completion of preallocation.

Preallocate PRP proof power 8 space took 15.6 GB, 3 minutes at 500M on Xeon Phi 7210 in Windows 10 with a rotating drive.
Forecast PRP time 328.5 days ~473040 minutes. 3/473040 x 15cores/16cores= 6. ppm of PRP time saved. This is a microoptimization. Use of hyperthreading may allow slightly higher by using n rather than n-1 worker cores; in that subcase the preallocate operates on a thread using a different logical core (hyperthread).

Proof generation disk space is proportional to exponent and exponential with proof power, so presumably preallocation time is ~linear with exponent, while PRP run time is proportional to ~exponent2.1, so at 110M preallocate time at proof power 8 is ~110/500 x 3 min = 0.66 minutes; run time ~(110/500)2.1 x 328.5 d = 13.67 days; ratio .66 min /13.67days/(1440 minutes/day) = 34. ppm, substantially more than for larger exponents.
If there are no truncation losses, 34 ppm is equivalent to adding 34e-6 x 15700 computers on GIMPS in the past month = 0.53 computers, or increasing an assumed average clock rate of 3GHz by 102. kHz.
At (gpuowl maximum) proof power 10 the file would be 4 times larger so presumably take 4 times longer to preallocate, 2.64 minutes; at mprime/prime95 max proof power 12 the file would be yet 4 times larger so presumably take ~10.6 minutes to preallocate.

A rough estimate of time from beginning of preallocation and PRP iterations to first proof residue save for prime95's max supported proof power 12 so earliest residue save is for 110M, 4-worker, 13.67d/212*1440min/day = 4.8 minutes. So there is in this case , Xeon Phi 7210, 8 workers or less, not sufficient time to fully preallocate in parallel, for the max proof power case vs. 10.6 minutes preallocate time projected.

At proof power 10, 110M, first proof residue time would be ~19.2 minutes from start on Xeon 7210 4-worker, vs. ~2.6 minutes preallocate time; even a single-worker setup at 4.8 minute first proof residue time would be ok in parallel.
At the default power 8, 0.66 minutes preallocate at 110M, vs. ~77. minutes to reach the first residue to save, there's ample time for parallel preallocation for 4-worker, and also in the 2-worker or 1-worker case or even up to 16-worker.

There may be a need for caching the first residue in ram, or stall the worker until preallocate completed after the first proof generation interim residue was reached, in some other processor/drive/exponent/proof-power combination cases.

The longer we wait the less worthwhile it becomes. SSDs replacing rotating drives may reduce preallocate time and so diminish the potential time savings.

That would need to be a very quick modification to be worth the programming and test time and risk of new bugs and additional complexity. As always, it's the authors' call whether any perceived optimization is worthwhile relative to other opportunities and priorities.

Note, this currently relates only to prime95 / mprime. Gpuowl supports proof powers up to 10 and does not preallocate. Mlucas does not have PRP proof generation implemented and released yet, so its behavior is TBD.


Top of this reference thread: https://www.mersenneforum.org/showth...736#post510736
Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1

Last fiddled with by kriesel on 2021-10-01 at 22:22 Reason: exponential proof file size growth with proof power; 3 exmple powers
kriesel is online now