mersenneforum.org P-1 on OBD candidates
 User Name Remember Me? Password
 Register FAQ Search Today's Posts Mark Forums Read

2023-04-21, 01:59   #45
chalsall
If I May

"Chris Halsall"
Sep 2002

22·5·571 Posts

Quote:
 Originally Posted by kriesel Hence, why ECC ram, time limits on run lengths, and qualifying run-time scaling with re-finding known factors are proposed.
Dude! Even with all of that, unless you have people in your matrix, you're doomed.

2023-04-21, 12:45   #46
Andrew Usher

Dec 2022

2×3×7×13 Posts

Quote:
 Originally Posted by kriesel ... for OBD P-1, Mlucas v20.x will be required, which does periodic GCDs during stage 2, lowering stage 2 cost in the case of a stage 2 factor found.
This is interesting; it would seem a better approach. It can of course be simulated in prime95 using many runs (also possible for stage 1) but that generally loses performance, while this should take only the GCD time itself.

I immediately ask myself it is changes optimum bounds. Clearly, it can't change the optimum B2 for a fixed B1 - but should B1 remain fixed? I think the answer is, theoretically, no. As the average run time is reduces for the same factor probability, the marginal benefit should favor increasing B1 (and so also B2, we can take a first approximation that B2/B1 should be fixed). That is small, though, as only if a factor is actually found in stage 2 does it come into action. Mental calculation beats James's calculator for differences this small - simply work out the time saved and the increased factor probablity with bounds - I get a rough estimate that the increase should be about 1.5% for your conditions - not really enough to worry about, and smaller than the effect of the extra bit of TF you do over the 'target'.

 2023-04-23, 16:40 #47 kriesel     "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 8,069 Posts IIRC, the OBD TF target of 92 bits was chosen based on the high GHD demand making use of recent NVIDIA GPUs likely, which have high TF performance and extremely low DP/SP performance ratio such as 1/64, which pushes the optimal use of such GPUs to higher TF bits than the usual target. And OBD TF software was ready, long before OBD P-1 software was. Improvements in efficiency of a P-1 stage will increase the optimal bound of that stage, and affect the optimal bound for the other stage slightly. Addition of polynomial multiplication in prime95 improved its stage 2 efficiency considerably, drastically increasing the optimal stage 2 bound and somewhat increasing the optimal stage 1 bound. Addition of polynomial multiplication to gpuowl stage 2 is unlikely, since most GPUs lack sufficient onboard ram to make good use of it. Addition of polynomial multiplication to Mlucas could help OBD, but I don't expect that to occur within the year or longer, if ever. Extension of prime95 to much larger fft lengths (~192M) would bring polynomial multiplication accelerated P-1 stage 2 to OBD, but that also is not expected to occur any time soon. Last fiddled with by kriesel on 2023-04-23 at 16:41
 2023-04-25, 12:45 #48 Andrew Usher   Dec 2022 2·3·7·13 Posts I am sure your decision to TF to 92 was sensibly based on your GPU capability, yes, so that's not in question. Although the periodic GCDs seem to make _stage 2_ more efficient, they do not actually raise the optimum B2/B1 - as I omitted a proof I give this: fixing B1, the optimal B2 is when the marginal increase in factors found falls below the marginal cost of a longer stage 2 (compared with the primality test), and neither is changed by the periodic GCDs. It is, as I said, rather raising optimal B1 that is the primary effect.
 2023-04-25, 14:26 #49 kriesel     "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 806910 Posts OBD TF to 92 bits might be 1 bit too high to be compute-effort-optimal. But it may be closer to calendar time optimal. Calendar could be compressed further by doing the last bit of TF in parallel with the previous bits and the start of P-1 on stage 1, on 3 distinct runs & hardware, at some risk of wasted effort if one of them finds a factor. Periodic GCD in P-1 stage 2 reduces stage2 cost when a factor is found early. Optimal P-1 bounds are whatever maximizes the probable savings in total compute time per exponent on the average. I think that will be where the partial derivatives of total compute time with B1 or B2 are zero, within the constraint of allowable ram. Mersenne.ca target bounds probably do not reflect the effect of periodic stage 2 GCD, at least within the 0-1G range or prime95 range ~1.169G, if not 10G. Early termination on factor found would I think cause slightly higher B2 to be optimal, by lowering average cost up to a specified B2. Finding optimal B1,B2 for the Mlucas case of early and often stage 2 GCD is more complex than for the prime95, gpuowl, cudapm1 etc case of performing stage 2 GCD once at the end. I note that in GIMPS software currently, GCD is a single-core computation. To my knowledge no one has implemented for GIMPS software a multithreaded GCD algorithm. This can be confirmed by viewing source code, by cpu core utilization during GCDs, and by observing a increase in performance of prime95 when running with Mlucas during Mlucas GCDs. Gpuowl: stage 1 GCD is done on cpu in parallel (as a separate thread) with speculative start of stage 2 on the GPU; stage 2 GCD is done in parallel (as a separate thread) with the next worktodo item if available occupying the GPU. prime95: GCD is done sequentially with other work for the worker, so n-1 cores go idle for the duration of the GCD on an n core worker. Mlucas: GCD is done sequentially, so n-1 cores go idle for the duration of the GCD in an n-core instance. CUDAPm1: GCDs are done on one cpu core, so the GPU goes idle for the duration of the GCD. For currently plausible numbers of cpu cores, the potential performance gain from parallelizing GCD seems small. https://www.wisdom.weizmann.ac.il/~oded/PDF/gcd.pdf https://oaciss.uoregon.edu/icpp18/pu...23s2-file1.pdf Last fiddled with by kriesel on 2023-04-25 at 14:31
2023-04-29, 12:43   #50
Andrew Usher

Dec 2022

2×3×7×13 Posts

Quote:
 Originally Posted by kriesel I think that [optimal bounds] will be where the partial derivatives of total compute time with B1 or B2 are zero ...
This is the correct rule and how I drew my conclusion. The extra GCDs _do not_ change the derivative with B2, as the new factors found with a marginal increase in it occur at the end of stage 2 where there's no time saving. They _do_ change that with B1, because the numerator (factor probability) remains the same, while the denominator (time) falls. It is still nothing to worry about as P-1 efficiency is a very flat plateau.

 2023-07-03, 17:18 #51 kriesel     "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 8,069 Posts The additional stage 2 GCDs have nonzero cost and nonzero probable benefit to overall compute time. If chosen correctly they can reduce average cost of stage 2 to a given bound, by up to perhaps ~1/2 stage 2 compute time x probability of stage 2 factor found, changing the total-compute-time (of P-1 and PRP) cost function vs. bounds, and therefore likely changing the average-probable-compute-time-optimal B2 (slightly). Optimal B2 increased greatly with recent efficiency improvements in prime95. The additional cost saving due to additional stage 2 GCDs is a small effect; very roughly 3% stage 2 factor probability, times ~40% stage 2 run time, times ~2.5% ratio P-1 time/prp time overall, times ~1/2 stage 2 as a total of P-1 run time, .03 0.4 /40 /2 ~ 0.015% of PRP time, or 0.6% of P-1 time. Number of GCDs is an additional optimization variable. GCD time scales more slowly with exponent than PRP time does, so optimization would favor more GCDs at larger exponent. The benefit was real enough & implementation straightforward enough for Ernst Mayer to implement interim stage 2 GCDs in Mlucas. (~0.6% of a year on F33 or OBD depending on hardware is ~2.2 days saved per run on average.) Given a choice to think Ernst got it wrong, or Andrew Usher, I'll bet on Ernst being right. He's been at this for decades, publicly. The payoff functions in total probable compute time over many exponents of unknown factorability other than TF history, could vary between polynomial multiplication implementation of P-1 and other implementations. Which is to say, the cost of additional GCDs might make sense in Mlucas P-1 implementation and not in v30.8 or later mprime's polynomial multiplication implementation. I would also trust George to get that call right, as well as prioritizing other larger efficiency or reliability improvements. Just yesterday Gerbicz posted about GEC applicability to ECM stage 1, which might appeal to George as a future enhancement. Last fiddled with by kriesel on 2023-07-03 at 17:26

 Similar Threads Thread Thread Starter Forum Replies Last Post fivemack NFS@Home 1 2014-11-30 07:52 pepi37 Riesel Prime Search 8 2014-04-17 20:51 japelprime Prime Sierpinski Project 2 2011-12-28 07:38 wblipp Operation Billion Digits 6 2011-04-10 17:45 cochet Miscellaneous Math 4 2008-10-24 14:33

All times are UTC. The time now is 10:07.

Sun Sep 24 10:07:12 UTC 2023 up 11 days, 7:49, 0 users, load averages: 0.75, 0.87, 0.81

Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔