![]() |
![]() |
#1 |
Oct 2007
Manchester, UK
137310 Posts |
![]()
Does anyone know of some utility that can handle P-1 on these monsters?
v29.8 of prime95 says it only accepts exponents up to 595,800,000, which I assume corresponds to its maximum FFT size. A few years ago LaurV made a post about potentially implementing P-1 in CUDA which sounds encouraging, but I don't know if he or anyone else got much further. https://www.mersenneforum.org/showpo...3&postcount=11 At least Prime95 can give optimal bounds for P-1. If I put in a candidate TF'd to 86 bits, it recommends B1=B2=44,680,000, no stage 2 due to RAM limitations I believe. This doesn't sound completely unreasonable, and offers a 3.53% chance of a factor, this is slightly higher than the chance of a factor from continuing TF up to 89 bits (3/89 ~ 3.37%). If I say the candidate has been TF'd to 91 bits instead (which seems to be vaguely where GPU TFing should probably stop), then Prime95 offers the bounds B1=B2=30,920,000 with a 2.07% chance of a factor. Seems a bit odd to me that the bounds are LOWER when TF has progressed more, but alright. |
![]() |
![]() |
![]() |
#2 | |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
7,717 Posts |
![]()
On an FMA3-capable system prime95 should be capable of going to 920M (50M fft since V29.2). I'm running 701M now on 29.7b1 x64.
https://www.mersenneforum.org/showpo...&postcount=218 CUDAPm1 has been around for years but doesn't reach that high, due to various issues, although it nominally supports sufficiently large fft lengths, on gpus with sufficient ram. https://www.mersenneforum.org/showthread.php?t=23389 Quote:
Last fiddled with by kriesel on 2019-06-04 at 06:26 |
|
![]() |
![]() |
![]() |
#3 |
Oct 2007
Manchester, UK
55D16 Posts |
![]() |
![]() |
![]() |
![]() |
#4 | |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
7,717 Posts |
![]() Quote:
|
|
![]() |
![]() |
![]() |
#5 |
Oct 2007
Manchester, UK
55D16 Posts |
![]()
Perhaps in time the limits will be raised such that stage 1 for these numbers will be possible. Though I completely understand why enabling such functionality is not exactly top priority.
For the memory usage, is there any possibility that it could be lowered if the second stage was broken down into multiple chunks, similar to stage 2 of ECM? |
![]() |
![]() |
![]() |
#6 | |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
7,717 Posts |
![]() Quote:
Of course, they are all going to be impacted by the roughly p2.2 run time scaling also. The 701M P-1 run on my i7-8750H (all cores one worker) took 32.3 days at NRP~25, while a recent 430M P-1 on a 3GB GTX1060 took ~5. days at NRP=5 (for both stages, no factor found). Those would scale to ~989. days and ~449. days respectively, per P-1 on a gigadigit candidate. Note that run time also lengthens when NRP goes toward 1 due to memory size limitations. (CUDAPm1 reference info https://www.mersenneforum.org/showthread.php?t=23389 re prime95 see https://www.mersenneforum.org/showthread.php?t=23900) Last fiddled with by kriesel on 2019-06-11 at 19:26 |
|
![]() |
![]() |
![]() |
#7 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
7,717 Posts |
![]()
Mlucas v20.x added P-1 factoring, and has sufficiently large fft lengths to tackle (slowly; it's a big job) gigadigit Mersenne P-1. The P-1 feature is maturing, with beta testing ongoing, and occasionally bugs found, reported, and fixed. So OBD P-1 is now becoming feasible.
I propose the following:
|
![]() |
![]() |
![]() |
#8 | |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
7,717 Posts |
![]()
Clarifying P-1 stage memory requirements for Mlucas and OBD candidates:
Quote:
Mlucas will support performing subdivisions of stage 2 on multiple systems. Each system so employed will need adequate ram for running stage 2. Subdivision of stage 2 bounds reduces total calendar time for a run, but does not reduce memory requirement per system. Last fiddled with by kriesel on 2021-10-27 at 18:23 |
|
![]() |
![]() |
![]() |
#9 | |
Jun 2003
The Computer
19116 Posts |
![]()
Thanks to James, I was able to get mersenne.ca/obd to support reservations to 92 bits. Thus, the 3321928171 exponent is now available to that bit depth.
Quote:
With the advent of DDR5 and presumably ECC RAM in the mainstream, within 1-2 years we should have an exponent factored to 92 bits and boxes with DDR5 RAM and a processor like a 12th Gen i9 or upcoming Zen 4 that would negate the need for multiple processors. |
|
![]() |
![]() |
![]() |
#10 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
7,717 Posts |
![]()
I think we can make some adjustments as situations arise.
If someone runs stage 1 and realizes they won't be able to complete stage 2 in time, or at all, they could:
The point of writing a post about how to coordinate is to hopefully promote efficiency. One could run stage 1 on a 16 GiB system, copy the files, and split stage 2 to 2 or more systems containing at least ~64 GiB ram each, apportioning the stage 2 bounds on the split so the systems working in tandem would finish at about the same time. Apportion stage 2 delta bounds according to iters/sec throughput for the relevant fft length. Selftest in Mlucas produces the reciprocal, time/iteration, expressed in msec/iter, in mlucas.cfg. Approximate example: if system A is twice as fast as system B, the two will begin work at the same time, and B2start = B1 (an oversimplification) = 17,000,000, B2range=1G-17M = 983M. B2range/3=~327M. System A runs stage 2 for bounds 17M to 327*2+17 =671M; system B runs stage 2 for bounds 671M to 1G. This would be good practice for participating in Ernst's planned significantly parallel performance of F33 P-1 stage 2, which will require ~208 GiB ram per system. I expect multiple OBD candidates to be TF completed to 92 bits within 6 months. On an RTX2080, OBD TF 89-90 bits is ~19 days, so to 91 ~+38 days, to 92 ~+76 days. (All using mfaktc GPU kernel "barrett92_mul32_gs") Starting from 88 done (level 22) that would be ~10 +19 +38 +76 = 143 days. While the faster GPUs climb the mountain to 92, lesser GPUs can work on lower remaining bit levels on additional exponents. On a GTX 1650, exp=3321929519 bit_min=87 bit_max=88 (9435.16 GHz-days) is ~12.7 days, so 88-89 takes ~25 days, 89-90 51 days, etc. I consider the used dual-Xeon or Xeon Phi workstations or servers a good combination of features, performance and cost for PRP or P-1. Last fiddled with by kriesel on 2021-10-29 at 15:33 |
![]() |
![]() |
![]() |
#11 |
Jun 2003
The Computer
401 Posts |
![]()
That sounds good. I think Stage 1 and Stage 2 should be checked out separately, but the user who completed Stage 1 will have preference to complete Stage 2, just like the user who completed the TF to 92 bits would have preference for Stage 1. Then the times could be reduced, i.e. Stage 1 would be forfeited after one year or six months with no progress report. Stage 2 would probably be more lenient, especially if multiple users are working on the same exponent simultaneously.
This would be good for many users, since in my case I could run it on my server, but it would probably be too slow to do both stages solo. If I split Stage 2 with Ken's dual 12 core machine, for example, it would make things go fast and still utilize the 128 GB of ECC RAM I have available. |
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
A couple of 15e candidates | fivemack | NFS@Home | 1 | 2014-11-30 07:52 |
How to calculate FFT lengths of candidates | pepi37 | Riesel Prime Search | 8 | 2014-04-17 20:51 |
No available candidates on server | japelprime | Prime Sierpinski Project | 2 | 2011-12-28 07:38 |
Adding New Candidates | wblipp | Operation Billion Digits | 6 | 2011-04-10 17:45 |
new candidates for M...46 and M48 | cochet | Miscellaneous Math | 4 | 2008-10-24 14:33 |