mersenneforum.org Prime95 30.8 (big P-1 changes, see post #551)
 Register FAQ Search Today's Posts Mark Forums Read

2021-11-24, 12:21   #1
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

175478 Posts
Prime95 30.8 (big P-1 changes, see post #551)

Quote:
 Originally Posted by petrw1 How low is "low"?
On my quad core, 8GB machine:

version 30.7:

Code:
[Work thread Nov 23 11:19] P-1 on M26899799 with B1=1000000, B2=30000000
[Work thread Nov 23 11:19] Using FMA3 FFT length 1440K, Pass1=320, Pass2=4608, clm=2, 4 threads using large pages
[Work thread Nov 23 12:03] M26899799 stage 1 complete. 2884382 transforms. Total time: 2612.637 sec.
[Work thread Nov 23 12:03] Conversion of stage 1 result complete. 5 transforms, 1 modular inverse. Time: 9.522 sec.
[Work thread Nov 23 12:03] D: 420, relative primes: 587, stage 2 primes: 1779361, pair%=97.87
[Work thread Nov 23 12:03] Using 6856MB of memory.
[Work thread Nov 23 12:03] Stage 2 init complete. 1267 transforms. Time: 6.631 sec.
[Work thread Nov 23 12:51] M26899799 stage 2 complete. 1947219 transforms. Total time: 2869.270 sec.
[Work thread Nov 23 12:51] Stage 2 GCD complete. Time: 5.941 sec.
[Work thread Nov 23 12:51] M26899799 completed P-1, B1=1000000, B2=30000000, Wi8: B63215A0
version 30.8:
Code:
[Work thread Nov 24 05:56] P-1 on M26899981 with B1=1000000, B2=30000000
[Work thread Nov 24 05:57] Conversion of stage 1 result complete. 5 transforms, 1 modular inverse. Time: 9.500 sec.
[Work thread Nov 24 05:57] Switching to FMA3 FFT length 1600K, Pass1=640, Pass2=2560, clm=1, 4 threads using large pages
[Work thread Nov 24 05:57] Using 6788MB of memory.  D: 1050, 120x403 polynomial multiplication.
[Work thread Nov 24 05:58] Stage 2 init complete. 2842 transforms. Time: 11.922 sec.
[Work thread Nov 24 06:15] M26899981 stage 2 complete. 360415 transforms. Total time: 1052.009 sec.
[Work thread Nov 24 06:15] Stage 2 GCD complete. Time: 5.965 sec.
[Work thread Nov 24 06:15] M26899981 completed P-1, B1=1000000, B2=30000000, Wi8: B63F15AE
At 27M stage 2 is 2.7x faster.

30.7
Code:
[Work thread Nov 24 06:21] P-1 on M9100033 with B1=1000000, B2=30000000
[Work thread Nov 24 06:21] Using FMA3 FFT length 480K, Pass1=384, Pass2=1280, clm=4, 4 threads using large pages
[Work thread Nov 24 06:33] M9100033 stage 1 complete. 2884376 transforms. Total time: 731.323 sec.
[Work thread Nov 24 06:33] Conversion of stage 1 result complete. 5 transforms, 1 modular inverse. Time: 2.594 sec.
[Work thread Nov 24 06:33] D: 924, relative primes: 1774, stage 2 primes: 1779361, pair%=99.03
[Work thread Nov 24 06:33] Using 6859MB of memory.
[Work thread Nov 24 06:33] Stage 2 init complete. 3299 transforms. Time: 4.871 sec.
[Work thread Nov 24 06:44] M9100033 stage 2 complete. 1849244 transforms. Total time: 620.501 sec.
[Work thread Nov 24 06:44] Stage 2 GCD complete. Time: 1.538 sec.
[Work thread Nov 24 06:44] M9100033 completed P-1, B1=1000000, B2=30000000, Wi8: EA555A34
30.8
Code:
[Work thread Nov 24 07:01] P-1 on M9100051 with B1=1000000, B2=30000000
[Work thread Nov 24 07:02] Conversion of stage 1 result complete. 5 transforms, 1 modular inverse. Time: 2.478 sec.
[Work thread Nov 24 07:02] Switching to FMA3 FFT length 560K, Pass1=448, Pass2=1280, clm=4, 4 threads using large pages
[Work thread Nov 24 07:02] Using 6787MB of memory.  D: 2730, 288x1216 polynomial multiplication.
[Work thread Nov 24 07:02] Stage 2 init complete. 7640 transforms. Time: 11.084 sec.
[Work thread Nov 24 07:04] M9100051 stage 2 complete. 119105 transforms. Total time: 98.746 sec.
[Work thread Nov 24 07:04] Stage 2 GCD complete. Time: 1.555 sec.
[Work thread Nov 24 07:04] M9100051 completed P-1, B1=1000000, B2=30000000, Wi8: EAB65A35
At 9M stage 2 is 5.7x faster.

I'm working as fast as I can to get a pre-beta ready. It won't work on anything but Mersenne numbers. Won't support save files in stage 2. And I wouldn't trust the "optimal B2" calculations.

Stage 2 performance and optimal B2 changes dramatically with amount of RAM available. You are much better off allowing only one worker to do stage 2 at a time. Optimal B2 will probably be in the 100*B1 to 500*B1 range. Consequently, you'll need to shift your strategies somewhat.

 2021-11-24, 13:49 #2 axn     Jun 2003 2·3·17·53 Posts Few questions, in no particular order: 1) Why is 30.8 using larger FFTs on these two examples? 2) Is this using GMP-ECM-like stage 2 -- i.e. O(sqrt(B2)) [I think] given sufficient RAM? 3) "It won't work on anything but Mersenne numbers. Won't support save files in stage 2" - Are these statements about the pre-beta build or something inherent about the algo? Last fiddled with by axn on 2021-11-24 at 13:51
 2021-11-24, 13:53 #3 firejuggler     "Vincent" Apr 2010 Over the rainbow 2·1,429 Posts Very, very nice improvement. I think I will return to pm1 with the new year.
 2021-11-24, 14:10 #4 Zhangrc   "University student" May 2021 Beijing, China 22·67 Posts And, when will Prime95 combine P-1 stage 1 with PRP? That's another 0.5% to 1% speed improvement. And after that, I think it's safe to change the number-of-tests-saved value from 2 to 1. Last fiddled with by Zhangrc on 2021-11-24 at 14:12
2021-11-24, 15:01   #5
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

8,039 Posts

Quote:
 Originally Posted by axn Few questions, in no particular order: 1) Why is 30.8 using larger FFTs on these two examples? 2) Is this using GMP-ECM-like stage 2 -- i.e. O(sqrt(B2)) [I think] given sufficient RAM? 3) "It won't work on anything but Mersenne numbers. Won't support save files in stage 2" - Are these statements about the pre-beta build or something inherent about the algo?
1) The algorithm requires "spare bits" in each FFT word. Should you decide to run the pre-beta, turn on round-off checking to make sure I've not made a mistake in estimating the correct number of spare bits required.
2) Yes. Pavel Atnashev and I have been brainstorming about how we can adapt that algorithm for our needs. Two or three bright ideas came together to produce these results.
3) These restrictions apply to the pre-beta.

2021-11-24, 15:41   #6
petrw1
1976 Toyota Corona years forever!

"Wayne"
Nov 2006

522910 Posts

Quote:
 Originally Posted by Prime95 At 27M stage 2 is 2.7x faster. At 9M stage 2 is 5.7x faster. Stage 2 performance and optimal B2 changes dramatically with amount of RAM available. You are much better off allowing only one worker to do stage 2 at a time. Optimal B2 will probably be in the 100*B1 to 500*B1 range. Consequently, you'll need to shift your strategies somewhat.
Wow; there have been a lot of P-1 improvements since version 30.x.
29.x to 30.3 was about 40% faster overall
30.3 to 30.7 was about 15% faster again.
Now another 200-570% faster .... Amazing.

I have a i5-7820x with 3600 DDR4 RAM that for unknown reasons performs best with 1 Worker x 8 Cores.
I have 20GB RAM allocated to Prime95. This should be exciting.

2021-11-24, 15:45   #7
petrw1
1976 Toyota Corona years forever!

"Wayne"
Nov 2006

32×7×83 Posts

Quote:
 Originally Posted by Zhangrc And, when will Prime95 combine P-1 stage 1 with PRP? That's another 0.5% to 1% speed improvement. And after that, I think it's safe to change the number-of-tests-saved value from 2 to 1.
I could be out to lunch but in my mind another line of thinking is:
If P-1 is so fast now relative to PRP let it find as many factors as possible and save as many expensive PRP tests as possible. Maybe it should be 2.5 or 3 to 1 tests-saved?

Similarly it is because GPUs are SOOOO much faster at TF that we bumped the pre-PRP TF by a few bits to save PRP tests.

2021-11-24, 16:33   #8
axn

Jun 2003

2×3×17×53 Posts

Quote:
 Originally Posted by petrw1 29.x to 30.3 was about 40% faster overall
I believe 30.4 was the first improvement.

Quote:
 Originally Posted by petrw1 Now another 200-570% faster .... Amazing.
IIUC, the more RAM you allocate (or more to the point, the more temps it can allocate), the greater the speed up ratio. So the 2x-6x seen in George's 6.5GB allocation will be bested by your 20GB (assuming all of that goes to a single worker). I have currently 57 GB allocated which is split 6 way (so 9.5 GB per worker). With 30.8, I will have to drastically change the workflow to give one worker all 57 GB. I might have to also look at putting another 32 GB I have lying around into the machine as well -- interesting times ahead.

2021-11-24, 17:02   #9
petrw1
1976 Toyota Corona years forever!

"Wayne"
Nov 2006

522910 Posts

Quote:
 Originally Posted by axn I believe 30.4 was the first improvement. IIUC, the more RAM you allocate (or more to the point, the more temps it can allocate), the greater the speed up ratio. So the 2x-6x seen in George's 6.5GB allocation will be bested by your 20GB (assuming all of that goes to a single worker). I have currently 57 GB allocated which is split 6 way (so 9.5 GB per worker). With 30.8, I will have to drastically change the workflow to give one worker all 57 GB. I might have to also look at putting another 32 GB I have lying around into the machine as well -- interesting times ahead.
30.4 is probably correct...I knew it 30.x.

George can chime in but I would think 9.5G per worker seems like a lot for the new version.

2021-11-24, 17:07   #10
axn

Jun 2003

2·3·17·53 Posts

Quote:
 Originally Posted by petrw1 George can chime in but I would think 9.5G per worker seems like a lot for the new version.
If it is anything like GMP-ECM, it will eat up all the memory you can throw at it.

If you ran George's test cases with 20 GB instead of 6.5 GB, you should see another factor of sqrt(20/6.5)=1.7x saved. /IIUC

2021-11-24, 19:34   #11
Luminescence

Oct 2021
Germany

2×3×23 Posts

Quote:
 Originally Posted by axn If it is anything like GMP-ECM, it will eat up all the memory you can throw at it. If you ran George's test cases with 20 GB instead of 6.5 GB, you should see another factor of sqrt(20/6.5)=1.7x saved. /IIUC
Are there any diminishing returns? I can run 2 workers with ~50GB each or one with 100-110GB

 Similar Threads Thread Thread Starter Forum Replies Last Post kar_bon Prime Wiki 40 2022-04-03 19:05 science_man_88 science_man_88 24 2018-10-19 23:00 xilman Linux 2 2010-12-15 16:39 kar_bon Forum Feedback 3 2010-09-28 08:01 dave_0273 Lounge 1 2005-02-27 18:36

All times are UTC. The time now is 23:39.

Mon Oct 3 23:39:38 UTC 2022 up 46 days, 21:08, 0 users, load averages: 1.05, 1.05, 1.05