20211124, 12:21  #1 
P90 years forever!
Aug 2002
Yeehaw, FL
2×4,099 Posts 
Prime95 30.8 (big P1 changes, see post #551)
On my quad core, 8GB machine:
version 30.7: Code:
[Work thread Nov 23 11:19] P1 on M26899799 with B1=1000000, B2=30000000 [Work thread Nov 23 11:19] Using FMA3 FFT length 1440K, Pass1=320, Pass2=4608, clm=2, 4 threads using large pages [Work thread Nov 23 12:03] M26899799 stage 1 complete. 2884382 transforms. Total time: 2612.637 sec. [Work thread Nov 23 12:03] Conversion of stage 1 result complete. 5 transforms, 1 modular inverse. Time: 9.522 sec. [Work thread Nov 23 12:03] D: 420, relative primes: 587, stage 2 primes: 1779361, pair%=97.87 [Work thread Nov 23 12:03] Using 6856MB of memory. [Work thread Nov 23 12:03] Stage 2 init complete. 1267 transforms. Time: 6.631 sec. [Work thread Nov 23 12:51] M26899799 stage 2 complete. 1947219 transforms. Total time: 2869.270 sec. [Work thread Nov 23 12:51] Stage 2 GCD complete. Time: 5.941 sec. [Work thread Nov 23 12:51] M26899799 completed P1, B1=1000000, B2=30000000, Wi8: B63215A0 Code:
[Work thread Nov 24 05:56] P1 on M26899981 with B1=1000000, B2=30000000 [Work thread Nov 24 05:57] Conversion of stage 1 result complete. 5 transforms, 1 modular inverse. Time: 9.500 sec. [Work thread Nov 24 05:57] Switching to FMA3 FFT length 1600K, Pass1=640, Pass2=2560, clm=1, 4 threads using large pages [Work thread Nov 24 05:57] Using 6788MB of memory. D: 1050, 120x403 polynomial multiplication. [Work thread Nov 24 05:58] Stage 2 init complete. 2842 transforms. Time: 11.922 sec. [Work thread Nov 24 06:15] M26899981 stage 2 complete. 360415 transforms. Total time: 1052.009 sec. [Work thread Nov 24 06:15] Stage 2 GCD complete. Time: 5.965 sec. [Work thread Nov 24 06:15] M26899981 completed P1, B1=1000000, B2=30000000, Wi8: B63F15AE 30.7 Code:
[Work thread Nov 24 06:21] P1 on M9100033 with B1=1000000, B2=30000000 [Work thread Nov 24 06:21] Using FMA3 FFT length 480K, Pass1=384, Pass2=1280, clm=4, 4 threads using large pages [Work thread Nov 24 06:33] M9100033 stage 1 complete. 2884376 transforms. Total time: 731.323 sec. [Work thread Nov 24 06:33] Conversion of stage 1 result complete. 5 transforms, 1 modular inverse. Time: 2.594 sec. [Work thread Nov 24 06:33] D: 924, relative primes: 1774, stage 2 primes: 1779361, pair%=99.03 [Work thread Nov 24 06:33] Using 6859MB of memory. [Work thread Nov 24 06:33] Stage 2 init complete. 3299 transforms. Time: 4.871 sec. [Work thread Nov 24 06:44] M9100033 stage 2 complete. 1849244 transforms. Total time: 620.501 sec. [Work thread Nov 24 06:44] Stage 2 GCD complete. Time: 1.538 sec. [Work thread Nov 24 06:44] M9100033 completed P1, B1=1000000, B2=30000000, Wi8: EA555A34 Code:
[Work thread Nov 24 07:01] P1 on M9100051 with B1=1000000, B2=30000000 [Work thread Nov 24 07:02] Conversion of stage 1 result complete. 5 transforms, 1 modular inverse. Time: 2.478 sec. [Work thread Nov 24 07:02] Switching to FMA3 FFT length 560K, Pass1=448, Pass2=1280, clm=4, 4 threads using large pages [Work thread Nov 24 07:02] Using 6787MB of memory. D: 2730, 288x1216 polynomial multiplication. [Work thread Nov 24 07:02] Stage 2 init complete. 7640 transforms. Time: 11.084 sec. [Work thread Nov 24 07:04] M9100051 stage 2 complete. 119105 transforms. Total time: 98.746 sec. [Work thread Nov 24 07:04] Stage 2 GCD complete. Time: 1.555 sec. [Work thread Nov 24 07:04] M9100051 completed P1, B1=1000000, B2=30000000, Wi8: EAB65A35 I'm working as fast as I can to get a prebeta ready. It won't work on anything but Mersenne numbers. Won't support save files in stage 2. And I wouldn't trust the "optimal B2" calculations. Stage 2 performance and optimal B2 changes dramatically with amount of RAM available. You are much better off allowing only one worker to do stage 2 at a time. Optimal B2 will probably be in the 100*B1 to 500*B1 range. Consequently, you'll need to shift your strategies somewhat. 
20211124, 13:49  #2 
Jun 2003
2·7·389 Posts 
Few questions, in no particular order:
1) Why is 30.8 using larger FFTs on these two examples? 2) Is this using GMPECMlike stage 2  i.e. O(sqrt(B2)) [I think] given sufficient RAM? 3) "It won't work on anything but Mersenne numbers. Won't support save files in stage 2"  Are these statements about the prebeta build or something inherent about the algo? Last fiddled with by axn on 20211124 at 13:51 
20211124, 13:53  #3 
"Vincent"
Apr 2010
Over the rainbow
2^{2}×7×103 Posts 
Very, very nice improvement. I think I will return to pm1 with the new year. 
20211124, 14:10  #4 
"University student"
May 2021
Beijing, China
100001101_{2} Posts 
And, when will Prime95 combine P1 stage 1 with PRP? That's another 0.5% to 1% speed improvement.
And after that, I think it's safe to change the numberoftestssaved value from 2 to 1. Last fiddled with by Zhangrc on 20211124 at 14:12 
20211124, 15:01  #5  
P90 years forever!
Aug 2002
Yeehaw, FL
2·4,099 Posts 
Quote:
2) Yes. Pavel Atnashev and I have been brainstorming about how we can adapt that algorithm for our needs. Two or three bright ideas came together to produce these results. 3) These restrictions apply to the prebeta. 

20211124, 15:41  #6  
1976 Toyota Corona years forever!
"Wayne"
Nov 2006
Saskatchewan, Canada
2^{3}×661 Posts 
Quote:
29.x to 30.3 was about 40% faster overall 30.3 to 30.7 was about 15% faster again. Now another 200570% faster .... Amazing. I have a i57820x with 3600 DDR4 RAM that for unknown reasons performs best with 1 Worker x 8 Cores. I have 20GB RAM allocated to Prime95. This should be exciting. 

20211124, 15:45  #7  
1976 Toyota Corona years forever!
"Wayne"
Nov 2006
Saskatchewan, Canada
2^{3}×661 Posts 
Quote:
If P1 is so fast now relative to PRP let it find as many factors as possible and save as many expensive PRP tests as possible. Maybe it should be 2.5 or 3 to 1 testssaved? Similarly it is because GPUs are SOOOO much faster at TF that we bumped the prePRP TF by a few bits to save PRP tests. 

20211124, 16:33  #8 
Jun 2003
1010101000110_{2} Posts 
I believe 30.4 was the first improvement.
IIUC, the more RAM you allocate (or more to the point, the more temps it can allocate), the greater the speed up ratio. So the 2x6x seen in George's 6.5GB allocation will be bested by your 20GB (assuming all of that goes to a single worker). I have currently 57 GB allocated which is split 6 way (so 9.5 GB per worker). With 30.8, I will have to drastically change the workflow to give one worker all 57 GB. I might have to also look at putting another 32 GB I have lying around into the machine as well  interesting times ahead. 
20211124, 17:02  #9  
1976 Toyota Corona years forever!
"Wayne"
Nov 2006
Saskatchewan, Canada
2^{3}·661 Posts 
Quote:
George can chime in but I would think 9.5G per worker seems like a lot for the new version. 

20211124, 17:07  #10  
Jun 2003
2·7·389 Posts 
Quote:
If you ran George's test cases with 20 GB instead of 6.5 GB, you should see another factor of sqrt(20/6.5)=1.7x saved. /IIUC 

20211124, 19:34  #11 
"Florian"
Oct 2021
Germany
11·17 Posts 
Are there any diminishing returns? I can run 2 workers with ~50GB each or one with 100110GB

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Do not post your results here!  kar_bon  Prime Wiki  40  20220403 19:05 
what should I post ?  science_man_88  science_man_88  24  20181019 23:00 
Where to post job ad?  xilman  Linux  2  20101215 16:39 
Moderated Post  kar_bon  Forum Feedback  3  20100928 08:01 
Something that I just had to post/buy  dave_0273  Lounge  1  20050227 18:36 