- **Software**
(*https://www.mersenneforum.org/forumdisplay.php?f=10*)

- - **Prime95 30.8 (big P-1 changes, see post #551)**
(*https://www.mersenneforum.org/showthread.php?t=27366*)

Prime95 30.8 (big P-1 changes, see post #551)[QUOTE=petrw1;593681]:drama:
How low is "low"?[/QUOTE] On my quad core, 8GB machine: version 30.7: [CODE][Work thread Nov 23 11:19] P-1 on M26899799 with B1=1000000, B2=30000000 [Work thread Nov 23 11:19] Using FMA3 FFT length 1440K, Pass1=320, Pass2=4608, clm=2, 4 threads using large pages [Work thread Nov 23 12:03] M26899799 stage 1 complete. 2884382 transforms. Total time: 2612.637 sec. [Work thread Nov 23 12:03] Conversion of stage 1 result complete. 5 transforms, 1 modular inverse. Time: 9.522 sec. [Work thread Nov 23 12:03] D: 420, relative primes: 587, stage 2 primes: 1779361, pair%=97.87 [Work thread Nov 23 12:03] Using 6856MB of memory. [Work thread Nov 23 12:03] Stage 2 init complete. 1267 transforms. Time: 6.631 sec. [Work thread Nov 23 12:51] M26899799 stage 2 complete. 1947219 transforms. Total time: 2869.270 sec. [Work thread Nov 23 12:51] Stage 2 GCD complete. Time: 5.941 sec. [Work thread Nov 23 12:51] M26899799 completed P-1, B1=1000000, B2=30000000, Wi8: B63215A0 [/CODE] version 30.8: [CODE][Work thread Nov 24 05:56] P-1 on M26899981 with B1=1000000, B2=30000000 [Work thread Nov 24 05:57] Conversion of stage 1 result complete. 5 transforms, 1 modular inverse. Time: 9.500 sec. [Work thread Nov 24 05:57] Switching to FMA3 FFT length 1600K, Pass1=640, Pass2=2560, clm=1, 4 threads using large pages [Work thread Nov 24 05:57] Using 6788MB of memory. D: 1050, 120x403 polynomial multiplication. [Work thread Nov 24 05:58] Stage 2 init complete. 2842 transforms. Time: 11.922 sec. [Work thread Nov 24 06:15] M26899981 stage 2 complete. 360415 transforms. Total time: 1052.009 sec. [Work thread Nov 24 06:15] Stage 2 GCD complete. Time: 5.965 sec. [Work thread Nov 24 06:15] M26899981 completed P-1, B1=1000000, B2=30000000, Wi8: B63F15AE [/CODE] At 27M stage 2 is 2.7x faster. 30.7 [CODE][Work thread Nov 24 06:21] P-1 on M9100033 with B1=1000000, B2=30000000 [Work thread Nov 24 06:21] Using FMA3 FFT length 480K, Pass1=384, Pass2=1280, clm=4, 4 threads using large pages [Work thread Nov 24 06:33] M9100033 stage 1 complete. 2884376 transforms. Total time: 731.323 sec. [Work thread Nov 24 06:33] Conversion of stage 1 result complete. 5 transforms, 1 modular inverse. Time: 2.594 sec. [Work thread Nov 24 06:33] D: 924, relative primes: 1774, stage 2 primes: 1779361, pair%=99.03 [Work thread Nov 24 06:33] Using 6859MB of memory. [Work thread Nov 24 06:33] Stage 2 init complete. 3299 transforms. Time: 4.871 sec. [Work thread Nov 24 06:44] M9100033 stage 2 complete. 1849244 transforms. Total time: 620.501 sec. [Work thread Nov 24 06:44] Stage 2 GCD complete. Time: 1.538 sec. [Work thread Nov 24 06:44] M9100033 completed P-1, B1=1000000, B2=30000000, Wi8: EA555A34 [/CODE] 30.8 [CODE][Work thread Nov 24 07:01] P-1 on M9100051 with B1=1000000, B2=30000000 [Work thread Nov 24 07:02] Conversion of stage 1 result complete. 5 transforms, 1 modular inverse. Time: 2.478 sec. [Work thread Nov 24 07:02] Switching to FMA3 FFT length 560K, Pass1=448, Pass2=1280, clm=4, 4 threads using large pages [Work thread Nov 24 07:02] Using 6787MB of memory. D: 2730, 288x1216 polynomial multiplication. [Work thread Nov 24 07:02] Stage 2 init complete. 7640 transforms. Time: 11.084 sec. [Work thread Nov 24 07:04] M9100051 stage 2 complete. 119105 transforms. Total time: 98.746 sec. [Work thread Nov 24 07:04] Stage 2 GCD complete. Time: 1.555 sec. [Work thread Nov 24 07:04] M9100051 completed P-1, B1=1000000, B2=30000000, Wi8: EAB65A35 [/CODE] At 9M stage 2 is 5.7x faster. I'm working as fast as I can to get a pre-beta ready. It won't work on anything but Mersenne numbers. Won't support save files in stage 2. And I wouldn't trust the "optimal B2" calculations. Stage 2 performance and optimal B2 changes dramatically with amount of RAM available. You are much better off allowing only one worker to do stage 2 at a time. Optimal B2 will probably be in the 100*B1 to 500*B1 range. Consequently, you'll need to shift your strategies somewhat. |

Few questions, in no particular order:
1) Why is 30.8 using larger FFTs on these two examples? 2) Is this using GMP-ECM-like stage 2 -- i.e. O(sqrt(B2)) [I think] given sufficient RAM? 3) "It won't work on anything but Mersenne numbers. Won't support save files in stage 2" - Are these statements about the pre-beta build or something inherent about the algo? |

:bow:
Very, very nice improvement. I think I will return to pm1 with the new year. |

And, when will Prime95 combine P-1 stage 1 with PRP? That's another 0.5% to 1% speed improvement.
And after that, I think it's safe to change the number-of-tests-saved value from 2 to 1. |

[QUOTE=axn;593749]Few questions, in no particular order:
1) Why is 30.8 using larger FFTs on these two examples? 2) Is this using GMP-ECM-like stage 2 -- i.e. O(sqrt(B2)) [I think] given sufficient RAM? 3) "It won't work on anything but Mersenne numbers. Won't support save files in stage 2" - Are these statements about the pre-beta build or something inherent about the algo?[/QUOTE] 1) The algorithm requires "spare bits" in each FFT word. Should you decide to run the pre-beta, turn on round-off checking to make sure I've not made a mistake in estimating the correct number of spare bits required. 2) Yes. Pavel Atnashev and I have been brainstorming about how we can adapt that algorithm for our needs. Two or three bright ideas came together to produce these results. 3) These restrictions apply to the pre-beta. |

[QUOTE=Prime95;593747]
At 27M stage 2 is 2.7x faster. At 9M stage 2 is 5.7x faster. Stage 2 performance and optimal B2 changes dramatically with amount of RAM available. You are much better off allowing only one worker to do stage 2 at a time. Optimal B2 will probably be in the 100*B1 to 500*B1 range. Consequently, you'll need to shift your strategies somewhat.[/QUOTE] Wow; there have been a lot of P-1 improvements since version 30.x. 29.x to 30.3 was about 40% faster overall 30.3 to 30.7 was about 15% faster again. Now another 200-570% faster .... Amazing. I have a i5-7820x with 3600 DDR4 RAM that for unknown reasons performs best with 1 Worker x 8 Cores. I have 20GB RAM allocated to Prime95. This should be exciting. |

[QUOTE=Zhangrc;593753]And, when will Prime95 combine P-1 stage 1 with PRP? That's another 0.5% to 1% speed improvement.
And after that, I think it's safe to change the number-of-tests-saved value from 2 to 1.[/QUOTE] I could be out to lunch but in my mind another line of thinking is: If P-1 is so fast now relative to PRP let it find as many factors as possible and save as many expensive PRP tests as possible. Maybe it should be 2.5 or 3 to 1 tests-saved? Similarly it is because GPUs are SOOOO much faster at TF that we bumped the pre-PRP TF by a few bits to save PRP tests. |

[QUOTE=petrw1;593761]29.x to 30.3 was about 40% faster overall[/quote]
I believe 30.4 was the first improvement. [QUOTE=petrw1;593761]Now another 200-570% faster .... Amazing.[/quote] IIUC, the more RAM you allocate (or more to the point, the more temps it can allocate), the greater the speed up ratio. So the 2x-6x seen in George's 6.5GB allocation will be bested by your 20GB (assuming all of that goes to a single worker). I have currently 57 GB allocated which is split 6 way (so 9.5 GB per worker). With 30.8, I will have to drastically change the workflow to give one worker all 57 GB. I might have to also look at putting another 32 GB I have lying around into the machine as well -- interesting times ahead. |

[QUOTE=axn;593766]I believe 30.4 was the first improvement.
IIUC, the more RAM you allocate (or more to the point, the more temps it can allocate), the greater the speed up ratio. So the 2x-6x seen in George's 6.5GB allocation will be bested by your 20GB (assuming all of that goes to a single worker). I have currently 57 GB allocated which is split 6 way (so 9.5 GB per worker). With 30.8, I will have to drastically change the workflow to give one worker all 57 GB. I might have to also look at putting another 32 GB I have lying around into the machine as well -- interesting times ahead.[/QUOTE] 30.4 is probably correct...I knew it 30.x. George can chime in but I would think 9.5G per worker seems like a lot for the new version. |

[QUOTE=petrw1;593770]George can chime in but I would think 9.5G per worker seems like a lot for the new version.[/QUOTE]
If it is anything like GMP-ECM, it will eat up all the memory you can throw at it. If you ran George's test cases with 20 GB instead of 6.5 GB, you should see another factor of sqrt(20/6.5)=1.7x saved. /IIUC |

[QUOTE=axn;593771]If it is anything like GMP-ECM, it will eat up all the memory you can throw at it.
If you ran George's test cases with 20 GB instead of 6.5 GB, you should see another factor of sqrt(20/6.5)=1.7x saved. /IIUC[/QUOTE] Are there any diminishing returns? I can run 2 workers with ~50GB each or one with 100-110GB |

All times are UTC. The time now is 08:23. |

Powered by vBulletin® Version 3.8.11

Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.