mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2017-11-24, 17:51   #1
aurashift
 
Jan 2015

25310 Posts
Default mprime #7 timing summary request

hi.


so, it'd be helpful on these humongo servers (and in the future when this stuff trickles down to everyone else) to have a summary at the end of running the timing tests. Finding the pertinent information is doable but less fun when there's 2 sockets of 22 hyperthreaded cores each doing 1000 iterations (a high # of iterations is getting very important to get a more representative result, turbo boost is becoming more and more of a factor, the cpu im referencing in this thread has a TB of 1600MHZ from 2.1 to 3.7 GHz!)


Useful items to add for each core count iteration:


1) minimum, maximum, mean, and biggest range for the ms timings
2) what # of cores gets you the best timing bang for the buck (This is gonna get pertinent when more channels of RAM come out...AMD epyc has eight channels :O. I'm pretty sure I'm reaching the performance limit of my RAM). (my general rule is that its always gonna be 2 threads, but I'd like to be validated on that).
3) I forgot what 3 was. I'm having a brain fart. maybe an edit to follow.
4) 4 isn't 3. More than 1000 iterations as an option please.
5) maybe some kind of option for a parallel multiplier to get the host under load. Turbo boost is only helpful to skew the results, every time you start a thread that clock frequency is gonna fluctuate wildly. If you're using the entire box for mprime the timings may not be representative once it gets to 100% utilization. Something like running the same timing test concurrently X times with Y threads.

Last fiddled with by aurashift on 2017-11-24 at 18:23
aurashift is offline   Reply With Quote
Old 2017-11-24, 18:45   #2
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

3×11×233 Posts
Default

I assume you are talking about Advanced/Time menu choice.

Try using the Options/Benchmark menu choice. The throughput benchmarks are what I use to decide the best configuration (on my admittedly small core count machines). Of special interest is you can use a comma separated list of workers/cores to benchmark to greatly reduce the less-than-useful combinations.
Prime95 is offline   Reply With Quote
Old 2017-11-24, 23:15   #3
aurashift
 
Jan 2015

11·23 Posts
Default

maybe you're right i need to wrap my head around this better. FFT size is running in CPU cache right? I'm not even sure i know what questions to ask.

Last fiddled with by aurashift on 2017-11-24 at 23:16
aurashift is offline   Reply With Quote
Old 2017-11-24, 23:34   #4
aurashift
 
Jan 2015

25310 Posts
Default

i guess i want to know if there's any unusual tweaks i can use to better utilize this https://ark.intel.com/products/12049...Cache-2_10-GHz
aurashift is offline   Reply With Quote
Old 2017-11-25, 00:08   #5
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

37×79 Posts
Default

Quote:
Originally Posted by aurashift View Post
i guess i want to know if there's any unusual tweaks i can use to better utilize this https://ark.intel.com/products/12049...Cache-2_10-GHz
That CPU downclocks hard when doing AVX. mprime doesn't scale perfectly to 22 cores. There will be a sweet spot combination of number of cores in use and number of workers.

Also, memory bandwidth will factor in. Going by my Skylake experience, 6 channels of DDR4-2666 will hit a bottleneck around 14 or 15 cores in that chip. Any gains over that will probably make nothing but heat.

I would suggest benchmarking:

1 and 2 workers, 12, 14, 16 and 18 cores
3 workers, 12, 15, and 18 cores
12, 14, 16, and 18 workers, 1 core.

If the 18 case is better than the 15/16 case, try 21/22 workers/cores, too.

And be sure to say yes to benchmarking All Complex FFT. Some of the timings are quite a bit faster. New mprime will eventually figure out the fastest FFT, but it won't adjust the number of workers/cores to use, so you need to benchmark the FFTs to figure out the number of workers/cores to use.

Last fiddled with by Mark Rose on 2017-11-25 at 00:10
Mark Rose is offline   Reply With Quote
Old 2017-11-26, 21:28   #6
NookieN
 
NookieN's Avatar
 
Aug 2002

2×29 Posts
Default

I've been using its 22-core sibling, the E5-2696v4 for a few months now. 22-cores in one worker can get down to about 1.39ms/it for 4M. I did not find any combination with multiple workers that yielded better throughput than a single worker on all cores. It usually runs at 2600 or 2700MHz with all cores active.

So a single 22c Skylake with 6 channels at 2666MT/s should be able to do a bit better. The fact that it's a 2S system will probably hurt somewhat. If your RAM is rated for a higher speed than 2666MT/s you won't be able to run it at a higher clock, but you should be to drop tCL, tRCD, tRP, and tRAS somewhat.
NookieN is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Timing for different B1 values? CRGreathouse GMP-ECM 8 2018-05-12 05:57
Timing for large candidate carpetpool Conjectures 'R Us 6 2016-12-31 06:02
Strange timing for GMP-ECM 6.2.3 jyb GMP-ECM 5 2010-02-10 14:01
Question about mprime behaviour (possibly feature request) TheJudger Software 7 2005-11-24 16:42
Timing Options Kevin Software 3 2002-09-12 14:03

All times are UTC. The time now is 09:10.


Mon Dec 6 09:10:02 UTC 2021 up 136 days, 3:39, 0 users, load averages: 0.99, 1.10, 1.28

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.