mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2019-09-08, 15:44   #1
scan80269
 
"Sam"
Jun 2019
California, USA

111012 Posts
Default CPU Energy Efficiency for Prime95

I just picked up an Intel Core i9-9900T CPU (8C/16T, 35W TDP) and in combination with 32GB DDR4-3600 dual rank memory and a ASRock Z390 Phantom Gaming-ITX/ac motherboard managed to achieve some decent throughput figures:

Timings for 2048K FFT length (8 cores, 1 worker): 1.11 ms. Throughput: 904.50 iter/sec.
Timings for 2304K FFT length (8 cores, 1 worker): 1.47 ms. Throughput: 680.55 iter/sec.
Timings for 2400K FFT length (8 cores, 1 worker): 1.83 ms. Throughput: 545.20 iter/sec.
Timings for 2560K FFT length (8 cores, 1 worker): 1.95 ms. Throughput: 512.50 iter/sec.
Timings for 2688K FFT length (8 cores, 1 worker): 2.03 ms. Throughput: 492.94 iter/sec.
Timings for 2880K FFT length (8 cores, 1 worker): 2.29 ms. Throughput: 437.28 iter/sec.
Timings for 3072K FFT length (8 cores, 1 worker): 2.42 ms. Throughput: 413.06 iter/sec.
Timings for 3200K FFT length (8 cores, 1 worker): 2.55 ms. Throughput: 391.66 iter/sec.
Timings for 3360K FFT length (8 cores, 1 worker): 2.82 ms. Throughput: 354.72 iter/sec.
Timings for 3456K FFT length (8 cores, 1 worker): 2.82 ms. Throughput: 353.99 iter/sec.
Timings for 3584K FFT length (8 cores, 1 worker): 2.95 ms. Throughput: 339.23 iter/sec.
Timings for 3840K FFT length (8 cores, 1 worker): 3.14 ms. Throughput: 318.70 iter/sec.
Timings for 4096K FFT length (8 cores, 1 worker): 3.47 ms. Throughput: 288.33 iter/sec.
Timings for 4480K FFT length (8 cores, 1 worker): 3.92 ms. Throughput: 255.15 iter/sec.
Timings for 4608K FFT length (8 cores, 1 worker): 3.88 ms. Throughput: 258.06 iter/sec.
Timings for 4800K FFT length (8 cores, 1 worker): 4.34 ms. Throughput: 230.54 iter/sec.
Timings for 5120K FFT length (8 cores, 1 worker): 4.53 ms. Throughput: 220.61 iter/sec.
Timings for 5376K FFT length (8 cores, 1 worker): 4.80 ms. Throughput: 208.33 iter/sec.
Timings for 5760K FFT length (8 cores, 1 worker): 5.39 ms. Throughput: 185.67 iter/sec.
Timings for 6144K FFT length (8 cores, 1 worker): 5.61 ms. Throughput: 178.22 iter/sec.
Timings for 6400K FFT length (8 cores, 1 worker): 5.98 ms. Throughput: 167.09 iter/sec.
Timings for 6720K FFT length (8 cores, 1 worker): 6.19 ms. Throughput: 161.55 iter/sec.
Timings for 6912K FFT length (8 cores, 1 worker): 6.55 ms. Throughput: 152.76 iter/sec.
Timings for 7168K FFT length (8 cores, 1 worker): 6.47 ms. Throughput: 154.53 iter/sec.
Timings for 7680K FFT length (8 cores, 1 worker): 7.02 ms. Throughput: 142.46 iter/sec.
Timings for 8064K FFT length (8 cores, 1 worker): 7.46 ms. Throughput: 134.00 iter/sec.
Timings for 8192K FFT length (8 cores, 1 worker): 7.51 ms. Throughput: 133.24 iter/sec.

This CPU strikes me as being quite energy efficient in running Prime95. It throughput is similar to a Core i7-8700K CPU (6C/12T, 95W TDP) but at nearly 1/3 the power dissipation.

It would be interesting to determine which modern CPU can achieve the highest efficiency for running Prime95, essentially a iters/sec per watt metric.

The CPUs delivering the highest throughput numbers tend not to be the most energy efficient. My i7-5960X rig with quad channel DDR4 memory is currently the fastest among my systems, but this CPU consumes almost 120W while running Prime95, and its throughput is nowhere close to 3X of the i9-9900T. Upcoming Intel Cascade Lake-X CPUs may provide an efficiency improvement over current gen Core X CPUs, so I'll be watching those closely.

Mobile CPUs with high core count may also be candidates for the energy efficiency crown.
scan80269 is offline   Reply With Quote
Old 2019-09-08, 16:31   #2
axn
 
axn's Avatar
 
Jun 2003

2×5×479 Posts
Default

Keep in mind that 35W TDP != 35W max power consumption. Have you tried to measure the actual power at the wall? Obviously, the total system power will be much higher, but the CPU itself might draw much more than 35W at full load, especially when using AVX.
axn is offline   Reply With Quote
Old 2019-09-08, 16:44   #3
scan80269
 
"Sam"
Jun 2019
California, USA

29 Posts
Default

Yes, I'm fully aware of that.

When Prime95 is started from idle, the i9-9900T engages Turbo Boost and consumes ~70W package power for several seconds, as measured using HWMonitor. The package power then goes back down close to 35W and stays there.

I did not override any default settings for CPU such as AVX offset in the BIOS, so once the CPU runs out of power & thermal headroom with Turbo Boost, it reduces the frequencies of the cores and the package power returns to TDP level.

Total system power is of course much higher than 35W, with memory, chipset, graphics, storage, VRs, etc. all consuming power in addition to the CPU itself, but this is true for any computer system.

Last fiddled with by scan80269 on 2019-09-08 at 16:50
scan80269 is offline   Reply With Quote
Old 2019-09-08, 16:48   #4
nomead
 
nomead's Avatar
 
"Sam Laur"
Dec 2018
Turku, Finland

1010010002 Posts
Default

Also see how the clock speed behaves during the run. As I recall, the 35W "TDP" limited parts run at full speed for some time (some seconds - max. tens of seconds, depending on the motherboard manufacturer's parameters in the BIOS) and then throttle the clock lower if CPU demand stays high. And if the cooling is designed for 35 watts, it will probably also hit some sort of thermal throttling during a longer run. Full exponent test, not just throughput benchmark.
nomead is offline   Reply With Quote
Old 2019-09-08, 17:11   #5
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

2·33·13 Posts
Default

Quote:
Originally Posted by scan80269 View Post
...
It would be interesting to determine which modern CPU can achieve the highest efficiency for running Prime95, essentially a iters/sec per watt metric.
...
Probably the 3900X or 3950X are the most energy efficient consumer CPU assuming the 64MB of cache is a big deal. Epyc zen2 will be the best including server CPUs due to running in the sweet spot of the power curve and more densely packing the compute power, therefore having less overhead per iteration and probably better utilising the sweet spot of the PSU. But if we include GPUs the Radeon VII beats all reasonable options.
M344587487 is online now   Reply With Quote
Old 2019-09-08, 17:17   #6
hansl
 
hansl's Avatar
 
Apr 2019

110011012 Posts
Default

Quote:
Originally Posted by M344587487 View Post
Probably the 3900X or 3950X are the most energy efficient consumer CPU assuming the 64MB of cache is a big deal. Epyc zen2 will be the best including server CPUs due to running in the sweet spot of the power curve and more densely packing the compute power, therefore having less overhead per iteration and probably better utilising the sweet spot of the PSU. But if we include GPUs the Radeon VII beats all reasonable options.
Wouldn't the 3000 series chips be bandwidth limited with only 2 memory channels?
I could definitely see the EPYC doing well though; 8 channel per socket i think?
hansl is offline   Reply With Quote
Old 2019-09-08, 17:24   #7
scan80269
 
"Sam"
Jun 2019
California, USA

111012 Posts
Default

Your description is spot on.

Intel CPU Turbo Boost frequencies for all cores correspond to a power level way higher than TDP, especially when running AVX. The Turbo Boost duration for desktop Intel CPUs is typically no more than a few seconds by default, after which the CPU core frequencies will go down to bring the steady state package power consumption in line with TDP.

Attached screen shot is from my i7-8700T CPU running Prime95 exponent 86622433. The cores are at 2.7GHz most of the time, which is a bit higher than the 2.4GHz "base frequency", but nowhere near the 4.0GHz "max Turbo frequency". Steady state package power fluctuates slightly but is always very close to TDP at 35W.

Thermal throttling is an entirely different thing, and occurs when CPU internal temperature reaches "PROCHOT", typically 100C but may be higher or lower depending on CPU model. As long as the thermal solution can evacuate TDP level heat away from the CPU, the core/package temperatures should not reach PROCHOT. My i7-8700T has a giant NOFAN CR-95C passive cooler as thermal solution, and the CPU package and core temperatures are below 60C while running Prime95 24x7.
Attached Thumbnails
Click image for larger version

Name:	Core_i7-8700T_Prime95_Exponent_86622433.png
Views:	111
Size:	48.0 KB
ID:	21012  

Last fiddled with by scan80269 on 2019-09-08 at 17:26
scan80269 is offline   Reply With Quote
Old 2019-09-08, 17:44   #8
scan80269
 
"Sam"
Jun 2019
California, USA

358 Posts
Default

Quote:
Originally Posted by hansl View Post
Wouldn't the 3000 series chips be bandwidth limited with only 2 memory channels?
I could definitely see the EPYC doing well though; 8 channel per socket i think?
How about the Xeon W-3175X with 28 cores and 6 channels of DDR4 memory and 255W TDP? I can't afford such a system (the motherboard alone is $1800US) so will need help from someone else to see how it fares in computational efficiency as well as throughput for Prime95.

I suspect the i9-9900T with high speed DDR4 memory (e.g. 3600 dual rank) may be hard to beat in efficiency, since the TDP is only 35W, so CPUs with higher TDP will need to deliver several times the throughput to come out on top. For example, even with AVX512 and 6-channel memory, I doubt if a Xeon W-3175X platform can achieve >7.2X the throughput of what I posted for i9-9900T/DDR4-3600. Wouldn't mind being proven wrong, though.

Perhaps the way to compare is to simply take the iters/sec figure for leading edge exponents (5120K FFT??) divided by the CPU steady state package power, with optimized thread and worker counts for each CPU.

Last fiddled with by scan80269 on 2019-09-08 at 17:45
scan80269 is offline   Reply With Quote
Old 2019-09-08, 17:53   #9
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

1100011102 Posts
Default

On Intel, TDP is the maximum power required to run at base clock at elevated but within spec temperatures. Most enthusiast level systems with adequate cooling will boost and remain at PL2 indefinitely, which is somewhere above TDP. This is considered in-spec by Intel, it isn't overclocking. Where TDP really plays is in thermally limited systems, such as laptops and horrible systems from box shifters like Dell. Because of the limited cooling potential, they allow a short boost above TDP, before pulling back to it. Clocks will probably be below max turbo, but doesn't necessarily have to drop all the way down to base.

As for efficiency, it mostly comes down to where on the efficiency curve you run on a CPU. Lower clocks at lower voltage helps a huge amount, if you can throw enough cores at it. So a direct comparison between Intel and AMD isn't trivial.

I'm doing the new Fermat divisor project on PrimeGrid at the moment. Based on CPU self-reported power, I can also look at average work unit time, and work out a production over time. Combined, I can work out the number of tasks per kWh. Units at time of writing are quite small, 120k-128k FFT, running 1 task per core.
3600 - 328 units/kWh
3700X - 430 units/kWh
6700k - 230 units/kWh
E5 2683v3 - 293 units/kWh

Mainstream consumer CPUs tend to be more biased towards clock than efficiency, which is in part why there are lower power versions available too.

The 3600 and 3700X reported near enough the same power used running 6 and 8 tasks respectively, and clocks only differed by about 100 MHz. I don't know if they used a better bin on the 3700X but it certainly is more efficient.

Been on my "to do" list for a while, but I wanted to see what sort of tradeoffs can be had by essentially underclocking/undervolting. Might be simpler to run a lower power limit and let the CPU take care of it.
mackerel is offline   Reply With Quote
Old 2019-09-08, 19:23   #10
xx005fs
 
"Eric"
Jan 2018
USA

211 Posts
Default

Quote:
Originally Posted by scan80269 View Post
How about the Xeon W-3175X with 28 cores and 6 channels of DDR4 memory and 255W TDP? I can't afford such a system (the motherboard alone is $1800US) so will need help from someone else to see how it fares in computational efficiency as well as throughput for Prime95.

I suspect the i9-9900T with high speed DDR4 memory (e.g. 3600 dual rank) may be hard to beat in efficiency, since the TDP is only 35W, so CPUs with higher TDP will need to deliver several times the throughput to come out on top. For example, even with AVX512 and 6-channel memory, I doubt if a Xeon W-3175X platform can achieve >7.2X the throughput of what I posted for i9-9900T/DDR4-3600. Wouldn't mind being proven wrong, though.

Perhaps the way to compare is to simply take the iters/sec figure for leading edge exponents (5120K FFT??) divided by the CPU steady state package power, with optimized thread and worker counts for each CPU.
In my opinion, you really can't beat efficiencies of GPUs, and if you purchase something like a Radeon VII you can achieve around 1ms/it for 5120K FFT easily pulling around 160W, while being significanlty cheaper than a crap ton of 9900T systems. On the Nvidia side, you don't really have a choice since all the GeForce cards are crippled on FP64. Though a Titan V have absolutely insane performance while drawing little power (from my experience 0.83ms/it on 5120K FFT and 150W). If you are going to shell out for such an expensive system like w3175x might as well purchase a Titan V which is cheaper, faster, and draws less power. Though the biggest advantage of GPUs is that you can scale things up extremely easily by just adding more cards, and they would take relatively little spaces.
xx005fs is offline   Reply With Quote
Old 2019-09-08, 21:34   #11
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

2·199 Posts
Default

Quote:
Originally Posted by xx005fs View Post
In my opinion, you really can't beat efficiencies of GPUs
At the risk of taking this on a tangent, I presume that is for large mersenne tasks. I know there have been attempts at implementing e.g. LLR on GPU with... performance that wasn't anything to talk about. I haven't kept up to date, but presume nothing has changed recently. I'm left wondering if this is a GPU limitation, a code limitation, or a math limitation? For now, CPUs are still optimal for many forms of prime finding.
mackerel is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
How much do you pay for your electric energy? em99010pepe Lounge 31 2011-02-14 01:57
kinetic energy science_man_88 Miscellaneous Math 8 2010-05-29 04:14
Energy Minimization ShiningArcanine Math 2 2008-04-16 13:47
VIA C3 efficiency ET_ Hardware 4 2007-03-27 21:29
Energy efficiency for LL markhl Hardware 5 2004-02-04 13:33

All times are UTC. The time now is 01:43.

Sun Dec 6 01:43:42 UTC 2020 up 2 days, 21:55, 0 users, load averages: 2.67, 2.52, 2.55

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.