mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Hardware (https://www.mersenneforum.org/forumdisplay.php?f=9)
-   -   CPU Energy Efficiency for Prime95 (https://www.mersenneforum.org/showthread.php?t=24757)

scan80269 2019-09-08 15:44

CPU Energy Efficiency for Prime95
 
I just picked up an Intel Core i9-9900T CPU (8C/16T, 35W TDP) and in combination with 32GB DDR4-3600 dual rank memory and a ASRock Z390 Phantom Gaming-ITX/ac motherboard managed to achieve some decent throughput figures:

Timings for 2048K FFT length (8 cores, 1 worker): 1.11 ms. Throughput: 904.50 iter/sec.
Timings for 2304K FFT length (8 cores, 1 worker): 1.47 ms. Throughput: 680.55 iter/sec.
Timings for 2400K FFT length (8 cores, 1 worker): 1.83 ms. Throughput: 545.20 iter/sec.
Timings for 2560K FFT length (8 cores, 1 worker): 1.95 ms. Throughput: 512.50 iter/sec.
Timings for 2688K FFT length (8 cores, 1 worker): 2.03 ms. Throughput: 492.94 iter/sec.
Timings for 2880K FFT length (8 cores, 1 worker): 2.29 ms. Throughput: 437.28 iter/sec.
Timings for 3072K FFT length (8 cores, 1 worker): 2.42 ms. Throughput: 413.06 iter/sec.
Timings for 3200K FFT length (8 cores, 1 worker): 2.55 ms. Throughput: 391.66 iter/sec.
Timings for 3360K FFT length (8 cores, 1 worker): 2.82 ms. Throughput: 354.72 iter/sec.
Timings for 3456K FFT length (8 cores, 1 worker): 2.82 ms. Throughput: 353.99 iter/sec.
Timings for 3584K FFT length (8 cores, 1 worker): 2.95 ms. Throughput: 339.23 iter/sec.
Timings for 3840K FFT length (8 cores, 1 worker): 3.14 ms. Throughput: 318.70 iter/sec.
Timings for 4096K FFT length (8 cores, 1 worker): 3.47 ms. Throughput: 288.33 iter/sec.
Timings for 4480K FFT length (8 cores, 1 worker): 3.92 ms. Throughput: 255.15 iter/sec.
Timings for 4608K FFT length (8 cores, 1 worker): 3.88 ms. Throughput: 258.06 iter/sec.
Timings for 4800K FFT length (8 cores, 1 worker): 4.34 ms. Throughput: 230.54 iter/sec.
Timings for 5120K FFT length (8 cores, 1 worker): 4.53 ms. Throughput: 220.61 iter/sec.
Timings for 5376K FFT length (8 cores, 1 worker): 4.80 ms. Throughput: 208.33 iter/sec.
Timings for 5760K FFT length (8 cores, 1 worker): 5.39 ms. Throughput: 185.67 iter/sec.
Timings for 6144K FFT length (8 cores, 1 worker): 5.61 ms. Throughput: 178.22 iter/sec.
Timings for 6400K FFT length (8 cores, 1 worker): 5.98 ms. Throughput: 167.09 iter/sec.
Timings for 6720K FFT length (8 cores, 1 worker): 6.19 ms. Throughput: 161.55 iter/sec.
Timings for 6912K FFT length (8 cores, 1 worker): 6.55 ms. Throughput: 152.76 iter/sec.
Timings for 7168K FFT length (8 cores, 1 worker): 6.47 ms. Throughput: 154.53 iter/sec.
Timings for 7680K FFT length (8 cores, 1 worker): 7.02 ms. Throughput: 142.46 iter/sec.
Timings for 8064K FFT length (8 cores, 1 worker): 7.46 ms. Throughput: 134.00 iter/sec.
Timings for 8192K FFT length (8 cores, 1 worker): 7.51 ms. Throughput: 133.24 iter/sec.

This CPU strikes me as being quite energy efficient in running Prime95. It throughput is similar to a Core i7-8700K CPU (6C/12T, 95W TDP) but at nearly 1/3 the power dissipation.

It would be interesting to determine which modern CPU can achieve the highest efficiency for running Prime95, essentially a iters/sec per watt metric.

The CPUs delivering the highest throughput numbers tend not to be the most energy efficient. My i7-5960X rig with quad channel DDR4 memory is currently the fastest among my systems, but this CPU consumes almost 120W while running Prime95, and its throughput is nowhere close to 3X of the i9-9900T. Upcoming Intel Cascade Lake-X CPUs may provide an efficiency improvement over current gen Core X CPUs, so I'll be watching those closely.

Mobile CPUs with high core count may also be candidates for the energy efficiency crown.

axn 2019-09-08 16:31

Keep in mind that 35W TDP != 35W max power consumption. Have you tried to measure the actual power at the wall? Obviously, the total system power will be much higher, but the CPU itself might draw much more than 35W at full load, especially when using AVX.

scan80269 2019-09-08 16:44

Yes, I'm fully aware of that.

When Prime95 is started from idle, the i9-9900T engages Turbo Boost and consumes ~70W package power for several seconds, as measured using HWMonitor. The package power then goes back down close to 35W and stays there.

I did not override any default settings for CPU such as AVX offset in the BIOS, so once the CPU runs out of power & thermal headroom with Turbo Boost, it reduces the frequencies of the cores and the package power returns to TDP level.

Total system power is of course much higher than 35W, with memory, chipset, graphics, storage, VRs, etc. all consuming power in addition to the CPU itself, but this is true for any computer system.

nomead 2019-09-08 16:48

Also see how the clock speed behaves during the run. As I recall, the 35W "TDP" limited parts run at full speed for some time (some seconds - max. tens of seconds, depending on the motherboard manufacturer's parameters in the BIOS) and then throttle the clock lower if CPU demand stays high. And if the cooling is designed for 35 watts, it will probably also hit some sort of thermal throttling during a longer run. Full exponent test, not just throughput benchmark.

M344587487 2019-09-08 17:11

[QUOTE=scan80269;525475]...
It would be interesting to determine which modern CPU can achieve the highest efficiency for running Prime95, essentially a iters/sec per watt metric.
...[/QUOTE]
Probably the 3900X or 3950X are the most energy efficient consumer CPU assuming the 64MB of cache is a big deal. Epyc zen2 will be the best including server CPUs due to running in the sweet spot of the power curve and more densely packing the compute power, therefore having less overhead per iteration and probably better utilising the sweet spot of the PSU. But if we include GPUs the Radeon VII beats all reasonable options.

hansl 2019-09-08 17:17

[QUOTE=M344587487;525483]Probably the 3900X or 3950X are the most energy efficient consumer CPU assuming the 64MB of cache is a big deal. Epyc zen2 will be the best including server CPUs due to running in the sweet spot of the power curve and more densely packing the compute power, therefore having less overhead per iteration and probably better utilising the sweet spot of the PSU. But if we include GPUs the Radeon VII beats all reasonable options.[/QUOTE]
Wouldn't the 3000 series chips be bandwidth limited with only 2 memory channels?
I could definitely see the EPYC doing well though; 8 channel per socket i think?

scan80269 2019-09-08 17:24

1 Attachment(s)
Your description is spot on.

Intel CPU Turbo Boost frequencies for all cores correspond to a power level way higher than TDP, especially when running AVX. The Turbo Boost duration for desktop Intel CPUs is typically no more than a few seconds by default, after which the CPU core frequencies will go down to bring the steady state package power consumption in line with TDP.

Attached screen shot is from my i7-8700T CPU running Prime95 exponent 86622433. The cores are at 2.7GHz most of the time, which is a bit higher than the 2.4GHz "base frequency", but nowhere near the 4.0GHz "max Turbo frequency". Steady state package power fluctuates slightly but is always very close to TDP at 35W.

Thermal throttling is an entirely different thing, and occurs when CPU internal temperature reaches "PROCHOT", typically 100C but may be higher or lower depending on CPU model. As long as the thermal solution can evacuate TDP level heat away from the CPU, the core/package temperatures should not reach PROCHOT. My i7-8700T has a giant NOFAN CR-95C passive cooler as thermal solution, and the CPU package and core temperatures are below 60C while running Prime95 24x7.

scan80269 2019-09-08 17:44

[QUOTE=hansl;525485]Wouldn't the 3000 series chips be bandwidth limited with only 2 memory channels?
I could definitely see the EPYC doing well though; 8 channel per socket i think?[/QUOTE]

How about the Xeon W-3175X with 28 cores and 6 channels of DDR4 memory and 255W TDP? I can't afford such a system (the motherboard alone is $1800US) so will need help from someone else to see how it fares in computational efficiency as well as throughput for Prime95.

I suspect the i9-9900T with high speed DDR4 memory (e.g. 3600 dual rank) may be hard to beat in efficiency, since the TDP is only 35W, so CPUs with higher TDP will need to deliver several times the throughput to come out on top. For example, even with AVX512 and 6-channel memory, I doubt if a Xeon W-3175X platform can achieve >7.2X the throughput of what I posted for i9-9900T/DDR4-3600. Wouldn't mind being proven wrong, though.

Perhaps the way to compare is to simply take the iters/sec figure for leading edge exponents (5120K FFT??) divided by the CPU steady state package power, with optimized thread and worker counts for each CPU.

mackerel 2019-09-08 17:53

On Intel, TDP is the maximum power required to run at base clock at elevated but within spec temperatures. Most enthusiast level systems with adequate cooling will boost and remain at PL2 indefinitely, which is somewhere above TDP. This is considered in-spec by Intel, it isn't overclocking. Where TDP really plays is in thermally limited systems, such as laptops and horrible systems from box shifters like Dell. Because of the limited cooling potential, they allow a short boost above TDP, before pulling back to it. Clocks will probably be below max turbo, but doesn't necessarily have to drop all the way down to base.

As for efficiency, it mostly comes down to where on the efficiency curve you run on a CPU. Lower clocks at lower voltage helps a huge amount, if you can throw enough cores at it. So a direct comparison between Intel and AMD isn't trivial.

I'm doing the new Fermat divisor project on PrimeGrid at the moment. Based on CPU self-reported power, I can also look at average work unit time, and work out a production over time. Combined, I can work out the number of tasks per kWh. Units at time of writing are quite small, 120k-128k FFT, running 1 task per core.
3600 - 328 units/kWh
3700X - 430 units/kWh
6700k - 230 units/kWh
E5 2683v3 - 293 units/kWh

Mainstream consumer CPUs tend to be more biased towards clock than efficiency, which is in part why there are lower power versions available too.

The 3600 and 3700X reported near enough the same power used running 6 and 8 tasks respectively, and clocks only differed by about 100 MHz. I don't know if they used a better bin on the 3700X but it certainly is more efficient.

Been on my "to do" list for a while, but I wanted to see what sort of tradeoffs can be had by essentially underclocking/undervolting. Might be simpler to run a lower power limit and let the CPU take care of it.

xx005fs 2019-09-08 19:23

[QUOTE=scan80269;525492]How about the Xeon W-3175X with 28 cores and 6 channels of DDR4 memory and 255W TDP? I can't afford such a system (the motherboard alone is $1800US) so will need help from someone else to see how it fares in computational efficiency as well as throughput for Prime95.

I suspect the i9-9900T with high speed DDR4 memory (e.g. 3600 dual rank) may be hard to beat in efficiency, since the TDP is only 35W, so CPUs with higher TDP will need to deliver several times the throughput to come out on top. For example, even with AVX512 and 6-channel memory, I doubt if a Xeon W-3175X platform can achieve >7.2X the throughput of what I posted for i9-9900T/DDR4-3600. Wouldn't mind being proven wrong, though.

Perhaps the way to compare is to simply take the iters/sec figure for leading edge exponents (5120K FFT??) divided by the CPU steady state package power, with optimized thread and worker counts for each CPU.[/QUOTE]

In my opinion, you really can't beat efficiencies of GPUs, and if you purchase something like a Radeon VII you can achieve around 1ms/it for 5120K FFT easily pulling around 160W, while being significanlty cheaper than a crap ton of 9900T systems. On the Nvidia side, you don't really have a choice since all the GeForce cards are crippled on FP64. Though a Titan V have absolutely insane performance while drawing little power (from my experience 0.83ms/it on 5120K FFT and 150W). If you are going to shell out for such an expensive system like w3175x might as well purchase a Titan V which is cheaper, faster, and draws less power. Though the biggest advantage of GPUs is that you can scale things up extremely easily by just adding more cards, and they would take relatively little spaces.

mackerel 2019-09-08 21:34

[QUOTE=xx005fs;525504]In my opinion, you really can't beat efficiencies of GPUs[/QUOTE]

At the risk of taking this on a tangent, I presume that is for large mersenne tasks. I know there have been attempts at implementing e.g. LLR on GPU with... performance that wasn't anything to talk about. I haven't kept up to date, but presume nothing has changed recently. I'm left wondering if this is a GPU limitation, a code limitation, or a math limitation? For now, CPUs are still optimal for many forms of prime finding.


All times are UTC. The time now is 14:32.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.