mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Hardware (https://www.mersenneforum.org/forumdisplay.php?f=9)
-   -   CPU Energy Efficiency for Prime95 (https://www.mersenneforum.org/showthread.php?t=24757)

scan80269 2019-09-08 15:44

CPU Energy Efficiency for Prime95
 
I just picked up an Intel Core i9-9900T CPU (8C/16T, 35W TDP) and in combination with 32GB DDR4-3600 dual rank memory and a ASRock Z390 Phantom Gaming-ITX/ac motherboard managed to achieve some decent throughput figures:

Timings for 2048K FFT length (8 cores, 1 worker): 1.11 ms. Throughput: 904.50 iter/sec.
Timings for 2304K FFT length (8 cores, 1 worker): 1.47 ms. Throughput: 680.55 iter/sec.
Timings for 2400K FFT length (8 cores, 1 worker): 1.83 ms. Throughput: 545.20 iter/sec.
Timings for 2560K FFT length (8 cores, 1 worker): 1.95 ms. Throughput: 512.50 iter/sec.
Timings for 2688K FFT length (8 cores, 1 worker): 2.03 ms. Throughput: 492.94 iter/sec.
Timings for 2880K FFT length (8 cores, 1 worker): 2.29 ms. Throughput: 437.28 iter/sec.
Timings for 3072K FFT length (8 cores, 1 worker): 2.42 ms. Throughput: 413.06 iter/sec.
Timings for 3200K FFT length (8 cores, 1 worker): 2.55 ms. Throughput: 391.66 iter/sec.
Timings for 3360K FFT length (8 cores, 1 worker): 2.82 ms. Throughput: 354.72 iter/sec.
Timings for 3456K FFT length (8 cores, 1 worker): 2.82 ms. Throughput: 353.99 iter/sec.
Timings for 3584K FFT length (8 cores, 1 worker): 2.95 ms. Throughput: 339.23 iter/sec.
Timings for 3840K FFT length (8 cores, 1 worker): 3.14 ms. Throughput: 318.70 iter/sec.
Timings for 4096K FFT length (8 cores, 1 worker): 3.47 ms. Throughput: 288.33 iter/sec.
Timings for 4480K FFT length (8 cores, 1 worker): 3.92 ms. Throughput: 255.15 iter/sec.
Timings for 4608K FFT length (8 cores, 1 worker): 3.88 ms. Throughput: 258.06 iter/sec.
Timings for 4800K FFT length (8 cores, 1 worker): 4.34 ms. Throughput: 230.54 iter/sec.
Timings for 5120K FFT length (8 cores, 1 worker): 4.53 ms. Throughput: 220.61 iter/sec.
Timings for 5376K FFT length (8 cores, 1 worker): 4.80 ms. Throughput: 208.33 iter/sec.
Timings for 5760K FFT length (8 cores, 1 worker): 5.39 ms. Throughput: 185.67 iter/sec.
Timings for 6144K FFT length (8 cores, 1 worker): 5.61 ms. Throughput: 178.22 iter/sec.
Timings for 6400K FFT length (8 cores, 1 worker): 5.98 ms. Throughput: 167.09 iter/sec.
Timings for 6720K FFT length (8 cores, 1 worker): 6.19 ms. Throughput: 161.55 iter/sec.
Timings for 6912K FFT length (8 cores, 1 worker): 6.55 ms. Throughput: 152.76 iter/sec.
Timings for 7168K FFT length (8 cores, 1 worker): 6.47 ms. Throughput: 154.53 iter/sec.
Timings for 7680K FFT length (8 cores, 1 worker): 7.02 ms. Throughput: 142.46 iter/sec.
Timings for 8064K FFT length (8 cores, 1 worker): 7.46 ms. Throughput: 134.00 iter/sec.
Timings for 8192K FFT length (8 cores, 1 worker): 7.51 ms. Throughput: 133.24 iter/sec.

This CPU strikes me as being quite energy efficient in running Prime95. It throughput is similar to a Core i7-8700K CPU (6C/12T, 95W TDP) but at nearly 1/3 the power dissipation.

It would be interesting to determine which modern CPU can achieve the highest efficiency for running Prime95, essentially a iters/sec per watt metric.

The CPUs delivering the highest throughput numbers tend not to be the most energy efficient. My i7-5960X rig with quad channel DDR4 memory is currently the fastest among my systems, but this CPU consumes almost 120W while running Prime95, and its throughput is nowhere close to 3X of the i9-9900T. Upcoming Intel Cascade Lake-X CPUs may provide an efficiency improvement over current gen Core X CPUs, so I'll be watching those closely.

Mobile CPUs with high core count may also be candidates for the energy efficiency crown.

axn 2019-09-08 16:31

Keep in mind that 35W TDP != 35W max power consumption. Have you tried to measure the actual power at the wall? Obviously, the total system power will be much higher, but the CPU itself might draw much more than 35W at full load, especially when using AVX.

scan80269 2019-09-08 16:44

Yes, I'm fully aware of that.

When Prime95 is started from idle, the i9-9900T engages Turbo Boost and consumes ~70W package power for several seconds, as measured using HWMonitor. The package power then goes back down close to 35W and stays there.

I did not override any default settings for CPU such as AVX offset in the BIOS, so once the CPU runs out of power & thermal headroom with Turbo Boost, it reduces the frequencies of the cores and the package power returns to TDP level.

Total system power is of course much higher than 35W, with memory, chipset, graphics, storage, VRs, etc. all consuming power in addition to the CPU itself, but this is true for any computer system.

nomead 2019-09-08 16:48

Also see how the clock speed behaves during the run. As I recall, the 35W "TDP" limited parts run at full speed for some time (some seconds - max. tens of seconds, depending on the motherboard manufacturer's parameters in the BIOS) and then throttle the clock lower if CPU demand stays high. And if the cooling is designed for 35 watts, it will probably also hit some sort of thermal throttling during a longer run. Full exponent test, not just throughput benchmark.

M344587487 2019-09-08 17:11

[QUOTE=scan80269;525475]...
It would be interesting to determine which modern CPU can achieve the highest efficiency for running Prime95, essentially a iters/sec per watt metric.
...[/QUOTE]
Probably the 3900X or 3950X are the most energy efficient consumer CPU assuming the 64MB of cache is a big deal. Epyc zen2 will be the best including server CPUs due to running in the sweet spot of the power curve and more densely packing the compute power, therefore having less overhead per iteration and probably better utilising the sweet spot of the PSU. But if we include GPUs the Radeon VII beats all reasonable options.

hansl 2019-09-08 17:17

[QUOTE=M344587487;525483]Probably the 3900X or 3950X are the most energy efficient consumer CPU assuming the 64MB of cache is a big deal. Epyc zen2 will be the best including server CPUs due to running in the sweet spot of the power curve and more densely packing the compute power, therefore having less overhead per iteration and probably better utilising the sweet spot of the PSU. But if we include GPUs the Radeon VII beats all reasonable options.[/QUOTE]
Wouldn't the 3000 series chips be bandwidth limited with only 2 memory channels?
I could definitely see the EPYC doing well though; 8 channel per socket i think?

scan80269 2019-09-08 17:24

1 Attachment(s)
Your description is spot on.

Intel CPU Turbo Boost frequencies for all cores correspond to a power level way higher than TDP, especially when running AVX. The Turbo Boost duration for desktop Intel CPUs is typically no more than a few seconds by default, after which the CPU core frequencies will go down to bring the steady state package power consumption in line with TDP.

Attached screen shot is from my i7-8700T CPU running Prime95 exponent 86622433. The cores are at 2.7GHz most of the time, which is a bit higher than the 2.4GHz "base frequency", but nowhere near the 4.0GHz "max Turbo frequency". Steady state package power fluctuates slightly but is always very close to TDP at 35W.

Thermal throttling is an entirely different thing, and occurs when CPU internal temperature reaches "PROCHOT", typically 100C but may be higher or lower depending on CPU model. As long as the thermal solution can evacuate TDP level heat away from the CPU, the core/package temperatures should not reach PROCHOT. My i7-8700T has a giant NOFAN CR-95C passive cooler as thermal solution, and the CPU package and core temperatures are below 60C while running Prime95 24x7.

scan80269 2019-09-08 17:44

[QUOTE=hansl;525485]Wouldn't the 3000 series chips be bandwidth limited with only 2 memory channels?
I could definitely see the EPYC doing well though; 8 channel per socket i think?[/QUOTE]

How about the Xeon W-3175X with 28 cores and 6 channels of DDR4 memory and 255W TDP? I can't afford such a system (the motherboard alone is $1800US) so will need help from someone else to see how it fares in computational efficiency as well as throughput for Prime95.

I suspect the i9-9900T with high speed DDR4 memory (e.g. 3600 dual rank) may be hard to beat in efficiency, since the TDP is only 35W, so CPUs with higher TDP will need to deliver several times the throughput to come out on top. For example, even with AVX512 and 6-channel memory, I doubt if a Xeon W-3175X platform can achieve >7.2X the throughput of what I posted for i9-9900T/DDR4-3600. Wouldn't mind being proven wrong, though.

Perhaps the way to compare is to simply take the iters/sec figure for leading edge exponents (5120K FFT??) divided by the CPU steady state package power, with optimized thread and worker counts for each CPU.

mackerel 2019-09-08 17:53

On Intel, TDP is the maximum power required to run at base clock at elevated but within spec temperatures. Most enthusiast level systems with adequate cooling will boost and remain at PL2 indefinitely, which is somewhere above TDP. This is considered in-spec by Intel, it isn't overclocking. Where TDP really plays is in thermally limited systems, such as laptops and horrible systems from box shifters like Dell. Because of the limited cooling potential, they allow a short boost above TDP, before pulling back to it. Clocks will probably be below max turbo, but doesn't necessarily have to drop all the way down to base.

As for efficiency, it mostly comes down to where on the efficiency curve you run on a CPU. Lower clocks at lower voltage helps a huge amount, if you can throw enough cores at it. So a direct comparison between Intel and AMD isn't trivial.

I'm doing the new Fermat divisor project on PrimeGrid at the moment. Based on CPU self-reported power, I can also look at average work unit time, and work out a production over time. Combined, I can work out the number of tasks per kWh. Units at time of writing are quite small, 120k-128k FFT, running 1 task per core.
3600 - 328 units/kWh
3700X - 430 units/kWh
6700k - 230 units/kWh
E5 2683v3 - 293 units/kWh

Mainstream consumer CPUs tend to be more biased towards clock than efficiency, which is in part why there are lower power versions available too.

The 3600 and 3700X reported near enough the same power used running 6 and 8 tasks respectively, and clocks only differed by about 100 MHz. I don't know if they used a better bin on the 3700X but it certainly is more efficient.

Been on my "to do" list for a while, but I wanted to see what sort of tradeoffs can be had by essentially underclocking/undervolting. Might be simpler to run a lower power limit and let the CPU take care of it.

xx005fs 2019-09-08 19:23

[QUOTE=scan80269;525492]How about the Xeon W-3175X with 28 cores and 6 channels of DDR4 memory and 255W TDP? I can't afford such a system (the motherboard alone is $1800US) so will need help from someone else to see how it fares in computational efficiency as well as throughput for Prime95.

I suspect the i9-9900T with high speed DDR4 memory (e.g. 3600 dual rank) may be hard to beat in efficiency, since the TDP is only 35W, so CPUs with higher TDP will need to deliver several times the throughput to come out on top. For example, even with AVX512 and 6-channel memory, I doubt if a Xeon W-3175X platform can achieve >7.2X the throughput of what I posted for i9-9900T/DDR4-3600. Wouldn't mind being proven wrong, though.

Perhaps the way to compare is to simply take the iters/sec figure for leading edge exponents (5120K FFT??) divided by the CPU steady state package power, with optimized thread and worker counts for each CPU.[/QUOTE]

In my opinion, you really can't beat efficiencies of GPUs, and if you purchase something like a Radeon VII you can achieve around 1ms/it for 5120K FFT easily pulling around 160W, while being significanlty cheaper than a crap ton of 9900T systems. On the Nvidia side, you don't really have a choice since all the GeForce cards are crippled on FP64. Though a Titan V have absolutely insane performance while drawing little power (from my experience 0.83ms/it on 5120K FFT and 150W). If you are going to shell out for such an expensive system like w3175x might as well purchase a Titan V which is cheaper, faster, and draws less power. Though the biggest advantage of GPUs is that you can scale things up extremely easily by just adding more cards, and they would take relatively little spaces.

mackerel 2019-09-08 21:34

[QUOTE=xx005fs;525504]In my opinion, you really can't beat efficiencies of GPUs[/QUOTE]

At the risk of taking this on a tangent, I presume that is for large mersenne tasks. I know there have been attempts at implementing e.g. LLR on GPU with... performance that wasn't anything to talk about. I haven't kept up to date, but presume nothing has changed recently. I'm left wondering if this is a GPU limitation, a code limitation, or a math limitation? For now, CPUs are still optimal for many forms of prime finding.

VBCurtis 2019-09-09 01:47

[QUOTE=scan80269;525492]I suspect the i9-9900T with high speed DDR4 memory (e.g. 3600 dual rank) may be hard to beat in efficiency, since the TDP is only 35W, so CPUs with higher TDP will need to deliver several times the throughput to come out on top. For example, even with AVX512 and 6-channel memory, I doubt if a Xeon W-3175X platform can achieve >7.2X the throughput of what I posted for i9-9900T/DDR4-3600. Wouldn't mind being proven wrong, though.

Perhaps the way to compare is to simply take the iters/sec figure for leading edge exponents (5120K FFT??) divided by the CPU steady state package power, with optimized thread and worker counts for each CPU.[/QUOTE]
I don't see how it is useful to compare CPU power draw for efficiency, rather than wall-socket power draw. If my CPU draws double the power of your 9900 for 50% more production, I think my system is more efficient per watt of power used, because my wall-socket power drawn is less than 50% more than yours. If you think of efficiency as "lowest cost of electricity per LL test completed", then it's wall-socket wattage for sure.

nomead 2019-09-09 13:01

[QUOTE=hansl;525485]Wouldn't the 3000 series chips be bandwidth limited with only 2 memory channels?[/QUOTE]

And they are, if the FFT size is large enough so that the data doesn't fit inside the L3 cache. If it does fit, the bandwidth limit is significantly higher, probably what the Infinity Fabric can transfer (write at half the speed of reads) is the limiting factor. See [URL="https://www.mersenneforum.org/showpost.php?p=524000&postcount=103"]benchmarks[/URL] and [URL="https://www.mersenneforum.org/showpost.php?p=525003&postcount=110"]speculation[/URL] in the Zen 2 thread.

retina 2019-09-09 13:34

[QUOTE=scan80269;525492]Perhaps the way to compare is to simply take the iters/sec figure for leading edge exponents (5120K FFT??) divided by the CPU steady state package power, with optimized thread and worker counts for each CPU.[/QUOTE]There is more to a computer than just the CPU. You need to measure all the components. RAM, mobo, PSU, GPU, HDD/SSD, fans, display, etc. A CPU doesn't run with nothing connected.

Buy one of those power plug watt meters and use that figure as the basis for efficiency computations.

scan80269 2019-09-10 01:54

OK. I picked up several used Watts-Up Pro meters a couple of years ago, and can put them to good use measuring AC power of my systems running Prime95.

Since virtually all of my systems running Prime95 are headless, with no display, keyboard or mouse attached, that is the way I'll be measuring the system power with watt meters.

Questions. Will system power vary by the assigned exponents? Should I just take AC power readings while the systems are running their current assignments? Would it matter if the current assignment is LL-D or PRP? Or should I pause the assignments and launch a specific torture test? Most of my systems are presently running PRP on exponents in the 88 million range.

VBCurtis 2019-09-10 03:19

Try your various ideas, report whether outlet-wattage varies among any of those things. I doubt it, but I am prepared to be surprised.

hansl 2019-09-10 03:43

As I understand it, the smallest of the FFTs are allegedly the most tortuous, so I would expect them to draw a few more watts than larger ones.

ewmayer 2019-09-10 19:41

[QUOTE=hansl;525602]As I understand it, the smallest of the FFTs are allegedly the most tortuous, so I would expect them to draw a few more watts than larger ones.[/QUOTE]

Not necessarily - those small FFTs fit almost entirely in L1/2 caches, so tend to stress the CPU the most. Larger FFTs will spill into L3 and then into main memory, so may leave the CPU slightly less stressed but the overall system more stressed. Watts-at-wall will tell the tale.

scan80269 2019-09-22 01:36

Here are two of my systems running Prime95 with AC power consumption measured using Watts-Up Pro meters:

System #1
- Intel NUC8i7BEH
- Intel Core i7-8559U CPU (28W TDP)
- 16GB DDR4-2133 memory (2 x Samsung 8GB 2133 2Rx8 SODIMM, CL15, CR=1T)
- Samsung 850 PRO 256GB SATA SSD
- Akasa Turing fanless chassis

4 cores 1 worker
PRP exponent 86831357, FFT=4608K
ms/iter: 5.736
AC power: 47.2W

System #2
- ASRock Z390M-ITX/ac motherboard
- Intel Core i9-9900T CPU (35W TDP)
- Nofan CR-95C passive heatsink (black pearl)
- 32GB DDR4-3600 memory (2 x Corsair 16GB 3600 2Rx8 UDIMM, CL17, CR=2T)
- Samsung SM961 512GB NVMe PCIe SSD
- Seasonic Prime Titanium 600W fanless power supply
- ThermalTake Core P1 chassis

8 cores 1 worker
PRP exponent 86846297, FFT=4608K
ms/iter: 4.513
AC power: 59.6W

Both systems are running headless: no display, keyboard or mouse attached.

The second system draws more power from the wall than the first system, but cranks faster. Efficiency wise (iters/watt) they are quite similar.

petrw1 2019-09-24 05:34

Power bill becoming a concern
 
I'm thinking of replacing some outdated CPUs for MUCH MUCH faster but hopefully NOT a lot more power consumption.

What would I need to buy to produce the same LL or P1 throughput as these 4 CPUs:
i5-2500
i5-3570
i5-3570
i5-3570K

For example my i7-7820X produces about the same P1 throughput as the first 3 combined
… at I suspect about 1.5 times the power draw of 1 of them.

The first PC above also has a GTX-980 GPU which I'd like to replace with maybe a 2080.

OR....might I be better off by rather than getting a big honking CPU to get a GPU that is efficient at P1/LL?

I understand Titan for example is really good at LL but not so great at TF whereas the 2080 is great at TF but so so at LL.

Is there a card that is good at both? (That doesn't cost the same as a car?)

scan80269 2019-09-24 06:51

[QUOTE=petrw1;526456]I'm thinking of replacing some outdated CPUs for MUCH MUCH faster but hopefully NOT a lot more power consumption.

What would I need to buy to produce the same LL or P1 throughput as these 4 CPUs:
i5-2500
i5-3570
i5-3570
i5-3570K

For example my i7-7820X produces about the same P1 throughput as the first 3 combined
… at I suspect about 1.5 times the power draw of 1 of them.

The first PC above also has a GTX-980 GPU which I'd like to replace with maybe a 2080.

OR....might I be better off by rather than getting a big honking CPU to get a GPU that is efficient at P1/LL?

I understand Titan for example is really good at LL but not so great at TF whereas the 2080 is great at TF but so so at LL.

Is there a card that is good at both? (That doesn't cost the same as a car?)[/QUOTE]

Recent generations of Intel desktop processors (e.g. Coffee Lake-R) should provide significant performance increases AND power reductions compared to your Intel 2nd gen Core (Sandy Bridge) and 3rd gen Core (Ivy Bridge) processors.

I've been a fan of low power Intel desktop processors, and am especially partial to the 35W series, such as i7-7700T (4-core, Kaby Lake), i7-8700T (6-core, Coffee Lake) and even i9-9900T (8-core, Coffee Lake refresh). These are paired with dual-channel DDR4 memory at 2400 or 2666, which has nearly twice the clock frequency of DDR3 memory at 1333 or 1600, or more than twice if you count memory overclocking. I've recently found DDR4-3600 dual-ranked (16GB sticks) memory quite optimal for running Prime95 PRP with leading edge exponents (FFT=4608K or higher).

My fastest system for Prime95 has a i7-5960X CPU (Haswell E, 140W) with quad channel DDR4-2133 CL14 memory, but its energy efficiency is substantially inferior to my i7-8559U and i9-9900T systems.

So it comes down to whether you want to pursue high energy efficiency or high absolute performance. Favoring the former can help lower your electric bill and still achieve some increase in prime computing throughput. The latter may get you to speed record territory but with a sizable penalty in electricity cost.

preda 2019-09-24 11:08

[QUOTE=petrw1;526456]I'm thinking of replacing some outdated CPUs for MUCH MUCH faster but hopefully NOT a lot more power consumption.
[...]
OR....might I be better off by rather than getting a big honking CPU to get a GPU that is efficient at P1/LL?

I understand Titan for example is really good at LL but not so great at TF whereas the 2080 is great at TF but so so at LL.

Is there a card that is good at both? (That doesn't cost the same as a car?)[/QUOTE]

Radeon VII is both powerful *and* efficient for PRP. (numbers: I run one of my RadeonVIIs in "power efficient" mode; it is using 150W (self-reported by the GPU; at the plug I measure about 25% more overhead), and does 955us/it at the wavefront)

juza89 2020-01-05 11:50

I just recently upgraded my computer to 9900K.


I did couple tests regarding wattage / throughput. Only tested 2880K fft because that's the size i am doing doublechecking at the moment.



[B]results for 9900K @3.6Ghz all cores:[/B]
FFTlen=2880K, Type=3, Arch=4, Pass1=320, Pass2=9216, clm=4 (1 core, 1 worker): 11.45 ms. Throughput: 87.36 iter/sec.
FFTlen=2880K, Type=3, Arch=4, Pass1=320, Pass2=9216, clm=4 (2 cores, 1 worker): 5.98 ms. Throughput: 167.18 iter/sec.
FFTlen=2880K, Type=3, Arch=4, Pass1=320, Pass2=9216, clm=4 (2 cores, 2 workers): 11.71, 11.60 ms. Throughput: 171.63 iter/sec.
FFTlen=2880K, Type=3, Arch=4, Pass1=320, Pass2=9216, clm=4 (4 cores, 1 worker): 3.22 ms. Throughput: 310.47 iter/sec.
FFTlen=2880K, Type=3, Arch=4, Pass1=320, Pass2=9216, clm=4 (4 cores, 2 workers): 7.12, 7.06 ms. Throughput: 281.97 iter/sec.
FFTlen=2880K, Type=3, Arch=4, Pass1=320, Pass2=9216, clm=4 (4 cores, 4 workers): 14.38, 14.23, 14.26, 14.28 ms. Throughput: 279.98 iter/sec.
[B]FFTlen=2880K, Type=3, Arch=4, Pass1=320, Pass2=9216, clm=4 (6 cores, 1 worker): 2.73 ms. Throughput: 366.08 iter/sec.[/B]
FFTlen=2880K, Type=3, Arch=4, Pass1=320, Pass2=9216, clm=4 (6 cores, 2 workers): 6.80, 6.79 ms. Throughput: 294.25 iter/sec.
FFTlen=2880K, Type=3, Arch=4, Pass1=320, Pass2=9216, clm=4 (6 cores, 4 workers): 21.49, 21.41, 10.72, 10.71 ms. Throughput: 279.89 iter/sec.
FFTlen=2880K, Type=3, Arch=4, Pass1=320, Pass2=9216, clm=4 (8 cores, 1 worker): 2.76 ms. Throughput: 362.61 iter/sec.
FFTlen=2880K, Type=3, Arch=4, Pass1=320, Pass2=9216, clm=4 (8 cores, 2 workers): 7.24, 7.23 ms. Throughput: 276.43 iter/sec.
FFTlen=2880K, Type=3, Arch=4, Pass1=320, Pass2=9216, clm=4 (8 cores, 4 workers): 14.98, 15.23, 15.15, 15.02 ms. Throughput: 265.06 iter/sec.
FFTlen=2880K, Type=3, Arch=4, Pass1=320, Pass2=9216, clm=4 (8 cores, 8 workers): 30.88, 30.40, 31.11, 30.18, 30.40, 30.64, 30.48, 30.48 ms. Throughput: 261.71 iter/sec.


doublechecking work for 6cores 1 worker results in 62w powerconsumption reported by HWMonitor.



same test 9900K with 4.7ghz boost all cores.
6cores 1 worker was still the fastest with 380iter/sec and power usage was 125w.


Conclusion: doubling the power consumption only results in 5% performance increase in Prime95.

VBCurtis 2020-01-06 04:28

Conclusion: Your CPU is waiting on your memory to provide data.
Faster Ghz CPU, in this case, is hurry up and wait.

You could try *under*clocking the CPU, if you were looking for peak efficiency; I imagine you could drop wattage 10% or more while still waiting on memory a bit.

nomead 2020-01-06 05:56

I have an i5-8400 with crappy slow OEM memory at work. It's a 6-core processor but I'm actually running Prime95 on just 4 cores because the throughput was best at that setting. So it is starved for memory bandwidth even sooner.

juza89 2020-01-25 11:25

1 Attachment(s)
I did some more testing for the 9900K
I overclocked the memory to 3600Mhz with 1.38V and got it stable.
I tested different speed from 800Mhz to 4000Mhz. No need to go faster because memory starts to bottleneck.


Fastest speed was 455,32 iter/sec 6cores @4000Mhz consuming 84.5watts. That results in 5.39iters/watt



For the peak efficiency / watt, @1500Mhz was able to get 283,27 iter/sec with 25watt consumption. Thats 11.33 iters/watt!


For anyone interested i've attached all the data that I collected in a spreadsheet.

retina 2020-01-25 11:35

[QUOTE=juza89;535934]I did some more testing for the 9900K
I overclocked the memory to 3600Mhz with 1.38V and got it stable.
I tested different speed from 800Mhz to 4000Mhz. No need to go faster because memory starts to bottleneck.


Fastest speed was 455,32 iter/sec 6cores @4000Mhz consuming 84.5watts. That results in 5.39iters/watt



For the peak efficiency / watt, @1500Mhz was able to get 283,27 iter/sec with 25watt consumption. Thats 11.33 iters/watt!


For anyone interested i've attached all the data that I collected in a spreadsheet.[/QUOTE]Thanks for the values.

Note that if you compute iterations/sec per Watt then the output unit is iterations/Joule (because 1 Watt = 1 Joule/sec).

kriesel 2020-01-25 23:04

iterations/Joule is an interesting measure, but a great deal depends on the fft length.
And on whether the system's many auxiliary loads are fed by those Joules, or only the cpu. Where is the power consumption measured, at the wall plug, the cpu's sensors, or elsewhere? Computational effort per iteration is O(n log n log log n), not constant.

juza89 2020-01-26 08:07

[QUOTE=kriesel;535973]iterations/Joule is an interesting measure, but a great deal depends on the fft length.
And on whether the system's many auxiliary loads are fed by those Joules, or only the cpu. Where is the power consumption measured, at the wall plug, the cpu's sensors, or elsewhere? Computational effort per iteration is O(n log n log log n), not constant.[/QUOTE]

All measurements was done using 2880K fft.
power usage was measured by checking cpu's sensors (package power) with HWMonitor when first fft implementation was running in prime95 benchmark. And I averaged it by eye. Lets say i've measured wattage to be 25w, in that case it was actually fluctuating between 24,7 and 25,3. Random spikes was ignored, I figured they are probably background processes consuming cpu cycles occasionally.
There was little difference in consumption when different types of fft implementations were running, but i didn't bother taking measures every implementation. It would've been too time consuming.
The whole point of Iters/joule measure was to find the most power efficient speed for the cpu. I would've guessed it to be with slowest speed and lowest corevoltage but it did not be that case.
First I changed cpu speeds in BIOS and voltages were on auto, but for some reason the voltages didin't go lower than 0.9v. Then I found software called Throttlestop which lets you change cpu speed on the fly from windows. Voltages with different speeds were stock voltages the cpu asked for.

TheJudger 2020-01-28 22:07

Stock Ryzen 9 3900X with dual DDR4-3200 (dual rank).

[CODE]Prime95 64-bit version 29.8, RdtscTiming=1
Timings for 2880K FFT length (12 cores, 1 worker): 1.33 ms. Throughput: 750.80 iter/sec.
Timings for 2880K FFT length (12 cores, 2 workers): 2.10, 2.10 ms. Throughput: 954.10 iter/sec.
[/CODE]

L3 cache works fine! When you increase the FFT size (somewhere near 4M) you want to switch to 12 cores, 1 worker because it doesn't fit twice into the L3 cache anymore.

Oliver

TheJudger 2020-01-31 21:21

Hi,

full benchmarks here: [URL="https://mersenneforum.org/showpost.php?p=536336&postcount=788"]https://mersenneforum.org/showpost.php?p=536336&postcount=788[/URL]

Oliver

DrobinsonPE 2020-10-25 06:38

1 Attachment(s)
I decided that my computers were using too much power so I have started tuning them for efficiency. My first set of results are for an i3-9100.

With the setting I chose to use, mprime speed dropped by less than 10% with almost a 46% drop in power use. In addition to the significant drop in power use, the computer fans spin much slower now and make significantly less noise. The one thing I should have measured during the testing but didn't was the CPU temperature. I will look into adding that in the future. Most likely the temperature is much lower with the reduced power use.

See the attached picture for the details.

axn 2020-10-25 07:55

[QUOTE=DrobinsonPE;561058]With the setting I chose to use, mprime speed dropped by less than 10% with almost a 46% drop in power use. In addition to the significant drop in power use, the computer fans spin much slower now and make significantly less noise. The one thing I should have measured during the testing but didn't was the CPU temperature. I will look into adding that in the future. Most likely the temperature is much lower with the reduced power use.[/QUOTE]

Very nice. How was the power measured?

Uncwilly 2020-10-25 10:12

[QUOTE=axn;561064]Very nice. How was the power measured?[/QUOTE]

It says with a Kill-a-Watt on the table.

axn 2020-10-25 10:31

[QUOTE=Uncwilly;561073]It says with a Kill-a-Watt on the table.[/QUOTE]

Ah, missed that.

M344587487 2020-10-25 11:48

[QUOTE=scan80269;526249]Here are two of my systems running Prime95 with AC power consumption measured using Watts-Up Pro meters:

System #1
- Intel NUC8i7BEH
- Intel Core i7-8559U CPU (28W TDP)
- 16GB DDR4-2133 memory (2 x Samsung 8GB 2133 2Rx8 SODIMM, CL15, CR=1T)
- Samsung 850 PRO 256GB SATA SSD
- Akasa Turing fanless chassis

4 cores 1 worker
PRP exponent 86831357, FFT=4608K
ms/iter: 5.736
AC power: 47.2W

System #2
- ASRock Z390M-ITX/ac motherboard
- Intel Core i9-9900T CPU (35W TDP)
- Nofan CR-95C passive heatsink (black pearl)
- 32GB DDR4-3600 memory (2 x Corsair 16GB 3600 2Rx8 UDIMM, CL17, CR=2T)
- Samsung SM961 512GB NVMe PCIe SSD
- Seasonic Prime Titanium 600W fanless power supply
- ThermalTake Core P1 chassis

8 cores 1 worker
PRP exponent 86846297, FFT=4608K
ms/iter: 4.513
AC power: 59.6W

Both systems are running headless: no display, keyboard or mouse attached.

The second system draws more power from the wall than the first system, but cranks faster. Efficiency wise (iters/watt) they are quite similar.[/QUOTE]
Here's my 4700u results tuned for efficiency:
8 cores 1 worker
PRP exponent 86846297, FFT=4608K
ms/iter: 8.1
AC power: 20.5W
it/j: 6.02

The small cache hurts throughput but the efficiency is nice. Measured at the wall at 0.5W resolution, ran the test for a few hours to let the it/s settle (I think thermal saturation came into play as it was ~7.8 ms/it initially and slowly creeped up to 8.1). 11W chip power target, headless, 2x16GB 3200 CL22, tlp enabled in bat mode, fan set to performance, 1TB NVMe. With ethernet wifi and blueooth enabled it's ~21W, with bluetooth disabled it's ~20.5W and with everything disabled it's ~20W. When everything was disabled power was measured with everything unplugged.

Much better it/j than the above intel parts but still very poor compared to an R7 no doubt. The CPU in the newest intel laptop part the i7-1165G7 generally performs much worse than Zen 2 mobile, but it has more cache so P95 may be an outlier in intels favour. Zen 3 desktop parts next month and mobile parts next year are the ones to watch.

Someone come out with an APU with an integrated stack of HBM2e already and stick it on an SBC or SFF.

Viliam Furik 2020-10-25 18:01

[QUOTE=M344587487;561079]Here's my 4700u results tuned for efficiency:[/QUOTE]

If it's a CPU in a laptop, does it even make sense to measure the power consumption at the wall? You are in fact measuring the speed at which the battery charges (which should be almost constant, as the AC adapter is made to do so), and unless the laptop circuitry can direct power around the battery when plugged in, it is not useful at all, IMO.

Try the same experiment with CPU idling. If it pulls 20.5 W, it is only made to charge the battery at an almost constant rate, and I was right.

If it pulls significantly less (<15W), it can direct the power around the battery, you realized the measurement correctly, and I feel terribly stupid.

Based on the AMD site, the CPU can pull up to 25 W, despite the 15 W TDP.

M344587487 2020-10-25 19:38

It's an Asus PN50, an SFF PC using laptop components, no battery or screen, with an external power brick like a laptop. By default the chip's configured to use 25W in it's boost state for a few minutes before dropping to 15W sustained, the defaults can be set by the manufacturer to match their cooler. I've set every power target to 11W for efficiency.



Even if it was a CPU in a laptop I'd argue power at the wall is a valid way to measure as long as you measure for a long-enough period of time to average any inconsistencies away. In the end all power consumed is power the laptop spends, a battery to maintain is just another variable like extra peripherals, different motherboards with different quality VRMS and different quality PSUs. That said, I'm pretty sure that unless you catch the battery in a discharge recharge cycle (automatic maintenance), the battery can be removed as a factor.

Viliam Furik 2020-10-25 20:02

[QUOTE=M344587487;561118]It's an Asus PN50...[/QUOTE]
Oh, then I am sorry. We have a saying "I am sprinkling ash onto my head." - rough translation. I've found out there is a similar idiom in English - "Eat a humble pie." Anyway, I think you've got the point.

[QUOTE=M344587487;561118]
Even if it was a CPU in a laptop I'd argue power at the wall is a valid way to measure...[/QUOTE]
I still don't think that would be a valid measurement. I will do an experiment soon, with my pretty old Sandy Bridge laptop. I will post the results when it's done.

DrobinsonPE 2020-10-25 21:17

1 Attachment(s)
My next efficiency tuning challenge is a computer that I have a GTX 1650 Super GPU in. I previously tried setting different power levels using nvidia-smi -pl and at the suggestion of one of the forum members locked the gpu to the base clock using nvidia-smi -lgc 1530.

Using the -lgc command to set the gpu fequency, I recorded mfakto output, power use, and gpu temperature.

I ended up setting the gpu at 1350MHz because with a 79W total power draw, it can run continuous instead of only running at night.

Total power reduction so far:
i3-9100 - 85W to 46W = 39W savings
GTX-1650 S - 145W to 79W = 66W savings

See the attached picture for details.

DrobinsonPE 2020-10-26 05:47

1 Attachment(s)
No one mentioned that my previous results were for a GPU not a CPU. However, to cover up that sin, I did include some information about mprime on the CPU in the picture so I at least pretended to be on subject for this thread.

To redeem myself, Here is efficiency data on an i5-8250U. This is not a laptop. It is a Gigabyte Brix, "NUC" style computer. Interesting to note, the processor is already optimized and as the results show, there really was no significant gain in efficiency by lowering the CPU frequency. Therefore, I will continue to run it at top speed until the fan dies.

Total power reduction so far:
i3-9100 - 85W to 46W = 39W savings
GTX-1650 S - 145W to 79W = 66W savings
i5-8250U - 30.5W to 30.5W = 0W savings

See the attached picture for details.

retina 2020-10-26 05:51

[QUOTE=DrobinsonPE;561154]i3-9100 - 85W to 46W = 39W savings
GTX-1650 S - 145W to 79W = 66W savings[/QUOTE]Based upon the rule-of-thumb of USD1 per year for each watt then you will have $105 in your pocket in one year from now.

kruoli 2020-10-26 11:39

[QUOTE=Viliam Furik;561123]"I am sprinkling ash onto my head."[/QUOTE]

"Asche auf mein Haupt", we have that, too. :grin:

DrobinsonPE 2020-10-26 14:30

[QUOTE=retina;561156]Based upon the rule-of-thumb of USD1 per year for each watt then you will have $105 in your pocket in one year from now.[/QUOTE]

I wish I lived in the rule-of-thumb area of the USA. Some areas of the USA pay a lot more than that.

The utility company I am forced to deal with charges about $0.25 per kWh for baseline use and $0.306 per kWh for tier 2 use. It is a safe assumption that my household uses all of the baseline power so any power used for "extra" non essential use come at the tier 2 rate.

1W*24H/D*365D/Y=8760W/Y
8760W/Y/1000W/kW=8.76kW/Y
8.76kW/Y*0.306$/kWh=$2.68

1 watt for 1 year is $2.68 for me.

At $0.306 per kWh, 250W continuous equals 6 kWh/day. That is $1.84 per day.

My goal is to stay at or under that number. Anything higher than that and my already expensive utility bill starts looking too much like a mortgage.

i3-9100 - 46W
GTX-1650 S - 79W
i5-8250U - 30.5W
J4105 - 20W
Ryzen 3-3200G - 75W

I am currently at 250.5W. If I want to build another computer...I either need to lower my current power use or get rid of a less efficient computer.

that is the problem. How to justify building the next one.

Next up is the Ryzen 3-3200G.

Uncwilly 2020-10-26 17:33

There is a thread all about the prices that forumites pay for power. It is a bit out of date for some.
[url]https://www.mersenneforum.org/showthread.php?t=22350[/url]

DrobinsonPE 2020-10-27 05:03

1 Attachment(s)
And here is the results of the Ryzen 3-3200G efficiency testing. Following the same procedure, I adjusted the CPU frequency with cpupower-gui and observed the change in mprime output. This time psensor was also used to monitor CPU temperature.

The results were very different this time. Either the motherboard or the CPU did not like the CPU frequency changing because the results were not a smooth decrease in power use and mprime output. Instead the power use and mprime output showed no change for large changes in frequency with sudden step changes in both power use and output as the frequency setting dropped.

Another complication with this test was that I did not have another computer nearby so I had to edit the google sheets form on the computer I was testing. This had a noticeable affect on the results.

I decided to leave the computer running at 2900MHz for a while to see what happens. That gives me a 51% drop in power use with a 32% drop in output.

Total power reduction so far:
i3-9100 - 85W to 46W = 39W savings
GTX-1650 S - 145W to 79W = 66W savings
i5-8250U - 30.5W to 30.5W = 0W savings
3200G - 75W to 36W = 39W savings

See the attached picture for details.

phillipsjk 2020-11-01 07:07

AMD Opteron VS AMD Athlon X2
 
Because I noticed that my "white elephant" machine takes around 96KWh to do a 80-100M LL test (in slightly less efficient low power mode), I decided to compare it to a Athlon X2 5000+ machine that I found more efficient than a Pentium D back in the day.


Don't have detailed measurements that I was able to find, but my "White Elephant" can do a LL test on an 80M exponent in about 21days, at a power draw of 150W/CPU.

Specifications:
[LIST][*] Quad AMD Opteron 6272 (16 cores/CPU, 2.1Ghz, 1.6GHz in low power mode)[*] 128GiB of LP DDR3 ECC RAM,[*] 64GB flash drive for storage (btrfs).[/LIST] Power= Work/Time -> Work=(Power)(Time) -> (0.150kW)(24hours/day)(21days) = [B]96kWh[/B].


The Athlon X2 5000+, drawing only 123W/CPU has a LL runtime of about 144days for an 88M exponent.

Specificatiuons
[LIST][*] Athlon X2 5000+ (2 cores, 2.2Ghz)[*] 2GiB of DDR RAM[*] Live DVD for testing[/LIST] Work=(Power)(Time) -> (0.123kW)(24hours/day)(144days) = [B]425kWh[/B]


Not sure why I was surprised that the old machine got trounced so badly.


Part 2 will be checking how much more efficient my AMD VEGA 56 GPU is once I get it working.

Mark Rose 2020-11-01 18:51

You'll see similar efficiency gains using modern CPUs.

My Intel Skylake quad cores from about five years ago, with total system power at 63 watts from the wall (underclocked and undervolted), can do an 80M exponent in about 3 days, or about 4.5 kWh.

However, I'm no longer running running four systems off one power supply, and now have GPUs and case fans adding to the power consumption.

phillipsjk 2020-11-02 07:49

[QUOTE=Mark Rose;561852]You'll see similar efficiency gains using modern CPUs.

My Intel Skylake quad cores from about five years ago, with total system power at 63 watts from the wall (underclocked and undervolted), can do an 80M exponent in about 3 days, or about 4.5 kWh.
[/QUOTE]

Since GPUs appear to be off-topic in this thread, my next test may actually be [URL="https://ark.intel.com/content/www/us/en/ark/compare.html?productIds=33910,29765"]Intel Core 2 Quad Q6600 (2.4GHz) vs (Duo) E8400 (3.0Ghz)[/URL] (with DDR2).


Did not realize the CPUs in my "White Elephant" may be 9 years old now. It is one of my newer machines.

DrobinsonPE 2020-11-08 01:55

2 Attachment(s)
Here are the efficiency testing results for two more computers. I pulled the 3200G out of the Deskmini A300 and installed an A8-9600 that I already had in it's place. The 3200G is now on a ASRock B450-HDV R4.0 so it will get re-tested for efficency in the future.

Celeron J4105 - This computer did not respond well to adjusting the CPU frequency so I found a different way to increase the efficiency.

A8-9600 - This is not an efficient processor. I knew older AMD CPUs were not that good but I did not expect it to be this bad. The response to lowering the CPU frequency was similar to the 3200G so it is probably the motherboard that does not like the operating system to adjust the CPU frequency.

Total power reduction so far:
i3-9100 - 85W to 46W = 39W savings
GTX-1650 S - 145W to 79W = 66W savings
i5-8250U - 30.5W to 30.5W = 0W savings
[STRIKE]3200G - 75W to 36W = 39W savings[/STRIKE]
J4105 - 20W to 21.4W = -1.4W savings
A8-9600 - 70W to 36W = 34W savings

See the attached pictures for details. In the pictures I am tracking a lot more than just power use changes.


All times are UTC. The time now is 05:02.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.