View Single Post
2021-07-25, 14:49   #9
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

3·1,933 Posts

Quote:
 Originally Posted by drkirkby I have a few questions about that table
For the CUDALucas or TF benchmark pages on mersenne.ca, and any blue column heading or the downward arrow right of GHzDays/day, pause your mouse cursor on them for popup descriptions.

The mersenne.ca CUDALucas benchmark page is useful within its limitations for relative comparisons between GPUs. The ~295 GHD/day values for Radeon VII are old, from old less efficient versions of Gpuowl, or from CUDALucas, and considerably understate maximum available performance with recent versions of Gpuowl.

Tdulcet reported ~75% longer run times on Colab & NVIDIA Tesla GPUs with CUDALucas than recent Gpuowl.

I've extensively benchmarked a Radeon VII across a wide variety of Gpuowl versions and all fft lengths supported in them from 3M to 192M, on Windows 10, for specified conditions. Resulting timings in ms/iter can be seen at the last attachment of https://www.mersenneforum.org/showpo...35&postcount=2. Those timings correspond to a range of performance for best version timing per fft length, from 316. to 486. GHD/day. (It might be possible to find other fft formulations that perform better; I used the first / default for each size. On occasion an alternate may perform better.)

Note that these measurements were made while the GPU was neither as aggressively clocked as I and others have been able to reliably use on Radeon VIIs with Hynix Vram, nor operating at full GPU power, nor highest performance OS/driver combo. Benchmarking was done at 86% power limit for improved power efficiency. Also, reportedly ROCm on Linux provides the highest performance, with Woltman having reported 510 GHD/day with it on IIRC 5M fft. Compare to 447. at reduced power and clock on Windows at 5M. Finally, power consumption may be elevated by the more aggressive than standard GPU fan curve I'm using.

Note also that prime/prime95 and Gpuowl each have some fft lengths for which running the next higher fft can be faster.
I've found in benchmarking Gpuowl that the 13-smooth ffts (3.25M, 6.5M etc) tend to be slower than the next larger fft (3.5M, 7M, etc.), as does 15M.

At current wavefront ~105.1M, 5.5M fft applies, and Gpuowl V6.11-380 benchmarked at 0.821 ms/iter, which corresponds to 0.9987 day/exponent/GPU, 419. GHD/day/GPU, again at reduced GPU power, on Windows, with below-maximum reliable vram clocking. I computed ~1.53 GHD/d/W for a multi-RadeonVII system, with power measured at the AC power cord, while running prime95 on its cpu. The GPU-only efficiency would be slightly higher.
That AC input power accounts for all power used, including the system ram which drkirkby omitted from his list, and at 384GiB ECC on his system, is probably consuming considerable power in his system. Due to the high cost of a >1KW output UPS, I am running my GPUs rig with inline surge suppression but not UPS.
Indicated GPU power per GPU range from 190 to 212W at the 86% setting. Total AC input power divided by number of GPUs operating was less than the nominal max GPU TDP. I'm currently running these GPUs at 80% for better power efficiency. The 419. GHD/day/GPU/~200Wactual/GPU is ~2.1 GHD/d/W on the GPUs alone, omitting system overhead and conversion losses.

One Radeon VII so configured can match the throughput of the dual-26-core-8167M $5000 system under certain conditions, at better power efficiency, and original cost of the entire open frame system divided by number of GPUs was ~$700. More power efficient, and much more capital efficient per unit throughput. And would still be ~4x more cost effective today than the 8167M system if created with current GPU costs.

Last fiddled with by kriesel on 2021-07-25 at 15:45