 A 3.6GHz 8-core Skylake-X with DDR-3600 memory. Running new AXV-512 FFT code: Timings for 4480K FFT length (1 core, 1 worker): 12.53 ms. Throughput: 79.83 iter/sec. Timings for 4480K FFT length (2 cores, 1 worker): 6.94 ms. Throughput: 144.15 iter/sec. Timings for 4480K FFT length (3 cores, 1 worker): 5.21 ms. Throughput: 192.10 iter/sec. Timings for 4480K FFT length (4 cores, 1 worker): 4.09 ms. Throughput: 244.70 iter/sec. Timings for 4480K FFT length (5 cores, 1 worker): 3.49 ms. Throughput: 286.31 iter/sec. Timings for 4480K FFT length (6 cores, 1 worker): 3.15 ms. Throughput: 317.06 iter/sec. Timings for 4480K FFT length (7 cores, 1 worker): 2.95 ms. Throughput: 339.29 iter/sec. Timings for 4480K FFT length (8 cores, 1 worker): 2.95 ms. Throughput: 338.73 iter/sec. Timings for 4480K FFT length (5 cores, 5 workers): 15.56, 15.50, 15.48, 15.39, 15.41 ms. Throughput: 323.30 iter/sec. Timings for 4480K FFT length (6 cores, 6 workers): 16.90, 16.85, 16.78, 16.70, 16.77, 16.73 ms. Throughput: 357.38 iter/sec. Timings for 4480K FFT length (7 cores, 7 workers): 18.71, 18.74, 18.63, 18.54, 18.56, 18.56, 18.63 ms. Throughput: 375.84 iter/sec. Timings for 4480K FFT length (8 cores, 8 workers): 21.20, 21.38, 21.14, 21.12, 20.97, 21.15, 21.17, 21.36 ms. Throughput: 377.63 iter/sec. The poor CPU is crying out for more memory bandwidth. BTW, the old AVX code: Timings for 4480K FFT length (8 cores, 8 workers): 24.81, 24.84, 24.70, 24.80, 24.77, 24.84, 24.80, 24.82 ms. Throughput: 322.61 iter/sec.
 Nice! Are these all at fixed clock speeds regardless of the the workload? (i.e. AVX and AVX512 both running at 3.6 GHz?) 17% speedup is more than I expected given the memory bottleneck.
 2018-09-03, 07:20 #3 mackerel     Feb 2016 UK 6238 Posts From my previous observations, the "old" AVX code wouldn't be significantly limited by ram in that configuration. Any significant increase from AVX-512 would push it there though. Is my understanding correct, to assume AVX-512 could double throughput, if not limited by ram? What sort of speedup do you see for smaller FFTs that fit in cache? I can see it shaking things up when it eventually makes its way to LLR. As a side thought, I assume the CPU temperatures when running would be a little warmer than with old code. It might get a lot hotter if it weren't limited...
 @Mystical: not sure what the AVX-512 offsets are. I just plugged the chip in and let it rip using the motherboard's default settings. sensors reports: Code: Physical id 0: +59.0°C (high = +95.0°C, crit = +105.0°C) Core 0: +54.0°C (high = +95.0°C, crit = +105.0°C) Core 1: +43.0°C (high = +95.0°C, crit = +105.0°C) Core 2: +57.0°C (high = +95.0°C, crit = +105.0°C) Core 3: +46.0°C (high = +95.0°C, crit = +105.0°C) Core 4: +56.0°C (high = +95.0°C, crit = +105.0°C) Core 5: +57.0°C (high = +95.0°C, crit = +105.0°C) Core 6: +54.0°C (high = +95.0°C, crit = +105.0°C) Core 7: +59.0°C (high = +95.0°C, crit = +105.0°C) i7z snapshot: Code: Socket [0] - [physical cores=8, logical cores=16, max online cores ever=8] TURBO DISABLED on 8 Cores, Hyper Threading ON Max Frequency without considering Turbo 3599.00 MHz (99.97 x [36]) Max TURBO Multiplier (if Enabled) with 1/2/3/4/5/6 Cores is 45x/41x/40x/40x/40x/40x Real Current Frequency 3603.84 MHz [99.97 x 36.05] (Max of below) Core [core-id] :Actual Freq (Mult.) C0% Halt(C1)% C3 % C6 % Temp VCore Core 1 [0]: 3500.11 (35.01x) 1 99.9 0 0 54 0.9535 Core 2 [1]: 3599.86 (36.01x) 1 0.801 0 98.6 43 0.9777 Core 3 [2]: 3491.47 (34.92x) 1 100 0 0 56 0.9749 Core 4 [3]: 3603.84 (36.05x) 1 0.147 0 99.8 46 0.9835 Core 5 [4]: 3503.82 (35.05x) 1 100 0 0 57 0.9595 Core 6 [5]: 3487.92 (34.89x) 1 100 0 0 55 0.9529 Core 7 [6]: 3489.45 (34.90x) 1 100 0 0 54 0.9545 Core 8 [7]: 3500.00 (35.01x) 100 2.78 0 0 59 0.9590 C1 = Processor running with halts (States >C0 are power saver modes with cores idling) C3 = Cores running with PLL turned off and core cache turned off C6, C7 = Everything in C3 + core state saved to last level cache, C7 is deeper than C6
 Interestingly, uptime reports only 6 cores in use: Code: george@SkylakeX:~/mers295/linux64$uptime 09:59:12 up 78 days, 17:05, 2 users, load average: 6.01, 6.00, 6.00
uptime is not the best measure. Better to look in /proc/cpuinfo to see how many cores there are. Good luck getting the load to 8.0.

Here is my justification: When running my own code written with gwnum, I get less load then JP's LLR but similar timings.

 My bad. I've been running the new code doing Gerbicz PRPs on Skylake-X and it finished two work units and will not get any more work. I've got some unexpected debugging to do. The good news is that I didn't lose much throughput with two cores idle. The temps and i7z data above is inaccurate. The benchmarks are OK as that was done after "kill -SIGSTOP" on the running mprime.
2018-09-03, 21:07   #8
Mysticial

Sep 2016

5138 Posts

Quote:
 Originally Posted by Prime95 @Mystical: not sure what the AVX-512 offsets are. I just plugged the chip in and let it rip using the motherboard's default settings.
Sounds like it's probably -4 for AVX512.

If you want a raw cycle-for-cycle comparison of AVX vs. AVX512, you'll need to force it from the BIOS.

So you'll need to zero the offsets for both AVX and AVX512. But you'll also need to drop all the turbos to no higher than 3.6 GHz. Otherwise, you'll roast the machine when it tries to run AVX512 @ 4.0 GHz on all 8 cores.

2018-09-04, 02:12   #9
GP2

Sep 2003

A1716 Posts

Quote:
 Originally Posted by Prime95 Timings for 4480K FFT length (1 core, 1 worker): 12.53 ms. Throughput: 79.83 iter/sec. Timings for 4480K FFT length (8 cores, 8 workers): 21.20, 21.38, 21.14, 21.12, 20.97, 21.15, 21.17, 21.36 ms. Throughput: 377.63 iter/sec. The poor CPU is crying out for more memory bandwidth.
One of the reasons I remain a fan of running on the cloud rather than a physical box is that using 8 separate one-core virtual machines really does mean 8 times the throughput of one core. Unless you somehow contrive to get them running on the same physical cloud server, which if it ever became an issue could be avoided with staggered starts.

In the above example, that would give you 640 iter/sec combined rather than 378, which is 70% more.

Although the nominal cost advantage is probably still in favor of a barebones server farm setup (unless you live in an area with expensive power), this factor does partly tilt the balance back the other way somewhat. That, plus the fact that the upgrade to Skylake hardware was free, just start using the new instance type, which was a 20% boost even on an AVX-to-AVX basis, and now I guess based on this benchmark will be an additional 17% boost when AVX-512 code is available.

2018-09-04, 19:51   #10
Mysticial

Sep 2016

331 Posts

Quote:
 Originally Posted by GP2 One of the reasons I remain a fan of running on the cloud rather than a physical box is that using 8 separate one-core virtual machines really does mean 8 times the throughput of one core. Unless you somehow contrive to get them running on the same physical cloud server, which if it ever became an issue could be avoided with staggered starts. In the above example, that would give you 640 iter/sec combined rather than 378, which is 70% more. Although the nominal cost advantage is probably still in favor of a barebones server farm setup (unless you live in an area with expensive power), this factor does partly tilt the balance back the other way somewhat. That, plus the fact that the upgrade to Skylake hardware was free, just start using the new instance type, which was a 20% boost even on an AVX-to-AVX basis, and now I guess based on this benchmark will be an additional 17% boost when AVX-512 code is available.
That sounds like a great way to piss off other cloud users!

Throw tons of single-threaded bandwidth-heavy AVX512 workloads on the cloud. Not only do you eat up all the memory bandwidth, you throttle their clocks as well!

2018-09-04, 20:38   #11
Mark Rose

"/X\(‘-‘)/X\"
Jan 2013

29×101 Posts

Quote:
 Originally Posted by Mysticial That sounds like a great way to piss off other cloud users! Throw tons of single-threaded bandwidth-heavy AVX512 workloads on the cloud. Not only do you eat up all the memory bandwidth, you throttle their clocks as well!
If EC2 users care enough, they can select dedicated tenancy instances.

