![]() |
![]() |
#1 |
P90 years forever!
Aug 2002
Yeehaw, FL
22·11·167 Posts |
![]()
A 3.6GHz 8-core Skylake-X with DDR-3600 memory. Running new AXV-512 FFT code:
Timings for 4480K FFT length (1 core, 1 worker): 12.53 ms. Throughput: 79.83 iter/sec. Timings for 4480K FFT length (2 cores, 1 worker): 6.94 ms. Throughput: 144.15 iter/sec. Timings for 4480K FFT length (3 cores, 1 worker): 5.21 ms. Throughput: 192.10 iter/sec. Timings for 4480K FFT length (4 cores, 1 worker): 4.09 ms. Throughput: 244.70 iter/sec. Timings for 4480K FFT length (5 cores, 1 worker): 3.49 ms. Throughput: 286.31 iter/sec. Timings for 4480K FFT length (6 cores, 1 worker): 3.15 ms. Throughput: 317.06 iter/sec. Timings for 4480K FFT length (7 cores, 1 worker): 2.95 ms. Throughput: 339.29 iter/sec. Timings for 4480K FFT length (8 cores, 1 worker): 2.95 ms. Throughput: 338.73 iter/sec. Timings for 4480K FFT length (5 cores, 5 workers): 15.56, 15.50, 15.48, 15.39, 15.41 ms. Throughput: 323.30 iter/sec. Timings for 4480K FFT length (6 cores, 6 workers): 16.90, 16.85, 16.78, 16.70, 16.77, 16.73 ms. Throughput: 357.38 iter/sec. Timings for 4480K FFT length (7 cores, 7 workers): 18.71, 18.74, 18.63, 18.54, 18.56, 18.56, 18.63 ms. Throughput: 375.84 iter/sec. Timings for 4480K FFT length (8 cores, 8 workers): 21.20, 21.38, 21.14, 21.12, 20.97, 21.15, 21.17, 21.36 ms. Throughput: 377.63 iter/sec. The poor CPU is crying out for more memory bandwidth. BTW, the old AVX code: Timings for 4480K FFT length (8 cores, 8 workers): 24.81, 24.84, 24.70, 24.80, 24.77, 24.84, 24.80, 24.82 ms. Throughput: 322.61 iter/sec. Last fiddled with by Prime95 on 2018-09-03 at 05:00 |
![]() |
![]() |
![]() |
#2 |
Sep 2016
331 Posts |
![]()
Nice!
Are these all at fixed clock speeds regardless of the the workload? (i.e. AVX and AVX512 both running at 3.6 GHz?) 17% speedup is more than I expected given the memory bottleneck. Last fiddled with by Mysticial on 2018-09-03 at 05:13 |
![]() |
![]() |
![]() |
#3 |
Feb 2016
UK
6238 Posts |
![]()
From my previous observations, the "old" AVX code wouldn't be significantly limited by ram in that configuration. Any significant increase from AVX-512 would push it there though.
Is my understanding correct, to assume AVX-512 could double throughput, if not limited by ram? What sort of speedup do you see for smaller FFTs that fit in cache? I can see it shaking things up when it eventually makes its way to LLR. As a side thought, I assume the CPU temperatures when running would be a little warmer than with old code. It might get a lot hotter if it weren't limited... |
![]() |
![]() |
![]() |
#4 |
P90 years forever!
Aug 2002
Yeehaw, FL
22·11·167 Posts |
![]()
@Mystical: not sure what the AVX-512 offsets are. I just plugged the chip in and let it rip using the motherboard's default settings.
sensors reports: Code:
Physical id 0: +59.0°C (high = +95.0°C, crit = +105.0°C) Core 0: +54.0°C (high = +95.0°C, crit = +105.0°C) Core 1: +43.0°C (high = +95.0°C, crit = +105.0°C) Core 2: +57.0°C (high = +95.0°C, crit = +105.0°C) Core 3: +46.0°C (high = +95.0°C, crit = +105.0°C) Core 4: +56.0°C (high = +95.0°C, crit = +105.0°C) Core 5: +57.0°C (high = +95.0°C, crit = +105.0°C) Core 6: +54.0°C (high = +95.0°C, crit = +105.0°C) Core 7: +59.0°C (high = +95.0°C, crit = +105.0°C) Code:
Socket [0] - [physical cores=8, logical cores=16, max online cores ever=8] TURBO DISABLED on 8 Cores, Hyper Threading ON Max Frequency without considering Turbo 3599.00 MHz (99.97 x [36]) Max TURBO Multiplier (if Enabled) with 1/2/3/4/5/6 Cores is 45x/41x/40x/40x/40x/40x Real Current Frequency 3603.84 MHz [99.97 x 36.05] (Max of below) Core [core-id] :Actual Freq (Mult.) C0% Halt(C1)% C3 % C6 % Temp VCore Core 1 [0]: 3500.11 (35.01x) 1 99.9 0 0 54 0.9535 Core 2 [1]: 3599.86 (36.01x) 1 0.801 0 98.6 43 0.9777 Core 3 [2]: 3491.47 (34.92x) 1 100 0 0 56 0.9749 Core 4 [3]: 3603.84 (36.05x) 1 0.147 0 99.8 46 0.9835 Core 5 [4]: 3503.82 (35.05x) 1 100 0 0 57 0.9595 Core 6 [5]: 3487.92 (34.89x) 1 100 0 0 55 0.9529 Core 7 [6]: 3489.45 (34.90x) 1 100 0 0 54 0.9545 Core 8 [7]: 3500.00 (35.01x) 100 2.78 0 0 59 0.9590 C1 = Processor running with halts (States >C0 are power saver modes with cores idling) C3 = Cores running with PLL turned off and core cache turned off C6, C7 = Everything in C3 + core state saved to last level cache, C7 is deeper than C6 |
![]() |
![]() |
![]() |
#5 |
P90 years forever!
Aug 2002
Yeehaw, FL
22·11·167 Posts |
![]()
Interestingly, uptime reports only 6 cores in use:
Code:
george@SkylakeX:~/mers295/linux64$ uptime 09:59:12 up 78 days, 17:05, 2 users, load average: 6.01, 6.00, 6.00 |
![]() |
![]() |
![]() |
#6 | |
Sep 2002
Database er0rr
37×97 Posts |
![]() Quote:
Here is my justification: When running my own code written with gwnum, I get less load then JP's LLR but similar timings. Last fiddled with by paulunderwood on 2018-09-03 at 14:24 |
|
![]() |
![]() |
![]() |
#7 |
P90 years forever!
Aug 2002
Yeehaw, FL
734810 Posts |
![]()
My bad. I've been running the new code doing Gerbicz PRPs on Skylake-X and it finished two work units and will not get any more work. I've got some unexpected debugging to do.
The good news is that I didn't lose much throughput with two cores idle. The temps and i7z data above is inaccurate. The benchmarks are OK as that was done after "kill -SIGSTOP" on the running mprime. |
![]() |
![]() |
![]() |
#8 | |
Sep 2016
5138 Posts |
![]() Quote:
If you want a raw cycle-for-cycle comparison of AVX vs. AVX512, you'll need to force it from the BIOS. So you'll need to zero the offsets for both AVX and AVX512. But you'll also need to drop all the turbos to no higher than 3.6 GHz. Otherwise, you'll roast the machine when it tries to run AVX512 @ 4.0 GHz on all 8 cores. |
|
![]() |
![]() |
![]() |
#9 | |
Sep 2003
A1716 Posts |
![]() Quote:
In the above example, that would give you 640 iter/sec combined rather than 378, which is 70% more. Although the nominal cost advantage is probably still in favor of a barebones server farm setup (unless you live in an area with expensive power), this factor does partly tilt the balance back the other way somewhat. That, plus the fact that the upgrade to Skylake hardware was free, just start using the new instance type, which was a 20% boost even on an AVX-to-AVX basis, and now I guess based on this benchmark will be an additional 17% boost when AVX-512 code is available. |
|
![]() |
![]() |
![]() |
#10 | |
Sep 2016
331 Posts |
![]() Quote:
![]() Throw tons of single-threaded bandwidth-heavy AVX512 workloads on the cloud. Not only do you eat up all the memory bandwidth, you throttle their clocks as well! ![]() ![]() ![]() |
|
![]() |
![]() |
![]() |
#11 |
"/X\(‘-‘)/X\"
Jan 2013
29×101 Posts |
![]()
If EC2 users care enough, they can select dedicated tenancy instances.
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Skylake vs Kabylake | ET_ | Hardware | 17 | 2017-05-24 16:19 |
Skylake processor | tha | Hardware | 7 | 2015-03-05 23:49 |
Skylake AVX-512 | clarke | Software | 15 | 2015-03-04 21:48 |
Preliminary iMac duo-core Intel RESULTS.TXT | SO7783 | Software | 6 | 2006-04-15 14:38 |
Fourth known factor of M(M31) (preliminary announcement) | ewmayer | Operazione Doppi Mersennes | 22 | 2005-07-06 00:33 |