mersenneforum.org AVX512 performance on new shiny Intel kit
 Register FAQ Search Today's Posts Mark Forums Read

 2017-10-13, 17:31 #1 heliosh   Oct 2017 ++41 53 Posts AVX512 performance on new shiny Intel kit How much does AVX512 improve LL performance at the same clock rate compared to AVX2?
2017-10-13, 22:36   #2
ewmayer
2ω=0

Sep 2002
República de California

5·2,351 Posts

Quote:
 Originally Posted by heliosh How much does AVX512 improve LL performance at the same clock rate compared to AVX2?
I got ~1.6x for my AVX512 implementation running on KNL. YMMV.

2017-10-13, 23:07   #3
Mysticial

Sep 2016

1011100102 Posts

Quote:
 Originally Posted by ewmayer I got ~1.6x for my AVX512 implementation running on KNL. YMMV.
FWIW, y-cruncher's untuned AVX512 gained anywhere from 5% to -10% on my system out-of-box depending on badly it throttled. (That minus 10% is not a typo. Once the system throttles, it throttles hard.)

Once I fixed the throttling, the AVX2 -> AVX512 gain hovered around 10%-ish.

Once I tuned the AVX512 binary, that grew to about 15%.

Once I overclocked the cache and memory, it grew up to 25%.

Version 0.7.4 (ETA end of weekend) makes both the AVX2 and AVX512 faster, but more so the AVX512. And I think it widens the gap by another percent or two.

By comparison, my BBP benchmark gained 90% (1.9x faster) going from AVX2 -> AVX512 in the absence of throttling. Cache and memory are irrelevant since it's L1 only.

Last fiddled with by Mysticial on 2017-10-13 at 23:09

2017-10-14, 21:50   #4
ewmayer
2ω=0

Sep 2002
República de California

1175510 Posts

Quote:
 Originally Posted by Mysticial FWIW, y-cruncher's untuned AVX512 gained anywhere from 5% to -10% on my system out-of-box depending on badly it throttled. (That minus 10% is not a typo. Once the system throttles, it throttles hard.) Once I fixed the throttling, the AVX2 -> AVX512 gain hovered around 10%-ish. Once I tuned the AVX512 binary, that grew to about 15%. Once I overclocked the cache and memory, it grew up to 25%. Version 0.7.4 (ETA end of weekend) makes both the AVX2 and AVX512 faster, but more so the AVX512. And I think it widens the gap by another percent or two. By comparison, my BBP benchmark gained 90% (1.9x faster) going from AVX2 -> AVX512 in the absence of throttling. Cache and memory are irrelevant since it's L1 only.
I must've missed the relevant post(s), but how did you fix the throttling issue?

It would be interesting to compare your so-far-disappointing AVX512 gains for y-cruncher to Mlucas - back in July in this same thread you posted a bunch of Mlucas timings for an AVX512 build in a Ubuntu sandbox you installed, but I don't believe we considered doing an AVX2 build on your hardware. Comparing those 2 binaries would tell us if your AVX512 gains for y-cruncher are in line to those for my code running on the same hardware.

Note if you're running Win10, build-under-linux is now greatly eased by MSFT having actually done something right for once by way of adding native linux-sandbox support to that version of their OS. The Mlucas readme page has info on that, but bottom line that once you open such a native shell, everything works beautifully.

2017-10-15, 19:40   #5
Mysticial

Sep 2016

2×5×37 Posts

Quote:
 Originally Posted by ewmayer I must've missed the relevant post(s), but how did you fix the throttling issue?
I fixed it by compensating for the voltage droop on the CPU input voltage (among other things). Details here: http://www.overclock.net/t/1634045/s...tom-throttling

When I ran the Mlucas benchmarks, I had already fixed the throttling. So it doesn't invalid those results. But, I've have significantly changed the overclock settings since then.

Now I have all 128GB running at 3800 MT/s - which is almost enough to match 6-channel 2666 MT/s on the servers. (This Samsung B-die stuff really is that good. Too bad it's so expensive now. And it's sold out everywhere so I can't even get another fix if I wanted to.)

Quote:
 It would be interesting to compare your so-far-disappointing AVX512 gains for y-cruncher to Mlucas - back in July in this same thread you posted a bunch of Mlucas timings for an AVX512 build in a Ubuntu sandbox you installed, but I don't believe we considered doing an AVX2 build on your hardware. Comparing those 2 binaries would tell us if your AVX512 gains for y-cruncher are in line to those for my code running on the same hardware. Note if you're running Win10, build-under-linux is now greatly eased by MSFT having actually done something right for once by way of adding native linux-sandbox support to that version of their OS. The Mlucas readme page has info on that, but bottom line that once you open such a native shell, everything works beautifully.
If you give me an updated set of commands to run, I'll grab the latest version and run them again in Ubuntu.

Last fiddled with by Mysticial on 2017-10-15 at 19:44

2017-10-15, 21:36   #6
ewmayer
2ω=0

Sep 2002
República de California

5·2,351 Posts

Quote:
 Originally Posted by Mysticial If you give me an updated set of commands to run, I'll grab the latest version and run them again in Ubuntu.
If the md5 checksum for you Mlucas tarball mismatches the current one, that means I added some code patches after you downloaded your copy - nothing major, but some stuff related to making self-tests more robust which you will want. These assume binaries built in separate obj-dirs (both within the src-dir, e.g. from src, 'mkdir obj_avx2' and 'mkdir obj_avx512'), i.e. both can use the same executable name:

AVX2:
gcc -c -O3 -mavx2 -DUSE_AVX2 -DUSE_THREADS ../*.c >& build.log
grep -i error build.log
[Assuming above grep comes up empty] gcc -o Mlucas *.o -lm -lpthread -lrt

AVX512
gcc -c -O3 -march=skylake-avx512 -DUSE_AVX512 -DUSE_THREADS ../*.c >& build.log
grep -i error build.log
[Assuming above grep comes up empty] gcc -o Mlucas *.o -lm -lpthread -lrt

Simple set of comparative timings at a GIMPS-representative FFT length should suffice for our basic purposes, but you are of course free to spend as much or as little time playing with this as you like. I suggest using 100-iters for self-tests only in 1-thread mode, 1000-iters will give more accurate timings for anything beyond that. I further suggest opening the mlucas.cfg file resulting from the initial set if self-tests in an editor and annotating each cfg-file best-timing line as it is printed with the -cpu options you used for that set of timings. (You can replace the reference-residues stuff that gets printed to the right of the 10-entry FFT radix-set on each line with your annotations, I find that a convenient method in my own work.)

So let's say we want to do comparative timings @4096K FFT length. On an Intel 10-physical-core CPU I would like so, each line producing a new cfg-file entry. The code is heavily geared toward power-of-2 thread counts, so I would eviate from that only at the 'what the heck - let's try all 100 core' end:

./Mlucas -fftlen 4096 -iters 100 -cpu 0
./Mlucas -fftlen 4096 -iters 1000 -cpu 0:1
./Mlucas -fftlen 4096 -iters 1000 -cpu 0:3
./Mlucas -fftlen 4096 -iters 1000 -cpu 0:7
./Mlucas -fftlen 4096 -iters 1000 -cpu 0:9

./Mlucas -fftlen 4096 -iters 100 -cpu 0,10
./Mlucas -fftlen 4096 -iters 1000 -cpu 0:1,10:11
./Mlucas -fftlen 4096 -iters 1000 -cpu 0:3,10:13
./Mlucas -fftlen 4096 -iters 1000 -cpu 0:7,10:17
./Mlucas -fftlen 4096 -iters 1000 -cpu 0:19

Last fiddled with by ewmayer on 2017-10-15 at 21:38

 2017-10-17, 01:21 #7 Mysticial     Sep 2016 37010 Posts AVX2: Code: 17.0 4096 msec/iter = 27.05 ROE[avg,max] = [0.278125000, 0.312500000] radices = 64 32 32 32 0 0 0 0 0 0 100-iteration Res mod 2^64, 2^35-1, 2^36-1 = 8CC30E314BF3E556, 22305398329, 64001568053 4096 msec/iter = 16.42 ROE[avg,max] = [0.265871720, 0.328125000] radices = 256 8 8 8 16 0 0 0 0 0 1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 5F87421FA9DD8F1F, 26302807323, 54919604018 4096 msec/iter = 10.70 ROE[avg,max] = [0.276559480, 0.343750000] radices = 64 32 32 32 0 0 0 0 0 0 1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 5F87421FA9DD8F1F, 26302807323, 54919604018 4096 msec/iter = 6.54 ROE[avg,max] = [0.276559480, 0.343750000] radices = 64 32 32 32 0 0 0 0 0 0 1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 5F87421FA9DD8F1F, 26302807323, 54919604018 4096 msec/iter = 6.19 ROE[avg,max] = [0.320515867, 0.406250000] radices = 256 32 16 16 0 0 0 0 0 0 1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 5F87421FA9DD8F1F, 26302807323, 54919604018 4096 msec/iter = 24.78 ROE[avg,max] = [0.264118304, 0.296875000] radices = 256 8 8 8 16 0 0 0 0 0 100-iteration Res mod 2^64, 2^35-1, 2^36-1 = 8CC30E314BF3E556, 22305398329, 64001568053 4096 msec/iter = 14.92 ROE[avg,max] = [0.320515867, 0.406250000] radices = 256 32 16 16 0 0 0 0 0 0 1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 5F87421FA9DD8F1F, 26302807323, 54919604018 4096 msec/iter = 9.64 ROE[avg,max] = [0.320515867, 0.406250000] radices = 256 32 16 16 0 0 0 0 0 0 1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 5F87421FA9DD8F1F, 26302807323, 54919604018 4096 msec/iter = 5.95 ROE[avg,max] = [0.320564191, 0.406250000] radices = 256 32 16 16 0 0 0 0 0 0 1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 5F87421FA9DD8F1F, 26302807323, 54919604018 4096 msec/iter = 5.84 ROE[avg,max] = [0.265964343, 0.328125000] radices = 256 8 8 8 16 0 0 0 0 0 1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 5F87421FA9DD8F1F, 26302807323, 54919604018 AVX512: Code: 17.0 4096 msec/iter = 21.77 ROE[avg,max] = [0.245026507, 0.281250000] radices = 32 16 16 16 16 0 0 0 0 0 100-iteration Res mod 2^64, 2^35-1, 2^36-1 = 8CC30E314BF3E556, 22305398329, 64001568053 4096 msec/iter = 15.21 ROE[avg,max] = [0.244451338, 0.304687500] radices = 32 16 16 16 16 0 0 0 0 0 1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 5F87421FA9DD8F1F, 26302807323, 54919604018 4096 msec/iter = 8.70 ROE[avg,max] = [0.275551707, 0.375000000] radices = 64 32 32 32 0 0 0 0 0 0 1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 5F87421FA9DD8F1F, 26302807323, 54919604018 4096 msec/iter = 5.37 ROE[avg,max] = [0.275551707, 0.375000000] radices = 64 32 32 32 0 0 0 0 0 0 1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 5F87421FA9DD8F1F, 26302807323, 54919604018 4096 msec/iter = 5.36 ROE[avg,max] = [0.300239107, 0.375000000] radices = 256 16 16 32 0 0 0 0 0 0 1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 5F87421FA9DD8F1F, 26302807323, 54919604018 4096 msec/iter = 18.90 ROE[avg,max] = [0.301116071, 0.375000000] radices = 256 16 16 32 0 0 0 0 0 0 100-iteration Res mod 2^64, 2^35-1, 2^36-1 = 8CC30E314BF3E556, 22305398329, 64001568053 4096 msec/iter = 12.86 ROE[avg,max] = [0.300271323, 0.375000000] radices = 256 16 16 32 0 0 0 0 0 0 1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 5F87421FA9DD8F1F, 26302807323, 54919604018 4096 msec/iter = 7.91 ROE[avg,max] = [0.300093503, 0.406250000] radices = 64 8 16 16 16 0 0 0 0 0 1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 5F87421FA9DD8F1F, 26302807323, 54919604018 4096 msec/iter = 5.42 ROE[avg,max] = [0.275567816, 0.375000000] radices = 64 32 32 32 0 0 0 0 0 0 1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 5F87421FA9DD8F1F, 26302807323, 54919604018 4096 msec/iter = 5.00 ROE[avg,max] = [0.300271323, 0.375000000] radices = 256 16 16 32 0 0 0 0 0 0 1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 5F87421FA9DD8F1F, 26302807323, 54919604018 So about 15 - 20%. Hardware Specs: AVX2 @ 4.1 GHz (overclocked from 3.6 GHz) AVX512 @ 3.8 GHz (overclocked from 3.3 GHz) Cache @ 3.0 GHz (overclocked from 2.4 GHz) Memory @ 3800 MT/s. (overclocked from 2666 MT/s) The clock speeds are constant regardless of load. So no higher speeds for having only 1 or 2 cores active. Overall, this is very similar to what I'm seeing. The threaded benchmarks suck because of the memory bandwidth. The single-threaded benchmarks suck almost as much from what I've observed to be the L3.
 2017-10-18, 00:40 #8 ewmayer ∂2ω=0     Sep 2002 República de California 5·2,351 Posts Thanks, Alex - so on the good-news front, "It's not you." :) Does your big-FFT y-cruncher code get any benefit from running 2 threads per physical core, as we see in the Mlucas timings? Last fiddled with by ewmayer on 2017-10-18 at 00:41
2017-10-18, 01:46   #9
Mysticial

Sep 2016

2×5×37 Posts

Quote:
 Originally Posted by ewmayer Thanks, Alex - so on the good-news front, "It's not you." :) Does your big-FFT y-cruncher code get any benefit from running 2 threads per physical core, as we see in the Mlucas timings?
About 15% gain on this box. So more than Mlucas.

But this is taken over an entire Pi computation/benchmark. The workload is a lot less homogeneous. So there's a lot of more and less optimized code.

 2017-12-14, 05:58 #10 Mysticial     Sep 2016 2·5·37 Posts My 3800 MT/s memory overclock soft-errored for a second time in the past 2 months. When I tested with the base clock bumped to 102 MHz (2% safety margin), it soft-errored within a minute. In retrospect, I never actually tested this overclock with any safety margin at all. So this prompted me to retest the entire overclock. During this process, I lifted the temperature throttle limits and noticed that my 3.8 GHz AVX512 overclock was shooting way past the usual throttle point and well above 100C. 5 months ago, I picked 3.8 GHz because it was the highest speed that wouldn't exceed 85C. But in those 5 months, there have been enough memory improvements (both in software as well as the overclock) that the code simply runs a hella lot hotter than before. I've also noticed an increase in benchmark inconsistency recently, but I didn't realize how badly it was throttling since I don't usually have CPUz open during benchmarks. Without the throttling, the CPU utilization seems have to gone up - presumably because the throttling was causing a load imbalance between the cores (since the hotter cores throttle harder). Going by the same 85C limit, I seem to have lost about 200 MHz during these 5 months. Yet the net speedup is more than 40%.
 2017-12-26, 17:15 #11 Prime95 P90 years forever!     Aug 2002 Yeehaw, FL 816410 Posts @mystical: Do you know if the vblendmpd instruction is running on the same ports as the FMA units? http://users.atw.hu/instlatx64/Genui...InstLatX64.txt shows vblendmpd with a latency of one and a throughput of two.

 Similar Threads Thread Thread Starter Forum Replies Last Post fivemack Astronomy 12 2017-01-24 17:49 schickel FactorDB 2 2012-08-16 00:09 ixfd64 Hardware 1 2007-09-24 08:28 zacariaz Hardware 1 2007-05-10 13:08 robert44444uk 15k Search 1 2006-02-09 01:43

All times are UTC. The time now is 18:43.

Sun Feb 5 18:43:49 UTC 2023 up 171 days, 16:12, 1 user, load averages: 1.27, 0.96, 0.85