mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2017-10-13, 17:31   #1
heliosh
 
Oct 2017
++41

53 Posts
Default AVX512 performance on new shiny Intel kit

How much does AVX512 improve LL performance at the same clock rate compared to AVX2?
heliosh is offline   Reply With Quote
Old 2017-10-13, 22:36   #2
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2×32×653 Posts
Default

Quote:
Originally Posted by heliosh View Post
How much does AVX512 improve LL performance at the same clock rate compared to AVX2?
I got ~1.6x for my AVX512 implementation running on KNL. YMMV.
ewmayer is offline   Reply With Quote
Old 2017-10-13, 23:07   #3
Mysticial
 
Mysticial's Avatar
 
Sep 2016

2×5×37 Posts
Default

Quote:
Originally Posted by ewmayer View Post
I got ~1.6x for my AVX512 implementation running on KNL. YMMV.
FWIW, y-cruncher's untuned AVX512 gained anywhere from 5% to -10% on my system out-of-box depending on badly it throttled. (That minus 10% is not a typo. Once the system throttles, it throttles hard.)

Once I fixed the throttling, the AVX2 -> AVX512 gain hovered around 10%-ish.

Once I tuned the AVX512 binary, that grew to about 15%.

Once I overclocked the cache and memory, it grew up to 25%.

Version 0.7.4 (ETA end of weekend) makes both the AVX2 and AVX512 faster, but more so the AVX512. And I think it widens the gap by another percent or two.


By comparison, my BBP benchmark gained 90% (1.9x faster) going from AVX2 -> AVX512 in the absence of throttling. Cache and memory are irrelevant since it's L1 only.

Last fiddled with by Mysticial on 2017-10-13 at 23:09
Mysticial is offline   Reply With Quote
Old 2017-10-14, 21:50   #4
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2×32×653 Posts
Thumbs down

Quote:
Originally Posted by Mysticial View Post
FWIW, y-cruncher's untuned AVX512 gained anywhere from 5% to -10% on my system out-of-box depending on badly it throttled. (That minus 10% is not a typo. Once the system throttles, it throttles hard.)

Once I fixed the throttling, the AVX2 -> AVX512 gain hovered around 10%-ish.

Once I tuned the AVX512 binary, that grew to about 15%.

Once I overclocked the cache and memory, it grew up to 25%.

Version 0.7.4 (ETA end of weekend) makes both the AVX2 and AVX512 faster, but more so the AVX512. And I think it widens the gap by another percent or two.

By comparison, my BBP benchmark gained 90% (1.9x faster) going from AVX2 -> AVX512 in the absence of throttling. Cache and memory are irrelevant since it's L1 only.
I must've missed the relevant post(s), but how did you fix the throttling issue?

It would be interesting to compare your so-far-disappointing AVX512 gains for y-cruncher to Mlucas - back in July in this same thread you posted a bunch of Mlucas timings for an AVX512 build in a Ubuntu sandbox you installed, but I don't believe we considered doing an AVX2 build on your hardware. Comparing those 2 binaries would tell us if your AVX512 gains for y-cruncher are in line to those for my code running on the same hardware.

Note if you're running Win10, build-under-linux is now greatly eased by MSFT having actually done something right for once by way of adding native linux-sandbox support to that version of their OS. The Mlucas readme page has info on that, but bottom line that once you open such a native shell, everything works beautifully.
ewmayer is offline   Reply With Quote
Old 2017-10-15, 19:40   #5
Mysticial
 
Mysticial's Avatar
 
Sep 2016

2·5·37 Posts
Default

Quote:
Originally Posted by ewmayer View Post
I must've missed the relevant post(s), but how did you fix the throttling issue?
I fixed it by compensating for the voltage droop on the CPU input voltage (among other things). Details here: http://www.overclock.net/t/1634045/s...tom-throttling

When I ran the Mlucas benchmarks, I had already fixed the throttling. So it doesn't invalid those results. But, I've have significantly changed the overclock settings since then.

Now I have all 128GB running at 3800 MT/s - which is almost enough to match 6-channel 2666 MT/s on the servers. (This Samsung B-die stuff really is that good. Too bad it's so expensive now. And it's sold out everywhere so I can't even get another fix if I wanted to.)

Quote:
It would be interesting to compare your so-far-disappointing AVX512 gains for y-cruncher to Mlucas - back in July in this same thread you posted a bunch of Mlucas timings for an AVX512 build in a Ubuntu sandbox you installed, but I don't believe we considered doing an AVX2 build on your hardware. Comparing those 2 binaries would tell us if your AVX512 gains for y-cruncher are in line to those for my code running on the same hardware.

Note if you're running Win10, build-under-linux is now greatly eased by MSFT having actually done something right for once by way of adding native linux-sandbox support to that version of their OS. The Mlucas readme page has info on that, but bottom line that once you open such a native shell, everything works beautifully.
If you give me an updated set of commands to run, I'll grab the latest version and run them again in Ubuntu.

Last fiddled with by Mysticial on 2017-10-15 at 19:44
Mysticial is offline   Reply With Quote
Old 2017-10-15, 21:36   #6
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

101101111010102 Posts
Default

Quote:
Originally Posted by Mysticial View Post
If you give me an updated set of commands to run, I'll grab the latest version and run them again in Ubuntu.
If the md5 checksum for you Mlucas tarball mismatches the current one, that means I added some code patches after you downloaded your copy - nothing major, but some stuff related to making self-tests more robust which you will want. These assume binaries built in separate obj-dirs (both within the src-dir, e.g. from src, 'mkdir obj_avx2' and 'mkdir obj_avx512'), i.e. both can use the same executable name:

AVX2:
gcc -c -O3 -mavx2 -DUSE_AVX2 -DUSE_THREADS ../*.c >& build.log
grep -i error build.log
[Assuming above grep comes up empty] gcc -o Mlucas *.o -lm -lpthread -lrt

AVX512
gcc -c -O3 -march=skylake-avx512 -DUSE_AVX512 -DUSE_THREADS ../*.c >& build.log
grep -i error build.log
[Assuming above grep comes up empty] gcc -o Mlucas *.o -lm -lpthread -lrt

Simple set of comparative timings at a GIMPS-representative FFT length should suffice for our basic purposes, but you are of course free to spend as much or as little time playing with this as you like. I suggest using 100-iters for self-tests only in 1-thread mode, 1000-iters will give more accurate timings for anything beyond that. I further suggest opening the mlucas.cfg file resulting from the initial set if self-tests in an editor and annotating each cfg-file best-timing line as it is printed with the -cpu options you used for that set of timings. (You can replace the reference-residues stuff that gets printed to the right of the 10-entry FFT radix-set on each line with your annotations, I find that a convenient method in my own work.)

So let's say we want to do comparative timings @4096K FFT length. On an Intel 10-physical-core CPU I would like so, each line producing a new cfg-file entry. The code is heavily geared toward power-of-2 thread counts, so I would eviate from that only at the 'what the heck - let's try all 100 core' end:

1 thread per physical core:
./Mlucas -fftlen 4096 -iters 100 -cpu 0
./Mlucas -fftlen 4096 -iters 1000 -cpu 0:1
./Mlucas -fftlen 4096 -iters 1000 -cpu 0:3
./Mlucas -fftlen 4096 -iters 1000 -cpu 0:7
./Mlucas -fftlen 4096 -iters 1000 -cpu 0:9

2 threads per physical core:
./Mlucas -fftlen 4096 -iters 100 -cpu 0,10
./Mlucas -fftlen 4096 -iters 1000 -cpu 0:1,10:11
./Mlucas -fftlen 4096 -iters 1000 -cpu 0:3,10:13
./Mlucas -fftlen 4096 -iters 1000 -cpu 0:7,10:17
./Mlucas -fftlen 4096 -iters 1000 -cpu 0:19

Last fiddled with by ewmayer on 2017-10-15 at 21:38
ewmayer is offline   Reply With Quote
Old 2017-10-17, 01:21   #7
Mysticial
 
Mysticial's Avatar
 
Sep 2016

2·5·37 Posts
Default

AVX2:
Code:
17.0
      4096  msec/iter =   27.05  ROE[avg,max] = [0.278125000, 0.312500000]  radices =  64 32 32 32  0  0  0  0  0  0    100-iteration Res mod 2^64, 2^35-1, 2^36-1 = 8CC30E314BF3E556, 22305398329, 64001568053
      4096  msec/iter =   16.42  ROE[avg,max] = [0.265871720, 0.328125000]  radices = 256  8  8  8 16  0  0  0  0  0    1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 5F87421FA9DD8F1F, 26302807323, 54919604018
      4096  msec/iter =   10.70  ROE[avg,max] = [0.276559480, 0.343750000]  radices =  64 32 32 32  0  0  0  0  0  0    1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 5F87421FA9DD8F1F, 26302807323, 54919604018
      4096  msec/iter =    6.54  ROE[avg,max] = [0.276559480, 0.343750000]  radices =  64 32 32 32  0  0  0  0  0  0    1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 5F87421FA9DD8F1F, 26302807323, 54919604018
      4096  msec/iter =    6.19  ROE[avg,max] = [0.320515867, 0.406250000]  radices = 256 32 16 16  0  0  0  0  0  0    1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 5F87421FA9DD8F1F, 26302807323, 54919604018
      4096  msec/iter =   24.78  ROE[avg,max] = [0.264118304, 0.296875000]  radices = 256  8  8  8 16  0  0  0  0  0    100-iteration Res mod 2^64, 2^35-1, 2^36-1 = 8CC30E314BF3E556, 22305398329, 64001568053
      4096  msec/iter =   14.92  ROE[avg,max] = [0.320515867, 0.406250000]  radices = 256 32 16 16  0  0  0  0  0  0    1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 5F87421FA9DD8F1F, 26302807323, 54919604018
      4096  msec/iter =    9.64  ROE[avg,max] = [0.320515867, 0.406250000]  radices = 256 32 16 16  0  0  0  0  0  0    1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 5F87421FA9DD8F1F, 26302807323, 54919604018
      4096  msec/iter =    5.95  ROE[avg,max] = [0.320564191, 0.406250000]  radices = 256 32 16 16  0  0  0  0  0  0    1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 5F87421FA9DD8F1F, 26302807323, 54919604018
      4096  msec/iter =    5.84  ROE[avg,max] = [0.265964343, 0.328125000]  radices = 256  8  8  8 16  0  0  0  0  0    1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 5F87421FA9DD8F1F, 26302807323, 54919604018
AVX512:
Code:
17.0
      4096  msec/iter =   21.77  ROE[avg,max] = [0.245026507, 0.281250000]  radices =  32 16 16 16 16  0  0  0  0  0    100-iteration Res mod 2^64, 2^35-1, 2^36-1 = 8CC30E314BF3E556, 22305398329, 64001568053
      4096  msec/iter =   15.21  ROE[avg,max] = [0.244451338, 0.304687500]  radices =  32 16 16 16 16  0  0  0  0  0    1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 5F87421FA9DD8F1F, 26302807323, 54919604018
      4096  msec/iter =    8.70  ROE[avg,max] = [0.275551707, 0.375000000]  radices =  64 32 32 32  0  0  0  0  0  0    1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 5F87421FA9DD8F1F, 26302807323, 54919604018
      4096  msec/iter =    5.37  ROE[avg,max] = [0.275551707, 0.375000000]  radices =  64 32 32 32  0  0  0  0  0  0    1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 5F87421FA9DD8F1F, 26302807323, 54919604018
      4096  msec/iter =    5.36  ROE[avg,max] = [0.300239107, 0.375000000]  radices = 256 16 16 32  0  0  0  0  0  0    1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 5F87421FA9DD8F1F, 26302807323, 54919604018
      4096  msec/iter =   18.90  ROE[avg,max] = [0.301116071, 0.375000000]  radices = 256 16 16 32  0  0  0  0  0  0    100-iteration Res mod 2^64, 2^35-1, 2^36-1 = 8CC30E314BF3E556, 22305398329, 64001568053
      4096  msec/iter =   12.86  ROE[avg,max] = [0.300271323, 0.375000000]  radices = 256 16 16 32  0  0  0  0  0  0    1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 5F87421FA9DD8F1F, 26302807323, 54919604018
      4096  msec/iter =    7.91  ROE[avg,max] = [0.300093503, 0.406250000]  radices =  64  8 16 16 16  0  0  0  0  0    1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 5F87421FA9DD8F1F, 26302807323, 54919604018
      4096  msec/iter =    5.42  ROE[avg,max] = [0.275567816, 0.375000000]  radices =  64 32 32 32  0  0  0  0  0  0    1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 5F87421FA9DD8F1F, 26302807323, 54919604018
      4096  msec/iter =    5.00  ROE[avg,max] = [0.300271323, 0.375000000]  radices = 256 16 16 32  0  0  0  0  0  0    1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 5F87421FA9DD8F1F, 26302807323, 54919604018
So about 15 - 20%.

Hardware Specs:
  • AVX2 @ 4.1 GHz (overclocked from 3.6 GHz)
  • AVX512 @ 3.8 GHz (overclocked from 3.3 GHz)
  • Cache @ 3.0 GHz (overclocked from 2.4 GHz)
  • Memory @ 3800 MT/s. (overclocked from 2666 MT/s)
The clock speeds are constant regardless of load. So no higher speeds for having only 1 or 2 cores active.


Overall, this is very similar to what I'm seeing.
  • The threaded benchmarks suck because of the memory bandwidth.
  • The single-threaded benchmarks suck almost as much from what I've observed to be the L3.
Mysticial is offline   Reply With Quote
Old 2017-10-18, 00:40   #8
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2·32·653 Posts
Default

Thanks, Alex - so on the good-news front, "It's not you." :)

Does your big-FFT y-cruncher code get any benefit from running 2 threads per physical core, as we see in the Mlucas timings?

Last fiddled with by ewmayer on 2017-10-18 at 00:41
ewmayer is offline   Reply With Quote
Old 2017-10-18, 01:46   #9
Mysticial
 
Mysticial's Avatar
 
Sep 2016

2×5×37 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Thanks, Alex - so on the good-news front, "It's not you." :)

Does your big-FFT y-cruncher code get any benefit from running 2 threads per physical core, as we see in the Mlucas timings?
About 15% gain on this box. So more than Mlucas.

But this is taken over an entire Pi computation/benchmark. The workload is a lot less homogeneous. So there's a lot of more and less optimized code.
Mysticial is offline   Reply With Quote
Old 2017-12-14, 05:58   #10
Mysticial
 
Mysticial's Avatar
 
Sep 2016

37010 Posts
Default

My 3800 MT/s memory overclock soft-errored for a second time in the past 2 months. When I tested with the base clock bumped to 102 MHz (2% safety margin), it soft-errored within a minute. In retrospect, I never actually tested this overclock with any safety margin at all. So this prompted me to retest the entire overclock.

During this process, I lifted the temperature throttle limits and noticed that my 3.8 GHz AVX512 overclock was shooting way past the usual throttle point and well above 100C. 5 months ago, I picked 3.8 GHz because it was the highest speed that wouldn't exceed 85C. But in those 5 months, there have been enough memory improvements (both in software as well as the overclock) that the code simply runs a hella lot hotter than before.

I've also noticed an increase in benchmark inconsistency recently, but I didn't realize how badly it was throttling since I don't usually have CPUz open during benchmarks. Without the throttling, the CPU utilization seems have to gone up - presumably because the throttling was causing a load imbalance between the cores (since the hotter cores throttle harder).

Going by the same 85C limit, I seem to have lost about 200 MHz during these 5 months. Yet the net speedup is more than 40%.
Mysticial is offline   Reply With Quote
Old 2017-12-26, 17:15   #11
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

1F9516 Posts
Default

@mystical: Do you know if the vblendmpd instruction is running on the same ports as the FMA units?

http://users.atw.hu/instlatx64/Genui...InstLatX64.txt shows vblendmpd with a latency of one and a throughput of two.
Prime95 is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
A surprising case of a new shiny thing being step-function better fivemack Astronomy 12 2017-01-24 17:49
Ooohhh.....shiny! schickel FactorDB 2 2012-08-16 00:09
performance of Intel "Harpertown" series ixfd64 Hardware 1 2007-09-24 08:28
64 bit performance? zacariaz Hardware 1 2007-05-10 13:08
LLR performance on k and n robert44444uk 15k Search 1 2006-02-09 01:43

All times are UTC. The time now is 17:37.


Tue Dec 6 17:37:21 UTC 2022 up 110 days, 15:05, 0 users, load averages: 1.05, 1.13, 1.15

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔