mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2018-09-03, 04:55   #1
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

22·11·167 Posts
Default Preliminary Skylake-X benchmark

A 3.6GHz 8-core Skylake-X with DDR-3600 memory. Running new AXV-512 FFT code:

Timings for 4480K FFT length (1 core, 1 worker): 12.53 ms. Throughput: 79.83 iter/sec.
Timings for 4480K FFT length (2 cores, 1 worker): 6.94 ms. Throughput: 144.15 iter/sec.
Timings for 4480K FFT length (3 cores, 1 worker): 5.21 ms. Throughput: 192.10 iter/sec.
Timings for 4480K FFT length (4 cores, 1 worker): 4.09 ms. Throughput: 244.70 iter/sec.
Timings for 4480K FFT length (5 cores, 1 worker): 3.49 ms. Throughput: 286.31 iter/sec.
Timings for 4480K FFT length (6 cores, 1 worker): 3.15 ms. Throughput: 317.06 iter/sec.
Timings for 4480K FFT length (7 cores, 1 worker): 2.95 ms. Throughput: 339.29 iter/sec.
Timings for 4480K FFT length (8 cores, 1 worker): 2.95 ms. Throughput: 338.73 iter/sec.

Timings for 4480K FFT length (5 cores, 5 workers): 15.56, 15.50, 15.48, 15.39, 15.41 ms. Throughput: 323.30 iter/sec.
Timings for 4480K FFT length (6 cores, 6 workers): 16.90, 16.85, 16.78, 16.70, 16.77, 16.73 ms. Throughput: 357.38 iter/sec.
Timings for 4480K FFT length (7 cores, 7 workers): 18.71, 18.74, 18.63, 18.54, 18.56, 18.56, 18.63 ms. Throughput: 375.84 iter/sec.
Timings for 4480K FFT length (8 cores, 8 workers): 21.20, 21.38, 21.14, 21.12, 20.97, 21.15, 21.17, 21.36 ms. Throughput: 377.63 iter/sec.

The poor CPU is crying out for more memory bandwidth.



BTW, the old AVX code:

Timings for 4480K FFT length (8 cores, 8 workers): 24.81, 24.84, 24.70, 24.80, 24.77, 24.84, 24.80, 24.82 ms. Throughput: 322.61 iter/sec.

Last fiddled with by Prime95 on 2018-09-03 at 05:00
Prime95 is offline   Reply With Quote
Old 2018-09-03, 05:13   #2
Mysticial
 
Mysticial's Avatar
 
Sep 2016

331 Posts
Default

Nice!

Are these all at fixed clock speeds regardless of the the workload? (i.e. AVX and AVX512 both running at 3.6 GHz?)

17% speedup is more than I expected given the memory bottleneck.

Last fiddled with by Mysticial on 2018-09-03 at 05:13
Mysticial is offline   Reply With Quote
Old 2018-09-03, 07:20   #3
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

6238 Posts
Default

From my previous observations, the "old" AVX code wouldn't be significantly limited by ram in that configuration. Any significant increase from AVX-512 would push it there though.

Is my understanding correct, to assume AVX-512 could double throughput, if not limited by ram? What sort of speedup do you see for smaller FFTs that fit in cache? I can see it shaking things up when it eventually makes its way to LLR.

As a side thought, I assume the CPU temperatures when running would be a little warmer than with old code. It might get a lot hotter if it weren't limited...
mackerel is offline   Reply With Quote
Old 2018-09-03, 13:56   #4
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

22·11·167 Posts
Default

@Mystical: not sure what the AVX-512 offsets are. I just plugged the chip in and let it rip using the motherboard's default settings.


sensors reports:

Code:
Physical id 0:  +59.0°C  (high = +95.0°C, crit = +105.0°C)
Core 0:         +54.0°C  (high = +95.0°C, crit = +105.0°C)
Core 1:         +43.0°C  (high = +95.0°C, crit = +105.0°C)
Core 2:         +57.0°C  (high = +95.0°C, crit = +105.0°C)
Core 3:         +46.0°C  (high = +95.0°C, crit = +105.0°C)
Core 4:         +56.0°C  (high = +95.0°C, crit = +105.0°C)
Core 5:         +57.0°C  (high = +95.0°C, crit = +105.0°C)
Core 6:         +54.0°C  (high = +95.0°C, crit = +105.0°C)
Core 7:         +59.0°C  (high = +95.0°C, crit = +105.0°C)
i7z snapshot:

Code:
Socket [0] - [physical cores=8, logical cores=16, max online cores ever=8]
  TURBO DISABLED on 8 Cores, Hyper Threading ON
  Max Frequency without considering Turbo 3599.00 MHz (99.97 x [36])
  Max TURBO Multiplier (if Enabled) with 1/2/3/4/5/6 Cores is  45x/41x/40x/40x/40x/40x
  Real Current Frequency 3603.84 MHz [99.97 x 36.05] (Max of below)
        Core [core-id]  :Actual Freq (Mult.)      C0%   Halt(C1)%  C3 %   C6 %  Temp      VCore
        Core 1 [0]:       3500.11 (35.01x)         1    99.9       0       0    54      0.9535
        Core 2 [1]:       3599.86 (36.01x)         1    0.801      0    98.6    43      0.9777
        Core 3 [2]:       3491.47 (34.92x)         1     100       0       0    56      0.9749
        Core 4 [3]:       3603.84 (36.05x)         1    0.147      0    99.8    46      0.9835
        Core 5 [4]:       3503.82 (35.05x)         1     100       0       0    57      0.9595
        Core 6 [5]:       3487.92 (34.89x)         1     100       0       0    55      0.9529
        Core 7 [6]:       3489.45 (34.90x)         1     100       0       0    54      0.9545
        Core 8 [7]:       3500.00 (35.01x)       100    2.78       0       0    59      0.9590
C1 = Processor running with halts (States >C0 are power saver modes with cores idling)
C3 = Cores running with PLL turned off and core cache turned off
C6, C7 = Everything in C3 + core state saved to last level cache, C7 is deeper than C6
Prime95 is offline   Reply With Quote
Old 2018-09-03, 14:00   #5
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

22·11·167 Posts
Default

Interestingly, uptime reports only 6 cores in use:

Code:
george@SkylakeX:~/mers295/linux64$ uptime
 09:59:12 up 78 days, 17:05,  2 users,  load average: 6.01, 6.00, 6.00
Prime95 is offline   Reply With Quote
Old 2018-09-03, 14:07   #6
paulunderwood
 
paulunderwood's Avatar
 
Sep 2002
Database er0rr

37×97 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Interestingly, uptime reports only 6 cores in use:

Code:
george@SkylakeX:~/mers295/linux64$ uptime
 09:59:12 up 78 days, 17:05,  2 users,  load average: 6.01, 6.00, 6.00
uptime is not the best measure. Better to look in /proc/cpuinfo to see how many cores there are. Good luck getting the load to 8.0.

Here is my justification: When running my own code written with gwnum, I get less load then JP's LLR but similar timings.

Last fiddled with by paulunderwood on 2018-09-03 at 14:24
paulunderwood is offline   Reply With Quote
Old 2018-09-03, 19:55   #7
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

734810 Posts
Default

My bad. I've been running the new code doing Gerbicz PRPs on Skylake-X and it finished two work units and will not get any more work. I've got some unexpected debugging to do.

The good news is that I didn't lose much throughput with two cores idle. The temps and i7z data above is inaccurate. The benchmarks are OK as that was done after "kill -SIGSTOP" on the running mprime.
Prime95 is offline   Reply With Quote
Old 2018-09-03, 21:07   #8
Mysticial
 
Mysticial's Avatar
 
Sep 2016

5138 Posts
Default

Quote:
Originally Posted by Prime95 View Post
@Mystical: not sure what the AVX-512 offsets are. I just plugged the chip in and let it rip using the motherboard's default settings.
Sounds like it's probably -4 for AVX512.

If you want a raw cycle-for-cycle comparison of AVX vs. AVX512, you'll need to force it from the BIOS.

So you'll need to zero the offsets for both AVX and AVX512. But you'll also need to drop all the turbos to no higher than 3.6 GHz. Otherwise, you'll roast the machine when it tries to run AVX512 @ 4.0 GHz on all 8 cores.
Mysticial is offline   Reply With Quote
Old 2018-09-04, 02:12   #9
GP2
 
GP2's Avatar
 
Sep 2003

A1716 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Timings for 4480K FFT length (1 core, 1 worker): 12.53 ms. Throughput: 79.83 iter/sec.

Timings for 4480K FFT length (8 cores, 8 workers): 21.20, 21.38, 21.14, 21.12, 20.97, 21.15, 21.17, 21.36 ms. Throughput: 377.63 iter/sec.

The poor CPU is crying out for more memory bandwidth.
One of the reasons I remain a fan of running on the cloud rather than a physical box is that using 8 separate one-core virtual machines really does mean 8 times the throughput of one core. Unless you somehow contrive to get them running on the same physical cloud server, which if it ever became an issue could be avoided with staggered starts.

In the above example, that would give you 640 iter/sec combined rather than 378, which is 70% more.

Although the nominal cost advantage is probably still in favor of a barebones server farm setup (unless you live in an area with expensive power), this factor does partly tilt the balance back the other way somewhat. That, plus the fact that the upgrade to Skylake hardware was free, just start using the new instance type, which was a 20% boost even on an AVX-to-AVX basis, and now I guess based on this benchmark will be an additional 17% boost when AVX-512 code is available.
GP2 is offline   Reply With Quote
Old 2018-09-04, 19:51   #10
Mysticial
 
Mysticial's Avatar
 
Sep 2016

331 Posts
Default

Quote:
Originally Posted by GP2 View Post
One of the reasons I remain a fan of running on the cloud rather than a physical box is that using 8 separate one-core virtual machines really does mean 8 times the throughput of one core. Unless you somehow contrive to get them running on the same physical cloud server, which if it ever became an issue could be avoided with staggered starts.

In the above example, that would give you 640 iter/sec combined rather than 378, which is 70% more.

Although the nominal cost advantage is probably still in favor of a barebones server farm setup (unless you live in an area with expensive power), this factor does partly tilt the balance back the other way somewhat. That, plus the fact that the upgrade to Skylake hardware was free, just start using the new instance type, which was a 20% boost even on an AVX-to-AVX basis, and now I guess based on this benchmark will be an additional 17% boost when AVX-512 code is available.
That sounds like a great way to piss off other cloud users!

Throw tons of single-threaded bandwidth-heavy AVX512 workloads on the cloud. Not only do you eat up all the memory bandwidth, you throttle their clocks as well!

Mysticial is offline   Reply With Quote
Old 2018-09-04, 20:38   #11
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

29×101 Posts
Default

Quote:
Originally Posted by Mysticial View Post
That sounds like a great way to piss off other cloud users!

Throw tons of single-threaded bandwidth-heavy AVX512 workloads on the cloud. Not only do you eat up all the memory bandwidth, you throttle their clocks as well!

If EC2 users care enough, they can select dedicated tenancy instances.
Mark Rose is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Skylake vs Kabylake ET_ Hardware 17 2017-05-24 16:19
Skylake processor tha Hardware 7 2015-03-05 23:49
Skylake AVX-512 clarke Software 15 2015-03-04 21:48
Preliminary iMac duo-core Intel RESULTS.TXT SO7783 Software 6 2006-04-15 14:38
Fourth known factor of M(M31) (preliminary announcement) ewmayer Operazione Doppi Mersennes 22 2005-07-06 00:33

All times are UTC. The time now is 11:36.

Thu Feb 25 11:36:10 UTC 2021 up 84 days, 7:47, 0 users, load averages: 1.48, 1.41, 1.41

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.