mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2018-09-04, 23:36   #12
GP2
 
GP2's Avatar
 
Sep 2003

A1716 Posts
Default

Quote:
Originally Posted by Mysticial View Post
That sounds like a great way to piss off other cloud users!

Throw tons of single-threaded bandwidth-heavy AVX512 workloads on the cloud. Not only do you eat up all the memory bandwidth, you throttle their clocks as well!
Amazon probably has literally millions of servers, maybe tens of millions. The exact number is a closely guarded secret, and outsider estimates have varied. So even if you could afford to launch a hundred or a thousand number-crunching instances, it would still be just a drop in the bucket.

An underlying physical server for c5 instances has 36 cores. I use just one or two of them, because it always make more sense to run multiple one-core instances than one multi-core instance, both from a throughput standpoint and for cost-effectiveness. So it really doesn't have that much of an impact.

PS, Amazon's cloud CPUs never get throttled. The data centers always assure a proper temperature, so Amazon paid Intel to create a custom chip with all the CPU-throttling logic removed, and they presumably used the freed-up die space for other purposes.
GP2 is offline   Reply With Quote
Old 2018-09-04, 23:46   #13
GP2
 
GP2's Avatar
 
Sep 2003

32·7·41 Posts
Default

Quote:
Originally Posted by Mark Rose View Post
If EC2 users care enough, they can select dedicated tenancy instances.
Don't try that at home, it adds an additional global fee of $2 / hour for every region where you have at least one dedicated instance, which is $336 per week. Insignificant for a business budget, but for our purposes at spot prices of 1.9 cents / hour you can run 105 Skylake cores @ 3.0 GHz simultaneously at that rate.
GP2 is offline   Reply With Quote
Old 2018-09-05, 05:21   #14
NookieN
 
NookieN's Avatar
 
Aug 2002

5810 Posts
Default

Quote:
Originally Posted by GP2 View Post
An underlying physical server for c5 instances has 36 cores. I use just one or two of them, because it always make more sense to run multiple one-core instances than one multi-core instance, both from a throughput standpoint and for cost-effectiveness. So it really doesn't have that much of an impact.

PS, Amazon's cloud CPUs never get throttled. The data centers always assure a proper temperature, so Amazon paid Intel to create a custom chip with all the CPU-throttling logic removed, and they presumably used the freed-up die space for other purposes.
What does /proc/cpuinfo identify the chip as? I assume it's an 8124M, which means they have an 8S blade where you get two sockets, giving you 72 threads. The most cores currently in a single Xeon die are 28 (ignoring Xeon Phi). Those are almost 3x as expensive though; it's still much cheaper to just buy the multisocket board.

Even if throttling is in fact disabled, they would have just changed the thermal limits programmed in the fuses.
NookieN is offline   Reply With Quote
Old 2018-09-05, 08:19   #15
GP2
 
GP2's Avatar
 
Sep 2003

32×7×41 Posts
Default

[QUOTE=NookieN;495401]What does /proc/cpuinfo identify the chip as? I assume it's an 8124M, which means they have an 8S blade where you get two sockets, giving you 72 threads. The most cores currently in a single Xeon die are 28 (ignoring Xeon Phi). Those are almost 3x as expensive though; it's still much cheaper to just buy the multisocket board.

Yes, and with two sockets as you mention:

model name : Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz

core id : 0
core id : 1


Quote:
Even if throttling is in fact disabled, they would have just changed the thermal limits programmed in the fuses.

The thing about thermal throttling being disabled was actually something I read about for the introduction of the previous-generation C4 instances that were built around the custom Xeon E5-2666 v3 chip (Haswell). I don't recall now if it was an officially announced feature or just rumor and scuttlebutt. When it comes to the C5 instances, I assumed the same thing applied, but who knows, maybe it's not the case.
GP2 is offline   Reply With Quote
Old 2018-09-07, 00:46   #16
xx005fs
 
"Eric"
Jan 2018
USA

22×53 Posts
Default

Quote:
Originally Posted by Prime95 View Post
A 3.6GHz 8-core Skylake-X with DDR-3600 memory. Running new AXV-512 FFT code:

Timings for 4480K FFT length (1 core, 1 worker): 12.53 ms. Throughput: 79.83 iter/sec.
Timings for 4480K FFT length (2 cores, 1 worker): 6.94 ms. Throughput: 144.15 iter/sec.
Timings for 4480K FFT length (3 cores, 1 worker): 5.21 ms. Throughput: 192.10 iter/sec.
Timings for 4480K FFT length (4 cores, 1 worker): 4.09 ms. Throughput: 244.70 iter/sec.
Timings for 4480K FFT length (5 cores, 1 worker): 3.49 ms. Throughput: 286.31 iter/sec.
Timings for 4480K FFT length (6 cores, 1 worker): 3.15 ms. Throughput: 317.06 iter/sec.
Timings for 4480K FFT length (7 cores, 1 worker): 2.95 ms. Throughput: 339.29 iter/sec.
Timings for 4480K FFT length (8 cores, 1 worker): 2.95 ms. Throughput: 338.73 iter/sec.

Timings for 4480K FFT length (5 cores, 5 workers): 15.56, 15.50, 15.48, 15.39, 15.41 ms. Throughput: 323.30 iter/sec.
Timings for 4480K FFT length (6 cores, 6 workers): 16.90, 16.85, 16.78, 16.70, 16.77, 16.73 ms. Throughput: 357.38 iter/sec.
Timings for 4480K FFT length (7 cores, 7 workers): 18.71, 18.74, 18.63, 18.54, 18.56, 18.56, 18.63 ms. Throughput: 375.84 iter/sec.
Timings for 4480K FFT length (8 cores, 8 workers): 21.20, 21.38, 21.14, 21.12, 20.97, 21.15, 21.17, 21.36 ms. Throughput: 377.63 iter/sec.

The poor CPU is crying out for more memory bandwidth.



BTW, the old AVX code:

Timings for 4480K FFT length (8 cores, 8 workers): 24.81, 24.84, 24.70, 24.80, 24.77, 24.84, 24.80, 24.82 ms. Throughput: 322.61 iter/sec.
The throughput surely looks amazing on Skylake X compared to regular consumer i7 and Ryzen. But is the speed really worth the platform cost or it is just easier to get 2 AMD or Nvidia GPU with unlocked double precision with high memory bandwidth? Or is there any advantages for the CPU over GPU?
xx005fs is offline   Reply With Quote
Old 2018-09-07, 16:15   #17
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

3·11·223 Posts
Default

Quote:
Originally Posted by xx005fs View Post
The throughput surely looks amazing on Skylake X compared to regular consumer i7 and Ryzen. But is the speed really worth the platform cost or it is just easier to get 2 AMD or Nvidia GPU with unlocked double precision with high memory bandwidth? Or is there any advantages for the CPU over GPU?
The amazing speed is due to quad channel memory.

Your cheapest solution is buy 2 i7s or Ryzens with fast memory. See "George's dream build" thread for my highest bang-for-the-buck solution.

Last I looked CPUs are more energy efficient at LL testing than GPUs.
Prime95 is online now   Reply With Quote
Old 2018-09-07, 16:51   #18
Mysticial
 
Mysticial's Avatar
 
Sep 2016

331 Posts
Default

Quote:
Originally Posted by GP2 View Post
Amazon probably has literally millions of servers, maybe tens of millions. The exact number is a closely guarded secret, and outsider estimates have varied. So even if you could afford to launch a hundred or a thousand number-crunching instances, it would still be just a drop in the bucket.

An underlying physical server for c5 instances has 36 cores. I use just one or two of them, because it always make more sense to run multiple one-core instances than one multi-core instance, both from a throughput standpoint and for cost-effectiveness. So it really doesn't have that much of an impact.

PS, Amazon's cloud CPUs never get throttled. The data centers always assure a proper temperature, so Amazon paid Intel to create a custom chip with all the CPU-throttling logic removed, and they presumably used the freed-up die space for other purposes.
Amazon doesn't throttle at all? Have you tried the worst of possible AVX512 loads such as Linpack or FireStarter?

It's actually less about the thermals and more about the stability. Even if the cooling can handle the thermals, it doesn't mean the AVX512 silicon can handle the higher clocks. For example, my 7940X has at least 2 cores that will not run AVX512 at the same stock non-AVX speed without a significant voltage increase.

OTOH, Intel could simply bin their chips in a way so that all the chips they sell Amazon for this purpose are able to run AVX512 at the non-AVX speeds.

-----

I'm less convinced that the throttling logic takes up enough die area to be reused for something else. Is this just speculation, or is there a source somewhere? This would imply changes to the silicon, and given all the binning that's done, I would think it's better to have everything come off the same assembly line so the parts can be sold to any market depending on how they turn out.

The other thing is that throttling AVX512 goes both ways. If you're not throttling AVX512, then you are throttling non-AVX. IOW, you're leaving a lot of headroom for non-AVX. So even if Amazon is liquid nitrogen cooling their custom parts to handle AVX512 at full speed, they could be running non-AVX at even higher speeds.

An alternate possibility is that the Amazon hardware is tuned to be power efficient for non-AVX. But they will have the cooling necessary to handle un-throttled AVX512 even if it means extremely high voltages that kill off any power efficiency (assuming AVX512 is still so rare that they don't need it to be efficient).

Last fiddled with by Mysticial on 2018-09-07 at 17:15
Mysticial is offline   Reply With Quote
Old 2018-09-07, 17:01   #19
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

13×31 Posts
Default

Where I am, the 7800X (6 core) isn't priced much different from the consumer 6 cores (8700k, 8086k). You do have to pay a bit more for the mobo, and by implication you will be running at least 16GB of ram to get quad channel. For that you do get quad channel ram feeding 6 cores. I still think I'm not seeing as good scaling as expected over quad cores/dual channel, suspecting the mesh cache.

My 7800X will do AVX2 loads (Prime95/LLR) at stock voltage to 4.3 GHz. I can't recall what the highest stable AVX2 clock was with more voltage, but you're dropping down the efficiency slope if you do that anyway. For running Y-cruncher with AVX-512, I've managed to complete a 10B run at 4.3 and elevated voltage. So... it is pretty punishing.
mackerel is offline   Reply With Quote
Old 2018-09-07, 19:09   #20
GP2
 
GP2's Avatar
 
Sep 2003

32×7×41 Posts
Default

Quote:
Originally Posted by Mysticial View Post
OTOH, Intel could simply bin their chips in a way so that all the chips they sell Amazon for this purpose are able to run AVX512 at the non-AVX speeds.

-----

I'm less convinced that the throttling logic takes up enough die area to be reused for something else. Is this just speculation, or is there a source somewhere? This would imply changes to the silicon, and given all the binning that's done, I would think it's better to have everything come off the same assembly line so the parts can be sold to any market depending on how they turn out.
The previous-generation c4 instances used a custom chip: "Xeon E5-2666 v3 (Haswell) processors optimized specifically for EC2" (click on the "Compute Optimized" tab on the linked page). I definitely recall reading something at the time about the "optimization" having to do with removing throttling logic, but when I tried to google it after posting the above I couldn't find anything. Maybe it was speculation after all.

The current-generation c5 instances say "3.0 GHz Intel Xeon Platinum processors with new Intel Advanced Vector Extension 512 (AVX-512) instruction set", with no mention of customization specifically for EC2. So maybe those really are off the same assembly line, I don't now. The /proc/cpuinfo output says model name : Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz.

By contrast, the Skylake chips on Google Compute Engine, are unspecified as to model, and slower: model name : Intel(R) Xeon(R) CPU @ 2.00GHz, although on some other specific VM instances I think I've seen a different number for the GHz, so Google's cloud may be somewhat heterogeneous in terms of clock speed, but considerably slower than AWS EC2.
GP2 is offline   Reply With Quote
Old 2018-09-07, 20:12   #21
ET_
Banned
 
ET_'s Avatar
 
"Luigi"
Aug 2002
Team Italia

12C316 Posts
Default

Quote:
Originally Posted by GP2 View Post
The previous-generation c4 instances used a custom chip: "Xeon E5-2666 v3 (Haswell) processors optimized specifically for EC2" (click on the "Compute Optimized" tab on the linked page). I definitely recall reading something at the time about the "optimization" having to do with removing throttling logic, but when I tried to google it after posting the above I couldn't find anything. Maybe it was speculation after all.

The current-generation c5 instances say "3.0 GHz Intel Xeon Platinum processors with new Intel Advanced Vector Extension 512 (AVX-512) instruction set", with no mention of customization specifically for EC2. So maybe those really are off the same assembly line, I don't now. The /proc/cpuinfo output says model name : Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz.

By contrast, the Skylake chips on Google Compute Engine, are unspecified as to model, and slower: model name : Intel(R) Xeon(R) CPU @ 2.00GHz, although on some other specific VM instances I think I've seen a different number for the GHz, so Google's cloud may be somewhat heterogeneous in terms of clock speed, but considerably slower than AWS EC2.
Prime95 reports the Skylake chips on Google Compute Engine as having 2506 MHz.
ET_ is offline   Reply With Quote
Old 2018-09-07, 20:49   #22
xx005fs
 
"Eric"
Jan 2018
USA

22×53 Posts
Default

Quote:
Originally Posted by Prime95 View Post
The amazing speed is due to quad channel memory.

Your cheapest solution is buy 2 i7s or Ryzens with fast memory. See "George's dream build" thread for my highest bang-for-the-buck solution.

Last I looked CPUs are more energy efficient at LL testing than GPUs.
That might be an advantage. However, I feel like that GPUs like Titan V draws like 300 watts under load if you don't overclock and it does under 1ms/it for 85M exponents, in which a consumer ryzen/i5 draws around 65 watt and does 6 ms/it and that would yield much better efficiency for the Titan V. Also, I found out that Vega 56 could get a lot of undervolt done and then it would achieve 2.06ms/it at about 200W board power. Hoping for GPU implementation in prime95 for the future as that would identify my devices and put their credit up on the site instead of manual testing.
xx005fs is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Skylake vs Kabylake ET_ Hardware 17 2017-05-24 16:19
Skylake processor tha Hardware 7 2015-03-05 23:49
Skylake AVX-512 clarke Software 15 2015-03-04 21:48
Preliminary iMac duo-core Intel RESULTS.TXT SO7783 Software 6 2006-04-15 14:38
Fourth known factor of M(M31) (preliminary announcement) ewmayer Operazione Doppi Mersennes 22 2005-07-06 00:33

All times are UTC. The time now is 00:04.

Sun Mar 7 00:04:14 UTC 2021 up 93 days, 20:15, 0 users, load averages: 2.56, 2.53, 2.12

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.