![]() |
![]() |
#1 |
"Composite as Heck"
Oct 2017
76110 Posts |
![]()
Zen 2 details were announced today. Highlights include:
There's more details about Zen2, 7nm GPUs and a few other things if you want to sift through the talk: https://www.youtube.com/watch?v=GwX13bo0RDQ&t=3686s They also announced gen 1 Epyc AWS instances: https://www.mersenneforum.org/showthread.php?t=23782 For our purposes the biggest news is the upgraded FPU, with it each core will be able to utilise more memory bandwidth meaning the lower core count parts will be the sweet spot for saturating available memory bandwidth. Probably quad core will be enough to saturate dual channel on Ryzen 3rd gen similar to intel's sweet spot now. Better power efficiency with the 7nm node shrink is another no-brainer. They didn't mention cache beyond the vague bullet point above. Speculation: Having a dedicated I/O die may allow for a more performant memory controller. It may allow for UMA. It makes sense that they will create two 14nm dies (a smaller one for Ryzen and a bigger one for Epyc and Threadripper) but use the same 7nm chiplet throughout. They've said nothing of an iGPU chiplet as the event was all about servers. I wouldn't be surprised if the iGPU chiplet was a zen2 version of what we have now in the 2400G (quad core CPU + iGPU on a single chip). This allows for the possibility of Ryzen to have up to 12 cores and an iGPU, and scales down nicely to quad and dual core + iGPU for the low end. I don't think they'll dedicate an entire chiplet to iGPU as the one in the 2400G is limited by memory bandwidth as it is, unless they've decided against a 12 core Ryzen and make the iGPU chiplet much smaller than the 8 core chiplet. |
![]() |
![]() |
![]() |
#2 |
Feb 2016
UK
6238 Posts |
![]()
Yup, most excited about the FPU upgrade as it means, if I want a farm of LLR crunchers, quad core offerings on dual channel ram will probably hit a sweet spot of price/performance/power. Question is, how long will it be before consumer versions? I'm hoping, at worse, they'll keep the cadence they had up to now, with Ryzen 1000 and 2000 both being launched around April from memory.
|
![]() |
![]() |
![]() |
#3 | |
"Composite as Heck"
Oct 2017
761 Posts |
![]() Quote:
Last fiddled with by M344587487 on 2018-11-07 at 10:51 |
|
![]() |
![]() |
![]() |
#4 |
"Composite as Heck"
Oct 2017
761 Posts |
![]()
The official video is now up with much better audio and seemingly extra details: https://www.youtube.com/watch?v=kC3ny3LBfi4
|
![]() |
![]() |
![]() |
#5 |
Sep 2016
1010010112 Posts |
![]()
If I understood the presentation and slides correctly, the Zen FPU is the same as the old one, but double in size. That makes it (in theory) better than the non-AVX512 Intels.
On Zen1, you could do 2 x 128-bit FMA + 2 x 128-bit FADD. (though not sustainably) If it's the same or better on Zen2, you can probably do 2 x 256-bit FMA + 2 x 256-bit FADD. (also unlikely to be unsustainable) So you get FMA parity with non-AVX512 Intel. And you have a couple extra FADD units to pick at the FADDs lying around. (since most code is not going to be 100% FMA and will have FADDs as well) |
![]() |
![]() |
![]() |
#6 | ||
Feb 2016
UK
13×31 Posts |
![]() Quote:
You make a good point, that this may be skewed to higher end parts. We don't have Zen+ quad cores, do we? The 2000 series APUs are still Zen. I need to catch up on the full video tonight. Quote:
It feels like my world is imploding, having bought Intels for so long due to their FPU performance, my next system(s) may switch to Zen 2. I will probably still get a Skylake-X refresh at some point for AVX-512. |
||
![]() |
![]() |
![]() |
#7 | |
Sep 2016
14B16 Posts |
![]() Quote:
|
|
![]() |
![]() |
![]() |
#8 |
Feb 2016
UK
13×31 Posts |
![]()
Fortunately for my prime interests, many of them are still searchable at small enough FFT sizes to run out of cache. Peak CPU performance without ram limiting is still possible. It does limit the value somewhat... I also wonder if the combined effective cache is big enough to take ram out of the equation, at least on higher core parts.
|
![]() |
![]() |
![]() |
#9 |
∂2ω=0
Sep 2002
República de California
101101010101102 Posts |
![]()
Just by way of reference, for the current Ryzen (a.k.a. Ryzen 1), would someone be so kind as to post mprime/Prime95 timings at various FFT lengths of interest here? Please not the usual blizzard of various-thread-count-configurations, just the ms/iter (or throughput in iters/sec) for the total-system-throughput-maximizing config at each FFT length.
Back when I did a trial build of Mlucas 17.1 on David Stanfill's then-new octocore Ryzen, I got some very interesting results. Here are ms/iter numbers for 1-core-running at various FFT lengths: Code:
ewmayer@RyzenBeast:~/mlucas_v17.1/obj_avx2$ cat mlucas.cfg 17.1 1024 msec/iter = 9.96 ROE[avg,max] = [0.233398438, 0.281250000] radices = 32 16 32 32 0 0 0 0 0 0 1152 msec/iter = 11.78 ROE[avg,max] = [0.262165179, 0.312500000] radices = 36 16 32 32 0 0 0 0 0 0 1280 msec/iter = 12.73 ROE[avg,max] = [0.277678571, 0.343750000] radices = 40 16 32 32 0 0 0 0 0 0 1408 msec/iter = 14.78 ROE[avg,max] = [0.286049107, 0.343750000] radices = 44 16 32 32 0 0 0 0 0 0 1536 msec/iter = 15.36 ROE[avg,max] = [0.246595982, 0.312500000] radices = 48 16 32 32 0 0 0 0 0 0 1664 msec/iter = 17.80 ROE[avg,max] = [0.299107143, 0.375000000] radices = 52 16 32 32 0 0 0 0 0 0 1792 msec/iter = 17.98 ROE[avg,max] = [0.292968750, 0.343750000] radices = 56 16 32 32 0 0 0 0 0 0 1920 msec/iter = 20.80 ROE[avg,max] = [0.290178571, 0.375000000] radices = 60 16 32 32 0 0 0 0 0 0 2048 msec/iter = 20.58 ROE[avg,max] = [0.238539342, 0.281250000] radices = 32 32 32 32 0 0 0 0 0 0 2304 msec/iter = 24.23 ROE[avg,max] = [0.237191336, 0.281250000] radices = 36 32 32 32 0 0 0 0 0 0 2560 msec/iter = 26.22 ROE[avg,max] = [0.294642857, 0.375000000] radices = 40 32 32 32 0 0 0 0 0 0 2816 msec/iter = 30.43 ROE[avg,max] = [0.241350446, 0.281250000] radices = 44 32 32 32 0 0 0 0 0 0 3072 msec/iter = 31.48 ROE[avg,max] = [0.235825893, 0.281250000] radices = 48 32 32 32 0 0 0 0 0 0 3328 msec/iter = 36.65 ROE[avg,max] = [0.308035714, 0.375000000] radices = 52 32 32 32 0 0 0 0 0 0 3584 msec/iter = 36.95 ROE[avg,max] = [0.255747768, 0.312500000] radices = 56 32 32 32 0 0 0 0 0 0 3840 msec/iter = 42.50 ROE[avg,max] = [0.255217634, 0.281250000] radices = 240 16 16 32 0 0 0 0 0 0 4096 msec/iter = 43.84 ROE[avg,max] = [0.238957868, 0.281250000] radices = 32 16 16 16 16 0 0 0 0 0 4608 msec/iter = 50.09 ROE[avg,max] = [0.236300223, 0.281250000] radices = 144 16 32 32 0 0 0 0 0 0 5120 msec/iter = 55.05 ROE[avg,max] = [0.298772321, 0.343750000] radices = 160 16 32 32 0 0 0 0 0 0 5632 msec/iter = 64.28 ROE[avg,max] = [0.233816964, 0.281250000] radices = 176 16 32 32 0 0 0 0 0 0 6144 msec/iter = 66.76 ROE[avg,max] = [0.273158482, 0.343750000] radices = 24 16 16 16 32 0 0 0 0 0 6656 msec/iter = 76.08 ROE[avg,max] = [0.249162946, 0.281250000] radices = 208 16 32 32 0 0 0 0 0 0 7168 msec/iter = 76.73 ROE[avg,max] = [0.261049107, 0.312500000] radices = 224 16 32 32 0 0 0 0 0 0 7680 msec/iter = 85.84 ROE[avg,max] = [0.266587612, 0.312500000] radices = 240 16 32 32 0 0 0 0 0 0 o 'Overloading' each physical core with 2 threads (1 per logical core on that single phys-core) cuts throughput by ~10%; o 1 thread per physical core is best, not by a huge amount, but still; o Running 1 single-thread LL test on each of the 8 physical cores barely dents the timing versus just 1 such job on the entire system. I.e. if competition for system memory bandwidth were as big an issue here as it is known to be on Intel, running 8 single-thread jobs should appreciably reduce the per-job throughput versus the above numbers. But e.g. with 8 single-thread LL tests running, all @4608K, here are the numbers: Code:
M85836229 Iter# = 870000 [ 1.01% complete] clocks = 00:08:47.698 [ 0.0528 sec/iter] M85836271 Iter# = 860000 [ 1.00% complete] clocks = 00:08:46.750 [ 0.0527 sec/iter] M85836449 Iter# = 860000 [ 1.00% complete] clocks = 00:08:47.069 [ 0.0527 sec/iter] M85836847 Iter# = 860000 [ 1.00% complete] clocks = 00:08:48.398 [ 0.0528 sec/iter] M85836869 Iter# = 870000 [ 1.01% complete] clocks = 00:08:45.096 [ 0.0525 sec/iter] M85836871 Iter# = 860000 [ 1.00% complete] clocks = 00:08:47.868 [ 0.0528 sec/iter] M85836931 Iter# = 860000 [ 1.00% complete] clocks = 00:09:01.621 [ 0.0542 sec/iter] M85836953 Iter# = 860000 [ 1.00% complete] clocks = 00:08:45.647 [ 0.0526 sec/iter] Here are salient /proc/cpuinfo details (just for the first processor in the file) for the above system: Code:
processor : 0 vendor_id : AuthenticAMD cpu family : 23 model : 1 model name : AMD Ryzen 7 1800X Eight-Core Processor stepping : 1 microcode : 0x8001126 cpu MHz : 3850.000 cache size : 512 KB physical id : 0 siblings : 16 core id : 0 cpu cores : 8 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf eagerfpu pni pclmulqdq monit or ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misa lignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_l2 mwaitx hw_pstate vmmcall fsgsbase bm i1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf arat npt lbrv svm_lock nrip_sa ve tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic overflow_recov succor smca bugs : fxsave_leak sysret_ss_attrs null_seg bogomips : 7685.18 TLB size : 2560 4K pages clflush size : 64 cache_alignment : 64 address sizes : 48 bits physical, 48 bits virtual power management: ts ttp tm hwpstate eff_freq_ro [13] [14] Last fiddled with by ewmayer on 2018-11-09 at 20:49 |
![]() |
![]() |
![]() |
#10 | |
Feb 2016
UK
19316 Posts |
![]()
Interesting interview at Anandtech: https://www.anandtech.com/show/13578...rk-papermaster
One part of interest: Quote:
|
|
![]() |
![]() |
![]() |
#11 |
"Eric"
Jan 2018
USA
22·53 Posts |
![]()
I saw the image of the delidded Epyc 7nm 64 core part in which the central IO die seems massive. Could there possibly be a really fast L4 cache that's a decent size and with very high bandwidth (aka much higher than around 150GB/s from 2666 8 channel memory)
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
RX470 and RX460 announced | VictordeHolland | GPU Computing | 0 | 2016-07-30 13:05 |
Intel Xeon D announced | VictordeHolland | Hardware | 7 | 2015-03-11 23:26 |
Factoring details | mturpin | Information & Answers | 4 | 2013-02-08 02:43 |
Euler (6,2,5) details. | Death | Math | 10 | 2011-08-03 13:49 |
Larrabee instruction set announced | fivemack | Hardware | 0 | 2009-03-25 12:09 |