![]() |
![]() |
#1 |
Sep 2016
331 Posts |
![]()
In light of the massive memory bottlenecks that I've observed with my 10-core 7900X along with the launch of 18-core chips, I felt the urge to bring back the topic of double-double arithmetic - even if it's much earlier than I had originally anticipated.
Going back to my post here, the motivation behind DD (double-double) is that DP FFTs are completely memory bound. DD has a larger "word" and therefore allows a higher memory efficiency. (I estimate about 20% lower memory/bandwidth consumption.) The trade-off of course is that DD is much slower. But maybe not quite as much as it looks like at first glance. Double-double addition and multiplication can be done in as few as 7 and 4 instructions respectively if you cut all the corners (see bottom of post for specifics). But in cutting those corners, you'll need to rebalance the numbers every once in a while at the cost of 3 instructions. So a hypothetical radix-4 butterfly (3 twiddle factors) would be: Code:
8 complex DD adds + 3 complex DD multiply = 16 DD add + 3 (4 DD mul + 2 DD add) = 22 DD add + 12 DD mul If we rebalance every two layers, the computational cost becomes: Code:
22 (7) + 12 (4) + 8 * 3 = 226 FPU instructions By comparison, a DP radix 4 butterfly can be done in about 24 FMAs. (not sure what's actually achieved now with P95 and mlucus) However, a DP FFT does ~17 bits/word of work while a DD FFT does ~43 bits/word. So the relative "computation/useful work" is:
So we start off a 3.73x computational difference between DP and DD right now. The memory difference is 20% in the other direction. So in the world where computation is free and memory access is the only factor, then DD will be 20% faster. As far as I understand right now, P95 (and LL in general) is completely memory-bound with just AVX2. Now we have AVX512. So that 3.73x margin cuts in half to about 2x. Now the question is: How memory bound is LL right now with AVX2? Would dropping the memory footprint by 20% at the cost of 2x the computation lead to a net speedup? If not for the low-core chips, what about the 18-core Skylake X? Admittedly, my analysis overly simplified. I have neither an implementation nor a benchmark. So there's a lot room for error. But if it's reasonably accurate, I'm thinking we might be getting close to crossing that threshold if we haven't already with the 18-core 7980XE. Double-double Multiply: 4 FP instructions Code:
A = {AL, AH} B = {BL, BH} H = AH * BH; L = AH * BH - H; // FMA L = AL * BH + L; // FMA L = AH * BL + L; // FMA return {L, H}; Code:
A = {AL, AH} B = {BL, BH} IH = abs_max(AH, BH); // AVX512-DQ "vrangepd" IL = abs_min(AH, BH); // AVX512-DQ "vrangepd" H = AH + BH; L = H - IH; L = IL - L; L = L + AL; L = L + BL; return {L, H}; Code:
A = {AL, AH} H = AL + AH; L = H - AH; L = AL - L; return {L, H}; |
![]() |
![]() |
![]() |
#2 |
"Forget I exist"
Jul 2009
Dumbassville
26·131 Posts |
![]()
I'll only say this. markup vs margin. a 20% markdown ( a margin of the original time etc.) means only a total of 25% more computation time per computation ( assuming the same number of computations of course) so no unless something really magical cuts things down a bit more you can't do 20% markdown ( margin) and then a 100% markup and expect to take less of anything.
Last fiddled with by science_man_88 on 2017-10-04 at 00:48 |
![]() |
![]() |
![]() |
#3 | |
Sep 2016
331 Posts |
![]() Quote:
Instead of the DD algorithm being faster, the computational power goes up another factor of 2x without an improvement in memory bandwidth. (which is kind of what just happened with the 18-core Skylake) But yes, even when both algorithms are completely memory-bound, the DD algorithm won't be less than 80% of the run-time of the DP algorithm. Last fiddled with by Mysticial on 2017-10-04 at 01:06 |
|
![]() |
![]() |
![]() |
#4 | |
"Forget I exist"
Jul 2009
Dumbassville
26×131 Posts |
![]() Quote:
|
|
![]() |
![]() |
![]() |
#5 |
Feb 2016
UK
3·7·19 Posts |
![]()
The implementation side is beyond me, but I do have observations on performance. Some scenarios to consider:
For a "fast" modern Intel quad, say >=4 GHz, I found 3200 rated dual channel, dual rank ram to give pretty good scaling (relative small FFTs). Basic ram at 2133 just cripples it. For Skylake-X 7800X (6 cores) I'm running 3200 quad channel single rank, and on paper that should be comparable in CPU:ram ratio to the quad above for AVX2. For AVX-512 I think it'll hurt without some change. Was there any timeline for AVX-512 implementation? The expected to be launched tomorrow 8700k gives us 6 cores at high potential clock but still with dual channel ram. As consumer line it wont have AVX-512 to worry about, but I think AVX2 will be starved without insane ram speeds. I get the feeling consumer level core counts will tend to increase more now, as will overall potential peak throughput, but they will remain at dual channel ram. We're going to have even more cores than we can feed. |
![]() |
![]() |
![]() |
#6 | |
Einyen
Dec 2003
Denmark
26·47 Posts |
![]() Quote:
http://www.mersenneforum.org/showthread.php?t=20575 In post #8 you can see that only at 2240K FFT and 1 worker can dual channel keep up with quad channel, but with more workers and/or higher FFT quad channel is always faster. The question is then at what point is it memory bound with quad channel as well. |
|
![]() |
![]() |
![]() |
#7 | ||||
Sep 2016
331 Posts |
![]() Quote:
If 3200 dual-channel is "enough" for a 4 GHz Skylake (AVX2), then the 7800X shouldn't have any issues at all with 4 channels (AVX2). Quote:
Quote:
IOW, there's a core-count war between AMD and Intel with little to no attention given to memory. For now, the chips with the largest memory bottleneck will still be the HCC Skylake X line with 18 cores + AVX512. Personally, I'd like to see quad-channel mainstream and 8-channel HEDT. (IOW, every DIMM is a channel.) Though I don't know if it's even possible to route that many traces on a motherboard. Quote:
Once my 7900X becomes available again, I'll need to do some more targeted tests to see how bad it is right now with P95 AVX2. VTune has the ability to record memory bandwidth usage. In my application (with AVX512), the bandwidth usage is basically a backwards Poisson distribution where the "hump" kisses the limit of 75 GB/s on my system. Last fiddled with by Mysticial on 2017-10-04 at 18:14 |
||||
![]() |
![]() |
![]() |
#8 |
"Robert Gerbicz"
Oct 2005
Hungary
22·3·7·17 Posts |
![]() |
![]() |
![]() |
![]() |
#9 |
"Kieren"
Jul 2011
In My Own Galaxy!
22·43·59 Posts |
![]()
IIRC, the boost from Dual Rank is around 15%.
|
![]() |
![]() |
![]() |
#10 |
Sep 2016
331 Posts |
![]() AL=BL=1, AH=BH=0 is an invalid double-double. AH must be greater in absolute magnitude than AL (unless both are zero - in which the entire double-double is zero). (Or precisely, AH should be around 2^52 times larger than AL.) Same applies to BH/BL. Last fiddled with by Mysticial on 2017-10-04 at 18:27 |
![]() |
![]() |
![]() |
#11 | |
P90 years forever!
Aug 2002
Yeehaw, FL
2·7·521 Posts |
![]() Quote:
My experience with dual-channel DDR4-2400 KabyLake is prime95 become memory bound at roughly 3 cores. Some back-of-the-envelope estimates: Using 3200, about 4 cores. Skylake X with quad-channel memory bound at about 8 cores using. A perfect AVX-512 implementation on Skylake-X would be memory bound at 4 cores. So on a 16-core chip that gives you 1/4 FPU utilization -- enough to handle the 3.73X computational difference. Question: Is 3.73X computational difference the best we can do? How about Ernst's DP+INT? An NTT? I once experimented with a 128-bit fixed-point FP CUDA implementation -- we can make add fast, but muls are slower. It was awhile ago, so I don't know how well it would work in AVX-512. |
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Double stars | fivemack | Astronomy | 5 | 2017-05-14 08:50 |
x.265 half the size, double the computation; so if you double again? 1/4th? | jasong | jasong | 7 | 2015-08-17 10:56 |
Double Check | Unregistered | Information & Answers | 3 | 2011-10-01 04:38 |
Double the area, Double the volume. | Uncwilly | Puzzles | 8 | 2006-07-03 16:02 |
Double Check P-1 | PhilF | PrimeNet | 6 | 2005-07-03 14:36 |