mersenneforum.org Double-Double Arithmetic
 Register FAQ Search Today's Posts Mark Forums Read

 2017-10-04, 00:22 #1 Mysticial     Sep 2016 14C16 Posts Double-Double Arithmetic In light of the massive memory bottlenecks that I've observed with my 10-core 7900X along with the launch of 18-core chips, I felt the urge to bring back the topic of double-double arithmetic - even if it's much earlier than I had originally anticipated. Going back to my post here, the motivation behind DD (double-double) is that DP FFTs are completely memory bound. DD has a larger "word" and therefore allows a higher memory efficiency. (I estimate about 20% lower memory/bandwidth consumption.) The trade-off of course is that DD is much slower. But maybe not quite as much as it looks like at first glance. Double-double addition and multiplication can be done in as few as 7 and 4 instructions respectively if you cut all the corners (see bottom of post for specifics). But in cutting those corners, you'll need to rebalance the numbers every once in a while at the cost of 3 instructions. So a hypothetical radix-4 butterfly (3 twiddle factors) would be: Code: 8 complex DD adds + 3 complex DD multiply = 16 DD add + 3 (4 DD mul + 2 DD add) = 22 DD add + 12 DD mul How often we need to rebalance is unclear at this point. It needs to be done at some point to avoid too much loss of precision. But doing it after every single DD operation is probably overkill. If we rebalance every two layers, the computational cost becomes: Code: 22 (7) + 12 (4) + 8 * 3 = 226 FPU instructions (Digressing a bit, this computational density will still require around radix 32-64 to avoid saturating the memory bandwidth on the 10-core Skylake X. And you may need to go as high as radix 256 - 1024 to avoid bandwidth saturation on the 18-core Skylake. In the special case of LL where the FFTs are 2-pass and merged on both ends, it'll probably remain compute-bound on the 18-core - but probably just barely.) By comparison, a DP radix 4 butterfly can be done in about 24 FMAs. (not sure what's actually achieved now with P95 and mlucus) However, a DP FFT does ~17 bits/word of work while a DD FFT does ~43 bits/word. So the relative "computation/useful work" is: DP FFT: 24 / 17 = 1.41 instructions / bit DD FFT: 226 / 43 = 5.26 instructions / bit So we start off a 3.73x computational difference between DP and DD right now. The memory difference is 20% in the other direction. So in the world where computation is free and memory access is the only factor, then DD will be 20% faster. As far as I understand right now, P95 (and LL in general) is completely memory-bound with just AVX2. Now we have AVX512. So that 3.73x margin cuts in half to about 2x. Now the question is: How memory bound is LL right now with AVX2? Would dropping the memory footprint by 20% at the cost of 2x the computation lead to a net speedup? If not for the low-core chips, what about the 18-core Skylake X? Admittedly, my analysis overly simplified. I have neither an implementation nor a benchmark. So there's a lot room for error. But if it's reasonably accurate, I'm thinking we might be getting close to crossing that threshold if we haven't already with the 18-core 7980XE. Double-double Multiply: 4 FP instructions Code: A = {AL, AH} B = {BL, BH} H = AH * BH; L = AH * BH - H; // FMA L = AL * BH + L; // FMA L = AH * BL + L; // FMA return {L, H}; Double-Double Add: 7 FP instructions Code: A = {AL, AH} B = {BL, BH} IH = abs_max(AH, BH); // AVX512-DQ "vrangepd" IL = abs_min(AH, BH); // AVX512-DQ "vrangepd" H = AH + BH; L = H - IH; L = IL - L; L = L + AL; L = L + BL; return {L, H}; Rebalance: 3 instructions Code: A = {AL, AH} H = AL + AH; L = H - AH; L = AL - L; return {L, H};
2017-10-04, 00:48   #2
science_man_88

"Forget I exist"
Jul 2009
Dumbassville

26×131 Posts

Quote:
 Originally Posted by Mysticial Would dropping the memory footprint by 20% at the cost of 2x the computation lead to a net speedup? If not for the low-core chips, what about the 18-core Skylake X?
I'll only say this. markup vs margin. a 20% markdown ( a margin of the original time etc.) means only a total of 25% more computation time per computation ( assuming the same number of computations of course) so no unless something really magical cuts things down a bit more you can't do 20% markdown ( margin) and then a 100% markup and expect to take less of anything.

Last fiddled with by science_man_88 on 2017-10-04 at 00:48

2017-10-04, 01:04   #3
Mysticial

Sep 2016

22·83 Posts

Quote:
 Originally Posted by science_man_88 I'll only say this. markup vs margin. a 20% markdown ( a margin of the original time etc.) means only a total of 25% more computation time per computation ( assuming the same number of computations of course) so no unless something really magical cuts things down a bit more you can't do 20% markdown ( margin) and then a 100% markup and expect to take less of anything.
I was actually thinking of the opposite of "magical".

Instead of the DD algorithm being faster, the computational power goes up another factor of 2x without an improvement in memory bandwidth. (which is kind of what just happened with the 18-core Skylake)

But yes, even when both algorithms are completely memory-bound, the DD algorithm won't be less than 80% of the run-time of the DP algorithm.

Last fiddled with by Mysticial on 2017-10-04 at 01:06

2017-10-04, 01:14   #4
science_man_88

"Forget I exist"
Jul 2009
Dumbassville

26×131 Posts

Quote:
 Originally Posted by Mysticial I was actually thinking of the opposite of "magical". Instead of the DD algorithm being faster, the computational power goes up another factor of 2x without an improvement in memory bandwidth. (which is kind of what just happened with the 18-core Skylake) But yes, even when both algorithms are completely memory-bound, the DD algorithm won't be less than 80% of the run-time of the DP algorithm.
yeah that and if you divide the true half 1.865 by 1.25 you get under 1.5 times as a factor between the two.

 2017-10-04, 11:56 #5 mackerel     Feb 2016 UK 419 Posts The implementation side is beyond me, but I do have observations on performance. Some scenarios to consider: For a "fast" modern Intel quad, say >=4 GHz, I found 3200 rated dual channel, dual rank ram to give pretty good scaling (relative small FFTs). Basic ram at 2133 just cripples it. For Skylake-X 7800X (6 cores) I'm running 3200 quad channel single rank, and on paper that should be comparable in CPU:ram ratio to the quad above for AVX2. For AVX-512 I think it'll hurt without some change. Was there any timeline for AVX-512 implementation? The expected to be launched tomorrow 8700k gives us 6 cores at high potential clock but still with dual channel ram. As consumer line it wont have AVX-512 to worry about, but I think AVX2 will be starved without insane ram speeds. I get the feeling consumer level core counts will tend to increase more now, as will overall potential peak throughput, but they will remain at dual channel ram. We're going to have even more cores than we can feed.
2017-10-04, 17:49   #6
ATH
Einyen

Dec 2003
Denmark

C3B16 Posts

Quote:
 Originally Posted by Mysticial Now the question is: How memory bound is LL right now with AVX2? Would dropping the memory footprint by 20% at the cost of 2x the computation lead to a net speedup? If not for the low-core chips, what about the 18-core Skylake X?
You can see my old Dual vs Quad channel RAM test on the 8 core 5960X:

In post #8 you can see that only at 2240K FFT and 1 worker can dual channel keep up with quad channel, but with more workers and/or higher FFT quad channel is always faster. The question is then at what point is it memory bound with quad channel as well.

2017-10-04, 18:00   #7
Mysticial

Sep 2016

14C16 Posts

Quote:
 Originally Posted by mackerel The implementation side is beyond me, but I do have observations on performance. Some scenarios to consider: For a "fast" modern Intel quad, say >=4 GHz, I found 3200 rated dual channel, dual rank ram to give pretty good scaling (relative small FFTs). Basic ram at 2133 just cripples it. For Skylake-X 7800X (6 cores) I'm running 3200 quad channel single rank, and on paper that should be comparable in CPU:ram ratio to the quad above for AVX2. For AVX-512 I think it'll hurt without some change.
I can't imagine dual-rank vs. single-rank having nearly as much of a performance impact as dual vs. quad-channel.

If 3200 dual-channel is "enough" for a 4 GHz Skylake (AVX2), then the 7800X shouldn't have any issues at all with 4 channels (AVX2).

Quote:
 Was there any timeline for AVX-512 implementation?
Mlucas already has it since Ernst asked me to benchmark it. But I haven't heard of anything from George.

Quote:
 The expected to be launched tomorrow 8700k gives us 6 cores at high potential clock but still with dual channel ram. As consumer line it wont have AVX-512 to worry about, but I think AVX2 will be starved without insane ram speeds. I get the feeling consumer level core counts will tend to increase more now, as will overall potential peak throughput, but they will remain at dual channel ram. We're going to have even more cores than we can feed.
There's rumors of 12-core Ryzens coming out in February and 8-core Coffee Lake later next year. Both with dual-channel. But neither with AVX512.

IOW, there's a core-count war between AMD and Intel with little to no attention given to memory.

For now, the chips with the largest memory bottleneck will still be the HCC Skylake X line with 18 cores + AVX512.

Personally, I'd like to see quad-channel mainstream and 8-channel HEDT. (IOW, every DIMM is a channel.) Though I don't know if it's even possible to route that many traces on a motherboard.

Quote:
 Originally Posted by ATH You can see my old Dual vs Quad channel RAM test on the 8 core 5960X: http://www.mersenneforum.org/showthread.php?t=20575 In post #8 you can see that only at 2240K FFT and 1 worker can dual channel keep up with quad channel, but with more workers and/or higher FFT quad channel is always faster. The question is then at what point is it memory bound with quad channel as well.
60% speedup is IMO, quite massive 2 -> 4 channels. And it gives an insight to what the upcoming Intel chips will look like. And all without AVX512.

Once my 7900X becomes available again, I'll need to do some more targeted tests to see how bad it is right now with P95 AVX2. VTune has the ability to record memory bandwidth usage. In my application (with AVX512), the bandwidth usage is basically a backwards Poisson distribution where the "hump" kisses the limit of 75 GB/s on my system.

Last fiddled with by Mysticial on 2017-10-04 at 18:14

2017-10-04, 18:18   #8
R. Gerbicz

"Robert Gerbicz"
Oct 2005
Hungary

2·733 Posts

Quote:
 Originally Posted by Mysticial Double-double Multiply: 4 FP instructions Code: A = {AL, AH} B = {BL, BH} H = AH * BH; L = AH * BH - H; // FMA L = AL * BH + L; // FMA L = AH * BL + L; // FMA return {L, H};
It is totally wrong for AL=BL=1, AH=BH=0.

 2017-10-04, 18:19 #9 kladner     "Kieren" Jul 2011 In My Own Galaxy! 27AE16 Posts IIRC, the boost from Dual Rank is around 15%.
2017-10-04, 18:24   #10
Mysticial

Sep 2016

5148 Posts

Quote:
 Originally Posted by R. Gerbicz It is totally wrong for AL=BL=1, AH=BH=0.

AL=BL=1, AH=BH=0 is an invalid double-double. AH must be greater in absolute magnitude than AL (unless both are zero - in which the entire double-double is zero). (Or precisely, AH should be around 2^52 times larger than AL.)

Same applies to BH/BL.

Last fiddled with by Mysticial on 2017-10-04 at 18:27

2017-10-04, 18:35   #11
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

3×11×227 Posts

Quote:
 Originally Posted by Mysticial Mlucas already has it since Ernst asked me to benchmark it. But I haven't heard of anything from George.
This winter's project will be AVX-512.

My experience with dual-channel DDR4-2400 KabyLake is prime95 become memory bound at roughly 3 cores. Some back-of-the-envelope estimates: Using 3200, about 4 cores. Skylake X with quad-channel memory bound at about 8 cores using. A perfect AVX-512 implementation on Skylake-X would be memory bound at 4 cores. So on a 16-core chip that gives you 1/4 FPU utilization -- enough to handle the 3.73X computational difference.

Question: Is 3.73X computational difference the best we can do? How about Ernst's DP+INT? An NTT? I once experimented with a 128-bit fixed-point FP CUDA implementation -- we can make add fast, but muls are slower. It was awhile ago, so I don't know how well it would work in AVX-512.

 Similar Threads Thread Thread Starter Forum Replies Last Post fivemack Astronomy 5 2017-05-14 08:50 jasong jasong 7 2015-08-17 10:56 Unregistered Information & Answers 3 2011-10-01 04:38 Uncwilly Puzzles 8 2006-07-03 16:02 PhilF PrimeNet 6 2005-07-03 14:36

All times are UTC. The time now is 19:53.

Tue May 11 19:53:07 UTC 2021 up 33 days, 14:34, 1 user, load averages: 1.66, 1.98, 1.91