mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2017-10-04, 00:22   #1
Mysticial
 
Mysticial's Avatar
 
Sep 2016

7·47 Posts
Default Double-Double Arithmetic

In light of the massive memory bottlenecks that I've observed with my 10-core 7900X along with the launch of 18-core chips, I felt the urge to bring back the topic of double-double arithmetic - even if it's much earlier than I had originally anticipated.

Going back to my post here, the motivation behind DD (double-double) is that DP FFTs are completely memory bound. DD has a larger "word" and therefore allows a higher memory efficiency. (I estimate about 20% lower memory/bandwidth consumption.)

The trade-off of course is that DD is much slower. But maybe not quite as much as it looks like at first glance.

Double-double addition and multiplication can be done in as few as 7 and 4 instructions respectively if you cut all the corners (see bottom of post for specifics). But in cutting those corners, you'll need to rebalance the numbers every once in a while at the cost of 3 instructions.

So a hypothetical radix-4 butterfly (3 twiddle factors) would be:
Code:
8 complex DD adds + 3 complex DD multiply
= 16 DD add + 3 (4 DD mul + 2 DD add)
= 22 DD add + 12 DD mul
How often we need to rebalance is unclear at this point. It needs to be done at some point to avoid too much loss of precision. But doing it after every single DD operation is probably overkill.

If we rebalance every two layers, the computational cost becomes:
Code:
22 (7) + 12 (4) + 8 * 3 = 226 FPU instructions
(Digressing a bit, this computational density will still require around radix 32-64 to avoid saturating the memory bandwidth on the 10-core Skylake X. And you may need to go as high as radix 256 - 1024 to avoid bandwidth saturation on the 18-core Skylake. In the special case of LL where the FFTs are 2-pass and merged on both ends, it'll probably remain compute-bound on the 18-core - but probably just barely.)


By comparison, a DP radix 4 butterfly can be done in about 24 FMAs. (not sure what's actually achieved now with P95 and mlucus)

However, a DP FFT does ~17 bits/word of work while a DD FFT does ~43 bits/word. So the relative "computation/useful work" is:
  • DP FFT: 24 / 17 = 1.41 instructions / bit
  • DD FFT: 226 / 43 = 5.26 instructions / bit

So we start off a 3.73x computational difference between DP and DD right now. The memory difference is 20% in the other direction. So in the world where computation is free and memory access is the only factor, then DD will be 20% faster.

As far as I understand right now, P95 (and LL in general) is completely memory-bound with just AVX2. Now we have AVX512. So that 3.73x margin cuts in half to about 2x.

Now the question is: How memory bound is LL right now with AVX2? Would dropping the memory footprint by 20% at the cost of 2x the computation lead to a net speedup? If not for the low-core chips, what about the 18-core Skylake X?

Admittedly, my analysis overly simplified. I have neither an implementation nor a benchmark. So there's a lot room for error. But if it's reasonably accurate, I'm thinking we might be getting close to crossing that threshold if we haven't already with the 18-core 7980XE.



Double-double Multiply: 4 FP instructions
Code:
A = {AL, AH}
B = {BL, BH}

H = AH * BH;
L = AH * BH - H;  //  FMA
L = AL * BH + L;  //  FMA
L = AH * BL + L;  //  FMA

return {L, H};
Double-Double Add: 7 FP instructions
Code:
A = {AL, AH}
B = {BL, BH}

IH = abs_max(AH, BH);   //  AVX512-DQ "vrangepd"
IL = abs_min(AH, BH);   //  AVX512-DQ "vrangepd"
H = AH + BH;
L = H - IH;
L = IL - L;
L = L + AL;
L = L + BL;

return {L, H};
Rebalance: 3 instructions
Code:
A = {AL, AH}

H = AL + AH;
L = H - AH;
L = AL - L;

return {L, H};
Mysticial is offline   Reply With Quote
Old 2017-10-04, 00:48   #2
science_man_88
 
science_man_88's Avatar
 
"Forget I exist"
Jul 2009
Dumbassville

8,369 Posts
Default

Quote:
Originally Posted by Mysticial View Post
Would dropping the memory footprint by 20% at the cost of 2x the computation lead to a net speedup? If not for the low-core chips, what about the 18-core Skylake X?
I'll only say this. markup vs margin. a 20% markdown ( a margin of the original time etc.) means only a total of 25% more computation time per computation ( assuming the same number of computations of course) so no unless something really magical cuts things down a bit more you can't do 20% markdown ( margin) and then a 100% markup and expect to take less of anything.

Last fiddled with by science_man_88 on 2017-10-04 at 00:48
science_man_88 is offline   Reply With Quote
Old 2017-10-04, 01:04   #3
Mysticial
 
Mysticial's Avatar
 
Sep 2016

7·47 Posts
Default

Quote:
Originally Posted by science_man_88 View Post
I'll only say this. markup vs margin. a 20% markdown ( a margin of the original time etc.) means only a total of 25% more computation time per computation ( assuming the same number of computations of course) so no unless something really magical cuts things down a bit more you can't do 20% markdown ( margin) and then a 100% markup and expect to take less of anything.
I was actually thinking of the opposite of "magical".

Instead of the DD algorithm being faster, the computational power goes up another factor of 2x without an improvement in memory bandwidth. (which is kind of what just happened with the 18-core Skylake)

But yes, even when both algorithms are completely memory-bound, the DD algorithm won't be less than 80% of the run-time of the DP algorithm.

Last fiddled with by Mysticial on 2017-10-04 at 01:06
Mysticial is offline   Reply With Quote
Old 2017-10-04, 01:14   #4
science_man_88
 
science_man_88's Avatar
 
"Forget I exist"
Jul 2009
Dumbassville

8,369 Posts
Default

Quote:
Originally Posted by Mysticial View Post
I was actually thinking of the opposite of "magical".

Instead of the DD algorithm being faster, the computational power goes up another factor of 2x without an improvement in memory bandwidth. (which is kind of what just happened with the 18-core Skylake)

But yes, even when both algorithms are completely memory-bound, the DD algorithm won't be less than 80% of the run-time of the DP algorithm.
yeah that and if you divide the true half 1.865 by 1.25 you get under 1.5 times as a factor between the two.
science_man_88 is offline   Reply With Quote
Old 2017-10-04, 11:56   #5
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

2·197 Posts
Default

The implementation side is beyond me, but I do have observations on performance. Some scenarios to consider:

For a "fast" modern Intel quad, say >=4 GHz, I found 3200 rated dual channel, dual rank ram to give pretty good scaling (relative small FFTs). Basic ram at 2133 just cripples it.

For Skylake-X 7800X (6 cores) I'm running 3200 quad channel single rank, and on paper that should be comparable in CPU:ram ratio to the quad above for AVX2. For AVX-512 I think it'll hurt without some change. Was there any timeline for AVX-512 implementation?

The expected to be launched tomorrow 8700k gives us 6 cores at high potential clock but still with dual channel ram. As consumer line it wont have AVX-512 to worry about, but I think AVX2 will be starved without insane ram speeds.

I get the feeling consumer level core counts will tend to increase more now, as will overall potential peak throughput, but they will remain at dual channel ram. We're going to have even more cores than we can feed.
mackerel is offline   Reply With Quote
Old 2017-10-04, 17:49   #6
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

56468 Posts
Default

Quote:
Originally Posted by Mysticial View Post
Now the question is: How memory bound is LL right now with AVX2? Would dropping the memory footprint by 20% at the cost of 2x the computation lead to a net speedup? If not for the low-core chips, what about the 18-core Skylake X?
You can see my old Dual vs Quad channel RAM test on the 8 core 5960X:

http://www.mersenneforum.org/showthread.php?t=20575

In post #8 you can see that only at 2240K FFT and 1 worker can dual channel keep up with quad channel, but with more workers and/or higher FFT quad channel is always faster. The question is then at what point is it memory bound with quad channel as well.
ATH is online now   Reply With Quote
Old 2017-10-04, 18:00   #7
Mysticial
 
Mysticial's Avatar
 
Sep 2016

7·47 Posts
Default

Quote:
Originally Posted by mackerel View Post
The implementation side is beyond me, but I do have observations on performance. Some scenarios to consider:

For a "fast" modern Intel quad, say >=4 GHz, I found 3200 rated dual channel, dual rank ram to give pretty good scaling (relative small FFTs). Basic ram at 2133 just cripples it.

For Skylake-X 7800X (6 cores) I'm running 3200 quad channel single rank, and on paper that should be comparable in CPU:ram ratio to the quad above for AVX2. For AVX-512 I think it'll hurt without some change.
I can't imagine dual-rank vs. single-rank having nearly as much of a performance impact as dual vs. quad-channel.

If 3200 dual-channel is "enough" for a 4 GHz Skylake (AVX2), then the 7800X shouldn't have any issues at all with 4 channels (AVX2).

Quote:
Was there any timeline for AVX-512 implementation?
Mlucas already has it since Ernst asked me to benchmark it. But I haven't heard of anything from George.

Quote:

The expected to be launched tomorrow 8700k gives us 6 cores at high potential clock but still with dual channel ram. As consumer line it wont have AVX-512 to worry about, but I think AVX2 will be starved without insane ram speeds.

I get the feeling consumer level core counts will tend to increase more now, as will overall potential peak throughput, but they will remain at dual channel ram. We're going to have even more cores than we can feed.
There's rumors of 12-core Ryzens coming out in February and 8-core Coffee Lake later next year. Both with dual-channel. But neither with AVX512.

IOW, there's a core-count war between AMD and Intel with little to no attention given to memory.

For now, the chips with the largest memory bottleneck will still be the HCC Skylake X line with 18 cores + AVX512.

Personally, I'd like to see quad-channel mainstream and 8-channel HEDT. (IOW, every DIMM is a channel.) Though I don't know if it's even possible to route that many traces on a motherboard.

Quote:
Originally Posted by ATH View Post
You can see my old Dual vs Quad channel RAM test on the 8 core 5960X:

http://www.mersenneforum.org/showthread.php?t=20575

In post #8 you can see that only at 2240K FFT and 1 worker can dual channel keep up with quad channel, but with more workers and/or higher FFT quad channel is always faster. The question is then at what point is it memory bound with quad channel as well.
60% speedup is IMO, quite massive 2 -> 4 channels. And it gives an insight to what the upcoming Intel chips will look like. And all without AVX512.

Once my 7900X becomes available again, I'll need to do some more targeted tests to see how bad it is right now with P95 AVX2. VTune has the ability to record memory bandwidth usage. In my application (with AVX512), the bandwidth usage is basically a backwards Poisson distribution where the "hump" kisses the limit of 75 GB/s on my system.

Last fiddled with by Mysticial on 2017-10-04 at 18:14
Mysticial is offline   Reply With Quote
Old 2017-10-04, 18:18   #8
R. Gerbicz
 
R. Gerbicz's Avatar
 
"Robert Gerbicz"
Oct 2005
Hungary

2×709 Posts
Default

Quote:
Originally Posted by Mysticial View Post
Double-double Multiply: 4 FP instructions
Code:
A = {AL, AH}
B = {BL, BH}

H = AH * BH;
L = AH * BH - H;  //  FMA
L = AL * BH + L;  //  FMA
L = AH * BL + L;  //  FMA

return {L, H};
It is totally wrong for AL=BL=1, AH=BH=0.
R. Gerbicz is offline   Reply With Quote
Old 2017-10-04, 18:19   #9
kladner
 
kladner's Avatar
 
"Kieren"
Jul 2011
In My Own Galaxy!

2×3×52×67 Posts
Default

IIRC, the boost from Dual Rank is around 15%.
kladner is offline   Reply With Quote
Old 2017-10-04, 18:24   #10
Mysticial
 
Mysticial's Avatar
 
Sep 2016

7·47 Posts
Default

Quote:
Originally Posted by R. Gerbicz View Post
It is totally wrong for AL=BL=1, AH=BH=0.

AL=BL=1, AH=BH=0 is an invalid double-double. AH must be greater in absolute magnitude than AL (unless both are zero - in which the entire double-double is zero). (Or precisely, AH should be around 2^52 times larger than AL.)

Same applies to BH/BL.

Last fiddled with by Mysticial on 2017-10-04 at 18:27
Mysticial is offline   Reply With Quote
Old 2017-10-04, 18:35   #11
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

7,159 Posts
Default

Quote:
Originally Posted by Mysticial View Post
Mlucas already has it since Ernst asked me to benchmark it. But I haven't heard of anything from George.
This winter's project will be AVX-512.

My experience with dual-channel DDR4-2400 KabyLake is prime95 become memory bound at roughly 3 cores. Some back-of-the-envelope estimates: Using 3200, about 4 cores. Skylake X with quad-channel memory bound at about 8 cores using. A perfect AVX-512 implementation on Skylake-X would be memory bound at 4 cores. So on a 16-core chip that gives you 1/4 FPU utilization -- enough to handle the 3.73X computational difference.

Question: Is 3.73X computational difference the best we can do? How about Ernst's DP+INT? An NTT? I once experimented with a 128-bit fixed-point FP CUDA implementation -- we can make add fast, but muls are slower. It was awhile ago, so I don't know how well it would work in AVX-512.
Prime95 is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Double stars fivemack Astronomy 5 2017-05-14 08:50
x.265 half the size, double the computation; so if you double again? 1/4th? jasong jasong 7 2015-08-17 10:56
Double Check Unregistered Information & Answers 3 2011-10-01 04:38
Double the area, Double the volume. Uncwilly Puzzles 8 2006-07-03 16:02
Double Check P-1 PhilF PrimeNet 6 2005-07-03 14:36

All times are UTC. The time now is 00:03.

Mon Nov 30 00:03:31 UTC 2020 up 80 days, 21:14, 3 users, load averages: 1.15, 1.17, 1.24

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.