mersenneforum.org I wonder if there is a single precision version LL-test for Nvidia GPU computing
 User Name Remember Me? Password
 Register FAQ Search Today's Posts Mark Forums Read

 2018-12-22, 19:32 #1 Neutron3529     Dec 2018 China 43 Posts I wonder if there is a single precision version LL-test for Nvidia GPU computing Firstly thanks for reading this thread since my English is poor.. Several years ago, https://www.mersenneforum.org/showthread.php?t=12576 shows a single precision version, but when I connect "http://www.garlic.com/%7Ewedgingt/mers.tar.gz", I got: Oh dear, we seemed to have gotten a bit lost! I want to know whether a single precision LL-test is possible for GPU computing. In fact, my GTX 1060 is really slow in double-precision computing, it is even 50% slower and 4x hoter than my CPU when executing the LL-test. I know that using single precision will boost my GPU computing, but I don't know how to convert the original .cu code into the single-precision one, since I don't know how to choose the FFT length, etc. Could anyone help me? Thanks.
2018-12-22, 20:00   #2
chalsall
If I May

"Chris Halsall"
Sep 2002
Barbados

292C16 Posts

Quote:
 Originally Posted by Neutron3529 I want to know whether a single precision LL-test is possible for GPU computing.
Short answer: Yes, it is possible.

Longer answer: It doesn't make sense. Many more OPs and/or memory needed.

Quote:
 Originally Posted by Neutron3529 Could anyone help me?
We have some /very/ good GPU programmers here, writing code optimally for this very rarefied problem space.

If it was possible to improve the throughput using SP, they would have done it.

2018-12-22, 21:02   #3
ewmayer
2ω=0

Sep 2002
República de California

23·32·163 Posts

Quote:
 Originally Posted by chalsall Short answer: Yes, it is possible. Longer answer: It doesn't make sense. Many more OPs and/or memory needed. We have some /very/ good GPU programmers here, writing code optimally for this very rarefied problem space. If it was possible to improve the throughput using SP, they would have done it.
That assumes they made a serious effort to do so - are there any threads describing such?

Also - have any of the GPU coders actually written the FFT infrastructure code, or are all the GPU LL programs using library FFTs? If the latter, that would make it easier to try an SP program, since most mathlib-style FFT packages come in SP and DP versions, i.e. only the IBDWT wrappers would need to be coded up in SP.

Here are some actual estimated-FFT-length numbers, as given by the utility function I use to set FFT-length breakpoints, based on ROE-as-multidimensional-random-walk heuristics I developed for the F24 paper (which also discusses the choice of asymptotic constant, whoch os expecte to be O(1) and whose precise choice affects aggressiveness of FFT-length settings, immaterial for the present 'big picture' discussion). Here is a simple *nix bc function which implements same (invoke bc in floating-point mode, as 'bc -l'):
Code:
define maxp(bmant, n, asympconst) {
auto ln2inv, ln_n, lnln_n, l2_n, lnl2_n, l2l2_n, lnlnln_n, l2lnln_n, wbits;
ln2inv = 1.0/l(2.0);
ln_n = l(1.0*n); lnln_n = l(ln_n); l2_n = ln2inv*ln_n; lnl2_n = l(l2_n); l2l2_n = ln2inv*lnl2_n; lnlnln_n = l(lnln_n); l2lnln_n = ln2inv*lnlnln_n;
wbits = 0.5*( bmant - asympconst - 0.5*(l2_n + l2l2_n) - 1.5*(l2lnln_n) );
return(wbits*n);
}
For example, comparing DP and SP for the FFT length of the current GIMPS wavefront, and rounding the outputs of the function:
DP:
maxp(53, 4608*2^10, 0.4) = 87540871
SP:
maxp(24, 4608*2^10, 0.4) = 19121287
so SP would allow exponents ~4.5x smaller than DP at this FFT length.

Flipping things around and asking what SP FFT length is needed to handle current-wavefront exponents gives 24576K = 24M, slightly above 5x the DP FFT length. So being pessimistic, on hardware where there is, say, a 10x or more per-cycle difference between SP and DP throughput, SP could well be a win.

Last fiddled with by ewmayer on 2018-12-22 at 21:08

2018-12-22, 21:07   #4
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

22·5·7·47 Posts

Quote:
 Originally Posted by chalsall Short answer: Yes, it is possible. ... If it was possible to improve the throughput using SP, they would have done it.
And furthermore, on the OpenCl side, attempts were made and performance tests made. See gpuowl v1.8 and v1.9's DP, SP, M31 and M61 transforms etc. SP resulted in too few bits/word to be useful at current exponents and fft lengths of interest.

https://www.mersenneforum.org/showpo...&postcount=224

It's always good to review such things periodically though, and ask questions.
Sometimes new hardware brings new capabilities such that the old conclusions no longer hold. I recall a discussion about the RTX2xxxx hardware features about how maybe mixing FP and integer may be beneficial there.
CUDA-oriented development of Mersenne prime hunting software seems nearly dormant right now.

Last fiddled with by kriesel on 2018-12-22 at 21:11

2018-12-22, 21:12   #5
chalsall
If I May

"Chris Halsall"
Sep 2002
Barbados

22·5·17·31 Posts

Quote:
 Originally Posted by ewmayer That assumes they made a serious effort to do so - are there any threads describing such? [snip] So being pessimistic, on hardware where there is, say, a 10x or more per-cycle difference between SP and DP throughput, SP could well be a win.
Fair point.

In all honesty, I was being guided by the Economics 101 joke: "Two economists are walking down the street. One says to the other 'Is that a \$20 bill just lying there on the street?' 'Of course not,' says the other economist, 'or else someone would have already picked it up....

If anyone wants to try to do this work (and make it profitable taking into account their time), all the power to them!

2018-12-22, 21:13   #6
ewmayer
2ω=0

Sep 2002
República de California

23·32·163 Posts

Quote:
 Originally Posted by kriesel And furthermore, on the OpenCl side, attempts were made and performance tests made. See gpuowl v1.8 and v1.9's M31 and M61 transforms etc https://www.mersenneforum.org/showpo...&postcount=224
Did the gpuowl author actually try an SP-FFT, or just the small-mersenne-prime-mod FGTs? It's a crucial distinction, because the latter involve a *lot* of integer-instruction overhead for double-width intermediate products (64-bits for M31, 128-bits for M61) and for modding of intermediates, whereas an SP-FFT is just like a DP one, merely at a lot lower precision, i.e. needing the pure-integer input vector storing the LL-test residue to be divided into many smaller words than in the DP case.

Edit: OK, I see preda (gpuowl author) says he tried SP but found it to be 'useless' - That needs a bit more digging, IMO. It's certainly possible that my above random-walk heuristic somehow fails to capture what happens at such lower precisions, but it's sufficiently general that such a gross mismatch between theory and practice seems hard to fathom.

I wonder if there's a simple way to modify e.g. the Mlucas scalar-DP-build FFT to act as SP, perhaps by fiddling the x86 hardware rounding mode, i.e. the code still uses doubles to hold instruction in-and-ouputs, but the CPU is set to convert all instruction results from DP to SP before writing back to memory.

Last fiddled with by ewmayer on 2018-12-22 at 21:23

2018-12-22, 21:26   #7
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

19B416 Posts

Quote:
 Originally Posted by ewmayer That assumes they made a serious effort to do so - are there any threads describing such? Also - have any of the GPU coders actually written the FFT infrastructure code, or are all the GPU LL programs using library FFTs? ... Here are some actual estimated-FFT-length numbers, as given by the utility function I use to set FFT-length breakpoints, based on ROE-as-multidimensional-random-walk heuristics I developed for the F24 paper (which also discusses the choice of asymptotic constant, whoch os expecte to be O(1) and whose precise choice affects aggressiveness of FFT-length settings, immaterial for the present 'big picture' discussion). Here is a simple *nix bc function which implements same (invoke bc in floating-point mode, as 'bc -l'): Code: define maxp(bmant, n, asympconst) { auto ln2inv, ln_n, lnln_n, l2_n, lnl2_n, l2l2_n, lnlnln_n, l2lnln_n, wbits; ln2inv = 1.0/l(2.0); ln_n = l(1.0*n); lnln_n = l(ln_n); l2_n = ln2inv*ln_n; lnl2_n = l(l2_n); l2l2_n = ln2inv*lnl2_n; lnlnln_n = l(lnln_n); l2lnln_n = ln2inv*lnlnln_n; wbits = 0.5*( bmant - asympconst - 0.5*(l2_n + l2l2_n) - 1.5*(l2lnln_n) ); return(wbits*n); } For example, comparing DP and SP for the FFT length of the current GIMPS wavefront, and rounding the outputs of the function: DP: maxp(53, 4608*2^10, 0.4) = 87540871 SP: maxp(24, 4608*2^10, 0.4) = 19121287 so SP would allow exponents ~4.5x smaller than DP at this FFT length. Flipping things around and asking what SP FFT length is needed to handle current-wavefront exponents gives 24576K = 24M, slightly above 5x the DP FFT length. So being pessimistic, on hardware where there is, say, a 10x or more per-cycle difference between SP and DP throughput, SP could well be a win.
It's my understanding CUDALucas uses NVIDIA's cufft library, which currently tops out at 128M. Extrapolating up from 19121287 / 4.5M, that would put an upper bound on SP somewhere around 272,000,000 for 64M. (Same bound forCUDAPm1, but about a year less of future use due to offset in P-1 vs primality testing wavefronts. I defer to ATH's post below for the actual calculation.) I recall seeing a post claiming CUDALucas -cufftbench runs to 128M, but haven't attempted >64M myself to confirm, or gotten an answer whether that was a modified version.
It's my understanding Mihai rolled his own ffts for gpuowl. V1.8 SP, DP, M31, M61.
SP/DP ratios are all over the place, from a small sample of gpu models: 1/2, 1/12, 1/16, 1/32. https://www.mersenneforum.org/showpo...12&postcount=3
It's also possible that SP is of lower or no benefit on AMD OpenCl gpus with 1/16 SP/DP, which Mihai codes for, and of sufficient benefit on newer CUDA gpus that are 1/32 SP/DP.

Last fiddled with by kriesel on 2018-12-22 at 22:18 Reason: fix extrapolation (thanks ATH for the catch)

2018-12-22, 21:46   #8
ATH
Einyen

Dec 2003
Denmark

3×11×101 Posts

Quote:
 Originally Posted by kriesel It's my understanding CUDALucas uses NVIDIA's cufft library, which currently tops out at 128M. Extrapolating up from 19121287 / 24M, that would put an upper bound on SP somewhere around 102,000,000, maybe not worth attempting since in a couple years it would be too small for the wavefront.
You misunderstood: The 19M exponent limit was for 4608K FFT, while the 24M FFT was required for the current 8xM wavefront.

Using the code Ernst posted I get 84,836,115 as the upper limit for SP 24M FFT.

Even though CUDALucas has FFT up to 128M we do not know if that would be the limit for SP FFT, but if it is the code gives 362,166,717 as the maximum for SP 128M FFT

Last fiddled with by ATH on 2018-12-22 at 21:59

 2018-12-22, 22:09 #9 ewmayer ∂2ω=0     Sep 2002 República de California 23·32·163 Posts I've decided to spend a few hours over the upcoming holiday week to hack a SP version of Mlucas (just the non-SIMD no-asm build based on C doubles, obviously) - I need to see what happens for myself. It will take more than simply global-replacing 'double' with 'float' in the sources, but shouldn't require all that much more. For example, my current setup inits small O(sqrt(n))-element tables of FFT roots-of-unity and DWT-weights data using quad-float arithmetic and then rounding results to double ... for the SP version I'll need to replace that with double-arithmetic and then rounding results to float. I'm more concerned with the possible occurrence of code which contains some kind of 'hidden assumption of DP', but I hope such will be minimal, if it exists.
2018-12-22, 22:10   #10
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

146648 Posts

Quote:
 Originally Posted by Neutron3529 Firstly thanks for reading this thread since my English is poor..
Welcome.
Your English seems nearly perfect.
Possibly you'll find the attachment in the second post of https://www.mersenneforum.org/showthread.php?t=23371 useful.
Your GTX1060 can run LL (CUDALucas), P-1 (CUDAPm1), or trial factoring (mfaktc). Its contribution would probably be maximized by running trial factoring.
Unfortunately there is no CUDA PRP code for mersenne hunting currently.

2018-12-22, 23:30   #11
chalsall
If I May

"Chris Halsall"
Sep 2002
Barbados

1054010 Posts

Quote:
 Originally Posted by ewmayer I've decided to spend a few hours over the upcoming holiday week to hack a SP version of Mlucas (just the non-SIMD no-asm build based on C doubles, obviously) - I need to see what happens for myself.
Sweet! You are one of the most qualified to do this analysis. Sincerely.

Will you be taking into consideration the available compute, and its SP/DP ratios?

 Thread Tools

 Similar Threads Thread Thread Starter Forum Replies Last Post ixfd64 GPU Computing 9 2017-08-05 22:12 ixfd64 Hardware 5 2012-09-12 05:10 ixfd64 Hardware 21 2007-10-16 03:32 dsouza123 Programming 6 2004-01-13 03:53 Gary Edstrom Lounge 7 2003-01-13 22:35

All times are UTC. The time now is 04:56.

Thu Jun 30 04:56:16 UTC 2022 up 77 days, 2:57, 0 users, load averages: 1.11, 1.41, 1.42

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔