mersenneforum.org The future is 24bit
 Register FAQ Search Today's Posts Mark Forums Read

 2021-05-25, 03:46 #45 kriesel     "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 10101001110012 Posts Way back on V1.9 you had implemented 4 transforms, and I did some testing and determined for the -fft M61 NTT transform, size 4M, a limit of about 19.996 bits/word providing about 8% higher exponent reach. https://www.mersenneforum.org/showpo...31&postcount=8 It was however considerably slower than the DP transform, 18.9 ms/iter vs 10.9 ms/iter on an RX550, 1.734 times the clock time. It was a little faster to run 4M M61, than 8M DP, for a narrow range of exponents, in V1.9. The DP transforms have gotten much faster and more numerous over time. https://www.mersenneforum.org/showpo...35&postcount=2 On a Radeon VII V1.9 4M DP takes 0.84msec, while V6.11-357 4M takes 0.579 msec/iter, ~45% faster, so estimated 2.5x as fast as the V1.9 M61. (All timings on Windows 7 for RX550 or Windows 10 for RadeonVII) It's not clear how a 2^64-2^32+1 NTT gets 25 bits/word usable, when M61 got ~20 bits/word usable. Or how it could be so much faster as to be competitive. Maybe there are some good reasons in favor of those, that you could elaborate.
2021-05-25, 06:41   #46
preda

"Mihai Preda"
Apr 2015

3·457 Posts

Quote:
 Originally Posted by kriesel It's not clear how a 2^64-2^32+1 NTT gets 25 bits/word usable, when M61 got ~20 bits/word usable. Or how it could be so much faster as to be competitive. Maybe there are some good reasons in favor of those, that you could elaborate.
There is a difference of 3 bits between M61 and the above 64-bit prime, which explains 3/2=1.5 bits of the difference.

Another important element that enables 25bits is that now I'm using "balanced" words (signed, with equal probability positive/negative), while when doing the M61 a while back I was using non-balanced words (all positive). This affects the magnitude of the convolution output words in two ways: first, being balanced, the magnitude of the input words is reduced by 1 bit. Second, when summing many balanced words, the magnitude of the sum grows with the square root of the number of summands -- while when summing unbalanced words, the magnitude grows linearily.

Anyway, looking at 25 usable bits:
- because balanced words, each input word has max magnitude of 24bits (plus the sign bit). Squaring, we get max magnitude of the squared words of 48bits. Doing a convolution of length 2^22 (4M), experimentally we see a contribution of 64-48==16 bits from the convolution. This value (16) is in the ballpark of what we see with usual complex FFTs of the same size.

About how to get it as fast as FP64, I don't know I have to see. Theoretically it's not impossible, but we'll see how it works in practice.

2021-05-25, 06:50   #47
preda

"Mihai Preda"
Apr 2015

3·457 Posts

Quote:
 Originally Posted by Prime95 When I looked at this many years ago, I played with .... 3a) Float implemented as 128-bit mantissa and an implied exponent (or was it 96-bit mantissas with an implied exponent??) This can reduce the amount of bit shifting. Adds and subtracts require no shifts as all mantissa are using the same implied exponent. Every FFT level (or maybe every few FFT levels), shift all the mantissas changing the implied exponent. IIRC, on a GTX 580 (or was it a GTX 660?), my testing indicated I'd get roughly the same performance as the cuFFT library using 64-bit floats.
Yes this is one option I considered. It's tricky to do the "shift every (couple of) FFT levels" well, because if it's done the conservative way (shift every level) it's wasteful on the precision bits, if done the aggressive way (shift every couple of levels) it fails sometimes, if done adaptivelly (shift if required by the actual data) it's complicated thus slow. But yes it should be a workable solution as well.

2021-05-26, 20:33   #48
preda

"Mihai Preda"
Apr 2015

101010110112 Posts

Quote:
 Originally Posted by preda About how to get it as fast as FP64, I don't know I have to see. Theoretically it's not impossible, but we'll see how it works in practice.
I did some preliminary performance measurements, and the initial results are a bit dissapointing -- the NTT being about 3x slower than the equivalent FP64. It seems the code is compute-bound now (vs. memory-bound), and the main problem may be the frequent modular reduction (normalization) that is part of every ADD and SUB. I'll keep at it.

Last fiddled with by preda on 2021-05-26 at 20:33

2021-05-26, 22:14   #49
chalsall
If I May

"Chris Halsall"
Sep 2002

2×67×73 Posts

Quote:
 Originally Posted by preda I'll keep at it.
I've been watching your work. I suspect many others have been as well.

Even if the code paths prove not to be profitable, it will be only because you've proven they aren't.

Few understand how important this kind of work is. Thank you for doing this.

2021-05-27, 00:41   #50
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

7,541 Posts

Quote:
 Originally Posted by preda I did some preliminary performance measurements, and the initial results are a bit dissapointing -- the NTT being about 3x slower than the equivalent FP64.
What hardware? An nVidia card with 1/32 FP64 vs a Radeon VII with 1/2 FP64 performance ought to have vastly different breakeven points.

2021-05-27, 00:50   #51
ewmayer
2ω=0

Sep 2002
República de California

2D7F16 Posts

Quote:
 Originally Posted by preda I did some preliminary performance measurements, and the initial results are a bit dissapointing -- the NTT being about 3x slower than the equivalent FP64. It seems the code is compute-bound now (vs. memory-bound), and the main problem may be the frequent modular reduction (normalization) that is part of every ADD and SUB. I'll keep at it.
One of the reasons p = M61 is an attractive modulus for Mersenne-mod FGT(p^2) is that one can do several rounds of add/sub butterflies before having to mod-reduce. The wide-MULs of course are the sticking point there.

 2021-05-27, 02:58 #52 kriesel     "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 3·1,811 Posts Thank you for post 46. Gpuowl V1.9 -fft M61 ~20 bits/word vs DP ~18+bits/word, M61 runtime ~1.73 times longer, for same FFT length in words, on low ratio 1/16- 1/4 SP/DP AMD GPUs. But M61 with a wider selection of fft lengths would allow using shorter faster fft lengths for the same exponent, so ~1.73 x 0.92 = ~1.6:1 disadvantage in that case. For the current work, 2^64-2^32+1 NTT ~25 bits/word, vs. DP 18.6 bits/word; the NTT could use significantly shorter FFT lengths for the same exponent. For a ~78M exponent DC, ~18/25 x 4M DP ~3M fft NTT length. So the same-length NTT could be 33% slower than DP and be a tie for the speed on the same exponent. It's not clear to me whether Preda has already allowed for that in his ~3:1 speed disadvantage, or on what hardware that was (CPU?). NTT may still be slower than DP on AMD GPUs or NVIDIA Tesla GPUs. But on those DP-lamed consumer NVIDIA GPUs; RTX2080 10.07 TFLOPS FP32, 314.6 GFLOPS FP64 (1:32 performance ratio); ? INT32? https://www.techpowerup.com/gpu-spec...rtx-2080.c3224 TU104 3072 INT32 units, 96 FP64 units (48:1 unit ratio) https://www.techpowerup.com/gpu-specs/nvidia-tu104.g854 The RTX2080 has ~2.5-3 x the TF performance in mfaktc (INT32, SP) of the GTX1080, but about 10% less in CUDALucas or presumably Gpuowl (DP). NVIDIA Tesla P100 PCIe FP32 9.526 TFLOPS, FP64 4.763 TFLOPS (1:2), INT32? https://www.techpowerup.com/gpu-spec...ie-16-gb.c2888 (lower memory bandwidth than the Radeon VII though, 732 vs 1024 GB/s) Radeon VII FP32 13.44 TFLOPS, FP64 3.36 TFLOPS (1:4), INT32? https://www.techpowerup.com/gpu-specs/radeon-vii.c3358 Unfortunately Gpuowl V1.9 won't run on an RTX2080, so can't get comparison runs in V1.9 on that GPU or similar for DP vs M61: Code: gpuOwL v1.9- GPU Mersenne primality checker GeForce RTX 2080, 46x1710MHz OpenCL compilation in 4227 ms, with "-I. -cl-fast-relaxed-math -cl-std=CL2.0 -DEXP=77973559u -DWIDTH=1024u -DHEIGHT=2048u -DLOG_NWORDS=22u -DFP_DP=1 " PRP-3: FFT 4M (1024 * 2048 * 2) of 77973559 (18.59 bits/word) [2021-05-26 21:07:15 Central Daylight Time] Starting at iteration 0 error -5 (carryConv) Assertion failed! Program: C:\Users\user\Documents\gpuowl v1.9\gpuowl.exe File: clwrap.h, Line 230 Expression: check(clEnqueueNDRangeKernel(queue, kernel, 1, __null, &workSize, &groupSize, 0, __null, __null), name.c_str()) Code: C:\Users\user\Documents\gpuowl v1.9>gpuowl -device 0 -size 4M -fft M61 gpuOwL v1.9- GPU Mersenne primality checker GeForce RTX 2080, 46x1710MHz OpenCL compilation in 9824 ms, with "-I. -cl-fast-relaxed-math -cl-std=CL2.0 -DEXP=77973559u -DWIDTH=1024u -DHEIGHT=2048u -DLOG_NWORDS=22u -DFGT_61=1 -DLOG_ROOT2=49u " Note: using long carry kernels PRP-3: FFT 4M (1024 * 2048 * 2) of 77973559 (18.59 bits/word) [2021-05-26 21:08:42 Central Daylight Time] Starting at iteration 0 error -5 Assertion failed! Program: C:\Users\user\Documents\gpuowl v1.9\gpuowl.exe File: clwrap.h, Line 234 Expression: check(clEnqueueReadBuffer(queue, buf, blocking, start, size, data, 0, __null, __null)) I'd welcome the chance to run some new NTT test code on that GPU, temporarily interrupting production TF runs, or on an RTX2080 Super or GTX1650 or GTX1650 Super, all 1/32 SP/DP. (But they're all running Windows currently, so source and a make win target would be good. Whenever that makes sense.) Last fiddled with by kriesel on 2021-05-27 at 03:20
2021-05-27, 07:03   #53
preda

"Mihai Preda"
Apr 2015

3·457 Posts

Quote:
 Originally Posted by Prime95 What hardware? An nVidia card with 1/32 FP64 vs a Radeon VII with 1/2 FP64 performance ought to have vastly different breakeven points.
I was testing on a Radeon VII.

 2021-06-02, 13:33 #54 preda     "Mihai Preda" Apr 2015 25338 Posts In the same setup (FFT 4M, Radeon VII, exponent around 107M) I gained about 33% performance by tweaking (with inline assembly) the low-level modular primitives (add, sub, mul). So now the performance is only 2x as bad as the DP of the same exponent. The kernels may be CPU-bound at the moment. Now I need to look into some higher-level optimizations. The code is in the NTT2 branch of gpuowl on github. It's not useful for anything, is more like a feasibility study at this point. https://github.com/preda/gpuowl/blob/NTT2/gpuowl.cl
 2021-06-02, 13:58 #55 preda     "Mihai Preda" Apr 2015 3·457 Posts There's one more interesting factoid about the difference between "classic" FFT and NTT: when working with complex numbers in the classic FFT, the inverse transform is equal to the conjugate of the forward transform on the conjugate of the data. i.e. iFFT(x) == conjugate(dFFT(conjugate(x)) and the conjugate() is a local operation. So in the classic setup, we only need to implement one transform that serves both for forward and inverse. The advantage of this is that there is only one set of precomputed twiddle tables, shared by the direct and the inverse FFT. The NTT OTOH does not have a notion of "complex conjugate" anymore. We can still implement the inverse in terms of the forward transform, but we need to "flip" the data words around in way which is non-local, thus costly (costly on a GPU, at least). Rather that doing that complicated "words-flip", I chose to implement two transforms, direct and inverse, that unfortunately don't share the twiddle tables anymore. (the "flip" is this: mirror[i] = word[(N-i) mod N], for i:=0 to N-1).

 Similar Threads Thread Thread Starter Forum Replies Last Post jasong jasong 1 2015-04-26 08:55 jasonp Msieve 23 2008-10-30 02:23 mfgoode Lounge 3 2006-11-18 23:43 JHansen NFSNET Discussion 15 2004-06-01 19:58 PrimeFun Lounge 21 2003-07-25 02:50

All times are UTC. The time now is 21:59.

Thu Aug 5 21:59:53 UTC 2021 up 13 days, 16:28, 1 user, load averages: 3.62, 3.78, 3.90