20210525, 03:46  #45 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
1010100111001_{2} Posts 
Way back on V1.9 you had implemented 4 transforms, and I did some testing and determined for the fft M61 NTT transform, size 4M, a limit of about 19.996 bits/word providing about 8% higher exponent reach. https://www.mersenneforum.org/showpo...31&postcount=8
It was however considerably slower than the DP transform, 18.9 ms/iter vs 10.9 ms/iter on an RX550, 1.734 times the clock time. It was a little faster to run 4M M61, than 8M DP, for a narrow range of exponents, in V1.9. The DP transforms have gotten much faster and more numerous over time. https://www.mersenneforum.org/showpo...35&postcount=2 On a Radeon VII V1.9 4M DP takes 0.84msec, while V6.11357 4M takes 0.579 msec/iter, ~45% faster, so estimated 2.5x as fast as the V1.9 M61. (All timings on Windows 7 for RX550 or Windows 10 for RadeonVII) It's not clear how a 2^642^32+1 NTT gets 25 bits/word usable, when M61 got ~20 bits/word usable. Or how it could be so much faster as to be competitive. Maybe there are some good reasons in favor of those, that you could elaborate. 
20210525, 06:41  #46  
"Mihai Preda"
Apr 2015
3·457 Posts 
Quote:
Another important element that enables 25bits is that now I'm using "balanced" words (signed, with equal probability positive/negative), while when doing the M61 a while back I was using nonbalanced words (all positive). This affects the magnitude of the convolution output words in two ways: first, being balanced, the magnitude of the input words is reduced by 1 bit. Second, when summing many balanced words, the magnitude of the sum grows with the square root of the number of summands  while when summing unbalanced words, the magnitude grows linearily. Anyway, looking at 25 usable bits:  because balanced words, each input word has max magnitude of 24bits (plus the sign bit). Squaring, we get max magnitude of the squared words of 48bits. Doing a convolution of length 2^22 (4M), experimentally we see a contribution of 6448==16 bits from the convolution. This value (16) is in the ballpark of what we see with usual complex FFTs of the same size. About how to get it as fast as FP64, I don't know I have to see. Theoretically it's not impossible, but we'll see how it works in practice. 

20210525, 06:50  #47  
"Mihai Preda"
Apr 2015
3·457 Posts 
Quote:


20210526, 20:33  #48 
"Mihai Preda"
Apr 2015
10101011011_{2} Posts 
I did some preliminary performance measurements, and the initial results are a bit dissapointing  the NTT being about 3x slower than the equivalent FP64. It seems the code is computebound now (vs. memorybound), and the main problem may be the frequent modular reduction (normalization) that is part of every ADD and SUB. I'll keep at it.
Last fiddled with by preda on 20210526 at 20:33 
20210526, 22:14  #49 
If I May
"Chris Halsall"
Sep 2002
Barbados
2×67×73 Posts 
I've been watching your work. I suspect many others have been as well.
Even if the code paths prove not to be profitable, it will be only because you've proven they aren't. Few understand how important this kind of work is. Thank you for doing this. 
20210527, 00:41  #50 
P90 years forever!
Aug 2002
Yeehaw, FL
7,541 Posts 
What hardware? An nVidia card with 1/32 FP64 vs a Radeon VII with 1/2 FP64 performance ought to have vastly different breakeven points.

20210527, 00:50  #51  
∂^{2}ω=0
Sep 2002
República de California
2D7F_{16} Posts 
Quote:


20210527, 02:58  #52 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
3·1,811 Posts 
Thank you for post 46.
Gpuowl V1.9 fft M61 ~20 bits/word vs DP ~18+bits/word, M61 runtime ~1.73 times longer, for same FFT length in words, on low ratio 1/16 1/4 SP/DP AMD GPUs. But M61 with a wider selection of fft lengths would allow using shorter faster fft lengths for the same exponent, so ~1.73 x 0.92 = ~1.6:1 disadvantage in that case. For the current work, 2^642^32+1 NTT ~25 bits/word, vs. DP 18.6 bits/word; the NTT could use significantly shorter FFT lengths for the same exponent. For a ~78M exponent DC, ~18/25 x 4M DP ~3M fft NTT length. So the samelength NTT could be 33% slower than DP and be a tie for the speed on the same exponent. It's not clear to me whether Preda has already allowed for that in his ~3:1 speed disadvantage, or on what hardware that was (CPU?). NTT may still be slower than DP on AMD GPUs or NVIDIA Tesla GPUs. But on those DPlamed consumer NVIDIA GPUs; RTX2080 10.07 TFLOPS FP32, 314.6 GFLOPS FP64 (1:32 performance ratio); ? INT32? https://www.techpowerup.com/gpuspec...rtx2080.c3224 TU104 3072 INT32 units, 96 FP64 units (48:1 unit ratio) https://www.techpowerup.com/gpuspecs/nvidiatu104.g854 The RTX2080 has ~2.53 x the TF performance in mfaktc (INT32, SP) of the GTX1080, but about 10% less in CUDALucas or presumably Gpuowl (DP). NVIDIA Tesla P100 PCIe FP32 9.526 TFLOPS, FP64 4.763 TFLOPS (1:2), INT32? https://www.techpowerup.com/gpuspec...ie16gb.c2888 (lower memory bandwidth than the Radeon VII though, 732 vs 1024 GB/s) Radeon VII FP32 13.44 TFLOPS, FP64 3.36 TFLOPS (1:4), INT32? https://www.techpowerup.com/gpuspecs/radeonvii.c3358 Unfortunately Gpuowl V1.9 won't run on an RTX2080, so can't get comparison runs in V1.9 on that GPU or similar for DP vs M61: Code:
gpuOwL v1.9 GPU Mersenne primality checker GeForce RTX 2080, 46x1710MHz OpenCL compilation in 4227 ms, with "I. clfastrelaxedmath clstd=CL2.0 DEXP=77973559u DWIDTH=1024u DHEIGHT=2048u DLOG_NWORDS=22u DFP_DP=1 " PRP3: FFT 4M (1024 * 2048 * 2) of 77973559 (18.59 bits/word) [20210526 21:07:15 Central Daylight Time] Starting at iteration 0 error 5 (carryConv) Assertion failed! Program: C:\Users\user\Documents\gpuowl v1.9\gpuowl.exe File: clwrap.h, Line 230 Expression: check(clEnqueueNDRangeKernel(queue, kernel, 1, __null, &workSize, &groupSize, 0, __null, __null), name.c_str()) Code:
C:\Users\user\Documents\gpuowl v1.9>gpuowl device 0 size 4M fft M61 gpuOwL v1.9 GPU Mersenne primality checker GeForce RTX 2080, 46x1710MHz OpenCL compilation in 9824 ms, with "I. clfastrelaxedmath clstd=CL2.0 DEXP=77973559u DWIDTH=1024u DHEIGHT=2048u DLOG_NWORDS=22u DFGT_61=1 DLOG_ROOT2=49u " Note: using long carry kernels PRP3: FFT 4M (1024 * 2048 * 2) of 77973559 (18.59 bits/word) [20210526 21:08:42 Central Daylight Time] Starting at iteration 0 error 5 Assertion failed! Program: C:\Users\user\Documents\gpuowl v1.9\gpuowl.exe File: clwrap.h, Line 234 Expression: check(clEnqueueReadBuffer(queue, buf, blocking, start, size, data, 0, __null, __null)) Last fiddled with by kriesel on 20210527 at 03:20 
20210527, 07:03  #53 
"Mihai Preda"
Apr 2015
3·457 Posts 

20210602, 13:33  #54 
"Mihai Preda"
Apr 2015
2533_{8} Posts 
In the same setup (FFT 4M, Radeon VII, exponent around 107M) I gained about 33% performance by tweaking (with inline assembly) the lowlevel modular primitives (add, sub, mul). So now the performance is only 2x as bad as the DP of the same exponent. The kernels may be CPUbound at the moment. Now I need to look into some higherlevel optimizations.
The code is in the NTT2 branch of gpuowl on github. It's not useful for anything, is more like a feasibility study at this point. https://github.com/preda/gpuowl/blob/NTT2/gpuowl.cl 
20210602, 13:58  #55 
"Mihai Preda"
Apr 2015
3·457 Posts 
There's one more interesting factoid about the difference between "classic" FFT and NTT:
when working with complex numbers in the classic FFT, the inverse transform is equal to the conjugate of the forward transform on the conjugate of the data. i.e. iFFT(x) == conjugate(dFFT(conjugate(x)) and the conjugate() is a local operation. So in the classic setup, we only need to implement one transform that serves both for forward and inverse. The advantage of this is that there is only one set of precomputed twiddle tables, shared by the direct and the inverse FFT. The NTT OTOH does not have a notion of "complex conjugate" anymore. We can still implement the inverse in terms of the forward transform, but we need to "flip" the data words around in way which is nonlocal, thus costly (costly on a GPU, at least). Rather that doing that complicated "wordsflip", I chose to implement two transforms, direct and inverse, that unfortunately don't share the twiddle tables anymore. (the "flip" is this: mirror[i] = word[(Ni) mod N], for i:=0 to N1). 
Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
What does net neutrality mean for the future?  jasong  jasong  1  20150426 08:55 
The future of Msieve  jasonp  Msieve  23  20081030 02:23 
Future of Primes.  mfgoode  Lounge  3  20061118 23:43 
The future of NFSNET  JHansen  NFSNET Discussion  15  20040601 19:58 
15k Future?  PrimeFun  Lounge  21  20030725 02:50 