mersenneforum.org NTT faster than FFT?
 Register FAQ Search Today's Posts Mark Forums Read

 2021-07-18, 05:22 #23 moytrage   Jul 2021 3·11 Posts That's why I implemented both FFT and NTT from scratch by myself with those skills that I have. I implemented both so that to have relative comparison. Both my variants are average-optimized so theirs comparison is quite fair. Also theirs inner loops and formulas look very much the same. My FFT uses AVX1 SIMD to do DOUBLE multiplies. And NTT uses non-SIMD 64x64->128 multiply (built-in in CPU). The reason I see why my FFT is 10x slower than my NTT is because NTT uses full 64-bit words (actually around 60-bit) for computation, because NTT doesn't suffer from error overflow, there is no error. While my FFT uses splitting into very small words of 3,4-bits size. So 64/3 = 21x difference in word size. If anyone can correct me - if I used wrong thechnique of FFT with splitting into 3-bit words? Is there any other way to solve task? This 3-bit splitting was done to keep final error within bounds of 0.05-0.1. 3-bit is for 2^25 bytes input numbers sizes. Of cause for smaller sizes like 2^20 I use 7-8 bits. This 3 bits were chosen based on experimental results for random inputs of different sizes, table was built to achieve this. NTT doesn't suffer from cumulative error. Instead it uses Chinese Reminder Theorem (CRT) to restore higher precision. In CRT I used 3 primes, because NTT with 64-bit words needs around 64 + 64 + 64 = 192-bit final result meaning 3 primes of 64-bits each are needed in CRT. So this gives 3x slow down of NTT, but still NTT is 10x faster than my FFT. So my conclusion was that if for my middle-optimized implementations of both FFT and NTT with almost same algorithms and loops are different in 10x times then I could imagine that if Prime95 gurus implemented NTT with same high optimizations in my like they did in GWNUM then it could happen that imaginary Prime95's NTT could be faster than Prime95's FFT, that's why I created this discussion thread initially. Also people say that NTT is much slower due to lots of modular divisions. But I don't have any divisions in my code at all, only multiplications. Because I used Barrett and Montgomery reducing which allowed replacing all divisions with multiplicaions (and shift/add). Last fiddled with by moytrage on 2021-07-18 at 05:23
2021-07-18, 05:35   #24
moytrage

Jul 2021

3·11 Posts

Quote:
 Originally Posted by Prime95 Fast implementations load up the L2 and L3 caches with data, do as much work as possible, then write that back to slow memory.
Basically exactly this is done in my code of both FFT and NTT. I have input array A and no recursion. Outer loop runs Log(N) times and inner loop goes through A sequentially from start to finish. And no recursion. To achieve no recursion input array is permuted. So I have Log(N) sequential reads/writes of my A array, hence I conclude that my algorithm is already quite cache friendly.

2021-07-18, 05:55   #25
moytrage

Jul 2021

3×11 Posts

Quote:
 Originally Posted by Mysticial Computationally, FFTs are much faster than NTT.
I don't understand why NTT should be much slower computationally-wise. People say it is slow due to lots of division in modular reduction. But I used Barrett and Montgomery reductions everywhere which allowed me to use no divisions at all, only mult/add operations.

So after Barrett+Montomerry reduction I don't see the reason why FFT should be much faster. Both of them use SIMD and/or built-in CPU multiplications and that's it. So both algorithms look to me about the same. More than that they both have almost the same code, loops and algorithms. I can't see real difference.

Last fiddled with by moytrage on 2021-07-18 at 06:16

2021-07-18, 06:51   #26
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

2·11·269 Posts

If your NTT (which NTT?) is 9.8 sec/iter and your FFT is 10-15 times slower than your NTT, we'll call it 12, that's ~118. sec/iter for moytrage FFT, while prime95 you timed as 0.34 sec. That's a ratio of ~346. between FFT implementations.

IBDWT vs. padded FFT is what, 2:1 based on length for padding? Bits/word in DP FFT is typ 17-19 in prime95 or other established code (CUDALucas, Mlucas, Gpuowl) depending on FFT length, not 3-4, so that would be ~4-6 fold. Memory or cache efficiency probably gets killed by the low bits/word, and perhaps the usage patterns as Prime95 described. He's done a LOT of optimization work over the decades.

The integer based code I thoroughly tested for exponent limit due to iteration error in Gpuowl 1.9 was good for ~20 bits/word, in M61 Fast Galois transform 4M length. It was SP or M31 that was expected to provide ~3-5 bits/word. Author's description at V1.8: https://mersenneforum.org/showpost.p...postcount=224; V1.9 exponent M61 transform limit test results at https://www.mersenneforum.org/showpo...31&postcount=8

I don't know why you're only able to use 3-4 bits/word in your DP based FFT and yet see .07 -.1 roundoff error at that. Other software goes up to ~0.3-0.4 roundoff error or slightly higher for performance with many more bits/word for DP based FFTs. Usable bits/word versus FFT length are given in a column on the right of some of the more recent processor-specific benchmarks in the prime95 reference thread.

If you plan to turn your code into a prime testing program, reading background on the development of Gpuowl, Mlucas, CUDALucas, prime95 could be enlightening. There are forum threads for most if not all.

Quote:
 Originally Posted by moytrage But the point is in different thing - not to proof for myself that algorithm gives correct multiplication results, but just to compare speeds, relative speed of FFT against NTT.
If the code is wrong, even rarely, relative speed does not much matter in an application as demanding as primality testing. Every branch, every case, needs well designed test inputs.

2021-07-18, 07:16   #27
moytrage

Jul 2021

3·11 Posts

Quote:
 Originally Posted by kriesel I don't know why you're only able to use 3-4 bits/word in your DP based FFT.
Experimentally I got following table:

Code:
{
{6, 22}, {7, 22}, {8, 20}, {9, 19}, {10, 20}, {11, 17}, {12, 16}, {13, 14}, {14, 13}, {15, 12},
{16, 11}, {17, 10}, {18, 9}, {19, 8}, {20, 7}, {21, 5}, {22, 4}, {23, 3}, {24, 3},
}
First number is power of count of uint-64 words, e.g. 2^22 U64 words (2^25 bytes) correspond to 4 bits. Second number is number bits in a word that are needed to achieve 0.1 error.

These bits I got when tried to limit error below 0.1. So this number of bits is the largest number of bits that still give error below 0.1 All errors measured experimentally on random input numbers. Of cause if I limit error to 0.4 then of cause I get more bits.

I just thought limiting to 0.4 is not very safe. Of cause if some extra errors checking things are done then 0.4 is also allowable.

The only reason I can think of why error would be higher than in Prime95 is because maybe of using -ffast-math switch when compiling. It makes math faster but less accurate. Also I use SIMD which is less accurate than built-in CPU multiplication.

Do I understand correctly that Prime95 somehow for some reason uses 17-19 bits with error of 0.4? Does it use DOUBLE-64 computations? Not 128 or 256 bit floats? Not long-double 80 bit floats?

Of cause if I use 18 bits instead of 3 bits then I'll be 6x times faster. How came that Prime95 managed to use 18 bits with 0.4 error?

Actually I didn't want to make any Primality-proof program out of my code, especially if it is dozens times slower than existing Prime95. I only implemented FFT and NTT from scratch just for educational purpose because always wanted to learn what FFT/NTT is.

If somebody quite clever already tested that NTT is much slower than IBDWT for same level of optimizations then I'm quite happy, that was my only concern when starting this thread, concern "if" Prime95 could become faster.

If everybody here says that NTT is inherintly slower than FFT, then I could understand why FFT was chosen for Prime95. I just can't understand - both FFT and NTT uses just multiplications (and few additions), taking into account that all NTT's divisions can be replaced by multiplications of Barrett and Montgomery reductions, then why FFT is always considered faster? Both are just multiplications, around same number of operations O(Log(N) * N), then both should be about the same. Integer multiplications is a bit slower than DOUBLE, but NTT can use almost all 64 bits for computational words, while FFT only 17-19 bits as in Prime95.

Last fiddled with by moytrage on 2021-07-18 at 07:25

2021-07-18, 07:39   #28
moytrage

Jul 2021

3×11 Posts

Quote:
 Originally Posted by kriesel Bits/word in DP FFT is typ 17-19 in prime95.
So I rerun my bits evaluation part and found out that for error 0.4 my code needs 8-bits, to remind for error 0.1 it needed 4 bits. Sure 8-bit words will be 2 times faster than 4-bit word.

How come that 18 bits can be used I can't see if I do just DOUBLE-64 bit operations everywhere, not float-128 or float-256.

Maybe I did some cruel mistake somewhere like doing multiplications/additions in some very wrong order that reduces precision a lot.

BTW does anyone know exact formula of error? If two large numbers are pseudo-random what is the formula for error based on numbers size and number of bits in a word?

Also do I understand correctly that if I have K-bit word and N such words then it means that I need at least float type that can hold (K + K + Log2(N)) bits of mantissa? Because when doing polynomials multiplication, each polynomial of degree N and each coefficient having K bits then multiplying these two polynomials has each coefficient of size not more than (K + K + Log2(N)) bits, because each final coefficient's value is at most N * (2^K)^2 after opening parenthesis.

Taking above into account then if word is 19 bits and large numbers are 2^25 bytes (2^28 bits) then there will be 2^23.75 such words. So using formula above maximal polynomial's coefficient after multiplication could be of size (19 + 19 + 23.75)=61.75 bits. But standard IEEE double-64 type has only 52-bit mantissa, meaning it can't hold 61.75 bits precisely. How come then that 19 bits can be used for word size?

Last fiddled with by moytrage on 2021-07-18 at 08:04

 2021-07-18, 08:10 #29 kriesel     "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 10111000111102 Posts There's something very unusual about that bits vs length table. It seems to me it gives for DP about what might be expected for SP. Some examples, that use DP FFT: CUDALucas v2.06, 17.6 bits/word at 18M FFT length, on GTX1060 GPU Code: | Date Time | Test Num Iter Residue | FFT Error ms/It Time | ETA Done | | Jul 18 00:17:41 | M332197331 112030000 0x88fc24530de05291 | 18432K 0.33594 36.0679 360.67s | 91:21:46:05 33.72% | 18.29 bits/word at 2.25M FF length Code: Starting M43158617 fft length = 2304K | Date Time | Test Num Iter Residue | FFT Error ms/It Time | ETA Done | | Jul 24 20:20:44 | M43158617 20000 0x6c847491e15e5282 | 2304K 0.28906 4.0205 80.41s | 2:00:10:42 0.04% | Gpuowl V7.2-53, 17.02 bits/word at 56M FFT length, on Radeon VII GPU (PRP3 catches ~all errors by GEC, retries until a block passes the error check, or stops if too many consecutive check errors) Code: 2021-07-18 02:02:47 asr2/radeonvii 999299999 OK 103300000 10.34% 0af6d6e23d1b882d 13969 us/it + check 7.43s + save 3.78s; ETA 144d 20:45 17 errors 2 Other gpuowl FFT lengths and exponent ranges can be seen at https://www.mersenneforum.org/showpo...36&postcount=9 Mlucas V17, 17.6 bits/word on 18M FFT length, one core of Xeon E5645 Code: M332220523: using FFT length 18432K = 18874368 8-byte floats. this gives an average 17.601676676008438 bits per digit Using complex FFT radices 36 16 16 32 32 [Jul 22 13:31:39] M332220523 Iter# = 10000 [ 0.00% complete] clocks = 02:16:04.515 [816.4516 msec/iter] Res64: 1A313D709BFA6663. AvgMaxErr = 0.171224865. MaxErr = 0.250000000. Source code can be consulted for fft length exponent limits.
2021-07-18, 08:14   #30
moytrage

Jul 2021

1000012 Posts

Quote:
 Originally Posted by kriesel It seems to me it gives for DP about what might be expected for SP.

Quote:
 Originally Posted by moytrage Also do I understand correctly that if I have K-bit word and N such words then it means that I need at least float type that can hold (K + K + Log2(N)) bits of mantissa? Because when doing polynomials multiplication, each polynomial of degree N and each coefficient having K bits then multiplying these two polynomials has each coefficient of size not more than (K + K + Log2(N)) bits, because each final coefficient's value is at most N * (2^K)^2 after opening parenthesis. Taking above into account then if word is 19 bits and large numbers are 2^25 bytes (2^28 bits) then there will be 2^23.75 such words. So using formula above maximal polynomial's coefficient after multiplication could be of size (19 + 19 + 23.75)=61.75 bits. But standard IEEE double-64 type has only 52-bit mantissa, meaning it can't hold 61.75 bits precisely. How come then that 19 bits can be used for word size?
So basically I can't see how 19-bits words computation can fit into 52-bit mantissa of IEEE double-64 type? Are my quoted formulas above not correct?

 2021-07-18, 08:30 #31 axn     Jun 2003 3·11·157 Posts To OP: Instead of implementing your own FFT (probably too inefficient) or comparing against a special purpose software like P95 (too specialized), try using a high-performance generalized FFT library like FFTW for your FFT baseline. You could also use OpenPFGW which supports testing of arbitrary numbers using generic FFT (rather than special form numbers using IBDWT). It shares the same underlying FFT routines with P95, so it is very high performance. Having said that, 3-4 bits per fft limb sounds too conservative. You might be able to get away with more number of bits. Also, are you using balanced digits representation? That will allow you to put many more bits per limb.
2021-07-18, 08:45   #32
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

2·11·269 Posts

Quote:
 Originally Posted by moytrage So basically I can't see how 19-bits words computation can fit into 52-bit mantissa of IEEE double-64 type?
It does or doesn't, depending on fft length.
See the paragraph starting "Maximum usable" in https://www.mersenneforum.org/showpo...65&postcount=6
IEEE 754 binary64 is 53 bit (1 implicit) for normalized values.

2021-07-18, 08:55   #33
moytrage

Jul 2021

3×11 Posts

Quote:
 Originally Posted by axn Instead of implementing your own FFT (probably too inefficient)...
I implemented both FFT and NTT as a part of self-educational process, I didn't know about them before and was always wanted to learn and implement them.

And only as a byproduct I found out that FFT got slower, only because FFT used 4 bits and NTT almost 64 bits per word.

So my initial purpose wasn't to invent some cool and fast library faster than Prime95. Also in advance I didn't know that FFT will appear to be slower.

Please read my previous post above, do you think that my formulas are not correct? Using these formulas I can't see how 18-19 bits words in FFT can be used with 52-bit mantissa of IEEE double-64.

I compared my FFT and NTT because they are two about the same, same level of optimizations and almost same skeleton of algorithm, same loops, almost same formulas. Both use only multiplications (no divisions) (NTT uses only multiplications because of Barrett and Montgomery reductions).

And I can see clearly that only reason that FFT is 10x slower than NTT is because NTT uses almost 64 bit words and FFT uses 4 bit words, hence their Fourier table sizes 16x different hence the slow down.

But already I found out how to do some speedup, 8-bit words can be used instead of 4-bit, because 8-bit words give 0.4 error, while before I was targetting 0.1 error. If I can target 0.4 error then of cause I can use 8-bit words and get 2x speedup.

 Similar Threads Thread Thread Starter Forum Replies Last Post indomit Information & Answers 4 2020-10-07 10:50 paulunderwood Miscellaneous Math 13 2016-08-02 00:05 lidocorc Software 2 2008-11-08 09:26 1260 Miscellaneous Math 23 2005-09-04 07:12 clowns789 Miscellaneous Math 3 2004-05-27 23:39

All times are UTC. The time now is 08:42.

Wed Dec 1 08:42:05 UTC 2021 up 131 days, 3:11, 1 user, load averages: 0.95, 1.19, 1.27