 mersenneforum.org NTT faster than FFT?
 Register FAQ Search Today's Posts Mark Forums Read  2021-07-18, 08:59   #34
moytrage

Jul 2021

3×11 Posts Quote:
 Originally Posted by axn Also, are you using balanced digits representation?
Never heard of of this kind of representation, what is it about? How can I use it or where to read about it?

I just slice input raw bytes of input large number into 4-bit fixed-size words and that's it. Afterward I convert these 4-bit words into complex numbers and do all the FFT transformations. At the end do a carry propagation.   2021-07-18, 09:04   #35
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

763210 Posts Quote:
 Originally Posted by moytrage While my FFT uses splitting into very small words of 3,4-bits size.
The sign-bit is your friend. Also called balanced representation.

If you want to store 10-bits store a number in the range -512 to 511 rather than 0 to 1023.   2021-07-18, 09:16   #36
moytrage

Jul 2021

3·11 Posts Quote:
 Originally Posted by Prime95 The sign-bit is your friend. Also called balanced representation.
Seems to me that sign-bit will not save me from generic formula that I mentioned above, i.e. if a word is K-bit (signed or unsigned) and if FFT size is N numbers then multiplying two polynomials each having coefficients in value below 2^K will result in final polynomial having coefficients below (K + K + Log2(N)) bits.

Signed balanced representation could probably save 2 bits, i.e. result will be below (K - 1 + K - 1 + Log2(N)) bits in absolute value.

Thus if Prime95 for quite large number (e.g. 2^28 bits) has words of 17 bits in size (total 2^28/17=2^24 words) then it means final coefficients will be (17 - 1 + 17 - 1 + 24)=56 bits which is more than 53-bits mantissa of double-64, to hold such value. Also mantissa not only should hold value of this size but also cope with all rounding errors happened after multiplications of millions of numbers.

Also very interesting, if I map all numbers to signed balanced range, how then I map back final result? To me it looks strange that after mapping final result can unmap to correct values. Is this kind of mapping and unmapping fits well with operation of multiplication?

Last fiddled with by moytrage on 2021-07-18 at 09:17   2021-07-18, 09:23   #37
axn

Jun 2003

10100000111002 Posts Quote:
 Originally Posted by moytrage I implemented both FFT and NTT as a part of self-educational process, I didn't know about them before and was always wanted to learn and implement them. And only as a byproduct I found out that FFT got slower, only because FFT used 4 bits and NTT almost 64 bits per word. So my initial purpose wasn't to invent some cool and fast library faster than Prime95. Also in advance I didn't know that FFT will appear to be slower.
Sure. It is always better to learn by doing. My point was that, rather than concluding "my implementation of FFT is slower than my implmentation of NTT", you concluded "FFT is slower than NTT". Hence the recommendation to use state-of-the-art implementations to compare. But it looks like you already have some idea for the main reason of slowness.

Quote:
 Originally Posted by moytrage Never heard of of this kind of representation, what is it about? How can I use it or where to read about it? I just slice input raw bytes of input large number into 4-bit fixed-size words and that's it. Afterward I convert these 4-bit words into complex numbers and do all the FFT transformations. At the end do a carry propagation.
Let's take your case of base 16 (4-bit) representation. In the balanced-digit form, we use signed digits in the range (-base/2) <= x <= base/2. If the ith digit is > base/2, we subtract base from it, and add 1 to the (i+1) digit.

Example: Let [1, 9, 12, 7] be your number in base-16 (0x19C7 = 6599). Starting from lowest digit, we check:

7 > 8? No. Retain as-is.
12 > 8? Yes. Use 12-16 = -4. add 1 to next digit.
(9+1) > 8? Yes. Use 10-16 = -6. add 1 to next digit.
(1+1) > 8? No. Retain 2.

So the new representation is [2, -6, -4, 7]. The advantage of balanced representation is 2-fold. One: the magnitudes of the digits are 1 bit less. Two: the intermmediate products are both positive and negative, hence their sum will undergo lot of cancellation and the final sum-of-product will be much smaller. Hence you can stick in much larger number of bits per limb.   2021-07-18, 09:40   #38
axn

Jun 2003

22×32×11×13 Posts Quote:
 Originally Posted by moytrage Also very interesting, if I map all numbers to signed balanced range, how then I map back final result? To me it looks strange that after mapping final result can unmap to correct values. Is this kind of mapping and unmapping fits well with operation of multiplication?
Trivially done during digit normalization / carry propagation. Just remember that the numbers can now be both positive and negative. In fact, we have to remap the numbers back to signed representation in preparation for next iteration.   2021-07-18, 09:59   #39
moytrage

Jul 2021

418 Posts Quote:
 Originally Posted by axn Trivially done during digit normalization / carry propagation.
If I map two numbers to signed form, then multiply them, then convert multiply result back to unsigned form - what will be the gurantee that these sequence of operations will give same result as if I just multiply two unsigned values and get unsigned result?

Is it trivially obvious that these two will give same results?

Last fiddled with by moytrage on 2021-07-18 at 10:00   2021-07-18, 10:33   #40
moebius

Jul 2009
Germany

7·89 Posts Quote:
 Originally Posted by Prime95 That's a bit rude.
And I thought the RTFM stands for "read the farging manual"   2021-07-18, 11:30   #41
moytrage

Jul 2021

2116 Posts Quote:
 Originally Posted by axn So the new balanced representation is [2, -6, -4, 7].
Thanks! Now I see that the balanced representation actually gives correct result. I'll try to incorporate it into my code and see what benefit it will give.   2021-07-18, 12:52 #42 jasonp Tribal Bullet   Oct 2004 3·1,181 Posts As a general rule, when performing a convolution of size 2^N of words with size B bits, the convolution results require 2*B+N bits to represent exactly. When using double precision floating point you get a 53-bit floating point mantissa to fit these results. Suppose your numbers to multiply are of size 100M digits ~ 330M bits (<2^29). This can be represented as 2^17 x 12 bit words or 2^16 x 13-bit words, or some other combination. If using the latter combination then a convolution will produce results with 42-bit words, which easily fits in a 53-bit floating point mantissa with extra bits to round each element correctly to an integer. So you can afford to be much more aggressive choosing B than you are concluding. Now this is an approximation to the real sizes of things; when doing general multiplies the number of words in the product is twice as large so the FFT needs to be of size 2^(N+1). Using balanced representation gives two bonuses: the sign bit of the floating point representation gives an extra bit of mantissa to fit the results, but much more importantly the convolution results now form a distribution centered about 0, which allows choosing a B that is large enough to risk a convolution result that fails to be correctly rounded, provided the probability that such a result appears is deemed small enough to risk. This is one of the reasons Prime95 can use 19-bit words with very large transforms, despite the error analysis above indicating it would be too dangerous. The other thing to conclude from this is that an NTT modulus around 2^64 is nearly the same size as a floating point mantissa of size 2^53, so if you are using much smaller B in your FFT then you need to look over the analysis you did to choose the sizes, both kinds of transforms should admit nearly the same size arrays of words. Now it's possible to use the CRT and construct very large convolution results out of comparatively very small NTTs, but the time and memory for postprocessing to apply the CRT grows basically quadratically increasing numbers of CRT moduli. My experience over a long time trying to make NTTs competitive with FFTs is that nothing you do with NTTs can overcome the latency of 64-bit integer multiplies compared to the pipeline throughput of double precision floating point multiplies. For a while I thought GPU computing would be an exception, since consumer-grade GPUs have severely limited double precision throughput that could make NTTs competitive, but an NTT on GPU would require 31-32 bit moduli and the CRT and there are very few such moduli that have a root of 2 of order large enough to make the requisite size NTT. See here for my last attempt. Last fiddled with by jasonp on 2021-07-20 at 17:43 Reason: fix convolution size   2021-07-18, 13:07   #43
moytrage

Jul 2021

3·11 Posts Quote:
 Originally Posted by jasonp Suppose your numbers to multiply are of size 100M digits ~ 330M bits (<2^29). This can be represented as 2^17 x 12 bit words or 2^16 x 13-bit words, or some other combination.
Number 2^29 is not 2^17 words each 12 bits, but 2^29/12=2^25.4 words of 12 bits each.   2021-07-18, 20:21   #44
ewmayer
2ω=0

Sep 2002
República de California

1165810 Posts Quote:
 Originally Posted by moytrage Never heard of of this kind of representation, what is it about? How can I use it or where to read about it?
I gave you a link to the original Crandall/Fagin DWT paper 2 full pages worth of posts ago. I suggest you cool the posting byterate and actually read it and work through the short-length examples it contains.

That paper does not delve into the details of just why balanced-digit representation is as good as it is in reducing roundoff errors. In fact it's due both to said representation and use of a fast algorithm such as FFT for computing the resulting convolution. Start with a balanced-digit vector input and compute the convolution using both a matrix-multiply DFT and FFT. In exact arithmetic, the outputs are the same. In practice, the DFT approach gives much higher roundoff errors. Attached is the relevant snip from a paper the above poster jasonp and I have some familiarity with (this section was written by me, so any errors in it are on me, not him):
Attached Thumbnails   Thread Tools Show Printable Version Email this Page Similar Threads Thread Thread Starter Forum Replies Last Post indomit Information & Answers 4 2020-10-07 10:50 paulunderwood Miscellaneous Math 13 2016-08-02 00:05 lidocorc Software 2 2008-11-08 09:26 1260 Miscellaneous Math 23 2005-09-04 07:12 clowns789 Miscellaneous Math 3 2004-05-27 23:39

All times are UTC. The time now is 12:55.

Mon Oct 18 12:55:13 UTC 2021 up 87 days, 7:24, 0 users, load averages: 1.64, 1.78, 1.73