mersenneforum.org interested in theory+implementation: FFT multipl
 Register FAQ Search Today's Posts Mark Forums Read

 2008-07-30, 04:02 #1 antlu     Jul 2008 13 Posts interested in theory+implementation: FFT multipl Greetings all, I assume I am like many others on this forum in that I am implementing my own arbitrary-precision integer arithmetic library for the purpose of conducting Lucas-Lehmer tests. I understand that if I am to have any hope of testing multi-million exponents, I will need to implement an FFT squaring algorithm. Unfortunately, the academic papers on this subject are hopelessly obscure for me, as I lack the necessary mathematical background (ring theory? what?). If someone could please offer me a little direction in this, I would be grateful. Also, what is the difference b/t the Schonhage-Strassen algorithm and other FFT's? Is the new Furer algorithm (2007) also considered an FFT? Can anyone share their implementations?
2008-07-30, 14:44   #2
jasonp
Tribal Bullet

Oct 2004

353710 Posts

Quote:
 Originally Posted by antlu I understand that if I am to have any hope of testing multi-million exponents, I will need to implement an FFT squaring algorithm. Unfortunately, the academic papers on this subject are hopelessly obscure for me, as I lack the necessary mathematical background (ring theory? what?). If someone could please offer me a little direction in this, I would be grateful.
See the list of FFT references at fftw.org, which includes many web pages and papers at varying levels of complexity. One old but very helpful reference if you're starting from scratch is

K. Jayasuriya, "Multiprecision Arithmetic Using Fast Hartley Transforms", Appl. Math. and Comp. 75:239-251 (1996)
Quote:
 Also, what is the difference b/t the Schonhage-Strassen algorithm and other FFT's? Is the new Furer algorithm (2007) also considered an FFT? Can anyone share their implementations?
All FFT algorithms recursively convert a dense matrix multiplication into a chain of sparse matrix multiplications. Most of the differences between the various flavors of FFTs boil down to changing the things that are being multiplied. Sometimes (most of the time) they're complex floating point numbers, other times they are integers modulo a prime, and with Schonhage-Strassen they are huge integers modulo a huge composite that has a special form. All that matters is that the dense-to-sparse decomposition is possible.

 2008-08-01, 19:33 #3 antlu     Jul 2008 13 Posts Thanks a lot, I will certainly give the material a look. It certainly seems a lot more accessible.
 2008-08-05, 03:01 #4 antlu     Jul 2008 13 Posts i must say i am pretty disappointed with the pdf book by fftguru.com though the author may have a strong command of the subject (i cannot say otherwise), i certainly do not think he has a strong command of how to teach others. he spends very little time explaining anything. amazingly, he finds that it is somehow interesting or even significant to digress to his personal experience of trying to find a way to beat the odds at roulette. while i realize that he was probably trying to make a point of how fft could be applied in this situation, i felt it was extremely inappropriate (i downloaded this to learn fft, not to read about his boring escapades). another chapter is devoted (and very poorly, i believe), to complex numbers. anybody interested in learning fft should already have had experience with complex numbers, including Euler's formula in complex analysis. i dont believe he even bothered to mention this. i could not understand the later chapters, because i lacked understanding of the foundations that he attempted to explain in the first chapter. the main thing i got out of this was what it meant when one states that the Fourier Transform transforms a function from its representation in the time domain into a representation in the frequency domain. he even had a simple picture, and for that i am grateful. but i expected so much more. i decided to just buy a book on the subject. im sure his sourcecode will be a lot more interesting and intelligible when i have actually learned the basics.
 2008-08-08, 00:15 #5 antlu     Jul 2008 13 Posts question: which is faster: number theoretic transform, or transform using complex n-th roots of unity?
2008-08-09, 22:19   #6
fivemack
(loop (#_fork))

Feb 2006
Cambridge, England

13·491 Posts

Quote:
 Originally Posted by antlu question: which is faster: number theoretic transform, or transform using complex n-th roots of unity?
Unless you're using very odd hardware, complex roots of unity will be much faster because FP-multiply is much better implemented than reduction modulo a prime of more than 32 bits.

NTT has a guarantee of correctness which is perhaps more comfortable than the floating-point methods.

2008-08-11, 19:33   #7
ewmayer
2ω=0

Sep 2002
República de California

3×53×31 Posts

Quote:
 Originally Posted by fivemack Unless you're using very odd hardware, complex roots of unity will be much faster because FP-multiply is much better implemented than reduction modulo a prime of more than 32 bits. NTT has a guarantee of correctness which is perhaps more comfortable than the floating-point methods.
That "guarantee" is highly misleading - even in the absence of ROE there are no guarantees that convolution outputs won't overflow an integer register, for instance, unless one uses a transform length sufficiently large to satisfy some rigorous convolution-output-size bound, which is likely to be ridiculously conservative.

And both floating-point and integer *hardware* units are subject to similar kinds of hardware-level glitches, whether they be due to 3rd-party device drivers, overheating, bad memory, voltage glitches or "cosmic rays".

As I've recounted on several occasions, during the all-integer parallel "wavefront" verification of the compositeness of F24, on at least 3 occasions we had a residue mismatch between a pure-integer partial-interval "triple check" run and the [matching] independent-floating-point-FFT wavefront results. In every such case, rerun of the pure-integer code for the given iteration range on an alternate PC gave a result that matched the floating-point ones, which were run on workstation-class hardware [a Sparc 1 and an SGI Octane]. So marginal-quality PC hardware is not rendered less marginal simply by running in pure-integer mode.

Dual floating-point runs with aggressively set transform lengths and independent computation data [as is done by GIMPS] is a far better way to go in the sense of establishing a result "beyond reasonable doubt", especially when one is testing thousands of numbers per week, which would make the order-of-magnitude speed penalty associated with a pure-integer run unacceptable. [In the case of F24 we had just *one* result to verify, and dual-FP-result verification was not as widely accepted yet, especially by the academic community.]

2008-08-17, 20:17   #8
antlu

Jul 2008

13 Posts

Some additional questions, if I may.

For a NTT I would have to reduce modulo a 32-bit prime number to even hope for acceptable speed b/c doing so by anything larger would require that I implement the reduction in software, correct?

Quote:
 Originally Posted by ewmayer That "guarantee" is highly misleading - even in the absence of ROE there are no guarantees that convolution outputs won't overflow an integer register, for instance, unless one uses a transform length sufficiently large to satisfy some rigorous convolution-output-size bound, which is likely to be ridiculously conservative.
What is ROE? My uneducated guess would be that the E stands for error.
The point you're making here is that the transform size would have to be incredibly large (and perhaps larger than is feasible) to ensure no overflows?

Quote:
 Originally Posted by ewmayer Dual floating-point runs with aggressively set transform lengths and independent computation data [as is done by GIMPS] is a far better way to go in the sense of establishing a result "beyond reasonable doubt", especially when one is testing thousands of numbers per week, which would make the order-of-magnitude speed penalty associated with a pure-integer run unacceptable.
The lesson I should take from here is: double-precision floating point rocks, pure-integer is too slow?

The main reason why I was considering NTT was the "guarantee" of correctness. But I guess I should stick with what others have already proven to be effective. Where may I find the mathematics associated with ensuring that double-precision FP rounding generates a correct result? I briefly glanced through volume 2 of Knuth's Art of Computer Programming: Seminumerical Algorithms, and there is a section on applying FFT to multiplication, and I thought there may have been something there related to it. For now, I will be content to bury my head in a DFT book, and then an FFT book.

2008-08-18, 18:47   #9
ewmayer
2ω=0

Sep 2002
República de California

3×53×31 Posts

Quote:
 Originally Posted by antlu For a NTT I would have to reduce modulo a 32-bit prime number to even hope for acceptable speed b/c doing so by anything larger would require that I implement the reduction in software, correct?
No, most current commodity CPUs have decent 64-bit support. I believe to access the full 64-bit integer functionality of the Core and AMD64 series you need to be running a 64-bit OS, however. [I whimsically refer to this crippling of functionality in 32-bit mode as the "Vista pump", but I'm sure Intel had *cough* excellent technical reasons to do things this way.]

Quote:
 What is ROE?
Roundoff Error.

Quote:
 The point you're making here is that the transform size would have to be incredibly large (and perhaps larger than is feasible) to ensure no overflows?
Partly, but more that any such "guarantees" assume perfect hardware reliability, which is unrealistic. Given that hardware [especially at the low end, i.e. the consumer PC mass market] is far from 100% reliable, it's far better to *assume* some non-negligible fraction of results will be incorrect and instead structure the testing procedures to reliably catch such occurrences. And given that, one should further use the fastest testing algorithm available, assuming that that speed does not incur a hugely higher failure rate than an alternate testing method. Paired floating-point results using different intermediate data at every stage of the computation [in the case of the LL test, via a random power-of-2 shift of the initial residue] fit the bill.

Quote:
 The lesson I should take from here is: double-precision floating point rocks, pure-integer is too slow?
Not intrinsically - fixed-point computation is in fact quite a bit cheaper in terms of silicon real estate than floating-point - but because of market considerations, CPU vendors have thrown much more Silicon at the FPUs, thus on contemporary commodity CPUs this is a fair description. And with the recently-announced quad-width [8x32-bit and 4x64-bit] SSE extensions announced by Intel it would appear that the balance will continue to skew further toward floating-point computation.

Quote:
 The main reason why I was considering NTT was the "guarantee" of correctness. But I guess I should stick with what others have already proven to be effective. Where may I find the mathematics associated with ensuring that double-precision FP rounding generates a correct result? I briefly glanced through volume 2 of Knuth's Art of Computer Programming: Seminumerical Algorithms, and there is a section on applying FFT to multiplication, and I thought there may have been something there related to it. For now, I will be content to bury my head in a DFT book, and then an FFT book.
For floating-point FFT, Colin Percival [don't have the precise reference, but I believe it's a Mathematics of Computation paper] has done some rigorous error analysis. For a more-useful probabilistic error estimation, see here. Come to think of it, Percival's paper is referenced in the linked paper.

2008-08-18, 22:05   #10

"Richard B. Woods"
Aug 2002
Wisconsin USA

22×3×641 Posts

Quote:
 Originally Posted by ewmayer "cosmic rays"
antlu,

Don't be misled by Ernst's quotation marks there. Cosmic rays (high-energy particles from outer space) really are more likely to have an effect on the smaller-and-smaller circuitry we're getting. A particle hit that disrupts three or four atoms on a chip is more likely to cause an error on today's narrow circuitry than on the slower-but-wider circuits of yesteryear.

Last fiddled with by cheesehead on 2008-08-18 at 22:05

 Similar Threads Thread Thread Starter Forum Replies Last Post alpertron Factoring 15 2010-04-12 19:16 PythonPower Factoring 27 2009-05-28 17:08 Leith Miscellaneous Math 4 2005-01-18 23:14 flava Programming 12 2004-10-26 03:51 ET_ ElevenSmooth 2 2004-03-26 16:29

All times are UTC. The time now is 01:43.

Sat Apr 17 01:43:26 UTC 2021 up 8 days, 20:24, 0 users, load averages: 1.62, 1.61, 1.50