mersenneforum.org  

Go Back   mersenneforum.org > Extra Stuff > Programming

Reply
 
Thread Tools
Old 2008-07-30, 04:02   #1
antlu
 
antlu's Avatar
 
Jul 2008

13 Posts
Default interested in theory+implementation: FFT multipl

Greetings all,

I assume I am like many others on this forum in that I am implementing my own arbitrary-precision integer arithmetic library for the purpose of conducting Lucas-Lehmer tests.

I understand that if I am to have any hope of testing multi-million exponents, I will need to implement an FFT squaring algorithm. Unfortunately, the academic papers on this subject are hopelessly obscure for me, as I lack the necessary mathematical background (ring theory? what?). If someone could please offer me a little direction in this, I would be grateful.

Also, what is the difference b/t the Schonhage-Strassen algorithm and other FFT's? Is the new Furer algorithm (2007) also considered an FFT? Can anyone share their implementations?
antlu is offline   Reply With Quote
Old 2008-07-30, 14:44   #2
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

353710 Posts
Default

Quote:
Originally Posted by antlu View Post
I understand that if I am to have any hope of testing multi-million exponents, I will need to implement an FFT squaring algorithm. Unfortunately, the academic papers on this subject are hopelessly obscure for me, as I lack the necessary mathematical background (ring theory? what?). If someone could please offer me a little direction in this, I would be grateful.
See the list of FFT references at fftw.org, which includes many web pages and papers at varying levels of complexity. One old but very helpful reference if you're starting from scratch is

K. Jayasuriya, "Multiprecision Arithmetic Using Fast Hartley Transforms", Appl. Math. and Comp. 75:239-251 (1996)
Quote:
Also, what is the difference b/t the Schonhage-Strassen algorithm and other FFT's? Is the new Furer algorithm (2007) also considered an FFT? Can anyone share their implementations?
All FFT algorithms recursively convert a dense matrix multiplication into a chain of sparse matrix multiplications. Most of the differences between the various flavors of FFTs boil down to changing the things that are being multiplied. Sometimes (most of the time) they're complex floating point numbers, other times they are integers modulo a prime, and with Schonhage-Strassen they are huge integers modulo a huge composite that has a special form. All that matters is that the dense-to-sparse decomposition is possible.
jasonp is offline   Reply With Quote
Old 2008-08-01, 19:33   #3
antlu
 
antlu's Avatar
 
Jul 2008

13 Posts
Default

Thanks a lot, I will certainly give the material a look. It certainly seems a lot more accessible.
antlu is offline   Reply With Quote
Old 2008-08-05, 03:01   #4
antlu
 
antlu's Avatar
 
Jul 2008

13 Posts
Default

i must say i am pretty disappointed with the pdf book by fftguru.com

though the author may have a strong command of the subject (i cannot say otherwise), i certainly do not think he has a strong command of how to teach others. he spends very little time explaining anything. amazingly, he finds that it is somehow interesting or even significant to digress to his personal experience of trying to find a way to beat the odds at roulette. while i realize that he was probably trying to make a point of how fft could be applied in this situation, i felt it was extremely inappropriate (i downloaded this to learn fft, not to read about his boring escapades). another chapter is devoted (and very poorly, i believe), to complex numbers. anybody interested in learning fft should already have had experience with complex numbers, including Euler's formula in complex analysis. i dont believe he even bothered to mention this. i could not understand the later chapters, because i lacked understanding of the foundations that he attempted to explain in the first chapter.

the main thing i got out of this was what it meant when one states that the Fourier Transform transforms a function from its representation in the time domain into a representation in the frequency domain. he even had a simple picture, and for that i am grateful. but i expected so much more. i decided to just buy a book on the subject. im sure his sourcecode will be a lot more interesting and intelligible when i have actually learned the basics.
antlu is offline   Reply With Quote
Old 2008-08-08, 00:15   #5
antlu
 
antlu's Avatar
 
Jul 2008

13 Posts
Default

question: which is faster: number theoretic transform, or transform using complex n-th roots of unity?
antlu is offline   Reply With Quote
Old 2008-08-09, 22:19   #6
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

13·491 Posts
Default

Quote:
Originally Posted by antlu View Post
question: which is faster: number theoretic transform, or transform using complex n-th roots of unity?
Unless you're using very odd hardware, complex roots of unity will be much faster because FP-multiply is much better implemented than reduction modulo a prime of more than 32 bits.

NTT has a guarantee of correctness which is perhaps more comfortable than the floating-point methods.
fivemack is offline   Reply With Quote
Old 2008-08-11, 19:33   #7
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
Rep├║blica de California

3×53×31 Posts
Default

Quote:
Originally Posted by fivemack View Post
Unless you're using very odd hardware, complex roots of unity will be much faster because FP-multiply is much better implemented than reduction modulo a prime of more than 32 bits.

NTT has a guarantee of correctness which is perhaps more comfortable than the floating-point methods.
That "guarantee" is highly misleading - even in the absence of ROE there are no guarantees that convolution outputs won't overflow an integer register, for instance, unless one uses a transform length sufficiently large to satisfy some rigorous convolution-output-size bound, which is likely to be ridiculously conservative.

And both floating-point and integer *hardware* units are subject to similar kinds of hardware-level glitches, whether they be due to 3rd-party device drivers, overheating, bad memory, voltage glitches or "cosmic rays".

As I've recounted on several occasions, during the all-integer parallel "wavefront" verification of the compositeness of F24, on at least 3 occasions we had a residue mismatch between a pure-integer partial-interval "triple check" run and the [matching] independent-floating-point-FFT wavefront results. In every such case, rerun of the pure-integer code for the given iteration range on an alternate PC gave a result that matched the floating-point ones, which were run on workstation-class hardware [a Sparc 1 and an SGI Octane]. So marginal-quality PC hardware is not rendered less marginal simply by running in pure-integer mode.

Dual floating-point runs with aggressively set transform lengths and independent computation data [as is done by GIMPS] is a far better way to go in the sense of establishing a result "beyond reasonable doubt", especially when one is testing thousands of numbers per week, which would make the order-of-magnitude speed penalty associated with a pure-integer run unacceptable. [In the case of F24 we had just *one* result to verify, and dual-FP-result verification was not as widely accepted yet, especially by the academic community.]
ewmayer is offline   Reply With Quote
Old 2008-08-17, 20:17   #8
antlu
 
antlu's Avatar
 
Jul 2008

13 Posts
Default

Thanks for your responses.
Some additional questions, if I may.

For a NTT I would have to reduce modulo a 32-bit prime number to even hope for acceptable speed b/c doing so by anything larger would require that I implement the reduction in software, correct?

Quote:
Originally Posted by ewmayer View Post
That "guarantee" is highly misleading - even in the absence of ROE there are no guarantees that convolution outputs won't overflow an integer register, for instance, unless one uses a transform length sufficiently large to satisfy some rigorous convolution-output-size bound, which is likely to be ridiculously conservative.
What is ROE? My uneducated guess would be that the E stands for error.
The point you're making here is that the transform size would have to be incredibly large (and perhaps larger than is feasible) to ensure no overflows?

Quote:
Originally Posted by ewmayer View Post
Dual floating-point runs with aggressively set transform lengths and independent computation data [as is done by GIMPS] is a far better way to go in the sense of establishing a result "beyond reasonable doubt", especially when one is testing thousands of numbers per week, which would make the order-of-magnitude speed penalty associated with a pure-integer run unacceptable.
The lesson I should take from here is: double-precision floating point rocks, pure-integer is too slow?

The main reason why I was considering NTT was the "guarantee" of correctness. But I guess I should stick with what others have already proven to be effective. Where may I find the mathematics associated with ensuring that double-precision FP rounding generates a correct result? I briefly glanced through volume 2 of Knuth's Art of Computer Programming: Seminumerical Algorithms, and there is a section on applying FFT to multiplication, and I thought there may have been something there related to it. For now, I will be content to bury my head in a DFT book, and then an FFT book.
antlu is offline   Reply With Quote
Old 2008-08-18, 18:47   #9
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
Rep├║blica de California

3×53×31 Posts
Default

Quote:
Originally Posted by antlu View Post
For a NTT I would have to reduce modulo a 32-bit prime number to even hope for acceptable speed b/c doing so by anything larger would require that I implement the reduction in software, correct?
No, most current commodity CPUs have decent 64-bit support. I believe to access the full 64-bit integer functionality of the Core and AMD64 series you need to be running a 64-bit OS, however. [I whimsically refer to this crippling of functionality in 32-bit mode as the "Vista pump", but I'm sure Intel had *cough* excellent technical reasons to do things this way.]

Quote:
What is ROE?
Roundoff Error.

Quote:
The point you're making here is that the transform size would have to be incredibly large (and perhaps larger than is feasible) to ensure no overflows?
Partly, but more that any such "guarantees" assume perfect hardware reliability, which is unrealistic. Given that hardware [especially at the low end, i.e. the consumer PC mass market] is far from 100% reliable, it's far better to *assume* some non-negligible fraction of results will be incorrect and instead structure the testing procedures to reliably catch such occurrences. And given that, one should further use the fastest testing algorithm available, assuming that that speed does not incur a hugely higher failure rate than an alternate testing method. Paired floating-point results using different intermediate data at every stage of the computation [in the case of the LL test, via a random power-of-2 shift of the initial residue] fit the bill.

Quote:
The lesson I should take from here is: double-precision floating point rocks, pure-integer is too slow?
Not intrinsically - fixed-point computation is in fact quite a bit cheaper in terms of silicon real estate than floating-point - but because of market considerations, CPU vendors have thrown much more Silicon at the FPUs, thus on contemporary commodity CPUs this is a fair description. And with the recently-announced quad-width [8x32-bit and 4x64-bit] SSE extensions announced by Intel it would appear that the balance will continue to skew further toward floating-point computation.

Quote:
The main reason why I was considering NTT was the "guarantee" of correctness. But I guess I should stick with what others have already proven to be effective. Where may I find the mathematics associated with ensuring that double-precision FP rounding generates a correct result? I briefly glanced through volume 2 of Knuth's Art of Computer Programming: Seminumerical Algorithms, and there is a section on applying FFT to multiplication, and I thought there may have been something there related to it. For now, I will be content to bury my head in a DFT book, and then an FFT book.
For floating-point FFT, Colin Percival [don't have the precise reference, but I believe it's a Mathematics of Computation paper] has done some rigorous error analysis. For a more-useful probabilistic error estimation, see here. Come to think of it, Percival's paper is referenced in the linked paper.
ewmayer is offline   Reply With Quote
Old 2008-08-18, 22:05   #10
cheesehead
 
cheesehead's Avatar
 
"Richard B. Woods"
Aug 2002
Wisconsin USA

22×3×641 Posts
Default

Quote:
Originally Posted by ewmayer View Post
"cosmic rays"
antlu,

Don't be misled by Ernst's quotation marks there. Cosmic rays (high-energy particles from outer space) really are more likely to have an effect on the smaller-and-smaller circuitry we're getting. A particle hit that disrupts three or four atoms on a chip is more likely to cause an error on today's narrow circuitry than on the slower-but-wider circuits of yesteryear.

Last fiddled with by cheesehead on 2008-08-18 at 22:05
cheesehead is offline   Reply With Quote
Old 2008-08-18, 22:17   #11
antlu
 
antlu's Avatar
 
Jul 2008

13 Posts
Default

I see. I really appreciate your time, btw.

Could you please elaborate a bit on the error detection methods employed by GIMPS? I understand that an LL residue for a particular exponent must be double-checked by a different individual before results are accepted. But what is the power-of-2 shift of the initial residue you mention? Is this just that instead of beginning the LL sequence with 4, you begin with a different power of 2 (or 4 shifted by a power of 2)? It would seem that hardware reliability is a huge factor in testing large exponents. >32 million iterations, with i-dunno-how-many machine operations per iteration. Would there be any advantage to saving intermediate results? Thus, instead of performing an LL test in its entirety, we perform some fraction of it, and then repeat this part to verify that the result up to some number of iterations is correct before proceeding? Just a thought.

I am convinced that I should proceed with FP. I have heard mentions of a Discrete Weighted Transform (DWT) and my very shallow understanding (please correct me on my errors) of its advantage is that one does not need to pad the input with zeros as one would need to in a regular FFT (padding with zeros to extend the input to a power-of-2 length). The tradeoff being that one must perform two convolutions (cyclic and negacyclic?) instead of one to arrive at the final result. I don't even understand most of what I said yet. Perhaps it will make more sense after some reading.

In your opinion, would there be any benefit to including a hybrid-squaring routine where Karatsuba and Toom-3 are used for smaller operands? When L^2 - 2 (mod Mersenne) is relatively small, the iteration may be faster with these routines than with an FFT.

I look forward to Intel's new Nehalem microarchitecture.
antlu is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
SQUFOF implementation alpertron Factoring 15 2010-04-12 19:16
Interested in C/C++ libraries for factoring PythonPower Factoring 27 2009-05-28 17:08
need C implementation of divb(r,k) Leith Miscellaneous Math 4 2005-01-18 23:14
RSA implementation flava Programming 12 2004-10-26 03:51
Is anybody interested in this? ET_ ElevenSmooth 2 2004-03-26 16:29

All times are UTC. The time now is 01:43.

Sat Apr 17 01:43:26 UTC 2021 up 8 days, 20:24, 0 users, load averages: 1.62, 1.61, 1.50

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.