![]() |
|
|
#1 | |
|
∂2ω=0
Sep 2002
República de California
103·113 Posts |
David Willmore (hi, David!) wrote (in the Software forum):
Quote:
But it's far from being as simple as "Duh - just slap a modular transform in there next to the FFT" - long-winded explication follows. Math-o-phobes, beware! It's not as bad as a typical posting to the Number Theory mailing list, but it does require some patience and basic algebra. * * * Indeed, I am still planning on implementing a mixed floating-integer transform, but currently my top priority is finishing work on a releasable version of a C implementation of my Mlucas code. There's not terribly much work remaining to be done on the basic algorithmic source code, but my for-pay job doesn't leave me oodles of time to work on this. The idea behind the mixed floating/integer (more precisely, floating/ modular) algorithm is this: do 2 transforms side-by-side. One is the usual approximate-arithmetic floating FFT, the other is an exact modular transform over some suitably chosen ring of integers. For those of you not familiar with commutative rings (and I am far from an expert in this area myself), the most important things to realize here are the following: 1) A ring defined by a real-integer prime modulus q has group order q-1; a ring defined by gaussian-integer (i.e. a complex pair a + b*I, with a and b separately modded) modulus q has group order q^2-1; 2) For length-N complex transform, we need N roots of unity. For complex FFT this is never a problem (since the complex numbers admit roots of unity of any desired order), but in the modular case the modulus must be chosen which admits a desired variety of runlengths (see (4) below). Typically we want a variety of runlengths N = k*2^n, where k is a small odd number and n is decently large, say at least 20 or so. In order to be able to split each interval between adjacent power-of-2 runlengths into 4 equal-length pieces (like Prime95 does), we need k = 1,3,5,7; in order to be able to split each interval into 8 equal-length pieces (like Mlucas does), we need k = 1,3,5,7,9,11,13,15. 3) For a Crandall/Fagin-style Mersenne-mod DWT to be built on top of the FFT, we also need N roots of 2. This latter requirement wrecks many schemes based on cleverly-found primes having the needed roots of unity. 4) For a modular transform of length N to be possible, N must divide the group order. For example, Nick Craig-Wood's modular code for the StrongArm CPU uses the prime q = 2^64 - 2^32 + 1, which defines a ring having multiplicative group order q-1 = 2^64 - 2^32 = 2^32*(3.5.17.257.65537). Thus, this prime admits power-of-two components of the runlength as large as 2^32, as well as the small odd primes k = 3 and 5, but not k = 7,9,11,13,15. This modulus also lends itself to fast hardware operations, another desirable property. * * * The implementation I am planning on will do modular arithmetic using the ring of Gaussian integers GF(q^2), which is a fancy term for gaussian (complex) integers modulo q. Here, q = 2^p-1 itself will be a Mersenne prime, so this will be an interesting recursive use of Mersenne primes - we use arithmetic modulo a small one to help us look for a big one. These integers have many nice properties - modding is very fast, and their lower-order power-of-2 roots of unity look just like those for the floating FFT, i.e. the modular transform will look very much like the floating FFT. What about the needed roots of 1 and 2? Roots of 2 turn out to be easy for Mersenne-mod arithmetic, since (2^j)^N == 2^(j*N) == 2^(j*N mod p) modulo q. That is, for 2^j to be an Nth root of 2 modulo q, we need j*N == 1 modulo p. Since p is prime, arithmetic modulo p defines a field, which means we are guaranteed to be able to find such a multiplicative inverse of N modulo p. (Example: for p = 61 and runlength N = 3*2^18 = 768K, we have that j*N mod p == j*(N mod p) mod p = j*3 mod p, and it's easy to see that j = 41 is the multiplicative inverse of 3 mod p, so 2^41 is our primitive Nth root of 2.) In other words, all the roots of 2 needed for the DWT are themselves powers of 2, so multiplies by them can be effected simply via shift, mask and add operations. How about roots of unity? For q = 2^p-1, the group order is q^2-1 = (q+1)*(q-1) = 2^(p+1)*[2^(p-1)-1]. That means all powers of two up to 2^(p+1) are allowed in the transform length, as well as all factors of the composite 2^(p-1)-1. For p > 2, his latter integer is guaranteed to always have 3 as a factor (i.e. we can always at least split each interval between adjacent power-of-2 runlengths into 2 equal-length pieces); the full factorization for some of the smaller Mersennes is: Code:
p p-1 factors of 2^(p-1)-1 ------ ------ -------------------- 3 2 3 5 22 3*5 7 2*3 32*7 13 22*3 32*5*7*13 17 24 3*5*17*257 19 2*32 33*7*19*73 31 2*3*5 32*7*11*31*151*331 61 22*3*5 32*52*7*11*13*31*41*61*151*331*1321 89 23*11 3*5*17*23*89*353*397*683*2113*P10 107 2*53 3*107*6361*69431*20394401*P14 127 2*32*7 33*72*19*43*73*127*337*5419*92737*649657*P11 * * * At this point, the astute reader who knows a bit about all-integer transforms may be asking, "But won't the floating FFT and modular transform simply return the same set of outputs? What's the use of that?" It is true that if we do either by itself, we must limit our input digit sizes so that each floating or integer output digit contains all of the bits of the corresponding exact convolution output digit. For 64-bit hardware arithmetic, that typically limits us to something around 20 bits per input digit, i.e. to do an LL test of 2^p-1 we need a transform length N ~= p/20. The key thing that doing a floating and modular transform side-by-side allows is MUCH LARGER INPUTS, because even if the bottom log2(q) bits of each floating output have been contaminated by roundoff error, we can use the modular output (which is exact modulo q) to exactly reconstruct the messed-up lower bits. How much larger can the inputs be? Well, convolution outputs scale as the square of the inputs, so e.g. for q = 2^61-1, our inputs can be roughly sqrt(q), i.e. 30.5 bits larger. That means a transform length less than 40% that an all-floating code would need to test the same number for primality. In other words, even if the parallel-transform scheme winds up needing twice as many clock cycles for a given transform length (which could happen due to less-than-optimal instruction scheduling, especially with HLL code), it will still be a small gain. In an ideal scenario, we would get all of the added integer instructions "for free" and the resulting code would run more than 2X as fast as all-floating. I expect the actual performance to lie somewhere between these two values. Of course, it could also happen that the compiler does a horrible job and the parallel scheme winds up being so slow it's not worth doing (that's what happened to my first prototype of this scheme, written in Fortran-90 code about 3 years ago, and lacking lots of algorithmic tricks i've learned since.) In that case one would need to start writing assembly - I hope to be able to get away with no more than a handful of key assembly-code macros, since I don't have the time to write reams of ASM for various platforms. Last fiddled with by ewmayer on 2008-03-27 at 16:25 Reason: Cleaned up formatting |
|
|
|
|
|
|
#2 |
|
Apr 2003
Berlin, Germany
192 Posts |
Replying to a 13 months old thread
I'm currently working on mixed FGT/FFT - starting from 2 sources available by carey bloodworth. First tests (of reconstructing digits) are very promising. On the GIMPS Opteron I also did first tests of fast multiply chains with a throughput of 1 every 2 cycles. It requires quick calculation of the modulo right after the product is available that the multiplier can store the next product into rax:rdx. It is necessary to load something into rax before a imul to break the dependency chain (and to multiply 2 new values ).The LEA instruction would have been fine for shifting one value and adding another but it doesn't set any flags. I think that will be a different hybrid transform (at least soucecode wise). I'm open for exchanging experiences. |
|
|
|
|
|
#3 |
|
Sep 2003
Brazil
2×7 Posts |
Let me take this opportunity to ask: i have been wondering on a similar optimization: doing a multiprime fermat style fft, using both the ALU and the FPU at the same time... this way i would be able to do 2 transforms "simultaneouly"... is that possible?? or is it completely non sense??
|
|
|
|
|
|
#4 | |
|
∂2ω=0
Sep 2002
República de California
103·113 Posts |
The 2nd edition of Crandall & Pomerance adds a nice little exercise regarding primitive roots in F*q[sup]2[/sup] (the multiplicative group of complex (Gaussian) integers modulo q), where q=2p-1 is a Mersenne prime - the first edition only mentioned the well-known Creutzburg-Tasche closed-form formula (Theorem 9.5.20 in the 1st edition; 9.5.21 in the 2nd) yielding roots of any desired power-of-2 order in F*q[sup]2[/sup], but I whined to Richard Crandall to say something about roots of the full order q2-1, which may have prompted him to add the following exercise, relating primitive roots in F*q[sup]2[/sup] to those (of order q-1) in the real multiplicative field F*q - note that I took the liberty of swapping p and q in the problem statement, to make the notation agree with that in my original post in this thread:
Quote:
I would add the following, as well: If a+bi is a primitive root in F*q[sup]2[/sup], what can you say about b+ai? What special property do roots of odd order in F*q[sup]2[/sup] have, and how might this be algorithmically useful? Last fiddled with by ewmayer on 2005-12-08 at 04:15 |
|
|
|
|
|
|
#5 |
|
∂2ω=0
Sep 2002
República de California
103×113 Posts |
Revival of this long-dormant thread by way of a quiz question for our math-o-phile readers, related to the search for primitive roots in F*q[sup]2[/sup], that is identifying roots of the full order q2-1:
True or False: To test primitivity of a candidate root R = a + bi it suffices to test whether R(q[sup]2-1)/2[/sup] == -1 (mod q). |
|
|
|
|
|
#6 | |
|
Nov 2003
22×5×373 Posts |
Quote:
|
|
|
|
|
|
|
#7 |
|
∂2ω=0
Sep 2002
República de California
103·113 Posts |
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Even faster integer multiplication | paulunderwood | Computer Science & Computational Number Theory | 17 | 2020-05-21 19:51 |
| Faster Lucas-Lehmer test using integer arithmetic? | __HRB__ | Software | 188 | 2010-08-09 11:13 |
| All-integer Transform | nevarcds | Software | 3 | 2004-07-08 10:25 |
| LL tests: Integer or floating point? | E_tron | Math | 4 | 2004-01-13 19:44 |
| Integer and Floating point Trial Factoring in parallel ? | dsouza123 | Software | 3 | 2003-09-21 12:46 |