20090102, 03:40  #1 
Tribal Bullet
Oct 2004
3^{3}·131 Posts 
Reference code for allinteger convolutions
There seems to be a resurgence of interest in using integer transforms to implement primefinding algorithms, so I've decided to release code that I wrote in 2004 that can do that, at least for the specific problems I was interested in at the time.
The attached will build a library, libfgt.a, on x86_64 systems using gcc. The library has functions that perform largenumber multiplies using Fast Galois transforms in the field of complex numbers mod 2^611. All I needed at the time was multipleprecision squaring modulo Fermat numbers, so that part of the code is complete, but I never got around to Mersennemod squaring. There are a lot of goodies in this library; it's constructed much like older versions of FFTW, so that transforms are recursive and built adaptively from modular building blocks. The building blocks themselves are emitted by one of two code generators, one for twiddle transforms and the other for the base case transforms. Those code generators go to a lot of trouble to optimize the underlying transforms, and I'd like to thank dresdenboy64 for a lot of help over several months in generating those transforms. Finally, the library automatically inserts cachelinesize holes in the transform array to minimize poweroftwo array strides, and the holes stack up recursively when different size power of two transforms are composed together. There is a sample program, timefgt.c, which prints the fastest FGT decomposition for each transform size and the time needed for a squaring at that size. I've only done a few cursory comparisons with the code of others, and believe that a fermatmod squaring using this code is about 15% slower than an equivalent size Mersennemod squaring using mlucas. If you're investigating integer transforms yourself, hopefully this code will be useful, for comparison purposes if nothing else. Apologies for the lack of documentation. 
20090102, 07:04  #2 
Dec 2008
Boycotting the Soapbox
1011010000_{2} Posts 
I dig your programming philosophy: if you can't write selfexplanatory code, you should not write programs.
Now, I see lots of potential for speedup...especially in functions like 'vector_mul', 'vector_scale', etc. For example, while t0, t1, t11 and t12 can reside in GPRs and keep the 64>128bit multiplier busy you could plop results onto the stack and load into SSEregisters the next cycle. Then for example: if t4 and t8 are in xmm0 and t2 and t6 are in xmm1, the following can be trivially vectorized: Code:
t2 = t2 + t4; t6 = t6 + t8; t2 = (t2 >> 61) + (t2 & M61); t6 = (t6 >> 61) + (t6 & M61); a[3].r = (t2 << 3) + ((t2 >> 61) << 3); a[3].i = (t6 << 3) + ((t6 >> 61) << 3); Code:
typedef struct { uint64 r, i; } complex; Code:
typedef struct { uint64 *r,*i; }complex_vector; Question: instead of using a codegenerator, had you considered using recursive templates with specialization? 
20090102, 15:17  #3 
Tribal Bullet
Oct 2004
3^{3}×131 Posts 
I didn't use SSE2 in the vector multiply functions because they require 64x64 multiplication yielding a 128bit product, and my sense was that any savings from doing two at a time would be wiped out by having quadruple the number of integer multiplications to deal with. Never verified that, though.
I find that the code that I want to write is complex enough that I have to use the language I'm most comfortable with, which unfortunately is C. Finally, I'd bragged elsewhere that this was the fastest allinteger code available, but extrapolating the timing figures from this paper it seems that isn't true. Incidentally, if you're contemplating improvements to SchonhageStrassen then the implementation in GMP should be a really good place to start. As a reference, the FGT code needs 70 seconds to perform a F31mod squaring on a 2GHz Opteron Last fiddled with by jasonp on 20090102 at 15:20 
20090102, 16:26  #4 
Mar 2007
Austria
2×151 Posts 
Sorry for a dumb question,but: Why does the library do the transforms modulo 2^611? For finding big primes we will have to reduce modulo much larger numbers,or?
Last fiddled with by ewmayer on 20090102 at 16:55 Reason: Nugget, this is a poor place for beginners ... suggest you google "number theoretic transform" 
20090102, 17:01  #5  
Dec 2008
Boycotting the Soapbox
2^{4}·3^{2}·5 Posts 
Quote:
I would first try to do two 64x64>128 multiplies, swizzle the results into two 128bit stacklocations and load into SSE for vector processing. I don't think the load/store unit is the bottleneck here, so it looks like the cost setting up values for SSE processing is only zero0.5 instructions/value. In my experience the key is to help the processor's outoforder engine by interleaving as much independent code as possible. So, first write code that does (using as few registers as possible): load>multiply>stack>load>sse_vectorstuff Then you can use the extra registers to softwarepipeline this, if it's not doing 3 instructions/clock (or whatever the limit is). Quote:
Same here. Then I read a quote by A. Stepanov of stlfame concerning Java: Quote:
http://ubiety.uwaterloo.ca/~tveldhui/papers/techniques/techniques01.html#l39 And I was hooked, because I realized one should always make everything a template. Take your FGTcode. Had you written the function vector_scale (with C++ template features) as Code:
template<typename REAL, typename IMAG, typename SCALE, int N> void FGT<N>::scale( REAL& a, IMAG& b, SCALE& c ) { ... } P.S. when I write template code, I make all classes CAPITALS. I don't use global macros, so there is no conflict. I think this makes it much easier to read meta code. Everybody else uses _t or _T suffixes or prefixes, which I find confusing. Compare: Code:
template<typename VALUE> VALUE::RESULT function( VALUE value ); Code:
template<typename value_t> value_t::result_t function( value_t value ); Last fiddled with by __HRB__ on 20090102 at 17:58 

20090126, 03:04  #6 
Dec 2008
Boycotting the Soapbox
2^{4}×3^{2}×5 Posts 
Scratch that.
According to http://swox.com/doc/x86timing.pdf the 64x64>128 bit multiply (mul) has a throughput of only 1/4 on Core 2, but in 4 clocks we can do eight 32x32>64bit multiplies using SSE2 (pmuludq), which is twice the work. Pasting the 4 parts together and doing the modulo 2^611, is something that can be easily combined. Now, a basic butterfly has 6 loads (2 complex values + 1 complex root) and 4 stores, so if two 61bit values fit into a 128bit load/store operation, this leaves us only 15 instructions to do one complex multiply and two complex adds (mod 2^611). The 61/(53  safety bits, if any) advantage alone isn't enough to beat floating point FFTs, though. 
20090126, 16:32  #7  
"Richard B. Woods"
Aug 2002
Wisconsin USA
17014_{8} Posts 
Quote:
See http://mathworld.wolfram.com/Karatsu...plication.html So, the savings depend on whether the add/subtracts can be performed (with overlapping) faster than the omitted multiplication could've. Last fiddled with by cheesehead on 20090126 at 16:53 

20090126, 17:33  #8  
Dec 2008
Boycotting the Soapbox
2^{4}·3^{2}·5 Posts 
Quote:
Doing a complex multiply with 3 adds and 3 muls (instead of 2 adds and 4 muls), on the other hand, is something that one should definitely do. 

20090126, 18:07  #9  
∂^{2}ω=0
Sep 2002
República de California
10110101101010_{2} Posts 
Quote:


20090126, 19:45  #10 
Dec 2008
Boycotting the Soapbox
2^{4}×3^{2}×5 Posts 
True, but I'm still convinced that this the wrong approach, unless we get a hardware modular unit. Until that happens, I think the only way to beat fpFFTs is to use a nword modulus where multiplies, adds and modulo can be done in O(n) or better.

20090126, 21:08  #11 
Tribal Bullet
Oct 2004
3^{3}·131 Posts 
I don't have the time to make any improvements to this code. Modifying the code generators to issue SSE2 intrinsics would be a neat exercise but is major surgery. Heck, while I'm dreaming you could interleave FGTs in the SSE2 unit with FGTs in the ALU.

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
FrankeKleinjung specialq lattice sieving code reference  ray10may  YAFU  1  20170401 02:17 
k*b^n+/c where b is an integer greater than 2 and c is an integer from 1 to b1  jasong  Miscellaneous Math  5  20160424 03:40 
Reference Labels for Manual Assignments  ClownRoyal  Information & Answers  5  20121019 20:07 
integer nth root code  bsquared  Computer Science & Computational Number Theory  6  20090929 19:54 
NFS reference  Jushi  Math  2  20060828 12:07 