![]() |
![]() |
#1 |
Tribal Bullet
Oct 2004
33·131 Posts |
![]()
There seems to be a resurgence of interest in using integer transforms to implement prime-finding algorithms, so I've decided to release code that I wrote in 2004 that can do that, at least for the specific problems I was interested in at the time.
The attached will build a library, libfgt.a, on x86_64 systems using gcc. The library has functions that perform large-number multiplies using Fast Galois transforms in the field of complex numbers mod 2^61-1. All I needed at the time was multiple-precision squaring modulo Fermat numbers, so that part of the code is complete, but I never got around to Mersenne-mod squaring. There are a lot of goodies in this library; it's constructed much like older versions of FFTW, so that transforms are recursive and built adaptively from modular building blocks. The building blocks themselves are emitted by one of two code generators, one for twiddle transforms and the other for the base case transforms. Those code generators go to a lot of trouble to optimize the underlying transforms, and I'd like to thank dresdenboy64 for a lot of help over several months in generating those transforms. Finally, the library automatically inserts cache-line-size holes in the transform array to minimize power-of-two array strides, and the holes stack up recursively when different size power of two transforms are composed together. There is a sample program, timefgt.c, which prints the fastest FGT decomposition for each transform size and the time needed for a squaring at that size. I've only done a few cursory comparisons with the code of others, and believe that a fermat-mod squaring using this code is about 15% slower than an equivalent size Mersenne-mod squaring using mlucas. If you're investigating integer transforms yourself, hopefully this code will be useful, for comparison purposes if nothing else. Apologies for the lack of documentation. |
![]() |
![]() |
![]() |
#2 |
Dec 2008
Boycotting the Soapbox
10110100002 Posts |
![]()
I dig your programming philosophy: if you can't write self-explanatory code, you should not write programs.
Now, I see lots of potential for speedup...especially in functions like 'vector_mul', 'vector_scale', etc. For example, while t0, t1, t11 and t12 can reside in GPRs and keep the 64->128bit multiplier busy you could plop results onto the stack and load into SSE-registers the next cycle. Then for example: if t4 and t8 are in xmm0 and t2 and t6 are in xmm1, the following can be trivially vectorized: Code:
t2 = t2 + t4; t6 = t6 + t8; t2 = (t2 >> 61) + (t2 & M61); t6 = (t6 >> 61) + (t6 & M61); a[3].r = (t2 << 3) + ((t2 >> 61) << 3); a[3].i = (t6 << 3) + ((t6 >> 61) << 3); Code:
typedef struct { uint64 r, i; } complex; Code:
typedef struct { uint64 *r,*i; }complex_vector; Question: instead of using a code-generator, had you considered using recursive templates with specialization? |
![]() |
![]() |
![]() |
#3 |
Tribal Bullet
Oct 2004
33×131 Posts |
![]()
I didn't use SSE2 in the vector multiply functions because they require 64x64 multiplication yielding a 128-bit product, and my sense was that any savings from doing two at a time would be wiped out by having quadruple the number of integer multiplications to deal with. Never verified that, though.
I find that the code that I want to write is complex enough that I have to use the language I'm most comfortable with, which unfortunately is C. Finally, I'd bragged elsewhere that this was the fastest all-integer code available, but extrapolating the timing figures from this paper it seems that isn't true. Incidentally, if you're contemplating improvements to Schonhage-Strassen then the implementation in GMP should be a really good place to start. As a reference, the FGT code needs 70 seconds to perform a F31-mod squaring on a 2GHz Opteron Last fiddled with by jasonp on 2009-01-02 at 15:20 |
![]() |
![]() |
![]() |
#4 |
Mar 2007
Austria
2×151 Posts |
![]()
Sorry for a dumb question,but: Why does the library do the transforms modulo 2^61-1? For finding big primes we will have to reduce modulo much larger numbers,or?
Last fiddled with by ewmayer on 2009-01-02 at 16:55 Reason: Nugget, this is a poor place for beginners ... suggest you google "number theoretic transform" |
![]() |
![]() |
![]() |
#5 | |||
Dec 2008
Boycotting the Soapbox
24·32·5 Posts |
![]() Quote:
I would first try to do two 64x64->128 multiplies, swizzle the results into two 128-bit stack-locations and load into SSE for vector processing. I don't think the load/store unit is the bottleneck here, so it looks like the cost setting up values for SSE processing is only zero-0.5 instructions/value. In my experience the key is to help the processor's out-of-order engine by interleaving as much independent code as possible. So, first write code that does (using as few registers as possible): load->multiply->stack->load->sse_vectorstuff Then you can use the extra registers to software-pipeline this, if it's not doing 3 instructions/clock (or whatever the limit is). Quote:
Same here. Then I read a quote by A. Stepanov of stl-fame concerning Java: Quote:
http://ubiety.uwaterloo.ca/~tveldhui/papers/techniques/techniques01.html#l39 And I was hooked, because I realized one should always make everything a template. Take your FGT-code. Had you written the function vector_scale (with C++ template features) as Code:
template<typename REAL, typename IMAG, typename SCALE, int N> void FGT<N>::scale( REAL& a, IMAG& b, SCALE& c ) { ... } P.S. when I write template code, I make all classes CAPITALS. I don't use global macros, so there is no conflict. I think this makes it much easier to read meta code. Everybody else uses _t or _T suffixes or prefixes, which I find confusing. Compare: Code:
template<typename VALUE> VALUE::RESULT function( VALUE value ); Code:
template<typename value_t> value_t::result_t function( value_t value ); Last fiddled with by __HRB__ on 2009-01-02 at 17:58 |
|||
![]() |
![]() |
![]() |
#6 |
Dec 2008
Boycotting the Soapbox
24×32×5 Posts |
![]()
Scratch that.
According to http://swox.com/doc/x86-timing.pdf the 64x64->128 bit multiply (mul) has a throughput of only 1/4 on Core 2, but in 4 clocks we can do eight 32x32->64-bit multiplies using SSE2 (pmuludq), which is twice the work. Pasting the 4 parts together and doing the modulo 2^61-1, is something that can be easily combined. Now, a basic butterfly has 6 loads (2 complex values + 1 complex root) and 4 stores, so if two 61-bit values fit into a 128-bit load/store operation, this leaves us only 15 instructions to do one complex multiply and two complex adds (mod 2^61-1). The 61/(53 - safety bits, if any) advantage alone isn't enough to beat floating point FFTs, though. |
![]() |
![]() |
![]() |
#7 | |
"Richard B. Woods"
Aug 2002
Wisconsin USA
170148 Posts |
![]() Quote:
See http://mathworld.wolfram.com/Karatsu...plication.html So, the savings depend on whether the add/subtracts can be performed (with overlapping) faster than the omitted multiplication could've. Last fiddled with by cheesehead on 2009-01-26 at 16:53 |
|
![]() |
![]() |
![]() |
#8 | |
Dec 2008
Boycotting the Soapbox
24·32·5 Posts |
![]() Quote:
Doing a complex multiply with 3 adds and 3 muls (instead of 2 adds and 4 muls), on the other hand, is something that one should definitely do. |
|
![]() |
![]() |
![]() |
#9 | |
∂2ω=0
Sep 2002
República de California
101101011010102 Posts |
![]() Quote:
|
|
![]() |
![]() |
![]() |
#10 |
Dec 2008
Boycotting the Soapbox
24×32×5 Posts |
![]()
True, but I'm still convinced that this the wrong approach, unless we get a hardware modular unit. Until that happens, I think the only way to beat fp-FFTs is to use a n-word modulus where multiplies, adds and modulo can be done in O(n) or better.
|
![]() |
![]() |
![]() |
#11 |
Tribal Bullet
Oct 2004
33·131 Posts |
![]()
I don't have the time to make any improvements to this code. Modifying the code generators to issue SSE2 intrinsics would be a neat exercise but is major surgery. Heck, while I'm dreaming you could interleave FGTs in the SSE2 unit with FGTs in the ALU.
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Franke-Kleinjung special-q lattice sieving code reference | ray10may | YAFU | 1 | 2017-04-01 02:17 |
k*b^n+/-c where b is an integer greater than 2 and c is an integer from 1 to b-1 | jasong | Miscellaneous Math | 5 | 2016-04-24 03:40 |
Reference Labels for Manual Assignments | ClownRoyal | Information & Answers | 5 | 2012-10-19 20:07 |
integer nth root code | bsquared | Computer Science & Computational Number Theory | 6 | 2009-09-29 19:54 |
NFS reference | Jushi | Math | 2 | 2006-08-28 12:07 |