mersenneforum.org Reference code for all-integer convolutions
 Register FAQ Search Today's Posts Mark Forums Read

2009-01-02, 03:40   #1
jasonp
Tribal Bullet

Oct 2004

33·131 Posts
Reference code for all-integer convolutions

There seems to be a resurgence of interest in using integer transforms to implement prime-finding algorithms, so I've decided to release code that I wrote in 2004 that can do that, at least for the specific problems I was interested in at the time.

The attached will build a library, libfgt.a, on x86_64 systems using gcc. The library has functions that perform large-number multiplies using Fast Galois transforms in the field of complex numbers mod 2^61-1. All I needed at the time was multiple-precision squaring modulo Fermat numbers, so that part of the code is complete, but I never got around to Mersenne-mod squaring.

There are a lot of goodies in this library; it's constructed much like older versions of FFTW, so that transforms are recursive and built adaptively from modular building blocks. The building blocks themselves are emitted by one of two code generators, one for twiddle transforms and the other for the base case transforms. Those code generators go to a lot of trouble to optimize the underlying transforms, and I'd like to thank dresdenboy64 for a lot of help over several months in generating those transforms. Finally, the library automatically inserts cache-line-size holes in the transform array to minimize power-of-two array strides, and the holes stack up recursively when different size power of two transforms are composed together.

There is a sample program, timefgt.c, which prints the fastest FGT decomposition for each transform size and the time needed for a squaring at that size. I've only done a few cursory comparisons with the code of others, and believe that a fermat-mod squaring using this code is about 15% slower than an equivalent size Mersenne-mod squaring using mlucas.

If you're investigating integer transforms yourself, hopefully this code will be useful, for comparison purposes if nothing else. Apologies for the lack of documentation.
Attached Files
 fgt64.zip (17.8 KB, 178 views)

 2009-01-02, 07:04 #2 __HRB__     Dec 2008 Boycotting the Soapbox 10110100002 Posts I dig your programming philosophy: if you can't write self-explanatory code, you should not write programs. Now, I see lots of potential for speedup...especially in functions like 'vector_mul', 'vector_scale', etc. For example, while t0, t1, t11 and t12 can reside in GPRs and keep the 64->128bit multiplier busy you could plop results onto the stack and load into SSE-registers the next cycle. Then for example: if t4 and t8 are in xmm0 and t2 and t6 are in xmm1, the following can be trivially vectorized: Code: t2 = t2 + t4; t6 = t6 + t8; t2 = (t2 >> 61) + (t2 & M61); t6 = (t6 >> 61) + (t6 & M61); a[3].r = (t2 << 3) + ((t2 >> 61) << 3); a[3].i = (t6 << 3) + ((t6 >> 61) << 3); I think doing this in several places should make it a piece of cake to achieve a speedup of 15%. I think 50% faster could be possible, but one would have to fix whole sections...the first thing to change would be: Code: typedef struct { uint64 r, i; } complex; This is a bad idea if you're aiming for maximum throughput, because, this makes 'complex*' a 'vector of structs'. But to exploit maximum parallelism it's much better to have a 'struct of vectors' such as: Code: typedef struct { uint64 *r,*i; }complex_vector; Small changes, huge difference. This makes it really trivial to vectorize code: load two pointers and wheeeeeeeeeeeeeee! Question: instead of using a code-generator, had you considered using recursive templates with specialization?
 2009-01-02, 15:17 #3 jasonp Tribal Bullet     Oct 2004 33×131 Posts I didn't use SSE2 in the vector multiply functions because they require 64x64 multiplication yielding a 128-bit product, and my sense was that any savings from doing two at a time would be wiped out by having quadruple the number of integer multiplications to deal with. Never verified that, though. I find that the code that I want to write is complex enough that I have to use the language I'm most comfortable with, which unfortunately is C. Finally, I'd bragged elsewhere that this was the fastest all-integer code available, but extrapolating the timing figures from this paper it seems that isn't true. Incidentally, if you're contemplating improvements to Schonhage-Strassen then the implementation in GMP should be a really good place to start. As a reference, the FGT code needs 70 seconds to perform a F31-mod squaring on a 2GHz Opteron Last fiddled with by jasonp on 2009-01-02 at 15:20
 2009-01-02, 16:26 #4 nuggetprime     Mar 2007 Austria 2×151 Posts Sorry for a dumb question,but: Why does the library do the transforms modulo 2^61-1? For finding big primes we will have to reduce modulo much larger numbers,or? Last fiddled with by ewmayer on 2009-01-02 at 16:55 Reason: Nugget, this is a poor place for beginners ... suggest you google "number theoretic transform"
2009-01-02, 17:01   #5
__HRB__

Dec 2008
Boycotting the Soapbox

24·32·5 Posts

Quote:
 Originally Posted by jasonp I didn't use SSE2 in the vector multiply functions because they require 64x64 multiplication yielding a 128-bit product, and my sense was that any savings from doing two at a time would be wiped out by having quadruple the number of integer multiplications to deal with. Never verified that, though.
Nobody says you cannot use the best of both worlds!

I would first try to do two 64x64->128 multiplies, swizzle the results into two 128-bit stack-locations and load into SSE for vector processing.

I don't think the load/store unit is the bottleneck here, so it looks like the cost setting up values for SSE processing is only zero-0.5 instructions/value.

In my experience the key is to help the processor's out-of-order engine by interleaving as much independent code as possible.

So, first write code that does (using as few registers as possible):

Then you can use the extra registers to software-pipeline this, if it's not doing 3 instructions/clock (or whatever the limit is).

Quote:
 I find that the code that I want to write is complex enough that I have to use the language I'm most comfortable with, which unfortunately is C.
Hm.

Same here. Then I read a quote by A. Stepanov of stl-fame concerning Java:

Quote:
 Originally Posted by A.Stepanov It keeps all the stuff that I never use in C++ - inheritance, virtuals - OO gook - and removes the stuff that I find useful.
http://ubiety.uwaterloo.ca/~tveldhui/papers/techniques/techniques01.html#l39

And I was hooked, because I realized one should always make everything a template.

Take your FGT-code. Had you written the function vector_scale (with C++ template features) as

Code:
template<typename REAL, typename IMAG, typename SCALE, int N>
void FGT<N>::scale( REAL& a, IMAG& b, SCALE& c )
{
...
}
with M61<uint64> being a type with overloaded arithmetic operators, including 'operator%()'...you would only have to re-write specializations for M2001<bignum_t>, and could reuse all the unspecialized code.

P.S. when I write template code, I make all classes CAPITALS. I don't use global macros, so there is no conflict. I think this makes it much easier to read meta code. Everybody else uses _t or _T suffixes or prefixes, which I find confusing. Compare:

Code:
template<typename VALUE> VALUE::RESULT function( VALUE value );
with

Code:
template<typename value_t> value_t::result_t function( value_t value );

Last fiddled with by __HRB__ on 2009-01-02 at 17:58

2009-01-26, 03:04   #6
__HRB__

Dec 2008
Boycotting the Soapbox

24×32×5 Posts

Quote:
 Originally Posted by __HRB__ Nobody says you cannot use the best of both worlds!
Scratch that.

According to http://swox.com/doc/x86-timing.pdf the 64x64->128 bit multiply (mul) has a throughput of only 1/4 on Core 2, but in 4 clocks we can do eight 32x32->64-bit multiplies using SSE2 (pmuludq), which is twice the work.

Pasting the 4 parts together and doing the modulo 2^61-1, is something that can be easily combined.

Now, a basic butterfly has 6 loads (2 complex values + 1 complex root) and 4 stores, so if two 61-bit values fit into a 128-bit load/store operation, this leaves us only 15 instructions to do one complex multiply and two complex adds (mod 2^61-1).

The 61/(53 - safety bits, if any) advantage alone isn't enough to beat floating point FFTs, though.

2009-01-26, 16:32   #7

"Richard B. Woods"
Aug 2002
Wisconsin USA

170148 Posts

Quote:
 Originally Posted by jasonp I didn't use SSE2 in the vector multiply functions because they require 64x64 multiplication yielding a 128-bit product, and my sense was that any savings from doing two at a time would be wiped out by having quadruple the number of integer multiplications to deal with.
The number of half-size multiplications to get the same result as a full-size multiplication can be three, instead of four, if some auxiliary additions/subtractions are done.

See http://mathworld.wolfram.com/Karatsu...plication.html

So, the savings depend on whether the add/subtracts can be performed (with overlapping) faster than the omitted multiplication could've.

Last fiddled with by cheesehead on 2009-01-26 at 16:53

2009-01-26, 17:33   #8
__HRB__

Dec 2008
Boycotting the Soapbox

24·32·5 Posts

Quote:
 Originally Posted by cheesehead The number of half-size multiplications to get the same result as a full-size multiplication can be three, instead of four, if some auxiliary additions/subtractions are done. See http://mathworld.wolfram.com/Karatsu...plication.html So, the savings depend on whether the add/subtracts can be performed (with overlapping) faster than the omitted multiplication could've.
I think Karatsuba will only make sense, if we're using p=2^607-1 or something as big as that. But if we're processing that many bits, using 2^n+1 will be *much* faster.

Doing a complex multiply with 3 adds and 3 muls (instead of 2 adds and 4 muls), on the other hand, is something that one should definitely do.

2009-01-26, 18:07   #9
ewmayer
2ω=0

Sep 2002
República de California

101101011010102 Posts

Quote:
 Originally Posted by __HRB__ According to http://swox.com/doc/x86-timing.pdf the 64x64->128 bit multiply (mul) has a throughput of only 1/4 on Core 2, but in 4 clocks we can do eight 32x32->64-bit multiplies using SSE2 (pmuludq), which is twice the work.
If the 64x64 and 8-way 32x32s can be done in parallel (i.e. don't use the same hardware multipliers), one could conceivably achieve 3 128-bit products every 4 cycles, which is smokin' fast.

2009-01-26, 19:45   #10
__HRB__

Dec 2008
Boycotting the Soapbox

24×32×5 Posts

Quote:
 Originally Posted by ewmayer If the 64x64 and 8-way 32x32s can be done in parallel (i.e. don't use the same hardware multipliers), one could conceivably achieve 3 128-bit products every 4 cycles, which is smokin' fast.
True, but I'm still convinced that this the wrong approach, unless we get a hardware modular unit. Until that happens, I think the only way to beat fp-FFTs is to use a n-word modulus where multiplies, adds and modulo can be done in O(n) or better.

 2009-01-26, 21:08 #11 jasonp Tribal Bullet     Oct 2004 33·131 Posts I don't have the time to make any improvements to this code. Modifying the code generators to issue SSE2 intrinsics would be a neat exercise but is major surgery. Heck, while I'm dreaming you could interleave FGTs in the SSE2 unit with FGTs in the ALU.

 Similar Threads Thread Thread Starter Forum Replies Last Post ray10may YAFU 1 2017-04-01 02:17 jasong Miscellaneous Math 5 2016-04-24 03:40 ClownRoyal Information & Answers 5 2012-10-19 20:07 bsquared Computer Science & Computational Number Theory 6 2009-09-29 19:54 Jushi Math 2 2006-08-28 12:07

All times are UTC. The time now is 11:57.

Fri Apr 23 11:57:23 UTC 2021 up 15 days, 6:38, 0 users, load averages: 1.91, 1.76, 1.66