20080326, 14:47  #1 
(loop (#_fork))
Feb 2006
Cambridge, England
2×7×461 Posts 
Efficient modular reduction with SSE
The vague idea here is to do four parallel NTTs, say modulo the primes {51,54,60,63}*2^25+1. The operations inside the NTT are ring operations, so you can do them in a vectorised fashion; you have to do a CRT step before carry propagation. The primes are less than 2^31 so modular addition and subtraction are trivial.
Transform length is 2^25, and I think you can have a digit size of 48 bits before worrying about roundoff error (since the product of the primes is greater than 2^(96+25)), so that's exact arithmetic on 800Mbit integers. But is there a nice way to do AB mod p, with A, B, p 32bit numbers, without having to do a full 64x64 multiply of the intermediate product AB by (2^94/p) using a total of nine pmuludq instructions and more shifts than the human brain or the permute unit can stand? 
20080326, 16:04  #2 
(loop (#_fork))
Feb 2006
Cambridge, England
2·7·461 Posts 
My current quad 64x64>128 multiply (input in four SSE2 registers, output in four SSE2 registers) takes 104 cycles on K8 and 117 on core2. This is an age, and in particular is significantly longer than using the integer multiplier four times. Has anyone got a notably better one?
The code is of a reasonably parallel shape, so I have no idea why it takes quite so long  there are 85 instructions, and lots of them can be executed in parallel. 
20080326, 18:51  #3  
"Mark"
Apr 2003
Between here and the
5·7·199 Posts 
Quote:


20080326, 19:28  #4 
∂^{2}ω=0
Sep 2002
República de California
5×2,351 Posts 
Having only had a cursory glance at Tom's sample code, a few comments:
 In 64bit mode, you'd need just *one* scalar MUL operation, which can complete every other cycle, so the SSE variant looks even worse there.  That's an awful lot of unpack.shuffle/repacking going on. If possible, consider rearranging the data, e.g. by doing 4 such wideMULs in parallel, to make the SSE code look much more like the scalar analog, thus eliminating the datashuffling. This will cause register spills, but one can just write data to their original array slots or to local storage as needed  trading a few register spills/fills for tens of inplace shuffles should be well worth it.  You're using intrinsics  In my experience that often leads to dismal performance. If you have access to Visual Studio under 32bit Windows, I suggest trying some raw inline assembly in lieu of intrinsics  VS has a very simple inline ASM syntax relative to e.g. GCC. Or if youi're an old hand at GCC inline ASM, try that.  The Penryn version of Core2 will support some improved vectorized integer MUL instruction by way of SSE4.1 support, e.g. PMULDQ. Together with the above shuffleless parallel widemul variant you should be able to get the per128bitproduct pipelined throughput to something on the order of 10 cycles or better. Still worse than the 2 cycles in 64bit mode, but quite competitive with the scalar version in 32bit mode. 
20080326, 19:56  #5 
(loop (#_fork))
Feb 2006
Cambridge, England
14466_{8} Posts 
I *am* doing four wideMULs in parallel; I realise there's an awful lot of shuffling going on, but it's for getting things lined up correctly for 64bit SSE2 arithmetic. PMULUDQ multiplies slots 0 and 2 of one register by slots 0 and 2 of another putting the result into slots 01 and 23 of the output, which is why I need the initial eight unpack instructions, and the rest of the shifting is needed to get carries not to fall off the top of registers.
Ah, yes, if I work less generally (I only really need the top 64 bits of the product for the division) I can avoid the final shuffling about and just return sumzw231 and sumzw232. I checked that the output from gcc4.1.2 looked pretty much like how I'd write the assembly language if I had enough brain to follow the register allocations; I'd rather be writing this in a nice macro assembler, but I don't know of one clever enough to let me write in A>B+C style and sort out the copying required for the annoying x86 destroyoneinput instructions. I don't think PMULDQ helps  signed multiplies are no use for multiprecision arithmetic. The code is anyway twice as fast as I thought it was because I'd forgotten that RDTSC takes perceptible time; looping a thousand times takes about 55k cycles. 
20080326, 21:12  #6 
(loop (#_fork))
Feb 2006
Cambridge, England
1100100110110_{2} Posts 
OK, SSE is obviously the wrong answer here; a simple implementation using standard 64bit operations (and magicreciprocals for division) takes under 25 cycles to do four 32x32%32>32 operations.

20080326, 22:40  #7 
∂^{2}ω=0
Sep 2002
República de California
5·2,351 Posts 
In 64bit mode, I agree  but in 32bit mode, I'm thinking that maybe trying to do *every* part of the 128bit product algorithm in SSE mode is the real problem: SSE mode is good for the 32x32bit subproducts, but is a pain for reassembling them with the needed carries. OTOH, nonSSE unsignedcompares and adds should be efficient for doing the reassembly. I still think ~10 pipelined cycles is achievable, but need to work through the details of the above hybrid approach to prove or disprove that timing estimate.
Last fiddled with by ewmayer on 20080326 at 22:42 
20080327, 13:34  #8  
Tribal Bullet
Oct 2004
3^{2}·5·79 Posts 
Quote:


20080327, 14:20  #9 
(loop (#_fork))
Feb 2006
Cambridge, England
1936_{16} Posts 
Thanks for that pointer  I'd forgotten that ECM stage 2 is an application of large Fourier transforms. Indeed you're doing exactly what I thought of doing, but with the NTTs mod the different small primes sequential rather than concurrent, and the butterfly code looks very familiar.
I need to read Jorg's book more carefully, I still don't quite understand inplace Fourier transforms so my fft spends most of its time in Code:
left[u]=input[v++]; right[u++]=input[v++]; 
20080401, 14:01  #10 
Tribal Bullet
Oct 2004
110111100011_{2} Posts 
The decimationinfrequency version is just a matrixvector product, where the matrix is a chain of factorized Fourier matrices. The decimationintime version is another matrixvector product, where the matrix is the transpose of the chain of factorized Fourier matrices. You can use that formulation to understand the butterfly structure as well. Nussbaumer has a good (by now quite old) introductory book on the subject as well.
Last fiddled with by jasonp on 20080401 at 14:04 Reason: fix errors 
20080401, 19:49  #11 
∂^{2}ω=0
Sep 2002
República de California
26753_{8} Posts 
My Mlucas code does a forward DIF transform with radices [r1,r2,...,rn], followed by an inverse DIT transform with the same radices in reverse order. This allows for a nice transposeandbitreversalless inplace implementation. Lots of details  literally years' worth of coding and optimization work  buried in that brief description, however.
Getting back to the original thread topic, here's a crude firsttry sketch of an SSEbased 64x64>128bit unsigned integer MUL. This is mostly SSE2 but a few of the new SSE4.2 instructions [e.g. pcmpgtq  for the life of me I can't understand why this and some other very basic quadword instructions were not implemented until 4.2], so the big caveat is that I have no actual hardware to try it out on  probably a couple glaring bugs in there. But gives at least a rough idea of what this might look like in SSE4.2  fuse a couple of these together [for instance to allow for 2 64bit middleproduct parts to be processed sidebyside] and I still think 10 cycles per 128bit result is reasonable. This uses the MSVC inlineasm syntax: Code:
__asm [instruction] dest, src, [imm8] Code:
/* 64x64>128bit unsigned integer MUL using SSE2 Inputs in adjacent 64bit memlocs A = [a0,a1], B = [b0,b1], a0,a1,b0,b1 are 32bit unsigned int, in memory we have a 128bit chunk with a0 at the low address, assume 128bit aligned: */ uint64 *ABinputs; /* Make sure this is 16byte aligned */ ... [allocate local storage for ABinputs of at least 2 64bit elements; init them] ... __asm mov eax, ABinputs /* Pointer to inputpair memloc */ __asm movaps xmm0,[eax] /* xmm[3:2:1:0] = [b1,b0,a1,a0] */ __asm pshufd xmm1,xmm0,0x10 /* Move xmm0[1:0] = [a1,a0] into xmm1[2:0] to get xmm1 = [,a1,,a0], slots 1 and 3 are don'tcares */ __asm pshufd xmm2,xmm0,0x32 /* Move xmm0[3:2] = [a0,a1] into xmm2[2:0] to get xmm2 = [,b1,,b0], slots 1 and 3 are don'tcares */ __asm pmuludq xmm2,xmm1 /* xmm2 = [a0*b0,a1*b1] */ __asm pshufd xmm0,xmm0,0x23 /* Move xmm0[2:3] = [b1,b0] into xmm0[2:0] to get xmm0 = [b1,xx,b0,yy], slots 1 and 3 are don'tcares */ __asm pmuludq xmm0,xmm1 /* xmm0 = [a0*b1,a1*b0] */ __asm pshufd xmm2,xmm2,0x39 /* Circular rightshift the contents of xmm2 by one 32bit word: xmm[3:2:1:0] < xmm[0:3:2:1] */ __asm movq xmm1,xmm0 /* xmm1.hi = 0; xmm1.lo = a0*b1 */ __asm psrldq xmm0,0x8 /* xmm0.hi = 0; xmm0.lo = a1*b0 */ __asm paddq xmm2,xmm1 /* Add a0*b1 to middle 64 bits of accumulator and check for carryout */ __asm pcmpgtq xmm1,xmm2 /* xmm1 = 0xff... if overflow occurred, 0 otherwise. This NEEDS SSE4.2! */ __asm paddq xmm2,xmm0 /* Add a1*b0 to middle 64 bits of accumulator and check for carryout */ __asm pcmpgtq xmm0,xmm2 __asm psubq xmm0,xmm1 /* Total carry, negated */ __asm pslldq xmm0,0x8 /* Leftshift carry to set up folding into high 32 bits of result, which are currently in xmm2[2] */ __asm psubq xmm2,xmm0 /* We know that the carryin cannot overflow the 32 high bits, since the full product must be < 2^128 */ __asm pshufd xmm2,xmm2,0x93 /* Circular leftshift the contents of xmm2 by one 32bit word: xmm[3:2:1:0] < xmm[2:1:0:3] */ /* Result is in xmm2 */ Last fiddled with by ewmayer on 20130402 at 01:46 Reason: add note about pcmpgtq being signed compare 
Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Efficient Test  paulunderwood  Computer Science & Computational Number Theory  5  20170609 14:02 
Fast modular reduction for primes < 512 bits?  BenR  Computer Science & Computational Number Theory  2  20160327 00:37 
Most efficient way to LL  hj47  Software  11  20090129 00:45 
CPU stats reduction  Prime95  PrimeNet  3  20081117 19:57 
Lattice Reduction  R.D. Silverman  Factoring  2  20050803 13:55 