∂^{2}ω=0
Sep 2002
República de California
3·7·13·43 Posts

Fermatmodmul with circularly shifted residues
I spent an appreciable amount of time last year wrestling with basic concepts and implementationincode of Fermatmod squaring chains in the presence of residue shift. This proves rather more interesting (a polite euphemism for "pain in the butt") than for the Mersennemod case. The Gerbicz check renders it less important, but it may still prove useful in some contexts, so figured I'd post a summary.
Let s0 = initial shift, i.e. instead of starting the Pépin test of Fm with initial seed x0, we start with (x0 << s0) (mod Fm).
[1] Data layout: for Mersennemod IBDWT, we have 2 different wordsizes differing by one bit, so need to loop over the elements in our variablewordsize array until we find the one containing the (s0)th bit, then shift x0 by the appropriate small number of bits to place it on the proper bit boundary within that target word. For Fermatmod, the easy case is that of powerof2 FFT length, for which we have a fixed powerof2 wordsize. More generally we will have a nonpowerof2 FFT length optimized for the target modulus  e.g. for F24 we can use 896 Kdoubles, and for F30 I did one run at 60M and the other (now 93% complete) at 64M. In both cases we use a special acyclictransformeffecting DWT which multiplies our N input words by the first N (2N)th complex roots of unity, i.e. the (j)th input gets multiplied by DWT weight w[j] = exp(I*Pi*j/N). This means multiplying our original real input word by a complex number, but we notice that w[j+N/2] = I*w[j], making it natural to group the (j)th and (j+N/2)th input words together in memory and treat them as a single input to a complex FFT. This is the socalled "rightangle transform" trick, and is detailed in the original Crandall/Fagin 1994 DWT paper. In the Mersennemod IBDWT case we can also treat adjacent even/oddindexed imput elements as a single input to a complex FFT, but as the Mersennemod IBDWT weights are strictly real, these are simply pairs of adjacent input words.
For Fermatmod arithmetic using a nonpowerof2length transform we need to layer a Mersennestyle dualwordlength base representation and similarly defined IBDWT weights atop our acycliceffecting DWT. One simplification relative to the Mersenne case is that for FFT length [odd]*2^k, the Fermatmod wordsizes and IBDWT weights recur in repeating length[odd] patterns.
In the context of a rightangle Fermatmod acyclic transform, finding the target word for a given initialseed shift means looping first over the evenindexed elements of our residue vector (considered here as a lengthN real array), then if the accumulated wordsizes are still less than the target shift s0, looping over the oddindexed elements until we reach the target word. For the Fermatmod case we need only do this for the initial seed of the Pépin test; for the Mersennemod case, specifically the LucasLehmer test, we must also repeat the exercise on each subsequent iteration in order to find the proper bit offset for injection of the 2 term added to the outout of each modsquaring output. This can be done very efficiently  specifically in O(1) time  by initializing a suitable bitmap encoding the variable wordsizes of the Mersennemod IBDWT and using that to effect a fast lookup scheme.
[2] For a basic squaringchainwithshift scheme as we do for Mersenne number M(p), each modsquaring doubles the shift count (mod p). That means that for any s0 != 0 (mod p) the shift will remain nonzero and nonrepeating during the LL/PRPtest, since at squaring i, our shift s = s0.2^i (mod p) and the multiplicative group (mod p) is of maximal order p1.
OTOH for squaring chains modulo a Fermat number Fm, each subsequent squaring doubles the shift count (mod 2^m), so we have the problem that any nonzero initial shift s0 will lead to a shift s = 0 after precisely m squaremods, and the shift will remain 0 throughout the remainder of the squaring chain. One way to get around this problem is to define an auxiliary pseudorandombit array, and on the (i)th squaremod, if bit[i] = 1, do a further moddoubling of the residue, thus yielding the shiftcount update s[i] = 2.s[i1] + bit[i]. For a pair of Pépin tests of Fm such as are typically done to allow for realtime crossvalidation of interim residues, it is also advisable to seed the bitarrays differently for each run.
[3] For the randombitoffset scheme in [2], we further need a postprocessing step to determine the correct sign to apply to the final residue; this relates to the shiftdoubling modulo the binary exponent which occurs on each squaremod. For Mp, squaring a shifted residue R.2^s gives R^2.2^2s (mod Mp) and 2^2s == 2^(2sp) (mod Mp), so at each iteration we simply double the shift count (mod p). For arithmetic modulo Fm, by way of contrast, each time the doubled shift count incurs a reduction (mod 2^m) the shifted residue undergoes a change of sign: R^2.2^2s (mod Fm) == R^2.2^(2s2^m) (mod Fm), so a shiftedresidue scheme must keep track of both the shift count and the sign of the shifted residue, relative to the true unshifted one.
However: since  unlike LL  this PRP test involves just repeated modsquaring, we don't care if the sign of the residue on any particular intermediate iteration is flipped, since the subsequent modsquaring will flip it back to +. The only thing we care about is whether the sign of the final residue (either for the Pépin test or for a particular checkpoint subinterval thereof) has flipped w.r.to that of its immediate predecessor, the penultimate residue, since such a finalsquaring signflip has no ensuing modsquaring to autocorrect it. Thus, if there is a finalsquaring sign flip we need to unflip the sign of the resulting residue manually. But that is not all, since generating a shiftremoved final residue also involves a modular multiply or divide by a power of 2. I've always found it conceptually easier to multiplymod rather to dividemod, so for both that and lesscodetomaintain reasons, my mi64 multiword int library has lowlevel code only for leftward circular shift, and effects sbit rightward circular shift of a bbit number by a (bs)bit leftward one. That is all that is needed working modulo a Mersenne number, as these are of form 2^p1 and thus any bits offshifted to the left get rewrapped from the right with + sign. Once again, Fermat moduli are trickier. Modulo Fm, a leftward circular shift must feed the bits offshifted to the left back into the right end with a  sign, and the aforementioned shrcdoneasshlc substitution implies a negation (mod Fm) of the resulting shiftremoved residue.
Thus, if at end of our current checkpoint interval or Pépin test, our residue R needed a signflip due the doubledshiftcount needing a reduction (mod 2^m), our shrcdoneasshlc already took care of it. If not, the residue needs an explicit negation. Rather than doing an explicit Fm  R with the attendant digitwise negations and borrows, one can simply do a bitwisecomplement of the residue vector and further += 2 of limb 0 in order to recover the unshifted residue.
Last fiddled with by ewmayer on 20210804 at 00:42
Reason: ficks speelink
