View Single Post
Old 2021-07-02, 18:57   #81
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

3·7·13·43 Posts
Default Fermat-modmul with circularly shifted residues

I spent an appreciable amount of time last year wrestling with basic concepts and implementation-in-code of Fermat-mod squaring chains in the presence of residue shift. This proves rather more interesting (a polite euphemism for "pain in the butt") than for the Mersenne-mod case. The Gerbicz check renders it less important, but it may still prove useful in some contexts, so figured I'd post a summary.

Let s0 = initial shift, i.e. instead of starting the Pépin test of Fm with initial seed x0, we start with (x0 << s0) (mod Fm).

[1] Data layout: for Mersenne-mod IBDWT, we have 2 different wordsizes differing by one bit, so need to loop over the elements in our variable-wordsize array until we find the one containing the (s0)th bit, then shift x0 by the appropriate small number of bits to place it on the proper bit boundary within that target word. For Fermat-mod, the easy case is that of power-of-2 FFT length, for which we have a fixed power-of-2 wordsize. More generally we will have a non-power-of-2 FFT length optimized for the target modulus - e.g. for F24 we can use 896 Kdoubles, and for F30 I did one run at 60M and the other (now 93% complete) at 64M. In both cases we use a special acyclic-transform-effecting DWT which multiplies our N input words by the first N (2N)th complex roots of unity, i.e. the (j)th input gets multiplied by DWT weight w[j] = exp(I*Pi*j/N). This means multiplying our original real input word by a complex number, but we notice that w[j+N/2] = I*w[j], making it natural to group the (j)th and (j+N/2)th input words together in memory and treat them as a single input to a complex FFT. This is the so-called "right-angle transform" trick, and is detailed in the original Crandall/Fagin 1994 DWT paper. In the Mersenne-mod IBDWT case we can also treat adjacent even/odd-indexed imput elements as a single input to a complex FFT, but as the Mersenne-mod IBDWT weights are strictly real, these are simply pairs of adjacent input words.

For Fermat-mod arithmetic using a non-power-of-2-length transform we need to layer a Mersenne-style dual-wordlength base representation and similarly defined IBDWT weights atop our acyclic-effecting DWT. One simplification relative to the Mersenne case is that for FFT length [odd]*2^k, the Fermat-mod wordsizes and IBDWT weights recur in repeating length-[odd] patterns.

In the context of a right-angle Fermat-mod acyclic transform, finding the target word for a given initial-seed shift means looping first over the even-indexed elements of our residue vector (considered here as a length-N real array), then if the accumulated wordsizes are still less than the target shift s0, looping over the odd-indexed elements until we reach the target word. For the Fermat-mod case we need only do this for the initial seed of the Pépin test; for the Mersenne-mod case, specifically the Lucas-Lehmer test, we must also repeat the exercise on each subsequent iteration in order to find the proper bit offset for injection of the -2 term added to the outout of each mod-squaring output. This can be done very efficiently - specifically in O(1) time - by initializing a suitable bitmap encoding the variable wordsizes of the Mersenne-mod IBDWT and using that to effect a fast lookup scheme.

[2] For a basic squaring-chain-with-shift scheme as we do for Mersenne number M(p), each mod-squaring doubles the shift count (mod p). That means that for any s0 != 0 (mod p) the shift will remain nonzero and nonrepeating during the LL/PRP-test, since at squaring i, our shift s = s0.2^i (mod p) and the multiplicative group (mod p) is of maximal order p-1.

OTOH for squaring chains modulo a Fermat number Fm, each subsequent squaring doubles the shift count (mod 2^m), so we have the problem that any nonzero initial shift s0 will lead to a shift s = 0 after precisely m square-mods, and the shift will remain 0 throughout the remainder of the squaring chain. One way to get around this problem is to define an auxiliary pseudorandom-bit array, and on the (i)th square-mod, if bit[i] = 1, do a further mod-doubling of the residue, thus yielding the shift-count update s[i] = 2.s[i-1] + bit[i]. For a pair of Pépin tests of Fm such as are typically done to allow for real-time cross-validation of interim residues, it is also advisable to seed the bit-arrays differently for each run.

[3] For the random-bit-offset scheme in [2], we further need a postprocessing step to determine the correct sign to apply to the final residue; this relates to the shift-doubling modulo the binary exponent which occurs on each square-mod. For Mp, squaring a shifted residue R.2^s gives R^2.2^2s (mod Mp) and 2^2s == 2^(2s-p) (mod Mp), so at each iteration we simply double the shift count (mod p). For arithmetic modulo Fm, by way of contrast, each time the doubled shift count incurs a reduction (mod 2^m) the shifted residue undergoes a change of sign: R^2.2^2s (mod Fm) == -R^2.2^(2s-2^m) (mod Fm), so a shifted-residue scheme must keep track of both the shift count and the sign of the shifted residue, relative to the true unshifted one.

However: since - unlike LL - this PRP test involves just repeated mod-squaring, we don't care if the sign of the residue on any particular intermediate iteration is flipped, since the subsequent mod-squaring will flip it back to +. The only thing we care about is whether the sign of the final residue (either for the Pépin test or for a particular checkpoint subinterval thereof) has flipped w.r.to that of its immediate predecessor, the penultimate residue, since such a final-squaring sign-flip has no ensuing mod-squaring to autocorrect it. Thus, if there is a final-squaring sign flip we need to unflip the sign of the resulting residue manually. But that is not all, since generating a shift-removed final residue also involves a modular multiply or divide by a power of 2. I've always found it conceptually easier to multiply-mod rather to divide-mod, so for both that and less-code-to-maintain reasons, my mi64 multiword int library has low-level code only for leftward circular shift, and effects s-bit rightward circular shift of a b-bit number by a (b-s)-bit leftward one. That is all that is needed working modulo a Mersenne number, as these are of form 2^p-1 and thus any bits off-shifted to the left get rewrapped from the right with + sign. Once again, Fermat moduli are trickier. Modulo Fm, a leftward circular shift must feed the bits off-shifted to the left back into the right end with a - sign, and the aforementioned shrc-done-as-shlc substitution implies a negation (mod Fm) of the resulting shift-removed residue.

Thus, if at end of our current checkpoint interval or Pépin test, our residue R needed a sign-flip due the doubled-shift-count needing a reduction (mod 2^m), our shrc-done-as-shlc already took care of it. If not, the residue needs an explicit negation. Rather than doing an explicit Fm - R with the attendant digitwise negations and borrows, one can simply do a bitwise-complement of the residue vector and further += 2 of limb 0 in order to recover the unshifted residue.

Last fiddled with by ewmayer on 2021-08-04 at 00:42 Reason: ficks speelink
ewmayer is offline   Reply With Quote