 mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Programming (https://www.mersenneforum.org/forumdisplay.php?f=29)
-   -   A fast C/C++ function for base-2 modular exponentation (https://www.mersenneforum.org/showthread.php?t=24755)

 hansl 2019-09-07 00:34

A fast C/C++ function for base-2 modular exponentation

Hello, I'm writing my own C++ program for Mersenne trial factoring and seeing how fast I can make it.

My program uses GMP lib for some of the math, but I found the "mpz_powm*" functions to be not particularly fast.

This is my first go at a function using "native types" (+ compiler extension __uint128_t ) and its limited to factors up to 64-bits:
[code]
// return Mod(2,m)^e
uint64_t pow2mod(uint64_t e, uint64_t m) {
int lz = __builtin_clzl(e);
int bits_remain = (lz > 58) ? 0 : 58 - lz;
__uint128_t temp = 1ul << ((e >> bits_remain) & 63);
for(; bits_remain >= 6; ) {
temp = (temp*temp) % m;
temp = (temp*temp) % m;
temp = (temp*temp) % m;
temp = (temp*temp) % m;
temp = (temp*temp) % m;
temp = (temp*temp) % m;
bits_remain -= 6;
temp = (temp*(1ul << ((e >> bits_remain) & 63))) % m;
}
switch(bits_remain) {
case 5: temp = (temp*temp) % m;
case 4: temp = (temp*temp) % m;
case 3: temp = (temp*temp) % m;
case 2: temp = (temp*temp) % m;
case 1: temp = (temp*temp) % m;
}
return (temp*(1ul << ((e & ((1<<bits_remain)-1))))) % m;
}
[/code]
I was pleased to find that it apparently produces correct results and improved my overall speed by ~84% over using gmp's mpz_powm_ui.

But I'm wondering what I could do to make it faster still? Without resorting to handcrafted assembly.

Or ideas to improve the upper range past 64bits (for the modulus of course, not the exponent) without sacrificing speed?

I know TF'ing past 64bits has pretty much already been covered for Primenet's exponents up to 1e9 (by TJAOI), so I'm specifically doing this for mersenne.ca's 1e10 range, much of which is still only factored to 55bits.

The program is still at a very early/rough stage, but I plan to share it all with source once its a bit more fleshed out. Its purpose is for fast bulk factoring over large ranges of exponents.

 xilman 2019-09-07 15:44

[QUOTE=hansl;525363]Hello, I'm writing my own C++ program for Mersenne trial factoring and seeing how fast I can make it.

My program uses GMP lib for some of the math, but I found the "mpz_powm*" functions to be not particularly fast.

This is my first go at a function using "native types" (+ compiler extension __uint128_t ) and its limited to factors up to 64-bits:
[code]
// return Mod(2,m)^e
uint64_t pow2mod(uint64_t e, uint64_t m) {
int lz = __builtin_clzl(e);
int bits_remain = (lz > 58) ? 0 : 58 - lz;
__uint128_t temp = 1ul << ((e >> bits_remain) & 63);
for(; bits_remain >= 6; ) {
temp = (temp*temp) % m;
temp = (temp*temp) % m;
temp = (temp*temp) % m;
temp = (temp*temp) % m;
temp = (temp*temp) % m;
temp = (temp*temp) % m;
bits_remain -= 6;
temp = (temp*(1ul << ((e >> bits_remain) & 63))) % m;
}
switch(bits_remain) {
case 5: temp = (temp*temp) % m;
case 4: temp = (temp*temp) % m;
case 3: temp = (temp*temp) % m;
case 2: temp = (temp*temp) % m;
case 1: temp = (temp*temp) % m;
}
return (temp*(1ul << ((e & ((1<<bits_remain)-1))))) % m;
}
[/code]
I was pleased to find that it apparently produces correct results and improved my overall speed by ~84% over using gmp's mpz_powm_ui.

But I'm wondering what I could do to make it faster still? Without resorting to handcrafted assembly.

Or ideas to improve the upper range past 64bits (for the modulus of course, not the exponent) without sacrificing speed?

I know TF'ing past 64bits has pretty much already been covered for Primenet's exponents up to 1e9 (by TJAOI), so I'm specifically doing this for mersenne.ca's 1e10 range, much of which is still only factored to 55bits.

The program is still at a very early/rough stage, but I plan to share it all with source once its a bit more fleshed out. Its purpose is for fast bulk factoring over large ranges of exponents.[/QUOTE]A minor point about the first comment. I believe you are returning (2^e) mod m, not (2 mod m) ^e.

One possibility is that instead of using (implied) division, perhaps it may be faster to multiply by the modular inverse of m.

Another: perhaps you can exploit the structure of m for the case of Mersenne factors.

 LaurV 2019-09-07 16:10

[QUOTE=xilman;525387]A minor point about the first comment. I believe you are returning (2^e) mod m, not (2 mod m) ^e.[/QUOTE]
You are both right, and we prefer hansl notation here. You old guy :paul: should learn some Pari :razz:
In programming languages and algebra systems a variable keeps its type (see Pari), regardless of what operations you do with it, therefore Mod(x, y) is a variable whose type is "mod" (which is different from the integer type). Mod(5,3) is not 2, but Mod(2,3). Similar, Mod(2,5)^3 is Mod(3,5), which in math notation is "2^3=3 (mod 5)". Which is right.

[CODE] GP/PARI CALCULATOR Version 2.11.1 (released)
amd64 running mingw (x86-64/GMP-6.1.2 kernel) 64-bit version
compiled: Nov 21 2018, gcc version 6.3.0 20170516 (GCC)
(readline v6.2 disabled, extended help enabled)

Copyright (C) 2000-2018 The PARI Group

PARI/GP is free software, covered by the GNU General Public License, and comes WITHOUT ANY WARRANTY WHATSOEVER.
Type ? for help, \q to quit.
Type ?17 for how to get moral (and possibly technical) support.
parisize = 10000000, primelimit = 1000000

(23:07:32) gp:> Mod(2,7)^5
%1 = Mod(4, 7)
(23:07:48) gp:>[/CODE]However, you are totally right about multiplying with the modular inverse, a Barrett reduction will work faster for exponentiation.

Exploiting the structure of m for mersenne factors - we don't understand what you mean, when he does TF, he does not do "mod m", it is just an "unfortunate" choice for the name of the variable, confusing with the "standard" notation. Using the standard notation, m=2^p-1, q=2kp+1 is a factor, what he need to repeatedly do, is "mod q" (i.e. his formal parameter is m, but he call the function with actual parameter q). And for q there is not much a structure he can exploit. Please enlighten us, maybe we are missing something.

 xilman 2019-09-07 16:48

[QUOTE=LaurV;525389]You are both right, and we prefer hansl notation here. You old guy :paul: should learn some Pari :razz:
In programming languages and algebra systems a variable keeps its type (see Pari), regardless of what operations you do with it, therefore Mod(x, y) is a variable whose type is "mod" (which is different from the integer type). Mod(5,3) is not 2, but Mod(2,3). Similar, Mod(2,5)^3 is Mod(3,5), which in math notation is "2^3=3 (mod 5)". Which is right.
[/QUOTE]True, but hansl explicitly stated that he was using C++ types, not Pari types. In raw C++ there is no native type "mod".

You young guy should learn how to read what is written, not what you believe ought to have been written.

 hansl 2019-09-07 17:44

Its just mathematical shorthand in comments, not a big deal.

Anyways if the comment was adhering to pure C++ the "^" operator would be interpreted as XOR instead of exponentiation too.

 hansl 2019-09-07 21:14

I've been trying some micro-optimizations to little success.

I thought since the same exponent is used over and over, I could pre-process it to some degree: count leading zeroes ahead of time, split into 6-bit chunks and pass pointer to array of those, etc. But none of those resulted in a speed improvement.

I only found one tiny change that seems to provide consistent improvement of about 0.75%

Replacing the multiplication by a shifted one, with just directly shifting the temp value:
[code]
(temp*(1ul << ((e >> bits_remain) & 63))) % m;
// becomes
(temp << ((e >> bits_remain) & 63)) % m;
[/code]

I've also started trying to implement Barrett reduction, but haven't quite figured out that algorithm yet. For example I'm not sure how to choose optimal word size and/or "k" value.

I'll keep working on it though, now that I'm out of other, easier ideas to try.

 hansl 2019-09-08 00:57

OK, I think I have the Barrett reduction working, and it looks like I got a 21% improvement over my previous implementation. Not bad, but I was hoping Barrett would boost things more than that.

There's one big 128x128 multiplication, which I might be doing in a really inefficient way. Not sure if there's a better way to extract the low and high words, or to handle overflows in the summation. Only the upper 128bits are kept for the result anyways, per the algorithm.
Not sure if my implementation is 100% correct with regard to error terms, but seems to work for the inputs I've been testing.

I based this on the definition from Handbook of Applied Cryptography Ch 14.
Algorithm 14.42
The chapters are available online: [url]http://cacr.uwaterloo.ca/hac/[/url]

I chose parameters b=2^64 and k=1, using the nomenclature from HAC. Then tried to simplify the math as much as I could based on those values.

[code]
#define H64(x) (x>>64)
#define L64(x) ((__uint128_t)((uint64_t)(x)))
#define _mulshr128(xh,xl,yh,yl) (xh*yh+(((((xh*yl)>>2)+((xl*yh)>>2)+(H64(xl*yl)>>2)))>>62))
#define MULSHR128(x,y) _mulshr128(H64(x),L64(x),H64(y),L64(y))

inline __uint128_t reduce( __uint128_t mu, __uint128_t x, uint64_t m) {
__uint128_t r = x - MULSHR128(x, mu)*m;
while (r >= m) r -= m;
return r;
}

uint64_t pow2mod_barrett(uint64_t e, uint64_t m)
{
// supposed to be b^2/m, but best I can do is (b^2-1)/m , difference should be negligible?
__uint128_t mu = (~__uint128_t(0))/m;
int lz = __builtin_clzl(e);
int bits_remain = (lz > 58) ? 0 : 58 - lz;
__uint128_t temp = 1ul << ((e >> bits_remain) & 63);
for (; bits_remain >= 6;)
{
temp = reduce(mu, temp*temp, m);
temp = reduce(mu, temp*temp, m);
temp = reduce(mu, temp*temp, m);
temp = reduce(mu, temp*temp, m);
temp = reduce(mu, temp*temp, m);
temp = reduce(mu, temp*temp, m);
bits_remain -= 6;
temp = reduce(mu, temp << ((e >> bits_remain) & 63), m);
}
switch (bits_remain)
{
case 5: temp = reduce(mu, temp*temp, m);
case 4: temp = reduce(mu, temp*temp, m);
case 3: temp = reduce(mu, temp*temp, m);
case 2: temp = reduce(mu, temp*temp, m);
case 1: temp = reduce(mu, temp*temp, m);
}
return reduce(mu, temp << ((e & ((1 << bits_remain) - 1))), m);
}
[/code]

 snme2pm1 2019-09-08 01:55

[QUOTE=xilman;525390]True, but hansl explicitly stated that he was using C++ types, not Pari types. In raw C++ there is no native type "mod".

You young guy should learn how to read what is written, not what you believe ought to have been written.[/QUOTE]

Entirely off topic, I had imagined that since xilman and LaurV both knew about mud mold capacitors like me, we must be of similar vintage?
p.s. Would it be warranted for us each to expose ourselves, I mean age wise!
Maybe it suffices to say that we are all 60+

 hansl 2019-09-08 03:35

Any other ideas for further improvement?

I'm wondering if there's any particular optimizations specific to "sqrmod" operations that I'm not taking advantage of.

Also, would it be worth trying Montgomery reduction instead of Barrett?

Or maybe I should focus my efforts on the sieving part of the program at this point.

 snme2pm1 2019-09-08 04:48

I've had experience with MR, but not Barrett, but note that Barrett appears to feature more highly in TF work.
[QUOTE=hansl;525436]Any other ideas for further improvement?

I'm wondering if there's any particular optimizations specific to "sqrmod" operations that I'm not taking advantage of.

Also, would it be worth trying Montgomery reduction instead of Barrett?

Or maybe I should focus my efforts on the sieving part of the program at this point.[/QUOTE]

 Prime95 2019-09-08 04:49

The very first temp*temp can be eliminated (shift the starting value more). The very first reduce can probably be simplified because x is a power of two.

All times are UTC. The time now is 02:26.