20190908, 13:01  #12 
Romulan Interpreter
"name field"
Jun 2011
Thailand
2^{4}·613 Posts 
We are, and we respect each other. At least, I respect Paul a lot! Newcomers and youngsters may be a little put off by these "blowexchanges" or how one wants to call them, but it is nothing serious, this is part of the fun, to pull the sleeve of eachother when we make mistakes, because nobody is perfect. One can't really understand a movie like Last Vegas, unless he reaches that "age of understanding".

20190908, 16:36  #13  
Bamboozled!
"ð’‰ºð’ŒŒð’‡·ð’†·ð’€"
May 2003
Down not across
25426_{8} Posts 
Quote:
The ability successfully to decide whether or not to to take one of my posts seriously is a sign of maturity. LaurV, despite his extreme youth, shows clear evidence of maturity. Last fiddled with by xilman on 20190908 at 16:53 Reason: Add final sentence 

20190908, 20:32  #14 
∂^{2}ω=0
Sep 2002
RepÃºblica de California
2^{2}·3·7·139 Posts 
Try processing multiple TF moduli sidebyside in order to mitigate MUL latency. That is easier when one is working with atomic operations (as opposed to your 128bit intrinsic stuff) and/or in asm, i.e. when one has access to the underlying hardwarearithmetic sequence and can structure things to best use available registers. But you should still get a decent boost in the context of your highlevel coding paradigm.

20190910, 07:08  #15  
Apr 2019
CD_{16} Posts 
Quote:
Code:
#define H64(x) (x>>64) #define L64(x) ((__uint128_t)((uint64_t)(x))) #define _mulshr128(xh,xl,yh,yl) (xh*yh+(((((xh*yl)>>2)+((xl*yh)>>2)+(H64(xl*yl)>>2)))>>62)) #define MULSHR128(x,y) _mulshr128(H64(x),L64(x),H64(y),L64(y)) inline __uint128_t reduce( __uint128_t mu, __uint128_t x, uint64_t m) { __uint128_t r = x  MULSHR128(x, mu)*m; while (unlikely(r >= m)) r = m; return r; } // Simplified version for first reduction, where x = 2^(e0), e0 > 64 // this mean that multiplying x*mu can be done with just a left shift. // Since we only want the top 128 bits of this product, // mu_shr is mu shifted right by (128e0) ***expected to be done by caller*** inline __uint128_t reduce_e0( __uint128_t mu_shr, __uint128_t x, uint64_t m) { __uint128_t r = x  mu_shr*m; while (unlikely(r >= m)) r = m; return r; } uint64_t pow2mod_barrett(uint64_t e, uint64_t m) { __uint128_t mu = (~__uint128_t(0))/m; int lz = __builtin_clzl(e); int bits_remain = (lz > 57) ? 0 : 57  lz; int e0 = e >> bits_remain; // take leading 7 bits __uint128_t temp = reduce_e0(mu >> (128e0), (__uint128_t)1 << e0, m); for (; bits_remain >= 6;) { temp = reduce(mu, temp*temp, m); temp = reduce(mu, temp*temp, m); temp = reduce(mu, temp*temp, m); temp = reduce(mu, temp*temp, m); temp = reduce(mu, temp*temp, m); temp = reduce(mu, temp*temp, m); bits_remain = 6; temp = reduce(mu, temp << ((e >> bits_remain) & 63), m); } switch (bits_remain) { case 5: temp = reduce(mu, temp*temp, m); case 4: temp = reduce(mu, temp*temp, m); case 3: temp = reduce(mu, temp*temp, m); case 2: temp = reduce(mu, temp*temp, m); case 1: temp = reduce(mu, temp*temp, m); return reduce(mu, temp << ((e & ((1 << bits_remain)  1))), m); } return temp; } Quote:
I worked some more on my k sieving code, and reduced some overhead. I've been testing/benchmarking the code against known factors downloaded from mersenne.ca. So far the largest workload I've tried is all exponents from 1277 to 1,000,000,000 from 39 to 40 bits. I have a separate program to verify the results and It finds exactly all factors in that range (plus a few composites, but I can just filter those in post) It did this in 74m29s on a single thread of my 2700x @4.0GHz. Not sure if that's good or not, but it definitely beats the PARI script I had been using. Extrapolating as 2x per bitlevel increase, it would only take me 2378 threadyears to go up to 64bits. No idea how TJAOI does it (custom GPU code and some serious hardware maybe?), but damn! Actually I've found the time per bit increase doesn't quite scale x2, its a little less, but still... Anyways I think I can improve it a bit more more with a better/existent presieve, which I plan to focus on next. 

20190910, 15:22  #16  
"Tilman Neumann"
Jan 2016
Germany
13·37 Posts 
Quote:
Thanks for that. Looking up the "Barrett reduction" wikipedia entry told me that this is what Thilo and me were doing in our "fast trial division by inverse multiplication". What we had was the floating point multiplication variant. Switching to integers makes another nice improvement. 

20190912, 15:57  #17  
Apr 2019
11001101_{2} Posts 
Quote:
I tend to run mprime PM1, LLD, or PRP tasks on all cores (no hyperthreads), and use the spare hyperthreads for my extra TF stuff. I briefly tried to add vectorization/SIMD using OpenMP pragmas but didn't have any luck making things go faster. Its my first time messing with that so I'm likely doing many things wrong. 

20190913, 19:38  #18  
∂^{2}ω=0
Sep 2002
RepÃºblica de California
2^{2}×3×7×139 Posts 
Quote:
MUL_LOHI(x,y,&lo,&hi); // 2Bbit product x*y, output halves into lo and hi MULL(qinv,lo,lo); // lo = qinv*lo, low Bbit product lo = MULH(q,lo); // High B bits of 2Bbit product q*lo In a typical hardware implementation, each of those MULs has 45 cycle latency. So in the context of e.g. Mersenne TF work, where we do a bunch of modular powerings 2^p (mod q) for various factor candidates (values of q), it makes sense to feed, say 4 such q's at a time to a 4way modpow loop, meaning that there will be 4 independent ones of each of the above 3 MULs: MUL_LOHI(x0,y0,&lo0,&hi0); MUL_LOHI(x1,y1,&lo1,&hi1); MUL_LOHI(x2,y2,&lo2,&hi2); MUL_LOHI(x3,y3,&lo3,&hi3); MULL(qinv0,lo0,lo0); MULL(qinv1,lo1,lo1); MULL(qinv2,lo2,lo2); MULL(qinv3,lo3,lo3); lo0 = MULH(q0,lo0); lo1 = MULH(q1,lo1); lo2 = MULH(q2,lo2); lo3 = MULH(q3,lo3); Thus, say, if latency = 4 cycles and one MUL can start every cycle, the result of the MUL_LOHI(x0,y0,&lo0,&hi0) becomes available just as it is needed for the ensuing MULL(qinv0,lo0,lo0), and so forth. 

20190920, 16:50  #19 
Apr 2019
5×41 Posts 
OK, thanks for the explanation, I understand better now.
I made a new version of my MULSHR128 routine (128x128, keeping the top 128b result), which should be more correct(and i think marginally faster) than my previous one that shifted away lsb's before carrying: Code:
inline __uint128_t MULSHR128(__uint128_t x, __uint128_t y) { __uint128_t xl = L64(x); __uint128_t xh = H64(x); __uint128_t yl = L64(y); __uint128_t yh = H64(y); __uint128_t xhyl = xh*yl; __uint128_t xlyh = xl*yh; uint64_t xlyl_h = H64(xl*yl); unsigned long long z; return xh*yh + H64(xhyl) + H64(xlyh) + __builtin_uaddll_overflow(xhyl, xlyh, &z) + __builtin_uaddll_overflow(z, xlyl_h, &z); } So in my latest code, pow2mod is a functor class now, using the operator()(...) to handle multiple modulos in parallel. I wrote parallel versions for x4, x3, x2, and x1. I'm showing just the "x1" version for brevity, but the others basically consist of the same individual lines repeated with 0,1,2,3 suffix on each variable name. Also merged the above code of MULSHR128 into the reduce function: Code:
#define likely(x) __builtin_expect(!!(x), 1) #define unlikely(x) __builtin_expect(!!(x), 0) constexpr inline __uint128_t H64(__uint128_t x) { return x >> 64; } constexpr inline __uint128_t L64(__uint128_t x) { return uint64_t(x); } struct pow2mod { __uint128_t pow_init; uint8_t sqrs[6]; uint8_t e_chunks[6]; uint8_t mu_shr_init; uint8_t chunk_count; // store deconstructed exponent for simplified powmod, base2 pow2mod(uint64_t e) { int lz = __builtin_clzl(e); int bits_remain = (lz > 57) ? 0 : 57  lz; int e0 = e >> bits_remain; this>pow_init = __uint128_t(1) << e0; this>mu_shr_init = 128e0; int n = 0; // shift chunks of bits, 6 at a time, into msb of e if (bits_remain) { for (e <<= lz + 7; e; ++n, e <<= 6) { lz = __builtin_clzl(e); e <<= lz; bits_remain = lz + 6; sqrs[n] = lz + 6; e_chunks[n] = e >> 58; // grab top 6 bits } } // exponent more than 7 bits total if (n > 0) { e_chunks[n1] >>= bits_remain; sqrs[n1] += bits_remain; // adjust squarings for nexttolast chunk } chunk_count = n; } private: inline void reduce_x1(__uint128_t &x0, uint64_t mu_l0, uint64_t mu_h0, uint64_t m0) { __uint128_t xl0 = L64(x0); __uint128_t xh0 = H64(x0); __uint128_t xlyh0 = xl0 * mu_h0; __uint128_t xhyl0 = xh0 * mu_l0; __uint128_t xlyl_h0 = H64(xl0*mu_l0); __uint128_t xhyh0 = xh0 * mu_h0; unsigned long long z0; xhyh0 = (xhyh0 + H64(xlyh0) + H64(xhyl0) + __builtin_uaddll_overflow(xhyl0, xlyh0, &z0) + __builtin_uaddll_overflow(z0, xlyl_h0, &z0)); xhyh0 *= m0; x0 = xhyh0; while (unlikely(x0 >= m0)) x0 = m0; } public: inline uint8_t operator()(uint64_t m0) { constexpr __uint128_t B = ~__uint128_t(0); __uint128_t mu0 = B / m0; uint64_t mu_l0 = mu0; uint64_t mu_h0 = mu0 >> 64; // initial reduction for first 7 bits of exponent __uint128_t temp0 = mu0 >> mu_shr_init; temp0 *= m0; temp0 = this>pow_init  temp0; while (unlikely(temp0 >= m0)) temp0 = m0; for (size_t i = 0; i < chunk_count; ++i) { for(uint8_t j = 0; j < sqrs[i]; ++j) { // temp = (temp * temp) % m; temp0 *= temp0; reduce_x1(temp0, mu_l0, mu_h0, m0); } // temp = (temp << chunks[i]) % m; temp0 <<= e_chunks[i]; reduce_x1(temp0, mu_l0, mu_h0, m0); } // check if each result == 1 return (temp0 == 1); } }; I'm still trying to figure out a nice way control flow between my old way of pulling one candidate at a time out of the sieve/classes into calling this parallel version. I've also done some profiling with perf, and realized I can see the effects of this latency from overall cycles spent on instructions, or more specifically recording the event "perf record e stalledcyclesbackend ..." With the above code, I noticed that 20% of overall time is spent in the initial "mu = 2^128 / modulus" calculation which is calls a builtin named "__udivti3". The disassembly of this in perf shows a few div instructions very close one after the other, so its fairly nonideal in terms of latency. Not sure how I could implement my own parallelized version of that particular operation though. Are there any special tricks when the numerator is a constant power of 2? 
20190921, 01:50  #20 
Apr 2019
5×41 Posts 
I did some more timing of the various "parallel" computation functions I wrote, and found that x1, x3, and x4 are basically all the same performance.
Only x2 seems to benefit with a ~16% increase in speed. edit: Here's some "perf stat" results for the x2 version, still with dumb "try every k" code Code:
Performance counter stats for './merstf 2900000000 2900000100 55 60': 146,871.08 msec taskclock # 1.000 CPUs utilized 15,371 contextswitches # 0.105 K/sec 38 cpumigrations # 0.000 K/sec 195 pagefaults # 0.001 K/sec 585,354,021,276 cycles # 3.985 GHz (83.33%) 706,590,899 stalledcyclesfrontend # 0.12% frontend cycles idle (83.33%) 455,069,579,192 stalledcyclesbackend # 77.74% backend cycles idle (83.33%) 1,496,220,929,283 instructions # 2.56 insn per cycle # 0.30 stalled cycles per insn (83.33%) 54,473,802,597 branches # 370.895 M/sec (83.33%) 624,591,586 branchmisses # 1.15% of all branches (83.33%) 146.939449114 seconds time elapsed 146.823542000 seconds user 0.031982000 seconds sys Last fiddled with by hansl on 20190921 at 01:57 
20190921, 12:44  #21  
"Graham uses ISO 8601"
Mar 2014
AU, Sydney
11·23 Posts 
Quote:
There are some forums that I visit that would delete a message like this and prior as off topic. mersennneforum.org moderators have demonstrated a more relaxed attitude. p.s. I took up the challenge proposed by LaurV and have viewed the video Last Vegas (2013). It didn't evoke any awkward feelings at all. 

20191030, 11:38  #22 
Mar 2013
7 Posts 
I think I may have a slightly different method for doing Mersenne number trial division that you can test in your project. This method can be summarized as "Negative power base2 modular exponentiation using Montgomery reduction." Instead of computing Mod(2,m)^q and test whether the result is 1, we can compute Mod(2,m)^q and test whether the result is 1.
The idea is to view Montgomery reduction as adding a multiple of the modulus to the accumulator so that 64 least significant bits of accumulator become zeros, then rightshift the accumulator by 64 bits. This rightshift operation has the effect of multiplying 2^64 to the accumulator. We want to utilize this effect to bring the exponent more and more negative towards our target. Code:
Example with exponent=33816457, we will compute 2^33816457 Start with 2^0=1 Perform Montgomery reduction, accumulator now holds 2^64 In the following table, sqmont means "square accumulator, then perform Montgomery reduction", this has the "2x64" effect on the exponent. div2 means "divide the accumulator by 2", this has the "minus 1" effect on the exponent. Operation Exponent Operation Exponent sqmont 192 div2 193 sqmont 450 div2 451 sqmont 966 div2 967 sqmont 1998 div2 1999 sqmont 4062 div2 4063 sqmont 8190 div2 8191 sqmont 16446 div2 16447 sqmont 32958 div2 32959 sqmont 65982 div2 65983 sqmont 132030 div2 132031 sqmont 264126 div2 264127 sqmont 528318 div2 528319 sqmont 1056702 sqmont 2113468 sqmont 4227000 div2 4227001 sqmont 8454066 sqmont 16908196 sqmont 33816456 div2 33816457 Besides trial division for Mersenne numbers, this method can be applied to base2 Fermat primality test. 
Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Requestion for fast chebyshev theta function  danaj  Computer Science & Computational Number Theory  9  20180331 14:59 
Do normal adults give themselves an allowance? (...to fast or not to fast  there is no question!)  jasong  jasong  35  20161211 00:57 
Fast modular reduction for primes < 512 bits?  BenR  Computer Science & Computational Number Theory  2  20160327 00:37 
Fast Approximate Ceiling Function  R.D. Silverman  Programming  27  20101119 17:50 
Base6 speed for prime testing vs. base2  jasong  Conjectures 'R Us  36  20100803 06:25 