20190821, 14:44  #166 
"Tilman Neumann"
Jan 2016
Germany
1000001001_{2} Posts 
Back...
I am using gcc version 7.2.0 from mingw64. Did you compare your rho and ecm on my test files? I think that the test data may make a difference. My files contain semiprimes N=a*b where the lower factor may range from cbrt(N) to sqrt(N). This is my favorite test set because N near cbrt(N) seem to be the most difficult for Hart and Lehman algorithms. If I am not mistaken then you use semiprimes where both factors have the same bit size? Just a guess, but it might be that rho is faster on my testset because it is better in finding small factors of small arguments than ecm... Another possibility is that my rho implementation is pretty good. EDIT: Probably not, as you wrote that you ported my rho implementation... EDIT 2: You never know what Java does below the hood. Lots of AVX stuff. A 1 to 1 port may not be equivalent. Last fiddled with by Till on 20190821 at 14:50 Reason: AVX 
20190821, 14:57  #167  
"Ben"
Feb 2007
E8B_{16} Posts 
My rho and your rho, and my ecm and your slightly modified ecm (which perform the same), all in C, on both my testdata and your testdata.
Quote:
Quote:
I've been using linux gcc 7.3.0. I think I have mingw somewhere on a windows system with a similar cpu, so I might be able to have a look at it. 

20190821, 17:45  #168  
"Tilman Neumann"
Jan 2016
Germany
1000001001_{2} Posts 
Quote:
Now I tried and got a big surprise. I had to adjust the submod() a bit but then I obtained a factor 1.5 speedup! (combined from addMod() and subMod() improvements) Seems as if my profiling attempts were not very cunning. Maybe jvisualvm is not very good on such tasks (using too much cpu time itself ?), maybe it was me who isn't. Thanks a lot for the advice! 

20190821, 18:34  #169 
"Tilman Neumann"
Jan 2016
Germany
521 Posts 
TinyEcm is my best (Java) algorithm now for N from 52 to 62 bit. Below 52 bits, Hart and Lehman are ruling (at equal pace). In Java, Thilo&me's latest Hart/Lehman implementations (Hart_Fast2Mult, Lehman_CustomKOrder) are factor 1.4 faster than the Lehman_Fast you ported time ago. If you port these, our numbers might get closer another bit.
The above still does not explain why timyecm.c does not work that good on my computer, but I hope your examination on a similar machine will give some insights. 
20190822, 15:20  #170  
"Tilman Neumann"
Jan 2016
Germany
521 Posts 
Quote:
O0 g3 Wall c fmessagelength=0 With O3 g3 Wall c fmessagelength=0 tinyecm.c is factor 3 faster. Now the data should look more agreeable. If you want I'll post an update of my performance comparison table. Last fiddled with by Till on 20190822 at 15:23 Reason: last sentence workover 

20190822, 15:38  #171  
"Ben"
Feb 2007
3·17·73 Posts 
Quote:
[edit] Oh, and glad to hear that the addmod() and submod() changes had a nice impact. I suspected they might, because I ran into a similar thing when an earlier inefficient submod() implementation of mine caused the whole algorithm to slow down by at least 30%. Last fiddled with by bsquared on 20190822 at 15:44 

20190822, 16:38  #172  
"Tilman Neumann"
Jan 2016
Germany
521 Posts 
Quote:
Here is the updated table: Code:
bits  Hart Lehman1 Lehman2 Squfof1 Squfof2 Rho1 Rho2 ECM(J)  ECM(C) ++ 42  1.01 1.01 1.37 2.29 3.85 2.35 2.59 4.31  1.24 44  1.60 1.60 2.18 3.23 5.28 2.98 3.26 4.80  1.34 46  2.56 2.48 3.48 4.84 7.34 3.95 4.29 6.93  1.50 48  3.87 3.81 5.46 6.92 10.14 5.45 5.96 6.56  1.82 50  6.33 6.38 8.67 11.82 16.49 7.74 8.48 7.84  2.09 52  11.01 10.31 14.28 18.78 21.51 10.48 11.85 9.45  2.52 54  16.00 16.20 22.90 29.37 14.09 15.59 12.87  3.27 56  25.74 25.65 36.89 40.69 19.21 20.48 14.71  3.81 58  40.84 41.20 59.31 56.01 27.64 18.70  4.86 60  63.28 77.57 37.71 22.71  5.90 62  98.75 106.21 49.30 28.82  7.20 * Hart = class Hart_Fast2Mult(false) * Lehman1 = class Lehman_CustomKOrder(false) * Lehman2 = class Lehman_Fast(false) * Squfof1 = class SquFoF31Preload; works for N<=52 bit * Squfof2 = class SquFoF63 * Rho1 = class PollardRhoBrentMontgomeryR64Mul63; too slow for N>=58 bit * Rho2 = class PollardRhoBrentMontgomery64 * ECM(J) = class TinyEcm; my port of tinyecm.c * ECM(C) = tinyecm.c; slightly modified to work in my IDE and on my CPU I'ld say that now the table is reflecting the effort you put into your assembler code. And yes, my CPU seems to obtain some rehabilitation, too, or not? Your advive on addmod() and submod() was pretty cool. Do you have some more ? ;) Trying to optimize u64div() was not that successful... 

20190822, 17:29  #173  
"Ben"
Feb 2007
7213_{8} Posts 
Quote:
Quote:
Also the method add_getHigh, on top of the "new" and the 2 method calls, seems needlessly complex. You could use the ideas of the new submod() routine as follows: a = low + o_lo; b = high + o_hi; if (a < low) return b + 1 else return b combining it all together in C looks like this: Code:
uint64_t montMul64(uint64_t x, uint64_t y, uint64_t n, uint64_t nhat) { uint64_t a_hi = x >> 32; uint64_t b_hi = y >> 32; uint64_t a_lo = x & 0xFFFFFFFFULL; uint64_t b_lo = y & 0xFFFFFFFFULL; uint64_t lo_prod = a_lo * b_lo; uint64_t med_prod1 = a_hi * b_lo; uint64_t med_prod2 = a_lo * b_hi; uint64_t med_term = med_prod1 + med_prod2; uint64_t c = 0; uint64_t hi_prod = a_hi * b_hi; uint64_t xy_hi = (((lo_prod >> 32) + med_term) >> 32) + hi_prod; if ((med_prod1 < 0 && med_prod2 < 0)  ((med_prod1 < 0  med_prod2 < 0) && med_term >= 0)) xy_hi += 1ULL << 32; uint64_t xy_lo = (med_term << 32) + lo_prod; uint64_t u = xy_lo * nhat; a_hi = u >> 32; b_hi = n >> 32; a_lo = u & 0xFFFFFFFFULL; b_lo = n & 0xFFFFFFFFULL; lo_prod = a_lo * b_lo; med_prod1 = a_hi * b_lo; med_prod2 = a_lo * b_hi; med_term = med_prod1 + med_prod2; hi_prod = a_hi * b_hi; uint64_t un_hi = (((lo_prod >> 32) + med_term) >> 32) + hi_prod; if ((med_prod1 < 0 && med_prod2 < 0)  ((med_prod1 < 0  med_prod2 < 0) && med_term >= 0)) un_hi += 1ULL << 32; uint64_t un_lo = ((med_term & 0xFFFFFFFFULL) << 32) + lo_prod; uint64_t r_lo = un_lo + xy_lo; c = 0; if (r_lo < un_lo) c = 1; uint64_t r_hi = un_hi + xy_hi + c; if (r_hi >= n) r_hi = n; return r_hi; } 

20190822, 18:06  #174 
"Ben"
Feb 2007
3723_{10} Posts 
One last idea, although this didn't make any difference in C.
If you make a specialized montSqr64 then the beginning of the routine can be simplified a bit: Code:
uint64_t a_hi = x >> 32; uint64_t a_lo = x & 0xFFFFFFFFULL; uint64_t lo_prod = a_lo * a_lo; uint64_t med_prod1 = a_hi * a_lo; uint64_t med_term = med_prod1 + med_prod1; uint64_t c = 0; uint64_t hi_prod = a_hi * a_hi; uint64_t xy_hi = (((lo_prod >> 32) + med_term) >> 32) + hi_prod; if (med_prod1 < 0) xy_hi += 1ULL << 32; uint64_t xy_lo = (med_term << 32) + lo_prod; You'll also have to replace the appropriate calls in add() and dup() with montSqr64. Last fiddled with by bsquared on 20190822 at 18:09 
20190822, 18:32  #175 
"Tilman Neumann"
Jan 2016
Germany
521 Posts 
Thanks for all those suggestions, Ben!
I'll check them all, and also keep in mind post #158... 
20190823, 11:48  #176  
Bamboozled!
"๐บ๐๐ท๐ท๐ญ"
May 2003
Down not across
37×313 Posts 
Quote:
E.g. Code:
long addmod(long x, long y, long n) { long r0 = xy; return r0 + (n & ((nr0) >> (sizeof (long) * sizeof (char))); } /* Check this carefully, I typed it in but have not fed it into a compiler !!! */ This trick is also widely used for constanttime arithmetic to avoid sidechannel attacks in securityaware software. Last fiddled with by xilman on 20190823 at 11:50 Reason: Add final line 

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Factoring Mersenne numbers  paulunderwood  Miscellaneous Math  18  20170827 14:56 
Factoring Big Numbers In c#  ShridharRasal  Factoring  10  20080320 17:17 
Question about factoring code  Peter Nelson  Software  9  20050330 18:28 
Factoring Fermat Numbers  Axel Fox  Software  14  20030704 18:57 
Questions about SSE2 code and Factoring  Joe O  Software  2  20020913 23:39 