mersenneforum.org Some code for factoring numbers up to 2^63
 Register FAQ Search Today's Posts Mark Forums Read

 2019-08-21, 14:44 #166 Till     "Tilman Neumann" Jan 2016 Germany 10000010012 Posts Back... I am using gcc version 7.2.0 from mingw64. Did you compare your rho and ecm on my test files? I think that the test data may make a difference. My files contain semiprimes N=a*b where the lower factor may range from cbrt(N) to sqrt(N). This is my favorite test set because N near cbrt(N) seem to be the most difficult for Hart and Lehman algorithms. If I am not mistaken then you use semiprimes where both factors have the same bit size? Just a guess, but it might be that rho is faster on my testset because it is better in finding small factors of small arguments than ecm... Another possibility is that my rho implementation is pretty good. EDIT: Probably not, as you wrote that you ported my rho implementation... EDIT 2: You never know what Java does below the hood. Lots of AVX stuff. A 1 to 1 port may not be equivalent. Last fiddled with by Till on 2019-08-21 at 14:50 Reason: AVX
2019-08-21, 14:57   #167
bsquared

"Ben"
Feb 2007

E8B16 Posts

Quote:
 Originally Posted by Till Did you compare your rho and ecm on my test files?
My rho and your rho, and my ecm and your slightly modified ecm (which perform the same), all in C, on both my testdata and your testdata.

Quote:
 Originally Posted by Till I think that the test data may make a difference. My files contain semiprimes N=a*b where the lower factor may range from cbrt(N) to sqrt(N). This is my favorite test set because N near cbrt(N) seem to be the most difficult for Hart and Lehman algorithms. If I am not mistaken then you use semiprimes where both factors have the same bit size? Just a guess, but it might be that rho is faster on my testset because it is better in finding small factors of small arguments than ecm...
ECM is pretty good at finding small factors too :) Anyway, this is exactly what I wanted to test. At least on my machine, ecm was faster than either rho implementation on both test sets.

Quote:
 Originally Posted by Till Another possibility is that my rho implementation is pretty good ;-)
Yes, and not wanting to discount this, I tested both... yours and mine perform similarly in C.

I've been using linux gcc 7.3.0. I think I have mingw somewhere on a windows system with a similar cpu, so I might be able to have a look at it.

2019-08-21, 17:45   #168
Till

"Tilman Neumann"
Jan 2016
Germany

10000010012 Posts

Quote:
 Originally Posted by bsquared Can you try (if you haven't already), something like this: Code: long submod(long x, long y, long n) { long r0; r0 = x-y; if (r0 > x) { r0 += n; } return r0; } long addmod(long x, long y, long n) { long r0; r0 = x+y; if (r0 >= n) { r0 -= n; } } There are about as many addmod and submod calls in the code as mulmods so this might make a difference if java is issuing division ops with the '%'s

Now I tried and got a big surprise.
I had to adjust the submod() a bit but then I obtained a factor 1.5 speedup! (combined from addMod() and subMod() improvements)

Seems as if my profiling attempts were not very cunning. Maybe jvisualvm is not very good on such tasks (using too much cpu time itself ?), maybe it was me who isn't.

Thanks a lot for the advice!

 2019-08-21, 18:34 #169 Till     "Tilman Neumann" Jan 2016 Germany 521 Posts TinyEcm is my best (Java) algorithm now for N from 52 to 62 bit. Below 52 bits, Hart and Lehman are ruling (at equal pace). In Java, Thilo&me's latest Hart/Lehman implementations (Hart_Fast2Mult, Lehman_CustomKOrder) are factor 1.4 faster than the Lehman_Fast you ported time ago. If you port these, our numbers might get closer another bit. The above still does not explain why timyecm.c does not work that good on my computer, but I hope your examination on a similar machine will give some insights.
2019-08-22, 15:20   #170
Till

"Tilman Neumann"
Jan 2016
Germany

521 Posts

Quote:
 Originally Posted by bsquared I'm still wondering how the java rho-brent benchmarks are beating the C ecm benchmarks on your cpu.
Hey, I found it. I didn't work much with C in the last 20 years and was using the default compiler settings of Eclipse CDT:
-O0 -g3 -Wall -c -fmessage-length=0

With
-O3 -g3 -Wall -c -fmessage-length=0
tinyecm.c is factor 3 faster.

Now the data should look more agreeable. If you want I'll post an update of my performance comparison table.

Last fiddled with by Till on 2019-08-22 at 15:23 Reason: last sentence workover

2019-08-22, 15:38   #171
bsquared

"Ben"
Feb 2007

3·17·73 Posts

Quote:
 Originally Posted by Till Hey, I found it. I didn't work much with C in the last 20 years and was using the default compiler settings of Eclipse CDT: -O0 -g3 -Wall -c -fmessage-length=0 With -O3 -g3 -Wall -c -fmessage-length=0 tinyecm.c is factor 3 faster. Now the data should look more agreeable. If you want I'll post an update of my performance comparison table.
Ah ha! Yes, that would do it. Yeah if it's not too much trouble it would be good to have the updated data posted in one place.

Oh, and glad to hear that the addmod() and submod() changes had a nice impact. I suspected they might, because I ran into a similar thing when an earlier inefficient submod() implementation of mine caused the whole algorithm to slow down by at least 30%.

Last fiddled with by bsquared on 2019-08-22 at 15:44

2019-08-22, 16:38   #172
Till

"Tilman Neumann"
Jan 2016
Germany

521 Posts

Quote:
 Originally Posted by bsquared Ah ha! Yes, that would do it. Yeah if it's not too much trouble it would be good to have the updated data posted in one place.  Oh, and glad to hear that the addmod() and submod() changes had a nice impact. I suspected they might, because I ran into a similar thing when an earlier inefficient submod() implementation of mine caused the whole algorithm to slow down by at least 30%.

Here is the updated table:
Code:
bits    | Hart    Lehman1    Lehman2    Squfof1    Squfof2    Rho1      Rho2     ECM(J)      | ECM(C)
--------+---------------------------------------------------------------------------------------------------------------+------------
42      | 1.01     1.01       1.37       2.29        3.85      2.35     2.59     4.31        | 1.24
44      | 1.60     1.60       2.18       3.23        5.28      2.98     3.26     4.80        | 1.34
46      | 2.56     2.48       3.48       4.84        7.34      3.95     4.29     6.93        | 1.50
48      | 3.87     3.81       5.46       6.92       10.14      5.45     5.96     6.56        | 1.82
50      | 6.33     6.38       8.67      11.82       16.49      7.74     8.48     7.84        | 2.09
52      | 11.01   10.31       14.28     18.78       21.51      10.48    11.85    9.45        | 2.52
54      | 16.00   16.20       22.90                 29.37      14.09    15.59    12.87       | 3.27
56      | 25.74   25.65       36.89                 40.69      19.21    20.48    14.71       | 3.81
58      | 40.84   41.20       59.31                 56.01               27.64    18.70       | 4.86
60      | 63.28                                     77.57               37.71    22.71       | 5.90
62      | 98.75                                     106.21              49.30    28.82       | 7.20
Legend:
* Hart = class Hart_Fast2Mult(false)
* Lehman1 = class Lehman_CustomKOrder(false)
* Lehman2 = class Lehman_Fast(false)
* Squfof1 = class SquFoF31Preload; works for N<=52 bit
* Squfof2 = class SquFoF63
* Rho1 = class PollardRhoBrentMontgomeryR64Mul63; too slow for N>=58 bit
* Rho2 = class PollardRhoBrentMontgomery64
* ECM(J) = class TinyEcm; my port of tinyecm.c
* ECM(C) = tinyecm.c; slightly modified to work in my IDE and on my CPU

I'ld say that now the table is reflecting the effort you put into your assembler code. And yes, my CPU seems to obtain some rehabilitation, too, or not?

Trying to optimize u64div() was not that successful...

2019-08-22, 17:29   #173
bsquared

"Ben"
Feb 2007

72138 Posts

Quote:
 Originally Posted by Till I'ld say that now the table is reflecting the effort you put into your assembler code. And yes, my CPU seems to obtain some rehabilitation, too, or not?
Yes, it looks much better.

Quote:
 Originally Posted by Till Your advive on addmod() and submod() was pretty cool. Do you have some more ? ;) Trying to optimize u64div() was not that successful...
My only suggestion at this point is to combine the operations of montMul64 into one method. montMul64 has 3 "new Uint128" in it and 3 other method calls (to getLow and getHigh). Maybe Java is super efficient at object creation and cleanup but if that stuff is anything more than zero overhead then combining everything should be a win.

Also the method add_getHigh, on top of the "new" and the 2 method calls, seems needlessly complex. You could use the ideas of the new submod() routine as follows:
a = low + o_lo;
b = high + o_hi;

if (a < low)
return b + 1
else
return b

combining it all together in C looks like this:
Code:
uint64_t montMul64(uint64_t x, uint64_t y, uint64_t n, uint64_t nhat)
{
uint64_t a_hi = x >> 32;
uint64_t b_hi = y >> 32;
uint64_t a_lo = x & 0xFFFFFFFFULL;
uint64_t b_lo = y & 0xFFFFFFFFULL;

uint64_t lo_prod = a_lo * b_lo;
uint64_t med_prod1 = a_hi * b_lo;
uint64_t med_prod2 = a_lo * b_hi;
uint64_t med_term = med_prod1 + med_prod2;
uint64_t c = 0;
uint64_t hi_prod = a_hi * b_hi;

uint64_t xy_hi = (((lo_prod >> 32) + med_term) >> 32) + hi_prod;
if ((med_prod1 < 0 && med_prod2 < 0) || ((med_prod1 < 0 || med_prod2 < 0) && med_term >= 0)) xy_hi += 1ULL << 32;
uint64_t xy_lo = (med_term << 32) + lo_prod;

uint64_t u = xy_lo * nhat;

a_hi = u >> 32;
b_hi = n >> 32;
a_lo = u & 0xFFFFFFFFULL;
b_lo = n & 0xFFFFFFFFULL;

lo_prod = a_lo * b_lo;
med_prod1 = a_hi * b_lo;
med_prod2 = a_lo * b_hi;
med_term = med_prod1 + med_prod2;
hi_prod = a_hi * b_hi;

uint64_t un_hi = (((lo_prod >> 32) + med_term) >> 32) + hi_prod;
if ((med_prod1 < 0 && med_prod2 < 0) || ((med_prod1 < 0 || med_prod2 < 0) && med_term >= 0)) un_hi += 1ULL << 32;

uint64_t un_lo = ((med_term & 0xFFFFFFFFULL) << 32) + lo_prod;
uint64_t r_lo = un_lo + xy_lo;
c = 0;
if (r_lo < un_lo)
c = 1;
uint64_t r_hi = un_hi + xy_hi + c;
if (r_hi >= n)
r_hi -= n;
return r_hi;
}
In C, the above is "only" 2x slower than my assembly code. But you are seeing a factor of 3x to 4x between C and Java (now that the C is fixed). Not all of that comes from montMul64, but probably a lot of it does. Anyway, hope this helps! If it does, then a similar optimization could be done to your montgomeryMult method in PollardRhoBrentMontgomery64.

 2019-08-22, 18:06 #174 bsquared     "Ben" Feb 2007 372310 Posts One last idea, although this didn't make any difference in C. If you make a specialized montSqr64 then the beginning of the routine can be simplified a bit: Code: uint64_t a_hi = x >> 32; uint64_t a_lo = x & 0xFFFFFFFFULL; uint64_t lo_prod = a_lo * a_lo; uint64_t med_prod1 = a_hi * a_lo; uint64_t med_term = med_prod1 + med_prod1; uint64_t c = 0; uint64_t hi_prod = a_hi * a_hi; uint64_t xy_hi = (((lo_prod >> 32) + med_term) >> 32) + hi_prod; if (med_prod1 < 0) xy_hi += 1ULL << 32; uint64_t xy_lo = (med_term << 32) + lo_prod; after which it is the same as montMul64. You'll also have to replace the appropriate calls in add() and dup() with montSqr64. Last fiddled with by bsquared on 2019-08-22 at 18:09
 2019-08-22, 18:32 #175 Till     "Tilman Neumann" Jan 2016 Germany 521 Posts Thanks for all those suggestions, Ben! I'll check them all, and also keep in mind post #158...
2019-08-23, 11:48   #176
xilman
Bamboozled!

"𒉺𒌌𒇷𒆷𒀭"
May 2003
Down not across

37×313 Posts

Quote:
 Originally Posted by bsquared Can you try (if you haven't already), something like this: Code: long submod(long x, long y, long n) { long r0; r0 = x-y; if (r0 > x) { r0 += n; } return r0; } long addmod(long x, long y, long n) { long r0; r0 = x+y; if (r0 >= n) { r0 -= n; } }
The conditional arithmetic may compile to mis-predicted branches on some architectures and cause an overall slowdown. Sometimes it is faster to mask, shift and add/subtract.

E.g.

Code:
long addmod(long x, long y, long n) {
long r0 = x-y;
return r0 + (n & ((n-r0) >> (sizeof (long) * sizeof (char)));
}

/* Check this carefully, I typed it in but have not fed it into a compiler !!! */
Have you experimented along these lines? A decent compiler will optimize the shift right operand to a constant. You may wish to (very carefully) convert to a macro or, better, give an inline directive to the compiler.

This trick is also widely used for constant-time arithmetic to avoid side-channel attacks in security-aware software.

Last fiddled with by xilman on 2019-08-23 at 11:50 Reason: Add final line

 Similar Threads Thread Thread Starter Forum Replies Last Post paulunderwood Miscellaneous Math 18 2017-08-27 14:56 ShridharRasal Factoring 10 2008-03-20 17:17 Peter Nelson Software 9 2005-03-30 18:28 Axel Fox Software 14 2003-07-04 18:57 Joe O Software 2 2002-09-13 23:39

All times are UTC. The time now is 00:12.

Sat Dec 3 00:12:03 UTC 2022 up 106 days, 21:40, 0 users, load averages: 0.95, 1.00, 0.92