mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > Factoring

Reply
 
Thread Tools
Old 2019-08-21, 14:44   #166
Till
 
Till's Avatar
 
"Tilman Neumann"
Jan 2016
Germany

10000010012 Posts
Default

Back...


I am using gcc version 7.2.0 from mingw64.


Did you compare your rho and ecm on my test files?
I think that the test data may make a difference. My files contain semiprimes N=a*b where the lower factor may range from cbrt(N) to sqrt(N). This is my favorite test set because N near cbrt(N) seem to be the most difficult for Hart and Lehman algorithms. If I am not mistaken then you use semiprimes where both factors have the same bit size?


Just a guess, but it might be that rho is faster on my testset because it is better in finding small factors of small arguments than ecm...


Another possibility is that my rho implementation is pretty good.
EDIT: Probably not, as you wrote that you ported my rho implementation...


EDIT 2: You never know what Java does below the hood. Lots of AVX stuff. A 1 to 1 port may not be equivalent.

Last fiddled with by Till on 2019-08-21 at 14:50 Reason: AVX
Till is offline   Reply With Quote
Old 2019-08-21, 14:57   #167
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

E8B16 Posts
Default

Quote:
Originally Posted by Till View Post
Did you compare your rho and ecm on my test files?
My rho and your rho, and my ecm and your slightly modified ecm (which perform the same), all in C, on both my testdata and your testdata.

Quote:
Originally Posted by Till View Post
I think that the test data may make a difference. My files contain semiprimes N=a*b where the lower factor may range from cbrt(N) to sqrt(N). This is my favorite test set because N near cbrt(N) seem to be the most difficult for Hart and Lehman algorithms. If I am not mistaken then you use semiprimes where both factors have the same bit size?


Just a guess, but it might be that rho is faster on my testset because it is better in finding small factors of small arguments than ecm...
ECM is pretty good at finding small factors too :) Anyway, this is exactly what I wanted to test. At least on my machine, ecm was faster than either rho implementation on both test sets.

Quote:
Originally Posted by Till View Post
Another possibility is that my rho implementation is pretty good ;-)
Yes, and not wanting to discount this, I tested both... yours and mine perform similarly in C.

I've been using linux gcc 7.3.0. I think I have mingw somewhere on a windows system with a similar cpu, so I might be able to have a look at it.
bsquared is offline   Reply With Quote
Old 2019-08-21, 17:45   #168
Till
 
Till's Avatar
 
"Tilman Neumann"
Jan 2016
Germany

10000010012 Posts
Default

Quote:
Originally Posted by bsquared View Post
Can you try (if you haven't already), something like this:
Code:
long submod(long x, long y, long n) {
    long r0;
    r0 = x-y;
    if (r0 > x) {
        r0 += n;
    }
    return r0;
}

long addmod(long x, long y, long n) {
    long r0;
    r0 = x+y;
    if (r0 >= n) {
        r0 -= n;
    }
}
There are about as many addmod and submod calls in the code as mulmods so this might make a difference if java is issuing division ops with the '%'s

Now I tried and got a big surprise.
I had to adjust the submod() a bit but then I obtained a factor 1.5 speedup! (combined from addMod() and subMod() improvements)


Seems as if my profiling attempts were not very cunning. Maybe jvisualvm is not very good on such tasks (using too much cpu time itself ?), maybe it was me who isn't.


Thanks a lot for the advice!
Till is offline   Reply With Quote
Old 2019-08-21, 18:34   #169
Till
 
Till's Avatar
 
"Tilman Neumann"
Jan 2016
Germany

521 Posts
Default

TinyEcm is my best (Java) algorithm now for N from 52 to 62 bit. Below 52 bits, Hart and Lehman are ruling (at equal pace). In Java, Thilo&me's latest Hart/Lehman implementations (Hart_Fast2Mult, Lehman_CustomKOrder) are factor 1.4 faster than the Lehman_Fast you ported time ago. If you port these, our numbers might get closer another bit.


The above still does not explain why timyecm.c does not work that good on my computer, but I hope your examination on a similar machine will give some insights.
Till is offline   Reply With Quote
Old 2019-08-22, 15:20   #170
Till
 
Till's Avatar
 
"Tilman Neumann"
Jan 2016
Germany

521 Posts
Default

Quote:
Originally Posted by bsquared View Post
I'm still wondering how the java rho-brent benchmarks are beating the C ecm benchmarks on your cpu.
Hey, I found it. I didn't work much with C in the last 20 years and was using the default compiler settings of Eclipse CDT:
-O0 -g3 -Wall -c -fmessage-length=0

With
-O3 -g3 -Wall -c -fmessage-length=0
tinyecm.c is factor 3 faster.

Now the data should look more agreeable. If you want I'll post an update of my performance comparison table.

Last fiddled with by Till on 2019-08-22 at 15:23 Reason: last sentence workover
Till is offline   Reply With Quote
Old 2019-08-22, 15:38   #171
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

3·17·73 Posts
Default

Quote:
Originally Posted by Till View Post
Hey, I found it. I didn't work much with C in the last 20 years and was using the default compiler settings of Eclipse CDT:
-O0 -g3 -Wall -c -fmessage-length=0

With
-O3 -g3 -Wall -c -fmessage-length=0
tinyecm.c is factor 3 faster.

Now the data should look more agreeable. If you want I'll post an update of my performance comparison table.
Ah ha! Yes, that would do it. Yeah if it's not too much trouble it would be good to have the updated data posted in one place.

[edit]
Oh, and glad to hear that the addmod() and submod() changes had a nice impact. I suspected they might, because I ran into a similar thing when an earlier inefficient submod() implementation of mine caused the whole algorithm to slow down by at least 30%.

Last fiddled with by bsquared on 2019-08-22 at 15:44
bsquared is offline   Reply With Quote
Old 2019-08-22, 16:38   #172
Till
 
Till's Avatar
 
"Tilman Neumann"
Jan 2016
Germany

521 Posts
Default

Quote:
Originally Posted by bsquared View Post
Ah ha! Yes, that would do it. Yeah if it's not too much trouble it would be good to have the updated data posted in one place.

[edit]
Oh, and glad to hear that the addmod() and submod() changes had a nice impact. I suspected they might, because I ran into a similar thing when an earlier inefficient submod() implementation of mine caused the whole algorithm to slow down by at least 30%.

Here is the updated table:
Code:
bits    | Hart    Lehman1    Lehman2    Squfof1    Squfof2    Rho1      Rho2     ECM(J)      | ECM(C)
--------+---------------------------------------------------------------------------------------------------------------+------------
42      | 1.01     1.01       1.37       2.29        3.85      2.35     2.59     4.31        | 1.24
44      | 1.60     1.60       2.18       3.23        5.28      2.98     3.26     4.80        | 1.34
46      | 2.56     2.48       3.48       4.84        7.34      3.95     4.29     6.93        | 1.50
48      | 3.87     3.81       5.46       6.92       10.14      5.45     5.96     6.56        | 1.82
50      | 6.33     6.38       8.67      11.82       16.49      7.74     8.48     7.84        | 2.09
52      | 11.01   10.31       14.28     18.78       21.51      10.48    11.85    9.45        | 2.52
54      | 16.00   16.20       22.90                 29.37      14.09    15.59    12.87       | 3.27
56      | 25.74   25.65       36.89                 40.69      19.21    20.48    14.71       | 3.81
58      | 40.84   41.20       59.31                 56.01               27.64    18.70       | 4.86
60      | 63.28                                     77.57               37.71    22.71       | 5.90
62      | 98.75                                     106.21              49.30    28.82       | 7.20
Legend:
* Hart = class Hart_Fast2Mult(false)
* Lehman1 = class Lehman_CustomKOrder(false)
* Lehman2 = class Lehman_Fast(false)
* Squfof1 = class SquFoF31Preload; works for N<=52 bit
* Squfof2 = class SquFoF63
* Rho1 = class PollardRhoBrentMontgomeryR64Mul63; too slow for N>=58 bit
* Rho2 = class PollardRhoBrentMontgomery64
* ECM(J) = class TinyEcm; my port of tinyecm.c
* ECM(C) = tinyecm.c; slightly modified to work in my IDE and on my CPU


I'ld say that now the table is reflecting the effort you put into your assembler code. And yes, my CPU seems to obtain some rehabilitation, too, or not?


Your advive on addmod() and submod() was pretty cool. Do you have some more ? ;)
Trying to optimize u64div() was not that successful...
Till is offline   Reply With Quote
Old 2019-08-22, 17:29   #173
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

72138 Posts
Default

Quote:
Originally Posted by Till View Post

I'ld say that now the table is reflecting the effort you put into your assembler code. And yes, my CPU seems to obtain some rehabilitation, too, or not?
Yes, it looks much better.

Quote:
Originally Posted by Till View Post

Your advive on addmod() and submod() was pretty cool. Do you have some more ? ;)
Trying to optimize u64div() was not that successful...
My only suggestion at this point is to combine the operations of montMul64 into one method. montMul64 has 3 "new Uint128" in it and 3 other method calls (to getLow and getHigh). Maybe Java is super efficient at object creation and cleanup but if that stuff is anything more than zero overhead then combining everything should be a win.

Also the method add_getHigh, on top of the "new" and the 2 method calls, seems needlessly complex. You could use the ideas of the new submod() routine as follows:
a = low + o_lo;
b = high + o_hi;

if (a < low)
return b + 1
else
return b

combining it all together in C looks like this:
Code:
uint64_t montMul64(uint64_t x, uint64_t y, uint64_t n, uint64_t nhat)
{
    uint64_t a_hi = x >> 32;
    uint64_t b_hi = y >> 32;
    uint64_t a_lo = x & 0xFFFFFFFFULL;
    uint64_t b_lo = y & 0xFFFFFFFFULL;

    uint64_t lo_prod = a_lo * b_lo;
    uint64_t med_prod1 = a_hi * b_lo;
    uint64_t med_prod2 = a_lo * b_hi;
    uint64_t med_term = med_prod1 + med_prod2;
    uint64_t c = 0;
    uint64_t hi_prod = a_hi * b_hi;

    uint64_t xy_hi = (((lo_prod >> 32) + med_term) >> 32) + hi_prod;
    if ((med_prod1 < 0 && med_prod2 < 0) || ((med_prod1 < 0 || med_prod2 < 0) && med_term >= 0)) xy_hi += 1ULL << 32;
    uint64_t xy_lo = (med_term << 32) + lo_prod;

    uint64_t u = xy_lo * nhat;

    a_hi = u >> 32;
    b_hi = n >> 32;
    a_lo = u & 0xFFFFFFFFULL;
    b_lo = n & 0xFFFFFFFFULL;

    lo_prod = a_lo * b_lo;
    med_prod1 = a_hi * b_lo;
    med_prod2 = a_lo * b_hi;
    med_term = med_prod1 + med_prod2;
    hi_prod = a_hi * b_hi;

    uint64_t un_hi = (((lo_prod >> 32) + med_term) >> 32) + hi_prod;
    if ((med_prod1 < 0 && med_prod2 < 0) || ((med_prod1 < 0 || med_prod2 < 0) && med_term >= 0)) un_hi += 1ULL << 32;

    uint64_t un_lo = ((med_term & 0xFFFFFFFFULL) << 32) + lo_prod;
    uint64_t r_lo = un_lo + xy_lo;
    c = 0;
    if (r_lo < un_lo)
        c = 1;
    uint64_t r_hi = un_hi + xy_hi + c;
    if (r_hi >= n)
        r_hi -= n;
    return r_hi;
}
In C, the above is "only" 2x slower than my assembly code. But you are seeing a factor of 3x to 4x between C and Java (now that the C is fixed). Not all of that comes from montMul64, but probably a lot of it does. Anyway, hope this helps! If it does, then a similar optimization could be done to your montgomeryMult method in PollardRhoBrentMontgomery64.
bsquared is offline   Reply With Quote
Old 2019-08-22, 18:06   #174
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

372310 Posts
Default

One last idea, although this didn't make any difference in C.

If you make a specialized montSqr64 then the beginning of the routine can be simplified a bit:

Code:
uint64_t a_hi = x >> 32;
    uint64_t a_lo = x & 0xFFFFFFFFULL;

    uint64_t lo_prod = a_lo * a_lo;
    uint64_t med_prod1 = a_hi * a_lo;
    uint64_t med_term = med_prod1 + med_prod1;
    uint64_t c = 0;
    uint64_t hi_prod = a_hi * a_hi;

    uint64_t xy_hi = (((lo_prod >> 32) + med_term) >> 32) + hi_prod;
    if (med_prod1 < 0) xy_hi += 1ULL << 32;
    uint64_t xy_lo = (med_term << 32) + lo_prod;
after which it is the same as montMul64.

You'll also have to replace the appropriate calls in add() and dup() with montSqr64.

Last fiddled with by bsquared on 2019-08-22 at 18:09
bsquared is offline   Reply With Quote
Old 2019-08-22, 18:32   #175
Till
 
Till's Avatar
 
"Tilman Neumann"
Jan 2016
Germany

521 Posts
Default

Thanks for all those suggestions, Ben!
I'll check them all, and also keep in mind post #158...
Till is offline   Reply With Quote
Old 2019-08-23, 11:48   #176
xilman
Bamboozled!
 
xilman's Avatar
 
"๐’‰บ๐’ŒŒ๐’‡ท๐’†ท๐’€ญ"
May 2003
Down not across

37×313 Posts
Default

Quote:
Originally Posted by bsquared View Post
Can you try (if you haven't already), something like this:
Code:
long submod(long x, long y, long n) {
    long r0;
    r0 = x-y;
    if (r0 > x) {
        r0 += n;
    }
    return r0;
}

long addmod(long x, long y, long n) {
    long r0;
    r0 = x+y;
    if (r0 >= n) {
        r0 -= n;
    }
}
The conditional arithmetic may compile to mis-predicted branches on some architectures and cause an overall slowdown. Sometimes it is faster to mask, shift and add/subtract.

E.g.

Code:
long addmod(long x, long y, long n) {
    long r0 = x-y;
    return r0 + (n & ((n-r0) >> (sizeof (long) * sizeof (char)));
}

/* Check this carefully, I typed it in but have not fed it into a compiler !!! */
Have you experimented along these lines? A decent compiler will optimize the shift right operand to a constant. You may wish to (very carefully) convert to a macro or, better, give an inline directive to the compiler.

This trick is also widely used for constant-time arithmetic to avoid side-channel attacks in security-aware software.

Last fiddled with by xilman on 2019-08-23 at 11:50 Reason: Add final line
xilman is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Factoring Mersenne numbers paulunderwood Miscellaneous Math 18 2017-08-27 14:56
Factoring Big Numbers In c# ShridharRasal Factoring 10 2008-03-20 17:17
Question about factoring code Peter Nelson Software 9 2005-03-30 18:28
Factoring Fermat Numbers Axel Fox Software 14 2003-07-04 18:57
Questions about SSE2 code and Factoring Joe O Software 2 2002-09-13 23:39

All times are UTC. The time now is 00:12.


Sat Dec 3 00:12:03 UTC 2022 up 106 days, 21:40, 0 users, load averages: 0.95, 1.00, 0.92

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

โ‰  ยฑ โˆ“ รท ร— ยท โˆ’ โˆš โ€ฐ โŠ— โŠ• โŠ– โŠ˜ โŠ™ โ‰ค โ‰ฅ โ‰ฆ โ‰ง โ‰จ โ‰ฉ โ‰บ โ‰ป โ‰ผ โ‰ฝ โŠ โŠ โŠ‘ โŠ’ ยฒ ยณ ยฐ
โˆ  โˆŸ ยฐ โ‰… ~ โ€– โŸ‚ โซ›
โ‰ก โ‰œ โ‰ˆ โˆ โˆž โ‰ช โ‰ซ โŒŠโŒ‹ โŒˆโŒ‰ โˆ˜ โˆ โˆ โˆ‘ โˆง โˆจ โˆฉ โˆช โจ€ โŠ• โŠ— ๐–• ๐–– ๐–— โŠฒ โŠณ
โˆ… โˆ– โˆ โ†ฆ โ†ฃ โˆฉ โˆช โŠ† โŠ‚ โŠ„ โŠŠ โŠ‡ โŠƒ โŠ… โŠ‹ โŠ– โˆˆ โˆ‰ โˆ‹ โˆŒ โ„• โ„ค โ„š โ„ โ„‚ โ„ต โ„ถ โ„ท โ„ธ ๐“Ÿ
ยฌ โˆจ โˆง โŠ• โ†’ โ† โ‡’ โ‡ โ‡” โˆ€ โˆƒ โˆ„ โˆด โˆต โŠค โŠฅ โŠข โŠจ โซค โŠฃ โ€ฆ โ‹ฏ โ‹ฎ โ‹ฐ โ‹ฑ
โˆซ โˆฌ โˆญ โˆฎ โˆฏ โˆฐ โˆ‡ โˆ† ฮด โˆ‚ โ„ฑ โ„’ โ„“
๐›ข๐›ผ ๐›ฃ๐›ฝ ๐›ค๐›พ ๐›ฅ๐›ฟ ๐›ฆ๐œ€๐œ– ๐›ง๐œ ๐›จ๐œ‚ ๐›ฉ๐œƒ๐œ— ๐›ช๐œ„ ๐›ซ๐œ… ๐›ฌ๐œ† ๐›ญ๐œ‡ ๐›ฎ๐œˆ ๐›ฏ๐œ‰ ๐›ฐ๐œŠ ๐›ฑ๐œ‹ ๐›ฒ๐œŒ ๐›ด๐œŽ๐œ ๐›ต๐œ ๐›ถ๐œ ๐›ท๐œ™๐œ‘ ๐›ธ๐œ’ ๐›น๐œ“ ๐›บ๐œ”