![]() |
![]() |
#67 | ||
"Ben"
Feb 2007
3,361 Posts |
![]() Quote:
Expanding on the example from before, with B1=7M on 2^1277-1: avx-ecm stage 1: 78 sec for 8 curves, 9.75 sec/curve avx-ecm stage 2: 55 sec for 8 curves, 6.87 sec/curve gmp-ecm stage 1: 60 sec for 1 curve gmp-ecm stage 2: 32 sec for 1 curve Now, each avx-ecm stage 2 curve is 2.8 times less likely to find a factor, but if you had 64 seconds to spend doing something, would you rather run 2 gmp-ecm stage 2 curves or 8 avx-ecm stage 2 curves? Even with each avx-ecm being 2.8 times less likely to find a factor, you'd be better off. Quote:
It gets interesting with generic inputs, when the throughput advantage of avx-ecm is not as great. Then, I think it makes sense to run avx-ecm stage 1 followed by 8 gmp-ecm stage 2's. Then you get all of the probability advantage of gmp-ecm stage 2 but with avx-ecm's higher stage 1 throughput. I am putting together some data on this case. |
||
![]() |
![]() |
![]() |
#68 |
"Ben"
Feb 2007
3,361 Posts |
![]()
Here are a couple other things to keep in mind:
1) for lower-end skylake processors the speedups are not as dramatic, with or without special Mersenne math. I ran on a 7800x and only saw about 2.1 times speedup for 2^1277-1. More testing/benchmarking is needed for a variety of avx512 capable processors. The good news is that this situation will only improve for avx-ecm as time goes on. I plan to implement enhancements as soon I can get my hands on an ice lake cpu or whatever cpu supports AVX-512IFMA. 2) avx-ecm uses fixed-time arithmetic in steps of 208 bits. So a curve at, say, 416 bits takes just as long as a curve at 623 bits. gmp-ecm on the other hand will adjust the size of its arithmetic in 64-bit steps. This could mean fine-tune adjustments for any crossover math that is performed for given input sizes. |
![]() |
![]() |
![]() |
#69 |
"Curtis"
Feb 2005
Riverside, CA
107778 Posts |
![]()
Thank you for rewarding my "I'm confused so I'll state the obvious and see where I'm wrong"! This makes much sense, and I'm all clear now.
|
![]() |
![]() |
![]() |
#70 |
"Ben"
Feb 2007
3,361 Posts |
![]()
I updated the first post with a table of timings and some analysis for a generic 900-bit input, and a couple different Mersenne inputs.
It looks like a hybrid plan is best for most B1 levels. The relative speeds of avx-ecm and gmp-ecm should be checked on the particular cpu beforehand, and crossovers adjusted. Last fiddled with by bsquared on 2020-11-03 at 19:50 |
![]() |
![]() |
![]() |
#71 |
"Ben"
Feb 2007
3,361 Posts |
![]()
Now processing 2^n+1 similarly to 2^n-1 (with special modulo)
Here verifying some of the factors of 2^941+1 Code:
./avx-ecm "(2^941+1)/3/1738969" 8 3000000 starting process 60820 commencing parallel ecm on 3562975409753765647916625929286022733708067351160837396219090783379741715638222929662078302565476897014358133541083282322395686863751130956979946597434060607614269433132169774980382077527684835263114705184392392863404860166373197969977946080763606872357617994374214244244159779 ECM has been configured with DIGITBITS = 52, VECLEN = 8, GMP_LIMB_BITS = 64 Choosing MAXBITS = 1040, NWORDS = 20, NBLOCKS = 5 based on input size 941 cached 3001134 primes < 49999991 Input has 919 bits, using 1 threads (8 curves/thread) Processing in batches of 100000000 primes Using special Mersenne mod for factor of: 2^941+1 Initialization took 0.0899 seconds. Cached 5761455 primes in range [2 : 99999989] commencing curves 0-7 of 8 Building curves took 0.0001 seconds. commencing Stage 1 @ prime 2 accumulating prime 2943173 Stage 1 completed at prime 2999999 with 6040073 point-adds and 565931 point-doubles Stage 1 took 19.2999 seconds found factor 5062155107501579 in stage 1 in thread 0, vec position 0, with sigma = 424412553419277870 found factor 5062155107501579 in stage 1 in thread 0, vec position 5, with sigma = 8913334498632421225 commencing stage 2 at p=3000017, A=3000690 w = 1155, R = 480, L = 16, umax = 9240, amin = 1299 found 5317482 primes in range [99999989 : 199999963] commencing stage 2 at p=100000007, A=99965250 w = 1155, R = 480, L = 16, umax = 9240, amin = 43275 found 5173389 primes in range [199999963 : 299999957] commencing stage 2 at p=199999991, A=199979010 w = 1155, R = 480, L = 16, umax = 9240, amin = 86571 found 53 primes in range [299999957 : 300000997] commencing stage 2 at p=299999977, A=299974290 w = 1155, R = 480, L = 16, umax = 9240, amin = 129859 Stage 2 took 14.7954 seconds performed 9087802 pair-multiplies for 16035509 primes in stage 2 performed 266472 point-additions and 57 point-doubles in stage 2 found factor 5062155107501579 in stage 2 in thread 0, vec position 0, with sigma = 424412553419277870 found factor 5062155107501579 in stage 2 in thread 0, vec position 1, with sigma = 15371561253067696485 found factor 5062155107501579 in stage 2 in thread 0, vec position 2, with sigma = 13475164758719783696 found factor 1053336016261649316809 in stage 2 in thread 0, vec position 4, with sigma = 333700535869040322 found factor 5062155107501579 in stage 2 in thread 0, vec position 5, with sigma = 8913334498632421225 Process took 34.3731 seconds. Last fiddled with by bsquared on 2020-11-05 at 15:17 |
![]() |
![]() |
![]() |
#72 |
"Ben"
Feb 2007
3,361 Posts |
![]()
And now processing 2^n-c similarly to 2^n-1 (with special modulo), where "c" is small (single-limb). GMP-ECM doesn't do this, as far as I can tell.
Example: Code:
./avx-ecm-52-icc-static "(2^941 - 5987059)/211/18413" 8 250000 1 starting process 97314 commencing parallel ecm on 4784305585655994717562720658383425526844379349934463348761460837207088185558434958330904147209498950842345840906605538877380399014736734618876417744155088454218322338609736739235487631453648349849153543969782570442284070067941772311074196433531076214607947202503691346829979451 ECM has been configured with DIGITBITS = 52, VECLEN = 8, GMP_LIMB_BITS = 64 Choosing MAXBITS = 1040, NWORDS = 20, NBLOCKS = 5 based on input size 941 cached 3001134 primes < 49999991 Input has 920 bits, using 1 threads (8 curves/thread) Processing in batches of 100000000 primes Using special pseudo-Mersenne mod for factor of: 2^941-5987059 Initialization took 0.1466 seconds. Cached 1565994 primes in range [2 : 25000991] commencing curves 0-7 of 8 Building curves took 0.0002 seconds. commencing Stage 1 @ prime 2 accumulating prime 180511 Stage 1 completed at prime 249989 with 496674 point-adds and 51127 point-doubles Stage 1 took 1.8810 seconds commencing stage 2 at p=250007, A=249480 w = 1155, R = 480, L = 16, umax = 9240, amin = 108 Stage 2 took 1.4885 seconds performed 854762 pair-multiplies for 1543883 primes in stage 2 performed 30752 point-additions and 49 point-doubles in stage 2 found factor 4655226350979187 in stage 2 in thread 0, vec position 3, with sigma = 9539569019480891334 found factor 4655226350979187 in stage 2 in thread 0, vec position 7, with sigma = 17882739001568946650 Process took 3.7186 seconds. Code:
echo "(2^941-5987059)/211/18413" | ../ecm-704-linux/ecm -v 250000 GMP-ECM 7.0.4 [configured with GMP 6.2.0, --enable-asm-redc] [ECM] Tuned for x86_64/k8/params.h Running on cpu Input number is (2^941-5987059)/211/18413 (277 digits) Using MODMULN [mulredc:1, sqrredc:1] Computing batch product (of 360843 bits) of primes up to B1=250000 took 17ms Using B1=250000, B2=128992510, polynomial Dickson(3), sigma=1:1245750705 dF=2048, k=3, d=19110, d2=11, i0=3 Expected number of curves to find a factor of n digits: 35 40 45 50 55 60 65 70 75 80 6022 87544 1534319 3.1e+07 7.5e+08 2e+10 6.7e+13 1e+19 1.3e+24 Inf Step 1 took 1673ms Last fiddled with by bsquared on 2020-11-05 at 21:04 Reason: gmp-ecm output comparison |
![]() |
![]() |
![]() |
#73 | |
"Ben"
Feb 2007
64418 Posts |
![]()
There have been a few updates to the avx-ecm repo:
1) added algebraic factor removal of Mersenne inputs 2) added checkpoint saving every 1e8 primes in stage 1 3) implemented batch inversion in stage 2, now almost twice as fast Here's an example of that last item compared to previously Quote:
Code:
B1=7M on 2^1277-1: avx-ecm stage 1: 78 sec for 8 curves, 9.7 sec/stg1-curve avx-ecm stage 2: 32 sec for 8 curves, 4.0 sec/stg2-curve I think that Bob's paper says to equalize the stage 1 and stage 2 runtimes, so the best answer is probably the latter. It might be moot if the best-best answer is to still use a hybrid plan with gmp-ecm stage 2. However this might change the crossover point (from pure avx-ecm to hybrid). |
|
![]() |
![]() |
![]() |
#74 |
"Ben"
Feb 2007
3,361 Posts |
![]()
More updates:
1) Fixed some bugs in the fast stage 2, after finally bearing down and understanding all of the nuances of Montgomery's PAIR algorithm correctly. I had it mostly right before, but not 100%, which caused some factors to be missed. Thank <deity> for a gold reference implementation in gmp-ecm. Also, the fixed bugs actually improved the speed a bit. 2) Got a shiny new laptop with a tigerlake i7-1165G7 in order to get AVX512IFMA up and running. It helped more than I thought it would. Going from my floating point 52-bit double precision integer multipliers to IFMA speeds up all of the vector math by about 40%! In this de-facto quick benchmark this cpu gets: Code:
B1=7M on 2^1277-1: avx-ecm stage 1: 47.3 sec for 8 curves, 5.9 sec/stg1-curve avx-ecm stage 2: 18.3 sec for 8 curves, 2.3 sec/stg2-curve |
![]() |
![]() |
![]() |
#75 |
P90 years forever!
Aug 2002
Yeehaw, FL
729810 Posts |
![]()
Have you played with improving PRAC?
I recently did and gained one or two percent. This may be because prime95 can store data FFTed which may change the costing functions. A double costs 10 transforms, an add costs 12 transforms. Also, prime95 is usually dealing with larger numbers and can afford to spend a little more time searching for the best chain. Glad to share the C code for the changes. |
![]() |
![]() |
![]() |
#76 | |
"Ben"
Feb 2007
336110 Posts |
![]() Quote:
I'm also looking into gmp-ecm's different parameterizations. "param 1" seems to be the default for me in most cases. It looks like they are using small parameters so that the multiply by (A+2)/4 is a single limb (saving a multiply when duplicating). Then not using PRAC at all, instead using Montgomery's ladder, which has a larger percentage of doublings, on a pre-multiplied set of B1 primes. Eliminating a multiply could save 10% or so. What I don't understand yet is how it can be single-limb if they are using montgomery multiplies (redc). Last fiddled with by bsquared on 2020-11-29 at 18:16 |
|
![]() |
![]() |
![]() |
#77 | |
"Ben"
Feb 2007
3,361 Posts |
![]() Quote:
It also seemed to help very slightly for me to define the cost functions to include the faster nature of a square versus a multiply. Doing that increases the total dups / adds ratio very slightly in stage 1. In my tinyecm code (< 64 bits) where there are just a few fixed and very small B1 values, I found that in a few cases the cost of PRAC(p1*p2) is cheaper than PRAC(p1) + PRAC(p2). It isn't practical to test combinations of primes like this in real time for arbitrary (and larger) B1. But I wonder if it is worth compiling a list of prime pairs that are cheaper when combined below bounds like 1M, 3M, 11M, etc. |
|
![]() |
![]() |