mersenneforum.org > YAFU AVX-ECM
 Register FAQ Search Today's Posts Mark Forums Read

2020-11-03, 02:50   #67
bsquared

"Ben"
Feb 2007

3,361 Posts

Quote:
 Originally Posted by VBCurtis When GMP-ECM is faster for Stage 2 with a higher B2, there is only advantage from using it. Save time, get more work- what's not to like?
If it were comparing curve-for-curve, sure, but it's not that straightforward. avx-ecm stage 2 may be slower and smaller B2, but it is doing 8 curves at once in that time.

Expanding on the example from before, with B1=7M on 2^1277-1:
avx-ecm stage 1: 78 sec for 8 curves, 9.75 sec/curve
avx-ecm stage 2: 55 sec for 8 curves, 6.87 sec/curve

gmp-ecm stage 1: 60 sec for 1 curve
gmp-ecm stage 2: 32 sec for 1 curve

Now, each avx-ecm stage 2 curve is 2.8 times less likely to find a factor, but if you had 64 seconds to spend doing something, would you rather run 2 gmp-ecm stage 2 curves or 8 avx-ecm stage 2 curves? Even with each avx-ecm being 2.8 times less likely to find a factor, you'd be better off.

Quote:
 Originally Posted by VBCurtis The difficult tradeoff would be on larger numbers, when the avx-ecm stage 2 is likely faster than GMP-ECM but a smaller B2. That leads to some need for study.
The same math as above applies. I think you are better off running pure avx-ecm until time/curve * probability-multiplier is greater than gmp-ecm's stage 2 runtime.

It gets interesting with generic inputs, when the throughput advantage of avx-ecm is not as great. Then, I think it makes sense to run avx-ecm stage 1 followed by 8 gmp-ecm stage 2's. Then you get all of the probability advantage of gmp-ecm stage 2 but with avx-ecm's higher stage 1 throughput. I am putting together some data on this case.

 2020-11-03, 03:05 #68 bsquared     "Ben" Feb 2007 3,361 Posts Here are a couple other things to keep in mind: 1) for lower-end skylake processors the speedups are not as dramatic, with or without special Mersenne math. I ran on a 7800x and only saw about 2.1 times speedup for 2^1277-1. More testing/benchmarking is needed for a variety of avx512 capable processors. The good news is that this situation will only improve for avx-ecm as time goes on. I plan to implement enhancements as soon I can get my hands on an ice lake cpu or whatever cpu supports AVX-512IFMA. 2) avx-ecm uses fixed-time arithmetic in steps of 208 bits. So a curve at, say, 416 bits takes just as long as a curve at 623 bits. gmp-ecm on the other hand will adjust the size of its arithmetic in 64-bit steps. This could mean fine-tune adjustments for any crossover math that is performed for given input sizes.
2020-11-03, 03:20   #69
VBCurtis

"Curtis"
Feb 2005
Riverside, CA

3×29×53 Posts

Quote:
 Originally Posted by bsquared Expanding on the example from before, with B1=7M on 2^1277-1: avx-ecm stage 1: 78 sec for 8 curves, 9.75 sec/curve avx-ecm stage 2: 55 sec for 8 curves, 6.87 sec/curve gmp-ecm stage 1: 60 sec for 1 curve gmp-ecm stage 2: 32 sec for 1 curve
Thank you for rewarding my "I'm confused so I'll state the obvious and see where I'm wrong"! This makes much sense, and I'm all clear now.

 2020-11-03, 18:03 #70 bsquared     "Ben" Feb 2007 3,361 Posts I updated the first post with a table of timings and some analysis for a generic 900-bit input, and a couple different Mersenne inputs. It looks like a hybrid plan is best for most B1 levels. The relative speeds of avx-ecm and gmp-ecm should be checked on the particular cpu beforehand, and crossovers adjusted. Last fiddled with by bsquared on 2020-11-03 at 19:50
 2020-11-05, 15:01 #71 bsquared     "Ben" Feb 2007 3,361 Posts Now processing 2^n+1 similarly to 2^n-1 (with special modulo) Here verifying some of the factors of 2^941+1 Code: ./avx-ecm "(2^941+1)/3/1738969" 8 3000000 starting process 60820 commencing parallel ecm on 3562975409753765647916625929286022733708067351160837396219090783379741715638222929662078302565476897014358133541083282322395686863751130956979946597434060607614269433132169774980382077527684835263114705184392392863404860166373197969977946080763606872357617994374214244244159779 ECM has been configured with DIGITBITS = 52, VECLEN = 8, GMP_LIMB_BITS = 64 Choosing MAXBITS = 1040, NWORDS = 20, NBLOCKS = 5 based on input size 941 cached 3001134 primes < 49999991 Input has 919 bits, using 1 threads (8 curves/thread) Processing in batches of 100000000 primes Using special Mersenne mod for factor of: 2^941+1 Initialization took 0.0899 seconds. Cached 5761455 primes in range [2 : 99999989] commencing curves 0-7 of 8 Building curves took 0.0001 seconds. commencing Stage 1 @ prime 2 accumulating prime 2943173 Stage 1 completed at prime 2999999 with 6040073 point-adds and 565931 point-doubles Stage 1 took 19.2999 seconds found factor 5062155107501579 in stage 1 in thread 0, vec position 0, with sigma = 424412553419277870 found factor 5062155107501579 in stage 1 in thread 0, vec position 5, with sigma = 8913334498632421225 commencing stage 2 at p=3000017, A=3000690 w = 1155, R = 480, L = 16, umax = 9240, amin = 1299 found 5317482 primes in range [99999989 : 199999963] commencing stage 2 at p=100000007, A=99965250 w = 1155, R = 480, L = 16, umax = 9240, amin = 43275 found 5173389 primes in range [199999963 : 299999957] commencing stage 2 at p=199999991, A=199979010 w = 1155, R = 480, L = 16, umax = 9240, amin = 86571 found 53 primes in range [299999957 : 300000997] commencing stage 2 at p=299999977, A=299974290 w = 1155, R = 480, L = 16, umax = 9240, amin = 129859 Stage 2 took 14.7954 seconds performed 9087802 pair-multiplies for 16035509 primes in stage 2 performed 266472 point-additions and 57 point-doubles in stage 2 found factor 5062155107501579 in stage 2 in thread 0, vec position 0, with sigma = 424412553419277870 found factor 5062155107501579 in stage 2 in thread 0, vec position 1, with sigma = 15371561253067696485 found factor 5062155107501579 in stage 2 in thread 0, vec position 2, with sigma = 13475164758719783696 found factor 1053336016261649316809 in stage 2 in thread 0, vec position 4, with sigma = 333700535869040322 found factor 5062155107501579 in stage 2 in thread 0, vec position 5, with sigma = 8913334498632421225 Process took 34.3731 seconds. Also, I'm only updating the avx-ecm repo with updates, for now. When things settle out a bit I'll revise yafu. Last fiddled with by bsquared on 2020-11-05 at 15:17
 2020-11-05, 20:59 #72 bsquared     "Ben" Feb 2007 3,361 Posts And now processing 2^n-c similarly to 2^n-1 (with special modulo), where "c" is small (single-limb). GMP-ECM doesn't do this, as far as I can tell. Example: Code: ./avx-ecm-52-icc-static "(2^941 - 5987059)/211/18413" 8 250000 1 starting process 97314 commencing parallel ecm on 4784305585655994717562720658383425526844379349934463348761460837207088185558434958330904147209498950842345840906605538877380399014736734618876417744155088454218322338609736739235487631453648349849153543969782570442284070067941772311074196433531076214607947202503691346829979451 ECM has been configured with DIGITBITS = 52, VECLEN = 8, GMP_LIMB_BITS = 64 Choosing MAXBITS = 1040, NWORDS = 20, NBLOCKS = 5 based on input size 941 cached 3001134 primes < 49999991 Input has 920 bits, using 1 threads (8 curves/thread) Processing in batches of 100000000 primes Using special pseudo-Mersenne mod for factor of: 2^941-5987059 Initialization took 0.1466 seconds. Cached 1565994 primes in range [2 : 25000991] commencing curves 0-7 of 8 Building curves took 0.0002 seconds. commencing Stage 1 @ prime 2 accumulating prime 180511 Stage 1 completed at prime 249989 with 496674 point-adds and 51127 point-doubles Stage 1 took 1.8810 seconds commencing stage 2 at p=250007, A=249480 w = 1155, R = 480, L = 16, umax = 9240, amin = 108 Stage 2 took 1.4885 seconds performed 854762 pair-multiplies for 1543883 primes in stage 2 performed 30752 point-additions and 49 point-doubles in stage 2 found factor 4655226350979187 in stage 2 in thread 0, vec position 3, with sigma = 9539569019480891334 found factor 4655226350979187 in stage 2 in thread 0, vec position 7, with sigma = 17882739001568946650 Process took 3.7186 seconds. And to compare: Code: echo "(2^941-5987059)/211/18413" | ../ecm-704-linux/ecm -v 250000 GMP-ECM 7.0.4 [configured with GMP 6.2.0, --enable-asm-redc] [ECM] Tuned for x86_64/k8/params.h Running on cpu Input number is (2^941-5987059)/211/18413 (277 digits) Using MODMULN [mulredc:1, sqrredc:1] Computing batch product (of 360843 bits) of primes up to B1=250000 took 17ms Using B1=250000, B2=128992510, polynomial Dickson(3), sigma=1:1245750705 dF=2048, k=3, d=19110, d2=11, i0=3 Expected number of curves to find a factor of n digits: 35 40 45 50 55 60 65 70 75 80 6022 87544 1534319 3.1e+07 7.5e+08 2e+10 6.7e+13 1e+19 1.3e+24 Inf Step 1 took 1673ms Pretty substantial stage 1 throughput improvement, for inputs of this form. Last fiddled with by bsquared on 2020-11-05 at 21:04 Reason: gmp-ecm output comparison
2020-11-19, 17:17   #73
bsquared

"Ben"
Feb 2007

3,361 Posts

There have been a few updates to the avx-ecm repo:

1) added algebraic factor removal of Mersenne inputs
2) added checkpoint saving every 1e8 primes in stage 1
3) implemented batch inversion in stage 2, now almost twice as fast

Here's an example of that last item compared to previously
Quote:
 Originally Posted by bsquared Expanding on the example from before, with B1=7M on 2^1277-1: avx-ecm stage 1: 78 sec for 8 curves, 9.7 sec/stg1-curve avx-ecm stage 2: 55 sec for 8 curves, 6.8 sec/stg2-curve
Now:
Code:
B1=7M on 2^1277-1:
avx-ecm stage 1: 78 sec for 8 curves, 9.7 sec/stg1-curve
avx-ecm stage 2: 32 sec for 8 curves, 4.0 sec/stg2-curve
At this point I'm not sure whether it's better to keep the faster stage 2 time or run to higher B2 bounds, keeping timings similar to previously.

I think that Bob's paper says to equalize the stage 1 and stage 2 runtimes, so the best answer is probably the latter. It might be moot if the best-best answer is to still use a hybrid plan with gmp-ecm stage 2. However this might change the crossover point (from pure avx-ecm to hybrid).

 2020-11-28, 00:58 #74 bsquared     "Ben" Feb 2007 3,361 Posts More updates: 1) Fixed some bugs in the fast stage 2, after finally bearing down and understanding all of the nuances of Montgomery's PAIR algorithm correctly. I had it mostly right before, but not 100%, which caused some factors to be missed. Thank for a gold reference implementation in gmp-ecm. Also, the fixed bugs actually improved the speed a bit. 2) Got a shiny new laptop with a tigerlake i7-1165G7 in order to get AVX512IFMA up and running. It helped more than I thought it would. Going from my floating point 52-bit double precision integer multipliers to IFMA speeds up all of the vector math by about 40%! In this de-facto quick benchmark this cpu gets: Code: B1=7M on 2^1277-1: avx-ecm stage 1: 47.3 sec for 8 curves, 5.9 sec/stg1-curve avx-ecm stage 2: 18.3 sec for 8 curves, 2.3 sec/stg2-curve The next goals are usability-focused. Get a proper set of command line flags in there, some options to control memory use and when to exit, improved logging, and reading in savefiles.
 2020-11-29, 05:19 #75 Prime95 P90 years forever!     Aug 2002 Yeehaw, FL 72·149 Posts Have you played with improving PRAC? I recently did and gained one or two percent. This may be because prime95 can store data FFTed which may change the costing functions. A double costs 10 transforms, an add costs 12 transforms. Also, prime95 is usually dealing with larger numbers and can afford to spend a little more time searching for the best chain. Glad to share the C code for the changes.
2020-11-29, 18:13   #76
bsquared

"Ben"
Feb 2007

64418 Posts

Quote:
 Originally Posted by Prime95 Have you played with improving PRAC? I recently did and gained one or two percent. This may be because prime95 can store data FFTed which may change the costing functions. A double costs 10 transforms, an add costs 12 transforms. Also, prime95 is usually dealing with larger numbers and can afford to spend a little more time searching for the best chain. Glad to share the C code for the changes.
I tried to adjust the balance of time optimizing chains, without notable success. I'd like to see your code, thanks for sharing.

I'm also looking into gmp-ecm's different parameterizations. "param 1" seems to be the default for me in most cases. It looks like they are using small parameters so that the multiply by (A+2)/4 is a single limb (saving a multiply when duplicating). Then not using PRAC at all, instead using Montgomery's ladder, which has a larger percentage of doublings, on a pre-multiplied set of B1 primes. Eliminating a multiply could save 10% or so. What I don't understand yet is how it can be single-limb if they are using montgomery multiplies (redc).

Last fiddled with by bsquared on 2020-11-29 at 18:16

2020-11-30, 14:47   #77
bsquared

"Ben"
Feb 2007

3,361 Posts

Quote:
 Originally Posted by Prime95 Have you played with improving PRAC? I recently did and gained one or two percent.
The gain is not quite as large for me for the smaller inputs I am dealing with, but it is a gain of about 1%.

It also seemed to help very slightly for me to define the cost functions to include the faster nature of a square versus a multiply. Doing that increases the total dups / adds ratio very slightly in stage 1.

In my tinyecm code (< 64 bits) where there are just a few fixed and very small B1 values, I found that in a few cases the cost of PRAC(p1*p2) is cheaper than PRAC(p1) + PRAC(p2). It isn't practical to test combinations of primes like this in real time for arbitrary (and larger) B1. But I wonder if it is worth compiling a list of prime pairs that are cheaper when combined below bounds like 1M, 3M, 11M, etc.