- **YAFU**
(*https://www.mersenneforum.org/forumdisplay.php?f=96*)

- - **AVX-ECM**
(*https://www.mersenneforum.org/showthread.php?t=25056*)

[QUOTE=VBCurtis;562018]When GMP-ECM is faster for Stage 2 with a higher B2, there is only advantage from using it. Save time, get more work- what's not to like?
[/QUOTE] If it were comparing curve-for-curve, sure, but it's not that straightforward. avx-ecm stage 2 may be slower and smaller B2, but it is doing 8 curves at once in that time. Expanding on the example from before, with B1=7M on 2^1277-1: avx-ecm stage 1: 78 sec for 8 curves, 9.75 sec/curve avx-ecm stage 2: 55 sec for 8 curves, 6.87 sec/curve gmp-ecm stage 1: 60 sec for 1 curve gmp-ecm stage 2: 32 sec for 1 curve Now, each avx-ecm stage 2 curve is 2.8 times less likely to find a factor, but if you had 64 seconds to spend doing something, would you rather run 2 gmp-ecm stage 2 curves or 8 avx-ecm stage 2 curves? Even with each avx-ecm being 2.8 times less likely to find a factor, you'd be better off. [QUOTE=VBCurtis;562018] The difficult tradeoff would be on larger numbers, when the avx-ecm stage 2 is likely faster than GMP-ECM but a smaller B2. That leads to some need for study. [/QUOTE] The same math as above applies. I think you are better off running pure avx-ecm until time/curve * probability-multiplier is greater than gmp-ecm's stage 2 runtime. It gets interesting with generic inputs, when the throughput advantage of avx-ecm is not as great. Then, I think it makes sense to run avx-ecm stage 1 followed by 8 gmp-ecm stage 2's. Then you get all of the probability advantage of gmp-ecm stage 2 but with avx-ecm's higher stage 1 throughput. I am putting together some data on this case. |

Here are a couple other things to keep in mind:
1) for lower-end skylake processors the speedups are not as dramatic, with or without special Mersenne math. I ran on a 7800x and only saw about 2.1 times speedup for 2^1277-1. [U]More testing/benchmarking is needed for a variety of avx512 capable processors[/U]. The good news is that this situation will only improve for avx-ecm as time goes on. I plan to implement enhancements as soon I can get my hands on an ice lake cpu or whatever cpu supports AVX-512IFMA. 2) avx-ecm uses fixed-time arithmetic in steps of 208 bits. So a curve at, say, 416 bits takes just as long as a curve at 623 bits. gmp-ecm on the other hand will adjust the size of its arithmetic in 64-bit steps. This could mean fine-tune adjustments for any crossover math that is performed for given input sizes. |

[QUOTE=bsquared;562023]Expanding on the example from before, with B1=7M on 2^1277-1:
avx-ecm stage 1: 78 sec for 8 curves, 9.75 sec/curve avx-ecm stage 2: 55 sec for 8 curves, 6.87 sec/curve gmp-ecm stage 1: 60 sec for 1 curve gmp-ecm stage 2: 32 sec for 1 curve[/QUOTE] Thank you for rewarding my "I'm confused so I'll state the obvious and see where I'm wrong"! This makes much sense, and I'm all clear now. |

I updated the first post with a table of timings and some analysis for a generic 900-bit input, and a couple different Mersenne inputs.
It looks like a hybrid plan is best for most B1 levels. The relative speeds of avx-ecm and gmp-ecm should be checked on the particular cpu beforehand, and crossovers adjusted. |

Now processing 2^n+1 similarly to 2^n-1 (with special modulo)
Here verifying some of the factors of 2^941+1 [CODE]./avx-ecm "(2^941+1)/3/1738969" 8 3000000 starting process 60820 commencing parallel ecm on 3562975409753765647916625929286022733708067351160837396219090783379741715638222929662078302565476897014358133541083282322395686863751130956979946597434060607614269433132169774980382077527684835263114705184392392863404860166373197969977946080763606872357617994374214244244159779 ECM has been configured with DIGITBITS = 52, VECLEN = 8, GMP_LIMB_BITS = 64 Choosing MAXBITS = 1040, NWORDS = 20, NBLOCKS = 5 based on input size 941 cached 3001134 primes < 49999991 Input has 919 bits, using 1 threads (8 curves/thread) Processing in batches of 100000000 primes Using special Mersenne mod for factor of: 2^941+1 Initialization took 0.0899 seconds. Cached 5761455 primes in range [2 : 99999989] commencing curves 0-7 of 8 Building curves took 0.0001 seconds. commencing Stage 1 @ prime 2 accumulating prime 2943173 Stage 1 completed at prime 2999999 with 6040073 point-adds and 565931 point-doubles Stage 1 took 19.2999 seconds found factor 5062155107501579 in stage 1 in thread 0, vec position 0, with sigma = 424412553419277870 found factor 5062155107501579 in stage 1 in thread 0, vec position 5, with sigma = 8913334498632421225 commencing stage 2 at p=3000017, A=3000690 w = 1155, R = 480, L = 16, umax = 9240, amin = 1299 found 5317482 primes in range [99999989 : 199999963] commencing stage 2 at p=100000007, A=99965250 w = 1155, R = 480, L = 16, umax = 9240, amin = 43275 found 5173389 primes in range [199999963 : 299999957] commencing stage 2 at p=199999991, A=199979010 w = 1155, R = 480, L = 16, umax = 9240, amin = 86571 found 53 primes in range [299999957 : 300000997] commencing stage 2 at p=299999977, A=299974290 w = 1155, R = 480, L = 16, umax = 9240, amin = 129859 Stage 2 took 14.7954 seconds performed 9087802 pair-multiplies for 16035509 primes in stage 2 performed 266472 point-additions and 57 point-doubles in stage 2 found factor 5062155107501579 in stage 2 in thread 0, vec position 0, with sigma = 424412553419277870 found factor 5062155107501579 in stage 2 in thread 0, vec position 1, with sigma = 15371561253067696485 found factor 5062155107501579 in stage 2 in thread 0, vec position 2, with sigma = 13475164758719783696 found factor 1053336016261649316809 in stage 2 in thread 0, vec position 4, with sigma = 333700535869040322 found factor 5062155107501579 in stage 2 in thread 0, vec position 5, with sigma = 8913334498632421225 Process took 34.3731 seconds. [/CODE] Also, I'm only updating the avx-ecm repo with updates, for now. When things settle out a bit I'll revise yafu. |

And now processing 2^n-c similarly to 2^n-1 (with special modulo), where "c" is small (single-limb). GMP-ECM doesn't do this, as far as I can tell.
Example: [CODE]./avx-ecm-52-icc-static "(2^941 - 5987059)/211/18413" 8 250000 1 starting process 97314 commencing parallel ecm on 4784305585655994717562720658383425526844379349934463348761460837207088185558434958330904147209498950842345840906605538877380399014736734618876417744155088454218322338609736739235487631453648349849153543969782570442284070067941772311074196433531076214607947202503691346829979451 ECM has been configured with DIGITBITS = 52, VECLEN = 8, GMP_LIMB_BITS = 64 Choosing MAXBITS = 1040, NWORDS = 20, NBLOCKS = 5 based on input size 941 cached 3001134 primes < 49999991 Input has 920 bits, using 1 threads (8 curves/thread) Processing in batches of 100000000 primes [U]Using special pseudo-Mersenne mod for factor of: 2^941-5987059[/U] Initialization took 0.1466 seconds. Cached 1565994 primes in range [2 : 25000991] commencing curves 0-7 of 8 Building curves took 0.0002 seconds. commencing Stage 1 @ prime 2 accumulating prime 180511 Stage 1 completed at prime 249989 with 496674 point-adds and 51127 point-doubles Stage 1 took 1.8810 seconds commencing stage 2 at p=250007, A=249480 w = 1155, R = 480, L = 16, umax = 9240, amin = 108 Stage 2 took 1.4885 seconds performed 854762 pair-multiplies for 1543883 primes in stage 2 performed 30752 point-additions and 49 point-doubles in stage 2 found factor 4655226350979187 in stage 2 in thread 0, vec position 3, with sigma = 9539569019480891334 found factor 4655226350979187 in stage 2 in thread 0, vec position 7, with sigma = 17882739001568946650 Process took 3.7186 seconds. [/CODE] And to compare: [CODE] echo "(2^941-5987059)/211/18413" | ../ecm-704-linux/ecm -v 250000 GMP-ECM 7.0.4 [configured with GMP 6.2.0, --enable-asm-redc] [ECM] Tuned for x86_64/k8/params.h Running on cpu Input number is (2^941-5987059)/211/18413 (277 digits) [U]Using MODMULN [mulredc:1, sqrredc:1][/U] Computing batch product (of 360843 bits) of primes up to B1=250000 took 17ms Using B1=250000, B2=128992510, polynomial Dickson(3), sigma=1:1245750705 dF=2048, k=3, d=19110, d2=11, i0=3 Expected number of curves to find a factor of n digits: 35 40 45 50 55 60 65 70 75 80 6022 87544 1534319 3.1e+07 7.5e+08 2e+10 6.7e+13 1e+19 1.3e+24 Inf Step 1 took 1673ms [/CODE] Pretty substantial stage 1 throughput improvement, for inputs of this form. |

There have been a few updates to the avx-ecm repo:
1) added algebraic factor removal of Mersenne inputs 2) added checkpoint saving every 1e8 primes in stage 1 3) implemented batch inversion in stage 2, now almost twice as fast Here's an example of that last item compared to previously [QUOTE=bsquared;562023] Expanding on the example from before, with B1=7M on 2^1277-1: avx-ecm stage 1: 78 sec for 8 curves, 9.7 sec/stg1-curve avx-ecm stage 2: 55 sec for 8 curves, 6.8 sec/stg2-curve [/QUOTE] Now: [CODE] B1=7M on 2^1277-1: avx-ecm stage 1: 78 sec for 8 curves, 9.7 sec/stg1-curve avx-ecm stage 2: 32 sec for 8 curves, 4.0 sec/stg2-curve[/CODE] At this point I'm not sure whether it's better to keep the faster stage 2 time or run to higher B2 bounds, keeping timings similar to previously. I think that Bob's paper says to equalize the stage 1 and stage 2 runtimes, so the best answer is probably the latter. It might be moot if the best-best answer is to still use a hybrid plan with gmp-ecm stage 2. However this might change the crossover point (from pure avx-ecm to hybrid). |

More updates:
1) Fixed some bugs in the fast stage 2, after finally bearing down and understanding all of the nuances of Montgomery's PAIR algorithm correctly. I had it mostly right before, but not 100%, which caused some factors to be missed. Thank <deity> for a gold reference implementation in gmp-ecm. Also, the fixed bugs actually improved the speed a bit. 2) Got a shiny new laptop with a tigerlake i7-1165G7 in order to get AVX512IFMA up and running. It helped more than I thought it would. Going from my floating point 52-bit double precision integer multipliers to IFMA speeds up all of the vector math by about 40%! In this de-facto quick benchmark this cpu gets: [CODE] B1=7M on 2^1277-1: avx-ecm stage 1: 47.3 sec for 8 curves, 5.9 sec/stg1-curve avx-ecm stage 2: 18.3 sec for 8 curves, 2.3 sec/stg2-curve [/CODE] The next goals are usability-focused. Get a proper set of command line flags in there, some options to control memory use and when to exit, improved logging, and reading in savefiles. |

Have you played with improving PRAC?
I recently did and gained one or two percent. This may be because prime95 can store data FFTed which may change the costing functions. A double costs 10 transforms, an add costs 12 transforms. Also, prime95 is usually dealing with larger numbers and can afford to spend a little more time searching for the best chain. Glad to share the C code for the changes. |

[QUOTE=Prime95;564720]Have you played with improving PRAC?
I recently did and gained one or two percent. This may be because prime95 can store data FFTed which may change the costing functions. A double costs 10 transforms, an add costs 12 transforms. Also, prime95 is usually dealing with larger numbers and can afford to spend a little more time searching for the best chain. Glad to share the C code for the changes.[/QUOTE] I tried to adjust the balance of time optimizing chains, without notable success. I'd like to see your code, thanks for sharing. I'm also looking into gmp-ecm's different parameterizations. "param 1" seems to be the default for me in most cases. It looks like they are using small parameters so that the multiply by (A+2)/4 is a single limb (saving a multiply when duplicating). Then not using PRAC at all, instead using Montgomery's ladder, which has a larger percentage of doublings, on a pre-multiplied set of B1 primes. Eliminating a multiply could save 10% or so. What I don't understand yet is how it can be single-limb if they are using montgomery multiplies (redc). |

[QUOTE=Prime95;564720]Have you played with improving PRAC?
I recently did and gained one or two percent. [/QUOTE] The gain is not quite as large for me for the smaller inputs I am dealing with, but it is a gain of about 1%. It also seemed to help very slightly for me to define the cost functions to include the faster nature of a square versus a multiply. Doing that increases the total dups / adds ratio very slightly in stage 1. In my tinyecm code (< 64 bits) where there are just a few fixed and very small B1 values, I found that in a few cases the cost of PRAC(p1*p2) is cheaper than PRAC(p1) + PRAC(p2). It isn't practical to test combinations of primes like this in real time for arbitrary (and larger) B1. But I wonder if it is worth compiling a list of prime pairs that are cheaper when combined below bounds like 1M, 3M, 11M, etc. |

All times are UTC. The time now is 20:52. |

Powered by vBulletin® Version 3.8.11

Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.