mersenneforum.org > YAFU AVX-ECM
 Register FAQ Search Today's Posts Mark Forums Read

2020-11-03, 02:50   #67
bsquared

"Ben"
Feb 2007

13×257 Posts

Quote:
 Originally Posted by VBCurtis When GMP-ECM is faster for Stage 2 with a higher B2, there is only advantage from using it. Save time, get more work- what's not to like?
If it were comparing curve-for-curve, sure, but it's not that straightforward. avx-ecm stage 2 may be slower and smaller B2, but it is doing 8 curves at once in that time.

Expanding on the example from before, with B1=7M on 2^1277-1:
avx-ecm stage 1: 78 sec for 8 curves, 9.75 sec/curve
avx-ecm stage 2: 55 sec for 8 curves, 6.87 sec/curve

gmp-ecm stage 1: 60 sec for 1 curve
gmp-ecm stage 2: 32 sec for 1 curve

Now, each avx-ecm stage 2 curve is 2.8 times less likely to find a factor, but if you had 64 seconds to spend doing something, would you rather run 2 gmp-ecm stage 2 curves or 8 avx-ecm stage 2 curves? Even with each avx-ecm being 2.8 times less likely to find a factor, you'd be better off.

Quote:
 Originally Posted by VBCurtis The difficult tradeoff would be on larger numbers, when the avx-ecm stage 2 is likely faster than GMP-ECM but a smaller B2. That leads to some need for study.
The same math as above applies. I think you are better off running pure avx-ecm until time/curve * probability-multiplier is greater than gmp-ecm's stage 2 runtime.

It gets interesting with generic inputs, when the throughput advantage of avx-ecm is not as great. Then, I think it makes sense to run avx-ecm stage 1 followed by 8 gmp-ecm stage 2's. Then you get all of the probability advantage of gmp-ecm stage 2 but with avx-ecm's higher stage 1 throughput. I am putting together some data on this case.

 2020-11-03, 03:05 #68 bsquared     "Ben" Feb 2007 13×257 Posts Here are a couple other things to keep in mind: 1) for lower-end skylake processors the speedups are not as dramatic, with or without special Mersenne math. I ran on a 7800x and only saw about 2.1 times speedup for 2^1277-1. More testing/benchmarking is needed for a variety of avx512 capable processors. The good news is that this situation will only improve for avx-ecm as time goes on. I plan to implement enhancements as soon I can get my hands on an ice lake cpu or whatever cpu supports AVX-512IFMA. 2) avx-ecm uses fixed-time arithmetic in steps of 208 bits. So a curve at, say, 416 bits takes just as long as a curve at 623 bits. gmp-ecm on the other hand will adjust the size of its arithmetic in 64-bit steps. This could mean fine-tune adjustments for any crossover math that is performed for given input sizes.
2020-11-03, 03:20   #69
VBCurtis

"Curtis"
Feb 2005
Riverside, CA

112·37 Posts

Quote:
 Originally Posted by bsquared Expanding on the example from before, with B1=7M on 2^1277-1: avx-ecm stage 1: 78 sec for 8 curves, 9.75 sec/curve avx-ecm stage 2: 55 sec for 8 curves, 6.87 sec/curve gmp-ecm stage 1: 60 sec for 1 curve gmp-ecm stage 2: 32 sec for 1 curve
Thank you for rewarding my "I'm confused so I'll state the obvious and see where I'm wrong"! This makes much sense, and I'm all clear now.

 2020-11-03, 18:03 #70 bsquared     "Ben" Feb 2007 13×257 Posts I updated the first post with a table of timings and some analysis for a generic 900-bit input, and a couple different Mersenne inputs. It looks like a hybrid plan is best for most B1 levels. The relative speeds of avx-ecm and gmp-ecm should be checked on the particular cpu beforehand, and crossovers adjusted. Last fiddled with by bsquared on 2020-11-03 at 19:50
 2020-11-05, 15:01 #71 bsquared     "Ben" Feb 2007 13·257 Posts Now processing 2^n+1 similarly to 2^n-1 (with special modulo) Here verifying some of the factors of 2^941+1 Code: ./avx-ecm "(2^941+1)/3/1738969" 8 3000000 starting process 60820 commencing parallel ecm on 3562975409753765647916625929286022733708067351160837396219090783379741715638222929662078302565476897014358133541083282322395686863751130956979946597434060607614269433132169774980382077527684835263114705184392392863404860166373197969977946080763606872357617994374214244244159779 ECM has been configured with DIGITBITS = 52, VECLEN = 8, GMP_LIMB_BITS = 64 Choosing MAXBITS = 1040, NWORDS = 20, NBLOCKS = 5 based on input size 941 cached 3001134 primes < 49999991 Input has 919 bits, using 1 threads (8 curves/thread) Processing in batches of 100000000 primes Using special Mersenne mod for factor of: 2^941+1 Initialization took 0.0899 seconds. Cached 5761455 primes in range [2 : 99999989] commencing curves 0-7 of 8 Building curves took 0.0001 seconds. commencing Stage 1 @ prime 2 accumulating prime 2943173 Stage 1 completed at prime 2999999 with 6040073 point-adds and 565931 point-doubles Stage 1 took 19.2999 seconds found factor 5062155107501579 in stage 1 in thread 0, vec position 0, with sigma = 424412553419277870 found factor 5062155107501579 in stage 1 in thread 0, vec position 5, with sigma = 8913334498632421225 commencing stage 2 at p=3000017, A=3000690 w = 1155, R = 480, L = 16, umax = 9240, amin = 1299 found 5317482 primes in range [99999989 : 199999963] commencing stage 2 at p=100000007, A=99965250 w = 1155, R = 480, L = 16, umax = 9240, amin = 43275 found 5173389 primes in range [199999963 : 299999957] commencing stage 2 at p=199999991, A=199979010 w = 1155, R = 480, L = 16, umax = 9240, amin = 86571 found 53 primes in range [299999957 : 300000997] commencing stage 2 at p=299999977, A=299974290 w = 1155, R = 480, L = 16, umax = 9240, amin = 129859 Stage 2 took 14.7954 seconds performed 9087802 pair-multiplies for 16035509 primes in stage 2 performed 266472 point-additions and 57 point-doubles in stage 2 found factor 5062155107501579 in stage 2 in thread 0, vec position 0, with sigma = 424412553419277870 found factor 5062155107501579 in stage 2 in thread 0, vec position 1, with sigma = 15371561253067696485 found factor 5062155107501579 in stage 2 in thread 0, vec position 2, with sigma = 13475164758719783696 found factor 1053336016261649316809 in stage 2 in thread 0, vec position 4, with sigma = 333700535869040322 found factor 5062155107501579 in stage 2 in thread 0, vec position 5, with sigma = 8913334498632421225 Process took 34.3731 seconds. Also, I'm only updating the avx-ecm repo with updates, for now. When things settle out a bit I'll revise yafu. Last fiddled with by bsquared on 2020-11-05 at 15:17
 2020-11-05, 20:59 #72 bsquared     "Ben" Feb 2007 13·257 Posts And now processing 2^n-c similarly to 2^n-1 (with special modulo), where "c" is small (single-limb). GMP-ECM doesn't do this, as far as I can tell. Example: Code: ./avx-ecm-52-icc-static "(2^941 - 5987059)/211/18413" 8 250000 1 starting process 97314 commencing parallel ecm on 4784305585655994717562720658383425526844379349934463348761460837207088185558434958330904147209498950842345840906605538877380399014736734618876417744155088454218322338609736739235487631453648349849153543969782570442284070067941772311074196433531076214607947202503691346829979451 ECM has been configured with DIGITBITS = 52, VECLEN = 8, GMP_LIMB_BITS = 64 Choosing MAXBITS = 1040, NWORDS = 20, NBLOCKS = 5 based on input size 941 cached 3001134 primes < 49999991 Input has 920 bits, using 1 threads (8 curves/thread) Processing in batches of 100000000 primes Using special pseudo-Mersenne mod for factor of: 2^941-5987059 Initialization took 0.1466 seconds. Cached 1565994 primes in range [2 : 25000991] commencing curves 0-7 of 8 Building curves took 0.0002 seconds. commencing Stage 1 @ prime 2 accumulating prime 180511 Stage 1 completed at prime 249989 with 496674 point-adds and 51127 point-doubles Stage 1 took 1.8810 seconds commencing stage 2 at p=250007, A=249480 w = 1155, R = 480, L = 16, umax = 9240, amin = 108 Stage 2 took 1.4885 seconds performed 854762 pair-multiplies for 1543883 primes in stage 2 performed 30752 point-additions and 49 point-doubles in stage 2 found factor 4655226350979187 in stage 2 in thread 0, vec position 3, with sigma = 9539569019480891334 found factor 4655226350979187 in stage 2 in thread 0, vec position 7, with sigma = 17882739001568946650 Process took 3.7186 seconds. And to compare: Code: echo "(2^941-5987059)/211/18413" | ../ecm-704-linux/ecm -v 250000 GMP-ECM 7.0.4 [configured with GMP 6.2.0, --enable-asm-redc] [ECM] Tuned for x86_64/k8/params.h Running on cpu Input number is (2^941-5987059)/211/18413 (277 digits) Using MODMULN [mulredc:1, sqrredc:1] Computing batch product (of 360843 bits) of primes up to B1=250000 took 17ms Using B1=250000, B2=128992510, polynomial Dickson(3), sigma=1:1245750705 dF=2048, k=3, d=19110, d2=11, i0=3 Expected number of curves to find a factor of n digits: 35 40 45 50 55 60 65 70 75 80 6022 87544 1534319 3.1e+07 7.5e+08 2e+10 6.7e+13 1e+19 1.3e+24 Inf Step 1 took 1673ms Pretty substantial stage 1 throughput improvement, for inputs of this form. Last fiddled with by bsquared on 2020-11-05 at 21:04 Reason: gmp-ecm output comparison
2020-11-19, 17:17   #73
bsquared

"Ben"
Feb 2007

64158 Posts

There have been a few updates to the avx-ecm repo:

1) added algebraic factor removal of Mersenne inputs
2) added checkpoint saving every 1e8 primes in stage 1
3) implemented batch inversion in stage 2, now almost twice as fast

Here's an example of that last item compared to previously
Quote:
 Originally Posted by bsquared Expanding on the example from before, with B1=7M on 2^1277-1: avx-ecm stage 1: 78 sec for 8 curves, 9.7 sec/stg1-curve avx-ecm stage 2: 55 sec for 8 curves, 6.8 sec/stg2-curve
Now:
Code:
B1=7M on 2^1277-1:
avx-ecm stage 1: 78 sec for 8 curves, 9.7 sec/stg1-curve
avx-ecm stage 2: 32 sec for 8 curves, 4.0 sec/stg2-curve
At this point I'm not sure whether it's better to keep the faster stage 2 time or run to higher B2 bounds, keeping timings similar to previously.

I think that Bob's paper says to equalize the stage 1 and stage 2 runtimes, so the best answer is probably the latter. It might be moot if the best-best answer is to still use a hybrid plan with gmp-ecm stage 2. However this might change the crossover point (from pure avx-ecm to hybrid).

All times are UTC. The time now is 19:15.

Thu Nov 26 19:15:14 UTC 2020 up 77 days, 16:26, 3 users, load averages: 1.28, 1.21, 1.32

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.