20201103, 02:50  #67  
"Ben"
Feb 2007
3,373 Posts 
Quote:
Expanding on the example from before, with B1=7M on 2^12771: avxecm stage 1: 78 sec for 8 curves, 9.75 sec/curve avxecm stage 2: 55 sec for 8 curves, 6.87 sec/curve gmpecm stage 1: 60 sec for 1 curve gmpecm stage 2: 32 sec for 1 curve Now, each avxecm stage 2 curve is 2.8 times less likely to find a factor, but if you had 64 seconds to spend doing something, would you rather run 2 gmpecm stage 2 curves or 8 avxecm stage 2 curves? Even with each avxecm being 2.8 times less likely to find a factor, you'd be better off. Quote:
It gets interesting with generic inputs, when the throughput advantage of avxecm is not as great. Then, I think it makes sense to run avxecm stage 1 followed by 8 gmpecm stage 2's. Then you get all of the probability advantage of gmpecm stage 2 but with avxecm's higher stage 1 throughput. I am putting together some data on this case. 

20201103, 03:05  #68 
"Ben"
Feb 2007
6455_{8} Posts 
Here are a couple other things to keep in mind:
1) for lowerend skylake processors the speedups are not as dramatic, with or without special Mersenne math. I ran on a 7800x and only saw about 2.1 times speedup for 2^12771. More testing/benchmarking is needed for a variety of avx512 capable processors. The good news is that this situation will only improve for avxecm as time goes on. I plan to implement enhancements as soon I can get my hands on an ice lake cpu or whatever cpu supports AVX512IFMA. 2) avxecm uses fixedtime arithmetic in steps of 208 bits. So a curve at, say, 416 bits takes just as long as a curve at 623 bits. gmpecm on the other hand will adjust the size of its arithmetic in 64bit steps. This could mean finetune adjustments for any crossover math that is performed for given input sizes. 
20201103, 03:20  #69 
"Curtis"
Feb 2005
Riverside, CA
2·2,339 Posts 
Thank you for rewarding my "I'm confused so I'll state the obvious and see where I'm wrong"! This makes much sense, and I'm all clear now.

20201103, 18:03  #70 
"Ben"
Feb 2007
3,373 Posts 
I updated the first post with a table of timings and some analysis for a generic 900bit input, and a couple different Mersenne inputs.
It looks like a hybrid plan is best for most B1 levels. The relative speeds of avxecm and gmpecm should be checked on the particular cpu beforehand, and crossovers adjusted. Last fiddled with by bsquared on 20201103 at 19:50 
20201105, 15:01  #71 
"Ben"
Feb 2007
3,373 Posts 
Now processing 2^n+1 similarly to 2^n1 (with special modulo)
Here verifying some of the factors of 2^941+1 Code:
./avxecm "(2^941+1)/3/1738969" 8 3000000 starting process 60820 commencing parallel ecm on 3562975409753765647916625929286022733708067351160837396219090783379741715638222929662078302565476897014358133541083282322395686863751130956979946597434060607614269433132169774980382077527684835263114705184392392863404860166373197969977946080763606872357617994374214244244159779 ECM has been configured with DIGITBITS = 52, VECLEN = 8, GMP_LIMB_BITS = 64 Choosing MAXBITS = 1040, NWORDS = 20, NBLOCKS = 5 based on input size 941 cached 3001134 primes < 49999991 Input has 919 bits, using 1 threads (8 curves/thread) Processing in batches of 100000000 primes Using special Mersenne mod for factor of: 2^941+1 Initialization took 0.0899 seconds. Cached 5761455 primes in range [2 : 99999989] commencing curves 07 of 8 Building curves took 0.0001 seconds. commencing Stage 1 @ prime 2 accumulating prime 2943173 Stage 1 completed at prime 2999999 with 6040073 pointadds and 565931 pointdoubles Stage 1 took 19.2999 seconds found factor 5062155107501579 in stage 1 in thread 0, vec position 0, with sigma = 424412553419277870 found factor 5062155107501579 in stage 1 in thread 0, vec position 5, with sigma = 8913334498632421225 commencing stage 2 at p=3000017, A=3000690 w = 1155, R = 480, L = 16, umax = 9240, amin = 1299 found 5317482 primes in range [99999989 : 199999963] commencing stage 2 at p=100000007, A=99965250 w = 1155, R = 480, L = 16, umax = 9240, amin = 43275 found 5173389 primes in range [199999963 : 299999957] commencing stage 2 at p=199999991, A=199979010 w = 1155, R = 480, L = 16, umax = 9240, amin = 86571 found 53 primes in range [299999957 : 300000997] commencing stage 2 at p=299999977, A=299974290 w = 1155, R = 480, L = 16, umax = 9240, amin = 129859 Stage 2 took 14.7954 seconds performed 9087802 pairmultiplies for 16035509 primes in stage 2 performed 266472 pointadditions and 57 pointdoubles in stage 2 found factor 5062155107501579 in stage 2 in thread 0, vec position 0, with sigma = 424412553419277870 found factor 5062155107501579 in stage 2 in thread 0, vec position 1, with sigma = 15371561253067696485 found factor 5062155107501579 in stage 2 in thread 0, vec position 2, with sigma = 13475164758719783696 found factor 1053336016261649316809 in stage 2 in thread 0, vec position 4, with sigma = 333700535869040322 found factor 5062155107501579 in stage 2 in thread 0, vec position 5, with sigma = 8913334498632421225 Process took 34.3731 seconds. Last fiddled with by bsquared on 20201105 at 15:17 
20201105, 20:59  #72 
"Ben"
Feb 2007
3,373 Posts 
And now processing 2^nc similarly to 2^n1 (with special modulo), where "c" is small (singlelimb). GMPECM doesn't do this, as far as I can tell.
Example: Code:
./avxecm52iccstatic "(2^941  5987059)/211/18413" 8 250000 1 starting process 97314 commencing parallel ecm on 4784305585655994717562720658383425526844379349934463348761460837207088185558434958330904147209498950842345840906605538877380399014736734618876417744155088454218322338609736739235487631453648349849153543969782570442284070067941772311074196433531076214607947202503691346829979451 ECM has been configured with DIGITBITS = 52, VECLEN = 8, GMP_LIMB_BITS = 64 Choosing MAXBITS = 1040, NWORDS = 20, NBLOCKS = 5 based on input size 941 cached 3001134 primes < 49999991 Input has 920 bits, using 1 threads (8 curves/thread) Processing in batches of 100000000 primes Using special pseudoMersenne mod for factor of: 2^9415987059 Initialization took 0.1466 seconds. Cached 1565994 primes in range [2 : 25000991] commencing curves 07 of 8 Building curves took 0.0002 seconds. commencing Stage 1 @ prime 2 accumulating prime 180511 Stage 1 completed at prime 249989 with 496674 pointadds and 51127 pointdoubles Stage 1 took 1.8810 seconds commencing stage 2 at p=250007, A=249480 w = 1155, R = 480, L = 16, umax = 9240, amin = 108 Stage 2 took 1.4885 seconds performed 854762 pairmultiplies for 1543883 primes in stage 2 performed 30752 pointadditions and 49 pointdoubles in stage 2 found factor 4655226350979187 in stage 2 in thread 0, vec position 3, with sigma = 9539569019480891334 found factor 4655226350979187 in stage 2 in thread 0, vec position 7, with sigma = 17882739001568946650 Process took 3.7186 seconds. Code:
echo "(2^9415987059)/211/18413"  ../ecm704linux/ecm v 250000 GMPECM 7.0.4 [configured with GMP 6.2.0, enableasmredc] [ECM] Tuned for x86_64/k8/params.h Running on cpu Input number is (2^9415987059)/211/18413 (277 digits) Using MODMULN [mulredc:1, sqrredc:1] Computing batch product (of 360843 bits) of primes up to B1=250000 took 17ms Using B1=250000, B2=128992510, polynomial Dickson(3), sigma=1:1245750705 dF=2048, k=3, d=19110, d2=11, i0=3 Expected number of curves to find a factor of n digits: 35 40 45 50 55 60 65 70 75 80 6022 87544 1534319 3.1e+07 7.5e+08 2e+10 6.7e+13 1e+19 1.3e+24 Inf Step 1 took 1673ms Last fiddled with by bsquared on 20201105 at 21:04 Reason: gmpecm output comparison 
20201119, 17:17  #73  
"Ben"
Feb 2007
3,373 Posts 
There have been a few updates to the avxecm repo:
1) added algebraic factor removal of Mersenne inputs 2) added checkpoint saving every 1e8 primes in stage 1 3) implemented batch inversion in stage 2, now almost twice as fast Here's an example of that last item compared to previously Quote:
Code:
B1=7M on 2^12771: avxecm stage 1: 78 sec for 8 curves, 9.7 sec/stg1curve avxecm stage 2: 32 sec for 8 curves, 4.0 sec/stg2curve I think that Bob's paper says to equalize the stage 1 and stage 2 runtimes, so the best answer is probably the latter. It might be moot if the bestbest answer is to still use a hybrid plan with gmpecm stage 2. However this might change the crossover point (from pure avxecm to hybrid). 

20201128, 00:58  #74 
"Ben"
Feb 2007
3,373 Posts 
More updates:
1) Fixed some bugs in the fast stage 2, after finally bearing down and understanding all of the nuances of Montgomery's PAIR algorithm correctly. I had it mostly right before, but not 100%, which caused some factors to be missed. Thank <deity> for a gold reference implementation in gmpecm. Also, the fixed bugs actually improved the speed a bit. 2) Got a shiny new laptop with a tigerlake i71165G7 in order to get AVX512IFMA up and running. It helped more than I thought it would. Going from my floating point 52bit double precision integer multipliers to IFMA speeds up all of the vector math by about 40%! In this defacto quick benchmark this cpu gets: Code:
B1=7M on 2^12771: avxecm stage 1: 47.3 sec for 8 curves, 5.9 sec/stg1curve avxecm stage 2: 18.3 sec for 8 curves, 2.3 sec/stg2curve 
20201129, 05:19  #75 
P90 years forever!
Aug 2002
Yeehaw, FL
3·5·491 Posts 
Have you played with improving PRAC?
I recently did and gained one or two percent. This may be because prime95 can store data FFTed which may change the costing functions. A double costs 10 transforms, an add costs 12 transforms. Also, prime95 is usually dealing with larger numbers and can afford to spend a little more time searching for the best chain. Glad to share the C code for the changes. 
20201129, 18:13  #76  
"Ben"
Feb 2007
3,373 Posts 
Quote:
I'm also looking into gmpecm's different parameterizations. "param 1" seems to be the default for me in most cases. It looks like they are using small parameters so that the multiply by (A+2)/4 is a single limb (saving a multiply when duplicating). Then not using PRAC at all, instead using Montgomery's ladder, which has a larger percentage of doublings, on a premultiplied set of B1 primes. Eliminating a multiply could save 10% or so. What I don't understand yet is how it can be singlelimb if they are using montgomery multiplies (redc). Last fiddled with by bsquared on 20201129 at 18:16 

20201130, 14:47  #77  
"Ben"
Feb 2007
3,373 Posts 
Quote:
It also seemed to help very slightly for me to define the cost functions to include the faster nature of a square versus a multiply. Doing that increases the total dups / adds ratio very slightly in stage 1. In my tinyecm code (< 64 bits) where there are just a few fixed and very small B1 values, I found that in a few cases the cost of PRAC(p1*p2) is cheaper than PRAC(p1) + PRAC(p2). It isn't practical to test combinations of primes like this in real time for arbitrary (and larger) B1. But I wonder if it is worth compiling a list of prime pairs that are cheaper when combined below bounds like 1M, 3M, 11M, etc. 
