mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > YAFU

Reply
 
Thread Tools
Old 2020-11-03, 02:50   #67
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

13×257 Posts
Default

Quote:
Originally Posted by VBCurtis View Post
When GMP-ECM is faster for Stage 2 with a higher B2, there is only advantage from using it. Save time, get more work- what's not to like?
If it were comparing curve-for-curve, sure, but it's not that straightforward. avx-ecm stage 2 may be slower and smaller B2, but it is doing 8 curves at once in that time.

Expanding on the example from before, with B1=7M on 2^1277-1:
avx-ecm stage 1: 78 sec for 8 curves, 9.75 sec/curve
avx-ecm stage 2: 55 sec for 8 curves, 6.87 sec/curve

gmp-ecm stage 1: 60 sec for 1 curve
gmp-ecm stage 2: 32 sec for 1 curve

Now, each avx-ecm stage 2 curve is 2.8 times less likely to find a factor, but if you had 64 seconds to spend doing something, would you rather run 2 gmp-ecm stage 2 curves or 8 avx-ecm stage 2 curves? Even with each avx-ecm being 2.8 times less likely to find a factor, you'd be better off.

Quote:
Originally Posted by VBCurtis View Post
The difficult tradeoff would be on larger numbers, when the avx-ecm stage 2 is likely faster than GMP-ECM but a smaller B2. That leads to some need for study.
The same math as above applies. I think you are better off running pure avx-ecm until time/curve * probability-multiplier is greater than gmp-ecm's stage 2 runtime.

It gets interesting with generic inputs, when the throughput advantage of avx-ecm is not as great. Then, I think it makes sense to run avx-ecm stage 1 followed by 8 gmp-ecm stage 2's. Then you get all of the probability advantage of gmp-ecm stage 2 but with avx-ecm's higher stage 1 throughput. I am putting together some data on this case.
bsquared is offline   Reply With Quote
Old 2020-11-03, 03:05   #68
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

13×257 Posts
Default

Here are a couple other things to keep in mind:

1) for lower-end skylake processors the speedups are not as dramatic, with or without special Mersenne math.
I ran on a 7800x and only saw about 2.1 times speedup for 2^1277-1.
More testing/benchmarking is needed for a variety of avx512 capable processors.
The good news is that this situation will only improve for avx-ecm as time goes on. I plan to implement enhancements as soon I can get my hands on an ice lake cpu or whatever cpu supports AVX-512IFMA.


2) avx-ecm uses fixed-time arithmetic in steps of 208 bits. So a curve at, say, 416 bits takes just as long as a curve at 623 bits.
gmp-ecm on the other hand will adjust the size of its arithmetic in 64-bit steps. This could mean fine-tune adjustments for any crossover math that is performed for given input sizes.
bsquared is offline   Reply With Quote
Old 2020-11-03, 03:20   #69
VBCurtis
 
VBCurtis's Avatar
 
"Curtis"
Feb 2005
Riverside, CA

112·37 Posts
Default

Quote:
Originally Posted by bsquared View Post
Expanding on the example from before, with B1=7M on 2^1277-1:
avx-ecm stage 1: 78 sec for 8 curves, 9.75 sec/curve
avx-ecm stage 2: 55 sec for 8 curves, 6.87 sec/curve

gmp-ecm stage 1: 60 sec for 1 curve
gmp-ecm stage 2: 32 sec for 1 curve
Thank you for rewarding my "I'm confused so I'll state the obvious and see where I'm wrong"! This makes much sense, and I'm all clear now.
VBCurtis is offline   Reply With Quote
Old 2020-11-03, 18:03   #70
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

13×257 Posts
Default

I updated the first post with a table of timings and some analysis for a generic 900-bit input, and a couple different Mersenne inputs.

It looks like a hybrid plan is best for most B1 levels.

The relative speeds of avx-ecm and gmp-ecm should be checked on the particular cpu beforehand, and crossovers adjusted.

Last fiddled with by bsquared on 2020-11-03 at 19:50
bsquared is offline   Reply With Quote
Old 2020-11-05, 15:01   #71
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

13·257 Posts
Default

Now processing 2^n+1 similarly to 2^n-1 (with special modulo)

Here verifying some of the factors of 2^941+1

Code:
./avx-ecm "(2^941+1)/3/1738969" 8 3000000
starting process 60820
commencing parallel ecm on 3562975409753765647916625929286022733708067351160837396219090783379741715638222929662078302565476897014358133541083282322395686863751130956979946597434060607614269433132169774980382077527684835263114705184392392863404860166373197969977946080763606872357617994374214244244159779
ECM has been configured with DIGITBITS = 52, VECLEN = 8, GMP_LIMB_BITS = 64
Choosing MAXBITS = 1040, NWORDS = 20, NBLOCKS = 5 based on input size 941
cached 3001134 primes < 49999991
Input has 919 bits, using 1 threads (8 curves/thread)
Processing in batches of 100000000 primes
Using special Mersenne mod for factor of: 2^941+1
Initialization took 0.0899 seconds.
Cached 5761455 primes in range [2 : 99999989]

commencing curves 0-7 of 8
Building curves took 0.0001 seconds.
commencing Stage 1 @ prime 2
accumulating prime 2943173
Stage 1 completed at prime 2999999 with 6040073 point-adds and 565931 point-doubles
Stage 1 took 19.2999 seconds

found factor 5062155107501579 in stage 1 in thread 0, vec position 0, with sigma = 424412553419277870

found factor 5062155107501579 in stage 1 in thread 0, vec position 5, with sigma = 8913334498632421225


commencing stage 2 at p=3000017, A=3000690
w = 1155, R = 480, L = 16, umax = 9240, amin = 1299
found 5317482 primes in range [99999989 : 199999963]

commencing stage 2 at p=100000007, A=99965250
w = 1155, R = 480, L = 16, umax = 9240, amin = 43275
found 5173389 primes in range [199999963 : 299999957]

commencing stage 2 at p=199999991, A=199979010
w = 1155, R = 480, L = 16, umax = 9240, amin = 86571
found 53 primes in range [299999957 : 300000997]

commencing stage 2 at p=299999977, A=299974290
w = 1155, R = 480, L = 16, umax = 9240, amin = 129859
Stage 2 took 14.7954 seconds
performed 9087802 pair-multiplies for 16035509 primes in stage 2
performed 266472 point-additions and 57 point-doubles in stage 2

found factor 5062155107501579 in stage 2 in thread 0, vec position 0, with sigma = 424412553419277870

found factor 5062155107501579 in stage 2 in thread 0, vec position 1, with sigma = 15371561253067696485

found factor 5062155107501579 in stage 2 in thread 0, vec position 2, with sigma = 13475164758719783696

found factor 1053336016261649316809 in stage 2 in thread 0, vec position 4, with sigma = 333700535869040322

found factor 5062155107501579 in stage 2 in thread 0, vec position 5, with sigma = 8913334498632421225
Process took 34.3731 seconds.
Also, I'm only updating the avx-ecm repo with updates, for now. When things settle out a bit I'll revise yafu.

Last fiddled with by bsquared on 2020-11-05 at 15:17
bsquared is offline   Reply With Quote
Old 2020-11-05, 20:59   #72
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

13·257 Posts
Default

And now processing 2^n-c similarly to 2^n-1 (with special modulo), where "c" is small (single-limb). GMP-ECM doesn't do this, as far as I can tell.

Example:
Code:
./avx-ecm-52-icc-static "(2^941 - 5987059)/211/18413" 8 250000 1
starting process 97314
commencing parallel ecm on 4784305585655994717562720658383425526844379349934463348761460837207088185558434958330904147209498950842345840906605538877380399014736734618876417744155088454218322338609736739235487631453648349849153543969782570442284070067941772311074196433531076214607947202503691346829979451
ECM has been configured with DIGITBITS = 52, VECLEN = 8, GMP_LIMB_BITS = 64
Choosing MAXBITS = 1040, NWORDS = 20, NBLOCKS = 5 based on input size 941
cached 3001134 primes < 49999991
Input has 920 bits, using 1 threads (8 curves/thread)
Processing in batches of 100000000 primes
Using special pseudo-Mersenne mod for factor of: 2^941-5987059
Initialization took 0.1466 seconds.
Cached 1565994 primes in range [2 : 25000991]

commencing curves 0-7 of 8
Building curves took 0.0002 seconds.
commencing Stage 1 @ prime 2
accumulating prime 180511
Stage 1 completed at prime 249989 with 496674 point-adds and 51127 point-doubles
Stage 1 took 1.8810 seconds


commencing stage 2 at p=250007, A=249480
w = 1155, R = 480, L = 16, umax = 9240, amin = 108
Stage 2 took 1.4885 seconds
performed 854762 pair-multiplies for 1543883 primes in stage 2
performed 30752 point-additions and 49 point-doubles in stage 2

found factor 4655226350979187 in stage 2 in thread 0, vec position 3, with sigma = 9539569019480891334

found factor 4655226350979187 in stage 2 in thread 0, vec position 7, with sigma = 17882739001568946650
Process took 3.7186 seconds.
And to compare:
Code:
echo "(2^941-5987059)/211/18413" | ../ecm-704-linux/ecm -v 250000
GMP-ECM 7.0.4 [configured with GMP 6.2.0, --enable-asm-redc] [ECM]
Tuned for x86_64/k8/params.h
Running on cpu
Input number is (2^941-5987059)/211/18413 (277 digits)
Using MODMULN [mulredc:1, sqrredc:1]
Computing batch product (of 360843 bits) of primes up to B1=250000 took 17ms
Using B1=250000, B2=128992510, polynomial Dickson(3), sigma=1:1245750705
dF=2048, k=3, d=19110, d2=11, i0=3
Expected number of curves to find a factor of n digits:
35      40      45      50      55      60      65      70      75      80
6022    87544   1534319 3.1e+07 7.5e+08 2e+10   6.7e+13 1e+19   1.3e+24 Inf
Step 1 took 1673ms
Pretty substantial stage 1 throughput improvement, for inputs of this form.

Last fiddled with by bsquared on 2020-11-05 at 21:04 Reason: gmp-ecm output comparison
bsquared is offline   Reply With Quote
Old 2020-11-19, 17:17   #73
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

64158 Posts
Default

There have been a few updates to the avx-ecm repo:

1) added algebraic factor removal of Mersenne inputs
2) added checkpoint saving every 1e8 primes in stage 1
3) implemented batch inversion in stage 2, now almost twice as fast

Here's an example of that last item compared to previously
Quote:
Originally Posted by bsquared View Post
Expanding on the example from before, with B1=7M on 2^1277-1:
avx-ecm stage 1: 78 sec for 8 curves, 9.7 sec/stg1-curve
avx-ecm stage 2: 55 sec for 8 curves, 6.8 sec/stg2-curve
Now:
Code:
B1=7M on 2^1277-1:
avx-ecm stage 1: 78 sec for 8 curves, 9.7 sec/stg1-curve
avx-ecm stage 2: 32 sec for 8 curves, 4.0 sec/stg2-curve
At this point I'm not sure whether it's better to keep the faster stage 2 time or run to higher B2 bounds, keeping timings similar to previously.

I think that Bob's paper says to equalize the stage 1 and stage 2 runtimes, so the best answer is probably the latter. It might be moot if the best-best answer is to still use a hybrid plan with gmp-ecm stage 2. However this might change the crossover point (from pure avx-ecm to hybrid).
bsquared is offline   Reply With Quote
Reply

Thread Tools


All times are UTC. The time now is 19:15.

Thu Nov 26 19:15:14 UTC 2020 up 77 days, 16:26, 3 users, load averages: 1.28, 1.21, 1.32

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.