mersenneforum.org (https://www.mersenneforum.org/index.php)
-   YAFU (https://www.mersenneforum.org/forumdisplay.php?f=96)

 bsquared 2020-11-03 02:50

[QUOTE=VBCurtis;562018]When GMP-ECM is faster for Stage 2 with a higher B2, there is only advantage from using it. Save time, get more work- what's not to like?
[/QUOTE]

If it were comparing curve-for-curve, sure, but it's not that straightforward. avx-ecm stage 2 may be slower and smaller B2, but it is doing 8 curves at once in that time.

Expanding on the example from before, with B1=7M on 2^1277-1:
avx-ecm stage 1: 78 sec for 8 curves, 9.75 sec/curve
avx-ecm stage 2: 55 sec for 8 curves, 6.87 sec/curve

gmp-ecm stage 1: 60 sec for 1 curve
gmp-ecm stage 2: 32 sec for 1 curve

Now, each avx-ecm stage 2 curve is 2.8 times less likely to find a factor, but if you had 64 seconds to spend doing something, would you rather run 2 gmp-ecm stage 2 curves or 8 avx-ecm stage 2 curves? Even with each avx-ecm being 2.8 times less likely to find a factor, you'd be better off.

[QUOTE=VBCurtis;562018]
The difficult tradeoff would be on larger numbers, when the avx-ecm stage 2 is likely faster than GMP-ECM but a smaller B2. That leads to some need for study. [/QUOTE]

The same math as above applies. I think you are better off running pure avx-ecm until time/curve * probability-multiplier is greater than gmp-ecm's stage 2 runtime.

It gets interesting with generic inputs, when the throughput advantage of avx-ecm is not as great. Then, I think it makes sense to run avx-ecm stage 1 followed by 8 gmp-ecm stage 2's. Then you get all of the probability advantage of gmp-ecm stage 2 but with avx-ecm's higher stage 1 throughput. I am putting together some data on this case.

 bsquared 2020-11-03 03:05

Here are a couple other things to keep in mind:

1) for lower-end skylake processors the speedups are not as dramatic, with or without special Mersenne math.
I ran on a 7800x and only saw about 2.1 times speedup for 2^1277-1.
[U]More testing/benchmarking is needed for a variety of avx512 capable processors[/U].
The good news is that this situation will only improve for avx-ecm as time goes on. I plan to implement enhancements as soon I can get my hands on an ice lake cpu or whatever cpu supports AVX-512IFMA.

2) avx-ecm uses fixed-time arithmetic in steps of 208 bits. So a curve at, say, 416 bits takes just as long as a curve at 623 bits.
gmp-ecm on the other hand will adjust the size of its arithmetic in 64-bit steps. This could mean fine-tune adjustments for any crossover math that is performed for given input sizes.

 VBCurtis 2020-11-03 03:20

[QUOTE=bsquared;562023]Expanding on the example from before, with B1=7M on 2^1277-1:
avx-ecm stage 1: 78 sec for 8 curves, 9.75 sec/curve
avx-ecm stage 2: 55 sec for 8 curves, 6.87 sec/curve

gmp-ecm stage 1: 60 sec for 1 curve
gmp-ecm stage 2: 32 sec for 1 curve[/QUOTE]

Thank you for rewarding my "I'm confused so I'll state the obvious and see where I'm wrong"! This makes much sense, and I'm all clear now.

 bsquared 2020-11-03 18:03

I updated the first post with a table of timings and some analysis for a generic 900-bit input, and a couple different Mersenne inputs.

It looks like a hybrid plan is best for most B1 levels.

The relative speeds of avx-ecm and gmp-ecm should be checked on the particular cpu beforehand, and crossovers adjusted.

 bsquared 2020-11-05 15:01

Now processing 2^n+1 similarly to 2^n-1 (with special modulo)

Here verifying some of the factors of 2^941+1

[CODE]./avx-ecm "(2^941+1)/3/1738969" 8 3000000
starting process 60820
commencing parallel ecm on 3562975409753765647916625929286022733708067351160837396219090783379741715638222929662078302565476897014358133541083282322395686863751130956979946597434060607614269433132169774980382077527684835263114705184392392863404860166373197969977946080763606872357617994374214244244159779
ECM has been configured with DIGITBITS = 52, VECLEN = 8, GMP_LIMB_BITS = 64
Choosing MAXBITS = 1040, NWORDS = 20, NBLOCKS = 5 based on input size 941
cached 3001134 primes < 49999991
Processing in batches of 100000000 primes
Using special Mersenne mod for factor of: 2^941+1
Initialization took 0.0899 seconds.
Cached 5761455 primes in range [2 : 99999989]

commencing curves 0-7 of 8
Building curves took 0.0001 seconds.
commencing Stage 1 @ prime 2
accumulating prime 2943173
Stage 1 completed at prime 2999999 with 6040073 point-adds and 565931 point-doubles
Stage 1 took 19.2999 seconds

found factor 5062155107501579 in stage 1 in thread 0, vec position 0, with sigma = 424412553419277870

found factor 5062155107501579 in stage 1 in thread 0, vec position 5, with sigma = 8913334498632421225

commencing stage 2 at p=3000017, A=3000690
w = 1155, R = 480, L = 16, umax = 9240, amin = 1299
found 5317482 primes in range [99999989 : 199999963]

commencing stage 2 at p=100000007, A=99965250
w = 1155, R = 480, L = 16, umax = 9240, amin = 43275
found 5173389 primes in range [199999963 : 299999957]

commencing stage 2 at p=199999991, A=199979010
w = 1155, R = 480, L = 16, umax = 9240, amin = 86571
found 53 primes in range [299999957 : 300000997]

commencing stage 2 at p=299999977, A=299974290
w = 1155, R = 480, L = 16, umax = 9240, amin = 129859
Stage 2 took 14.7954 seconds
performed 9087802 pair-multiplies for 16035509 primes in stage 2
performed 266472 point-additions and 57 point-doubles in stage 2

found factor 5062155107501579 in stage 2 in thread 0, vec position 0, with sigma = 424412553419277870

found factor 5062155107501579 in stage 2 in thread 0, vec position 1, with sigma = 15371561253067696485

found factor 5062155107501579 in stage 2 in thread 0, vec position 2, with sigma = 13475164758719783696

found factor 1053336016261649316809 in stage 2 in thread 0, vec position 4, with sigma = 333700535869040322

found factor 5062155107501579 in stage 2 in thread 0, vec position 5, with sigma = 8913334498632421225
Process took 34.3731 seconds.
[/CODE]

Also, I'm only updating the avx-ecm repo with updates, for now. When things settle out a bit I'll revise yafu.

 bsquared 2020-11-05 20:59

And now processing 2^n-c similarly to 2^n-1 (with special modulo), where "c" is small (single-limb). GMP-ECM doesn't do this, as far as I can tell.

Example:
[CODE]./avx-ecm-52-icc-static "(2^941 - 5987059)/211/18413" 8 250000 1
starting process 97314
commencing parallel ecm on 4784305585655994717562720658383425526844379349934463348761460837207088185558434958330904147209498950842345840906605538877380399014736734618876417744155088454218322338609736739235487631453648349849153543969782570442284070067941772311074196433531076214607947202503691346829979451
ECM has been configured with DIGITBITS = 52, VECLEN = 8, GMP_LIMB_BITS = 64
Choosing MAXBITS = 1040, NWORDS = 20, NBLOCKS = 5 based on input size 941
cached 3001134 primes < 49999991
Processing in batches of 100000000 primes
[U]Using special pseudo-Mersenne mod for factor of: 2^941-5987059[/U]
Initialization took 0.1466 seconds.
Cached 1565994 primes in range [2 : 25000991]

commencing curves 0-7 of 8
Building curves took 0.0002 seconds.
commencing Stage 1 @ prime 2
accumulating prime 180511
Stage 1 completed at prime 249989 with 496674 point-adds and 51127 point-doubles
Stage 1 took 1.8810 seconds

commencing stage 2 at p=250007, A=249480
w = 1155, R = 480, L = 16, umax = 9240, amin = 108
Stage 2 took 1.4885 seconds
performed 854762 pair-multiplies for 1543883 primes in stage 2
performed 30752 point-additions and 49 point-doubles in stage 2

found factor 4655226350979187 in stage 2 in thread 0, vec position 3, with sigma = 9539569019480891334

found factor 4655226350979187 in stage 2 in thread 0, vec position 7, with sigma = 17882739001568946650
Process took 3.7186 seconds.
[/CODE]

And to compare:
[CODE]
echo "(2^941-5987059)/211/18413" | ../ecm-704-linux/ecm -v 250000
GMP-ECM 7.0.4 [configured with GMP 6.2.0, --enable-asm-redc] [ECM]
Tuned for x86_64/k8/params.h
Running on cpu
Input number is (2^941-5987059)/211/18413 (277 digits)
[U]Using MODMULN [mulredc:1, sqrredc:1][/U]
Computing batch product (of 360843 bits) of primes up to B1=250000 took 17ms
Using B1=250000, B2=128992510, polynomial Dickson(3), sigma=1:1245750705
dF=2048, k=3, d=19110, d2=11, i0=3
Expected number of curves to find a factor of n digits:
35 40 45 50 55 60 65 70 75 80
6022 87544 1534319 3.1e+07 7.5e+08 2e+10 6.7e+13 1e+19 1.3e+24 Inf
Step 1 took 1673ms
[/CODE]

Pretty substantial stage 1 throughput improvement, for inputs of this form.

 bsquared 2020-11-19 17:17

There have been a few updates to the avx-ecm repo:

1) added algebraic factor removal of Mersenne inputs
2) added checkpoint saving every 1e8 primes in stage 1
3) implemented batch inversion in stage 2, now almost twice as fast

Here's an example of that last item compared to previously
[QUOTE=bsquared;562023]
Expanding on the example from before, with B1=7M on 2^1277-1:
avx-ecm stage 1: 78 sec for 8 curves, 9.7 sec/stg1-curve
avx-ecm stage 2: 55 sec for 8 curves, 6.8 sec/stg2-curve
[/QUOTE]

Now:
[CODE]
B1=7M on 2^1277-1:
avx-ecm stage 1: 78 sec for 8 curves, 9.7 sec/stg1-curve
avx-ecm stage 2: 32 sec for 8 curves, 4.0 sec/stg2-curve[/CODE]

At this point I'm not sure whether it's better to keep the faster stage 2 time or run to higher B2 bounds, keeping timings similar to previously.

I think that Bob's paper says to equalize the stage 1 and stage 2 runtimes, so the best answer is probably the latter. It might be moot if the best-best answer is to still use a hybrid plan with gmp-ecm stage 2. However this might change the crossover point (from pure avx-ecm to hybrid).

 bsquared 2020-11-28 00:58

1) Fixed some bugs in the fast stage 2, after finally bearing down and understanding all of the nuances of Montgomery's PAIR algorithm correctly. I had it mostly right before, but not 100%, which caused some factors to be missed. Thank <deity> for a gold reference implementation in gmp-ecm. Also, the fixed bugs actually improved the speed a bit.

2) Got a shiny new laptop with a tigerlake i7-1165G7 in order to get AVX512IFMA up and running. It helped more than I thought it would. Going from my floating point 52-bit double precision integer multipliers to IFMA speeds up all of the vector math by about 40%!

In this de-facto quick benchmark this cpu gets:
[CODE]
B1=7M on 2^1277-1:
avx-ecm stage 1: 47.3 sec for 8 curves, 5.9 sec/stg1-curve
avx-ecm stage 2: 18.3 sec for 8 curves, 2.3 sec/stg2-curve
[/CODE]

The next goals are usability-focused. Get a proper set of command line flags in there, some options to control memory use and when to exit, improved logging, and reading in savefiles.

 Prime95 2020-11-29 05:19

Have you played with improving PRAC?

I recently did and gained one or two percent. This may be because prime95 can store data FFTed which may change the costing functions. A double costs 10 transforms, an add costs 12 transforms. Also, prime95 is usually dealing with larger numbers and can afford to spend a little more time searching for the best chain.

Glad to share the C code for the changes.

 bsquared 2020-11-29 18:13

[QUOTE=Prime95;564720]Have you played with improving PRAC?

I recently did and gained one or two percent. This may be because prime95 can store data FFTed which may change the costing functions. A double costs 10 transforms, an add costs 12 transforms. Also, prime95 is usually dealing with larger numbers and can afford to spend a little more time searching for the best chain.

Glad to share the C code for the changes.[/QUOTE]

I tried to adjust the balance of time optimizing chains, without notable success. I'd like to see your code, thanks for sharing.

I'm also looking into gmp-ecm's different parameterizations. "param 1" seems to be the default for me in most cases. It looks like they are using small parameters so that the multiply by (A+2)/4 is a single limb (saving a multiply when duplicating). Then not using PRAC at all, instead using Montgomery's ladder, which has a larger percentage of doublings, on a pre-multiplied set of B1 primes. Eliminating a multiply could save 10% or so. What I don't understand yet is how it can be single-limb if they are using montgomery multiplies (redc).

 bsquared 2020-11-30 14:47

[QUOTE=Prime95;564720]Have you played with improving PRAC?

I recently did and gained one or two percent. [/QUOTE]

The gain is not quite as large for me for the smaller inputs I am dealing with, but it is a gain of about 1%.

It also seemed to help very slightly for me to define the cost functions to include the faster nature of a square versus a multiply. Doing that increases the total dups / adds ratio very slightly in stage 1.

In my tinyecm code (< 64 bits) where there are just a few fixed and very small B1 values, I found that in a few cases the cost of PRAC(p1*p2) is cheaper than PRAC(p1) + PRAC(p2). It isn't practical to test combinations of primes like this in real time for arbitrary (and larger) B1. But I wonder if it is worth compiling a list of prime pairs that are cheaper when combined below bounds like 1M, 3M, 11M, etc.

All times are UTC. The time now is 20:52.