mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   YAFU (https://www.mersenneforum.org/forumdisplay.php?f=96)
-   -   AVX-ECM (https://www.mersenneforum.org/showthread.php?t=25056)

bsquared 2019-12-30 22:21

AVX-ECM
 
Just created a repository on github with new ECM code: [url]https://github.com/bbuhrow/avx-ecm[/url]

It computes curves in parallel using AVX-512 vector arithmetic, and is also multi-threaded. I've measured it to have about 2x the throughput of GMP-ECM using 1 thread on a Xeon Gold 5122 CPU. It does both stage 1 and stage 2 in parallel, but stage 2 is just the standard continuation with pairing (by default, B2=100xB1).

It will output a savefile after stage 1 that will work with GMP-ECM stage 2, if desired.
e.g.:
ecm -resume save_b1.txt 1000000 < input.txt

compile on linux using either icc or gcc-7.3.0 or above (I've only tested icc and gcc-7.3.0).

e.g.:
make MAXBITS=416 COMPILER=gcc730 SKYLAKEX=1

the MAXBITS parameter must be a multiple of 208. When this is the case, 8 curves are performed in parallel per thread.

Alternatively, if DIGITBITS=32 is specified during make, then MAXBITS must be a mutiple of 128 and 16 curves are performed in parallel per thread.

The 52-bit version is generally faster (and the default) but for some sizes a 32-bit version can have higher throughput (because of the 208-bit jumps between sizes).

If anyone wants to test it out you will of course need an avx-512 capable computer. Let me know if there are any problems.

Happy factoring!

============================================================
[Edit] Sticky comparison tables

generic 900-bit input
Xeon Gold 5122 CPU, 1 thread, icc compiler
Notes:
1) "weight" is obtained by determining the number of curves for the given t-level using -power 1 and B2=100B1 in gmp-ecm, and then also using default gmp-ecm settings, and taking the ratio.
2) hybrid = (avx-ecm stage 1 time + 8*gmp-ecm stage 2 time) / 8

[CODE]
generic 900-bit
avx-ecm weighted
t-level B1 stg 1 stg2 sum sec/curve weight sec/curve
35 1.00E+06 13 9.2 22.2 2.78 1.85 5.14
40 3.00E+06 39.3 26.3 65.6 8.20 2.07 17.01
45 1.00E+07 131.1 81.1 212.2 26.53 2.35 62.33
50 4.30E+07 567.1 334.3 901.4 112.68 2.46 277.42
55 1.10E+08 1448.3 822.6 2270.9 283.86 2.63 747.74
60 2.60E+08 3423 1874 5303 662.98 2.87 1902.6

gmp-ecm-704 hybrid
t-level B1 stg 1 stg2 sum sec/curve sec/curve
35 1.00E+06 6.59 3.45 10.04 10.04 5.08
40 3.00E+06 19.8 7.77 27.57 27.57 12.68
45 1.00E+07 65.9 25.4 91.3 91.30 41.79
50 4.30E+07 287.5 78.8 366.3 366.30 149.69
55 1.10E+08 738.3 196.2 934.5 934.50 377.24
60 2.60E+08 1750 412.5 2162.5 2160.5 841.1

gmp/weighted-avx gmp/hybrid avx/hybrid
t-level B1 sec/curve ratio sec/curve ratio ratio
35 1.00E+06 1.95 1.98 1.01
40 3.00E+06 1.62 2.17 1.34
45 1.00E+07 1.46 2.18 1.49
50 4.30E+07 1.32 2.45 1.85
55 1.10E+08 1.25 2.48 1.98
60 2.60E+08 1.14 2.57 2.26
[/CODE]

Conclusions:
Below B1=1M, probably better to just use avx-ecm.
At B1=1M and above, hybrid is best.
Somewhere above 260M, pure GMP-ECM likely beats pure AVX-ECM, due to increasing weight, but hybrid is still best.

For Mersenne inputs with primitive size between 832 and 1039 bits:
[CODE]
gmp/weighted-avx gmp/hybrid avx/hybrid
t-level B1 sec/curve ratio sec/curve ratio ratio
35 1.00E+06 3.72 2.26 0.61
40 3.00E+06 3.06 2.56 0.84
45 1.00E+07 2.79 2.55 0.91
50 4.30E+07 2.49 2.97 1.19
55 1.10E+08 2.31 3.11 1.35
60 2.60E+08 2.09 3.29 1.57
[/CODE]

Hybrid best starting at B1 around 30M.



For Mersenne inputs with primitive size between 1248 and 1456 bits:
[CODE]
gmp/weighted-avx gmp/hybrid avx/hybrid
t-level B1 sec/curve ratio sec/curve ratio ratio
35 1.00E+06 3.12 2.06 0.66
40 3.00E+06 2.59 2.32 0.89
45 1.00E+07 2.30 2.32 1.01
50 4.30E+07 2.08 2.67 1.28
55 1.10E+08 1.92 2.79 1.46
60 2.60E+08 1.74 2.95 1.69
65 8.50E+08 1.65 3.17 1.92
70 2.90E+09 1.52 3.40 2.23
75 7.60E+09 1.44 3.59 2.50
[/CODE]

Hybrid best starting at B1 around 10M.

mathwiz 2019-12-30 23:03

Hmmmm. Just tried it out on an AVX-512 capable machine, with B1=1e6 and input M433 ([url]http://factordb.com/index.php?id=1000000000000000433):[/url]

[CODE]./avx-ecm 22181357552966518876627313473144669627491496603006532601363836644916970462445004984319795248833116624779129687691228574631793262591 100 1000000
commencing parallel ecm on 22181357552966518876627313473144669627491496603006532601363836644916970462445004984319795248833116624779129687691228574631793262591
ECM has been configured with MAXBITS = 416, NWORDS = 8, VECLEN = 8
starting process 171275
cached 3001134 primes < 49999991
Input has 433 bits, using 1 threads (100 curves/thread)
Processing in batches of 100000000 primes
Initialization took 0.0743 seconds.
Cached 5761455 primes in range [2 : 99999989]

commencing curves 0-7 of 100
Building curves took 0.0005 seconds.
commencing Stage 1 @ prime 2
accumulating prime 997693
Stage 1 completed at prime 999983 with 2001925 point-adds and 194143 point-doubles
Stage 1 took 4.1971 seconds

found factor 7573885960626120430675586583017049046287571561744815615 in stage 1 in thread 0, vec position 0, with sigma = 9805897911615405889

found factor 125483996795288390556263452518000395100487722298175677228160127607935 in stage 1 in thread 0, vec position 1, with sigma = 5941742753921851068

found factor 140559707882921114195636669550269531279151421131321967551083551935 in stage 1 in thread 0, vec position 2, with sigma = 17488052527796218971

found factor 140559707882921114195636669550269531279151421131321967551083551935 in stage 1 in thread 0, vec position 3, with sigma = 162542135677334094

found factor 9704134944669326942243922594738986945861288047284534894845604139673055 in stage 1 in thread 0, vec position 4, with sigma = 9639283928122886917

found factor 19065414140013908092677804414156626466105803295 in stage 1 in thread 0, vec position 5, with sigma = 14437779671575784496

found factor 455237245388167216759545663443996983027266740639813141535 in stage 1 in thread 0, vec position 6, with sigma = 5479294375532123583

found factor 161449908068885681487281375790285735990727019303070577509707381288895 in stage 1 in thread 0, vec position 7, with sigma = 16438922582860911842


commencing stage 2 at p=1000003, A=1000230
w = 1155, R = 480, L = 16, umax = 9240, amin = 433
Stage 2 took 3.1940 seconds
performed 3188920 pair-multiplies for 5682957 primes in stage 2
performed 95044 point-additions and 53 point-doubles in stage 2
something failed: tid = 0, vec = 0 has zero result
something failed: tid = 0, vec = 1 has zero result
something failed: tid = 0, vec = 2 has zero result
something failed: tid = 0, vec = 3 has zero result
something failed: tid = 0, vec = 4 has zero result
something failed: tid = 0, vec = 5 has zero result
something failed: tid = 0, vec = 6 has zero result
something failed: tid = 0, vec = 7 has zero result
Process took 7.4682 seconds.[/CODE]

Something seems wrong here? Or am I holding it wrong?

Compare to:

[CODE]4$ echo "22181357552966518876627313473144669627491496603006532601363836644916970462445004984319795248833116624779129687691228574631793262591" | ./ecm -c 10 1e6
GMP-ECM 7.0.4 [configured with GMP 6.1.2, GWNUM 29.8, --enable-asm-redc] [ECM]
Due to incompatible licenses, this binary file must not be distributed.
Input number is 22181357552966518876627313473144669627491496603006532601363836644916970462445004984319795248833116624779129687691228574631793262591 (131 digits)
Using B1=1000000, B2=1045563762, polynomial Dickson(6), sigma=0:14118429249538045316
Step 1 took 2892ms
Step 2 took 2391ms
Run 2 out of 10:
Using B1=1000000, B2=1045563762, polynomial Dickson(6), sigma=0:15640357536401406010
********** Factor found in step 1: 22086765417396827057
Found prime factor of 20 digits: 22086765417396827057
Composite cofactor 1004282751855334395242501879256940939756665292745707836441839543208470249351377144560254198205767424080811494063 has 112 digits[/CODE]

bsquared 2019-12-30 23:06

It looks like your input is larger than 416 bits. So you'd have to re-build with MAXBITS=624. Actually, this is a case where MAXBITS=512 DIGITBITS=32 will give you slightly higher throughput (because 512 is quite a bit smaller than 624).

bsquared 2019-12-31 14:45

New version that chooses MAXBITS automatically based on the input size. You still have to specify DIGITBITS during build as this impacts the underlying word size, which can't be easily changed on the fly. Default is 52-bit, resulting in steps of 208 bits. It is a fixed-size arithmetic within these steps, so curves at 417 bits will take the same amount of time as curves with 623 bits. This is similar to GPU-ECM, although with more fine-grained resolution.

The 32-bit build path gives steps of 128 bits. So for this input it will choose a 512-bit curve size which is faster than a 624-bit curve with 52-bit words.

[CODE]./avx-ecm 22181357552966518876627313473144669627491496603006532601363836644916970462445004984319795248833116624779129687691228574631793262591 16 1000000 1
commencing parallel ecm on 22181357552966518876627313473144669627491496603006532601363836644916970462445004984319795248833116624779129687691228574631793262591
starting process 226272
ECM has been configured with DIGITBITS = 32, VECLEN = 16
Choosing MAXBITS = 512, NWORDS = 16, NBLOCKS = 4 based on input size 433
cached 3001134 primes < 49999991
Input has 433 bits, using 1 threads (16 curves/thread)
Processing in batches of 100000000 primes
Initialization took 0.0895 seconds.
Cached 5761455 primes in range [2 : 99999989]

commencing curves 0-15 of 16
Building curves took 0.0006 seconds.
commencing Stage 1 @ prime 2
accumulating prime 997693
Stage 1 completed at prime 999983 with 2001925 point-adds and 194143 point-doubles
Stage 1 took 11.0072 seconds

found factor 22086765417396827057 in stage 1 in thread 0, vec position 10, with sigma = 415569030

found factor 22086765417396827057 in stage 1 in thread 0, vec position 11, with sigma = 3510146781


commencing stage 2 at p=1000003, A=1000230
w = 1155, R = 480, L = 16, umax = 9240, amin = 433
Stage 2 took 7.5996 seconds
performed 3188920 pair-multiplies for 5682957 primes in stage 2
performed 95044 point-additions and 53 point-doubles in stage 2

found factor 22086765417396827057 in stage 2 in thread 0, vec position 0, with sigma = 551874572

found factor 22086765417396827057 in stage 2 in thread 0, vec position 5, with sigma = 2520392143

found factor 22086765417396827057 in stage 2 in thread 0, vec position 6, with sigma = 1474306994

found factor 22086765417396827057 in stage 2 in thread 0, vec position 8, with sigma = 2588423220

found factor 22086765417396827057 in stage 2 in thread 0, vec position 10, with sigma = 415569030

found factor 22086765417396827057 in stage 2 in thread 0, vec position 11, with sigma = 3510146781

found factor 22086765417396827057 in stage 2 in thread 0, vec position 12, with sigma = 1003661608
Process took 18.7103 seconds.
[/CODE]

For this size input, 16 simultaneous 512-bit curves complete in 18.7 sec or 1.17 sec/curve versus GMP-ECM's 1.72 sec/curve.

bsquared 2019-12-31 20:16

Another commit:
+ fixed printing issues and get_ui/set_ui with mingw
+ parameterization seems not exactly compatible with gmp-ecm: looking into it
+ adding windows binaries for 32-bit and 52-bit arithmetic versions

The mingw64 build for windows is a little slower but seems to work.

I'm not exactly sure what's going on with the parameterization... this program finds factors when it seems like it shouldn't for some sigma and conversely doesn't find factors when gmp-ecm does for some sigma. On average though the factor-finding rate seems about the same.

For example:
[CODE]./avx-ecm 274083933049131091178904352004902235910804839081565071801565126307886989756226870709872076759480838311782134623705910949147664751751242690094033868727929674779456111539120708680813658241 128 1000000 16

<snip>
found factor 618970019642690137449562111 in stage 2 in thread 8, vec position 2, with sigma = 17981939361520122334

[/CODE]

But using [URL="https://www.mersenneforum.org/showpost.php?p=236443&postcount=1"]this sage code[/URL] [URL="http://magma.maths.usyd.edu.au/calc/"]here[/URL] gives the group order for that factor and sigma as:
[CODE][ <2, 2>, <3, 2>, <17193611656740451817020637, 1> ][/CODE]

And gmp-ecm with that sigma only finds the factor 74383430474532481 with the following entirely reasonable group order:
[CODE][ <2, 2>, <3, 3>, <19, 1>, <23, 1>, <3617, 1>, <435735059, 1> ][/CODE]

mathwiz 2019-12-31 20:52

Thanks for all the updates!

Would it be a reasonable FR to request basic expression parsing, so that the input does not have to be the full decimal representation of N?

bsquared 2019-12-31 21:56

[QUOTE=mathwiz;533849]Thanks for all the updates!

Would it be a reasonable FR to request basic expression parsing, so that the input does not have to be the full decimal representation of N?[/QUOTE]

Yes, that's reasonable :)

BTW, thanks for testing it.

chris2be8 2020-01-01 16:36

How can you check if a CPU supports avx512? Eg one of my systems is a Ryzen 5 2600, /proc/cpuinfo says it supports avx and avx2, but so do some older systems.

Chris

CRGreathouse 2020-01-01 17:08

[QUOTE=chris2be8;533905]How can you check if a CPU supports avx512? Eg one of my systems is a Ryzen 5 2600, /proc/cpuinfo says it supports avx and avx2, but so do some older systems.[/QUOTE]

Wikipedia [URL="https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#CPUs_with_AVX-512"]writes[/URL], without citation:
[INDENT]As of 2019, there are no AMD CPUs that support AVX-512, and AMD has not yet released plans to support AVX-512.[/INDENT]
As the [url=https://www.amd.com/en/products/cpu/amd-ryzen-5-2600]Ryzen 5 2600[/url] was launched in 2018 I trust it does not support AVX-512.

CRGreathouse 2020-01-02 04:53

As for Intel, their advanced search allows you to [url=https://ark.intel.com/content/www/us/en/ark/search/featurefilter.html?productType=873&1_Filter-InstructionSetExtensions=3533]search for AVX-512[/url] and find all processors with it. (You can further limit the search, or just look up a chip and see if it's there.)

EdH 2020-01-02 14:45

I would have sworn my earlier sessions with Colab showed AVX512 for their Xeon CPUs when I checked. Now, none of my recent sessions have. The CPU type just says Xeon CPU @ 2.00GHz. It does give me a family, etc. for the CPU(s), but no name. Is there a way to find that via the command line?

If I can, I'll set up a Colab AVX-ECM example in my "How I . . ." series.

bsquared 2020-01-02 14:52

[QUOTE=EdH;533996]I would have sworn my earlier sessions with Colab showed AVX512 for their Xeon CPUs when I checked. Now, none of my recent sessions have. The CPU type just says Xeon CPU @ 2.00GHz. It does give me a family, etc. for the CPU(s), but no name. Is there a way to find that via the command line?

If I can, I'll set up a Colab AVX-ECM example in my "How I . . ." series.[/QUOTE]

Not sure if this applies across all distros, but at least for CentOS/RedHat you can do:
cat /proc/cpuinfo

which will list the cpuflags, e.g.:

[CODE]flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp
lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf
eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16
xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave
avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 intel_ppin intel_pt ssbd
mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle
avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a [B]avx512f avx512dq[/B] rdseed adx
smap clflushopt clwb [B]avx512cd avx512bw avx512vl[/B] xsaveopt xsavec xgetbv1
cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_
window hwp_epp hwp_pkg_req pku ospke md_clear spec_ctrl intel_stibp flush_l1d
[/CODE]

AVX-ECM needs avx512f and avx512dq

bsquared 2020-01-02 14:59

Speaking of cpu features, can't wait for systems with avx512ifma to become available. Should get a good speed boost with that feature.

GP2 2020-01-02 15:01

[QUOTE=chris2be8;533905]How can you check if a CPU supports avx512? Eg one of my systems is a Ryzen 5 2600, /proc/cpuinfo says it supports avx and avx2, but so do some older systems.[/QUOTE]

[c]grep flags /proc/cpuinfo[/c] should show which specific subsets of the AVX-512 instruction set are supported. For instance, the CPU I use has avx512f avx512dq avx512cd avx512bw avx512vl. For the meanings, see [URL="https://en.wikipedia.org/wiki/AVX-512#Instruction_set"]Wikipedia[/URL].

chalsall 2020-01-02 15:33

[QUOTE=bsquared;533997]cat /proc/cpuinfo[/QUOTE]

I just happen to have an SSH session into a Colab session open right now.

[CODE]root@InstanceSSH:~# cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) CPU @ 2.00GHz
stepping : 3
microcode : 0x1
cpu MHz : 2000.168
cache size : 39424 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 1
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat md_clear arch_capabilities
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs
bogomips : 4000.33
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:

processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) CPU @ 2.00GHz
stepping : 3
microcode : 0x1
cpu MHz : 2000.168
cache size : 39424 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 1
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat md_clear arch_capabilities
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs
bogomips : 4000.33
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:[/CODE]

bsquared 2020-01-02 15:35

[QUOTE=chalsall;534005]I just happen to have an SSH session into a Colab session open right now.

[CODE]root@InstanceSSH:~# cat /proc/cpuinfo

flags : ... avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl ...
[/CODE][/QUOTE]

Nice!

bsquared 2020-01-02 16:16

I bolted on a simplified version of yafu's front end calculator:

[CODE]./avx-ecm "fib(791)/13/677/216416017" 8 1000000 1
starting process 418960
commencing parallel ecm on 478538644094177490260709299559653477212580778084175536200356719720677713018603842152898226485988725658355447208677180777716686530734249077491242934517377
ECM has been configured with DIGITBITS = 52, VECLEN = 8, GMP_LIMB_BITS = 64
Choosing MAXBITS = 624, NWORDS = 12, NBLOCKS = 3 based on input size 508
cached 3001134 primes < 49999991
Input has 508 bits, using 1 threads (8 curves/thread)
Processing in batches of 100000000 primes
Initialization took 0.1045 seconds.
Cached 5761455 primes in range [2 : 99999989]

commencing curves 0-7 of 8
Building curves took 0.0006 seconds.
commencing Stage 1 @ prime 2
accumulating prime 997693
Stage 1 completed at prime 999983 with 2001925 point-adds and 194143 point-doubles
Stage 1 took 6.8980 seconds

found factor 272602401466814027129 in stage 1 in thread 0, vec position 3, with sigma = 1465376940087886860


commencing stage 2 at p=1000003, A=1000230
w = 1155, R = 480, L = 16, umax = 9240, amin = 433
Stage 2 took 4.9852 seconds
performed 3188920 pair-multiplies for 5682957 primes in stage 2
performed 95044 point-additions and 53 point-doubles in stage 2

found factor 272602401466814027129 in stage 2 in thread 0, vec position 3, with sigma = 1465376940087886860

found factor 272602401466814027129 in stage 2 in thread 0, vec position 4, with sigma = 16373433592041433963

found factor 272602401466814027129 in stage 2 in thread 0, vec position 7, with sigma = 12457157968472999552
Process took 11.9819 seconds.
[/CODE]

So now you should be able to use +,-,*,/,^,%,# (primorial), ! (factorial), as well as fib(), luc(), >>, <<, binary ops (xor,and,not,or), and a few other things.

PhilF 2020-01-02 17:34

Sorry, duplicate post...

CRGreathouse 2020-01-02 18:29

[QUOTE=GP2;533999][c]grep flags /proc/cpuinfo[/c] should show which specific subsets of the AVX-512 instruction set are supported.[/QUOTE]

For the lazy:
[code]egrep -o 'avx512[^ ]*' /proc/cpuinfo | sort -u[/code]

mathwiz 2020-01-02 23:17

FWIW, I'm getting segfaults with the current master branch. Was a bug introduced?

bsquared 2020-01-03 00:37

[QUOTE=mathwiz;534063]FWIW, I'm getting segfaults with the current master branch. Was a bug introduced?[/QUOTE]

Probably... can you let me know what command line was giving you problems? It has been working for me but I haven't done a lot of testing either.

bsquared 2020-01-03 02:33

[QUOTE=bsquared;534071]Probably... can you let me know what command line was giving you problems? It has been working for me but I haven't done a lot of testing either.[/QUOTE]

I see now after testing on a different system with gcc that it doesn't work at all. I committed a fix (was missing a couple libraries in the new calc front end). Tested on linux with icc, with mingw on windows/msys2, and on windows-subsystem-for-linux/gcc.



[QUOTE=bsquared;533848]

+ parameterization seems not exactly compatible with gmp-ecm: looking into it

I'm not exactly sure what's going on with the parameterization... this program finds factors when it seems like it shouldn't for some sigma and conversely doesn't find factors when gmp-ecm does for some sigma. On average though the factor-finding rate seems about the same.
[/QUOTE]

Also fixed this. Was a simple overflow when trying to set random sigma values into an mpz_t.

EdH 2020-01-03 04:19

I had been using cat /proc/cpuinfo and lscpu, but really appreciate the egrep line.

I have dicovered that if I try to get a new python3 motebook, it invariably has no avx512. However, if I "Connect" the "Welcome" page, I get the full list of avx512bw/cd/dq/f/vl. I hope to build a Colab AVX-ECM instance soon.

PhilF 2020-01-03 19:01

[QUOTE=EdH;534088]I had been using cat /proc/cpuinfo and lscpu, but really appreciate the egrep line.

I have dicovered that if I try to get a new python3 motebook, it invariably has no avx512. However, if I "Connect" the "Welcome" page, I get the full list of avx512bw/cd/dq/f/vl. I hope to build a Colab AVX-ECM instance soon.[/QUOTE]

I think it is more random than that. I always connect through the Welcome page, but the CPU I get is hit-or-miss between Haswell, Broadwell, or Skylake. Only the Skylake has AVX-512. If I need a Skylake, right after I connect I use lscpu to check, then I reset the session until I get a one. It usually takes only a few tries.

EdH 2020-01-03 19:47

[QUOTE=PhilF;534138]I think it is more random than that. I always connect through the Welcome page, but the CPU I get is hit-or-miss between Haswell, Broadwell, or Skylake. Only the Skylake has AVX-512. If I need a Skylake, right after I connect I use lscpu to check, then I reset the session until I get a one. It usually takes only a few tries.[/QUOTE]
You seem to be quite right. I tried several without success (even from Welcome page) and then got one, but alas, I couldn't get avx-ecm to compile. GCC was 7.4.0 - changing GCC=7.4.0 in the Makefile didn't help. I'll play more later.

mathwiz 2020-01-03 19:51

[QUOTE=EdH;534140]You seem to be quite right. I tried several without success (even from Welcome page) and then got one, but alas, I couldn't get avx-ecm to compile. GCC was 7.4.0 - changing GCC=7.4.0 in the Makefile didn't help. I'll play more later.[/QUOTE]

I was able to get it to compile and run successfully on a Skylake machine by doing the following:

[CODE]!sudo apt-get install libgmp-dev
!git clone https://github.com/bbuhrow/avx-ecm.git
!cd avx-ecm/ && ls
!make -j 8 SKYLAKEX=1 COMPILER=gcc[/CODE]

And running with, e.g.:

[CODE]!cd avx-ecm && ./avx-ecm "2^1277-1" 100 1000000 8[/CODE]

ATH 2020-01-03 20:16

I'm getting compiling errors on an EC2 Ubuntu 18.04 instance with GCC 9.2.1

Edit: ...and same error with GCC 7.4.0

[CODE]ubuntu@ip-172-31-1-23:/mnt-efs/z/avx-ecm$ gcc --version
gcc (Ubuntu 9.2.1-17ubuntu1~18.04.1) 9.2.1 20191102
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

ubuntu@ip-172-31-1-23:/mnt-efs/z/avx-ecm$ make MAXBITS=416 COMPILER=gcc SKYLAKEX=1
gcc -fopenmp -DMAXBITS=416 -g -O3 -march=skylake-avx512 -DSKYLAKEX -Wall -I. -I/projects/gmp-6.0.0a/install/include -c -o eratosthenes/presieve.o eratosthenes/presieve.c
gcc -fopenmp -DMAXBITS=416 -g -O3 -march=skylake-avx512 -DSKYLAKEX -Wall -I. -I/projects/gmp-6.0.0a/install/include -c -o eratosthenes/count.o eratosthenes/count.c
gcc -fopenmp -DMAXBITS=416 -g -O3 -march=skylake-avx512 -DSKYLAKEX -Wall -I. -I/projects/gmp-6.0.0a/install/include -c -o eratosthenes/offsets.o eratosthenes/offsets.c
gcc -fopenmp -DMAXBITS=416 -g -O3 -march=skylake-avx512 -DSKYLAKEX -Wall -I. -I/projects/gmp-6.0.0a/install/include -c -o eratosthenes/primes.o eratosthenes/primes.c
gcc -fopenmp -DMAXBITS=416 -g -O3 -march=skylake-avx512 -DSKYLAKEX -Wall -I. -I/projects/gmp-6.0.0a/install/include -c -o eratosthenes/roots.o eratosthenes/roots.c
gcc -fopenmp -DMAXBITS=416 -g -O3 -march=skylake-avx512 -DSKYLAKEX -Wall -I. -I/projects/gmp-6.0.0a/install/include -c -o eratosthenes/linesieve.o eratosthenes/linesieve.c
gcc -fopenmp -DMAXBITS=416 -g -O3 -march=skylake-avx512 -DSKYLAKEX -Wall -I. -I/projects/gmp-6.0.0a/install/include -c -o eratosthenes/soe.o eratosthenes/soe.c
eratosthenes/soe.c: In function ‘spSOE’:
eratosthenes/soe.c:273:43: warning: format ‘%lu’ expects argument of type ‘long unsigned int’, but argument 3 has type ‘uint32_t’ {aka ‘unsigned int’} [-Wformat=]
273 | printf("using %lu primes, max prime = %lu \n", sdata.pboundi, sieve_p[sdata.pboundi]); // sdata.pbound);
| ~~^ ~~~~~~~~~~~~~~~~~~~~~~
| | |
| long unsigned int uint32_t {aka unsigned int}
| %u
gcc -fopenmp -DMAXBITS=416 -g -O3 -march=skylake-avx512 -DSKYLAKEX -Wall -I. -I/projects/gmp-6.0.0a/install/include -c -o eratosthenes/tiny.o eratosthenes/tiny.c
gcc -fopenmp -DMAXBITS=416 -g -O3 -march=skylake-avx512 -DSKYLAKEX -Wall -I. -I/projects/gmp-6.0.0a/install/include -c -o eratosthenes/worker.o eratosthenes/worker.c
gcc -fopenmp -DMAXBITS=416 -g -O3 -march=skylake-avx512 -DSKYLAKEX -Wall -I. -I/projects/gmp-6.0.0a/install/include -c -o eratosthenes/soe_util.o eratosthenes/soe_util.c
eratosthenes/soe_util.c: In function ‘check_input’:
eratosthenes/soe_util.c:140:17: warning: assignment to ‘__mpz_struct (*)[1]’ {aka ‘struct <anonymous> (*)[1]’} from incompatible pointer type ‘__mpz_struct *’ {aka ‘struct <anonymous> *’} [-Wincompatible-pointer-types]
140 | sdata->offset = offset;
| ^
eratosthenes/soe_util.c: In function ‘init_sieve’:
eratosthenes/soe_util.c:172:13: warning: implicit declaration of function ‘spGCD’ [-Wimplicit-function-declaratio ]
172 | if (spGCD(i, (uint64_t)prodN) == 1)
| ^~~~~
gcc -fopenmp -DMAXBITS=416 -g -O3 -march=skylake-avx512 -DSKYLAKEX -Wall -I. -I/projects/gmp-6.0.0a/install/include -c -o eratosthenes/wrapper.o eratosthenes/wrapper.c
gcc -fopenmp -DMAXBITS=416 -g -O3 -march=skylake-avx512 -DSKYLAKEX -Wall -I. -I/projects/gmp-6.0.0a/install/include -c -o threadpool.o threadpool.c
gcc -fopenmp -DMAXBITS=416 -g -O3 -march=skylake-avx512 -DSKYLAKEX -Wall -I. -I/projects/gmp-6.0.0a/install/include -c -o main.o main.c
<command-line>: error: expected identifier or ‘(’ before numeric constant
avx_ecm.h:107:10: note: in expansion of macro ‘MAXBITS’
107 | uint32_t MAXBITS;
| ^~~~~~~
main.c: In function ‘main’:
main.c:203:17: error: lvalue required as left operand of assignment
203 | MAXBITS = 208;
| ^
main.c:206:21: error: lvalue required as left operand of assignment
206 | MAXBITS += 208;
| ^~
main.c:211:17: error: lvalue required as left operand of assignment
211 | MAXBITS = 128;
| ^
main.c:214:21: error: lvalue required as left operand of assignment
214 | MAXBITS += 128;
| ^~
main.c:274:24: warning: format ‘%d’ expects argument of type ‘int’, but argument 2 has type ‘size_t’ {aka ‘long unsigned int’} [-Wformat=]
274 | printf("Input has %d bits, using %d threads (%d curves/thread)\n",
| ~^
| |
| int
| %ld
main.c:325:13: warning: unused variable ‘j’ [-Wunused-variable]
325 | int j;
| ^
main.c:145:9: warning: unused variable ‘nextptr’ [-Wunused-variable]
145 | char **nextptr;
| ^~~~~~~
main.c:144:14: warning: unused variable ‘j’ [-Wunused-variable]
144 | uint32_t i, j;
| ^
main.c:140:12: warning: unused variable ‘siglist’ [-Wunused-variable]
140 | uint32_t *siglist;
| ^~~~~~~
main.c:138:15: warning: unused variable ‘n’ [-Wunused-variable]
138 | bignum **f, *n;
| ^
main.c:138:11: warning: unused variable ‘f’ [-Wunused-variable]
138 | bignum **f, *n;
| ^
At top level:
main.c:50:12: warning: ‘debugctr’ defined but not used [-Wunused-variable]
50 | static int debugctr = 0;
| ^~~~~~~~
Makefile:174: recipe for target 'main.o' failed
make: *** [main.o] Error 1
ubuntu@ip-172-31-1-23:/mnt-efs/z/avx-ecm$
[/CODE]

bsquared 2020-01-03 20:45

[QUOTE=ATH;534146]I'm getting compiling errors on an EC2 Ubuntu 18.04 instance with GCC 9.2.1

Edit: ...and same error with GCC 7.4.0
[/QUOTE]

As of post #4, you don't need the MAXBITS=416 directive on the compile line. Just build with COMPILER=gcc and SKYLAKEX=1.

Sorry, I will update the readme.

mathwiz 2020-01-03 22:04

From my limited knowledge of Makefile's and gcc, I notice that the gcc options for SKYLAKEX=1 specify "-march=skylake-avx512" but not "-mtune=skylake-avx512". Is this deliberate?

bsquared 2020-01-03 22:24

[QUOTE=mathwiz;534168]From my limited knowledge of Makefile's and gcc, I notice that the gcc options for SKYLAKEX=1 specify "-march=skylake-avx512" but not "-mtune=skylake-avx512". Is this deliberate?[/QUOTE]

From [url]https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html[/url]:

[QUOTE]
[B]3.18.59 x86 Options[/B]
These ‘-m’ options are defined for the x86 family of computers.

-march=cpu-type
Generate instructions for the machine type cpu-type. In contrast to -mtune=cpu-type, which merely tunes the generated code for the specified cpu-type, -march=cpu-type allows GCC to generate code that may not run at all on processors other than the one indicated. [Specifying -march=cpu-type implies -mtune=cpu-type.
[/QUOTE]

So according to the last sentence there, we don't need mtune if we've already put in march. I ran some experiments anyway and didn't see any speed difference when adding mtune.

I'll also mention that if you have access to Intel's compiler, it seems to do a much better job with this code. For the 624-bit size, for instance, I get 8 stage 1 curves at B1=1e6 in 8 seconds with gcc versus 7 seconds with icc.

Even thinking about building a statically-linked executable gives me a headache though...

ATH 2020-01-04 00:57

Thanks, it worked. I tested it in an EC2 instance with 1 core / 2 threads (Xeon Cascadelake) on a 402 bit prime number:

[U]160 curves at B1=1,000,000:[/U]
GMP-ECM: [B]300.6 sec[/B]
AVX-ECM 1 thread: [B]179.5 sec[/B]
AVX-ECM 2 threads: [B]144.8 sec[/B]


I created a 402 bit composite number with a 30 digit (97 bit) factor and tested it 10 times with B1=1,000,000 (35 digit level) to see how fast the factor was found with GMP-ECM and AVX-ECM:

[CODE]
GMP-ECM AVX-ECM
curve 23 curves 0-15
curve 34 curves 16-31
curve 48 curves 32-47
curve 81 curves 48-63 (found twice)
curve 89 curves 64-79
curve 129 curves 144-159
curve 135 curves 144-159
curve 168 curves 160-175 (found twice)
curve 195 curves 384-399
curve 264 curves 416-431[/CODE]

So based on the speed from the first test GMP-ECM found the factor 10 times in 2191 seconds and AVX-ECM found the factor 10 (12) times in 1419 seconds.

I took the 12 sigmas values for which AVX-ECM found the factor and tested in GMP-ECM, and it found the factor as well with all the 12 sigmas.

EdH 2020-01-04 03:17

[QUOTE=mathwiz;534142]I was able to get it to compile and run successfully on a Skylake machine by doing the following:

[CODE]!sudo apt-get install libgmp-dev
!git clone https://github.com/bbuhrow/avx-ecm.git
!cd avx-ecm/ && ls
!make -j 8 SKYLAKEX=1 COMPILER=gcc[/CODE]And running with, e.g.:

[CODE]!cd avx-ecm && ./avx-ecm "2^1277-1" 100 1000000 8[/CODE][/QUOTE]
Thanks! I got a Colab session to work. I'll play a little more and then make a "How I . . ." for my blog section.

bsquared 2020-01-04 03:37

[QUOTE=ATH;534182]Thanks, it worked. I tested it in an EC2 instance with 1 core / 2 threads (Xeon Cascadelake) on a 402 bit prime number:

[U]160 curves at B1=1,000,000:[/U]
GMP-ECM: [B]300.6 sec[/B]
AVX-ECM 1 thread: [B]179.5 sec[/B]
AVX-ECM 2 threads: [B]144.8 sec[/B]


I created a 402 bit composite number with a 30 digit (97 bit) factor and tested it 10 times with B1=1,000,000 (35 digit level) to see how fast the factor was found with GMP-ECM and AVX-ECM:

[CODE]
GMP-ECM AVX-ECM
curve 23 curves 0-15
curve 34 curves 16-31
curve 48 curves 32-47
curve 81 curves 48-63 (found twice)
curve 89 curves 64-79
curve 129 curves 144-159
curve 135 curves 144-159
curve 168 curves 160-175 (found twice)
curve 195 curves 384-399
curve 264 curves 416-431[/CODE]

So based on the speed from the first test GMP-ECM found the factor 10 times in 2191 seconds and AVX-ECM found the factor 10 (12) times in 1419 seconds.

I took the 12 sigmas values for which AVX-ECM found the factor and tested in GMP-ECM, and it found the factor as well with all the 12 sigmas.[/QUOTE]

Fantastic, thank you for those tests!

Decent speedup from the hyperthread - I haven't had a chance to test that yet so thanks again.

EdH 2020-01-04 17:57

It took quite a few tries to get a machine with avx512 today, but all went well when I did.

I haven't tried this yet, but hope to soon and wondered what the thoughts are for a setup that uses the GPU branch of ECM for stage 1 and this avx-ecm for stage 2. This may be the exact answer I've been searching for with my Colab ECM-GPU session setup, since prior to this I could only run a single stage 2 thread against all the stage 1 curves via GPU.

Thoughts?

bsquared 2020-01-04 20:47

A GPU is no doubt the most efficient approach for stg1 if you can get one. This program is probably a good option for stg2 but keep in mind:
1) you will have to run more curves, not quite 2x more but probably close to that. Gmp-ecm with -v is your friend.
2) avx-ecm does not yet have the ability to resume curves :)

EdH 2020-01-05 04:00

[QUOTE=bsquared;534240]A GPU is no doubt the most efficient approach for stg1 if you can get one. This program is probably a good option for stg2 but keep in mind:
1) you will have to run more curves, not quite 2x more but probably close to that. Gmp-ecm with -v is your friend.
[B]2) avx-ecm does not yet have the ability to resume curves[/B] :)[/QUOTE]
That I had misunderstood. Bummer! I will wait patiently for that ability. . .

Thanks for all your work.

CRGreathouse 2020-01-05 04:11

[QUOTE=bsquared;534240]A GPU is no doubt the most efficient approach for stg1 if you can get one.[/QUOTE]

Is there a program that supports GPUs for this? :blush: What kind of stage 1--stage 2 tradeoff is reasonable in a setup like that?

mathwiz 2020-01-05 15:38

I notice that with large inputs, say 2^8192+1, GMP-ECM (with GWNUM) is still significantly faster at least on my system (Xeon Gold 6154 @ 3GHz).

For example, with this input and B1=1e6, GMP-ECM curves take about 20-30 seconds each:

[CODE]Using gwnum_ecmStage1(1, 2, 8192, 1, 1000000, 1)
Step 1 took 19146ms
Estimated memory usage: 203.12MB
Initializing tables of differences for F took 6ms
Computing roots of F took 186ms
Building F from its roots took 393ms
Computing 1/F took 176ms
Initializing table of differences for G took 32ms
Computing roots of G took 142ms
Building G from its roots took 388ms
Computing roots of G took 146ms
Building G from its roots took 388ms
Computing G * H took 105ms
Reducing G * H mod F took 156ms
Computing roots of G took 142ms
Building G from its roots took 389ms
Computing G * H took 105ms
Reducing G * H mod F took 155ms
Computing roots of G took 146ms
Building G from its roots took 388ms
Computing G * H took 106ms
Reducing G * H mod F took 156ms
Computing roots of G took 146ms
Building G from its roots took 388ms
Computing G * H took 105ms
Reducing G * H mod F took 155ms
Computing roots of G took 143ms
Building G from its roots took 387ms
Computing G * H took 104ms
Reducing G * H mod F took 155ms
Computing polyeval(F,G) took 911ms
Computing product of all F(g_i) took 64ms
Step 2 took 6291ms
********** Factor found in step 2: 3603109844542291969
Found prime factor of 19 digits: 3603109844542291969
Composite cofactor ((2^8192+1)/2710954639361)/3603109844542291969 has 2436 digits[/CODE]

But, 10+ minutes later, AVX-ECM is still running! ("./avx-ecm 2^8192+1 4 1000000 4" to be precise).

EdH 2020-01-05 16:19

[QUOTE=CRGreathouse;534262]Is there a program that supports GPUs for this? :blush: What kind of stage 1--stage 2 tradeoff is reasonable in a setup like that?[/QUOTE]There is a GPU branch in the latest GMP-ECM which I have played around with using Colab. I have a description [URL="https://www.mersenneforum.org/showthread.php?t=24887"]here[/URL]. Unfortunately, its default is to run the number of curves based on the GPU cores for stage 1 and then all those curves are processed singly by the CPU for stage 2. I have found no way (yet) to get more than 1 CPU to do stage 2 on the Colab instance, which only has 2 CPUs, anyway. Additionally, someone told me that the GPU branch might not be as reliable as the main branch in finding factors.

bsquared 2020-01-05 16:41

[QUOTE=mathwiz;534282]I notice that with large inputs, say 2^8192+1, GMP-ECM (with GWNUM) is still significantly faster at least on my system (Xeon Gold 6154 @ 3GHz).

For example, with this input and B1=1e6, GMP-ECM curves take about 20-30 seconds each:

[CODE]Using gwnum_ecmStage1(1, 2, 8192, 1, 1000000, 1)
...
[/CODE]

But, 10+ minutes later, AVX-ECM is still running! ("./avx-ecm 2^8192+1 4 1000000 4" to be precise).[/QUOTE]

Well...
1) No fair using gwnum
2) You are running avx-ecm with 4 threads, which is 32 curves, so whatever the time ends up being you will have to divide by 32 for a fair comparison.

But you are correct that avx-ecm will become less and less competitive as the input size goes up, because I do not have any sub-quadratic methods implemented. You are also correct that GMP-ECM will benefit significantly on numbers of special form (like the one you picked).

I'm not advocating that avx-ecm replace gmp-ecm, far from it! It is just another tool that may be useful in some situations. Long term, it probably makes the most sense to merge avx-ecm into gmp-ecm somehow.

bsquared 2020-01-05 16:51

[QUOTE=EdH;534286]There is a GPU branch in the latest GMP-ECM which I have played around with using Colab. I have a description [URL="https://www.mersenneforum.org/showthread.php?t=24887"]here[/URL]. Unfortunately, its default is to run the number of curves based on the GPU cores for stage 1 and then all those curves are processed singly by the CPU for stage 2. I have found no way (yet) to get more than 1 CPU to do stage 2 on the Colab instance, which only has 2 CPUs, anyway. Additionally, someone told me that the GPU branch might not be as reliable as the main branch in finding factors.[/QUOTE]

The GPU branch, IIRC, is also fixed-size arithmetic like avx-ecm. But I believe it only has 2 fixed sizes available, maybe 508 and 1016 (I don't recall exactly)? So if you want to run curves on a 509-bit number you are forced to use 1016-bit arithmetic, which of course is not optimal. avx-ecm is fixed-sized arithmetic as well but in steps of either 128 bits or 208 bits, so you have more efficient choices.

So the question what is the most efficient possible path depends on the size and form of the number.

chris2be8 2020-01-05 17:29

[QUOTE=EdH;534286]There is a GPU branch in the latest GMP-ECM which I have played around with using Colab. I have a description [URL="https://www.mersenneforum.org/showthread.php?t=24887"]here[/URL]. Unfortunately, its default is to run the number of curves based on the GPU cores for stage 1 and then all those curves are processed singly by the CPU for stage 2. I have found no way (yet) to get more than 1 CPU to do stage 2 on the Colab instance, which only has 2 CPUs, anyway. Additionally, someone told me that the GPU branch might not be as reliable as the main branch in finding factors.[/QUOTE]

You need to get ecm to do stage 1 on the GPU and save state to a file. Then split that into as many pieces as you have CPUs and run 1 ecm task against each piece.

Eg:
[code]
rm $NAME.save
ecm -gpu -save $NAME.save $B1 1 <$INI | tee -a $LOG
split -nr/2 $NAME.save $NAME.save
wait # If you are in a loop
ecm -resume $NAME.saveaa $B1 | tee -a $LOG.1 | grep [Ff]actor &
ecm -resume $NAME.saveaa $B1 | tee -a $LOG.1 | grep [Ff]actor &
# This would be the end of the loop
wait # Until all stage 2's end
[/code]

$NAME is name of project (anything you like).
$INI is name of file with the number in it.
$B1 you can probably guess.
$LOG is the name of the log files.

If you want to do more curves than the GPU does in 1 go then put the code into a loop.

This is a simplified version of my code, which has to allow for the CPU being faster doing stage 2 than the GPU doing stage 1. But I've not tested this.

See ecm's README file for details of what the parameters do. And README.gpu.

Chris

chalsall 2020-01-05 18:01

[QUOTE=EdH;534286]I have found no way (yet) to get more than 1 CPU to do stage 2 on the Colab instance, which only has 2 CPUs, anyway.[/QUOTE]

Please remember that Colab instances give you two *hyperthreaded* cores; only one real core. I don't know anything about this codebase, but HT'ing is useless for mprime.

I've already "burnt" my disposable Kaggle account, so I've been hesitant to experiment with CPU usage there with my main / development account, but CPU-only Kaggle instances give you two real cores; hyperthreaded to four.

FWIW.

bsquared 2020-10-30 20:55

[QUOTE=bsquared;560751]I will run some tests using AVX-ECM. Scaled tests, at B1=7e6, show that first-stage curve throughput is about 2.4x larger than GMP-ECM.


... But I've never run AVX-ECM with B1 anywhere close to that large. My bet is it crashes... but I'll test and see what happens.[/QUOTE]

Replying here to a post in [URL="https://www.mersenneforum.org/showthread.php?t=25941"]this [/URL]thread, to prevent further hijacking of that thread.

I was right, it crashed :smile:.

While I was fixing the bug I realized that I was using Montgomery arithmetic for this input that would be much better off using a special Mersenne reduction. So I implemented that feature, and now AVX-ECM is about 4.5 times faster than GMP-ECM on 2^1277-1.

GMP-ECM:
[CODE]
B1 Time, 1 curve, seconds
7e5 5.67
7e6 57.2
7e7 581
...
7e9 59448 (extrapolated)
[/CODE]

AVX-ECM
[CODE]
B1 Time, 8 curves, seconds
7e5 10.5
7e6 101.1
7e7 1019
...
7e9 102351 (actual)
[/CODE]

102351 / 8 is about 3.5 hours per curve versus about 16.5 hours/curve.

I've checked the latest code into r393 if anyone wants to do some testing. You will need a avx512 capable cpu.

mathwiz 2020-10-30 21:08

[QUOTE=bsquared;561591]I've checked the latest code into r393 if anyone wants to do some testing. You will need a avx512 capable cpu.[/QUOTE]

Is [url]https://github.com/bbuhrow/avx-ecm/[/url] the latest repo? Unless I'm reading wrong, the only commit recently (3 days ago) was just to README.md?

bsquared 2020-10-30 21:13

[QUOTE=mathwiz;561594]Is [url]https://github.com/bbuhrow/avx-ecm/[/url] the latest repo? Unless I'm reading wrong, the only commit recently (3 days ago) was just to README.md?[/QUOTE]

Ah, good question. Sorry, I've folded it all into yafu, so it is yafu r393 that you'll need. It is probably a good idea to update the avx-ecm repo as well since that is significantly easier to build. I can do that tonight, hopefully.

VBCurtis 2020-10-30 22:14

One more reason to buy one of those Phi barebones systems!

Nice work.

mathwiz 2020-10-30 22:50

[QUOTE=bsquared;561597]Ah, good question. Sorry, I've folded it all into yafu, so it is yafu r393 that you'll need. It is probably a good idea to update the avx-ecm repo as well since that is significantly easier to build. I can do that tonight, hopefully.[/QUOTE]

Gotcha. Is there an easy way to build just AVX-ECM from the latest yafu sources? Or, if you could update the avx-ecm git repo, that'd be great too!

I tried:

[CODE]$ svn checkout https://svn.code.sf.net/p/yafu/code/branches/wip yafu-code
$ make -j 8 SKYLAKEX=1[/CODE]

But I get some linker errors like:

[CODE]gcc-9: error: /sppdg/scratch/buhrow/projects/gmp_install/gmp-6.2.0/lib/libgmp.a: No such file or directory[/CODE]

bsquared 2020-10-31 00:36

[QUOTE=VBCurtis;561614]One more reason to buy one of those Phi barebones systems!

Nice work.[/QUOTE]

Thanks!

I actually have some data for the KNL, linearly scaled from a run with B1=7e6.

7210 KNL:
2048 simultaneous curves with 256 threads
354 hours/run
10.4 minutes/curve

7250 KNL
2176 simultaneous curves with 272 threads
353 hours/run
9.7 minutes/curve

B2 then takes about 3/4 of that to go up to 100*B1, another 264 hours, so that's a lengthy ~25 day runtime per batch. But you get 2000+ curves at the end of it (B1=7G).
[edit]
Running with fewer threads and curves trades off throughput for lower wall-clock time, if that is a concern. I don't have data for the exact trades there, yet. Also, it will save intermediate checkpoint files every 100M primes for each curve. The savefiles can be fed to GMP-ECM for stage 2, if you'd rather run stage 2 that way. Run with -saveB1 on the command line to enable saving.

Memory use is pretty minimal for both stages; won't be a problem to stay in MCDRAM.

bsquared 2020-10-31 00:50

[QUOTE=mathwiz;561617]Gotcha. Is there an easy way to build just AVX-ECM from the latest yafu sources? Or, if you could update the avx-ecm git repo, that'd be great too!

I tried:

[CODE]$ svn checkout https://svn.code.sf.net/p/yafu/code/branches/wip yafu-code
$ make -j 8 SKYLAKEX=1[/CODE]

But I get some linker errors like:
[/QUOTE]

No easy way to just build part of yafu. Makefiles are not my strong suit. I very much recommend [URL="https://www.mersenneforum.org/showthread.php?t=23087"]EdH's guides[/URL] on building yafu. It requires a small amount of makefile editing to point to your msieve, ecm, and gmp locations.

IMO there are reasons to prefer the yafu version of avx-ecm. It integrates into factor(), so it's fire and forget on any input. It will auto-recognize Mersennes and determine the fastest approach (sometimes it's better to just use Montgomery arithmetic). It uses the NFS algebraic factor extraction framework that's already there, so you could feed it, say, (2^1265-1) / 121441 / 533831 / 136149590957161 / 556590676780973231 and it would pull out all the annoying algebraic factors and go to work on the remaining C223. You can tell it to pretest up to a specific t-level, etc.

avx-ecm has none of that, but I will update it with the new vector math at least, with maybe a command line switch to turn it on.

wombatman 2020-11-01 04:38

Posting here only because the issue seems to be related to avx-ecm. I'm trying to build the latest revision of YAFU, and no matter what I pass during make, it's failing with errors like:

[CODE]/usr/bin/ld: factor/avx-ecm/avx_ecm_main.o: in function `vec_ecm_main':
/home/wombat/yafu-code/branches/wip/factor/avx-ecm/avx_ecm_main.c:266: undefined reference to `snfs_init'
/usr/bin/ld: /home/wombat/yafu-code/branches/wip/factor/avx-ecm/avx_ecm_main.c:272: undefined reference to `find_primitive_factor'
/usr/bin/ld: /home/wombat/yafu-code/branches/wip/factor/avx-ecm/avx_ecm_main.c:279: undefined reference to `snfs_clear'
/usr/bin/ld: /home/wombat/yafu-code/branches/wip/factor/avx-ecm/avx_ecm_main.c:464: undefined reference to `vecmulmod52_mersenne'
/usr/bin/ld: /home/wombat/yafu-code/branches/wip/factor/avx-ecm/avx_ecm_main.c:465: undefined reference to `vecsqrmod52_mersenne'
collect2: error: ld returned 1 exit status
make: *** [Makefile:397: all] Error 1[/CODE]

I'm passing in this:

[CODE] make clean && make ECM=1 USE_SSE41=1[/CODE]

This error is recent, though I can't say exactly which revision I was able to successfully build. Any suggestions?

bsquared 2020-11-01 15:10

[QUOTE=wombatman;561758]Posting here only because the issue seems to be related to avx-ecm. I'm trying to build the latest revision of YAFU, and no matter what I pass during make, it's failing with errors like:

[CODE]/usr/bin/ld: factor/avx-ecm/avx_ecm_main.o: in function `vec_ecm_main':
/home/wombat/yafu-code/branches/wip/factor/avx-ecm/avx_ecm_main.c:266: undefined reference to `snfs_init'
/usr/bin/ld: /home/wombat/yafu-code/branches/wip/factor/avx-ecm/avx_ecm_main.c:272: undefined reference to `find_primitive_factor'
/usr/bin/ld: /home/wombat/yafu-code/branches/wip/factor/avx-ecm/avx_ecm_main.c:279: undefined reference to `snfs_clear'
/usr/bin/ld: /home/wombat/yafu-code/branches/wip/factor/avx-ecm/avx_ecm_main.c:464: undefined reference to `vecmulmod52_mersenne'
/usr/bin/ld: /home/wombat/yafu-code/branches/wip/factor/avx-ecm/avx_ecm_main.c:465: undefined reference to `vecsqrmod52_mersenne'
collect2: error: ld returned 1 exit status
make: *** [Makefile:397: all] Error 1[/CODE]

I'm passing in this:

[CODE] make clean && make ECM=1 USE_SSE41=1[/CODE]

This error is recent, though I can't say exactly which revision I was able to successfully build. Any suggestions?[/QUOTE]

Looks like you just need to use more makefile options. You need at least NFS=1 to bring in the nfs stuff and SKYLAKEX=1 for the avx512 vector arithmetic stuff. And if you use SKYLAKEX=1 then you also need USE_AVX2=1. My typical build line looks like this:

make NFS=1 USE_AVX2=1 USE_BMI2=1 SKYLAKEX=1

also, ECM=1 is a msieve thing, not a yafu thing, so you can leave that off.

bsquared 2020-11-01 15:30

Another quick usage note...
Normally, yafu will switch over from using an internal GMP-ECM to an external GMP-ECM (pointed to by the ecm_path variable in yafu.ini) at B1=50k. With AVX-ECM enabled, by default that switchover happens at B1=40M instead. If you want to use AVX-ECM at higher B1, put that crossover value into the variable "ext_ecm" in yafu.ini.

For example if I want to use B1=110M with AVX-ECM and not with the external GMP-ECM, set ext_ecm=120000000, or anything larger than 110M.

mathwiz 2020-11-01 15:30

(waits patiently for the old avx-ecm git repo to be updated with the latest code... it's so much easier to build :)

bsquared 2020-11-01 16:56

[QUOTE=mathwiz;561813](waits patiently for the old avx-ecm git repo to be updated with the latest code... it's so much easier to build :)[/QUOTE]

Thanks for your patience :blush:

Now updated, tested, seems to work. Also checked in a pre-built version for windows. The mingw compiler doesn't do as good of a job, but something for windows is better than nothing.

wombatman 2020-11-01 17:02

[QUOTE=bsquared;561807]Looks like you just need to use more makefile options. You need at least NFS=1 to bring in the nfs stuff and SKYLAKEX=1 for the avx512 vector arithmetic stuff. And if you use SKYLAKEX=1 then you also need USE_AVX2=1. My typical build line looks like this:

make NFS=1 USE_AVX2=1 USE_BMI2=1 SKYLAKEX=1

also, ECM=1 is a msieve thing, not a yafu thing, so you can leave that off.[/QUOTE]

Sorry, I should have been clearer. The CPU is an Ivy Bridge-E, so it can't do any of the AVX-ECM stuff (to my understanding).

bsquared 2020-11-01 17:06

[QUOTE=wombatman;561828]Sorry, I should have been clearer. The CPU is an Ivy Bridge-E, so it can't do any of the AVX-ECM stuff (to my understanding).[/QUOTE]

Ah, ok. I have only checked that things work *with* avx512, did not stop to consider that things might break without it. Thanks for the report... stay tuned.

wombatman 2020-11-01 17:18

[QUOTE=bsquared;561829]Ah, ok. I have only checked that things work *with* avx512, did not stop to consider that things might break without it. Thanks for the report... stay tuned.[/QUOTE]

No worries! If it's just a simple makefile change, I'm good with doing that. :smile:

bsquared 2020-11-01 17:26

[QUOTE=bsquared;561829]Ah, ok. I have only checked that things work *with* avx512, did not stop to consider that things might break without it. Thanks for the report... stay tuned.[/QUOTE]

Ok, try r394.

wombatman 2020-11-01 17:33

[QUOTE=bsquared;561832]Ok, try r394.[/QUOTE]

Builds successfully and seems to be working well using factor(rsa(300)) to check both ECM and SIQS (my primary usage). Thanks very much for the quick response!

bsquared 2020-11-02 18:49

addmod and submod can also be simplified for Mersenne inputs and this has a measurable improvement... about 3% faster in stage 1 and 7% in stage 2.

Both yafu and avx-ecm repos have been updated.

Also, I've seen that avx-ecm is about 10-20% faster than yafu. I'm not sure why, the only difference that I can see is that yafu has a lot more code around it and I can't see how the overhead would be that large.

I've done some testing, but if anyone is able to test either package on finding [URL="https://homes.cerias.purdue.edu/~ssw/cun/pmain1020.txt"]known Mersenne factors[/URL], that'd be great.

bsquared 2020-11-02 19:34

[QUOTE=bsquared;561976]
Also, I've seen that avx-ecm is about 10-20% faster than yafu. I'm not sure why, the only difference that I can see is that yafu has a lot more code around it and I can't see how the overhead would be that large.
[/QUOTE]

Ok, most of that is explained by different compiler options. avx-ecm was using -O3 and yafu -O2. With -O3, yafu is much better, but avx-ecm is still 5% faster or so.

Given all of the above, for 2^1277-1, AVX-ECM is 6.2 times faster than GMP-ECM now. I hope I have GMP-ECM configured optimally for this system for a fair comparison... here is what it shows for a run with -v:

[CODE] echo "2^1277-1" | ~/ecm-704-linux/ecm -v 7000000
GMP-ECM 7.0.4 [configured with GMP 6.2.0, --enable-asm-redc] [ECM]
Tuned for x86_64/k8/params.h
Running on cpu
Input number is 2^1277-1 (385 digits)
Using special division for factor of 2^1277-1
Using B1=7000000, B2=17125579390, polynomial Dickson(12), sigma=0:1088740413510573297
dF=16384, k=6, d=158340, d2=11, i0=34
Expected number of curves to find a factor of n digits:
35 40 45 50 55 60 65 70 75 80
167 1024 7351 60402 558826 5744532 6.5e+07 7.8e+08 1.1e+10 2.8e+11
Step 1 took 60676ms[/CODE]

B2 takes 32 sec vs. 55 sec for 8 curves (@ B2=100*B1)

VBCurtis 2020-11-02 22:15

[QUOTE=bsquared;561986]
B2 takes 32 sec vs. 55 sec for 8 curves (@ B2=100*B1)[/QUOTE]
Which one is which for this comparison? Are both GMP-ECM and avx-ecm running the same B2, or is the longer time related to GMP-ECM running its standard B2 level?

Your stage 1 might be so fast that it calls for running the lesser stage 2 and not using GMP-ECM at all; I haven't quite decided which way to go for stage 2.

Also, if I'm reading the start of this thread correctly, the avx-ECM speedup on generic smallish (say, 1000 bits or under) speedup is around double GMP-ECM stage 1 speed, while on special mersenne numbers it's 5x or more?

My use cases are M1277 in the short run, and production work around 900 bits with no special form in the long run. Seems maybe M1277 calls for no GMP-ECM stage 2, while production work may call for using the stage1-save option and GMP-ECM for stage 2.

bsquared 2020-11-02 22:36

[QUOTE=VBCurtis;562006]Which one is which for this comparison? Are both GMP-ECM and avx-ecm running the same B2, or is the longer time related to GMP-ECM running its standard B2 level?

Your stage 1 might be so fast that it calls for running the lesser stage 2 and not using GMP-ECM at all; I haven't quite decided which way to go for stage 2.

Also, if I'm reading the start of this thread correctly, the avx-ECM speedup on generic smallish (say, 1000 bits or under) speedup is around double GMP-ECM stage 1 speed, while on special mersenne numbers it's 5x or more?

My use cases are M1277 in the short run, and production work around 900 bits with no special form in the long run. Seems maybe M1277 calls for no GMP-ECM stage 2, while production work may call for using the stage1-save option and GMP-ECM for stage 2.[/QUOTE]

gmp-ecm ran its larger stage 2 in 32 seconds while avx-ecm ran its smaller stage 2 in 55 seconds, or 6.8 sec/curve.
gmp-ecm B2 was B2=17125579390
avx-ecm B2 was 700000000 (24.4 times smaller)

This B2 size disparity will grow the larger B1 gets.
Also complicating matters is that gmp-ecm uses the Brent-Suyama extension for stage 2, which increases the odds of finding a factor slightly. avx-ecm doesn't do that.

So unfortunately it is a complicated tradeoff that has to consider factor-finding probabilities as well as speed. I don't think there is enough data for avx-ecm yet to compare probability of finding a factor per unit time (or per curve), compared to gmp-ecm. Not sure I would even know how to do that analysis if I had the data, although maybe I could figure it out.

ATH's data set [URL="https://www.mersenneforum.org/showpost.php?p=534182&postcount=31"]here [/URL]did show avx-ecm finding *more* factors per curve than gmp-ecm did, but stuff like that will happen for randomized algorithms and small data sets.

After thinking about it a bit, I think we could extract the probabilities from gmp-ecm -v, while running with power 1 (no brent-suyama) and forcing B2=100B1.

bsquared 2020-11-02 22:47

[QUOTE=bsquared;562007]
After thinking about it a bit, I think we could extract the probabilities from gmp-ecm -v, while running with power 1 (no brent-suyama) and forcing B2=100B1.[/QUOTE]

Here's what gmp-ecm spit out:

Input number is (2^1277-1) (385 digits)
Using special division for factor of 2^1277-1
Using B1=10000000000, B2=637998273712822, polynomial Dickson(30), sigma=0:6343133439254916928
dF=4194304, k=3, d=48498450, d2=23, i0=184
Expected number of curves to find a factor of n digits:
35 40 45 50 55 60 65 70 75 80
7 19 52 161 545 1993 7815 32642 144421 673682


Input number is (2^1277-1) (385 digits)
Using special division for factor of 2^1277-1
Using B1=10000000000, B2=1000000000000, polynomial x^1, sigma=0:11834000506611778745
dF=131072, k=6, d=1345890, d2=11, i0=7420
Expected number of curves to find a factor of n digits:
35 40 45 50 55 60 65 70 75 80
16 44 134 442 1578 6046 24696 106998 489087 2349846

The 75-digit number of curves ratio is 3.38, which is essentially a multiplier in favor of gmp-ecm curves.

So if avx-ecm is 6.2 times higher throughput, but 3.38 times less likely to find a factor at t75, then the net gain for avx-ecm is about 1.83 for 2^1277-1.

Similar calculations would have to be applied to generic inputs at particular B1's, I think. gmp-ecm's advantage is its vastly superior stage 2. So the avx-ecm stage 1 followed by gmp-ecm stage 2 recipe is probably a winner in a lot of cases. But for 2^1277-1 it looks like pure avx-ecm wins.

Thoughts?

VBCurtis 2020-11-03 01:47

When GMP-ECM is faster for Stage 2 with a higher B2, there is only advantage from using it. Save time, get more work- what's not to like?

The difficult tradeoff would be on larger numbers, when the avx-ecm stage 2 is likely faster than GMP-ECM but a smaller B2. That leads to some need for study.

bsquared 2020-11-03 02:50

[QUOTE=VBCurtis;562018]When GMP-ECM is faster for Stage 2 with a higher B2, there is only advantage from using it. Save time, get more work- what's not to like?
[/QUOTE]

If it were comparing curve-for-curve, sure, but it's not that straightforward. avx-ecm stage 2 may be slower and smaller B2, but it is doing 8 curves at once in that time.

Expanding on the example from before, with B1=7M on 2^1277-1:
avx-ecm stage 1: 78 sec for 8 curves, 9.75 sec/curve
avx-ecm stage 2: 55 sec for 8 curves, 6.87 sec/curve

gmp-ecm stage 1: 60 sec for 1 curve
gmp-ecm stage 2: 32 sec for 1 curve

Now, each avx-ecm stage 2 curve is 2.8 times less likely to find a factor, but if you had 64 seconds to spend doing something, would you rather run 2 gmp-ecm stage 2 curves or 8 avx-ecm stage 2 curves? Even with each avx-ecm being 2.8 times less likely to find a factor, you'd be better off.

[QUOTE=VBCurtis;562018]
The difficult tradeoff would be on larger numbers, when the avx-ecm stage 2 is likely faster than GMP-ECM but a smaller B2. That leads to some need for study. [/QUOTE]

The same math as above applies. I think you are better off running pure avx-ecm until time/curve * probability-multiplier is greater than gmp-ecm's stage 2 runtime.

It gets interesting with generic inputs, when the throughput advantage of avx-ecm is not as great. Then, I think it makes sense to run avx-ecm stage 1 followed by 8 gmp-ecm stage 2's. Then you get all of the probability advantage of gmp-ecm stage 2 but with avx-ecm's higher stage 1 throughput. I am putting together some data on this case.

bsquared 2020-11-03 03:05

Here are a couple other things to keep in mind:

1) for lower-end skylake processors the speedups are not as dramatic, with or without special Mersenne math.
I ran on a 7800x and only saw about 2.1 times speedup for 2^1277-1.
[U]More testing/benchmarking is needed for a variety of avx512 capable processors[/U].
The good news is that this situation will only improve for avx-ecm as time goes on. I plan to implement enhancements as soon I can get my hands on an ice lake cpu or whatever cpu supports AVX-512IFMA.


2) avx-ecm uses fixed-time arithmetic in steps of 208 bits. So a curve at, say, 416 bits takes just as long as a curve at 623 bits.
gmp-ecm on the other hand will adjust the size of its arithmetic in 64-bit steps. This could mean fine-tune adjustments for any crossover math that is performed for given input sizes.

VBCurtis 2020-11-03 03:20

[QUOTE=bsquared;562023]Expanding on the example from before, with B1=7M on 2^1277-1:
avx-ecm stage 1: 78 sec for 8 curves, 9.75 sec/curve
avx-ecm stage 2: 55 sec for 8 curves, 6.87 sec/curve

gmp-ecm stage 1: 60 sec for 1 curve
gmp-ecm stage 2: 32 sec for 1 curve[/QUOTE]

Thank you for rewarding my "I'm confused so I'll state the obvious and see where I'm wrong"! This makes much sense, and I'm all clear now.

bsquared 2020-11-03 18:03

I updated the first post with a table of timings and some analysis for a generic 900-bit input, and a couple different Mersenne inputs.

It looks like a hybrid plan is best for most B1 levels.

The relative speeds of avx-ecm and gmp-ecm should be checked on the particular cpu beforehand, and crossovers adjusted.

bsquared 2020-11-05 15:01

Now processing 2^n+1 similarly to 2^n-1 (with special modulo)

Here verifying some of the factors of 2^941+1

[CODE]./avx-ecm "(2^941+1)/3/1738969" 8 3000000
starting process 60820
commencing parallel ecm on 3562975409753765647916625929286022733708067351160837396219090783379741715638222929662078302565476897014358133541083282322395686863751130956979946597434060607614269433132169774980382077527684835263114705184392392863404860166373197969977946080763606872357617994374214244244159779
ECM has been configured with DIGITBITS = 52, VECLEN = 8, GMP_LIMB_BITS = 64
Choosing MAXBITS = 1040, NWORDS = 20, NBLOCKS = 5 based on input size 941
cached 3001134 primes < 49999991
Input has 919 bits, using 1 threads (8 curves/thread)
Processing in batches of 100000000 primes
Using special Mersenne mod for factor of: 2^941+1
Initialization took 0.0899 seconds.
Cached 5761455 primes in range [2 : 99999989]

commencing curves 0-7 of 8
Building curves took 0.0001 seconds.
commencing Stage 1 @ prime 2
accumulating prime 2943173
Stage 1 completed at prime 2999999 with 6040073 point-adds and 565931 point-doubles
Stage 1 took 19.2999 seconds

found factor 5062155107501579 in stage 1 in thread 0, vec position 0, with sigma = 424412553419277870

found factor 5062155107501579 in stage 1 in thread 0, vec position 5, with sigma = 8913334498632421225


commencing stage 2 at p=3000017, A=3000690
w = 1155, R = 480, L = 16, umax = 9240, amin = 1299
found 5317482 primes in range [99999989 : 199999963]

commencing stage 2 at p=100000007, A=99965250
w = 1155, R = 480, L = 16, umax = 9240, amin = 43275
found 5173389 primes in range [199999963 : 299999957]

commencing stage 2 at p=199999991, A=199979010
w = 1155, R = 480, L = 16, umax = 9240, amin = 86571
found 53 primes in range [299999957 : 300000997]

commencing stage 2 at p=299999977, A=299974290
w = 1155, R = 480, L = 16, umax = 9240, amin = 129859
Stage 2 took 14.7954 seconds
performed 9087802 pair-multiplies for 16035509 primes in stage 2
performed 266472 point-additions and 57 point-doubles in stage 2

found factor 5062155107501579 in stage 2 in thread 0, vec position 0, with sigma = 424412553419277870

found factor 5062155107501579 in stage 2 in thread 0, vec position 1, with sigma = 15371561253067696485

found factor 5062155107501579 in stage 2 in thread 0, vec position 2, with sigma = 13475164758719783696

found factor 1053336016261649316809 in stage 2 in thread 0, vec position 4, with sigma = 333700535869040322

found factor 5062155107501579 in stage 2 in thread 0, vec position 5, with sigma = 8913334498632421225
Process took 34.3731 seconds.
[/CODE]

Also, I'm only updating the avx-ecm repo with updates, for now. When things settle out a bit I'll revise yafu.

bsquared 2020-11-05 20:59

And now processing 2^n-c similarly to 2^n-1 (with special modulo), where "c" is small (single-limb). GMP-ECM doesn't do this, as far as I can tell.

Example:
[CODE]./avx-ecm-52-icc-static "(2^941 - 5987059)/211/18413" 8 250000 1
starting process 97314
commencing parallel ecm on 4784305585655994717562720658383425526844379349934463348761460837207088185558434958330904147209498950842345840906605538877380399014736734618876417744155088454218322338609736739235487631453648349849153543969782570442284070067941772311074196433531076214607947202503691346829979451
ECM has been configured with DIGITBITS = 52, VECLEN = 8, GMP_LIMB_BITS = 64
Choosing MAXBITS = 1040, NWORDS = 20, NBLOCKS = 5 based on input size 941
cached 3001134 primes < 49999991
Input has 920 bits, using 1 threads (8 curves/thread)
Processing in batches of 100000000 primes
[U]Using special pseudo-Mersenne mod for factor of: 2^941-5987059[/U]
Initialization took 0.1466 seconds.
Cached 1565994 primes in range [2 : 25000991]

commencing curves 0-7 of 8
Building curves took 0.0002 seconds.
commencing Stage 1 @ prime 2
accumulating prime 180511
Stage 1 completed at prime 249989 with 496674 point-adds and 51127 point-doubles
Stage 1 took 1.8810 seconds


commencing stage 2 at p=250007, A=249480
w = 1155, R = 480, L = 16, umax = 9240, amin = 108
Stage 2 took 1.4885 seconds
performed 854762 pair-multiplies for 1543883 primes in stage 2
performed 30752 point-additions and 49 point-doubles in stage 2

found factor 4655226350979187 in stage 2 in thread 0, vec position 3, with sigma = 9539569019480891334

found factor 4655226350979187 in stage 2 in thread 0, vec position 7, with sigma = 17882739001568946650
Process took 3.7186 seconds.
[/CODE]

And to compare:
[CODE]
echo "(2^941-5987059)/211/18413" | ../ecm-704-linux/ecm -v 250000
GMP-ECM 7.0.4 [configured with GMP 6.2.0, --enable-asm-redc] [ECM]
Tuned for x86_64/k8/params.h
Running on cpu
Input number is (2^941-5987059)/211/18413 (277 digits)
[U]Using MODMULN [mulredc:1, sqrredc:1][/U]
Computing batch product (of 360843 bits) of primes up to B1=250000 took 17ms
Using B1=250000, B2=128992510, polynomial Dickson(3), sigma=1:1245750705
dF=2048, k=3, d=19110, d2=11, i0=3
Expected number of curves to find a factor of n digits:
35 40 45 50 55 60 65 70 75 80
6022 87544 1534319 3.1e+07 7.5e+08 2e+10 6.7e+13 1e+19 1.3e+24 Inf
Step 1 took 1673ms
[/CODE]

Pretty substantial stage 1 throughput improvement, for inputs of this form.

bsquared 2020-11-19 17:17

There have been a few updates to the avx-ecm repo:

1) added algebraic factor removal of Mersenne inputs
2) added checkpoint saving every 1e8 primes in stage 1
3) implemented batch inversion in stage 2, now almost twice as fast

Here's an example of that last item compared to previously
[QUOTE=bsquared;562023]
Expanding on the example from before, with B1=7M on 2^1277-1:
avx-ecm stage 1: 78 sec for 8 curves, 9.7 sec/stg1-curve
avx-ecm stage 2: 55 sec for 8 curves, 6.8 sec/stg2-curve
[/QUOTE]

Now:
[CODE]
B1=7M on 2^1277-1:
avx-ecm stage 1: 78 sec for 8 curves, 9.7 sec/stg1-curve
avx-ecm stage 2: 32 sec for 8 curves, 4.0 sec/stg2-curve[/CODE]

At this point I'm not sure whether it's better to keep the faster stage 2 time or run to higher B2 bounds, keeping timings similar to previously.

I think that Bob's paper says to equalize the stage 1 and stage 2 runtimes, so the best answer is probably the latter. It might be moot if the best-best answer is to still use a hybrid plan with gmp-ecm stage 2. However this might change the crossover point (from pure avx-ecm to hybrid).

bsquared 2020-11-28 00:58

More updates:

1) Fixed some bugs in the fast stage 2, after finally bearing down and understanding all of the nuances of Montgomery's PAIR algorithm correctly. I had it mostly right before, but not 100%, which caused some factors to be missed. Thank <deity> for a gold reference implementation in gmp-ecm. Also, the fixed bugs actually improved the speed a bit.

2) Got a shiny new laptop with a tigerlake i7-1165G7 in order to get AVX512IFMA up and running. It helped more than I thought it would. Going from my floating point 52-bit double precision integer multipliers to IFMA speeds up all of the vector math by about 40%!

In this de-facto quick benchmark this cpu gets:
[CODE]
B1=7M on 2^1277-1:
avx-ecm stage 1: 47.3 sec for 8 curves, 5.9 sec/stg1-curve
avx-ecm stage 2: 18.3 sec for 8 curves, 2.3 sec/stg2-curve
[/CODE]

The next goals are usability-focused. Get a proper set of command line flags in there, some options to control memory use and when to exit, improved logging, and reading in savefiles.

Prime95 2020-11-29 05:19

Have you played with improving PRAC?

I recently did and gained one or two percent. This may be because prime95 can store data FFTed which may change the costing functions. A double costs 10 transforms, an add costs 12 transforms. Also, prime95 is usually dealing with larger numbers and can afford to spend a little more time searching for the best chain.

Glad to share the C code for the changes.

bsquared 2020-11-29 18:13

[QUOTE=Prime95;564720]Have you played with improving PRAC?

I recently did and gained one or two percent. This may be because prime95 can store data FFTed which may change the costing functions. A double costs 10 transforms, an add costs 12 transforms. Also, prime95 is usually dealing with larger numbers and can afford to spend a little more time searching for the best chain.

Glad to share the C code for the changes.[/QUOTE]

I tried to adjust the balance of time optimizing chains, without notable success. I'd like to see your code, thanks for sharing.

I'm also looking into gmp-ecm's different parameterizations. "param 1" seems to be the default for me in most cases. It looks like they are using small parameters so that the multiply by (A+2)/4 is a single limb (saving a multiply when duplicating). Then not using PRAC at all, instead using Montgomery's ladder, which has a larger percentage of doublings, on a pre-multiplied set of B1 primes. Eliminating a multiply could save 10% or so. What I don't understand yet is how it can be single-limb if they are using montgomery multiplies (redc).

bsquared 2020-11-30 14:47

[QUOTE=Prime95;564720]Have you played with improving PRAC?

I recently did and gained one or two percent. [/QUOTE]

The gain is not quite as large for me for the smaller inputs I am dealing with, but it is a gain of about 1%.

It also seemed to help very slightly for me to define the cost functions to include the faster nature of a square versus a multiply. Doing that increases the total dups / adds ratio very slightly in stage 1.

In my tinyecm code (< 64 bits) where there are just a few fixed and very small B1 values, I found that in a few cases the cost of PRAC(p1*p2) is cheaper than PRAC(p1) + PRAC(p2). It isn't practical to test combinations of primes like this in real time for arbitrary (and larger) B1. But I wonder if it is worth compiling a list of prime pairs that are cheaper when combined below bounds like 1M, 3M, 11M, etc.

bsquared 2020-11-30 16:32

[QUOTE=bsquared;564835]
In my tinyecm code (< 64 bits) where there are just a few fixed and very small B1 values, I found that in a few cases the cost of PRAC(p1*p2) is cheaper than PRAC(p1) + PRAC(p2). It isn't practical to test combinations of primes like this in real time for arbitrary (and larger) B1. But I wonder if it is worth compiling a list of prime pairs that are cheaper when combined below bounds like 1M, 3M, 11M, etc.[/QUOTE]

Ran a quick test attempting to pair primes < 1M. There are ~5k pairs that are cheaper this way, sometimes dramatically so[SUP]1[/SUP]. But there are ~78k primes below this bound so the total cost savings is a fraction of a percent.

edit: nevermind - this is not valid. Have to check that the chain actually works! conclusion is unchanged - finding cheaper pairs doesn't seem worthwhile.
[CODE]prime 9769 * 9781 cost = 103.5
prime 9769 cost = 109.0
prime 9781 cost = 103.5
(sum 212.5, savings 109.0 @ prac v=0.620181980807)
[/CODE]

mathwiz 2020-12-08 16:41

If I do a "git pull" on the latest code, and then "make COMPILER=gcc SKYLAKEX=1", I'm getting a bunch of linker errors:

[code]gcc -g -O3 -mavx -march=skylake-avx512 -DSKYLAKEX -Wall -I. -I../../gmp-6.2.1 -c -o vec_common.o vec_common.c
gcc -g -O3 -mavx -march=skylake-avx512 -DSKYLAKEX -Wall -I. -I../../gmp-6.2.1 -c -o calc.o calc.c
gcc -g -O3 -mavx -march=skylake-avx512 -DSKYLAKEX -Wall -I. -I../../gmp-6.2.1 -c -o queue.o queue.c
rm -f libavxecm.a
ar r libavxecm.a eratosthenes/presieve.o eratosthenes/count.o eratosthenes/offsets.o eratosthenes/primes.o eratosthenes/roots.o eratosthenes/linesieve.o eratosthenes/soe.o eratosthenes/tiny.o eratosthenes/worker.o eratosthenes/soe_util.o eratosthenes/wrapper.o threadpool.o main.o ecm.o util.o vecarith.o vecarith52.o vec_common.o calc.o queue.o
ar: creating libavxecm.a
ranlib libavxecm.a
gcc -g -O3 -mavx -march=skylake-avx512 -DSKYLAKEX -Wall -I. -I../../gmp-6.2.1 eratosthenes/presieve.o eratosthenes/count.o eratosthenes/offsets.o eratosthenes/primes.o eratosthenes/roots.o eratosthenes/linesieve.o eratosthenes/soe.o eratosthenes/tiny.o eratosthenes/worker.o eratosthenes/soe_util.o eratosthenes/wrapper.o threadpool.o main.o ecm.o util.o vecarith.o vecarith52.o vec_common.o calc.o queue.o -o avx-ecm libavxecm.a -L../../gmp-6.2.1/.libs -lm -lgmp -lpthread
/usr/bin/ld: eratosthenes/count.o:/tmp/avx-ecm/eratosthenes/soe.h:323: multiple definition of `nmasks'; eratosthenes/presieve.o:/tmp/avx-ecm/eratosthenes/soe.h:323: first defined here
/usr/bin/ld: eratosthenes/count.o:/tmp/avx-ecm/eratosthenes/soe.h:341: multiple definition of `szSOEp'; eratosthenes/presieve.o:/tmp/avx-ecm/eratosthenes/soe.h:341: first defined here
/usr/bin/ld: eratosthenes/count.o:/tmp/avx-ecm/eratosthenes/soe.h:340: multiple definition of `spSOEprimes'; eratosthenes/presieve.o:/tmp/avx-ecm/eratosthenes/soe.h:340: first defined here
/usr/bin/ld: eratosthenes/count.o:/tmp/avx-ecm/eratosthenes/soe.h:337: multiple definition of `P_MAX'; eratosthenes/presieve.o:/tmp/avx-ecm/eratosthenes/soe.h:337: first defined here
/usr/bin/ld: eratosthenes/count.o:/tmp/avx-ecm/eratosthenes/soe.h:336: multiple definition of `P_MIN'; eratosthenes/presieve.o:/tmp/avx-ecm/eratosthenes/soe.h:336: first defined here
/usr/bin/ld: eratosthenes/count.o:/tmp/avx-ecm/eratosthenes/soe.h:335: multiple definition of `NUM_P'; eratosthenes/presieve.o:/tmp/avx-ecm/eratosthenes/soe.h:335: first defined here
/usr/bin/ld: eratosthenes/count.o:/tmp/avx-ecm/eratosthenes/soe.h:334: multiple definition of `PRIMES'; eratosthenes/presieve.o:/tmp/avx-ecm/eratosthenes/soe.h:334: first defined here
/usr/bin/ld: eratosthenes/count.o:/tmp/avx-ecm/eratosthenes/soe.h:330: multiple definition of `SOE_VFLAG'; eratosthenes/presieve.o:/tmp/avx-ecm/eratosthenes/soe.h:330: first defined here[/code]

bsquared 2020-12-08 16:47

[QUOTE=mathwiz;565667]If I do a "git pull" on the latest code, and then "make COMPILER=gcc SKYLAKEX=1", I'm getting a bunch of linker errors:
[/QUOTE]

Those are very similar errors to what is happening [URL="https://www.mersenneforum.org/showthread.php?t=23087"]here[/URL], with someone trying to compile YAFU using gcc 10.2.0. Are you using the same version of gcc, by any chance?

As with yafu, I see no problems with gcc 7.3.0 or icc 18.0.3.

[edit]
I know that defining globals in header files is frowned upon, but I didn't know it was an error as long as the header is only included once. That is why they are protected with the #ifndef X #define X macros. Could that have changed with the latest gcc? Maybe putting a -std=c11 in the make would help?

EdH 2020-12-08 17:16

Just to note, gcc 8.3.0 (Debian) and 9.3.0 (Ubuntu) worked fine for YAFU (wip), last time I upgraded all my machines.


Note: I will probably move this to the other thread later, since it's not really about AVX-ECM.

mathwiz 2020-12-08 21:36

[QUOTE=EdH;565675]Just to note, gcc 8.3.0 (Debian) and 9.3.0 (Ubuntu) worked fine for YAFU (wip), last time I upgraded all my machines.


Note: I will probably move this to the other thread later, since it's not really about AVX-ECM.[/QUOTE]

Indeed, using gcc-8 instead of my default (gcc-10) appears to work fine.

bsquared 2020-12-08 21:51

[QUOTE=mathwiz;565704]Indeed, using gcc-8 instead of my default (gcc-10) appears to work fine.[/QUOTE]

Good to know, thanks.

I've asked a question about this over in the programming thread; maybe more people read that and someone might know what's going on. I've browsed the gcc changelog but nothing jumps out at me as a possible cause of this behavior. No time at the moment to dig deeper... hoping that someone smarter than me can bail me out here.


All times are UTC. The time now is 16:51.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.