AVXECM
Just created a repository on github with new ECM code: [url]https://github.com/bbuhrow/avxecm[/url]
It computes curves in parallel using AVX512 vector arithmetic, and is also multithreaded. I've measured it to have about 2x the throughput of GMPECM using 1 thread on a Xeon Gold 5122 CPU. It does both stage 1 and stage 2 in parallel, but stage 2 is just the standard continuation with pairing (by default, B2=100xB1). It will output a savefile after stage 1 that will work with GMPECM stage 2, if desired. e.g.: ecm resume save_b1.txt 1000000 < input.txt compile on linux using either icc or gcc7.3.0 or above (I've only tested icc and gcc7.3.0). e.g.: make MAXBITS=416 COMPILER=gcc730 SKYLAKEX=1 the MAXBITS parameter must be a multiple of 208. When this is the case, 8 curves are performed in parallel per thread. Alternatively, if DIGITBITS=32 is specified during make, then MAXBITS must be a mutiple of 128 and 16 curves are performed in parallel per thread. The 52bit version is generally faster (and the default) but for some sizes a 32bit version can have higher throughput (because of the 208bit jumps between sizes). If anyone wants to test it out you will of course need an avx512 capable computer. Let me know if there are any problems. Happy factoring! ============================================================ [Edit] Sticky comparison tables generic 900bit input Xeon Gold 5122 CPU, 1 thread, icc compiler Notes: 1) "weight" is obtained by determining the number of curves for the given tlevel using power 1 and B2=100B1 in gmpecm, and then also using default gmpecm settings, and taking the ratio. 2) hybrid = (avxecm stage 1 time + 8*gmpecm stage 2 time) / 8 [CODE] generic 900bit avxecm weighted tlevel B1 stg 1 stg2 sum sec/curve weight sec/curve 35 1.00E+06 13 9.2 22.2 2.78 1.85 5.14 40 3.00E+06 39.3 26.3 65.6 8.20 2.07 17.01 45 1.00E+07 131.1 81.1 212.2 26.53 2.35 62.33 50 4.30E+07 567.1 334.3 901.4 112.68 2.46 277.42 55 1.10E+08 1448.3 822.6 2270.9 283.86 2.63 747.74 60 2.60E+08 3423 1874 5303 662.98 2.87 1902.6 gmpecm704 hybrid tlevel B1 stg 1 stg2 sum sec/curve sec/curve 35 1.00E+06 6.59 3.45 10.04 10.04 5.08 40 3.00E+06 19.8 7.77 27.57 27.57 12.68 45 1.00E+07 65.9 25.4 91.3 91.30 41.79 50 4.30E+07 287.5 78.8 366.3 366.30 149.69 55 1.10E+08 738.3 196.2 934.5 934.50 377.24 60 2.60E+08 1750 412.5 2162.5 2160.5 841.1 gmp/weightedavx gmp/hybrid avx/hybrid tlevel B1 sec/curve ratio sec/curve ratio ratio 35 1.00E+06 1.95 1.98 1.01 40 3.00E+06 1.62 2.17 1.34 45 1.00E+07 1.46 2.18 1.49 50 4.30E+07 1.32 2.45 1.85 55 1.10E+08 1.25 2.48 1.98 60 2.60E+08 1.14 2.57 2.26 [/CODE] Conclusions: Below B1=1M, probably better to just use avxecm. At B1=1M and above, hybrid is best. Somewhere above 260M, pure GMPECM likely beats pure AVXECM, due to increasing weight, but hybrid is still best. For Mersenne inputs with primitive size between 832 and 1039 bits: [CODE] gmp/weightedavx gmp/hybrid avx/hybrid tlevel B1 sec/curve ratio sec/curve ratio ratio 35 1.00E+06 3.72 2.26 0.61 40 3.00E+06 3.06 2.56 0.84 45 1.00E+07 2.79 2.55 0.91 50 4.30E+07 2.49 2.97 1.19 55 1.10E+08 2.31 3.11 1.35 60 2.60E+08 2.09 3.29 1.57 [/CODE] Hybrid best starting at B1 around 30M. For Mersenne inputs with primitive size between 1248 and 1456 bits: [CODE] gmp/weightedavx gmp/hybrid avx/hybrid tlevel B1 sec/curve ratio sec/curve ratio ratio 35 1.00E+06 3.12 2.06 0.66 40 3.00E+06 2.59 2.32 0.89 45 1.00E+07 2.30 2.32 1.01 50 4.30E+07 2.08 2.67 1.28 55 1.10E+08 1.92 2.79 1.46 60 2.60E+08 1.74 2.95 1.69 65 8.50E+08 1.65 3.17 1.92 70 2.90E+09 1.52 3.40 2.23 75 7.60E+09 1.44 3.59 2.50 [/CODE] Hybrid best starting at B1 around 10M. 
Hmmmm. Just tried it out on an AVX512 capable machine, with B1=1e6 and input M433 ([url]http://factordb.com/index.php?id=1000000000000000433):[/url]
[CODE]./avxecm 22181357552966518876627313473144669627491496603006532601363836644916970462445004984319795248833116624779129687691228574631793262591 100 1000000 commencing parallel ecm on 22181357552966518876627313473144669627491496603006532601363836644916970462445004984319795248833116624779129687691228574631793262591 ECM has been configured with MAXBITS = 416, NWORDS = 8, VECLEN = 8 starting process 171275 cached 3001134 primes < 49999991 Input has 433 bits, using 1 threads (100 curves/thread) Processing in batches of 100000000 primes Initialization took 0.0743 seconds. Cached 5761455 primes in range [2 : 99999989] commencing curves 07 of 100 Building curves took 0.0005 seconds. commencing Stage 1 @ prime 2 accumulating prime 997693 Stage 1 completed at prime 999983 with 2001925 pointadds and 194143 pointdoubles Stage 1 took 4.1971 seconds found factor 7573885960626120430675586583017049046287571561744815615 in stage 1 in thread 0, vec position 0, with sigma = 9805897911615405889 found factor 125483996795288390556263452518000395100487722298175677228160127607935 in stage 1 in thread 0, vec position 1, with sigma = 5941742753921851068 found factor 140559707882921114195636669550269531279151421131321967551083551935 in stage 1 in thread 0, vec position 2, with sigma = 17488052527796218971 found factor 140559707882921114195636669550269531279151421131321967551083551935 in stage 1 in thread 0, vec position 3, with sigma = 162542135677334094 found factor 9704134944669326942243922594738986945861288047284534894845604139673055 in stage 1 in thread 0, vec position 4, with sigma = 9639283928122886917 found factor 19065414140013908092677804414156626466105803295 in stage 1 in thread 0, vec position 5, with sigma = 14437779671575784496 found factor 455237245388167216759545663443996983027266740639813141535 in stage 1 in thread 0, vec position 6, with sigma = 5479294375532123583 found factor 161449908068885681487281375790285735990727019303070577509707381288895 in stage 1 in thread 0, vec position 7, with sigma = 16438922582860911842 commencing stage 2 at p=1000003, A=1000230 w = 1155, R = 480, L = 16, umax = 9240, amin = 433 Stage 2 took 3.1940 seconds performed 3188920 pairmultiplies for 5682957 primes in stage 2 performed 95044 pointadditions and 53 pointdoubles in stage 2 something failed: tid = 0, vec = 0 has zero result something failed: tid = 0, vec = 1 has zero result something failed: tid = 0, vec = 2 has zero result something failed: tid = 0, vec = 3 has zero result something failed: tid = 0, vec = 4 has zero result something failed: tid = 0, vec = 5 has zero result something failed: tid = 0, vec = 6 has zero result something failed: tid = 0, vec = 7 has zero result Process took 7.4682 seconds.[/CODE] Something seems wrong here? Or am I holding it wrong? Compare to: [CODE]4$ echo "22181357552966518876627313473144669627491496603006532601363836644916970462445004984319795248833116624779129687691228574631793262591"  ./ecm c 10 1e6 GMPECM 7.0.4 [configured with GMP 6.1.2, GWNUM 29.8, enableasmredc] [ECM] Due to incompatible licenses, this binary file must not be distributed. Input number is 22181357552966518876627313473144669627491496603006532601363836644916970462445004984319795248833116624779129687691228574631793262591 (131 digits) Using B1=1000000, B2=1045563762, polynomial Dickson(6), sigma=0:14118429249538045316 Step 1 took 2892ms Step 2 took 2391ms Run 2 out of 10: Using B1=1000000, B2=1045563762, polynomial Dickson(6), sigma=0:15640357536401406010 ********** Factor found in step 1: 22086765417396827057 Found prime factor of 20 digits: 22086765417396827057 Composite cofactor 1004282751855334395242501879256940939756665292745707836441839543208470249351377144560254198205767424080811494063 has 112 digits[/CODE] 
It looks like your input is larger than 416 bits. So you'd have to rebuild with MAXBITS=624. Actually, this is a case where MAXBITS=512 DIGITBITS=32 will give you slightly higher throughput (because 512 is quite a bit smaller than 624).

New version that chooses MAXBITS automatically based on the input size. You still have to specify DIGITBITS during build as this impacts the underlying word size, which can't be easily changed on the fly. Default is 52bit, resulting in steps of 208 bits. It is a fixedsize arithmetic within these steps, so curves at 417 bits will take the same amount of time as curves with 623 bits. This is similar to GPUECM, although with more finegrained resolution.
The 32bit build path gives steps of 128 bits. So for this input it will choose a 512bit curve size which is faster than a 624bit curve with 52bit words. [CODE]./avxecm 22181357552966518876627313473144669627491496603006532601363836644916970462445004984319795248833116624779129687691228574631793262591 16 1000000 1 commencing parallel ecm on 22181357552966518876627313473144669627491496603006532601363836644916970462445004984319795248833116624779129687691228574631793262591 starting process 226272 ECM has been configured with DIGITBITS = 32, VECLEN = 16 Choosing MAXBITS = 512, NWORDS = 16, NBLOCKS = 4 based on input size 433 cached 3001134 primes < 49999991 Input has 433 bits, using 1 threads (16 curves/thread) Processing in batches of 100000000 primes Initialization took 0.0895 seconds. Cached 5761455 primes in range [2 : 99999989] commencing curves 015 of 16 Building curves took 0.0006 seconds. commencing Stage 1 @ prime 2 accumulating prime 997693 Stage 1 completed at prime 999983 with 2001925 pointadds and 194143 pointdoubles Stage 1 took 11.0072 seconds found factor 22086765417396827057 in stage 1 in thread 0, vec position 10, with sigma = 415569030 found factor 22086765417396827057 in stage 1 in thread 0, vec position 11, with sigma = 3510146781 commencing stage 2 at p=1000003, A=1000230 w = 1155, R = 480, L = 16, umax = 9240, amin = 433 Stage 2 took 7.5996 seconds performed 3188920 pairmultiplies for 5682957 primes in stage 2 performed 95044 pointadditions and 53 pointdoubles in stage 2 found factor 22086765417396827057 in stage 2 in thread 0, vec position 0, with sigma = 551874572 found factor 22086765417396827057 in stage 2 in thread 0, vec position 5, with sigma = 2520392143 found factor 22086765417396827057 in stage 2 in thread 0, vec position 6, with sigma = 1474306994 found factor 22086765417396827057 in stage 2 in thread 0, vec position 8, with sigma = 2588423220 found factor 22086765417396827057 in stage 2 in thread 0, vec position 10, with sigma = 415569030 found factor 22086765417396827057 in stage 2 in thread 0, vec position 11, with sigma = 3510146781 found factor 22086765417396827057 in stage 2 in thread 0, vec position 12, with sigma = 1003661608 Process took 18.7103 seconds. [/CODE] For this size input, 16 simultaneous 512bit curves complete in 18.7 sec or 1.17 sec/curve versus GMPECM's 1.72 sec/curve. 
Another commit:
+ fixed printing issues and get_ui/set_ui with mingw + parameterization seems not exactly compatible with gmpecm: looking into it + adding windows binaries for 32bit and 52bit arithmetic versions The mingw64 build for windows is a little slower but seems to work. I'm not exactly sure what's going on with the parameterization... this program finds factors when it seems like it shouldn't for some sigma and conversely doesn't find factors when gmpecm does for some sigma. On average though the factorfinding rate seems about the same. For example: [CODE]./avxecm 274083933049131091178904352004902235910804839081565071801565126307886989756226870709872076759480838311782134623705910949147664751751242690094033868727929674779456111539120708680813658241 128 1000000 16 <snip> found factor 618970019642690137449562111 in stage 2 in thread 8, vec position 2, with sigma = 17981939361520122334 [/CODE] But using [URL="https://www.mersenneforum.org/showpost.php?p=236443&postcount=1"]this sage code[/URL] [URL="http://magma.maths.usyd.edu.au/calc/"]here[/URL] gives the group order for that factor and sigma as: [CODE][ <2, 2>, <3, 2>, <17193611656740451817020637, 1> ][/CODE] And gmpecm with that sigma only finds the factor 74383430474532481 with the following entirely reasonable group order: [CODE][ <2, 2>, <3, 3>, <19, 1>, <23, 1>, <3617, 1>, <435735059, 1> ][/CODE] 
Thanks for all the updates!
Would it be a reasonable FR to request basic expression parsing, so that the input does not have to be the full decimal representation of N? 
[QUOTE=mathwiz;533849]Thanks for all the updates!
Would it be a reasonable FR to request basic expression parsing, so that the input does not have to be the full decimal representation of N?[/QUOTE] Yes, that's reasonable :) BTW, thanks for testing it. 
How can you check if a CPU supports avx512? Eg one of my systems is a Ryzen 5 2600, /proc/cpuinfo says it supports avx and avx2, but so do some older systems.
Chris 
[QUOTE=chris2be8;533905]How can you check if a CPU supports avx512? Eg one of my systems is a Ryzen 5 2600, /proc/cpuinfo says it supports avx and avx2, but so do some older systems.[/QUOTE]
Wikipedia [URL="https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#CPUs_with_AVX512"]writes[/URL], without citation: [INDENT]As of 2019, there are no AMD CPUs that support AVX512, and AMD has not yet released plans to support AVX512.[/INDENT] As the [url=https://www.amd.com/en/products/cpu/amdryzen52600]Ryzen 5 2600[/url] was launched in 2018 I trust it does not support AVX512. 
As for Intel, their advanced search allows you to [url=https://ark.intel.com/content/www/us/en/ark/search/featurefilter.html?productType=873&1_FilterInstructionSetExtensions=3533]search for AVX512[/url] and find all processors with it. (You can further limit the search, or just look up a chip and see if it's there.)

I would have sworn my earlier sessions with Colab showed AVX512 for their Xeon CPUs when I checked. Now, none of my recent sessions have. The CPU type just says Xeon CPU @ 2.00GHz. It does give me a family, etc. for the CPU(s), but no name. Is there a way to find that via the command line?
If I can, I'll set up a Colab AVXECM example in my "How I . . ." series. 
[QUOTE=EdH;533996]I would have sworn my earlier sessions with Colab showed AVX512 for their Xeon CPUs when I checked. Now, none of my recent sessions have. The CPU type just says Xeon CPU @ 2.00GHz. It does give me a family, etc. for the CPU(s), but no name. Is there a way to find that via the command line?
If I can, I'll set up a Colab AVXECM example in my "How I . . ." series.[/QUOTE] Not sure if this applies across all distros, but at least for CentOS/RedHat you can do: cat /proc/cpuinfo which will list the cpuflags, e.g.: [CODE]flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 intel_ppin intel_pt ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a [B]avx512f avx512dq[/B] rdseed adx smap clflushopt clwb [B]avx512cd avx512bw avx512vl[/B] xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_ window hwp_epp hwp_pkg_req pku ospke md_clear spec_ctrl intel_stibp flush_l1d [/CODE] AVXECM needs avx512f and avx512dq 
Speaking of cpu features, can't wait for systems with avx512ifma to become available. Should get a good speed boost with that feature.

[QUOTE=chris2be8;533905]How can you check if a CPU supports avx512? Eg one of my systems is a Ryzen 5 2600, /proc/cpuinfo says it supports avx and avx2, but so do some older systems.[/QUOTE]
[c]grep flags /proc/cpuinfo[/c] should show which specific subsets of the AVX512 instruction set are supported. For instance, the CPU I use has avx512f avx512dq avx512cd avx512bw avx512vl. For the meanings, see [URL="https://en.wikipedia.org/wiki/AVX512#Instruction_set"]Wikipedia[/URL]. 
[QUOTE=bsquared;533997]cat /proc/cpuinfo[/QUOTE]
I just happen to have an SSH session into a Colab session open right now. [CODE]root@InstanceSSH:~# cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 85 model name : Intel(R) Xeon(R) CPU @ 2.00GHz stepping : 3 microcode : 0x1 cpu MHz : 2000.168 cache size : 39424 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 1 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat md_clear arch_capabilities bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs bogomips : 4000.33 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 85 model name : Intel(R) Xeon(R) CPU @ 2.00GHz stepping : 3 microcode : 0x1 cpu MHz : 2000.168 cache size : 39424 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 1 apicid : 1 initial apicid : 1 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat md_clear arch_capabilities bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs bogomips : 4000.33 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management:[/CODE] 
[QUOTE=chalsall;534005]I just happen to have an SSH session into a Colab session open right now.
[CODE]root@InstanceSSH:~# cat /proc/cpuinfo flags : ... avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl ... [/CODE][/QUOTE] Nice! 
I bolted on a simplified version of yafu's front end calculator:
[CODE]./avxecm "fib(791)/13/677/216416017" 8 1000000 1 starting process 418960 commencing parallel ecm on 478538644094177490260709299559653477212580778084175536200356719720677713018603842152898226485988725658355447208677180777716686530734249077491242934517377 ECM has been configured with DIGITBITS = 52, VECLEN = 8, GMP_LIMB_BITS = 64 Choosing MAXBITS = 624, NWORDS = 12, NBLOCKS = 3 based on input size 508 cached 3001134 primes < 49999991 Input has 508 bits, using 1 threads (8 curves/thread) Processing in batches of 100000000 primes Initialization took 0.1045 seconds. Cached 5761455 primes in range [2 : 99999989] commencing curves 07 of 8 Building curves took 0.0006 seconds. commencing Stage 1 @ prime 2 accumulating prime 997693 Stage 1 completed at prime 999983 with 2001925 pointadds and 194143 pointdoubles Stage 1 took 6.8980 seconds found factor 272602401466814027129 in stage 1 in thread 0, vec position 3, with sigma = 1465376940087886860 commencing stage 2 at p=1000003, A=1000230 w = 1155, R = 480, L = 16, umax = 9240, amin = 433 Stage 2 took 4.9852 seconds performed 3188920 pairmultiplies for 5682957 primes in stage 2 performed 95044 pointadditions and 53 pointdoubles in stage 2 found factor 272602401466814027129 in stage 2 in thread 0, vec position 3, with sigma = 1465376940087886860 found factor 272602401466814027129 in stage 2 in thread 0, vec position 4, with sigma = 16373433592041433963 found factor 272602401466814027129 in stage 2 in thread 0, vec position 7, with sigma = 12457157968472999552 Process took 11.9819 seconds. [/CODE] So now you should be able to use +,,*,/,^,%,# (primorial), ! (factorial), as well as fib(), luc(), >>, <<, binary ops (xor,and,not,or), and a few other things. 
Sorry, duplicate post...

[QUOTE=GP2;533999][c]grep flags /proc/cpuinfo[/c] should show which specific subsets of the AVX512 instruction set are supported.[/QUOTE]
For the lazy: [code]egrep o 'avx512[^ ]*' /proc/cpuinfo  sort u[/code] 
FWIW, I'm getting segfaults with the current master branch. Was a bug introduced?

[QUOTE=mathwiz;534063]FWIW, I'm getting segfaults with the current master branch. Was a bug introduced?[/QUOTE]
Probably... can you let me know what command line was giving you problems? It has been working for me but I haven't done a lot of testing either. 
[QUOTE=bsquared;534071]Probably... can you let me know what command line was giving you problems? It has been working for me but I haven't done a lot of testing either.[/QUOTE]
I see now after testing on a different system with gcc that it doesn't work at all. I committed a fix (was missing a couple libraries in the new calc front end). Tested on linux with icc, with mingw on windows/msys2, and on windowssubsystemforlinux/gcc. [QUOTE=bsquared;533848] + parameterization seems not exactly compatible with gmpecm: looking into it I'm not exactly sure what's going on with the parameterization... this program finds factors when it seems like it shouldn't for some sigma and conversely doesn't find factors when gmpecm does for some sigma. On average though the factorfinding rate seems about the same. [/QUOTE] Also fixed this. Was a simple overflow when trying to set random sigma values into an mpz_t. 
I had been using cat /proc/cpuinfo and lscpu, but really appreciate the egrep line.
I have dicovered that if I try to get a new python3 motebook, it invariably has no avx512. However, if I "Connect" the "Welcome" page, I get the full list of avx512bw/cd/dq/f/vl. I hope to build a Colab AVXECM instance soon. 
[QUOTE=EdH;534088]I had been using cat /proc/cpuinfo and lscpu, but really appreciate the egrep line.
I have dicovered that if I try to get a new python3 motebook, it invariably has no avx512. However, if I "Connect" the "Welcome" page, I get the full list of avx512bw/cd/dq/f/vl. I hope to build a Colab AVXECM instance soon.[/QUOTE] I think it is more random than that. I always connect through the Welcome page, but the CPU I get is hitormiss between Haswell, Broadwell, or Skylake. Only the Skylake has AVX512. If I need a Skylake, right after I connect I use lscpu to check, then I reset the session until I get a one. It usually takes only a few tries. 
[QUOTE=PhilF;534138]I think it is more random than that. I always connect through the Welcome page, but the CPU I get is hitormiss between Haswell, Broadwell, or Skylake. Only the Skylake has AVX512. If I need a Skylake, right after I connect I use lscpu to check, then I reset the session until I get a one. It usually takes only a few tries.[/QUOTE]
You seem to be quite right. I tried several without success (even from Welcome page) and then got one, but alas, I couldn't get avxecm to compile. GCC was 7.4.0  changing GCC=7.4.0 in the Makefile didn't help. I'll play more later. 
[QUOTE=EdH;534140]You seem to be quite right. I tried several without success (even from Welcome page) and then got one, but alas, I couldn't get avxecm to compile. GCC was 7.4.0  changing GCC=7.4.0 in the Makefile didn't help. I'll play more later.[/QUOTE]
I was able to get it to compile and run successfully on a Skylake machine by doing the following: [CODE]!sudo aptget install libgmpdev !git clone https://github.com/bbuhrow/avxecm.git !cd avxecm/ && ls !make j 8 SKYLAKEX=1 COMPILER=gcc[/CODE] And running with, e.g.: [CODE]!cd avxecm && ./avxecm "2^12771" 100 1000000 8[/CODE] 
I'm getting compiling errors on an EC2 Ubuntu 18.04 instance with GCC 9.2.1
Edit: ...and same error with GCC 7.4.0 [CODE]ubuntu@ip17231123:/mntefs/z/avxecm$ gcc version gcc (Ubuntu 9.2.117ubuntu1~18.04.1) 9.2.1 20191102 Copyright (C) 2019 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. ubuntu@ip17231123:/mntefs/z/avxecm$ make MAXBITS=416 COMPILER=gcc SKYLAKEX=1 gcc fopenmp DMAXBITS=416 g O3 march=skylakeavx512 DSKYLAKEX Wall I. I/projects/gmp6.0.0a/install/include c o eratosthenes/presieve.o eratosthenes/presieve.c gcc fopenmp DMAXBITS=416 g O3 march=skylakeavx512 DSKYLAKEX Wall I. I/projects/gmp6.0.0a/install/include c o eratosthenes/count.o eratosthenes/count.c gcc fopenmp DMAXBITS=416 g O3 march=skylakeavx512 DSKYLAKEX Wall I. I/projects/gmp6.0.0a/install/include c o eratosthenes/offsets.o eratosthenes/offsets.c gcc fopenmp DMAXBITS=416 g O3 march=skylakeavx512 DSKYLAKEX Wall I. I/projects/gmp6.0.0a/install/include c o eratosthenes/primes.o eratosthenes/primes.c gcc fopenmp DMAXBITS=416 g O3 march=skylakeavx512 DSKYLAKEX Wall I. I/projects/gmp6.0.0a/install/include c o eratosthenes/roots.o eratosthenes/roots.c gcc fopenmp DMAXBITS=416 g O3 march=skylakeavx512 DSKYLAKEX Wall I. I/projects/gmp6.0.0a/install/include c o eratosthenes/linesieve.o eratosthenes/linesieve.c gcc fopenmp DMAXBITS=416 g O3 march=skylakeavx512 DSKYLAKEX Wall I. I/projects/gmp6.0.0a/install/include c o eratosthenes/soe.o eratosthenes/soe.c eratosthenes/soe.c: In function ‘spSOE’: eratosthenes/soe.c:273:43: warning: format ‘%lu’ expects argument of type ‘long unsigned int’, but argument 3 has type ‘uint32_t’ {aka ‘unsigned int’} [Wformat=] 273  printf("using %lu primes, max prime = %lu \n", sdata.pboundi, sieve_p[sdata.pboundi]); // sdata.pbound);  ~~^ ~~~~~~~~~~~~~~~~~~~~~~     long unsigned int uint32_t {aka unsigned int}  %u gcc fopenmp DMAXBITS=416 g O3 march=skylakeavx512 DSKYLAKEX Wall I. I/projects/gmp6.0.0a/install/include c o eratosthenes/tiny.o eratosthenes/tiny.c gcc fopenmp DMAXBITS=416 g O3 march=skylakeavx512 DSKYLAKEX Wall I. I/projects/gmp6.0.0a/install/include c o eratosthenes/worker.o eratosthenes/worker.c gcc fopenmp DMAXBITS=416 g O3 march=skylakeavx512 DSKYLAKEX Wall I. I/projects/gmp6.0.0a/install/include c o eratosthenes/soe_util.o eratosthenes/soe_util.c eratosthenes/soe_util.c: In function ‘check_input’: eratosthenes/soe_util.c:140:17: warning: assignment to ‘__mpz_struct (*)[1]’ {aka ‘struct <anonymous> (*)[1]’} from incompatible pointer type ‘__mpz_struct *’ {aka ‘struct <anonymous> *’} [Wincompatiblepointertypes] 140  sdata>offset = offset;  ^ eratosthenes/soe_util.c: In function ‘init_sieve’: eratosthenes/soe_util.c:172:13: warning: implicit declaration of function ‘spGCD’ [Wimplicitfunctiondeclaratio ] 172  if (spGCD(i, (uint64_t)prodN) == 1)  ^~~~~ gcc fopenmp DMAXBITS=416 g O3 march=skylakeavx512 DSKYLAKEX Wall I. I/projects/gmp6.0.0a/install/include c o eratosthenes/wrapper.o eratosthenes/wrapper.c gcc fopenmp DMAXBITS=416 g O3 march=skylakeavx512 DSKYLAKEX Wall I. I/projects/gmp6.0.0a/install/include c o threadpool.o threadpool.c gcc fopenmp DMAXBITS=416 g O3 march=skylakeavx512 DSKYLAKEX Wall I. I/projects/gmp6.0.0a/install/include c o main.o main.c <commandline>: error: expected identifier or ‘(’ before numeric constant avx_ecm.h:107:10: note: in expansion of macro ‘MAXBITS’ 107  uint32_t MAXBITS;  ^~~~~~~ main.c: In function ‘main’: main.c:203:17: error: lvalue required as left operand of assignment 203  MAXBITS = 208;  ^ main.c:206:21: error: lvalue required as left operand of assignment 206  MAXBITS += 208;  ^~ main.c:211:17: error: lvalue required as left operand of assignment 211  MAXBITS = 128;  ^ main.c:214:21: error: lvalue required as left operand of assignment 214  MAXBITS += 128;  ^~ main.c:274:24: warning: format ‘%d’ expects argument of type ‘int’, but argument 2 has type ‘size_t’ {aka ‘long unsigned int’} [Wformat=] 274  printf("Input has %d bits, using %d threads (%d curves/thread)\n",  ~^    int  %ld main.c:325:13: warning: unused variable ‘j’ [Wunusedvariable] 325  int j;  ^ main.c:145:9: warning: unused variable ‘nextptr’ [Wunusedvariable] 145  char **nextptr;  ^~~~~~~ main.c:144:14: warning: unused variable ‘j’ [Wunusedvariable] 144  uint32_t i, j;  ^ main.c:140:12: warning: unused variable ‘siglist’ [Wunusedvariable] 140  uint32_t *siglist;  ^~~~~~~ main.c:138:15: warning: unused variable ‘n’ [Wunusedvariable] 138  bignum **f, *n;  ^ main.c:138:11: warning: unused variable ‘f’ [Wunusedvariable] 138  bignum **f, *n;  ^ At top level: main.c:50:12: warning: ‘debugctr’ defined but not used [Wunusedvariable] 50  static int debugctr = 0;  ^~~~~~~~ Makefile:174: recipe for target 'main.o' failed make: *** [main.o] Error 1 ubuntu@ip17231123:/mntefs/z/avxecm$ [/CODE] 
[QUOTE=ATH;534146]I'm getting compiling errors on an EC2 Ubuntu 18.04 instance with GCC 9.2.1
Edit: ...and same error with GCC 7.4.0 [/QUOTE] As of post #4, you don't need the MAXBITS=416 directive on the compile line. Just build with COMPILER=gcc and SKYLAKEX=1. Sorry, I will update the readme. 
From my limited knowledge of Makefile's and gcc, I notice that the gcc options for SKYLAKEX=1 specify "march=skylakeavx512" but not "mtune=skylakeavx512". Is this deliberate?

[QUOTE=mathwiz;534168]From my limited knowledge of Makefile's and gcc, I notice that the gcc options for SKYLAKEX=1 specify "march=skylakeavx512" but not "mtune=skylakeavx512". Is this deliberate?[/QUOTE]
From [url]https://gcc.gnu.org/onlinedocs/gcc/x86Options.html[/url]: [QUOTE] [B]3.18.59 x86 Options[/B] These ‘m’ options are defined for the x86 family of computers. march=cputype Generate instructions for the machine type cputype. In contrast to mtune=cputype, which merely tunes the generated code for the specified cputype, march=cputype allows GCC to generate code that may not run at all on processors other than the one indicated. [Specifying march=cputype implies mtune=cputype. [/QUOTE] So according to the last sentence there, we don't need mtune if we've already put in march. I ran some experiments anyway and didn't see any speed difference when adding mtune. I'll also mention that if you have access to Intel's compiler, it seems to do a much better job with this code. For the 624bit size, for instance, I get 8 stage 1 curves at B1=1e6 in 8 seconds with gcc versus 7 seconds with icc. Even thinking about building a staticallylinked executable gives me a headache though... 
Thanks, it worked. I tested it in an EC2 instance with 1 core / 2 threads (Xeon Cascadelake) on a 402 bit prime number:
[U]160 curves at B1=1,000,000:[/U] GMPECM: [B]300.6 sec[/B] AVXECM 1 thread: [B]179.5 sec[/B] AVXECM 2 threads: [B]144.8 sec[/B] I created a 402 bit composite number with a 30 digit (97 bit) factor and tested it 10 times with B1=1,000,000 (35 digit level) to see how fast the factor was found with GMPECM and AVXECM: [CODE] GMPECM AVXECM curve 23 curves 015 curve 34 curves 1631 curve 48 curves 3247 curve 81 curves 4863 (found twice) curve 89 curves 6479 curve 129 curves 144159 curve 135 curves 144159 curve 168 curves 160175 (found twice) curve 195 curves 384399 curve 264 curves 416431[/CODE] So based on the speed from the first test GMPECM found the factor 10 times in 2191 seconds and AVXECM found the factor 10 (12) times in 1419 seconds. I took the 12 sigmas values for which AVXECM found the factor and tested in GMPECM, and it found the factor as well with all the 12 sigmas. 
[QUOTE=mathwiz;534142]I was able to get it to compile and run successfully on a Skylake machine by doing the following:
[CODE]!sudo aptget install libgmpdev !git clone https://github.com/bbuhrow/avxecm.git !cd avxecm/ && ls !make j 8 SKYLAKEX=1 COMPILER=gcc[/CODE]And running with, e.g.: [CODE]!cd avxecm && ./avxecm "2^12771" 100 1000000 8[/CODE][/QUOTE] Thanks! I got a Colab session to work. I'll play a little more and then make a "How I . . ." for my blog section. 
[QUOTE=ATH;534182]Thanks, it worked. I tested it in an EC2 instance with 1 core / 2 threads (Xeon Cascadelake) on a 402 bit prime number:
[U]160 curves at B1=1,000,000:[/U] GMPECM: [B]300.6 sec[/B] AVXECM 1 thread: [B]179.5 sec[/B] AVXECM 2 threads: [B]144.8 sec[/B] I created a 402 bit composite number with a 30 digit (97 bit) factor and tested it 10 times with B1=1,000,000 (35 digit level) to see how fast the factor was found with GMPECM and AVXECM: [CODE] GMPECM AVXECM curve 23 curves 015 curve 34 curves 1631 curve 48 curves 3247 curve 81 curves 4863 (found twice) curve 89 curves 6479 curve 129 curves 144159 curve 135 curves 144159 curve 168 curves 160175 (found twice) curve 195 curves 384399 curve 264 curves 416431[/CODE] So based on the speed from the first test GMPECM found the factor 10 times in 2191 seconds and AVXECM found the factor 10 (12) times in 1419 seconds. I took the 12 sigmas values for which AVXECM found the factor and tested in GMPECM, and it found the factor as well with all the 12 sigmas.[/QUOTE] Fantastic, thank you for those tests! Decent speedup from the hyperthread  I haven't had a chance to test that yet so thanks again. 
It took quite a few tries to get a machine with avx512 today, but all went well when I did.
I haven't tried this yet, but hope to soon and wondered what the thoughts are for a setup that uses the GPU branch of ECM for stage 1 and this avxecm for stage 2. This may be the exact answer I've been searching for with my Colab ECMGPU session setup, since prior to this I could only run a single stage 2 thread against all the stage 1 curves via GPU. Thoughts? 
A GPU is no doubt the most efficient approach for stg1 if you can get one. This program is probably a good option for stg2 but keep in mind:
1) you will have to run more curves, not quite 2x more but probably close to that. Gmpecm with v is your friend. 2) avxecm does not yet have the ability to resume curves :) 
[QUOTE=bsquared;534240]A GPU is no doubt the most efficient approach for stg1 if you can get one. This program is probably a good option for stg2 but keep in mind:
1) you will have to run more curves, not quite 2x more but probably close to that. Gmpecm with v is your friend. [B]2) avxecm does not yet have the ability to resume curves[/B] :)[/QUOTE] That I had misunderstood. Bummer! I will wait patiently for that ability. . . Thanks for all your work. 
[QUOTE=bsquared;534240]A GPU is no doubt the most efficient approach for stg1 if you can get one.[/QUOTE]
Is there a program that supports GPUs for this? :blush: What kind of stage 1stage 2 tradeoff is reasonable in a setup like that? 
I notice that with large inputs, say 2^8192+1, GMPECM (with GWNUM) is still significantly faster at least on my system (Xeon Gold 6154 @ 3GHz).
For example, with this input and B1=1e6, GMPECM curves take about 2030 seconds each: [CODE]Using gwnum_ecmStage1(1, 2, 8192, 1, 1000000, 1) Step 1 took 19146ms Estimated memory usage: 203.12MB Initializing tables of differences for F took 6ms Computing roots of F took 186ms Building F from its roots took 393ms Computing 1/F took 176ms Initializing table of differences for G took 32ms Computing roots of G took 142ms Building G from its roots took 388ms Computing roots of G took 146ms Building G from its roots took 388ms Computing G * H took 105ms Reducing G * H mod F took 156ms Computing roots of G took 142ms Building G from its roots took 389ms Computing G * H took 105ms Reducing G * H mod F took 155ms Computing roots of G took 146ms Building G from its roots took 388ms Computing G * H took 106ms Reducing G * H mod F took 156ms Computing roots of G took 146ms Building G from its roots took 388ms Computing G * H took 105ms Reducing G * H mod F took 155ms Computing roots of G took 143ms Building G from its roots took 387ms Computing G * H took 104ms Reducing G * H mod F took 155ms Computing polyeval(F,G) took 911ms Computing product of all F(g_i) took 64ms Step 2 took 6291ms ********** Factor found in step 2: 3603109844542291969 Found prime factor of 19 digits: 3603109844542291969 Composite cofactor ((2^8192+1)/2710954639361)/3603109844542291969 has 2436 digits[/CODE] But, 10+ minutes later, AVXECM is still running! ("./avxecm 2^8192+1 4 1000000 4" to be precise). 
[QUOTE=CRGreathouse;534262]Is there a program that supports GPUs for this? :blush: What kind of stage 1stage 2 tradeoff is reasonable in a setup like that?[/QUOTE]There is a GPU branch in the latest GMPECM which I have played around with using Colab. I have a description [URL="https://www.mersenneforum.org/showthread.php?t=24887"]here[/URL]. Unfortunately, its default is to run the number of curves based on the GPU cores for stage 1 and then all those curves are processed singly by the CPU for stage 2. I have found no way (yet) to get more than 1 CPU to do stage 2 on the Colab instance, which only has 2 CPUs, anyway. Additionally, someone told me that the GPU branch might not be as reliable as the main branch in finding factors.

[QUOTE=mathwiz;534282]I notice that with large inputs, say 2^8192+1, GMPECM (with GWNUM) is still significantly faster at least on my system (Xeon Gold 6154 @ 3GHz).
For example, with this input and B1=1e6, GMPECM curves take about 2030 seconds each: [CODE]Using gwnum_ecmStage1(1, 2, 8192, 1, 1000000, 1) ... [/CODE] But, 10+ minutes later, AVXECM is still running! ("./avxecm 2^8192+1 4 1000000 4" to be precise).[/QUOTE] Well... 1) No fair using gwnum 2) You are running avxecm with 4 threads, which is 32 curves, so whatever the time ends up being you will have to divide by 32 for a fair comparison. But you are correct that avxecm will become less and less competitive as the input size goes up, because I do not have any subquadratic methods implemented. You are also correct that GMPECM will benefit significantly on numbers of special form (like the one you picked). I'm not advocating that avxecm replace gmpecm, far from it! It is just another tool that may be useful in some situations. Long term, it probably makes the most sense to merge avxecm into gmpecm somehow. 
[QUOTE=EdH;534286]There is a GPU branch in the latest GMPECM which I have played around with using Colab. I have a description [URL="https://www.mersenneforum.org/showthread.php?t=24887"]here[/URL]. Unfortunately, its default is to run the number of curves based on the GPU cores for stage 1 and then all those curves are processed singly by the CPU for stage 2. I have found no way (yet) to get more than 1 CPU to do stage 2 on the Colab instance, which only has 2 CPUs, anyway. Additionally, someone told me that the GPU branch might not be as reliable as the main branch in finding factors.[/QUOTE]
The GPU branch, IIRC, is also fixedsize arithmetic like avxecm. But I believe it only has 2 fixed sizes available, maybe 508 and 1016 (I don't recall exactly)? So if you want to run curves on a 509bit number you are forced to use 1016bit arithmetic, which of course is not optimal. avxecm is fixedsized arithmetic as well but in steps of either 128 bits or 208 bits, so you have more efficient choices. So the question what is the most efficient possible path depends on the size and form of the number. 
[QUOTE=EdH;534286]There is a GPU branch in the latest GMPECM which I have played around with using Colab. I have a description [URL="https://www.mersenneforum.org/showthread.php?t=24887"]here[/URL]. Unfortunately, its default is to run the number of curves based on the GPU cores for stage 1 and then all those curves are processed singly by the CPU for stage 2. I have found no way (yet) to get more than 1 CPU to do stage 2 on the Colab instance, which only has 2 CPUs, anyway. Additionally, someone told me that the GPU branch might not be as reliable as the main branch in finding factors.[/QUOTE]
You need to get ecm to do stage 1 on the GPU and save state to a file. Then split that into as many pieces as you have CPUs and run 1 ecm task against each piece. Eg: [code] rm $NAME.save ecm gpu save $NAME.save $B1 1 <$INI  tee a $LOG split nr/2 $NAME.save $NAME.save wait # If you are in a loop ecm resume $NAME.saveaa $B1  tee a $LOG.1  grep [Ff]actor & ecm resume $NAME.saveaa $B1  tee a $LOG.1  grep [Ff]actor & # This would be the end of the loop wait # Until all stage 2's end [/code] $NAME is name of project (anything you like). $INI is name of file with the number in it. $B1 you can probably guess. $LOG is the name of the log files. If you want to do more curves than the GPU does in 1 go then put the code into a loop. This is a simplified version of my code, which has to allow for the CPU being faster doing stage 2 than the GPU doing stage 1. But I've not tested this. See ecm's README file for details of what the parameters do. And README.gpu. Chris 
[QUOTE=EdH;534286]I have found no way (yet) to get more than 1 CPU to do stage 2 on the Colab instance, which only has 2 CPUs, anyway.[/QUOTE]
Please remember that Colab instances give you two *hyperthreaded* cores; only one real core. I don't know anything about this codebase, but HT'ing is useless for mprime. I've already "burnt" my disposable Kaggle account, so I've been hesitant to experiment with CPU usage there with my main / development account, but CPUonly Kaggle instances give you two real cores; hyperthreaded to four. FWIW. 
[QUOTE=bsquared;560751]I will run some tests using AVXECM. Scaled tests, at B1=7e6, show that firststage curve throughput is about 2.4x larger than GMPECM.
... But I've never run AVXECM with B1 anywhere close to that large. My bet is it crashes... but I'll test and see what happens.[/QUOTE] Replying here to a post in [URL="https://www.mersenneforum.org/showthread.php?t=25941"]this [/URL]thread, to prevent further hijacking of that thread. I was right, it crashed :smile:. While I was fixing the bug I realized that I was using Montgomery arithmetic for this input that would be much better off using a special Mersenne reduction. So I implemented that feature, and now AVXECM is about 4.5 times faster than GMPECM on 2^12771. GMPECM: [CODE] B1 Time, 1 curve, seconds 7e5 5.67 7e6 57.2 7e7 581 ... 7e9 59448 (extrapolated) [/CODE] AVXECM [CODE] B1 Time, 8 curves, seconds 7e5 10.5 7e6 101.1 7e7 1019 ... 7e9 102351 (actual) [/CODE] 102351 / 8 is about 3.5 hours per curve versus about 16.5 hours/curve. I've checked the latest code into r393 if anyone wants to do some testing. You will need a avx512 capable cpu. 
[QUOTE=bsquared;561591]I've checked the latest code into r393 if anyone wants to do some testing. You will need a avx512 capable cpu.[/QUOTE]
Is [url]https://github.com/bbuhrow/avxecm/[/url] the latest repo? Unless I'm reading wrong, the only commit recently (3 days ago) was just to README.md? 
[QUOTE=mathwiz;561594]Is [url]https://github.com/bbuhrow/avxecm/[/url] the latest repo? Unless I'm reading wrong, the only commit recently (3 days ago) was just to README.md?[/QUOTE]
Ah, good question. Sorry, I've folded it all into yafu, so it is yafu r393 that you'll need. It is probably a good idea to update the avxecm repo as well since that is significantly easier to build. I can do that tonight, hopefully. 
One more reason to buy one of those Phi barebones systems!
Nice work. 
[QUOTE=bsquared;561597]Ah, good question. Sorry, I've folded it all into yafu, so it is yafu r393 that you'll need. It is probably a good idea to update the avxecm repo as well since that is significantly easier to build. I can do that tonight, hopefully.[/QUOTE]
Gotcha. Is there an easy way to build just AVXECM from the latest yafu sources? Or, if you could update the avxecm git repo, that'd be great too! I tried: [CODE]$ svn checkout https://svn.code.sf.net/p/yafu/code/branches/wip yafucode $ make j 8 SKYLAKEX=1[/CODE] But I get some linker errors like: [CODE]gcc9: error: /sppdg/scratch/buhrow/projects/gmp_install/gmp6.2.0/lib/libgmp.a: No such file or directory[/CODE] 
[QUOTE=VBCurtis;561614]One more reason to buy one of those Phi barebones systems!
Nice work.[/QUOTE] Thanks! I actually have some data for the KNL, linearly scaled from a run with B1=7e6. 7210 KNL: 2048 simultaneous curves with 256 threads 354 hours/run 10.4 minutes/curve 7250 KNL 2176 simultaneous curves with 272 threads 353 hours/run 9.7 minutes/curve B2 then takes about 3/4 of that to go up to 100*B1, another 264 hours, so that's a lengthy ~25 day runtime per batch. But you get 2000+ curves at the end of it (B1=7G). [edit] Running with fewer threads and curves trades off throughput for lower wallclock time, if that is a concern. I don't have data for the exact trades there, yet. Also, it will save intermediate checkpoint files every 100M primes for each curve. The savefiles can be fed to GMPECM for stage 2, if you'd rather run stage 2 that way. Run with saveB1 on the command line to enable saving. Memory use is pretty minimal for both stages; won't be a problem to stay in MCDRAM. 
[QUOTE=mathwiz;561617]Gotcha. Is there an easy way to build just AVXECM from the latest yafu sources? Or, if you could update the avxecm git repo, that'd be great too!
I tried: [CODE]$ svn checkout https://svn.code.sf.net/p/yafu/code/branches/wip yafucode $ make j 8 SKYLAKEX=1[/CODE] But I get some linker errors like: [/QUOTE] No easy way to just build part of yafu. Makefiles are not my strong suit. I very much recommend [URL="https://www.mersenneforum.org/showthread.php?t=23087"]EdH's guides[/URL] on building yafu. It requires a small amount of makefile editing to point to your msieve, ecm, and gmp locations. IMO there are reasons to prefer the yafu version of avxecm. It integrates into factor(), so it's fire and forget on any input. It will autorecognize Mersennes and determine the fastest approach (sometimes it's better to just use Montgomery arithmetic). It uses the NFS algebraic factor extraction framework that's already there, so you could feed it, say, (2^12651) / 121441 / 533831 / 136149590957161 / 556590676780973231 and it would pull out all the annoying algebraic factors and go to work on the remaining C223. You can tell it to pretest up to a specific tlevel, etc. avxecm has none of that, but I will update it with the new vector math at least, with maybe a command line switch to turn it on. 
Posting here only because the issue seems to be related to avxecm. I'm trying to build the latest revision of YAFU, and no matter what I pass during make, it's failing with errors like:
[CODE]/usr/bin/ld: factor/avxecm/avx_ecm_main.o: in function `vec_ecm_main': /home/wombat/yafucode/branches/wip/factor/avxecm/avx_ecm_main.c:266: undefined reference to `snfs_init' /usr/bin/ld: /home/wombat/yafucode/branches/wip/factor/avxecm/avx_ecm_main.c:272: undefined reference to `find_primitive_factor' /usr/bin/ld: /home/wombat/yafucode/branches/wip/factor/avxecm/avx_ecm_main.c:279: undefined reference to `snfs_clear' /usr/bin/ld: /home/wombat/yafucode/branches/wip/factor/avxecm/avx_ecm_main.c:464: undefined reference to `vecmulmod52_mersenne' /usr/bin/ld: /home/wombat/yafucode/branches/wip/factor/avxecm/avx_ecm_main.c:465: undefined reference to `vecsqrmod52_mersenne' collect2: error: ld returned 1 exit status make: *** [Makefile:397: all] Error 1[/CODE] I'm passing in this: [CODE] make clean && make ECM=1 USE_SSE41=1[/CODE] This error is recent, though I can't say exactly which revision I was able to successfully build. Any suggestions? 
[QUOTE=wombatman;561758]Posting here only because the issue seems to be related to avxecm. I'm trying to build the latest revision of YAFU, and no matter what I pass during make, it's failing with errors like:
[CODE]/usr/bin/ld: factor/avxecm/avx_ecm_main.o: in function `vec_ecm_main': /home/wombat/yafucode/branches/wip/factor/avxecm/avx_ecm_main.c:266: undefined reference to `snfs_init' /usr/bin/ld: /home/wombat/yafucode/branches/wip/factor/avxecm/avx_ecm_main.c:272: undefined reference to `find_primitive_factor' /usr/bin/ld: /home/wombat/yafucode/branches/wip/factor/avxecm/avx_ecm_main.c:279: undefined reference to `snfs_clear' /usr/bin/ld: /home/wombat/yafucode/branches/wip/factor/avxecm/avx_ecm_main.c:464: undefined reference to `vecmulmod52_mersenne' /usr/bin/ld: /home/wombat/yafucode/branches/wip/factor/avxecm/avx_ecm_main.c:465: undefined reference to `vecsqrmod52_mersenne' collect2: error: ld returned 1 exit status make: *** [Makefile:397: all] Error 1[/CODE] I'm passing in this: [CODE] make clean && make ECM=1 USE_SSE41=1[/CODE] This error is recent, though I can't say exactly which revision I was able to successfully build. Any suggestions?[/QUOTE] Looks like you just need to use more makefile options. You need at least NFS=1 to bring in the nfs stuff and SKYLAKEX=1 for the avx512 vector arithmetic stuff. And if you use SKYLAKEX=1 then you also need USE_AVX2=1. My typical build line looks like this: make NFS=1 USE_AVX2=1 USE_BMI2=1 SKYLAKEX=1 also, ECM=1 is a msieve thing, not a yafu thing, so you can leave that off. 
Another quick usage note...
Normally, yafu will switch over from using an internal GMPECM to an external GMPECM (pointed to by the ecm_path variable in yafu.ini) at B1=50k. With AVXECM enabled, by default that switchover happens at B1=40M instead. If you want to use AVXECM at higher B1, put that crossover value into the variable "ext_ecm" in yafu.ini. For example if I want to use B1=110M with AVXECM and not with the external GMPECM, set ext_ecm=120000000, or anything larger than 110M. 
(waits patiently for the old avxecm git repo to be updated with the latest code... it's so much easier to build :)

[QUOTE=mathwiz;561813](waits patiently for the old avxecm git repo to be updated with the latest code... it's so much easier to build :)[/QUOTE]
Thanks for your patience :blush: Now updated, tested, seems to work. Also checked in a prebuilt version for windows. The mingw compiler doesn't do as good of a job, but something for windows is better than nothing. 
[QUOTE=bsquared;561807]Looks like you just need to use more makefile options. You need at least NFS=1 to bring in the nfs stuff and SKYLAKEX=1 for the avx512 vector arithmetic stuff. And if you use SKYLAKEX=1 then you also need USE_AVX2=1. My typical build line looks like this:
make NFS=1 USE_AVX2=1 USE_BMI2=1 SKYLAKEX=1 also, ECM=1 is a msieve thing, not a yafu thing, so you can leave that off.[/QUOTE] Sorry, I should have been clearer. The CPU is an Ivy BridgeE, so it can't do any of the AVXECM stuff (to my understanding). 
[QUOTE=wombatman;561828]Sorry, I should have been clearer. The CPU is an Ivy BridgeE, so it can't do any of the AVXECM stuff (to my understanding).[/QUOTE]
Ah, ok. I have only checked that things work *with* avx512, did not stop to consider that things might break without it. Thanks for the report... stay tuned. 
[QUOTE=bsquared;561829]Ah, ok. I have only checked that things work *with* avx512, did not stop to consider that things might break without it. Thanks for the report... stay tuned.[/QUOTE]
No worries! If it's just a simple makefile change, I'm good with doing that. :smile: 
[QUOTE=bsquared;561829]Ah, ok. I have only checked that things work *with* avx512, did not stop to consider that things might break without it. Thanks for the report... stay tuned.[/QUOTE]
Ok, try r394. 
[QUOTE=bsquared;561832]Ok, try r394.[/QUOTE]
Builds successfully and seems to be working well using factor(rsa(300)) to check both ECM and SIQS (my primary usage). Thanks very much for the quick response! 
addmod and submod can also be simplified for Mersenne inputs and this has a measurable improvement... about 3% faster in stage 1 and 7% in stage 2.
Both yafu and avxecm repos have been updated. Also, I've seen that avxecm is about 1020% faster than yafu. I'm not sure why, the only difference that I can see is that yafu has a lot more code around it and I can't see how the overhead would be that large. I've done some testing, but if anyone is able to test either package on finding [URL="https://homes.cerias.purdue.edu/~ssw/cun/pmain1020.txt"]known Mersenne factors[/URL], that'd be great. 
[QUOTE=bsquared;561976]
Also, I've seen that avxecm is about 1020% faster than yafu. I'm not sure why, the only difference that I can see is that yafu has a lot more code around it and I can't see how the overhead would be that large. [/QUOTE] Ok, most of that is explained by different compiler options. avxecm was using O3 and yafu O2. With O3, yafu is much better, but avxecm is still 5% faster or so. Given all of the above, for 2^12771, AVXECM is 6.2 times faster than GMPECM now. I hope I have GMPECM configured optimally for this system for a fair comparison... here is what it shows for a run with v: [CODE] echo "2^12771"  ~/ecm704linux/ecm v 7000000 GMPECM 7.0.4 [configured with GMP 6.2.0, enableasmredc] [ECM] Tuned for x86_64/k8/params.h Running on cpu Input number is 2^12771 (385 digits) Using special division for factor of 2^12771 Using B1=7000000, B2=17125579390, polynomial Dickson(12), sigma=0:1088740413510573297 dF=16384, k=6, d=158340, d2=11, i0=34 Expected number of curves to find a factor of n digits: 35 40 45 50 55 60 65 70 75 80 167 1024 7351 60402 558826 5744532 6.5e+07 7.8e+08 1.1e+10 2.8e+11 Step 1 took 60676ms[/CODE] B2 takes 32 sec vs. 55 sec for 8 curves (@ B2=100*B1) 
[QUOTE=bsquared;561986]
B2 takes 32 sec vs. 55 sec for 8 curves (@ B2=100*B1)[/QUOTE] Which one is which for this comparison? Are both GMPECM and avxecm running the same B2, or is the longer time related to GMPECM running its standard B2 level? Your stage 1 might be so fast that it calls for running the lesser stage 2 and not using GMPECM at all; I haven't quite decided which way to go for stage 2. Also, if I'm reading the start of this thread correctly, the avxECM speedup on generic smallish (say, 1000 bits or under) speedup is around double GMPECM stage 1 speed, while on special mersenne numbers it's 5x or more? My use cases are M1277 in the short run, and production work around 900 bits with no special form in the long run. Seems maybe M1277 calls for no GMPECM stage 2, while production work may call for using the stage1save option and GMPECM for stage 2. 
[QUOTE=VBCurtis;562006]Which one is which for this comparison? Are both GMPECM and avxecm running the same B2, or is the longer time related to GMPECM running its standard B2 level?
Your stage 1 might be so fast that it calls for running the lesser stage 2 and not using GMPECM at all; I haven't quite decided which way to go for stage 2. Also, if I'm reading the start of this thread correctly, the avxECM speedup on generic smallish (say, 1000 bits or under) speedup is around double GMPECM stage 1 speed, while on special mersenne numbers it's 5x or more? My use cases are M1277 in the short run, and production work around 900 bits with no special form in the long run. Seems maybe M1277 calls for no GMPECM stage 2, while production work may call for using the stage1save option and GMPECM for stage 2.[/QUOTE] gmpecm ran its larger stage 2 in 32 seconds while avxecm ran its smaller stage 2 in 55 seconds, or 6.8 sec/curve. gmpecm B2 was B2=17125579390 avxecm B2 was 700000000 (24.4 times smaller) This B2 size disparity will grow the larger B1 gets. Also complicating matters is that gmpecm uses the BrentSuyama extension for stage 2, which increases the odds of finding a factor slightly. avxecm doesn't do that. So unfortunately it is a complicated tradeoff that has to consider factorfinding probabilities as well as speed. I don't think there is enough data for avxecm yet to compare probability of finding a factor per unit time (or per curve), compared to gmpecm. Not sure I would even know how to do that analysis if I had the data, although maybe I could figure it out. ATH's data set [URL="https://www.mersenneforum.org/showpost.php?p=534182&postcount=31"]here [/URL]did show avxecm finding *more* factors per curve than gmpecm did, but stuff like that will happen for randomized algorithms and small data sets. After thinking about it a bit, I think we could extract the probabilities from gmpecm v, while running with power 1 (no brentsuyama) and forcing B2=100B1. 
[QUOTE=bsquared;562007]
After thinking about it a bit, I think we could extract the probabilities from gmpecm v, while running with power 1 (no brentsuyama) and forcing B2=100B1.[/QUOTE] Here's what gmpecm spit out: Input number is (2^12771) (385 digits) Using special division for factor of 2^12771 Using B1=10000000000, B2=637998273712822, polynomial Dickson(30), sigma=0:6343133439254916928 dF=4194304, k=3, d=48498450, d2=23, i0=184 Expected number of curves to find a factor of n digits: 35 40 45 50 55 60 65 70 75 80 7 19 52 161 545 1993 7815 32642 144421 673682 Input number is (2^12771) (385 digits) Using special division for factor of 2^12771 Using B1=10000000000, B2=1000000000000, polynomial x^1, sigma=0:11834000506611778745 dF=131072, k=6, d=1345890, d2=11, i0=7420 Expected number of curves to find a factor of n digits: 35 40 45 50 55 60 65 70 75 80 16 44 134 442 1578 6046 24696 106998 489087 2349846 The 75digit number of curves ratio is 3.38, which is essentially a multiplier in favor of gmpecm curves. So if avxecm is 6.2 times higher throughput, but 3.38 times less likely to find a factor at t75, then the net gain for avxecm is about 1.83 for 2^12771. Similar calculations would have to be applied to generic inputs at particular B1's, I think. gmpecm's advantage is its vastly superior stage 2. So the avxecm stage 1 followed by gmpecm stage 2 recipe is probably a winner in a lot of cases. But for 2^12771 it looks like pure avxecm wins. Thoughts? 
When GMPECM is faster for Stage 2 with a higher B2, there is only advantage from using it. Save time, get more work what's not to like?
The difficult tradeoff would be on larger numbers, when the avxecm stage 2 is likely faster than GMPECM but a smaller B2. That leads to some need for study. 
[QUOTE=VBCurtis;562018]When GMPECM is faster for Stage 2 with a higher B2, there is only advantage from using it. Save time, get more work what's not to like?
[/QUOTE] If it were comparing curveforcurve, sure, but it's not that straightforward. avxecm stage 2 may be slower and smaller B2, but it is doing 8 curves at once in that time. Expanding on the example from before, with B1=7M on 2^12771: avxecm stage 1: 78 sec for 8 curves, 9.75 sec/curve avxecm stage 2: 55 sec for 8 curves, 6.87 sec/curve gmpecm stage 1: 60 sec for 1 curve gmpecm stage 2: 32 sec for 1 curve Now, each avxecm stage 2 curve is 2.8 times less likely to find a factor, but if you had 64 seconds to spend doing something, would you rather run 2 gmpecm stage 2 curves or 8 avxecm stage 2 curves? Even with each avxecm being 2.8 times less likely to find a factor, you'd be better off. [QUOTE=VBCurtis;562018] The difficult tradeoff would be on larger numbers, when the avxecm stage 2 is likely faster than GMPECM but a smaller B2. That leads to some need for study. [/QUOTE] The same math as above applies. I think you are better off running pure avxecm until time/curve * probabilitymultiplier is greater than gmpecm's stage 2 runtime. It gets interesting with generic inputs, when the throughput advantage of avxecm is not as great. Then, I think it makes sense to run avxecm stage 1 followed by 8 gmpecm stage 2's. Then you get all of the probability advantage of gmpecm stage 2 but with avxecm's higher stage 1 throughput. I am putting together some data on this case. 
Here are a couple other things to keep in mind:
1) for lowerend skylake processors the speedups are not as dramatic, with or without special Mersenne math. I ran on a 7800x and only saw about 2.1 times speedup for 2^12771. [U]More testing/benchmarking is needed for a variety of avx512 capable processors[/U]. The good news is that this situation will only improve for avxecm as time goes on. I plan to implement enhancements as soon I can get my hands on an ice lake cpu or whatever cpu supports AVX512IFMA. 2) avxecm uses fixedtime arithmetic in steps of 208 bits. So a curve at, say, 416 bits takes just as long as a curve at 623 bits. gmpecm on the other hand will adjust the size of its arithmetic in 64bit steps. This could mean finetune adjustments for any crossover math that is performed for given input sizes. 
[QUOTE=bsquared;562023]Expanding on the example from before, with B1=7M on 2^12771:
avxecm stage 1: 78 sec for 8 curves, 9.75 sec/curve avxecm stage 2: 55 sec for 8 curves, 6.87 sec/curve gmpecm stage 1: 60 sec for 1 curve gmpecm stage 2: 32 sec for 1 curve[/QUOTE] Thank you for rewarding my "I'm confused so I'll state the obvious and see where I'm wrong"! This makes much sense, and I'm all clear now. 
I updated the first post with a table of timings and some analysis for a generic 900bit input, and a couple different Mersenne inputs.
It looks like a hybrid plan is best for most B1 levels. The relative speeds of avxecm and gmpecm should be checked on the particular cpu beforehand, and crossovers adjusted. 
Now processing 2^n+1 similarly to 2^n1 (with special modulo)
Here verifying some of the factors of 2^941+1 [CODE]./avxecm "(2^941+1)/3/1738969" 8 3000000 starting process 60820 commencing parallel ecm on 3562975409753765647916625929286022733708067351160837396219090783379741715638222929662078302565476897014358133541083282322395686863751130956979946597434060607614269433132169774980382077527684835263114705184392392863404860166373197969977946080763606872357617994374214244244159779 ECM has been configured with DIGITBITS = 52, VECLEN = 8, GMP_LIMB_BITS = 64 Choosing MAXBITS = 1040, NWORDS = 20, NBLOCKS = 5 based on input size 941 cached 3001134 primes < 49999991 Input has 919 bits, using 1 threads (8 curves/thread) Processing in batches of 100000000 primes Using special Mersenne mod for factor of: 2^941+1 Initialization took 0.0899 seconds. Cached 5761455 primes in range [2 : 99999989] commencing curves 07 of 8 Building curves took 0.0001 seconds. commencing Stage 1 @ prime 2 accumulating prime 2943173 Stage 1 completed at prime 2999999 with 6040073 pointadds and 565931 pointdoubles Stage 1 took 19.2999 seconds found factor 5062155107501579 in stage 1 in thread 0, vec position 0, with sigma = 424412553419277870 found factor 5062155107501579 in stage 1 in thread 0, vec position 5, with sigma = 8913334498632421225 commencing stage 2 at p=3000017, A=3000690 w = 1155, R = 480, L = 16, umax = 9240, amin = 1299 found 5317482 primes in range [99999989 : 199999963] commencing stage 2 at p=100000007, A=99965250 w = 1155, R = 480, L = 16, umax = 9240, amin = 43275 found 5173389 primes in range [199999963 : 299999957] commencing stage 2 at p=199999991, A=199979010 w = 1155, R = 480, L = 16, umax = 9240, amin = 86571 found 53 primes in range [299999957 : 300000997] commencing stage 2 at p=299999977, A=299974290 w = 1155, R = 480, L = 16, umax = 9240, amin = 129859 Stage 2 took 14.7954 seconds performed 9087802 pairmultiplies for 16035509 primes in stage 2 performed 266472 pointadditions and 57 pointdoubles in stage 2 found factor 5062155107501579 in stage 2 in thread 0, vec position 0, with sigma = 424412553419277870 found factor 5062155107501579 in stage 2 in thread 0, vec position 1, with sigma = 15371561253067696485 found factor 5062155107501579 in stage 2 in thread 0, vec position 2, with sigma = 13475164758719783696 found factor 1053336016261649316809 in stage 2 in thread 0, vec position 4, with sigma = 333700535869040322 found factor 5062155107501579 in stage 2 in thread 0, vec position 5, with sigma = 8913334498632421225 Process took 34.3731 seconds. [/CODE] Also, I'm only updating the avxecm repo with updates, for now. When things settle out a bit I'll revise yafu. 
And now processing 2^nc similarly to 2^n1 (with special modulo), where "c" is small (singlelimb). GMPECM doesn't do this, as far as I can tell.
Example: [CODE]./avxecm52iccstatic "(2^941  5987059)/211/18413" 8 250000 1 starting process 97314 commencing parallel ecm on 4784305585655994717562720658383425526844379349934463348761460837207088185558434958330904147209498950842345840906605538877380399014736734618876417744155088454218322338609736739235487631453648349849153543969782570442284070067941772311074196433531076214607947202503691346829979451 ECM has been configured with DIGITBITS = 52, VECLEN = 8, GMP_LIMB_BITS = 64 Choosing MAXBITS = 1040, NWORDS = 20, NBLOCKS = 5 based on input size 941 cached 3001134 primes < 49999991 Input has 920 bits, using 1 threads (8 curves/thread) Processing in batches of 100000000 primes [U]Using special pseudoMersenne mod for factor of: 2^9415987059[/U] Initialization took 0.1466 seconds. Cached 1565994 primes in range [2 : 25000991] commencing curves 07 of 8 Building curves took 0.0002 seconds. commencing Stage 1 @ prime 2 accumulating prime 180511 Stage 1 completed at prime 249989 with 496674 pointadds and 51127 pointdoubles Stage 1 took 1.8810 seconds commencing stage 2 at p=250007, A=249480 w = 1155, R = 480, L = 16, umax = 9240, amin = 108 Stage 2 took 1.4885 seconds performed 854762 pairmultiplies for 1543883 primes in stage 2 performed 30752 pointadditions and 49 pointdoubles in stage 2 found factor 4655226350979187 in stage 2 in thread 0, vec position 3, with sigma = 9539569019480891334 found factor 4655226350979187 in stage 2 in thread 0, vec position 7, with sigma = 17882739001568946650 Process took 3.7186 seconds. [/CODE] And to compare: [CODE] echo "(2^9415987059)/211/18413"  ../ecm704linux/ecm v 250000 GMPECM 7.0.4 [configured with GMP 6.2.0, enableasmredc] [ECM] Tuned for x86_64/k8/params.h Running on cpu Input number is (2^9415987059)/211/18413 (277 digits) [U]Using MODMULN [mulredc:1, sqrredc:1][/U] Computing batch product (of 360843 bits) of primes up to B1=250000 took 17ms Using B1=250000, B2=128992510, polynomial Dickson(3), sigma=1:1245750705 dF=2048, k=3, d=19110, d2=11, i0=3 Expected number of curves to find a factor of n digits: 35 40 45 50 55 60 65 70 75 80 6022 87544 1534319 3.1e+07 7.5e+08 2e+10 6.7e+13 1e+19 1.3e+24 Inf Step 1 took 1673ms [/CODE] Pretty substantial stage 1 throughput improvement, for inputs of this form. 
There have been a few updates to the avxecm repo:
1) added algebraic factor removal of Mersenne inputs 2) added checkpoint saving every 1e8 primes in stage 1 3) implemented batch inversion in stage 2, now almost twice as fast Here's an example of that last item compared to previously [QUOTE=bsquared;562023] Expanding on the example from before, with B1=7M on 2^12771: avxecm stage 1: 78 sec for 8 curves, 9.7 sec/stg1curve avxecm stage 2: 55 sec for 8 curves, 6.8 sec/stg2curve [/QUOTE] Now: [CODE] B1=7M on 2^12771: avxecm stage 1: 78 sec for 8 curves, 9.7 sec/stg1curve avxecm stage 2: 32 sec for 8 curves, 4.0 sec/stg2curve[/CODE] At this point I'm not sure whether it's better to keep the faster stage 2 time or run to higher B2 bounds, keeping timings similar to previously. I think that Bob's paper says to equalize the stage 1 and stage 2 runtimes, so the best answer is probably the latter. It might be moot if the bestbest answer is to still use a hybrid plan with gmpecm stage 2. However this might change the crossover point (from pure avxecm to hybrid). 
More updates:
1) Fixed some bugs in the fast stage 2, after finally bearing down and understanding all of the nuances of Montgomery's PAIR algorithm correctly. I had it mostly right before, but not 100%, which caused some factors to be missed. Thank <deity> for a gold reference implementation in gmpecm. Also, the fixed bugs actually improved the speed a bit. 2) Got a shiny new laptop with a tigerlake i71165G7 in order to get AVX512IFMA up and running. It helped more than I thought it would. Going from my floating point 52bit double precision integer multipliers to IFMA speeds up all of the vector math by about 40%! In this defacto quick benchmark this cpu gets: [CODE] B1=7M on 2^12771: avxecm stage 1: 47.3 sec for 8 curves, 5.9 sec/stg1curve avxecm stage 2: 18.3 sec for 8 curves, 2.3 sec/stg2curve [/CODE] The next goals are usabilityfocused. Get a proper set of command line flags in there, some options to control memory use and when to exit, improved logging, and reading in savefiles. 
Have you played with improving PRAC?
I recently did and gained one or two percent. This may be because prime95 can store data FFTed which may change the costing functions. A double costs 10 transforms, an add costs 12 transforms. Also, prime95 is usually dealing with larger numbers and can afford to spend a little more time searching for the best chain. Glad to share the C code for the changes. 
[QUOTE=Prime95;564720]Have you played with improving PRAC?
I recently did and gained one or two percent. This may be because prime95 can store data FFTed which may change the costing functions. A double costs 10 transforms, an add costs 12 transforms. Also, prime95 is usually dealing with larger numbers and can afford to spend a little more time searching for the best chain. Glad to share the C code for the changes.[/QUOTE] I tried to adjust the balance of time optimizing chains, without notable success. I'd like to see your code, thanks for sharing. I'm also looking into gmpecm's different parameterizations. "param 1" seems to be the default for me in most cases. It looks like they are using small parameters so that the multiply by (A+2)/4 is a single limb (saving a multiply when duplicating). Then not using PRAC at all, instead using Montgomery's ladder, which has a larger percentage of doublings, on a premultiplied set of B1 primes. Eliminating a multiply could save 10% or so. What I don't understand yet is how it can be singlelimb if they are using montgomery multiplies (redc). 
[QUOTE=Prime95;564720]Have you played with improving PRAC?
I recently did and gained one or two percent. [/QUOTE] The gain is not quite as large for me for the smaller inputs I am dealing with, but it is a gain of about 1%. It also seemed to help very slightly for me to define the cost functions to include the faster nature of a square versus a multiply. Doing that increases the total dups / adds ratio very slightly in stage 1. In my tinyecm code (< 64 bits) where there are just a few fixed and very small B1 values, I found that in a few cases the cost of PRAC(p1*p2) is cheaper than PRAC(p1) + PRAC(p2). It isn't practical to test combinations of primes like this in real time for arbitrary (and larger) B1. But I wonder if it is worth compiling a list of prime pairs that are cheaper when combined below bounds like 1M, 3M, 11M, etc. 
[QUOTE=bsquared;564835]
In my tinyecm code (< 64 bits) where there are just a few fixed and very small B1 values, I found that in a few cases the cost of PRAC(p1*p2) is cheaper than PRAC(p1) + PRAC(p2). It isn't practical to test combinations of primes like this in real time for arbitrary (and larger) B1. But I wonder if it is worth compiling a list of prime pairs that are cheaper when combined below bounds like 1M, 3M, 11M, etc.[/QUOTE] Ran a quick test attempting to pair primes < 1M. There are ~5k pairs that are cheaper this way, sometimes dramatically so[SUP]1[/SUP]. But there are ~78k primes below this bound so the total cost savings is a fraction of a percent. edit: nevermind  this is not valid. Have to check that the chain actually works! conclusion is unchanged  finding cheaper pairs doesn't seem worthwhile. [CODE]prime 9769 * 9781 cost = 103.5 prime 9769 cost = 109.0 prime 9781 cost = 103.5 (sum 212.5, savings 109.0 @ prac v=0.620181980807) [/CODE] 
If I do a "git pull" on the latest code, and then "make COMPILER=gcc SKYLAKEX=1", I'm getting a bunch of linker errors:
[code]gcc g O3 mavx march=skylakeavx512 DSKYLAKEX Wall I. I../../gmp6.2.1 c o vec_common.o vec_common.c gcc g O3 mavx march=skylakeavx512 DSKYLAKEX Wall I. I../../gmp6.2.1 c o calc.o calc.c gcc g O3 mavx march=skylakeavx512 DSKYLAKEX Wall I. I../../gmp6.2.1 c o queue.o queue.c rm f libavxecm.a ar r libavxecm.a eratosthenes/presieve.o eratosthenes/count.o eratosthenes/offsets.o eratosthenes/primes.o eratosthenes/roots.o eratosthenes/linesieve.o eratosthenes/soe.o eratosthenes/tiny.o eratosthenes/worker.o eratosthenes/soe_util.o eratosthenes/wrapper.o threadpool.o main.o ecm.o util.o vecarith.o vecarith52.o vec_common.o calc.o queue.o ar: creating libavxecm.a ranlib libavxecm.a gcc g O3 mavx march=skylakeavx512 DSKYLAKEX Wall I. I../../gmp6.2.1 eratosthenes/presieve.o eratosthenes/count.o eratosthenes/offsets.o eratosthenes/primes.o eratosthenes/roots.o eratosthenes/linesieve.o eratosthenes/soe.o eratosthenes/tiny.o eratosthenes/worker.o eratosthenes/soe_util.o eratosthenes/wrapper.o threadpool.o main.o ecm.o util.o vecarith.o vecarith52.o vec_common.o calc.o queue.o o avxecm libavxecm.a L../../gmp6.2.1/.libs lm lgmp lpthread /usr/bin/ld: eratosthenes/count.o:/tmp/avxecm/eratosthenes/soe.h:323: multiple definition of `nmasks'; eratosthenes/presieve.o:/tmp/avxecm/eratosthenes/soe.h:323: first defined here /usr/bin/ld: eratosthenes/count.o:/tmp/avxecm/eratosthenes/soe.h:341: multiple definition of `szSOEp'; eratosthenes/presieve.o:/tmp/avxecm/eratosthenes/soe.h:341: first defined here /usr/bin/ld: eratosthenes/count.o:/tmp/avxecm/eratosthenes/soe.h:340: multiple definition of `spSOEprimes'; eratosthenes/presieve.o:/tmp/avxecm/eratosthenes/soe.h:340: first defined here /usr/bin/ld: eratosthenes/count.o:/tmp/avxecm/eratosthenes/soe.h:337: multiple definition of `P_MAX'; eratosthenes/presieve.o:/tmp/avxecm/eratosthenes/soe.h:337: first defined here /usr/bin/ld: eratosthenes/count.o:/tmp/avxecm/eratosthenes/soe.h:336: multiple definition of `P_MIN'; eratosthenes/presieve.o:/tmp/avxecm/eratosthenes/soe.h:336: first defined here /usr/bin/ld: eratosthenes/count.o:/tmp/avxecm/eratosthenes/soe.h:335: multiple definition of `NUM_P'; eratosthenes/presieve.o:/tmp/avxecm/eratosthenes/soe.h:335: first defined here /usr/bin/ld: eratosthenes/count.o:/tmp/avxecm/eratosthenes/soe.h:334: multiple definition of `PRIMES'; eratosthenes/presieve.o:/tmp/avxecm/eratosthenes/soe.h:334: first defined here /usr/bin/ld: eratosthenes/count.o:/tmp/avxecm/eratosthenes/soe.h:330: multiple definition of `SOE_VFLAG'; eratosthenes/presieve.o:/tmp/avxecm/eratosthenes/soe.h:330: first defined here[/code] 
[QUOTE=mathwiz;565667]If I do a "git pull" on the latest code, and then "make COMPILER=gcc SKYLAKEX=1", I'm getting a bunch of linker errors:
[/QUOTE] Those are very similar errors to what is happening [URL="https://www.mersenneforum.org/showthread.php?t=23087"]here[/URL], with someone trying to compile YAFU using gcc 10.2.0. Are you using the same version of gcc, by any chance? As with yafu, I see no problems with gcc 7.3.0 or icc 18.0.3. [edit] I know that defining globals in header files is frowned upon, but I didn't know it was an error as long as the header is only included once. That is why they are protected with the #ifndef X #define X macros. Could that have changed with the latest gcc? Maybe putting a std=c11 in the make would help? 
Just to note, gcc 8.3.0 (Debian) and 9.3.0 (Ubuntu) worked fine for YAFU (wip), last time I upgraded all my machines.
Note: I will probably move this to the other thread later, since it's not really about AVXECM. 
[QUOTE=EdH;565675]Just to note, gcc 8.3.0 (Debian) and 9.3.0 (Ubuntu) worked fine for YAFU (wip), last time I upgraded all my machines.
Note: I will probably move this to the other thread later, since it's not really about AVXECM.[/QUOTE] Indeed, using gcc8 instead of my default (gcc10) appears to work fine. 
[QUOTE=mathwiz;565704]Indeed, using gcc8 instead of my default (gcc10) appears to work fine.[/QUOTE]
Good to know, thanks. I've asked a question about this over in the programming thread; maybe more people read that and someone might know what's going on. I've browsed the gcc changelog but nothing jumps out at me as a possible cause of this behavior. No time at the moment to dig deeper... hoping that someone smarter than me can bail me out here. 
All times are UTC. The time now is 16:51. 
Powered by vBulletin® Version 3.8.11
Copyright ©2000  2021, Jelsoft Enterprises Ltd.