20191230, 22:21  #1 
"Ben"
Feb 2007
D4B_{16} Posts 
AVXECM
Just created a repository on github with new ECM code: https://github.com/bbuhrow/avxecm
It computes curves in parallel using AVX512 vector arithmetic, and is also multithreaded. I've measured it to have about 2x the throughput of GMPECM using 1 thread on a Xeon Gold 5122 CPU. It does both stage 1 and stage 2 in parallel, but stage 2 is just the standard continuation with pairing (by default, B2=100xB1). It will output a savefile after stage 1 that will work with GMPECM stage 2, if desired. e.g.: ecm resume save_b1.txt 1000000 < input.txt compile on linux using either icc or gcc7.3.0 or above (I've only tested icc and gcc7.3.0). e.g.: make MAXBITS=416 COMPILER=gcc730 SKYLAKEX=1 the MAXBITS parameter must be a multiple of 208. When this is the case, 8 curves are performed in parallel per thread. Alternatively, if DIGITBITS=32 is specified during make, then MAXBITS must be a mutiple of 128 and 16 curves are performed in parallel per thread. The 52bit version is generally faster (and the default) but for some sizes a 32bit version can have higher throughput (because of the 208bit jumps between sizes). If anyone wants to test it out you will of course need an avx512 capable computer. Let me know if there are any problems. Happy factoring! ============================================================ [Edit] Sticky comparison tables generic 900bit input Xeon Gold 5122 CPU, 1 thread, icc compiler Notes: 1) "weight" is obtained by determining the number of curves for the given tlevel using power 1 and B2=100B1 in gmpecm, and then also using default gmpecm settings, and taking the ratio. 2) hybrid = (avxecm stage 1 time + 8*gmpecm stage 2 time) / 8 Code:
generic 900bit avxecm weighted tlevel B1 stg 1 stg2 sum sec/curve weight sec/curve 35 1.00E+06 13 9.2 22.2 2.78 1.85 5.14 40 3.00E+06 39.3 26.3 65.6 8.20 2.07 17.01 45 1.00E+07 131.1 81.1 212.2 26.53 2.35 62.33 50 4.30E+07 567.1 334.3 901.4 112.68 2.46 277.42 55 1.10E+08 1448.3 822.6 2270.9 283.86 2.63 747.74 60 2.60E+08 3423 1874 5303 662.98 2.87 1902.6 gmpecm704 hybrid tlevel B1 stg 1 stg2 sum sec/curve sec/curve 35 1.00E+06 6.59 3.45 10.04 10.04 5.08 40 3.00E+06 19.8 7.77 27.57 27.57 12.68 45 1.00E+07 65.9 25.4 91.3 91.30 41.79 50 4.30E+07 287.5 78.8 366.3 366.30 149.69 55 1.10E+08 738.3 196.2 934.5 934.50 377.24 60 2.60E+08 1750 412.5 2162.5 2160.5 841.1 gmp/weightedavx gmp/hybrid avx/hybrid tlevel B1 sec/curve ratio sec/curve ratio ratio 35 1.00E+06 1.95 1.98 1.01 40 3.00E+06 1.62 2.17 1.34 45 1.00E+07 1.46 2.18 1.49 50 4.30E+07 1.32 2.45 1.85 55 1.10E+08 1.25 2.48 1.98 60 2.60E+08 1.14 2.57 2.26 Below B1=1M, probably better to just use avxecm. At B1=1M and above, hybrid is best. Somewhere above 260M, pure GMPECM likely beats pure AVXECM, due to increasing weight, but hybrid is still best. For Mersenne inputs with primitive size between 832 and 1039 bits: Code:
gmp/weightedavx gmp/hybrid avx/hybrid tlevel B1 sec/curve ratio sec/curve ratio ratio 35 1.00E+06 3.72 2.26 0.61 40 3.00E+06 3.06 2.56 0.84 45 1.00E+07 2.79 2.55 0.91 50 4.30E+07 2.49 2.97 1.19 55 1.10E+08 2.31 3.11 1.35 60 2.60E+08 2.09 3.29 1.57 For Mersenne inputs with primitive size between 1248 and 1456 bits: Code:
gmp/weightedavx gmp/hybrid avx/hybrid tlevel B1 sec/curve ratio sec/curve ratio ratio 35 1.00E+06 3.12 2.06 0.66 40 3.00E+06 2.59 2.32 0.89 45 1.00E+07 2.30 2.32 1.01 50 4.30E+07 2.08 2.67 1.28 55 1.10E+08 1.92 2.79 1.46 60 2.60E+08 1.74 2.95 1.69 65 8.50E+08 1.65 3.17 1.92 70 2.90E+09 1.52 3.40 2.23 75 7.60E+09 1.44 3.59 2.50 Last fiddled with by bsquared on 20201103 at 19:49 Reason: adding tables 
20191230, 23:03  #2 
Mar 2019
149 Posts 
Hmmmm. Just tried it out on an AVX512 capable machine, with B1=1e6 and input M433 (http://factordb.com/index.php?id=1000000000000000433):
Code:
./avxecm 22181357552966518876627313473144669627491496603006532601363836644916970462445004984319795248833116624779129687691228574631793262591 100 1000000 commencing parallel ecm on 22181357552966518876627313473144669627491496603006532601363836644916970462445004984319795248833116624779129687691228574631793262591 ECM has been configured with MAXBITS = 416, NWORDS = 8, VECLEN = 8 starting process 171275 cached 3001134 primes < 49999991 Input has 433 bits, using 1 threads (100 curves/thread) Processing in batches of 100000000 primes Initialization took 0.0743 seconds. Cached 5761455 primes in range [2 : 99999989] commencing curves 07 of 100 Building curves took 0.0005 seconds. commencing Stage 1 @ prime 2 accumulating prime 997693 Stage 1 completed at prime 999983 with 2001925 pointadds and 194143 pointdoubles Stage 1 took 4.1971 seconds found factor 7573885960626120430675586583017049046287571561744815615 in stage 1 in thread 0, vec position 0, with sigma = 9805897911615405889 found factor 125483996795288390556263452518000395100487722298175677228160127607935 in stage 1 in thread 0, vec position 1, with sigma = 5941742753921851068 found factor 140559707882921114195636669550269531279151421131321967551083551935 in stage 1 in thread 0, vec position 2, with sigma = 17488052527796218971 found factor 140559707882921114195636669550269531279151421131321967551083551935 in stage 1 in thread 0, vec position 3, with sigma = 162542135677334094 found factor 9704134944669326942243922594738986945861288047284534894845604139673055 in stage 1 in thread 0, vec position 4, with sigma = 9639283928122886917 found factor 19065414140013908092677804414156626466105803295 in stage 1 in thread 0, vec position 5, with sigma = 14437779671575784496 found factor 455237245388167216759545663443996983027266740639813141535 in stage 1 in thread 0, vec position 6, with sigma = 5479294375532123583 found factor 161449908068885681487281375790285735990727019303070577509707381288895 in stage 1 in thread 0, vec position 7, with sigma = 16438922582860911842 commencing stage 2 at p=1000003, A=1000230 w = 1155, R = 480, L = 16, umax = 9240, amin = 433 Stage 2 took 3.1940 seconds performed 3188920 pairmultiplies for 5682957 primes in stage 2 performed 95044 pointadditions and 53 pointdoubles in stage 2 something failed: tid = 0, vec = 0 has zero result something failed: tid = 0, vec = 1 has zero result something failed: tid = 0, vec = 2 has zero result something failed: tid = 0, vec = 3 has zero result something failed: tid = 0, vec = 4 has zero result something failed: tid = 0, vec = 5 has zero result something failed: tid = 0, vec = 6 has zero result something failed: tid = 0, vec = 7 has zero result Process took 7.4682 seconds. Compare to: Code:
4$ echo "22181357552966518876627313473144669627491496603006532601363836644916970462445004984319795248833116624779129687691228574631793262591"  ./ecm c 10 1e6 GMPECM 7.0.4 [configured with GMP 6.1.2, GWNUM 29.8, enableasmredc] [ECM] Due to incompatible licenses, this binary file must not be distributed. Input number is 22181357552966518876627313473144669627491496603006532601363836644916970462445004984319795248833116624779129687691228574631793262591 (131 digits) Using B1=1000000, B2=1045563762, polynomial Dickson(6), sigma=0:14118429249538045316 Step 1 took 2892ms Step 2 took 2391ms Run 2 out of 10: Using B1=1000000, B2=1045563762, polynomial Dickson(6), sigma=0:15640357536401406010 ********** Factor found in step 1: 22086765417396827057 Found prime factor of 20 digits: 22086765417396827057 Composite cofactor 1004282751855334395242501879256940939756665292745707836441839543208470249351377144560254198205767424080811494063 has 112 digits 
20191230, 23:06  #3 
"Ben"
Feb 2007
41×83 Posts 
It looks like your input is larger than 416 bits. So you'd have to rebuild with MAXBITS=624. Actually, this is a case where MAXBITS=512 DIGITBITS=32 will give you slightly higher throughput (because 512 is quite a bit smaller than 624).

20191231, 14:45  #4 
"Ben"
Feb 2007
3403_{10} Posts 
New version that chooses MAXBITS automatically based on the input size. You still have to specify DIGITBITS during build as this impacts the underlying word size, which can't be easily changed on the fly. Default is 52bit, resulting in steps of 208 bits. It is a fixedsize arithmetic within these steps, so curves at 417 bits will take the same amount of time as curves with 623 bits. This is similar to GPUECM, although with more finegrained resolution.
The 32bit build path gives steps of 128 bits. So for this input it will choose a 512bit curve size which is faster than a 624bit curve with 52bit words. Code:
./avxecm 22181357552966518876627313473144669627491496603006532601363836644916970462445004984319795248833116624779129687691228574631793262591 16 1000000 1 commencing parallel ecm on 22181357552966518876627313473144669627491496603006532601363836644916970462445004984319795248833116624779129687691228574631793262591 starting process 226272 ECM has been configured with DIGITBITS = 32, VECLEN = 16 Choosing MAXBITS = 512, NWORDS = 16, NBLOCKS = 4 based on input size 433 cached 3001134 primes < 49999991 Input has 433 bits, using 1 threads (16 curves/thread) Processing in batches of 100000000 primes Initialization took 0.0895 seconds. Cached 5761455 primes in range [2 : 99999989] commencing curves 015 of 16 Building curves took 0.0006 seconds. commencing Stage 1 @ prime 2 accumulating prime 997693 Stage 1 completed at prime 999983 with 2001925 pointadds and 194143 pointdoubles Stage 1 took 11.0072 seconds found factor 22086765417396827057 in stage 1 in thread 0, vec position 10, with sigma = 415569030 found factor 22086765417396827057 in stage 1 in thread 0, vec position 11, with sigma = 3510146781 commencing stage 2 at p=1000003, A=1000230 w = 1155, R = 480, L = 16, umax = 9240, amin = 433 Stage 2 took 7.5996 seconds performed 3188920 pairmultiplies for 5682957 primes in stage 2 performed 95044 pointadditions and 53 pointdoubles in stage 2 found factor 22086765417396827057 in stage 2 in thread 0, vec position 0, with sigma = 551874572 found factor 22086765417396827057 in stage 2 in thread 0, vec position 5, with sigma = 2520392143 found factor 22086765417396827057 in stage 2 in thread 0, vec position 6, with sigma = 1474306994 found factor 22086765417396827057 in stage 2 in thread 0, vec position 8, with sigma = 2588423220 found factor 22086765417396827057 in stage 2 in thread 0, vec position 10, with sigma = 415569030 found factor 22086765417396827057 in stage 2 in thread 0, vec position 11, with sigma = 3510146781 found factor 22086765417396827057 in stage 2 in thread 0, vec position 12, with sigma = 1003661608 Process took 18.7103 seconds. 
20191231, 20:16  #5 
"Ben"
Feb 2007
41·83 Posts 
Another commit:
+ fixed printing issues and get_ui/set_ui with mingw + parameterization seems not exactly compatible with gmpecm: looking into it + adding windows binaries for 32bit and 52bit arithmetic versions The mingw64 build for windows is a little slower but seems to work. I'm not exactly sure what's going on with the parameterization... this program finds factors when it seems like it shouldn't for some sigma and conversely doesn't find factors when gmpecm does for some sigma. On average though the factorfinding rate seems about the same. For example: Code:
./avxecm 274083933049131091178904352004902235910804839081565071801565126307886989756226870709872076759480838311782134623705910949147664751751242690094033868727929674779456111539120708680813658241 128 1000000 16 <snip> found factor 618970019642690137449562111 in stage 2 in thread 8, vec position 2, with sigma = 17981939361520122334 Code:
[ <2, 2>, <3, 2>, <17193611656740451817020637, 1> ] Code:
[ <2, 2>, <3, 3>, <19, 1>, <23, 1>, <3617, 1>, <435735059, 1> ] Last fiddled with by bsquared on 20191231 at 20:22 
20191231, 20:52  #6 
Mar 2019
149 Posts 
Thanks for all the updates!
Would it be a reasonable FR to request basic expression parsing, so that the input does not have to be the full decimal representation of N? 
20191231, 21:56  #7 
"Ben"
Feb 2007
6513_{8} Posts 

20200101, 16:36  #8 
Sep 2009
7E9_{16} Posts 
How can you check if a CPU supports avx512? Eg one of my systems is a Ryzen 5 2600, /proc/cpuinfo says it supports avx and avx2, but so do some older systems.
Chris 
20200101, 17:08  #9  
Aug 2006
1011101100001_{2} Posts 
Quote:
As of 2019, there are no AMD CPUs that support AVX512, and AMD has not yet released plans to support AVX512.As the Ryzen 5 2600 was launched in 2018 I trust it does not support AVX512. 

20200102, 04:53  #10 
Aug 2006
3^{2}·5·7·19 Posts 
As for Intel, their advanced search allows you to search for AVX512 and find all processors with it. (You can further limit the search, or just look up a chip and see if it's there.)

20200102, 14:45  #11 
"Ed Hall"
Dec 2009
Adirondack Mtns
3683_{10} Posts 
I would have sworn my earlier sessions with Colab showed AVX512 for their Xeon CPUs when I checked. Now, none of my recent sessions have. The CPU type just says Xeon CPU @ 2.00GHz. It does give me a family, etc. for the CPU(s), but no name. Is there a way to find that via the command line?
If I can, I'll set up a Colab AVXECM example in my "How I . . ." series. 