mersenneforum.org > YAFU AVX-ECM
 User Name Remember Me? Password
 Register FAQ Search Today's Posts Mark Forums Read

 2019-12-30, 22:21 #1 bsquared     "Ben" Feb 2007 336110 Posts AVX-ECM Just created a repository on github with new ECM code: https://github.com/bbuhrow/avx-ecm It computes curves in parallel using AVX-512 vector arithmetic, and is also multi-threaded. I've measured it to have about 2x the throughput of GMP-ECM using 1 thread on a Xeon Gold 5122 CPU. It does both stage 1 and stage 2 in parallel, but stage 2 is just the standard continuation with pairing (by default, B2=100xB1). It will output a savefile after stage 1 that will work with GMP-ECM stage 2, if desired. e.g.: ecm -resume save_b1.txt 1000000 < input.txt compile on linux using either icc or gcc-7.3.0 or above (I've only tested icc and gcc-7.3.0). e.g.: make MAXBITS=416 COMPILER=gcc730 SKYLAKEX=1 the MAXBITS parameter must be a multiple of 208. When this is the case, 8 curves are performed in parallel per thread. Alternatively, if DIGITBITS=32 is specified during make, then MAXBITS must be a mutiple of 128 and 16 curves are performed in parallel per thread. The 52-bit version is generally faster (and the default) but for some sizes a 32-bit version can have higher throughput (because of the 208-bit jumps between sizes). If anyone wants to test it out you will of course need an avx-512 capable computer. Let me know if there are any problems. Happy factoring! ============================================================  Sticky comparison tables generic 900-bit input Xeon Gold 5122 CPU, 1 thread, icc compiler Notes: 1) "weight" is obtained by determining the number of curves for the given t-level using -power 1 and B2=100B1 in gmp-ecm, and then also using default gmp-ecm settings, and taking the ratio. 2) hybrid = (avx-ecm stage 1 time + 8*gmp-ecm stage 2 time) / 8 Code: generic 900-bit avx-ecm weighted t-level B1 stg 1 stg2 sum sec/curve weight sec/curve 35 1.00E+06 13 9.2 22.2 2.78 1.85 5.14 40 3.00E+06 39.3 26.3 65.6 8.20 2.07 17.01 45 1.00E+07 131.1 81.1 212.2 26.53 2.35 62.33 50 4.30E+07 567.1 334.3 901.4 112.68 2.46 277.42 55 1.10E+08 1448.3 822.6 2270.9 283.86 2.63 747.74 60 2.60E+08 3423 1874 5303 662.98 2.87 1902.6 gmp-ecm-704 hybrid t-level B1 stg 1 stg2 sum sec/curve sec/curve 35 1.00E+06 6.59 3.45 10.04 10.04 5.08 40 3.00E+06 19.8 7.77 27.57 27.57 12.68 45 1.00E+07 65.9 25.4 91.3 91.30 41.79 50 4.30E+07 287.5 78.8 366.3 366.30 149.69 55 1.10E+08 738.3 196.2 934.5 934.50 377.24 60 2.60E+08 1750 412.5 2162.5 2160.5 841.1 gmp/weighted-avx gmp/hybrid avx/hybrid t-level B1 sec/curve ratio sec/curve ratio ratio 35 1.00E+06 1.95 1.98 1.01 40 3.00E+06 1.62 2.17 1.34 45 1.00E+07 1.46 2.18 1.49 50 4.30E+07 1.32 2.45 1.85 55 1.10E+08 1.25 2.48 1.98 60 2.60E+08 1.14 2.57 2.26 Conclusions: Below B1=1M, probably better to just use avx-ecm. At B1=1M and above, hybrid is best. Somewhere above 260M, pure GMP-ECM likely beats pure AVX-ECM, due to increasing weight, but hybrid is still best. For Mersenne inputs with primitive size between 832 and 1039 bits: Code:  gmp/weighted-avx gmp/hybrid avx/hybrid t-level B1 sec/curve ratio sec/curve ratio ratio 35 1.00E+06 3.72 2.26 0.61 40 3.00E+06 3.06 2.56 0.84 45 1.00E+07 2.79 2.55 0.91 50 4.30E+07 2.49 2.97 1.19 55 1.10E+08 2.31 3.11 1.35 60 2.60E+08 2.09 3.29 1.57 Hybrid best starting at B1 around 30M. For Mersenne inputs with primitive size between 1248 and 1456 bits: Code:  gmp/weighted-avx gmp/hybrid avx/hybrid t-level B1 sec/curve ratio sec/curve ratio ratio 35 1.00E+06 3.12 2.06 0.66 40 3.00E+06 2.59 2.32 0.89 45 1.00E+07 2.30 2.32 1.01 50 4.30E+07 2.08 2.67 1.28 55 1.10E+08 1.92 2.79 1.46 60 2.60E+08 1.74 2.95 1.69 65 8.50E+08 1.65 3.17 1.92 70 2.90E+09 1.52 3.40 2.23 75 7.60E+09 1.44 3.59 2.50 Hybrid best starting at B1 around 10M. Last fiddled with by bsquared on 2020-11-03 at 19:49 Reason: adding tables
 2019-12-30, 23:03 #2 mathwiz   Mar 2019 11×13 Posts Hmmmm. Just tried it out on an AVX-512 capable machine, with B1=1e6 and input M433 (http://factordb.com/index.php?id=1000000000000000433): Code: ./avx-ecm 22181357552966518876627313473144669627491496603006532601363836644916970462445004984319795248833116624779129687691228574631793262591 100 1000000 commencing parallel ecm on 22181357552966518876627313473144669627491496603006532601363836644916970462445004984319795248833116624779129687691228574631793262591 ECM has been configured with MAXBITS = 416, NWORDS = 8, VECLEN = 8 starting process 171275 cached 3001134 primes < 49999991 Input has 433 bits, using 1 threads (100 curves/thread) Processing in batches of 100000000 primes Initialization took 0.0743 seconds. Cached 5761455 primes in range [2 : 99999989] commencing curves 0-7 of 100 Building curves took 0.0005 seconds. commencing Stage 1 @ prime 2 accumulating prime 997693 Stage 1 completed at prime 999983 with 2001925 point-adds and 194143 point-doubles Stage 1 took 4.1971 seconds found factor 7573885960626120430675586583017049046287571561744815615 in stage 1 in thread 0, vec position 0, with sigma = 9805897911615405889 found factor 125483996795288390556263452518000395100487722298175677228160127607935 in stage 1 in thread 0, vec position 1, with sigma = 5941742753921851068 found factor 140559707882921114195636669550269531279151421131321967551083551935 in stage 1 in thread 0, vec position 2, with sigma = 17488052527796218971 found factor 140559707882921114195636669550269531279151421131321967551083551935 in stage 1 in thread 0, vec position 3, with sigma = 162542135677334094 found factor 9704134944669326942243922594738986945861288047284534894845604139673055 in stage 1 in thread 0, vec position 4, with sigma = 9639283928122886917 found factor 19065414140013908092677804414156626466105803295 in stage 1 in thread 0, vec position 5, with sigma = 14437779671575784496 found factor 455237245388167216759545663443996983027266740639813141535 in stage 1 in thread 0, vec position 6, with sigma = 5479294375532123583 found factor 161449908068885681487281375790285735990727019303070577509707381288895 in stage 1 in thread 0, vec position 7, with sigma = 16438922582860911842 commencing stage 2 at p=1000003, A=1000230 w = 1155, R = 480, L = 16, umax = 9240, amin = 433 Stage 2 took 3.1940 seconds performed 3188920 pair-multiplies for 5682957 primes in stage 2 performed 95044 point-additions and 53 point-doubles in stage 2 something failed: tid = 0, vec = 0 has zero result something failed: tid = 0, vec = 1 has zero result something failed: tid = 0, vec = 2 has zero result something failed: tid = 0, vec = 3 has zero result something failed: tid = 0, vec = 4 has zero result something failed: tid = 0, vec = 5 has zero result something failed: tid = 0, vec = 6 has zero result something failed: tid = 0, vec = 7 has zero result Process took 7.4682 seconds. Something seems wrong here? Or am I holding it wrong? Compare to: Code: 4\$ echo "22181357552966518876627313473144669627491496603006532601363836644916970462445004984319795248833116624779129687691228574631793262591" | ./ecm -c 10 1e6 GMP-ECM 7.0.4 [configured with GMP 6.1.2, GWNUM 29.8, --enable-asm-redc] [ECM] Due to incompatible licenses, this binary file must not be distributed. Input number is 22181357552966518876627313473144669627491496603006532601363836644916970462445004984319795248833116624779129687691228574631793262591 (131 digits) Using B1=1000000, B2=1045563762, polynomial Dickson(6), sigma=0:14118429249538045316 Step 1 took 2892ms Step 2 took 2391ms Run 2 out of 10: Using B1=1000000, B2=1045563762, polynomial Dickson(6), sigma=0:15640357536401406010 ********** Factor found in step 1: 22086765417396827057 Found prime factor of 20 digits: 22086765417396827057 Composite cofactor 1004282751855334395242501879256940939756665292745707836441839543208470249351377144560254198205767424080811494063 has 112 digits
 2019-12-30, 23:06 #3 bsquared     "Ben" Feb 2007 3,361 Posts It looks like your input is larger than 416 bits. So you'd have to re-build with MAXBITS=624. Actually, this is a case where MAXBITS=512 DIGITBITS=32 will give you slightly higher throughput (because 512 is quite a bit smaller than 624).
 2019-12-31, 14:45 #4 bsquared     "Ben" Feb 2007 D2116 Posts New version that chooses MAXBITS automatically based on the input size. You still have to specify DIGITBITS during build as this impacts the underlying word size, which can't be easily changed on the fly. Default is 52-bit, resulting in steps of 208 bits. It is a fixed-size arithmetic within these steps, so curves at 417 bits will take the same amount of time as curves with 623 bits. This is similar to GPU-ECM, although with more fine-grained resolution. The 32-bit build path gives steps of 128 bits. So for this input it will choose a 512-bit curve size which is faster than a 624-bit curve with 52-bit words. Code: ./avx-ecm 22181357552966518876627313473144669627491496603006532601363836644916970462445004984319795248833116624779129687691228574631793262591 16 1000000 1 commencing parallel ecm on 22181357552966518876627313473144669627491496603006532601363836644916970462445004984319795248833116624779129687691228574631793262591 starting process 226272 ECM has been configured with DIGITBITS = 32, VECLEN = 16 Choosing MAXBITS = 512, NWORDS = 16, NBLOCKS = 4 based on input size 433 cached 3001134 primes < 49999991 Input has 433 bits, using 1 threads (16 curves/thread) Processing in batches of 100000000 primes Initialization took 0.0895 seconds. Cached 5761455 primes in range [2 : 99999989] commencing curves 0-15 of 16 Building curves took 0.0006 seconds. commencing Stage 1 @ prime 2 accumulating prime 997693 Stage 1 completed at prime 999983 with 2001925 point-adds and 194143 point-doubles Stage 1 took 11.0072 seconds found factor 22086765417396827057 in stage 1 in thread 0, vec position 10, with sigma = 415569030 found factor 22086765417396827057 in stage 1 in thread 0, vec position 11, with sigma = 3510146781 commencing stage 2 at p=1000003, A=1000230 w = 1155, R = 480, L = 16, umax = 9240, amin = 433 Stage 2 took 7.5996 seconds performed 3188920 pair-multiplies for 5682957 primes in stage 2 performed 95044 point-additions and 53 point-doubles in stage 2 found factor 22086765417396827057 in stage 2 in thread 0, vec position 0, with sigma = 551874572 found factor 22086765417396827057 in stage 2 in thread 0, vec position 5, with sigma = 2520392143 found factor 22086765417396827057 in stage 2 in thread 0, vec position 6, with sigma = 1474306994 found factor 22086765417396827057 in stage 2 in thread 0, vec position 8, with sigma = 2588423220 found factor 22086765417396827057 in stage 2 in thread 0, vec position 10, with sigma = 415569030 found factor 22086765417396827057 in stage 2 in thread 0, vec position 11, with sigma = 3510146781 found factor 22086765417396827057 in stage 2 in thread 0, vec position 12, with sigma = 1003661608 Process took 18.7103 seconds. For this size input, 16 simultaneous 512-bit curves complete in 18.7 sec or 1.17 sec/curve versus GMP-ECM's 1.72 sec/curve.
 2019-12-31, 20:16 #5 bsquared     "Ben" Feb 2007 1101001000012 Posts Another commit: + fixed printing issues and get_ui/set_ui with mingw + parameterization seems not exactly compatible with gmp-ecm: looking into it + adding windows binaries for 32-bit and 52-bit arithmetic versions The mingw64 build for windows is a little slower but seems to work. I'm not exactly sure what's going on with the parameterization... this program finds factors when it seems like it shouldn't for some sigma and conversely doesn't find factors when gmp-ecm does for some sigma. On average though the factor-finding rate seems about the same. For example: Code: ./avx-ecm 274083933049131091178904352004902235910804839081565071801565126307886989756226870709872076759480838311782134623705910949147664751751242690094033868727929674779456111539120708680813658241 128 1000000 16 found factor 618970019642690137449562111 in stage 2 in thread 8, vec position 2, with sigma = 17981939361520122334 But using this sage code here gives the group order for that factor and sigma as: Code: [ <2, 2>, <3, 2>, <17193611656740451817020637, 1> ] And gmp-ecm with that sigma only finds the factor 74383430474532481 with the following entirely reasonable group order: Code: [ <2, 2>, <3, 3>, <19, 1>, <23, 1>, <3617, 1>, <435735059, 1> ] Last fiddled with by bsquared on 2019-12-31 at 20:22
 2019-12-31, 20:52 #6 mathwiz   Mar 2019 11×13 Posts Thanks for all the updates! Would it be a reasonable FR to request basic expression parsing, so that the input does not have to be the full decimal representation of N?
2019-12-31, 21:56   #7
bsquared

"Ben"
Feb 2007

3,361 Posts

Quote:
 Originally Posted by mathwiz Thanks for all the updates! Would it be a reasonable FR to request basic expression parsing, so that the input does not have to be the full decimal representation of N?
Yes, that's reasonable :)

BTW, thanks for testing it.

 2020-01-01, 16:36 #8 chris2be8     Sep 2009 198010 Posts How can you check if a CPU supports avx512? Eg one of my systems is a Ryzen 5 2600, /proc/cpuinfo says it supports avx and avx2, but so do some older systems. Chris
2020-01-01, 17:08   #9
CRGreathouse

Aug 2006

3×1,987 Posts

Quote:
 Originally Posted by chris2be8 How can you check if a CPU supports avx512? Eg one of my systems is a Ryzen 5 2600, /proc/cpuinfo says it supports avx and avx2, but so do some older systems.
Wikipedia writes, without citation:
As of 2019, there are no AMD CPUs that support AVX-512, and AMD has not yet released plans to support AVX-512.
As the Ryzen 5 2600 was launched in 2018 I trust it does not support AVX-512.

 2020-01-02, 04:53 #10 CRGreathouse     Aug 2006 3×1,987 Posts As for Intel, their advanced search allows you to search for AVX-512 and find all processors with it. (You can further limit the search, or just look up a chip and see if it's there.)
 2020-01-02, 14:45 #11 EdH     "Ed Hall" Dec 2009 Adirondack Mtns 1101110101012 Posts I would have sworn my earlier sessions with Colab showed AVX512 for their Xeon CPUs when I checked. Now, none of my recent sessions have. The CPU type just says Xeon CPU @ 2.00GHz. It does give me a family, etc. for the CPU(s), but no name. Is there a way to find that via the command line? If I can, I'll set up a Colab AVX-ECM example in my "How I . . ." series.

 Thread Tools

All times are UTC. The time now is 14:49.

Wed Jan 20 14:49:16 UTC 2021 up 48 days, 11 hrs, 0 users, load averages: 4.06, 3.53, 3.63

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.