mersenneforum.org > YAFU AVX-ECM
 User Name Remember Me? Password
 Register FAQ Search Today's Posts Mark Forums Read

 2020-01-04, 17:57 #34 EdH     "Ed Hall" Dec 2009 Adirondack Mtns 32×13×31 Posts It took quite a few tries to get a machine with avx512 today, but all went well when I did. I haven't tried this yet, but hope to soon and wondered what the thoughts are for a setup that uses the GPU branch of ECM for stage 1 and this avx-ecm for stage 2. This may be the exact answer I've been searching for with my Colab ECM-GPU session setup, since prior to this I could only run a single stage 2 thread against all the stage 1 curves via GPU. Thoughts?
 2020-01-04, 20:47 #35 bsquared     "Ben" Feb 2007 3,371 Posts A GPU is no doubt the most efficient approach for stg1 if you can get one. This program is probably a good option for stg2 but keep in mind: 1) you will have to run more curves, not quite 2x more but probably close to that. Gmp-ecm with -v is your friend. 2) avx-ecm does not yet have the ability to resume curves :)
2020-01-05, 04:00   #36
EdH

"Ed Hall"
Dec 2009

70538 Posts

Quote:
 Originally Posted by bsquared A GPU is no doubt the most efficient approach for stg1 if you can get one. This program is probably a good option for stg2 but keep in mind: 1) you will have to run more curves, not quite 2x more but probably close to that. Gmp-ecm with -v is your friend. 2) avx-ecm does not yet have the ability to resume curves :)
That I had misunderstood. Bummer! I will wait patiently for that ability. . .

Thanks for all your work.

2020-01-05, 04:11   #37
CRGreathouse

Aug 2006

22·1,493 Posts

Quote:
 Originally Posted by bsquared A GPU is no doubt the most efficient approach for stg1 if you can get one.
Is there a program that supports GPUs for this? What kind of stage 1--stage 2 tradeoff is reasonable in a setup like that?

 2020-01-05, 15:38 #38 mathwiz   Mar 2019 24·32 Posts I notice that with large inputs, say 2^8192+1, GMP-ECM (with GWNUM) is still significantly faster at least on my system (Xeon Gold 6154 @ 3GHz). For example, with this input and B1=1e6, GMP-ECM curves take about 20-30 seconds each: Code: Using gwnum_ecmStage1(1, 2, 8192, 1, 1000000, 1) Step 1 took 19146ms Estimated memory usage: 203.12MB Initializing tables of differences for F took 6ms Computing roots of F took 186ms Building F from its roots took 393ms Computing 1/F took 176ms Initializing table of differences for G took 32ms Computing roots of G took 142ms Building G from its roots took 388ms Computing roots of G took 146ms Building G from its roots took 388ms Computing G * H took 105ms Reducing G * H mod F took 156ms Computing roots of G took 142ms Building G from its roots took 389ms Computing G * H took 105ms Reducing G * H mod F took 155ms Computing roots of G took 146ms Building G from its roots took 388ms Computing G * H took 106ms Reducing G * H mod F took 156ms Computing roots of G took 146ms Building G from its roots took 388ms Computing G * H took 105ms Reducing G * H mod F took 155ms Computing roots of G took 143ms Building G from its roots took 387ms Computing G * H took 104ms Reducing G * H mod F took 155ms Computing polyeval(F,G) took 911ms Computing product of all F(g_i) took 64ms Step 2 took 6291ms ********** Factor found in step 2: 3603109844542291969 Found prime factor of 19 digits: 3603109844542291969 Composite cofactor ((2^8192+1)/2710954639361)/3603109844542291969 has 2436 digits But, 10+ minutes later, AVX-ECM is still running! ("./avx-ecm 2^8192+1 4 1000000 4" to be precise).
2020-01-05, 16:19   #39
EdH

"Ed Hall"
Dec 2009

32·13·31 Posts

Quote:
 Originally Posted by CRGreathouse Is there a program that supports GPUs for this? What kind of stage 1--stage 2 tradeoff is reasonable in a setup like that?
There is a GPU branch in the latest GMP-ECM which I have played around with using Colab. I have a description here. Unfortunately, its default is to run the number of curves based on the GPU cores for stage 1 and then all those curves are processed singly by the CPU for stage 2. I have found no way (yet) to get more than 1 CPU to do stage 2 on the Colab instance, which only has 2 CPUs, anyway. Additionally, someone told me that the GPU branch might not be as reliable as the main branch in finding factors.

2020-01-05, 16:41   #40
bsquared

"Ben"
Feb 2007

3,371 Posts

Quote:
 Originally Posted by mathwiz I notice that with large inputs, say 2^8192+1, GMP-ECM (with GWNUM) is still significantly faster at least on my system (Xeon Gold 6154 @ 3GHz). For example, with this input and B1=1e6, GMP-ECM curves take about 20-30 seconds each: Code: Using gwnum_ecmStage1(1, 2, 8192, 1, 1000000, 1) ... But, 10+ minutes later, AVX-ECM is still running! ("./avx-ecm 2^8192+1 4 1000000 4" to be precise).
Well...
1) No fair using gwnum
2) You are running avx-ecm with 4 threads, which is 32 curves, so whatever the time ends up being you will have to divide by 32 for a fair comparison.

But you are correct that avx-ecm will become less and less competitive as the input size goes up, because I do not have any sub-quadratic methods implemented. You are also correct that GMP-ECM will benefit significantly on numbers of special form (like the one you picked).

I'm not advocating that avx-ecm replace gmp-ecm, far from it! It is just another tool that may be useful in some situations. Long term, it probably makes the most sense to merge avx-ecm into gmp-ecm somehow.

Last fiddled with by bsquared on 2020-01-05 at 16:51

2020-01-05, 16:51   #41
bsquared

"Ben"
Feb 2007

3,371 Posts

Quote:
 Originally Posted by EdH There is a GPU branch in the latest GMP-ECM which I have played around with using Colab. I have a description here. Unfortunately, its default is to run the number of curves based on the GPU cores for stage 1 and then all those curves are processed singly by the CPU for stage 2. I have found no way (yet) to get more than 1 CPU to do stage 2 on the Colab instance, which only has 2 CPUs, anyway. Additionally, someone told me that the GPU branch might not be as reliable as the main branch in finding factors.
The GPU branch, IIRC, is also fixed-size arithmetic like avx-ecm. But I believe it only has 2 fixed sizes available, maybe 508 and 1016 (I don't recall exactly)? So if you want to run curves on a 509-bit number you are forced to use 1016-bit arithmetic, which of course is not optimal. avx-ecm is fixed-sized arithmetic as well but in steps of either 128 bits or 208 bits, so you have more efficient choices.

So the question what is the most efficient possible path depends on the size and form of the number.

2020-01-05, 17:29   #42
chris2be8

Sep 2009

2,003 Posts

Quote:
 Originally Posted by EdH There is a GPU branch in the latest GMP-ECM which I have played around with using Colab. I have a description here. Unfortunately, its default is to run the number of curves based on the GPU cores for stage 1 and then all those curves are processed singly by the CPU for stage 2. I have found no way (yet) to get more than 1 CPU to do stage 2 on the Colab instance, which only has 2 CPUs, anyway. Additionally, someone told me that the GPU branch might not be as reliable as the main branch in finding factors.
You need to get ecm to do stage 1 on the GPU and save state to a file. Then split that into as many pieces as you have CPUs and run 1 ecm task against each piece.

Eg:
Code:
rm $NAME.save ecm -gpu -save$NAME.save $B1 1 <$INI | tee -a $LOG split -nr/2$NAME.save $NAME.save wait # If you are in a loop ecm -resume$NAME.saveaa $B1 | tee -a$LOG.1 | grep [Ff]actor &
ecm -resume $NAME.saveaa$B1 | tee -a $LOG.1 | grep [Ff]actor & # This would be the end of the loop wait # Until all stage 2's end$NAME is name of project (anything you like).
$INI is name of file with the number in it.$B1 you can probably guess.
\$LOG is the name of the log files.

If you want to do more curves than the GPU does in 1 go then put the code into a loop.

This is a simplified version of my code, which has to allow for the CPU being faster doing stage 2 than the GPU doing stage 1. But I've not tested this.

See ecm's README file for details of what the parameters do. And README.gpu.

Chris

2020-01-05, 18:01   #43
chalsall
If I May

"Chris Halsall"
Sep 2002

36×13 Posts

Quote:
 Originally Posted by EdH I have found no way (yet) to get more than 1 CPU to do stage 2 on the Colab instance, which only has 2 CPUs, anyway.
Please remember that Colab instances give you two *hyperthreaded* cores; only one real core. I don't know anything about this codebase, but HT'ing is useless for mprime.

I've already "burnt" my disposable Kaggle account, so I've been hesitant to experiment with CPU usage there with my main / development account, but CPU-only Kaggle instances give you two real cores; hyperthreaded to four.

FWIW.

2020-10-30, 20:55   #44
bsquared

"Ben"
Feb 2007

3,371 Posts

Quote:
 Originally Posted by bsquared I will run some tests using AVX-ECM. Scaled tests, at B1=7e6, show that first-stage curve throughput is about 2.4x larger than GMP-ECM. ... But I've never run AVX-ECM with B1 anywhere close to that large. My bet is it crashes... but I'll test and see what happens.
Replying here to a post in this thread, to prevent further hijacking of that thread.

I was right, it crashed .

While I was fixing the bug I realized that I was using Montgomery arithmetic for this input that would be much better off using a special Mersenne reduction. So I implemented that feature, and now AVX-ECM is about 4.5 times faster than GMP-ECM on 2^1277-1.

GMP-ECM:
Code:
B1      Time, 1 curve, seconds
7e5      5.67
7e6      57.2
7e7      581
...
7e9      59448  (extrapolated)
AVX-ECM
Code:
B1      Time, 8 curves, seconds
7e5      10.5
7e6      101.1
7e7      1019
...
7e9      102351  (actual)
102351 / 8 is about 3.5 hours per curve versus about 16.5 hours/curve.

I've checked the latest code into r393 if anyone wants to do some testing. You will need a avx512 capable cpu.

Last fiddled with by bsquared on 2020-10-30 at 21:00 Reason: specify input, and requirements