mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > YAFU

Reply
 
Thread Tools
Old 2020-01-04, 17:57   #34
EdH
 
EdH's Avatar
 
"Ed Hall"
Dec 2009
Adirondack Mtns

1110011000112 Posts
Default

It took quite a few tries to get a machine with avx512 today, but all went well when I did.

I haven't tried this yet, but hope to soon and wondered what the thoughts are for a setup that uses the GPU branch of ECM for stage 1 and this avx-ecm for stage 2. This may be the exact answer I've been searching for with my Colab ECM-GPU session setup, since prior to this I could only run a single stage 2 thread against all the stage 1 curves via GPU.

Thoughts?
EdH is offline   Reply With Quote
Old 2020-01-04, 20:47   #35
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

41·83 Posts
Default

A GPU is no doubt the most efficient approach for stg1 if you can get one. This program is probably a good option for stg2 but keep in mind:
1) you will have to run more curves, not quite 2x more but probably close to that. Gmp-ecm with -v is your friend.
2) avx-ecm does not yet have the ability to resume curves :)
bsquared is offline   Reply With Quote
Old 2020-01-05, 04:00   #36
EdH
 
EdH's Avatar
 
"Ed Hall"
Dec 2009
Adirondack Mtns

29×127 Posts
Default

Quote:
Originally Posted by bsquared View Post
A GPU is no doubt the most efficient approach for stg1 if you can get one. This program is probably a good option for stg2 but keep in mind:
1) you will have to run more curves, not quite 2x more but probably close to that. Gmp-ecm with -v is your friend.
2) avx-ecm does not yet have the ability to resume curves :)
That I had misunderstood. Bummer! I will wait patiently for that ability. . .

Thanks for all your work.
EdH is offline   Reply With Quote
Old 2020-01-05, 04:11   #37
CRGreathouse
 
CRGreathouse's Avatar
 
Aug 2006

32·5·7·19 Posts
Default

Quote:
Originally Posted by bsquared View Post
A GPU is no doubt the most efficient approach for stg1 if you can get one.
Is there a program that supports GPUs for this? What kind of stage 1--stage 2 tradeoff is reasonable in a setup like that?
CRGreathouse is offline   Reply With Quote
Old 2020-01-05, 15:38   #38
mathwiz
 
Mar 2019

149 Posts
Default

I notice that with large inputs, say 2^8192+1, GMP-ECM (with GWNUM) is still significantly faster at least on my system (Xeon Gold 6154 @ 3GHz).

For example, with this input and B1=1e6, GMP-ECM curves take about 20-30 seconds each:

Code:
Using gwnum_ecmStage1(1, 2, 8192, 1, 1000000, 1)
Step 1 took 19146ms
Estimated memory usage: 203.12MB
Initializing tables of differences for F took 6ms
Computing roots of F took 186ms
Building F from its roots took 393ms
Computing 1/F took 176ms
Initializing table of differences for G took 32ms
Computing roots of G took 142ms
Building G from its roots took 388ms
Computing roots of G took 146ms
Building G from its roots took 388ms
Computing G * H took 105ms
Reducing  G * H mod F took 156ms
Computing roots of G took 142ms
Building G from its roots took 389ms
Computing G * H took 105ms
Reducing  G * H mod F took 155ms
Computing roots of G took 146ms
Building G from its roots took 388ms
Computing G * H took 106ms
Reducing  G * H mod F took 156ms
Computing roots of G took 146ms
Building G from its roots took 388ms
Computing G * H took 105ms
Reducing  G * H mod F took 155ms
Computing roots of G took 143ms
Building G from its roots took 387ms
Computing G * H took 104ms
Reducing  G * H mod F took 155ms
Computing polyeval(F,G) took 911ms
Computing product of all F(g_i) took 64ms
Step 2 took 6291ms
********** Factor found in step 2: 3603109844542291969
Found prime factor of 19 digits: 3603109844542291969
Composite cofactor ((2^8192+1)/2710954639361)/3603109844542291969 has 2436 digits
But, 10+ minutes later, AVX-ECM is still running! ("./avx-ecm 2^8192+1 4 1000000 4" to be precise).
mathwiz is offline   Reply With Quote
Old 2020-01-05, 16:19   #39
EdH
 
EdH's Avatar
 
"Ed Hall"
Dec 2009
Adirondack Mtns

1110011000112 Posts
Default

Quote:
Originally Posted by CRGreathouse View Post
Is there a program that supports GPUs for this? What kind of stage 1--stage 2 tradeoff is reasonable in a setup like that?
There is a GPU branch in the latest GMP-ECM which I have played around with using Colab. I have a description here. Unfortunately, its default is to run the number of curves based on the GPU cores for stage 1 and then all those curves are processed singly by the CPU for stage 2. I have found no way (yet) to get more than 1 CPU to do stage 2 on the Colab instance, which only has 2 CPUs, anyway. Additionally, someone told me that the GPU branch might not be as reliable as the main branch in finding factors.
EdH is offline   Reply With Quote
Old 2020-01-05, 16:41   #40
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

41·83 Posts
Default

Quote:
Originally Posted by mathwiz View Post
I notice that with large inputs, say 2^8192+1, GMP-ECM (with GWNUM) is still significantly faster at least on my system (Xeon Gold 6154 @ 3GHz).

For example, with this input and B1=1e6, GMP-ECM curves take about 20-30 seconds each:

Code:
Using gwnum_ecmStage1(1, 2, 8192, 1, 1000000, 1)
...
But, 10+ minutes later, AVX-ECM is still running! ("./avx-ecm 2^8192+1 4 1000000 4" to be precise).
Well...
1) No fair using gwnum
2) You are running avx-ecm with 4 threads, which is 32 curves, so whatever the time ends up being you will have to divide by 32 for a fair comparison.

But you are correct that avx-ecm will become less and less competitive as the input size goes up, because I do not have any sub-quadratic methods implemented. You are also correct that GMP-ECM will benefit significantly on numbers of special form (like the one you picked).

I'm not advocating that avx-ecm replace gmp-ecm, far from it! It is just another tool that may be useful in some situations. Long term, it probably makes the most sense to merge avx-ecm into gmp-ecm somehow.

Last fiddled with by bsquared on 2020-01-05 at 16:51
bsquared is offline   Reply With Quote
Old 2020-01-05, 16:51   #41
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

41·83 Posts
Default

Quote:
Originally Posted by EdH View Post
There is a GPU branch in the latest GMP-ECM which I have played around with using Colab. I have a description here. Unfortunately, its default is to run the number of curves based on the GPU cores for stage 1 and then all those curves are processed singly by the CPU for stage 2. I have found no way (yet) to get more than 1 CPU to do stage 2 on the Colab instance, which only has 2 CPUs, anyway. Additionally, someone told me that the GPU branch might not be as reliable as the main branch in finding factors.
The GPU branch, IIRC, is also fixed-size arithmetic like avx-ecm. But I believe it only has 2 fixed sizes available, maybe 508 and 1016 (I don't recall exactly)? So if you want to run curves on a 509-bit number you are forced to use 1016-bit arithmetic, which of course is not optimal. avx-ecm is fixed-sized arithmetic as well but in steps of either 128 bits or 208 bits, so you have more efficient choices.

So the question what is the most efficient possible path depends on the size and form of the number.
bsquared is offline   Reply With Quote
Old 2020-01-05, 17:29   #42
chris2be8
 
chris2be8's Avatar
 
Sep 2009

34·52 Posts
Default

Quote:
Originally Posted by EdH View Post
There is a GPU branch in the latest GMP-ECM which I have played around with using Colab. I have a description here. Unfortunately, its default is to run the number of curves based on the GPU cores for stage 1 and then all those curves are processed singly by the CPU for stage 2. I have found no way (yet) to get more than 1 CPU to do stage 2 on the Colab instance, which only has 2 CPUs, anyway. Additionally, someone told me that the GPU branch might not be as reliable as the main branch in finding factors.
You need to get ecm to do stage 1 on the GPU and save state to a file. Then split that into as many pieces as you have CPUs and run 1 ecm task against each piece.

Eg:
Code:
rm $NAME.save
ecm -gpu -save $NAME.save $B1 1 <$INI | tee -a $LOG
split -nr/2 $NAME.save $NAME.save
wait # If you are in a loop
ecm -resume $NAME.saveaa $B1 | tee -a $LOG.1 | grep [Ff]actor &
ecm -resume $NAME.saveaa $B1 | tee -a $LOG.1 | grep [Ff]actor &
# This would be the end of the loop
wait # Until all stage 2's end
$NAME is name of project (anything you like).
$INI is name of file with the number in it.
$B1 you can probably guess.
$LOG is the name of the log files.

If you want to do more curves than the GPU does in 1 go then put the code into a loop.

This is a simplified version of my code, which has to allow for the CPU being faster doing stage 2 than the GPU doing stage 1. But I've not tested this.

See ecm's README file for details of what the parameters do. And README.gpu.

Chris
chris2be8 is offline   Reply With Quote
Old 2020-01-05, 18:01   #43
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

3·52·127 Posts
Default

Quote:
Originally Posted by EdH View Post
I have found no way (yet) to get more than 1 CPU to do stage 2 on the Colab instance, which only has 2 CPUs, anyway.
Please remember that Colab instances give you two *hyperthreaded* cores; only one real core. I don't know anything about this codebase, but HT'ing is useless for mprime.

I've already "burnt" my disposable Kaggle account, so I've been hesitant to experiment with CPU usage there with my main / development account, but CPU-only Kaggle instances give you two real cores; hyperthreaded to four.

FWIW.
chalsall is online now   Reply With Quote
Old 2020-10-30, 20:55   #44
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

D4B16 Posts
Default

Quote:
Originally Posted by bsquared View Post
I will run some tests using AVX-ECM. Scaled tests, at B1=7e6, show that first-stage curve throughput is about 2.4x larger than GMP-ECM.


... But I've never run AVX-ECM with B1 anywhere close to that large. My bet is it crashes... but I'll test and see what happens.
Replying here to a post in this thread, to prevent further hijacking of that thread.

I was right, it crashed .

While I was fixing the bug I realized that I was using Montgomery arithmetic for this input that would be much better off using a special Mersenne reduction. So I implemented that feature, and now AVX-ECM is about 4.5 times faster than GMP-ECM on 2^1277-1.

GMP-ECM:
Code:
B1      Time, 1 curve, seconds
7e5      5.67
7e6      57.2
7e7      581
...
7e9      59448  (extrapolated)
AVX-ECM
Code:
B1      Time, 8 curves, seconds
7e5      10.5
7e6      101.1
7e7      1019
...
7e9      102351  (actual)
102351 / 8 is about 3.5 hours per curve versus about 16.5 hours/curve.

I've checked the latest code into r393 if anyone wants to do some testing. You will need a avx512 capable cpu.

Last fiddled with by bsquared on 2020-10-30 at 21:00 Reason: specify input, and requirements
bsquared is offline   Reply With Quote
Reply

Thread Tools


All times are UTC. The time now is 14:33.

Tue Apr 13 14:33:25 UTC 2021 up 5 days, 9:14, 1 user, load averages: 3.21, 3.08, 2.89

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.