It took quite a few tries to get a machine with avx512 today, but all went well when I did.
I haven't tried this yet, but hope to soon and wondered what the thoughts are for a setup that uses the GPU branch of ECM for stage 1 and this avxecm for stage 2. This may be the exact answer I've been searching for with my Colab ECMGPU session setup, since prior to this I could only run a single stage 2 thread against all the stage 1 curves via GPU. Thoughts? 
A GPU is no doubt the most efficient approach for stg1 if you can get one. This program is probably a good option for stg2 but keep in mind:
1) you will have to run more curves, not quite 2x more but probably close to that. Gmpecm with v is your friend. 2) avxecm does not yet have the ability to resume curves :) 
[QUOTE=bsquared;534240]A GPU is no doubt the most efficient approach for stg1 if you can get one. This program is probably a good option for stg2 but keep in mind:
1) you will have to run more curves, not quite 2x more but probably close to that. Gmpecm with v is your friend. [B]2) avxecm does not yet have the ability to resume curves[/B] :)[/QUOTE] That I had misunderstood. Bummer! I will wait patiently for that ability. . . Thanks for all your work. 
[QUOTE=bsquared;534240]A GPU is no doubt the most efficient approach for stg1 if you can get one.[/QUOTE]
Is there a program that supports GPUs for this? :blush: What kind of stage 1stage 2 tradeoff is reasonable in a setup like that? 
I notice that with large inputs, say 2^8192+1, GMPECM (with GWNUM) is still significantly faster at least on my system (Xeon Gold 6154 @ 3GHz).
For example, with this input and B1=1e6, GMPECM curves take about 2030 seconds each: [CODE]Using gwnum_ecmStage1(1, 2, 8192, 1, 1000000, 1) Step 1 took 19146ms Estimated memory usage: 203.12MB Initializing tables of differences for F took 6ms Computing roots of F took 186ms Building F from its roots took 393ms Computing 1/F took 176ms Initializing table of differences for G took 32ms Computing roots of G took 142ms Building G from its roots took 388ms Computing roots of G took 146ms Building G from its roots took 388ms Computing G * H took 105ms Reducing G * H mod F took 156ms Computing roots of G took 142ms Building G from its roots took 389ms Computing G * H took 105ms Reducing G * H mod F took 155ms Computing roots of G took 146ms Building G from its roots took 388ms Computing G * H took 106ms Reducing G * H mod F took 156ms Computing roots of G took 146ms Building G from its roots took 388ms Computing G * H took 105ms Reducing G * H mod F took 155ms Computing roots of G took 143ms Building G from its roots took 387ms Computing G * H took 104ms Reducing G * H mod F took 155ms Computing polyeval(F,G) took 911ms Computing product of all F(g_i) took 64ms Step 2 took 6291ms ********** Factor found in step 2: 3603109844542291969 Found prime factor of 19 digits: 3603109844542291969 Composite cofactor ((2^8192+1)/2710954639361)/3603109844542291969 has 2436 digits[/CODE] But, 10+ minutes later, AVXECM is still running! ("./avxecm 2^8192+1 4 1000000 4" to be precise). 
[QUOTE=CRGreathouse;534262]Is there a program that supports GPUs for this? :blush: What kind of stage 1stage 2 tradeoff is reasonable in a setup like that?[/QUOTE]There is a GPU branch in the latest GMPECM which I have played around with using Colab. I have a description [URL="https://www.mersenneforum.org/showthread.php?t=24887"]here[/URL]. Unfortunately, its default is to run the number of curves based on the GPU cores for stage 1 and then all those curves are processed singly by the CPU for stage 2. I have found no way (yet) to get more than 1 CPU to do stage 2 on the Colab instance, which only has 2 CPUs, anyway. Additionally, someone told me that the GPU branch might not be as reliable as the main branch in finding factors.

[QUOTE=mathwiz;534282]I notice that with large inputs, say 2^8192+1, GMPECM (with GWNUM) is still significantly faster at least on my system (Xeon Gold 6154 @ 3GHz).
For example, with this input and B1=1e6, GMPECM curves take about 2030 seconds each: [CODE]Using gwnum_ecmStage1(1, 2, 8192, 1, 1000000, 1) ... [/CODE] But, 10+ minutes later, AVXECM is still running! ("./avxecm 2^8192+1 4 1000000 4" to be precise).[/QUOTE] Well... 1) No fair using gwnum 2) You are running avxecm with 4 threads, which is 32 curves, so whatever the time ends up being you will have to divide by 32 for a fair comparison. But you are correct that avxecm will become less and less competitive as the input size goes up, because I do not have any subquadratic methods implemented. You are also correct that GMPECM will benefit significantly on numbers of special form (like the one you picked). I'm not advocating that avxecm replace gmpecm, far from it! It is just another tool that may be useful in some situations. Long term, it probably makes the most sense to merge avxecm into gmpecm somehow. 
[QUOTE=EdH;534286]There is a GPU branch in the latest GMPECM which I have played around with using Colab. I have a description [URL="https://www.mersenneforum.org/showthread.php?t=24887"]here[/URL]. Unfortunately, its default is to run the number of curves based on the GPU cores for stage 1 and then all those curves are processed singly by the CPU for stage 2. I have found no way (yet) to get more than 1 CPU to do stage 2 on the Colab instance, which only has 2 CPUs, anyway. Additionally, someone told me that the GPU branch might not be as reliable as the main branch in finding factors.[/QUOTE]
The GPU branch, IIRC, is also fixedsize arithmetic like avxecm. But I believe it only has 2 fixed sizes available, maybe 508 and 1016 (I don't recall exactly)? So if you want to run curves on a 509bit number you are forced to use 1016bit arithmetic, which of course is not optimal. avxecm is fixedsized arithmetic as well but in steps of either 128 bits or 208 bits, so you have more efficient choices. So the question what is the most efficient possible path depends on the size and form of the number. 
[QUOTE=EdH;534286]There is a GPU branch in the latest GMPECM which I have played around with using Colab. I have a description [URL="https://www.mersenneforum.org/showthread.php?t=24887"]here[/URL]. Unfortunately, its default is to run the number of curves based on the GPU cores for stage 1 and then all those curves are processed singly by the CPU for stage 2. I have found no way (yet) to get more than 1 CPU to do stage 2 on the Colab instance, which only has 2 CPUs, anyway. Additionally, someone told me that the GPU branch might not be as reliable as the main branch in finding factors.[/QUOTE]
You need to get ecm to do stage 1 on the GPU and save state to a file. Then split that into as many pieces as you have CPUs and run 1 ecm task against each piece. Eg: [code] rm $NAME.save ecm gpu save $NAME.save $B1 1 <$INI  tee a $LOG split nr/2 $NAME.save $NAME.save wait # If you are in a loop ecm resume $NAME.saveaa $B1  tee a $LOG.1  grep [Ff]actor & ecm resume $NAME.saveaa $B1  tee a $LOG.1  grep [Ff]actor & # This would be the end of the loop wait # Until all stage 2's end [/code] $NAME is name of project (anything you like). $INI is name of file with the number in it. $B1 you can probably guess. $LOG is the name of the log files. If you want to do more curves than the GPU does in 1 go then put the code into a loop. This is a simplified version of my code, which has to allow for the CPU being faster doing stage 2 than the GPU doing stage 1. But I've not tested this. See ecm's README file for details of what the parameters do. And README.gpu. Chris 
[QUOTE=EdH;534286]I have found no way (yet) to get more than 1 CPU to do stage 2 on the Colab instance, which only has 2 CPUs, anyway.[/QUOTE]
Please remember that Colab instances give you two *hyperthreaded* cores; only one real core. I don't know anything about this codebase, but HT'ing is useless for mprime. I've already "burnt" my disposable Kaggle account, so I've been hesitant to experiment with CPU usage there with my main / development account, but CPUonly Kaggle instances give you two real cores; hyperthreaded to four. FWIW. 
[QUOTE=bsquared;560751]I will run some tests using AVXECM. Scaled tests, at B1=7e6, show that firststage curve throughput is about 2.4x larger than GMPECM.
... But I've never run AVXECM with B1 anywhere close to that large. My bet is it crashes... but I'll test and see what happens.[/QUOTE] Replying here to a post in [URL="https://www.mersenneforum.org/showthread.php?t=25941"]this [/URL]thread, to prevent further hijacking of that thread. I was right, it crashed :smile:. While I was fixing the bug I realized that I was using Montgomery arithmetic for this input that would be much better off using a special Mersenne reduction. So I implemented that feature, and now AVXECM is about 4.5 times faster than GMPECM on 2^12771. GMPECM: [CODE] B1 Time, 1 curve, seconds 7e5 5.67 7e6 57.2 7e7 581 ... 7e9 59448 (extrapolated) [/CODE] AVXECM [CODE] B1 Time, 8 curves, seconds 7e5 10.5 7e6 101.1 7e7 1019 ... 7e9 102351 (actual) [/CODE] 102351 / 8 is about 3.5 hours per curve versus about 16.5 hours/curve. I've checked the latest code into r393 if anyone wants to do some testing. You will need a avx512 capable cpu. 
All times are UTC. The time now is 16:55. 
Powered by vBulletin® Version 3.8.11
Copyright ©2000  2021, Jelsoft Enterprises Ltd.