mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > Factoring

Reply
 
Thread Tools
Old 2021-08-24, 23:35   #1
SethTro
 
SethTro's Avatar
 
"Seth"
Apr 2019

36910 Posts
Default Faster GPU-ECM with CGBN

Resurrecting this thread. If anyone is running numbers smaller than C155 they should reach out to me.

(Moderator note: Referenced thread is here)

My new CGBN enabled code is something like 7x faster

Code:
$ echo "(2^499-1)/20959" | ./ecm -gpu -gpucurves 3584 -sigma 3:1000 20000
Input number is (2^499-1)/20959 (146 digits)
Using B1=20000, B2=3804582, sigma=3:1000-3:4583 (3584 curves)
Computing 3584 Step 1 took 93ms of CPU time / 7258ms of GPU time
Computing 3584 Step 2 on CPU took 71933ms

$$ echo "(2^499-1)/20959" | ./ecm -gpu -cgbn -gpucurves 3584 -sigma 3:1000 20000
Input number is (2^499-1)/20959 (146 digits)
Using B1=20000, B2=3804582, sigma=3:1000-3:4583 (3584 curves)
Computing 3584 Step 1 took 15ms of CPU time / 1019ms of GPU time
Computing 3584 Step 2 on CPU took 72142ms
For numbers smaller than C300 It's generally 2-3x faster

Code:
$  echo "(2^997-1)" | ./ecm -gpu -sigma 3:1000 20000
Input number is (2^997-1) (301 digits)
Using B1=20000, B2=3804582, sigma=3:1000-3:2791 (1792 curves)
Computing 1792 Step 1 took 91ms of CPU time / 3810ms of GPU time
Computing 1792 Step 2 on CPU took 83417ms

$ echo "(2^997-1)" | ./ecm -gpu -cgbn -sigma 3:1000 20000
Input number is (2^997-1) (301 digits)
Using B1=20000, B2=3804582, sigma=3:1000-3:2791 (1792 curves)
Computing 1792 Step 1 took 15ms of CPU time / 1588ms of GPU time
Computing 1792 Step 2 on CPU took 83521ms
I'm working on the code actively in https://github.com/sethtroisi/gmp-ec...pu_integration if you are a developer and can possible distribute Linux binaries if we had a place to store them.

Last fiddled with by bsquared on 2021-08-27 at 19:13
SethTro is offline   Reply With Quote
Old 2021-08-25, 11:59   #2
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

22×1,481 Posts
Default

This is an impressive speedup. I assume there is no chance of stage 2 being ported to run on CUDA? GPU memory was too small back when the old GPU code was written. I suspect that may still be the case when running many curves in parallel(maybe different sections of the stage 2 range could be done in parallel instead?)



Binaries for several programs have been hosted on the forum server. I would suggest messaging Xyyzy.

Does the windows visual studio compilation work for this? I would either need that or CUDA working under WSL2.
henryzz is offline   Reply With Quote
Old 2021-08-25, 13:07   #3
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

357810 Posts
Default

Quote:
Originally Posted by SethTro View Post
Resurrecting this thread. If anyone is running numbers smaller than C155 they should reach out to me.

My new CGBN enabled code is something like 7x faster

Code:
$ echo "(2^499-1)/20959" | ./ecm -gpu -gpucurves 3584 -sigma 3:1000 20000
Input number is (2^499-1)/20959 (146 digits)
Using B1=20000, B2=3804582, sigma=3:1000-3:4583 (3584 curves)
Computing 3584 Step 1 took 93ms of CPU time / 7258ms of GPU time
Computing 3584 Step 2 on CPU took 71933ms

$$ echo "(2^499-1)/20959" | ./ecm -gpu -cgbn -gpucurves 3584 -sigma 3:1000 20000
Input number is (2^499-1)/20959 (146 digits)
Using B1=20000, B2=3804582, sigma=3:1000-3:4583 (3584 curves)
Computing 3584 Step 1 took 15ms of CPU time / 1019ms of GPU time
Computing 3584 Step 2 on CPU took 72142ms
For numbers smaller than C300 It's generally 2-3x faster

Code:
$  echo "(2^997-1)" | ./ecm -gpu -sigma 3:1000 20000
Input number is (2^997-1) (301 digits)
Using B1=20000, B2=3804582, sigma=3:1000-3:2791 (1792 curves)
Computing 1792 Step 1 took 91ms of CPU time / 3810ms of GPU time
Computing 1792 Step 2 on CPU took 83417ms

$ echo "(2^997-1)" | ./ecm -gpu -cgbn -sigma 3:1000 20000
Input number is (2^997-1) (301 digits)
Using B1=20000, B2=3804582, sigma=3:1000-3:2791 (1792 curves)
Computing 1792 Step 1 took 15ms of CPU time / 1588ms of GPU time
Computing 1792 Step 2 on CPU took 83521ms
I'm working on the code actively in https://github.com/sethtroisi/gmp-ec...pu_integration if you are a developer and can possible distribute Linux binaries if we had a place to store them.
Very nice!

I haven't been able to build it yet. My knowledge of automake/autoconf is very limited, but trying to use these tools just throws a bunch of errors about missing files. Any advice on building this for linux?

Also are you by any chance looking at implementing a standard continuation for stage 2 on the GPU? It is very helpful for speeding up the process as a whole if these can also be run in parallel, even given that each curve is slightly less likely to find a factor:

Code:
./yafu "ecm(2^997-1,1792)" -B1ecm 20000 -threads 16


ecm: 1792/1792 curves on C301 @ B1=20000, B2=100*B1
ecm: process took 5.5385 seconds.
[edit]
sorry saw that henryzz already brought this up

Last fiddled with by bsquared on 2021-08-25 at 13:11
bsquared is offline   Reply With Quote
Old 2021-08-25, 15:42   #4
chris2be8
 
chris2be8's Avatar
 
Sep 2009

2,179 Posts
Default

I'm trying to build it on Linux, but have not had much luck. I copied the process I used to build the previous version:
Code:
Download gmp-ecm-gpu_integration.zip
unzip gmp-ecm-gpu_integration.zip
cd gmp-ecm-gpu_integration
autoreconf -si

./configure --enable-gpu=30 # The previous version needed --enable-gpu=sm30. The following messages look relevant:
-snip-
checking that CUDA Toolkit version and runtime version are the same... (9.1/9.1) yes
-snip-
configure: with_cgbn: , , "
-snip-

make # This fails with messages about CGBN being missing.
What version of CUDA Toolkit and runtime is needed to support CGBN? And where is cgbn.h on your system?

Also what makes of GPU support it? Mine may be too old.

Chris

PS I use ECM set to do 512 bit arithmetic for numbers below 2^506. This is about 3 times faster than the 1024 bit version. So you may be *only* getting a factor of 3 over all ranges. But that's still very nice.

Last fiddled with by chris2be8 on 2021-08-25 at 15:43
chris2be8 is offline   Reply With Quote
Old 2021-08-25, 16:19   #5
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

2·1,789 Posts
Default

Thanks to the autoreconf -si hint, I've progressed a little further and have run through a configure process for a sm_70 card. But now I'm getting a bunch of these errors:

Code:
ptxas /tmp/tmpxft_00008eaa_00000000-5_cudakernel.ptx, line 2378; error   : Instruction 'vote' without '.sync' is not supported on .target sm_70 and higher from PTX ISA version 6.4
which probably has nothing to do with your cbgn enhancements. Anyone know if gpu-ecm will build and run on a sm_70 card or hints on how to proceed?
bsquared is offline   Reply With Quote
Old 2021-08-25, 17:34   #6
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

2·1,789 Posts
Default

Quote:
Originally Posted by bsquared View Post
Thanks to the autoreconf -si hint, I've progressed a little further and have run through a configure process for a sm_70 card. But now I'm getting a bunch of these errors:

Code:
ptxas /tmp/tmpxft_00008eaa_00000000-5_cudakernel.ptx, line 2378; error   : Instruction 'vote' without '.sync' is not supported on .target sm_70 and higher from PTX ISA version 6.4
which probably has nothing to do with your cbgn enhancements. Anyone know if gpu-ecm will build and run on a sm_70 card or hints on how to proceed?
Fixed it.

On line 10 of cudakernel_default.cu replace this line:

Code:
while(__any(cy[threadIdx.x])!=0)
with this one

Code:
while(__any_sync(__activemask(),cy[threadIdx.x])!=0)
Now I have a working gpu-ecm! Later I'll try to get the cbgn part working.

Code:
% echo "(2^997-1)" | ./ecm -gpu -sigma 3:1000 20000
GMP-ECM 7.0.5-dev [configured with GMP 6.2.0, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM]
Input number is (2^997-1) (301 digits)
Using B1=20000, B2=3804582, sigma=3:1000-3:6119 (5120 curves)
GPU: Block: 32x32x1 Grid: 160x1x1 (5120 parallel curves)
Computing 5120 Step 1 took 485ms of CPU time / 4990ms of GPU time
Computing 5120 Step 2 on CPU took 247281ms

Last fiddled with by bsquared on 2021-08-25 at 17:52
bsquared is offline   Reply With Quote
Old 2021-08-25, 17:48   #7
SethTro
 
SethTro's Avatar
 
"Seth"
Apr 2019

32×41 Posts
Default

Quote:
Originally Posted by bsquared View Post
Very nice!
I haven't been able to build it yet. My knowledge of automake/autoconf is very limited, but trying to use these tools just throws a bunch of errors about missing files. Any advice on building this for linux?
There's some instructions in README.dev.
This is what I use

Code:
autoreconf -i
./configure --enable-gpu=61 --with-cuda=/usr/local/cuda CC=gcc-9 -with-cgbn-include=/home/five/Projects/CGBN/include/cgbn
make -j8
Replace `--enable_gpu=61` with a different compute capability if you need one (and apply the change from #27 if you need SM_70) but you already seem to have that figured out.

Quote:
Originally Posted by chris2be8 View Post
I'm trying to build it on Linux, but have not had much luck. I copied the process I used to build the previous version:

What version of CUDA Toolkit and runtime is needed to support CGBN? And where is cgbn.h on your system?
I ran `git clone https://github.com/NVlabs/CGBN.git` under /home/five/Projects
then added `-with-cgbn-include=/home/five/Projects/CGBN/include/cgbn` to the list of options I pass to `./configure`

Last fiddled with by SethTro on 2021-08-25 at 18:08
SethTro is offline   Reply With Quote
Old 2021-08-25, 18:24   #8
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

2×1,789 Posts
Default

I got CGBN set up and configured your code with the proper cgbn include path.

Needed to make the following changes to get it to compile:

Code:
Lines 510 and 587, replace 100'000'000 with 1000000000

Around line 640:
//const std::vector<uint32_t> available_kernels = { 512, 1024 };
  uint32_t available_kernels[2] = { 512, 1024 };
  uint32_t num_available_kernels = 2;

and then in the following loop:
//for (kernel_bits : available_kernels) {
  for (i=0; i<num_available_kernels; i++) {
    kernel_bits = available_kernels[i];
Running with -gpu -cgbn gives me an error:

Code:
echo "((2^499-1)/20959)" | ./ecm -gpu -cgbn -sigma 3:1000 20000 0
GMP-ECM 7.0.5-dev [configured with GMP 6.2.0, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM]
Input number is ((2^499-1)/20959) (146 digits)
Using B1=20000, B2=0, sigma=3:1000-3:6119 (5120 curves)
CUDA error occurred: cannot set while device is active in this process
While running cudaSetDeviceFlags (cudaDeviceScheduleBlockingSync)   (file cgbn_stage1.cu, line 591)
Running with -gpu still works fine.

[edit]
Just commenting out line 591 makes it work. cpu usage does go up during gpu execution though.

Note the factor of 10 increase in B1 compared to before. Speedup is about 8x! Awesome!

Code:
% echo "((2^499-1)/20959)" | ./ecm -gpu -cgbn -sigma 3:1000 200000 0
GMP-ECM 7.0.5-dev [configured with GMP 6.2.0, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM]
Input number is ((2^499-1)/20959) (146 digits)
Using B1=200000, B2=0, sigma=3:1000-3:6119 (5120 curves)
Computing 5120 Step 1 took 3587ms of CPU time / 6088ms of GPU time

Last fiddled with by bsquared on 2021-08-25 at 18:30 Reason: new results
bsquared is offline   Reply With Quote
Old 2021-08-25, 18:42   #9
SethTro
 
SethTro's Avatar
 
"Seth"
Apr 2019

36910 Posts
Default

I committed the first tweak you made so you can 'git pull' (you might need `git fetch` `git reset --hard origin/gpu_integration`)

I'm not sure why you get the error for

Code:
CUDA_CHECK(cudaSetDeviceFlags (cudaDeviceScheduleBlockingSync));
Google suggests this might happen if I've already started run things but the code shouldn't have at that point.
You could try replacing line 135 in cudakernel.cu (while leaving line 591 commented out)

Code:
-  errCheck (cudaSetDeviceFlags (cudaDeviceScheduleYield));
+ errCheck (cudaSetDeviceFlags (cudaDeviceScheduleBlockingSync));
If anyone else experiences this let me know so I can try to fix but I'm going to ignore for now.

Last fiddled with by SethTro on 2021-08-25 at 18:47
SethTro is offline   Reply With Quote
Old 2021-08-27, 15:46   #10
chris2be8
 
chris2be8's Avatar
 
Sep 2009

2,179 Posts
Default

I got that error too:
Code:
$ echo "((2^499-1)/20959)" | ./ecm -gpu -cgbn -sigma 3:1000 20000 0
GMP-ECM 7.0.5-dev [configured with GMP 6.1.2, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM]
Input number is ((2^499-1)/20959) (146 digits)
Using B1=20000, B2=0, sigma=3:1000-3:1383 (384 curves)
CUDA error occurred: cannot set while device is active in this process
While running cudaSetDeviceFlags (cudaDeviceScheduleBlockingSync)   (file cgbn_stage1.cu, line 601)
Though the different line number makes me suspect I may have an old version of your code. What command should I run to download the latest version?

After commenting out line 601 it works:
Code:
$ echo "((2^499-1)/20959)" | ./ecm -gpu -cgbn -sigma 3:1000 20000 0
GMP-ECM 7.0.5-dev [configured with GMP 6.1.2, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM]
Input number is ((2^499-1)/20959) (146 digits)
Using B1=20000, B2=0, sigma=3:1000-3:1383 (384 curves)
Running GPU kernel<24,128> ...
Copying results back to CPU ...
Computing 384 Step 1 took 1653ms of CPU time / 2928ms of GPU time
But after raising B1 to 200000 it fails:
Code:
$ echo "((2^499-1)/20959)" | ./ecm -gpu -cgbn -sigma 3:1000 200000 0
GMP-ECM 7.0.5-dev [configured with GMP 6.1.2, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM]
Input number is ((2^499-1)/20959) (146 digits)
Using B1=200000, B2=0, sigma=3:1000-3:1383 (384 curves)
Running GPU kernel<24,128> ...
CUDA error occurred: the launch timed out and was terminated
While running cudaDeviceSynchronize()   (file cgbn_stage1.cu, line 632)
@bsquared, what happens if you try raising B1 by another factor of 10? How high a B1 will it take?

This GPU has CUDA arch 3.0, is that new enough for CGBN? I also have a newer GPU with CUDA arch 5.2, I'll try installing on that next.

NB. msieve says what CUDA arch the card is when used for GPU based poly selection. I'm using that to check what they really support.

Chris

PS. Should discussion of ecm with CGBN be split off into another thread?
chris2be8 is offline   Reply With Quote
Old 2021-08-27, 19:24   #11
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

1101111110102 Posts
Default

Quote:
Originally Posted by chris2be8 View Post

@bsquared, what happens if you try raising B1 by another factor of 10? How high a B1 will it take?
I have run 2^997-1 up to B1=10M so far with no problems. I will go up another order of magnitude; should take a few hours.

(2^499-1)/20959 has a factor that is found on my card in stage 1 at B1=2M (sigma=3890).
bsquared is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
NTT faster than FFT? moytrage Software 50 2021-07-21 05:55
PRP on gpu is faster that on cpu indomit Information & Answers 4 2020-10-07 10:50
faster than LL? paulunderwood Miscellaneous Math 13 2016-08-02 00:05
My CPU is getting faster and faster ;-) lidocorc Software 2 2008-11-08 09:26
Faster than LL? clowns789 Miscellaneous Math 3 2004-05-27 23:39

All times are UTC. The time now is 00:39.


Wed Oct 27 00:39:18 UTC 2021 up 95 days, 19:08, 0 users, load averages: 0.94, 1.01, 1.04

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.