mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Factoring (https://www.mersenneforum.org/forumdisplay.php?f=19)
-   -   Faster GPU-ECM with CGBN (https://www.mersenneforum.org/showthread.php?t=27103)

SethTro 2021-08-24 23:35

Faster GPU-ECM with CGBN
 
Resurrecting this thread. If anyone is running numbers smaller than C155 they should reach out to me.

(Moderator note: Referenced thread is [URL="https://www.mersenneforum.org/showthread.php?t=25115"]here[/URL])

My new CGBN enabled code is something like 7x faster

[CODE]
$ echo "(2^499-1)/20959" | ./ecm -gpu -gpucurves 3584 -sigma 3:1000 20000
Input number is (2^499-1)/20959 (146 digits)
Using B1=20000, B2=3804582, sigma=3:1000-3:4583 (3584 curves)
Computing 3584 Step 1 took 93ms of CPU time / 7258ms of GPU time
Computing 3584 Step 2 on CPU took 71933ms

$$ echo "(2^499-1)/20959" | ./ecm -gpu -cgbn -gpucurves 3584 -sigma 3:1000 20000
Input number is (2^499-1)/20959 (146 digits)
Using B1=20000, B2=3804582, sigma=3:1000-3:4583 (3584 curves)
Computing 3584 Step 1 took 15ms of CPU time / 1019ms of GPU time
Computing 3584 Step 2 on CPU took 72142ms
[/CODE]

For numbers smaller than C300 It's generally 2-3x faster

[CODE]
$ echo "(2^997-1)" | ./ecm -gpu -sigma 3:1000 20000
Input number is (2^997-1) (301 digits)
Using B1=20000, B2=3804582, sigma=3:1000-3:2791 (1792 curves)
Computing 1792 Step 1 took 91ms of CPU time / 3810ms of GPU time
Computing 1792 Step 2 on CPU took 83417ms

$ echo "(2^997-1)" | ./ecm -gpu -cgbn -sigma 3:1000 20000
Input number is (2^997-1) (301 digits)
Using B1=20000, B2=3804582, sigma=3:1000-3:2791 (1792 curves)
Computing 1792 Step 1 took 15ms of CPU time / 1588ms of GPU time
Computing 1792 Step 2 on CPU took 83521ms
[/CODE]

I'm working on the code actively in [url]https://github.com/sethtroisi/gmp-ecm/tree/gpu_integration[/url] if you are a developer and can possible distribute Linux binaries if we had a place to store them.

henryzz 2021-08-25 11:59

This is an impressive speedup. I assume there is no chance of stage 2 being ported to run on CUDA? GPU memory was too small back when the old GPU code was written. I suspect that may still be the case when running many curves in parallel(maybe different sections of the stage 2 range could be done in parallel instead?)



Binaries for several programs have been hosted on the forum server. I would suggest messaging Xyyzy.

Does the windows visual studio compilation work for this? I would either need that or CUDA working under WSL2.

bsquared 2021-08-25 13:07

[QUOTE=SethTro;586454]Resurrecting this thread. If anyone is running numbers smaller than C155 they should reach out to me.

My new CGBN enabled code is something like 7x faster

[CODE]
$ echo "(2^499-1)/20959" | ./ecm -gpu -gpucurves 3584 -sigma 3:1000 20000
Input number is (2^499-1)/20959 (146 digits)
Using B1=20000, B2=3804582, sigma=3:1000-3:4583 (3584 curves)
Computing 3584 Step 1 took 93ms of CPU time / 7258ms of GPU time
Computing 3584 Step 2 on CPU took 71933ms

$$ echo "(2^499-1)/20959" | ./ecm -gpu -cgbn -gpucurves 3584 -sigma 3:1000 20000
Input number is (2^499-1)/20959 (146 digits)
Using B1=20000, B2=3804582, sigma=3:1000-3:4583 (3584 curves)
Computing 3584 Step 1 took 15ms of CPU time / 1019ms of GPU time
Computing 3584 Step 2 on CPU took 72142ms
[/CODE]

For numbers smaller than C300 It's generally 2-3x faster

[CODE]
$ echo "(2^997-1)" | ./ecm -gpu -sigma 3:1000 20000
Input number is (2^997-1) (301 digits)
Using B1=20000, B2=3804582, sigma=3:1000-3:2791 (1792 curves)
Computing 1792 Step 1 took 91ms of CPU time / 3810ms of GPU time
Computing 1792 Step 2 on CPU took 83417ms

$ echo "(2^997-1)" | ./ecm -gpu -cgbn -sigma 3:1000 20000
Input number is (2^997-1) (301 digits)
Using B1=20000, B2=3804582, sigma=3:1000-3:2791 (1792 curves)
Computing 1792 Step 1 took 15ms of CPU time / 1588ms of GPU time
Computing 1792 Step 2 on CPU took 83521ms
[/CODE]

I'm working on the code actively in [url]https://github.com/sethtroisi/gmp-ecm/tree/gpu_integration[/url] if you are a developer and can possible distribute Linux binaries if we had a place to store them.[/QUOTE]

Very nice!

I haven't been able to build it yet. My knowledge of automake/autoconf is very limited, but trying to use these tools just throws a bunch of errors about missing files. Any advice on building this for linux?

Also are you by any chance looking at implementing a standard continuation for stage 2 on the GPU? It is very helpful for speeding up the process as a whole if these can also be run in parallel, even given that each curve is slightly less likely to find a factor:

[CODE]./yafu "ecm(2^997-1,1792)" -B1ecm 20000 -threads 16


ecm: 1792/1792 curves on C301 @ B1=20000, B2=100*B1
ecm: process took 5.5385 seconds.[/CODE]

[edit]
sorry saw that henryzz already brought this up

chris2be8 2021-08-25 15:42

I'm trying to build it on Linux, but have not had much luck. I copied the process I used to build the previous version:
[code]
Download gmp-ecm-gpu_integration.zip
unzip gmp-ecm-gpu_integration.zip
cd gmp-ecm-gpu_integration
autoreconf -si

./configure --enable-gpu=30 # The previous version needed --enable-gpu=sm30. The following messages look relevant:
-snip-
checking that CUDA Toolkit version and runtime version are the same... (9.1/9.1) yes
-snip-
configure: with_cgbn: , , "
-snip-

make # This fails with messages about CGBN being missing.
[/code]

What version of CUDA Toolkit and runtime is needed to support CGBN? And where is cgbn.h on your system?

Also what makes of GPU support it? Mine may be too old.

Chris

PS I use ECM set to do 512 bit arithmetic for numbers below 2^506. This is about 3 times faster than the 1024 bit version. So you may be *only* getting a factor of 3 over all ranges. But that's still very nice.

bsquared 2021-08-25 16:19

Thanks to the autoreconf -si hint, I've progressed a little further and have run through a configure process for a sm_70 card. But now I'm getting a bunch of these errors:

[CODE]ptxas /tmp/tmpxft_00008eaa_00000000-5_cudakernel.ptx, line 2378; error : Instruction 'vote' without '.sync' is not supported on .target sm_70 and higher from PTX ISA version 6.4
[/CODE]

which probably has nothing to do with your cbgn enhancements. Anyone know if gpu-ecm will build and run on a sm_70 card or hints on how to proceed?

bsquared 2021-08-25 17:34

[QUOTE=bsquared;586493]Thanks to the autoreconf -si hint, I've progressed a little further and have run through a configure process for a sm_70 card. But now I'm getting a bunch of these errors:

[CODE]ptxas /tmp/tmpxft_00008eaa_00000000-5_cudakernel.ptx, line 2378; error : Instruction 'vote' without '.sync' is not supported on .target sm_70 and higher from PTX ISA version 6.4
[/CODE]

which probably has nothing to do with your cbgn enhancements. Anyone know if gpu-ecm will build and run on a sm_70 card or hints on how to proceed?[/QUOTE]

Fixed it.

On line 10 of cudakernel_default.cu replace this line:

[CODE]
while(__any(cy[threadIdx.x])!=0)
[/CODE]

with this one

[CODE]
while(__any_sync(__activemask(),cy[threadIdx.x])!=0)
[/CODE]

Now I have a working gpu-ecm! Later I'll try to get the cbgn part working.

[CODE]
% echo "(2^997-1)" | ./ecm -gpu -sigma 3:1000 20000
GMP-ECM 7.0.5-dev [configured with GMP 6.2.0, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM]
Input number is (2^997-1) (301 digits)
Using B1=20000, B2=3804582, sigma=3:1000-3:6119 (5120 curves)
GPU: Block: 32x32x1 Grid: 160x1x1 (5120 parallel curves)
Computing 5120 Step 1 took 485ms of CPU time / 4990ms of GPU time
Computing 5120 Step 2 on CPU took 247281ms
[/CODE]

SethTro 2021-08-25 17:48

[QUOTE=bsquared;586471]Very nice!
I haven't been able to build it yet. My knowledge of automake/autoconf is very limited, but trying to use these tools just throws a bunch of errors about missing files. Any advice on building this for linux?
[/QUOTE]

There's some instructions in README.dev.
This is what I use

[CODE]
autoreconf -i
./configure --enable-gpu=61 --with-cuda=/usr/local/cuda CC=gcc-9 -with-cgbn-include=/home/five/Projects/CGBN/include/cgbn
make -j8
[/CODE]

Replace `--enable_gpu=61` with a different compute capability if you need one (and apply the change from [URL="https://www.mersenneforum.org/showpost.php?p=586504&postcount=27"]#27[/URL] if you need SM_70) but you already seem to have that figured out.

[QUOTE=chris2be8;586488]I'm trying to build it on Linux, but have not had much luck. I copied the process I used to build the previous version:

What version of CUDA Toolkit and runtime is needed to support CGBN? And where is cgbn.h on your system?
[/QUOTE]

I ran `git clone https://github.com/NVlabs/CGBN.git` under /home/five/Projects
then added `-with-cgbn-include=/home/five/Projects/CGBN/include/cgbn` to the list of options I pass to `./configure`

bsquared 2021-08-25 18:24

I got CGBN set up and configured your code with the proper cgbn include path.

Needed to make the following changes to get it to compile:

[CODE]
Lines 510 and 587, replace 100'000'000 with 1000000000

Around line 640:
//const std::vector<uint32_t> available_kernels = { 512, 1024 };
uint32_t available_kernels[2] = { 512, 1024 };
uint32_t num_available_kernels = 2;

and then in the following loop:
//for (kernel_bits : available_kernels) {
for (i=0; i<num_available_kernels; i++) {
kernel_bits = available_kernels[i];


[/CODE]

Running with -gpu -cgbn gives me an error:

[CODE]echo "((2^499-1)/20959)" | ./ecm -gpu -cgbn -sigma 3:1000 20000 0
GMP-ECM 7.0.5-dev [configured with GMP 6.2.0, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM]
Input number is ((2^499-1)/20959) (146 digits)
Using B1=20000, B2=0, sigma=3:1000-3:6119 (5120 curves)
CUDA error occurred: cannot set while device is active in this process
While running cudaSetDeviceFlags (cudaDeviceScheduleBlockingSync) (file cgbn_stage1.cu, line 591)
[/CODE]

Running with -gpu still works fine.

[edit]
Just commenting out line 591 makes it work. cpu usage does go up during gpu execution though.

Note the factor of 10 increase in B1 compared to before. Speedup is about 8x! Awesome!

[CODE]% echo "((2^499-1)/20959)" | ./ecm -gpu -cgbn -sigma 3:1000 200000 0
GMP-ECM 7.0.5-dev [configured with GMP 6.2.0, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM]
Input number is ((2^499-1)/20959) (146 digits)
Using B1=200000, B2=0, sigma=3:1000-3:6119 (5120 curves)
Computing 5120 Step 1 took 3587ms of CPU time / 6088ms of GPU time
[/CODE]

SethTro 2021-08-25 18:42

I committed the first tweak you made so you can 'git pull' (you might need `git fetch` `git reset --hard origin/gpu_integration`)

I'm not sure why you get the error for

[CODE]CUDA_CHECK(cudaSetDeviceFlags (cudaDeviceScheduleBlockingSync));
[/CODE]

Google suggests this might happen if I've already started run things but the code shouldn't have at that point.
You could try replacing line 135 in cudakernel.cu (while leaving line 591 commented out)

[CODE]
- errCheck (cudaSetDeviceFlags (cudaDeviceScheduleYield));
+ errCheck (cudaSetDeviceFlags (cudaDeviceScheduleBlockingSync));
[/CODE]

If anyone else experiences this let me know so I can try to fix but I'm going to ignore for now.

chris2be8 2021-08-27 15:46

I got that error too:
[code]
$ echo "((2^499-1)/20959)" | ./ecm -gpu -cgbn -sigma 3:1000 20000 0
GMP-ECM 7.0.5-dev [configured with GMP 6.1.2, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM]
Input number is ((2^499-1)/20959) (146 digits)
Using B1=20000, B2=0, sigma=3:1000-3:1383 (384 curves)
CUDA error occurred: cannot set while device is active in this process
While running cudaSetDeviceFlags (cudaDeviceScheduleBlockingSync) (file cgbn_stage1.cu, line 601)
[/code]

Though the different line number makes me suspect I may have an old version of your code. What command should I run to download the latest version?

After commenting out line 601 it works:
[code]
$ echo "((2^499-1)/20959)" | ./ecm -gpu -cgbn -sigma 3:1000 20000 0
GMP-ECM 7.0.5-dev [configured with GMP 6.1.2, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM]
Input number is ((2^499-1)/20959) (146 digits)
Using B1=20000, B2=0, sigma=3:1000-3:1383 (384 curves)
Running GPU kernel<24,128> ...
Copying results back to CPU ...
Computing 384 Step 1 took 1653ms of CPU time / 2928ms of GPU time
[/code]

But after raising B1 to 200000 it fails:
[code]
$ echo "((2^499-1)/20959)" | ./ecm -gpu -cgbn -sigma 3:1000 200000 0
GMP-ECM 7.0.5-dev [configured with GMP 6.1.2, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM]
Input number is ((2^499-1)/20959) (146 digits)
Using B1=200000, B2=0, sigma=3:1000-3:1383 (384 curves)
Running GPU kernel<24,128> ...
CUDA error occurred: the launch timed out and was terminated
While running cudaDeviceSynchronize() (file cgbn_stage1.cu, line 632)
[/code]

@bsquared, what happens if you try raising B1 by another factor of 10? How high a B1 will it take?

This GPU has CUDA arch 3.0, is that new enough for CGBN? I also have a newer GPU with CUDA arch 5.2, I'll try installing on that next.

NB. msieve says what CUDA arch the card is when used for GPU based poly selection. I'm using that to check what they really support.

Chris

PS. Should discussion of ecm with CGBN be split off into another thread?

bsquared 2021-08-27 19:24

[QUOTE=chris2be8;586664]

@bsquared, what happens if you try raising B1 by another factor of 10? How high a B1 will it take?
[/QUOTE]

I have run 2^997-1 up to B1=10M so far with no problems. I will go up another order of magnitude; should take a few hours.

(2^499-1)/20959 has a factor that is found on my card in stage 1 at B1=2M (sigma=3890).


All times are UTC. The time now is 16:35.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.