![]() |
Faster GPU-ECM with CGBN
Resurrecting this thread. If anyone is running numbers smaller than C155 they should reach out to me.
(Moderator note: Referenced thread is [URL="https://www.mersenneforum.org/showthread.php?t=25115"]here[/URL]) My new CGBN enabled code is something like 7x faster [CODE] $ echo "(2^499-1)/20959" | ./ecm -gpu -gpucurves 3584 -sigma 3:1000 20000 Input number is (2^499-1)/20959 (146 digits) Using B1=20000, B2=3804582, sigma=3:1000-3:4583 (3584 curves) Computing 3584 Step 1 took 93ms of CPU time / 7258ms of GPU time Computing 3584 Step 2 on CPU took 71933ms $$ echo "(2^499-1)/20959" | ./ecm -gpu -cgbn -gpucurves 3584 -sigma 3:1000 20000 Input number is (2^499-1)/20959 (146 digits) Using B1=20000, B2=3804582, sigma=3:1000-3:4583 (3584 curves) Computing 3584 Step 1 took 15ms of CPU time / 1019ms of GPU time Computing 3584 Step 2 on CPU took 72142ms [/CODE] For numbers smaller than C300 It's generally 2-3x faster [CODE] $ echo "(2^997-1)" | ./ecm -gpu -sigma 3:1000 20000 Input number is (2^997-1) (301 digits) Using B1=20000, B2=3804582, sigma=3:1000-3:2791 (1792 curves) Computing 1792 Step 1 took 91ms of CPU time / 3810ms of GPU time Computing 1792 Step 2 on CPU took 83417ms $ echo "(2^997-1)" | ./ecm -gpu -cgbn -sigma 3:1000 20000 Input number is (2^997-1) (301 digits) Using B1=20000, B2=3804582, sigma=3:1000-3:2791 (1792 curves) Computing 1792 Step 1 took 15ms of CPU time / 1588ms of GPU time Computing 1792 Step 2 on CPU took 83521ms [/CODE] I'm working on the code actively in [url]https://github.com/sethtroisi/gmp-ecm/tree/gpu_integration[/url] if you are a developer and can possible distribute Linux binaries if we had a place to store them. |
This is an impressive speedup. I assume there is no chance of stage 2 being ported to run on CUDA? GPU memory was too small back when the old GPU code was written. I suspect that may still be the case when running many curves in parallel(maybe different sections of the stage 2 range could be done in parallel instead?)
Binaries for several programs have been hosted on the forum server. I would suggest messaging Xyyzy. Does the windows visual studio compilation work for this? I would either need that or CUDA working under WSL2. |
[QUOTE=SethTro;586454]Resurrecting this thread. If anyone is running numbers smaller than C155 they should reach out to me.
My new CGBN enabled code is something like 7x faster [CODE] $ echo "(2^499-1)/20959" | ./ecm -gpu -gpucurves 3584 -sigma 3:1000 20000 Input number is (2^499-1)/20959 (146 digits) Using B1=20000, B2=3804582, sigma=3:1000-3:4583 (3584 curves) Computing 3584 Step 1 took 93ms of CPU time / 7258ms of GPU time Computing 3584 Step 2 on CPU took 71933ms $$ echo "(2^499-1)/20959" | ./ecm -gpu -cgbn -gpucurves 3584 -sigma 3:1000 20000 Input number is (2^499-1)/20959 (146 digits) Using B1=20000, B2=3804582, sigma=3:1000-3:4583 (3584 curves) Computing 3584 Step 1 took 15ms of CPU time / 1019ms of GPU time Computing 3584 Step 2 on CPU took 72142ms [/CODE] For numbers smaller than C300 It's generally 2-3x faster [CODE] $ echo "(2^997-1)" | ./ecm -gpu -sigma 3:1000 20000 Input number is (2^997-1) (301 digits) Using B1=20000, B2=3804582, sigma=3:1000-3:2791 (1792 curves) Computing 1792 Step 1 took 91ms of CPU time / 3810ms of GPU time Computing 1792 Step 2 on CPU took 83417ms $ echo "(2^997-1)" | ./ecm -gpu -cgbn -sigma 3:1000 20000 Input number is (2^997-1) (301 digits) Using B1=20000, B2=3804582, sigma=3:1000-3:2791 (1792 curves) Computing 1792 Step 1 took 15ms of CPU time / 1588ms of GPU time Computing 1792 Step 2 on CPU took 83521ms [/CODE] I'm working on the code actively in [url]https://github.com/sethtroisi/gmp-ecm/tree/gpu_integration[/url] if you are a developer and can possible distribute Linux binaries if we had a place to store them.[/QUOTE] Very nice! I haven't been able to build it yet. My knowledge of automake/autoconf is very limited, but trying to use these tools just throws a bunch of errors about missing files. Any advice on building this for linux? Also are you by any chance looking at implementing a standard continuation for stage 2 on the GPU? It is very helpful for speeding up the process as a whole if these can also be run in parallel, even given that each curve is slightly less likely to find a factor: [CODE]./yafu "ecm(2^997-1,1792)" -B1ecm 20000 -threads 16 ecm: 1792/1792 curves on C301 @ B1=20000, B2=100*B1 ecm: process took 5.5385 seconds.[/CODE] [edit] sorry saw that henryzz already brought this up |
I'm trying to build it on Linux, but have not had much luck. I copied the process I used to build the previous version:
[code] Download gmp-ecm-gpu_integration.zip unzip gmp-ecm-gpu_integration.zip cd gmp-ecm-gpu_integration autoreconf -si ./configure --enable-gpu=30 # The previous version needed --enable-gpu=sm30. The following messages look relevant: -snip- checking that CUDA Toolkit version and runtime version are the same... (9.1/9.1) yes -snip- configure: with_cgbn: , , " -snip- make # This fails with messages about CGBN being missing. [/code] What version of CUDA Toolkit and runtime is needed to support CGBN? And where is cgbn.h on your system? Also what makes of GPU support it? Mine may be too old. Chris PS I use ECM set to do 512 bit arithmetic for numbers below 2^506. This is about 3 times faster than the 1024 bit version. So you may be *only* getting a factor of 3 over all ranges. But that's still very nice. |
Thanks to the autoreconf -si hint, I've progressed a little further and have run through a configure process for a sm_70 card. But now I'm getting a bunch of these errors:
[CODE]ptxas /tmp/tmpxft_00008eaa_00000000-5_cudakernel.ptx, line 2378; error : Instruction 'vote' without '.sync' is not supported on .target sm_70 and higher from PTX ISA version 6.4 [/CODE] which probably has nothing to do with your cbgn enhancements. Anyone know if gpu-ecm will build and run on a sm_70 card or hints on how to proceed? |
[QUOTE=bsquared;586493]Thanks to the autoreconf -si hint, I've progressed a little further and have run through a configure process for a sm_70 card. But now I'm getting a bunch of these errors:
[CODE]ptxas /tmp/tmpxft_00008eaa_00000000-5_cudakernel.ptx, line 2378; error : Instruction 'vote' without '.sync' is not supported on .target sm_70 and higher from PTX ISA version 6.4 [/CODE] which probably has nothing to do with your cbgn enhancements. Anyone know if gpu-ecm will build and run on a sm_70 card or hints on how to proceed?[/QUOTE] Fixed it. On line 10 of cudakernel_default.cu replace this line: [CODE] while(__any(cy[threadIdx.x])!=0) [/CODE] with this one [CODE] while(__any_sync(__activemask(),cy[threadIdx.x])!=0) [/CODE] Now I have a working gpu-ecm! Later I'll try to get the cbgn part working. [CODE] % echo "(2^997-1)" | ./ecm -gpu -sigma 3:1000 20000 GMP-ECM 7.0.5-dev [configured with GMP 6.2.0, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM] Input number is (2^997-1) (301 digits) Using B1=20000, B2=3804582, sigma=3:1000-3:6119 (5120 curves) GPU: Block: 32x32x1 Grid: 160x1x1 (5120 parallel curves) Computing 5120 Step 1 took 485ms of CPU time / 4990ms of GPU time Computing 5120 Step 2 on CPU took 247281ms [/CODE] |
[QUOTE=bsquared;586471]Very nice!
I haven't been able to build it yet. My knowledge of automake/autoconf is very limited, but trying to use these tools just throws a bunch of errors about missing files. Any advice on building this for linux? [/QUOTE] There's some instructions in README.dev. This is what I use [CODE] autoreconf -i ./configure --enable-gpu=61 --with-cuda=/usr/local/cuda CC=gcc-9 -with-cgbn-include=/home/five/Projects/CGBN/include/cgbn make -j8 [/CODE] Replace `--enable_gpu=61` with a different compute capability if you need one (and apply the change from [URL="https://www.mersenneforum.org/showpost.php?p=586504&postcount=27"]#27[/URL] if you need SM_70) but you already seem to have that figured out. [QUOTE=chris2be8;586488]I'm trying to build it on Linux, but have not had much luck. I copied the process I used to build the previous version: What version of CUDA Toolkit and runtime is needed to support CGBN? And where is cgbn.h on your system? [/QUOTE] I ran `git clone https://github.com/NVlabs/CGBN.git` under /home/five/Projects then added `-with-cgbn-include=/home/five/Projects/CGBN/include/cgbn` to the list of options I pass to `./configure` |
I got CGBN set up and configured your code with the proper cgbn include path.
Needed to make the following changes to get it to compile: [CODE] Lines 510 and 587, replace 100'000'000 with 1000000000 Around line 640: //const std::vector<uint32_t> available_kernels = { 512, 1024 }; uint32_t available_kernels[2] = { 512, 1024 }; uint32_t num_available_kernels = 2; and then in the following loop: //for (kernel_bits : available_kernels) { for (i=0; i<num_available_kernels; i++) { kernel_bits = available_kernels[i]; [/CODE] Running with -gpu -cgbn gives me an error: [CODE]echo "((2^499-1)/20959)" | ./ecm -gpu -cgbn -sigma 3:1000 20000 0 GMP-ECM 7.0.5-dev [configured with GMP 6.2.0, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM] Input number is ((2^499-1)/20959) (146 digits) Using B1=20000, B2=0, sigma=3:1000-3:6119 (5120 curves) CUDA error occurred: cannot set while device is active in this process While running cudaSetDeviceFlags (cudaDeviceScheduleBlockingSync) (file cgbn_stage1.cu, line 591) [/CODE] Running with -gpu still works fine. [edit] Just commenting out line 591 makes it work. cpu usage does go up during gpu execution though. Note the factor of 10 increase in B1 compared to before. Speedup is about 8x! Awesome! [CODE]% echo "((2^499-1)/20959)" | ./ecm -gpu -cgbn -sigma 3:1000 200000 0 GMP-ECM 7.0.5-dev [configured with GMP 6.2.0, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM] Input number is ((2^499-1)/20959) (146 digits) Using B1=200000, B2=0, sigma=3:1000-3:6119 (5120 curves) Computing 5120 Step 1 took 3587ms of CPU time / 6088ms of GPU time [/CODE] |
I committed the first tweak you made so you can 'git pull' (you might need `git fetch` `git reset --hard origin/gpu_integration`)
I'm not sure why you get the error for [CODE]CUDA_CHECK(cudaSetDeviceFlags (cudaDeviceScheduleBlockingSync)); [/CODE] Google suggests this might happen if I've already started run things but the code shouldn't have at that point. You could try replacing line 135 in cudakernel.cu (while leaving line 591 commented out) [CODE] - errCheck (cudaSetDeviceFlags (cudaDeviceScheduleYield)); + errCheck (cudaSetDeviceFlags (cudaDeviceScheduleBlockingSync)); [/CODE] If anyone else experiences this let me know so I can try to fix but I'm going to ignore for now. |
I got that error too:
[code] $ echo "((2^499-1)/20959)" | ./ecm -gpu -cgbn -sigma 3:1000 20000 0 GMP-ECM 7.0.5-dev [configured with GMP 6.1.2, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM] Input number is ((2^499-1)/20959) (146 digits) Using B1=20000, B2=0, sigma=3:1000-3:1383 (384 curves) CUDA error occurred: cannot set while device is active in this process While running cudaSetDeviceFlags (cudaDeviceScheduleBlockingSync) (file cgbn_stage1.cu, line 601) [/code] Though the different line number makes me suspect I may have an old version of your code. What command should I run to download the latest version? After commenting out line 601 it works: [code] $ echo "((2^499-1)/20959)" | ./ecm -gpu -cgbn -sigma 3:1000 20000 0 GMP-ECM 7.0.5-dev [configured with GMP 6.1.2, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM] Input number is ((2^499-1)/20959) (146 digits) Using B1=20000, B2=0, sigma=3:1000-3:1383 (384 curves) Running GPU kernel<24,128> ... Copying results back to CPU ... Computing 384 Step 1 took 1653ms of CPU time / 2928ms of GPU time [/code] But after raising B1 to 200000 it fails: [code] $ echo "((2^499-1)/20959)" | ./ecm -gpu -cgbn -sigma 3:1000 200000 0 GMP-ECM 7.0.5-dev [configured with GMP 6.1.2, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM] Input number is ((2^499-1)/20959) (146 digits) Using B1=200000, B2=0, sigma=3:1000-3:1383 (384 curves) Running GPU kernel<24,128> ... CUDA error occurred: the launch timed out and was terminated While running cudaDeviceSynchronize() (file cgbn_stage1.cu, line 632) [/code] @bsquared, what happens if you try raising B1 by another factor of 10? How high a B1 will it take? This GPU has CUDA arch 3.0, is that new enough for CGBN? I also have a newer GPU with CUDA arch 5.2, I'll try installing on that next. NB. msieve says what CUDA arch the card is when used for GPU based poly selection. I'm using that to check what they really support. Chris PS. Should discussion of ecm with CGBN be split off into another thread? |
[QUOTE=chris2be8;586664]
@bsquared, what happens if you try raising B1 by another factor of 10? How high a B1 will it take? [/QUOTE] I have run 2^997-1 up to B1=10M so far with no problems. I will go up another order of magnitude; should take a few hours. (2^499-1)/20959 has a factor that is found on my card in stage 1 at B1=2M (sigma=3890). |
[QUOTE=bsquared;586677]I have run 2^997-1 up to B1=10M so far with no problems. I will go up another order of magnitude; should take a few hours.
(2^499-1)/20959 has a factor that is found on my card in stage 1 at B1=2M (sigma=3890).[/QUOTE] You might try (2^419-1)/4398294875195008479937661267298757530097217 which has a 38 digit factor or (2^569-1)/160592976218334727184554268072735638438202191 which has a 42 digit factor [QUOTE=chris2be8;586664]I got that error too: [code] $ echo "((2^499-1)/20959)" | ./ecm -gpu -cgbn -sigma 3:1000 20000 0 GMP-ECM 7.0.5-dev [configured with GMP 6.1.2, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM] Input number is ((2^499-1)/20959) (146 digits) Using B1=20000, B2=0, sigma=3:1000-3:1383 (384 curves) CUDA error occurred: cannot set while device is active in this process While running cudaSetDeviceFlags (cudaDeviceScheduleBlockingSync) (file cgbn_stage1.cu, line 601) [/code] Though the different line number makes me suspect I may have an old version of your code. What command should I run to download the latest version? [/QUOTE] You can run `git pull` (tries to pull any changes I've made but fails if I do a bad thing that I like to do). If that fails you can run this command to reset to the state of my branch (this is destructive of any changes you made) `git fetch` `git reset --hard origin/gpu_integration` Thanks for confirming you also see the cudaError, I'll investigate more now that multiple people see it. |
That was a painful configure. I'm using nvhpc rather than the CUDA toolkit, so the directory structure is different and the cuda lib is just a stub so the check for cuInit() fails. I can point to the actual lib, but the version doesn't match the nvhpc version (which is fine, nvhpc doesn't require them to match) but that check fails. Then the check for CGBN ignores the --with-gmp directory so that check fails because it can't find GMP but reports that it can't find cgbn.h. After bypassing all of those checks, I get a working binary. Yay!
[CODE]$ echo "(2^499-1)/20959" | ./ecm -gpu -cgbn -sigma 3:1000 20000 0 GMP-ECM 7.0.5-dev [configured with GMP 6.2.1, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM] Input number is (2^499-1)/20959 (146 digits) Using B1=20000, B2=0, sigma=3:1000-3:6119 (5120 curves) Computing 5120 Step 1 took 282ms of CPU time / 705ms of GPU time $ echo "((2^499-1)/20959)" | ./ecm -gpu -cgbn -sigma 3:1000 2000000 0 GMP-ECM 7.0.5-dev [configured with GMP 6.2.1, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM] Input number is ((2^499-1)/20959) (146 digits) Using B1=2000000, B2=0, sigma=3:1000-3:6119 (5120 curves) GPU: factor 1998447222711143545931606352264121 found in Step 1 with curve 2890 (-sigma 3:3890) Computing 5120 Step 1 took 47146ms of CPU time / 112480ms of GPU time ********** Factor found in step 1: 1998447222711143545931606352264121 Found prime factor of 34 digits: 1998447222711143545931606352264121 Prime cofactor (((2^499-1)/20959))/1998447222711143545931606352264121 has 113 digits [/CODE] |
[QUOTE=frmky;586708]That was a painful configure. I'm using nvhpc rather than the CUDA toolkit, so the directory structure is different and the cuda lib is just a stub so the check for cuInit() fails. I can point to the actual lib, but the version doesn't match the nvhpc version (which is fine, nvhpc doesn't require them to match) but that check fails. Then the check for CGBN ignores the --with-gmp directory so that check fails because it can't find GMP but reports that it can't find cgbn.h. After bypassing all of those checks, I get a working binary. Yay!
[CODE]$ echo "(2^499-1)/20959" | ./ecm -gpu -cgbn -sigma 3:1000 20000 0 GMP-ECM 7.0.5-dev [configured with GMP 6.2.1, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM] Input number is (2^499-1)/20959 (146 digits) Using B1=20000, B2=0, sigma=3:1000-3:6119 (5120 curves) Computing 5120 Step 1 took 282ms of CPU time / 705ms of GPU time $ echo "((2^499-1)/20959)" | ./ecm -gpu -cgbn -sigma 3:1000 2000000 0 GMP-ECM 7.0.5-dev [configured with GMP 6.2.1, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM] Input number is ((2^499-1)/20959) (146 digits) Using B1=2000000, B2=0, sigma=3:1000-3:6119 (5120 curves) GPU: factor 1998447222711143545931606352264121 found in Step 1 with curve 2890 (-sigma 3:3890) Computing 5120 Step 1 took 47146ms of CPU time / 112480ms of GPU time ********** Factor found in step 1: 1998447222711143545931606352264121 Found prime factor of 34 digits: 1998447222711143545931606352264121 Prime cofactor (((2^499-1)/20959))/1998447222711143545931606352264121 has 113 digits [/CODE][/QUOTE] Glad you got a working binary! Would you mind measuring the speedup of echo "2^997-1" with -gpu vs -cgbn? I feel for you on setup. I didn't know any automake and how to configure it took me a whole day. If you were willing you could try making this change and seeing if the cgbn.h checks respects the --with_gmp_lib path after it. diff --git a/acinclude.m4 b/acinclude.m4 index fbbf94df..04694003 100644 --- a/acinclude.m4 +++ b/acinclude.m4 @@ -612,7 +612,7 @@ AS_IF([test "x$enable_gpu" = "xyes" ], #include <gmp.h> #include <cgbn.h> ], - [-I$cgbn_include -lgmp], + [-I$cgbn_include $GMPLIB], [AC_MSG_RESULT([yes])], [ AC_MSG_RESULT([no]) @@ -620,7 +620,7 @@ AS_IF([test "x$enable_gpu" = "xyes" ], ] ) AC_DEFINE([HAVE_CGBN_H], [1], [Define to 1 if cgbn.h exists]) - NVCCFLAGS="-I$with_cgbn_include -lgmp $NVCCFLAGS" + NVCCFLAGS="-I$with_cgbn_include $GMPLIB $NVCCFLAGS" want_cgbn="yes" ]) ]) |
I made a fairly large change so that the GPU kernel is called in much smaller batches. This helps with system responsiveness also gives a progress status.
|
After further testing on the system I was trying it on I'm fairly sure it won't work because the GPU is sm_30 (Fermi architecture) and CGBN needs sm_35 (Kepler) or higher. I had to look in the Makefiles shipped with CGBN to find out what sm_.. corresponds to what marketing name Nvidia use.
I have a newer GPU which is sm_52. But that system has an old version of the CUDA Toolkit and runtime (7.5/7.5). And that doesn't support CGBN. So I'll have to upgrade CUDA on that system. Which will take a while. I've just used [c]git clone 'https://github.com/sethtroisi/gmp-ecm.git'[/c] to grab a clean copy of the latest code instead of downloading gmp-ecm-gpu_integration.zip from the web site. I assume that's the best way to get it. PS. [c]git clone 'https://github.com/sethtroisi/gmp-ecm.git'[/c] isn't right, it doesn't include cgbn_stage1.* So what should I use? |
[QUOTE=chris2be8;586726]After further testing on the system I was trying it on I'm fairly sure it won't work because the GPU is sm_30 (Fermi architecture) and CGBN needs sm_35 (Kepler) or higher. I had to look in the Makefiles shipped with CGBN to find out what sm_.. corresponds to what marketing name Nvidia use.
. ..[/QUOTE]Thank you! I've been watching this thread, wondering if I should try it with my sm_30 card, that won't build GMP-ECM or Msieve with CUDA 10.x, which is supposed to support that architecture. CUDA 11.x does not support it. |
There still seems to be cudaDeviceSynchronize issues although I get it at a very different line number.
[CODE]echo "(2^499-1)/20959" | ./ecm -gpu -cgbn -gpucurves 3584 -sigma 3:1000 20000 GMP-ECM 7.0.5-dev [configured with GMP 6.2.99, --enable-asm-redc, --enable-gpu, --enable-assert, --enable-openmp] [ECM] Input number is (2^499-1)/20959 (146 digits) Using B1=20000, B2=3804582, sigma=3:1000-3:4583 (3584 curves) CUDA error occurred: unknown error While running cudaDeviceSynchronize() (file cgbn_stage1.cu, line 733)[/CODE] I have managed to get the non-cgbn code working under WSL2 although this required updating windows to 21H2 to enable gpu support which is only available as a preview so far. |
[QUOTE=EdH;586728]Thank you! I've been watching this thread, wondering if I should try it with my sm_30 card, that won't build GMP-ECM or Msieve with CUDA 10.x, which is supposed to support that architecture. CUDA 11.x does not support it.[/QUOTE]
My system with that card has CUDA 9.1 on it. So if you can "upgrade" to that level ecm and msieve should build. |
[QUOTE=chris2be8;586726]After further testing on the system I was trying it on I'm fairly sure it won't work because the GPU is sm_30 (Fermi architecture) and CGBN needs sm_35 (Kepler) or higher. I had to look in the Makefiles shipped with CGBN to find out what sm_.. corresponds to what marketing name Nvidia use.
I have a newer GPU which is sm_52. But that system has an old version of the CUDA Toolkit and runtime (7.5/7.5). And that doesn't support CGBN. So I'll have to upgrade CUDA on that system. Which will take a while. I've just used [c]git clone 'https://github.com/sethtroisi/gmp-ecm.git'[/c] to grab a clean copy of the latest code instead of downloading gmp-ecm-gpu_integration.zip from the web site. I assume that's the best way to get it. PS. [c]git clone 'https://github.com/sethtroisi/gmp-ecm.git'[/c] isn't right, it doesn't include cgbn_stage1.* So what should I use?[/QUOTE] The correct command to download the CGBN branch should be [c]git clone https://github.com/sethtroisi/gmp-ecm/ -b gpu_integration folder_name[/c]. PS: Also receiving the same CUDA error as chris2be8, [c]$ echo "2^997-1" | ./ecm -gpu -cgbn -sigma 3:1000 20000 0 GMP-ECM 7.0.5-dev [configured with GMP 6.2.1, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM] Input number is 2^997-1 (301 digits) Using B1=20000, B2=0, sigma=3:1000-3:1639 (640 curves) CUDA error occurred: cannot set while device is active in this process While running cudaSetDeviceFlags (cudaDeviceScheduleBlockingSync) (file cgbn_stage1.cu, line 600)[/c] |
WraithX [URL="https://github.com/sethtroisi/gmp-ecm/pull/3#pullrequestreview-741052640"]proposed a patch[/URL] which will hopefully resolve the `CUDA error occurred: cannot set while device is active in this process` error. Hopefully it will be checked in later today / tonight.
|
With Wraith's patch(with the header fixed) I now get:
[CODE]echo "(2^499-1)/20959" | ./ecm -gpu -cgbn -gpucurves 3584 -sigma 3:1000 2000 GMP-ECM 7.0.5-dev [configured with GMP 6.2.99, --enable-asm-redc, --enable-gpu, --enable-assert, --enable-openmp] [ECM] Input number is (2^499-1)/20959 (146 digits) Using B1=2000, B2=147396, sigma=3:1000-3:4583 (3584 curves) Computing 3584 Step 1 took 1863ms of CPU time / 2291ms of GPU time Computing 3584 Step 2 on CPU took 12404ms echo "(2^499-1)/20959" | ./ecm -gpu -gpucurves 3584 -sigma 3:1000 2000 GMP-ECM 7.0.5-dev [configured with GMP 6.2.99, --enable-asm-redc, --enable-gpu, --enable-assert, --enable-openmp] [ECM] Input number is (2^499-1)/20959 (146 digits) Using B1=2000, B2=147396, sigma=3:1000-3:4583 (3584 curves) GPU: Block: 32x32x1 Grid: 112x1x1 (3584 parallel curves) Computing 3584 Step 1 took 3668ms of CPU time / 6199ms of GPU time Computing 3584 Step 2 on CPU took 12445ms[/CODE] Although with B1=20000 I still get: [CODE]echo "(2^499-1)/20959" | ./ecm -gpu -cgbn -gpucurves 3584 -sigma 3:1000 20000 GMP-ECM 7.0.5-dev [configured with GMP 6.2.99, --enable-asm-redc, --enable-gpu, --enable-assert, --enable-openmp] [ECM] Input number is (2^499-1)/20959 (146 digits) Using B1=20000, B2=3804582, sigma=3:1000-3:4583 (3584 curves) CUDA error (702) occurred: the launch timed out and was terminated While running cudaDeviceSynchronize() (file cgbn_stage1.cu, line 731)[/CODE] |
[QUOTE=henryzz;586760]
Although with B1=20000 I still get: [CODE]echo "(2^499-1)/20959" | ./ecm -gpu -cgbn -gpucurves 3584 -sigma 3:1000 20000 GMP-ECM 7.0.5-dev [configured with GMP 6.2.99, --enable-asm-redc, --enable-gpu, --enable-assert, --enable-openmp] [ECM] Input number is (2^499-1)/20959 (146 digits) Using B1=20000, B2=3804582, sigma=3:1000-3:4583 (3584 curves) CUDA error (702) occurred: the launch timed out and was terminated While running cudaDeviceSynchronize() (file cgbn_stage1.cu, line 731)[/CODE][/QUOTE] What happens if you specify 0 for B2? Like this: [CODE]echo "(2^499-1)/20959" | ./ecm -gpu -cgbn -gpucurves 3584 -sigma 3:1000 20000 0[/CODE] |
[QUOTE=WraithX;586761]What happens if you specify 0 for B2? Like this:
[CODE]echo "(2^499-1)/20959" | ./ecm -gpu -cgbn -gpucurves 3584 -sigma 3:1000 20000 0[/CODE][/QUOTE] The same thing. If I run less curves at once it works. Possibly just that my gpu is pathetic (750 Ti): [CODE]echo "(2^499-1)/20959" | ./ecm -gpu -cgbn -sigma 3:1000 20000 GMP-ECM 7.0.5-dev [configured with GMP 6.2.99, --enable-asm-redc, --enable-gpu, --enable-assert, --enable-openmp] [ECM] Input number is (2^499-1)/20959 (146 digits) Using B1=20000, B2=3804582, sigma=3:1000-3:1319 (320 curves) Computing 320 Step 1 took 756ms of CPU time / 1269ms of GPU time Computing 320 Step 2 on CPU took 7488ms[/CODE] |
You might try changing in cgbn_stage1.cu
-#define S_BITS_PER_CALL 10000 +#define S_BITS_PER_CALL 1000 then running with -v which might tell you when the GPU died (and also might prevent timeouts) [CODE] $ echo "(2^499-1)/20959" | ./ecm -v -gpu -cgbn -gpucurves 3584 -sigma 3:1000 20000 0 GMP-ECM 7.0.5-dev [configured with GMP 6.2.99, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM] Input number is (2^499-1)/20959 (146 digits) GPU: will use device 0: GeForce GTX 1080 Ti, compute capability 6.1, 28 MPs. Using B1=20000, B2=0, sigma=3:1000-3:4583 (3584 curves) Running CGBN<512,4> kernel<112,128> at bit 0/28820 (0.0%)... Running CGBN<512,4> kernel<112,128> at bit 1000/28820 (3.5%)... ... Running CGBN<512,4> kernel<112,128> at bit 27000/28820 (93.7%)... Running CGBN<512,4> kernel<112,128> at bit 28000/28820 (97.2%)... Copying results back to CPU ... Computing 3584 Step 1 took 15ms of CPU time / 1105ms of GPU time Throughput: 3244.848 curves per second (on average 0.31ms per Step 1) [/CODE] |
[QUOTE=SethTro;586711]Glad you got a working binary! Would you mind measuring the speedup of echo "2^997-1" with -gpu vs -cgbn?[/QUOTE]
[CODE]$ echo "(2^997-1)" | ./ecm -gpu -sigma 3:1000 20000 0 GMP-ECM 7.0.5-dev [configured with GMP 6.2.1, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM] Input number is (2^997-1) (301 digits) Using B1=20000, B2=0, sigma=3:1000-3:6119 (5120 curves) GPU: Block: 32x32x1 Grid: 160x1x1 (5120 parallel curves) Computing 5120 Step 1 took 183ms of CPU time / 5364ms of GPU time $ echo "(2^997-1)" | ./ecm -gpu -cgbn -sigma 3:1000 20000 0 GMP-ECM 7.0.5-dev [configured with GMP 6.2.1, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM] Input number is (2^997-1) (301 digits) Using B1=20000, B2=0, sigma=3:1000-3:6119 (5120 curves) Computing 5120 Step 1 took 1284ms of CPU time / 3057ms of GPU time [/CODE] I'll try the configure changes later. Overnight I ran 2560 stage-1 curves on the C201 blocking the aliquot sequence starting at 3366 using B1=85e7. I'm working through stage 2 on those now. |
Those changes to acinclude.m4 aren't enough. It still can't find gmp.h during the test compile. We need to add a -I for the gmp include directory. And that breaks the build since it's trying to include libgmp.a during compile.
|
Reducing S_BITS_PER_CALL has fixed it for me. Thank you 😀
|
Current git fails for inputs near 512 Bits. It seems that there is a condition the wrong way:[CODE]diff --git a/cgbn_stage1.cu b/cgbn_stage1.cu
index 1b512ecd..f67f8715 100644 --- a/cgbn_stage1.cu +++ b/cgbn_stage1.cu @@ -653,7 +653,7 @@ int run_cgbn(mpz_t *factors, int *array_stage_found, #endif /* IS_DEV_BUILD */ for (int k_i = 0; k_i < available_kernels.size(); k_i++) { uint32_t kernel_bits = available_kernels[k_i]; - if (kernel_bits + 6 >= mpz_sizeinbase(N, 2)) { + if (kernel_bits >= mpz_sizeinbase(N, 2) + 6) { BITS = kernel_bits; assert( BITS % 32 == 0 ); TPI = (BITS <= 512) ? 4 : (BITS <= 2048) ? 8 : (BITS <= 8192) ? 16 : 32;[/CODE] |
[QUOTE=Gimarel;586805]Current git fails for inputs near 512 Bits. It seems that there is a condition the wrong way:[CODE]diff --git a/cgbn_stage1.cu b/cgbn_stage1.cu
index 1b512ecd..f67f8715 100644 --- a/cgbn_stage1.cu +++ b/cgbn_stage1.cu @@ -653,7 +653,7 @@ int run_cgbn(mpz_t *factors, int *array_stage_found, #endif /* IS_DEV_BUILD */ for (int k_i = 0; k_i < available_kernels.size(); k_i++) { uint32_t kernel_bits = available_kernels[k_i]; - if (kernel_bits + 6 >= mpz_sizeinbase(N, 2)) { + if (kernel_bits >= mpz_sizeinbase(N, 2) + 6) { BITS = kernel_bits; assert( BITS % 32 == 0 ); TPI = (BITS <= 512) ? 4 : (BITS <= 2048) ? 8 : (BITS <= 8192) ? 16 : 32;[/CODE][/QUOTE] Whoops, totally backwards, coding is hard :p I'll fix it tonight. Thanks for testing |
Has anyone checked ecm-cgbn can find factors? On my system with a sm_30 GPU I updated test.gpuecm to pass -cgbn to ecm. But it failed to find any factors when the test cases expected them to be found!
It is *probably* because sm_30 is too low for CGBN. It will be a while before I can test my newer GPU. The system it's on is running an old version of Linux which doesn't support CUDA 9.0. (I've been working on a "if it works don't fix it" base since it's only used for computations.) Upgrading Linux will probably need a complete re-install which I'll need to plan for a time when I don't need it for a few hours/days. And I'd be happier if I was sure CGBN would work once I got it installed. |
[QUOTE=chris2be8;586868]Has anyone checked ecm-cgbn can find factors? On my system with a sm_30 GPU I updated test.gpuecm to pass -cgbn to ecm. But it failed to find any factors when the test cases expected them to be found!
It is *probably* because sm_30 is too low for CGBN. It will be a while before I can test my newer GPU. The system it's on is running an old version of Linux which doesn't support CUDA 9.0. (I've been working on a "if it works don't fix it" base since it's only used for computations.) Upgrading Linux will probably need a complete re-install which I'll need to plan for a time when I don't need it for a few hours/days. And I'd be happier if I was sure CGBN would work once I got it installed.[/QUOTE] Yes, many of use have found the same test factor for (2^499-1)/20959 and I've verified several times that the residuals exactly match those produced by `-gpu`. I've also tested with `$ sage check_gpuecm.sage "./ecm -cgbn"` |
[QUOTE=chris2be8;586868]Has anyone checked ecm-cgbn can find factors?[/QUOTE]
Yes, test.gpuecm completes successfully both with and without -cgbn. I'm using a V100 with CUDA 11.3. |
Is there a simpler way to distribute stage 2 across multiple cores than creating a script to use the -save option with B2=0, split the save file, then launch multiple ecm processes with -resume?
|
[QUOTE=frmky;586881]Is there a simpler way to distribute stage 2 across multiple cores than creating a script to use the -save option with B2=0, split the save file, then launch multiple ecm processes with -resume?[/QUOTE]
I am working on the ability to process ecm save files with yafu, but it isn't ready yet. |
[QUOTE=frmky;586881]Is there a simpler way to distribute stage 2 across multiple cores than creating a script to use the -save option with B2=0, split the save file, then launch multiple ecm processes with -resume?[/QUOTE]Not sure if I'm understanding the question, but would [URL="https://www.mersenneforum.org/showthread.php?t=15508"]ECM.py[/URL] work?
Edit: For my Colab-GPU ECM experiements, I use:[code]python3 ecm.py -resume residues[/code]to run the residues from the Colab GPU stage 1 portion. I think I have all the threads, etc. set in the Python code, but they can be used on the command line, as well. The latest version is [URL="https://www.mersenneforum.org/showpost.php?p=518249&postcount=109"]here[/URL]. |
@EdH I started using ECM.py again and it's great!
--- I wrote a bunch of code today so S_BITS_PER_BATCH is dynamic and there's better verbose output. Verbose output includes this message, when the kernel size is much lager than the input number. [CODE] Input number is 2^239-1 (72 digits) Compiling custom kernel for 256 bits should be ~180% faster CGBN<512, 4> running kernel<56 block x 128 threads> [/CODE] I doubt that verbose is the right place for this output (as I'm not sure how many people can actually recompile cuda code), but if you have a working setup it's as easy as changing [CODE] - typedef cgbn_params_t<4, 512> cgbn_params_4_512; + typedef cgbn_params_t<4, 256> cgbn_params_4_512; [/CODE] --- ETA and estimated throughput [CODE] Copying 716800 bits of data to GPU CGBN<640, 8> running kernel<112 block x 128 threads> Computing 100 bits/call, 0/4328085 (0.0%) Computing 110 bits/call, 100/4328085 (0.0%) Computing 121 bits/call, 210/4328085 (0.0%) ... Computing 256 bits/call, 1584/4328085 (0.0%) Computing 655 bits/call, 5630/4328085 (0.1%) Computing 1694 bits/call, 16050/4328085 (0.4%) Computing 2049 bits/call, 35999/4328085 (0.8%), ETA 184 + 2 = 186 seconds (~104 ms/curves) Computing 2049 bits/call, 56489/4328085 (1.3%), ETA 183 + 2 = 185 seconds (~103 ms/curves) ... Computing 2049 bits/call, 158939/4328085 (3.7%), ETA 178 + 7 = 185 seconds (~103 ms/curves) Computing 2049 bits/call, 363839/4328085 (8.4%), ETA 169 + 16 = 185 seconds (~103 ms/curves) ... Computing 2049 bits/call, 1798139/4328085 (41.5%), ETA 109 + 77 = 186 seconds (~104 ms/curves) Computing 2049 bits/call, 2003039/4328085 (46.3%), ETA 100 + 86 = 186 seconds (~104 ms/curves) Computing 2049 bits/call, 4052039/4328085 (93.6%), ETA 12 + 175 = 187 seconds (~104 ms/curves) Copying results back to CPU ... Computing 1792 Step 1 took 240ms of CPU time / 186575ms of GPU time Throughput: 9.605 curves per second (on average 104.12ms per Step 1) [/CODE] This is nice as it can gives very early feedback (estimates after 1-5 seconds are very accurate) if you are changing `-gpucurves` or playing with custom kernel bit sizes. I've found that doubling gpucurves can lead to 2x worse throughput! So I may need to add some warnings. |
[QUOTE=SethTro;586911]@EdH I started using ECM.py again and it's great!
--- [/QUOTE]Good to read. I just wish I could get my sm_30 card to do something. . . (2 sm-20s and 1 sm_30 and none will do anything productive, . . . yet. With all the install/reinstall/remove activity, now the sm_30 machine is complaining about a linux-kernel, so I've taken a break from trying more.) |
I couldn't find it in the thread (hope I didn't just overlook it), how does the speed of ECM on GPU generally compare to CPU? Say a GTX 1660 or similar.
And is it so that only small B1 values can be used? I found [URL="https://eprint.iacr.org/2020/1265.pdf"]this paper[/URL] and they also only seem to have used B1=50k. With a 2080 Ti they achieved "2781 ECM trials", I guess curves, per second for B1=50k. That is very fast, but if the B1 size is severely limited, a CPU is still required for larger factors? |
[QUOTE=bur;586936]I couldn't find it in the thread (hope I didn't just overlook it), how does the speed of ECM on GPU generally compare to CPU? Say a GTX 1660 or similar.
And is it so that only small B1 values can be used? I found [URL="https://eprint.iacr.org/2020/1265.pdf"]this paper[/URL] and they also only seem to have used B1=50k. With a 2080 Ti they achieved "2781 ECM trials", I guess curves, per second for B1=50k. That is very fast, but if the B1 size is severely limited, a CPU is still required for larger factors?[/QUOTE] The most important factor is the size of N (which is limitted by CGBN to 32K for GPUs or ~10,000 digits). Both CPU and GPU have the same linear scaling for B1 which can be increased to any number you want. the speedup is strongly depends on your CPU vs GPU. For my 1080ti vs 2600K 250 bits 46x faster on GPU 500 bits 48x faster on GPU 1000 bits 68x faster on GPU 1500 bits 83x faster on GPU 2000 bits 46x faster on GPU Which means we are seeing roughly the same scaling for the GPU as CPU for bit levels < 2K. Informal testing with larger inputs (2048 - 32,768 bits) bits shows the CPU outscales GPU for larger inputs and the speedup slowly decreases from ~50x to ~25x as bits increase from 2K to 16K. At the maximal value of 32K bits performances has decreases again to 14x (from 26x at 16K bits) |
[QUOTE=frmky;586881]Is there a simpler way to distribute stage 2 across multiple cores than creating a script to use the -save option with B2=0, split the save file, then launch multiple ecm processes with -resume?[/QUOTE]it is what I used to do when GPU-enabled ECM still worked on my machines. It was a trivial script to write.
|
I re-cloned the gpu_integration branch to capture the latest changes and went through the build process with the following caveats:
specifying --with-gmp together with --with-cgbn-include doesn't work. I had to use the system default gmp (6.0.0). With compute 70 I still have to replace __any with __any_sync(__activemask() on line 10 of cude_kernel_default.cu building with gcc I get this error in cgbn_stage1.cu: cgbn_stage1.cu(654): error: initialization with "{...}" is not allowed for object of type "const std::vector<uint32_t, std::allocator<uint32_t>>" I suppose I need to build with g++ instead? Anyway I can get past all of that and get a working binary and the cpu usage is now much lower. But now the gpu portion appears to be about 15% slower? Before: [CODE] Input number is 2^997-1 (301 digits) Computing 5120 Step 1 took 75571ms of CPU time / 129206ms of GPU time Throughput: 39.627 curves per second (on average 25.24ms per Step 1) [/CODE] New clone: [CODE] Input number is 2^997-1 (301 digits) Computing 5120 Step 1 took 643ms of CPU time / 149713ms of GPU time Throughput: 34.199 curves per second (on average 29.24ms per Step 1) [/CODE] Anyone else seeing this? |
Hello,
I've upgraded my system with a GTX 970 (sm_52) to openSUSE 42.2 and installed CUDA 9.0 on it. But when I try to compile ecm with GPU support ./configure says: [code] configure: Using cuda.h from /usr/local/cuda/include checking cuda.h usability... no checking cuda.h presence... yes configure: WARNING: cuda.h: present but cannot be compiled configure: WARNING: cuda.h: check for missing prerequisite headers? configure: WARNING: cuda.h: see the Autoconf documentation configure: WARNING: cuda.h: section "Present But Cannot Be Compiled" configure: WARNING: cuda.h: proceeding with the compiler's result configure: WARNING: ## ----------------------------------- ## configure: WARNING: ## Report this to ecm-discuss@inria.fr ## configure: WARNING: ## ----------------------------------- ## checking for cuda.h... no configure: error: required header file missing [/code] README.gpu says: [code] Some versions of CUDA are not compatible with recent versions of gcc. To specify which C compiler is called by the CUDA compiler nvcc, type: $ ./configure --enable-gpu --with-cuda-compiler=/PATH/DIR If you get errors about "cuda.h: present but cannot be compiled" Try using an older CC: $ ./configure --enable-gpu CC=gcc-8 The value of this parameter is directly passed to nvcc via the option "--compiler-bindir". By default, GMP-ECM lets nvcc choose what C compiler it uses. [/code] The only gcc installed now is version 4.8.5. Should I install an older gcc (if so what level) or should I upgrade the OS to a higher level so I can install a newer CUDA? Does anyone have ecm working with CUDA 9.0 or higher on openSUSE and if so what level of openSUSE? Chris (getting slightly frustrated by now) |
[QUOTE=chris2be8;587001]The only gcc installed now is version 4.8.5. Should I install an older gcc (if so what level) or should I upgrade the OS to a higher level so I can install a newer CUDA? Does anyone have ecm working with CUDA 9.0 or higher on openSUSE and if so what level of openSUSE?
Chris (getting slightly frustrated by now)[/QUOTE]I've passed the frustration point with my systems. I was getting the same with my Ubuntu 20.04 with all the 10.x and 11.x CUDA versions (my card isn't supported by CUDA 11.x, anyway). I installed and made default several older gcc versions (8, 9, 10).* I gave up for now. * I'm curious about the gcc version numer difference between yours and mine. The default Ubuntu 20.04 gcc is 9.3.0, my Debian Buster is 8.3.0, and the default for my Fedora 33 is 10.3.1. Is your version actually that old compared to mine? |
[c]gcc --version[/c] returns:
[code] gcc (SUSE Linux) 4.8.5 Copyright (C) 2015 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. [/code] [c]zypper search gcc[/c] shows it as gcc48 and says gcc5 and gcc6 could also be installed. I've installed clang as well: [code] clang --version clang version 3.8.0 (tags/RELEASE_380/final 262553) Target: x86_64-unknown-linux-gnu Thread model: posix InstalledDir: /usr/bin [/code] but that gets a different error: [code] ./configure --enable-gpu=50 --with-cuda=/usr/local/cuda --with-cuda-compiler=clang CC=clang ... configure: Using nvcc compiler from from /usr/local/cuda/bin checking for compatibility between gcc and nvcc... no configure: error: gcc version is not compatible with nvcc [/code] I don't think my problems are due to openSUSE. So if someone who has ecm with cgbn working on any Linux distro could say what version of CUDA and what compiler version they have I could probably get it working. |
[QUOTE=bsquared;586996]I re-cloned the gpu_integration branch to capture the latest changes and went through the build process with the following caveats:
specifying --with-gmp together with --with-cgbn-include doesn't work. I had to use the system default gmp (6.0.0). With compute 70 I still have to replace __any with __any_sync(__activemask() on line 10 of cude_kernel_default.cu building with gcc I get this error in cgbn_stage1.cu: cgbn_stage1.cu(654): error: initialization with "{...}" is not allowed for object of type "const std::vector<uint32_t, std::allocator<uint32_t>>" I suppose I need to build with g++ instead? [/QUOTE] [B]I rebased the branch to cleanup the git history. so everyone will likely need to `git pull` and `git reset --hard origin/gpu_integration`. I'm sorry, but also we're in development and everything is nicer now to review. [/B] I fixed the vector initialize issue and have included your "__any_sync(__activemask()" fix in the repo (I forgot to credit you in the commit but I'll try and do that the next time I rebase). I'm not sure why --with-gmp doesn't work with --with-cgbn-include if you have some sense of why I'm happy to try and fix. If it's failing on "checking if CGBN is present..." maybe try adding more flags to acinclude.m4:617 [-I$cgbn_include $GMPLIB], maybe "-I$with_gmp_include" and or "-L$with_gmp_lib" |
[QUOTE=EdH;587008]I've passed the frustration point with my systems. I was getting the same with my Ubuntu 20.04 with all the 10.x and 11.x CUDA versions (my card isn't supported by CUDA 11.x, anyway). I installed and made default several older gcc versions (8, 9, 10).* I gave up for now.
* I'm curious about the gcc version numer difference between yours and mine. The default Ubuntu 20.04 gcc is 9.3.0, my Debian Buster is 8.3.0, and the default for my Fedora 33 is 10.3.1. Is your version actually that old compared to mine?[/QUOTE] I know that feeling and I really empathize. I'm building on the pile of cludge that is cuda and I wish I could make this easier. did you try with CC=gcc-9? I can also maybe add some debug to the configure log to show which CC it's using. I personally use this to configure [CODE]./configure --enable-gpu=61 --with-cuda=/usr/local/cuda CC=gcc-9 -with-cgbn-include=/home/five/Projects/CGBN/include/cgbn[/CODE] and my gcc / nvcc versions [CODE] $ gcc --version gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 Copyright (C) 2019 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. $ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2020 NVIDIA Corporation Built on Mon_Nov_30_19:08:53_PST_2020 Cuda compilation tools, release 11.2, V11.2.67 Build cuda_11.2.r11.2/compiler.29373293_0 [/CODE] If you tell me what compute / sm_arch your card is I can try building and sending you a binary. |
In my case, everything except ECM and Msieve seemed to be working, but I've uninstalled everything now and I thought from a few posts ago that my arch 3.0 was perhaps too ancient, 3.5 being necessary. ATM, updates, etc. are also giving me errors, so I was going step back for a bit. I've uninstalled all the CUDA, NVIDIA, etc. from the system. In its latest iteration, although I had installed CUDA 10.2, nvcc and nvidia-smi claimed to be running CUDA 11, which does not support architecture 3.0. I'll try another installation some time soon and then see where it stalls. If I can't get ECM to build for GPU with my card, there is no point trying to add in cgbn, is there?
Thanks! |
[QUOTE=EdH;587008]I've passed the frustration point with my systems. I was getting the same with my Ubuntu 20.04 with all the 10.x and 11.x CUDA versions (my card isn't supported by CUDA 11.x, anyway). I installed and made default several older gcc versions (8, 9, 10).* I gave up for now.
* I'm curious about the gcc version numer difference between yours and mine. The default Ubuntu 20.04 gcc is 9.3.0, my Debian Buster is 8.3.0, and the default for my Fedora 33 is 10.3.1. Is your version actually that old compared to mine?[/QUOTE]Which is why I would love for someone to make a [B]fully[/B] static Linux executable for a relatively low SM value. OK, it would not be as fast as the latest and greatest but at least it would be much faster than a purely cpu version. I'd do it myself but haven't been able to compile with CUDA for far too long now.:sad: |
And I've been having "fun" with msieve's CUDA support. The version I had been running failed saying [c][sort_engine.cu, 95] sort engine: (CUDA error 78: a PTX JIT compilation failed)[/c] (probably because compiled with and old version of CUDA. So I decided to install the latest version of msieve, revision 1043. Which also failed with a message saying "file not found" but of course didn't say *which* file it could not find. After a lot of puzzling I found revision 1043 notes the card is compute architecture 5.2 and tries to load stage1_core_sm52.ptx. But the Makefile as shipped is only set up to build ptx files for sm20, sm30, sm35 and sm50. So you are out of luck with any other architecture. I hacked the Makefile, first to remove sm20 which CUDA 9.0 doesn't support, then to add sm52 once I realised that was missing.
The makefile probably should build ptx files for all of this list: [code] ~/msieve.1043/trunk> strings msieve | grep ptx stage1_core_sm20.ptx stage1_core_sm30.ptx stage1_core_sm35.ptx stage1_core_sm50.ptx stage1_core_sm52.ptx stage1_core_sm61.ptx stage1_core_sm70.ptx stage1_core_sm75.ptx stage1_core_sm86.ptx [/code] If I hadn't knows of the [c]strings[/c] command I would have been stuck. |
[QUOTE=bsquared;586996]
Anyway I can get past all of that and get a working binary and the cpu usage is now much lower. But now the gpu portion appears to be about 15% slower? Before: [CODE] Input number is 2^997-1 (301 digits) Computing 5120 Step 1 took 75571ms of CPU time / 129206ms of GPU time Throughput: 39.627 curves per second (on average 25.24ms per Step 1) [/CODE] New clone: [CODE] Input number is 2^997-1 (301 digits) Computing 5120 Step 1 took 643ms of CPU time / 149713ms of GPU time Throughput: 34.199 curves per second (on average 29.24ms per Step 1) [/CODE] Anyone else seeing this?[/QUOTE] Can you try running with `-v --gpucurves 1280` and `--gpucurves 2560` (if you are having fun you can also try 640 and 1792)? The new code should give you approximate timings quite quickly so no need to complete a full run. I have seen 2x and 4x slowdowns when gpucurves is large. I may need to put in some code that searches for optimal throughput. |
[QUOTE=xilman;587015]Which is why I would love for someone to make a [B]fully[/B] static Linux executable for a relatively low SM value.
OK, it would not be as fast as the latest and greatest but at least it would be much faster than a purely cpu version. I'd do it myself but haven't been able to compile with CUDA for far too long now.:sad:[/QUOTE] I don't know how static linking works especially with respect to CUDA but I compilled ecm with all supported SM (including sm35 and sm70) using CUDA 11.2. Feel free to try it, but I wouldn't be to hopeful. It doesn't run in colab and gives an error ./ecm_cgbn_cuda11_2: /lib/x86_64-linux-gnu/libm.so.6: version `GLIBC_2.29' not found (required by ./ecm_cgbn_cuda11_2) [url]https://static.cloudygo.com/static/ecm_cgbn_cuda11_2[/url] ^ I pinky-promise this isn't a virus |
cudacommon.h is missing from the git repository.
|
[QUOTE=chris2be8;587010][c]gcc --version[/c] returns:
[code] gcc (SUSE Linux) 4.8.5 Copyright (C) 2015 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. [/code][c]zypper search gcc[/c] shows it as gcc48 and says gcc5 and gcc6 could also be installed. [/quote] My guess is that your gcc version may be too old. I would try the most recent version you can get your hands on. The easiest way may be to update your OS into a version that isn't end of life. |
[QUOTE=SethTro;587019]Can you try running with `-v --gpucurves 1280` and `--gpucurves 2560` (if you are having fun you can also try 640 and 1792)?
The new code should give you approximate timings quite quickly so no need to complete a full run. I have seen 2x and 4x slowdowns when gpucurves is large. I may need to put in some code that searches for optimal throughput.[/QUOTE] 1280: (~31 ms/curves) 2560: (~21 ms/curves) 640: (~63 ms/curves) 1792: (~36 ms/curves) So we have a winner! -gpucurves 2560 beats all the others and anything the old build could do as well (best on the old build was 5120 @ (~25 ms/curves)) With the smaller kernel (running (2^499-1) / 20959), -gpucurves 5120 is fastest at about 6ms/curve on both new and old builds. |
[QUOTE=frmky;587024]cudacommon.h is missing from the git repository.[/QUOTE]
Fixed along with another issue. |
[QUOTE=bsquared;587026]1280: (~31 ms/curves)
2560: (~21 ms/curves) 640: (~63 ms/curves) 1792: (~36 ms/curves) So we have a winner! -gpucurves 2560 beats all the others and anything the old build could do as well (best on the old build was 5120 @ (~25 ms/curves)) With the smaller kernel (running (2^499-1) / 20959), -gpucurves 5120 is fastest at about 6ms/curve on both new and old builds.[/QUOTE] I added `gpu_throughput_test.sh` which runs different sized inputs and measures throughput. On my system maximum results are achieved at 256 bits: 2x default curves (or 3584 curves), same speed at 4x default too 512 bits: 2x and 4x default curves 1024 bits: only at default curves extra testing at 2048 bits: 1.5x and 3x outperform 2x and 4x slightly |
[QUOTE=SethTro;587033]I added `gpu_throughput_test.sh` which runs different sized inputs and measures throughput.
On my system maximum results are achieved at 256 bits: 2x default curves (or 3584 curves), same speed at 4x default too 512 bits: 2x and 4x default curves 1024 bits: only at default curves extra testing at 2048 bits: 1.5x and 3x outperform 2x and 4x slightly[/QUOTE] Maybe this relates to registers used by the kernel? max threads per block? Any insight from CUDA experts would be appreciated |
I halved compile time by adding cgbn_swap and avoiding inlining double_add_v2 twice.
Sadly I pushed the branch and it will probably fail to compile for everyone till [url]https://github.com/NVlabs/CGBN/pull/17[/url] gets pulled --- @bsquared, you might try changing TPB_DEFAULT from 128 to 512, In some initial testing it looks like larger gpucurves don't slow down any more with ./gpu_throughput_test.sh more testing to follow tomorrow. |
[QUOTE=henryzz;587025]My guess is that your gcc version may be too old. I would try the most recent version you can get your hands on. The easiest way may be to update your OS into a version that isn't end of life.[/QUOTE]
I've installed gcc-6 (the latest in the repositories) and that gets past that error, but fails a bit further on: [code] gcc-6 --version gcc-6 (SUSE Linux) 6.2.1 20160826 [gcc-6-branch revision 239773] Copyright (C) 2016 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. ./configure --enable-gpu=30 --with-cuda=/usr/local/cuda CC=gcc-6 -with-cgbn-include=/home/chris/CGBN/include/cgbn ... configure: Using cuda.h from /usr/local/cuda/include checking cuda.h usability... yes checking cuda.h presence... yes checking for cuda.h... yes checking that CUDA Toolkit version is at least 3.0... (9.0) yes configure: Using CUDA dynamic library from /usr/local/cuda/lib64 checking for cuInit in -lcuda... yes checking that CUDA Toolkit version and runtime version are the same... no configure: error: 'cuda.h' and 'cudart' library have different versions, you have to reinstall CUDA properly, or use the --with-cuda parameter to tell configure the path to the CUDA library and header you want to use [/code] That error message doesn't make much sense because I only have one version of CUDA installed on the system. So it's probably failing to compile a test program. So I'll try upgrading the OS next. Then install later versions of CUDA and gcc. |
[QUOTE=chris2be8;587067]I've installed gcc-6 (the latest in the repositories) and that gets past that error, but fails a bit further on:
[code] configure: error: 'cuda.h' and 'cudart' library have different versions, you have to reinstall CUDA properly, or use the --with-cuda parameter to tell configure the path to the CUDA library and header you want to use [/code] That error message doesn't make much sense because I only have one version of CUDA installed on the system. So it's probably failing to compile a test program. [/QUOTE] You can find the literal program it failed to compile in config.log or the shape in acinclude.m4 (basically wrap the 2nd block in int maint() { ... }) [CODE] AC_RUN_IFELSE([AC_LANG_PROGRAM([ [ #include <stdio.h> #include <string.h> #include <cuda.h> #include <cuda_runtime.h> ]],[[ int libversion; cudaError_t err; err = cudaRuntimeGetVersion (&libversion); if (err != cudaSuccess) { printf ("Could not get runtime version\n"); printf ("Error msg: %s\n", cudaGetErrorString(err)); return -1; } printf("(%d.%d/", CUDA_VERSION/1000, (CUDA_VERSION/10) % 10); printf("%d.%d) ", libversion/1000, (libversion/10) % 10); if (CUDA_VERSION == libversion) return 0; else return 1; ]])], [/CODE] And you can find the command line it tried to compile this with in config.log too (my guess is something like gcc-9 -o conftest -I/usr/local/cuda/include -g -O2 -I/usr/local/cuda/include -Wl,-rpath,/usr/local/cuda/lib64 -L/usr/ local/cuda/lib64 conftest.c -lcudart -lstdc++ -lcuda -lrt ) |
I think this can be triggered if the version of CUDA supported by the driver doesn't match the toolkit version. But this is usually ok as long as the driver is a little newer. I think both this and the lack of cuInit() in the CUDA lib should be warnings, not errors. Both of these are ok in some circumstances.
|
Happy me!
I found two 35 digit factors from a [URL="http://factordb.com/index.php?id=1100000002657449020"]C303[/URL] today (from [URL="https://docs.google.com/spreadsheets/d/1IuxGlf6dEUd8Qixu87P-_r6sgdG7Yl8UUPXS6rKBpbM/edit#gid=1905095108"]Factoring for a publication[/URL]) [CODE] GPU: factor 404157820975138535541421971085010741 found in Step 1 with curve 1796 (-sigma 3:1850760857) GPU: factor 404157820975138535541421971085010741 found in Step 1 with curve 2049 (-sigma 3:1850761110) GPU: factor 404157820975138535541421971085010741 found in Step 1 with curve 2449 (-sigma 3:1850761510) Computing 3584 Step 1 took 2294ms of CPU time / 1816867ms of GPU time ********** Factor found in step 1: 404157820975138535541421971085010741 Found prime factor of 36 digits: 404157820975138535541421971085010741 [/CODE] Then [CODE] Thu 2021/09/02 23:25:50 UTC Step 1 took 0ms Thu 2021/09/02 23:25:50 UTC Step 2 took 9668ms Thu 2021/09/02 23:25:50 UTC ********** Factor found in step 2: 51858345311243630596653971633910169 Thu 2021/09/02 23:25:50 UTC Found prime factor of 35 digits: 51858345311243630596653971633910169 [/CODE] Feels good that this code is being useful :) |
[QUOTE=SethTro;587108]Feels good that this code is being useful :)[/QUOTE]
Nearly all of the factors that I found for Factoring for a Publication 2 used this code. |
I'm still puzzling over it. I've upgraded the system to openSUSE Leap 15.3 and installed CUDA 11.4. But no matter what I do [c]lspci -v[/c] still says [c]Kernel modules: nouveau[/c]
I've tried everything I can find in the CUDA Installation Guide for Linux. And everything I can find on the web. But it still loads the nouveau kernel module, not the one shipped with CUDA. Has anyone any idea how to get it to use the Nvidia drivers? NB. On the system with the GTX 970: [code] 4core:~ # lspci -v -s 01:00 01:00.0 VGA compatible controller: NVIDIA Corporation GM204 [GeForce GTX 970] (rev a1) (prog-if 00 [VGA controller]) Subsystem: eVga.com. Corp. Device 3978 Flags: fast devsel, IRQ 11 Memory at f6000000 (32-bit, non-prefetchable) [disabled] [size=16M] Memory at e0000000 (64-bit, prefetchable) [disabled] [size=256M] Memory at f0000000 (64-bit, prefetchable) [disabled] [size=32M] I/O ports at e000 [disabled] [size=128] Expansion ROM at f7000000 [disabled] [size=512K] Capabilities: [60] Power Management version 3 Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+ Capabilities: [78] Express Legacy Endpoint, MSI 00 Capabilities: [100] Virtual Channel Capabilities: [250] Latency Tolerance Reporting Capabilities: [258] L1 PM Substates Capabilities: [128] Power Budgeting <?> Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?> Capabilities: [900] #19 Kernel modules: nouveau [/code] Compare with on the system with a CC 3.0 card: [code] root@sirius:~# lspci -v -s 07:00 07:00.0 VGA compatible controller: NVIDIA Corporation GK104 [GeForce GTX 760] (rev a1) (prog-if 00 [VGA controller]) Subsystem: Micro-Star International Co., Ltd. [MSI] GK104 [GeForce GTX 760] Flags: bus master, fast devsel, latency 0, IRQ 76 Memory at f6000000 (32-bit, non-prefetchable) [size=16M] Memory at e8000000 (64-bit, prefetchable) [size=128M] Memory at f0000000 (64-bit, prefetchable) [size=32M] I/O ports at e000 [size=128] [virtual] Expansion ROM at 000c0000 [disabled] [size=128K] Capabilities: [60] Power Management version 3 Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+ Capabilities: [78] Express Endpoint, MSI 00 Capabilities: [b4] Vendor Specific Information: Len=14 <?> Capabilities: [100] Virtual Channel Capabilities: [128] Power Budgeting <?> Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?> Capabilities: [900] #19 Kernel driver in use: nvidia Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia [/code] Compare the last line of output in each case. If it's because CUDA 11.4 doesn't support this card I could try removing CUDA 11.4 and installing CUDA 10.x But would that work. |
Following [URL="https://askubuntu.com/questions/1095825/what-does-modprobe-blacklist-nouveau-do"]this solution[/URL] (although Ubuntu)
[QUOTE] Boot to Ubuntu, but before you login in to Ubuntu, press Cntrl+Alt+F2 run the following command: sudo nano /etc/modprobe.d/blacklist-nouveau.conf add the 2 following lines, save & exit blacklist nouveau options nouveau modeset=0 run the following command sudo update-initramfs -u [/QUOTE] reboot. run [c]lsmod | grep nvidia[/c] HTH |
Thanks, but I've already tried that:
[code] 4core:/etc/modprobe.d # cat 60-blacklist.nouveau.conf blacklist nouveau options nouveau modeset=0 [/code] And it is in the current initramfs: [code] 4core:/etc/modprobe.d # lsinitrd -f /etc/modprobe.d/60-blacklist.nouveau.conf blacklist nouveau options nouveau modeset=0 [/code] lsmod doesn't show any nvidia kernel modules: [code] 4core:/etc/modprobe.d # lsmod | grep -i nvidia 4core:/etc/modprobe.d # [/code] On my system where CUDA (but not cgbn) works: [code] root@sirius:~# lsmod | grep nvidia nvidia_uvm 876544 0 nvidia_drm 49152 5 nvidia_modeset 1122304 14 nvidia_drm nvidia 19517440 682 nvidia_uvm,nvidia_modeset drm_kms_helper 180224 1 nvidia_drm drm 483328 8 drm_kms_helper,nvidia_drm ipmi_msghandler 102400 2 ipmi_devintf,nvidia [/code] |
Did you install through Yast or a direct download from nVidia?
|
I've already tried that:
[code] 4core:/etc/modprobe.d # cat 60-blacklist.nouveau.conf blacklist nouveau options nouveau modeset=0 [/code] And it is in initrd: [code] 4core:/etc/modprobe.d # lsinitrd -f /etc/modprobe.d/60-blacklist.nouveau.conf blacklist nouveau options nouveau modeset=0 [/code] Digging a bit further I don't think the nvidia kernel modules are correctly installed: [code] 4core:/lib/modules # find . -name 'nvidia*' ./4.12.14-lp150.12.82-default/updates/nvidia-uvm.ko ./4.12.14-lp150.12.82-default/updates/nvidia-modeset.ko ./4.12.14-lp150.12.82-default/updates/nvidia.ko ./4.12.14-lp150.12.82-default/updates/nvidia-drm.ko ./5.3.18-57-default/weak-updates/updates/nvidia-uvm.ko ./5.3.18-57-default/weak-updates/updates/nvidia-modeset.ko ./5.3.18-57-default/weak-updates/updates/nvidia.ko ./5.3.18-57-default/weak-updates/updates/nvidia-drm.ko ./5.3.18-57-default/kernel/drivers/net/ethernet/nvidia ./5.3.18-57-preempt/kernel/drivers/net/ethernet/nvidia ./5.3.18-59.19-preempt/kernel/drivers/net/ethernet/nvidia ./5.3.18-59.19-default/weak-updates/updates/nvidia-uvm.ko ./5.3.18-59.19-default/weak-updates/updates/nvidia-modeset.ko ./5.3.18-59.19-default/weak-updates/updates/nvidia.ko ./5.3.18-59.19-default/weak-updates/updates/nvidia-drm.ko ./5.3.18-59.19-default/kernel/drivers/net/ethernet/nvidia 4core:/lib/modules # uname -r 5.3.18-59.19-preempt [/code] So the kernel I'm running won't find them because it will look in 5.3.18-59.19-preempt even though they are installed in 5.3.18-59.19-default (next question, how to fix this cleanly). But at least I think I know where I'm going now. |
[QUOTE=paulunderwood;587259]Did you install through Yast or a direct download from nVidia?[/QUOTE]
zypper on the command line. Following the instructions on Nvidia's web site [url]https://developer.nvidia.com/cuda-downloads[/url] |
Some of the instructions I saw in the past had a separate step, almost hidden, that was required to install the driver. Is it possible there is a driver install step missing in your procedure?
For my Ubuntu repository install of 10.2, it automatically installs the 470 driver, no matter what I have beforehand. Is there an equivalent to this Ubuntu command?:[code]sudo [B]ubuntu-drivers devices[/B] WARNING:root:_pkg_get_support nvidia-driver-390: package has invalid Support Legacyheader, cannot determine support level == /sys/devices/pci0000:00/0000:00:03.0/0000:01:00.0 == modalias : pci:v000010DEd00000FFDsv0000103Csd00000967bc03sc00i00 vendor : NVIDIA Corporation model : GK107 [NVS 510] driver : nvidia-driver-450-server - distro non-free driver : nvidia-driver-450 - third-party non-free driver : nvidia-driver-460-server - distro non-free driver : nvidia-driver-455 - third-party non-free driver : nvidia-driver-418-server - distro non-free driver : nvidia-340 - distro non-free driver : nvidia-driver-465 - third-party non-free driver : nvidia-driver-390 - distro non-free driver : nvidia-driver-470 - third-party non-free recommended driver : nvidia-driver-418 - third-party non-free driver : nvidia-driver-410 - third-party non-free driver : nvidia-driver-470-server - distro non-free driver : nvidia-driver-440 - third-party non-free driver : nvidia-driver-460 - third-party non-free driver : xserver-xorg-video-nouveau - distro free builtin[/code]Would such be of any help? |
After rebooting using the 5.3.18-59.19-default kernel the nvidia drivers are picked up:
[code] 4core:~ # lspci -v -s 01:00 01:00.0 VGA compatible controller: NVIDIA Corporation GM204 [GeForce GTX 970] (rev a1) (prog-if 00 [VGA controller]) Subsystem: eVga.com. Corp. Device 3978 Flags: bus master, fast devsel, latency 0, IRQ 16 Memory at f6000000 (32-bit, non-prefetchable) [size=16M] Memory at e0000000 (64-bit, prefetchable) [size=256M] Memory at f0000000 (64-bit, prefetchable) [size=32M] I/O ports at e000 [size=128] [virtual] Expansion ROM at f7000000 [disabled] [size=512K] Capabilities: [60] Power Management version 3 Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+ Capabilities: [78] Express Legacy Endpoint, MSI 00 Capabilities: [100] Virtual Channel Capabilities: [258] L1 PM Substates Capabilities: [128] Power Budgeting <?> Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?> Capabilities: [900] #19 Kernel driver in use: nvidia Kernel modules: nouveau, nvidia_drm, nvidia [/code] I'll need to fix that but it can wait for now. Then I started testing things ... msieve works OK: [code] Sat Sep 4 19:10:51 2021 Msieve v. 1.54 (SVN 1043) Sat Sep 4 19:10:51 2021 random seeds: 6e515738 cae1a347 Sat Sep 4 19:10:51 2021 factoring 1522605027922533360535618378132637429718068114961380688657908494580122963258952897654000350692006139 (100 digits) Sat Sep 4 19:10:51 2021 no P-1/P+1/ECM available, skipping Sat Sep 4 19:10:51 2021 commencing number field sieve (100-digit input) Sat Sep 4 19:10:51 2021 commencing number field sieve polynomial selection Sat Sep 4 19:10:51 2021 polynomial degree: 4 Sat Sep 4 19:10:51 2021 max stage 1 norm: 1.16e+17 Sat Sep 4 19:10:51 2021 max stage 2 norm: 8.33e+14 Sat Sep 4 19:10:51 2021 min E-value: 9.89e-09 Sat Sep 4 19:10:51 2021 poly select deadline: 54 Sat Sep 4 19:10:51 2021 time limit set to 0.01 CPU-hours Sat Sep 4 19:10:51 2021 expecting poly E from 1.49e-08 to > 1.71e-08 Sat Sep 4 19:10:51 2021 searching leading coefficients from 10000 to 1000000 Sat Sep 4 19:10:52 2021 using GPU 0 (NVIDIA GeForce GTX 970) Sat Sep 4 19:10:52 2021 selected card has CUDA arch 5.2 Sat Sep 4 19:11:19 2021 polynomial selection complete Sat Sep 4 19:11:19 2021 elapsed time 00:00:28 [/code] But I've been having fun with ecm. The problem with conftest turned out to be: [code] chris@4core:~> gcc-9 -o conftest -I/usr/local/cuda/include -g -O2 -I/usr/local/cuda/include -Wl,-rpath,/usr/local/cuda/lib64 -L/usr/local/cuda/lib64 conftest.c -lcudart -lstdc++ -lcuda -lrt -lm -lm -lm -lm -lm /usr/lib64/gcc/x86_64-suse-linux/9/../../../../x86_64-suse-linux/bin/ld: cannot find -lstdc++ collect2: error: ld returned 1 exit status [/code] So changing ./configure line 15498 from [c]CUDALIB="-lcudart -lstdc++"[/c] to [c]CUDALIB="-lcudart"[/c] made it work OK. I then got a lot of errors like this: [code] Instruction 'vote' without '.sync' is not supported on .target sm_70 and higher from PTX ISA version 6.4 [/code] So edited the Makefile to only build for sm_52 since that's all I need. But trying to build CGBN support I get: [code] chris@4core:~/ecm-cgbn/gmp-ecm> make make all-recursive make[1]: Entering directory '/home/chris/ecm-cgbn/gmp-ecm' Making all in x86_64 make[2]: Entering directory '/home/chris/ecm-cgbn/gmp-ecm/x86_64' make[2]: Nothing to be done for 'all'. make[2]: Leaving directory '/home/chris/ecm-cgbn/gmp-ecm/x86_64' make[2]: Entering directory '/home/chris/ecm-cgbn/gmp-ecm' /bin/sh ./libtool --tag=CC --mode=compile /usr/local/cuda/bin/nvcc --compile -I/home/chris/CGBN/include/cgbn -lgmp -I/usr/local/cuda/include -DECM_GPU_CURVES_BY_BLOCK=32 --generate-code arch=compute_52,code=sm_52 --ptxas-options=-v --compiler-options -fno-strict-aliasing -O2 --compiler-options -fPIC -I/usr/local/cuda/include -DWITH_GPU -o cgbn_stage1.lo cgbn_stage1.cu -static libtool: compile: /usr/local/cuda/bin/nvcc --compile -I/home/chris/CGBN/include/cgbn -lgmp -I/usr/local/cuda/include -DECM_GPU_CURVES_BY_BLOCK=32 --generate-code arch=compute_52,code=sm_52 --ptxas-options=-v --compiler-options -fno-strict-aliasing -O2 --compiler-options -fPIC -I/usr/local/cuda/include -DWITH_GPU cgbn_stage1.cu -o cgbn_stage1.o cgbn_stage1.cu(435): error: identifier "cgbn_swap" is undefined detected during instantiation of "void kernel_double_add<params>(cgbn_error_report_t *, uint32_t, uint32_t, uint32_t, char *, uint32_t *, uint32_t, uint32_t, uint32_t) [with params=cgbn_params_t<4U, 512U>]" (757): here cgbn_stage1.cu(442): error: identifier "cgbn_swap" is undefined detected during instantiation of "void kernel_double_add<params>(cgbn_error_report_t *, uint32_t, uint32_t, uint32_t, char *, uint32_t *, uint32_t, uint32_t, uint32_t) [with params=cgbn_params_t<4U, 512U>]" (757): here cgbn_stage1.cu(435): error: identifier "cgbn_swap" is undefined detected during instantiation of "void kernel_double_add<params>(cgbn_error_report_t *, uint32_t, uint32_t, uint32_t, char *, uint32_t *, uint32_t, uint32_t, uint32_t) [with params=cgbn_params_t<8U, 1024U>]" (760): here cgbn_stage1.cu(442): error: identifier "cgbn_swap" is undefined detected during instantiation of "void kernel_double_add<params>(cgbn_error_report_t *, uint32_t, uint32_t, uint32_t, char *, uint32_t *, uint32_t, uint32_t, uint32_t) [with params=cgbn_params_t<8U, 1024U>]" (760): here 4 errors detected in the compilation of "cgbn_stage1.cu". make[2]: *** [Makefile:2571: cgbn_stage1.lo] Error 1 make[2]: Leaving directory '/home/chris/ecm-cgbn/gmp-ecm' make[1]: *** [Makefile:1903: all-recursive] Error 1 make[1]: Leaving directory '/home/chris/ecm-cgbn/gmp-ecm' make: *** [Makefile:783: all] Error 2 [/code] This is after several attempts to run make, so hopefully only the relevant messages. But I've got an older version of ecm working on the GPU (at last!) So i'll leave it for now. |
[QUOTE=SethTro;587047]I halved compile time by adding cgbn_swap and avoiding inlining double_add_v2 twice.
[/QUOTE] Does it affect the runtime? I don't care much about the compile time. Just compile a few small kernels for testing, and once it's stable include a good coverage of kernels and just let it compile overnight if necessary. In my current build I included all of [CODE] typedef cgbn_params_t<4, 256> cgbn_params_256; typedef cgbn_params_t<4, 512> cgbn_params_512; typedef cgbn_params_t<8, 768> cgbn_params_768; typedef cgbn_params_t<8, 1024> cgbn_params_1024; typedef cgbn_params_t<8, 1536> cgbn_params_1536; typedef cgbn_params_t<8, 2048> cgbn_params_2048; typedef cgbn_params_t<16, 3072> cgbn_params_3072; typedef cgbn_params_t<16, 4096> cgbn_params_4096; typedef cgbn_params_t<16, 5120> cgbn_params_5120; typedef cgbn_params_t<16, 6144> cgbn_params_6144; typedef cgbn_params_t<16, 7168> cgbn_params_7168; typedef cgbn_params_t<16, 8192> cgbn_params_8192; typedef cgbn_params_t<32, 10240> cgbn_params_10240; typedef cgbn_params_t<32, 12288> cgbn_params_12288; typedef cgbn_params_t<32, 14336> cgbn_params_14336; typedef cgbn_params_t<32, 16384> cgbn_params_16384; typedef cgbn_params_t<32, 18432> cgbn_params_18432; typedef cgbn_params_t<32, 20480> cgbn_params_20480; typedef cgbn_params_t<32, 22528> cgbn_params_22528; typedef cgbn_params_t<32, 24576> cgbn_params_24576; typedef cgbn_params_t<32, 28672> cgbn_params_28672; typedef cgbn_params_t<32, 32768> cgbn_params_32768; [/CODE] and it took a little over an hour to compile for sm_70. |
[QUOTE=chris2be8;587267]
So changing ./configure line 15498 from [c]CUDALIB="-lcudart -lstdc++"[/c] to [c]CUDALIB="-lcudart"[/c] made it work OK. [/QUOTE] Use YaST to search for the dev file of libstdc++ and install it (and its dependencies), and then link with -lstdc++ |
[QUOTE=chris2be8;587267]
This is after several attempts to run make, so hopefully only the relevant messages. But I've got an older version of ecm working on the GPU (at last!) So i'll leave it for now.[/QUOTE] This is an easy fix, you are on the home stretch! I'll committed a change that depends on [url]https://github.com/NVlabs/CGBN/pull/17[/url] being accepted. I'll committed a change reverting that to 3 cgbn_set's for now. After you `git pull` everything should build! Alternatively you can use replace your CGBN directory with this one. `git clone -b cgbn_swap [email]git@github.com:sethtroisi/CGBN.git[/email]` |
[QUOTE=frmky;587274]Does it affect the runtime? I don't care much about the compile time. Just compile a few small kernels for testing, and once it's stable include a good coverage of kernels and just let it compile overnight if necessary. In my current build I included all of
[CODE] typedef cgbn_params_t<4, 256> cgbn_params_256; typedef cgbn_params_t<4, 512> cgbn_params_512; typedef cgbn_params_t<8, 768> cgbn_params_768; typedef cgbn_params_t<8, 1024> cgbn_params_1024; ......... typedef cgbn_params_t<32, 32768> cgbn_params_32768; [/CODE] and it took a little over an hour to compile for sm_70.[/QUOTE] It doesn't reduce runtime, it does make it faster for me to test things and slightly reduces registers pressure. |
[QUOTE=SethTro;587290]Alternatively you can use replace your CGBN directory with this one. `git clone -b cgbn_swap [email]git@github.com:sethtroisi/CGBN.git[/email]`[/QUOTE]
That fails: [code] chris@4core:~> git clone -b cgbn_swap git@github.com:sethtroisi/CGBN.git Cloning into 'CGBN'... The authenticity of host 'github.com (140.82.121.4)' can't be established. RSA key fingerprint is SHA256:nThbg6kXUpJWGl7E1IGOCspRomTxdCARLviKw6E5SY8. Are you sure you want to continue connecting (yes/no/[fingerprint])? yes Warning: Permanently added 'github.com,140.82.121.4' (RSA) to the list of known hosts. git@github.com: Permission denied (publickey). fatal: Could not read from remote repository. Please make sure you have the correct access rights and the repository exists. [/code] And 'git pull' does nothing: [code] chris@4core:~/CGBN> git pull Already up to date. [/code] Unless I'm not using it correctly. |
[QUOTE=chris2be8;587296]That fails:
[code] And 'git pull' does nothing: [code] chris@4core:~/CGBN> git pull Already up to date. [/code]Unless I'm not using it correctly.[/QUOTE] Ignore this, but for completion sake you can probably clone my copy of CGBN with `git clone -b cgbn_swap https://github.com/sethtroisi/CGBN.git` The top entry from `git log` should be [CODE] commit 1595e543801bcbffd2c36cbf978baff843c09876 (HEAD -> gpu_integration, origin/gpu_integration) Author: Seth Troisi <sethtroisi@google.com> Date: Sat Sep 4 20:26:30 2021 -0700 reverted the cgbn_swap change till that is accepted [/CODE] If so you should be able to build. If it's not try `git fetch` then `git pull origin gpu_integration` |
I'm still stuck. I re-downloaded everything from scratch and re-ran autoreconf -si, ./configure and make. But make still fails
[code] ... libtool: link: ( cd ".libs" && rm -f "libecm.la" && ln -s "../libecm.la" "libecm.la" ) /bin/sh ./libtool --tag=CC --mode=link gcc-9 -g -I/usr/local/cuda/include -g -O2 -DWITH_GPU -R /usr/local/cuda/lib64 -o ecm ecm-auxi.o ecm-b1_ainc.o ecm-candi.o ecm-eval.o ecm-main.o ecm-resume.o ecm-addlaws.o ecm-torsions.o ecm-getprime_r.o aprtcle/ecm-mpz_aprcl.o ecm-memusage.o libecm.la -lgmp -lrt -lm -lm -lm -lm -lm libtool: link: gcc-9 -g -I/usr/local/cuda/include -g -O2 -DWITH_GPU -o ecm ecm-auxi.o ecm-b1_ainc.o ecm-candi.o ecm-eval.o ecm-main.o ecm-resume.o ecm-addlaws.o ecm-torsions.o ecm-getprime_r.o aprtcle/ecm-mpz_aprcl.o ecm-memusage.o ./.libs/libecm.a -L/usr/local/cuda/lib64 -lcudart -lgmp -lrt -lm -Wl,-rpath -Wl,/usr/local/cuda/lib64 /usr/lib64/gcc/x86_64-suse-linux/9/../../../../x86_64-suse-linux/bin/ld: ./.libs/libecm.a(cgbn_stage1.o): in function `cgbn_ecm_stage1': tmpxft_00007e39_00000000-6_cgbn_stage1.cudafe1.cpp:(.text+0x8b3): undefined reference to `operator delete(void*)' /usr/lib64/gcc/x86_64-suse-linux/9/../../../../x86_64-suse-linux/bin/ld: tmpxft_00007e39_00000000-6_cgbn_stage1.cudafe1.cpp:(.text+0x196e): undefined reference to `operator delete(void*)' /usr/lib64/gcc/x86_64-suse-linux/9/../../../../x86_64-suse-linux/bin/ld: ./.libs/libecm.a(cgbn_stage1.o): in function `void std::vector<unsigned int, std::allocator<unsigned int> >::_M_realloc_insert<unsigned int>(__gnu_cxx::__normal_iterator<unsigned int*, std::vector<unsigned int, std::allocator<unsigned int> > >, unsigned int&&)': tmpxft_00007e39_00000000-6_cgbn_stage1.cudafe1.cpp:(.text._ZNSt6vectorIjSaIjEE17_M_realloc_insertIJjEEEvN9__gnu_cxx17__normal_iteratorIPjS1_EEDpOT_[_ZNSt6vectorIjSaIjEE17_M_realloc_insertIJjEEEvN9__gnu_cxx17__normal_iteratorIPjS1_EEDpOT_]+0x50): undefined reference to `operator new(unsigned long)' /usr/lib64/gcc/x86_64-suse-linux/9/../../../../x86_64-suse-linux/bin/ld: tmpxft_00007e39_00000000-6_cgbn_stage1.cudafe1.cpp:(.text._ZNSt6vectorIjSaIjEE17_M_realloc_insertIJjEEEvN9__gnu_cxx17__normal_iteratorIPjS1_EEDpOT_[_ZNSt6vectorIjSaIjEE17_M_realloc_insertIJjEEEvN9__gnu_cxx17__normal_iteratorIPjS1_EEDpOT_]+0xc8): undefined reference to `operator delete(void*)' /usr/lib64/gcc/x86_64-suse-linux/9/../../../../x86_64-suse-linux/bin/ld: ./.libs/libecm.a(cgbn_stage1.o):(.data.rel.local.DW.ref.__gxx_personality_v0[DW.ref.__gxx_personality_v0]+0x0): undefined reference to `__gxx_personality_v0' collect2: error: ld returned 1 exit status make[2]: *** [Makefile:973: ecm] Error 1 make[2]: Leaving directory '/home/chris/ecm-cgbn/gmp-ecm' make[1]: *** [Makefile:1903: all-recursive] Error 1 make[1]: Leaving directory '/home/chris/ecm-cgbn/gmp-ecm' make: *** [Makefile:783: all] Error 2 [/code] Any ideas? |
Did you install with YaST the dev package of libstdc++?
|
Success!
The vital bit of info came from putting "__gxx_personality_v0" into duckduckgo. That told me it's provided by libstdc++ which is the g++ runtime. After installing gcc9-g++ and its run time libstdc++6-devel-gcc9 everything works. This has been an educational experience. Next step is to benchmark cgbn on my GPU. |
Benchmark results:
[code] chris@4core:~/ecm-cgbn/gmp-ecm> date;echo "(2^499-1)/20959" | ./ecm -gpu -gpucurves 3584 -sigma 3:1000 20000 0;date Sun 5 Sep 19:42:42 BST 2021 GMP-ECM 7.0.5-dev [configured with GMP 5.1.3, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM] Input number is (2^499-1)/20959 (146 digits) Using B1=20000, B2=0, sigma=3:1000-3:4583 (3584 curves) GPU: Using device code targeted for architecture compile_52 GPU: Ptx version is 52 GPU: maxThreadsPerBlock = 1024 GPU: numRegsPerThread = 31 sharedMemPerBlock = 24576 bytes GPU: Block: 32x32x1 Grid: 112x1x1 (3584 parallel curves) Computing 3584 Step 1 took 190ms of CPU time / 20427ms of GPU time Sun 5 Sep 19:43:03 BST 2021 chris@4core:~/ecm-cgbn/gmp-ecm> date;echo "(2^499-1)/20959" | ./ecm -gpu -cgbn -gpucurves 3584 -sigma 3:1000 20000 0;date Sun 5 Sep 19:43:29 BST 2021 GMP-ECM 7.0.5-dev [configured with GMP 5.1.3, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM] Input number is (2^499-1)/20959 (146 digits) Using B1=20000, B2=0, sigma=3:1000-3:4583 (3584 curves) GPU: Using device code targeted for architecture compile_52 GPU: Ptx version is 52 GPU: maxThreadsPerBlock = 640 GPU: numRegsPerThread = 93 sharedMemPerBlock = 0 bytes Computing 3584 Step 1 took 30ms of CPU time / 3644ms of GPU time Sun 5 Sep 19:43:33 BST 2021 chris@4core:~/ecm-cgbn/gmp-ecm> date;echo "(2^997-1)" | ./ecm -gpu -sigma 3:1000 20000 0;date Sun 5 Sep 19:44:25 BST 2021 GMP-ECM 7.0.5-dev [configured with GMP 5.1.3, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM] Input number is (2^997-1) (301 digits) Using B1=20000, B2=0, sigma=3:1000-3:1831 (832 curves) GPU: Using device code targeted for architecture compile_52 GPU: Ptx version is 52 GPU: maxThreadsPerBlock = 1024 GPU: numRegsPerThread = 31 sharedMemPerBlock = 24576 bytes GPU: Block: 32x32x1 Grid: 26x1x1 (832 parallel curves) Computing 832 Step 1 took 188ms of CPU time / 4552ms of GPU time Sun 5 Sep 19:44:30 BST 2021 chris@4core:~/ecm-cgbn/gmp-ecm> date;echo "(2^997-1)" | ./ecm -gpu -cgbn -sigma 3:1000 20000 0;date Sun 5 Sep 19:44:41 BST 2021 GMP-ECM 7.0.5-dev [configured with GMP 5.1.3, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM] Input number is (2^997-1) (301 digits) Using B1=20000, B2=0, sigma=3:1000-3:1831 (832 curves) GPU: Using device code targeted for architecture compile_52 GPU: Ptx version is 52 GPU: maxThreadsPerBlock = 640 GPU: numRegsPerThread = 93 sharedMemPerBlock = 0 bytes Computing 832 Step 1 took 8ms of CPU time / 1995ms of GPU time Sun 5 Sep 19:44:44 BST 2021 [/code] So about 5 times faster for (2^499-1)/20959 and about twice as fast for 2^997-1. But these are all small cases. But my overall throughput won't increase much because my CPU can't do stage 2 as fast as the GPU can do stage 1 now. But that's not your fault. And any speedup is nice. Thanks. Other lessons learnt: autoreconf -si creates symlinks to missing files while autoreconf -i copies them. Using -si saves space, but if you upgrade to a new level of automake you can get hanging symlinks: [code] lrwxrwxrwx 1 chris users 32 Nov 12 2015 INSTALL -> /usr/share/automake-1.13/INSTALL lrwxrwxrwx 1 chris users 35 Nov 12 2015 ltmain.sh -> /usr/share/libtool/config/ltmain.sh [/code] They needed updating to: [code] lrwxrwxrwx 1 chris users 32 Sep 4 19:20 INSTALL -> /usr/share/automake-1.15/INSTALL lrwxrwxrwx 1 chris users 38 Sep 4 19:20 ltmain.sh -> /usr/share/libtool/build-aux/ltmain.sh [/code] Not a common issue though. And suggestions for the install process: INSTALL-ecm should tell users to run autoreconf -i (or -si) before running ./configure (which is created by autoreconf -i). ./configure compiles several small programs and runs them to check things. If the compile fails it should put out a message saying the compile failed, not one saying it found different levels of run time library etc. If the compile normally produces no output then letting any output it does produce go to the screen would be informative (eg when it can't find -lstdc++). Chris |
[QUOTE=chris2be8;587337]Success![/QUOTE]
I'm glad we finally got here! 2.2x speedup for the 1024 bit case is almost exactly what everyone else is seeing (except bsquared maybe because newer card?). You can often improve overall throughput by adjust to 1.2*B1 and 1/2*B2 (and checking that expected curves stays roughly the same). This can especially help if Stage 1 time < Stage 2 time / cores. I'll reflect on your notes and see if I can improve the documentation / configure script. |
[QUOTE=SethTro;587429]
I'll reflect on your notes and see if I can improve the documentation / configure script.[/QUOTE] How about updating INSTALL-ecm like this: [code] diff -u INSTALL-ecm INSTALL-ecm.new --- INSTALL-ecm 2021-09-05 12:13:55.613439408 +0100 +++ INSTALL-ecm.new 2021-09-07 16:37:42.903291304 +0100 @@ -19,6 +19,7 @@ 1) check your configuration with: + $ autoreconf -i $ ./configure The configure script accepts several options (see ./configure --help). [/code] That's a minimum change to get new users started. |
[QUOTE=chris2be8;587449]How about updating INSTALL-ecm like this:
[code] diff -u INSTALL-ecm INSTALL-ecm.new --- INSTALL-ecm 2021-09-05 12:13:55.613439408 +0100 +++ INSTALL-ecm.new 2021-09-07 16:37:42.903291304 +0100 @@ -19,6 +19,7 @@ 1) check your configuration with: + $ autoreconf -i $ ./configure The configure script accepts several options (see ./configure --help). [/code] That's a minimum change to get new users started.[/QUOTE] That document describes what users should do when they have downloaded an official release. When building an official release, you do not need to run [C]autoreconf -i[/C]. You only need to run [C]autoreconf -i[/C] when you download a development version with git or svn. I don't think adding [C]autoreconf -i[/C] to this document is a good idea. Looking at the various documents, I see that [C]README.dev[/C] has the advice of running [C]autoreconf -i[/C]. |
How about having INSTALL-ecm tell users to run [c]autoreconf -i[/c] if they don't have a ./configure in the directory?
And if people get an official release would the files that would be created by autoreconf -i be correct for their OS etc? |
@Chris: Did you get your sm_30 card working or just the higher arch one?
|
Just the higher arch one (sm_52). Sorry.
PS. Does CGBN increase the maximum size of number that can be handled? I'd try it, but I'm tied up catching up with ECM work I delayed while I was getting ecm-cgbn working. |
[QUOTE=chris2be8;587572]Just the higher arch one (sm_52). Sorry.
PS. Does CGBN increase the maximum size of number that can be handled? I'd try it, but I'm tied up catching up with ECM work I delayed while I was getting ecm-cgbn working.[/QUOTE] Yes! In cgbn_stage1.cu search for this line /* NOTE: Custom kernel changes here You can either add a new kernel or I recommend just changing `cgbn_params_512` - typedef cgbn_params_t<4, 512> cgbn_params_512; + typedef cgbn_params_t<TPI_SEE_COMMENT_ABOVE, YOUR_BITS_HERE> cgbn_params_512; // My name is a lie The absolute limit is 32,768 bits. I found that GPU/CPU performance decreases 3x from 1,024 bits to 16,384 bits then an additional 2x above 16,384 still something like 13x faster on my system but possible no longer competitive watt for watt. Read the nearby comment for a sense of how long it will take to compile. |
I spent most of today working on new optimal bounds. It can be a [URL="https://www.mersenneforum.org/showpost.php?p=587617&postcount=22"]large speedup[/URL] to use these instead of the traditionally optimal B1 bounds. ecm can confirm they represent a full t<X> while taking substantially less time when accounting for the GPU speedup.
Full table at [url]https://github.com/sethtroisi/misc-scripts/tree/main/ecm_gpu_optimizer[/url] and an excerpt below [CODE]GPU speedup/CPU cores digits optimal B1 optimal B2 B2/B1 ratio expected curves Fast GPU + 4 cores 40/4 35 2,567,367 264,075,603 103 809 40/4 40 8,351,462 1,459,547,807 175 1760 40/4 45 38,803,644 17,323,036,685 446 2481 40/4 50 79,534,840 58,654,664,284 737 7269 40/4 55 113,502,213 96,313,119,323 849 29883 40/4 60 322,667,450 395,167,622,450 1225 56664 Fast GPU + 8 cores 40/8 35 1,559,844 351,804,250 226 1038 40/8 40 6,467,580 2,889,567,750 447 1843 40/8 45 29,448,837 35,181,170,876 1195 2599 40/8 50 40,201,280 58,928,323,592 1466 11993 40/8 55 136,135,593 289,565,678,027 2127 20547 40/8 60 479,960,096 3,226,409,839,042 6722 30014[/CODE] |
[QUOTE=bsquared;587026]1280: (~31 ms/curves)
2560: (~21 ms/curves) 640: (~63 ms/curves) 1792: (~36 ms/curves) So we have a winner! -gpucurves 2560 beats all the others and anything the old build could do as well (best on the old build was 5120 @ (~25 ms/curves)) With the smaller kernel (running (2^499-1) / 20959), -gpucurves 5120 is fastest at about 6ms/curve on both new and old builds.[/QUOTE] Two late night performance thoughts. 1. You might get 10% more throughput by toggling VERIFY_NORMALIZED to 0 on line 55 It's a nice debug check while this is still in development but it has never tripped so it's overly cautious especially if it costs 10% performance. 2. Would you mind sharing what card you have and the full output from -v output (especially the lines that start with "GPU: ") |
[QUOTE=SethTro;587628]Two late night performance thoughts.
1. You might get 10% more throughput by toggling VERIFY_NORMALIZED to 0 on line 55 It's a nice debug check while this is still in development but it has never tripped so it's overly cautious especially if it costs 10% performance. 2. Would you mind sharing what card you have and the full output from -v output (especially the lines that start with "GPU: ")[/QUOTE] Hmm, when running on 2^997-1 I'm getting *better* throughput with VERIFY_NORMALIZED 1, 53.5 curves/sec with it defined to 1 vs. 45.6 curves/sec with it defined to 0, both running -gpucurves 2560. If I set gpucurves 5120 then the no_verify version is 15% faster, but still slower than -gpucurves 2560. It is a Tesla V100-SXM2-32GB (compute capability 7.0, 80 MPs, maxSharedPerBlock = 49152 maxThreadsPerBlock = 1024 maxRegsPerBlock = 65536) |
[QUOTE=bsquared;587026]1280: (~31 ms/curves)
2560: (~21 ms/curves) 640: (~63 ms/curves) 1792: (~36 ms/curves) So we have a winner! -gpucurves 2560 beats all the others and anything the old build could do as well (best on the old build was 5120 @ (~25 ms/curves)) With the smaller kernel (running (2^499-1) / 20959), -gpucurves 5120 is fastest at about 6ms/curve on both new and old builds.[/QUOTE] I was confused when you saw only moderate gains so I rented a V100 (V100-SXM2-16GB) on AWS today. I'm seeing the new code be 3.1x faster which is similar to the 2-3x improvement I've seen on a 1080ti, 970, and K80. [CODE] $ echo "(2^997-1)" | ./ecm -cgbn -v -sigma 3:1000 1000000 0 Computing 5120 Step 1 took 134ms of CPU time / 69031ms of GPU time Throughput: 74.170 curves per second (on average 13.48ms per Step 1) $ echo "(2^997-1)" | ./ecm -gpu -v -sigma 3:1000 1000000 0 Computing 5120 Step 1 took 10911ms of CPU time / 218643ms of GPU time Throughput: 23.417 curves per second (on average 42.70ms per Step 1) [/CODE] |
Hello, I've got an error while trying to run curves with B1=11e7:
[CODE]ecm-cgbn: cgbn_stage1.cu:525: char* allocate_and_set_s_bits(const __mpz_struct*, int*): Assertion `1 <= num_bits && num_bits <= 100000000' failed.[/CODE] Is this a sort of CGBN limitations? |
It's to prevent GPU memory issues so it can be ignored (unless you run with a very huge number.
It's on my to-do list to remove but I'm sadly without internet today. You can remove the assert and everything will be fine. |
[QUOTE=unconnected;591358]Hello, I've got an error while trying to run curves with B1=11e7:
[CODE]ecm-cgbn: cgbn_stage1.cu:525: char* allocate_and_set_s_bits(const __mpz_struct*, int*): Assertion `1 <= num_bits && num_bits <= 100000000' failed.[/CODE] Is this a sort of CGBN limitations?[/QUOTE] I just merged [URL]https://gitlab.inria.fr/zimmerma/ecm/-/merge_requests/27[/URL] which contains a fix of B1 limit along with a number of quality of life improvements: multiple kernels included by default (512 and 1024), estimated timing, better overflow detection, faster compilation. |
[B]SethTro[/B], thanks for the explanation and improvements!
|
All times are UTC. The time now is 16:51. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.