mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Factoring (https://www.mersenneforum.org/forumdisplay.php?f=19)
-   -   Faster GPU-ECM with CGBN (https://www.mersenneforum.org/showthread.php?t=27103)

frmky 2021-08-30 19:34

Is there a simpler way to distribute stage 2 across multiple cores than creating a script to use the -save option with B2=0, split the save file, then launch multiple ecm processes with -resume?

bsquared 2021-08-30 19:42

[QUOTE=frmky;586881]Is there a simpler way to distribute stage 2 across multiple cores than creating a script to use the -save option with B2=0, split the save file, then launch multiple ecm processes with -resume?[/QUOTE]

I am working on the ability to process ecm save files with yafu, but it isn't ready yet.

EdH 2021-08-30 19:44

[QUOTE=frmky;586881]Is there a simpler way to distribute stage 2 across multiple cores than creating a script to use the -save option with B2=0, split the save file, then launch multiple ecm processes with -resume?[/QUOTE]Not sure if I'm understanding the question, but would [URL="https://www.mersenneforum.org/showthread.php?t=15508"]ECM.py[/URL] work?

Edit: For my Colab-GPU ECM experiements, I use:[code]python3 ecm.py -resume residues[/code]to run the residues from the Colab GPU stage 1 portion. I think I have all the threads, etc. set in the Python code, but they can be used on the command line, as well.

The latest version is [URL="https://www.mersenneforum.org/showpost.php?p=518249&postcount=109"]here[/URL].

SethTro 2021-08-31 08:49

@EdH I started using ECM.py again and it's great!

---

I wrote a bunch of code today so S_BITS_PER_BATCH is dynamic and there's better verbose output.

Verbose output includes this message, when the kernel size is much lager than the input number.
[CODE]
Input number is 2^239-1 (72 digits)
Compiling custom kernel for 256 bits should be ~180% faster
CGBN<512, 4> running kernel<56 block x 128 threads>
[/CODE]
I doubt that verbose is the right place for this output (as I'm not sure how many people can actually recompile cuda code), but if you have a working setup it's as easy as changing

[CODE]
- typedef cgbn_params_t<4, 512> cgbn_params_4_512;
+ typedef cgbn_params_t<4, 256> cgbn_params_4_512;
[/CODE]

---

ETA and estimated throughput

[CODE]
Copying 716800 bits of data to GPU
CGBN<640, 8> running kernel<112 block x 128 threads>
Computing 100 bits/call, 0/4328085 (0.0%)
Computing 110 bits/call, 100/4328085 (0.0%)
Computing 121 bits/call, 210/4328085 (0.0%)
...
Computing 256 bits/call, 1584/4328085 (0.0%)
Computing 655 bits/call, 5630/4328085 (0.1%)
Computing 1694 bits/call, 16050/4328085 (0.4%)
Computing 2049 bits/call, 35999/4328085 (0.8%), ETA 184 + 2 = 186 seconds (~104 ms/curves)
Computing 2049 bits/call, 56489/4328085 (1.3%), ETA 183 + 2 = 185 seconds (~103 ms/curves)
...
Computing 2049 bits/call, 158939/4328085 (3.7%), ETA 178 + 7 = 185 seconds (~103 ms/curves)
Computing 2049 bits/call, 363839/4328085 (8.4%), ETA 169 + 16 = 185 seconds (~103 ms/curves)
...
Computing 2049 bits/call, 1798139/4328085 (41.5%), ETA 109 + 77 = 186 seconds (~104 ms/curves)
Computing 2049 bits/call, 2003039/4328085 (46.3%), ETA 100 + 86 = 186 seconds (~104 ms/curves)
Computing 2049 bits/call, 4052039/4328085 (93.6%), ETA 12 + 175 = 187 seconds (~104 ms/curves)
Copying results back to CPU ...
Computing 1792 Step 1 took 240ms of CPU time / 186575ms of GPU time
Throughput: 9.605 curves per second (on average 104.12ms per Step 1)
[/CODE]

This is nice as it can gives very early feedback (estimates after 1-5 seconds are very accurate) if you are changing `-gpucurves` or playing with custom kernel bit sizes.
I've found that doubling gpucurves can lead to 2x worse throughput! So I may need to add some warnings.

EdH 2021-08-31 12:41

[QUOTE=SethTro;586911]@EdH I started using ECM.py again and it's great!
---
[/QUOTE]Good to read. I just wish I could get my sm_30 card to do something. . . (2 sm-20s and 1 sm_30 and none will do anything productive, . . . yet. With all the install/reinstall/remove activity, now the sm_30 machine is complaining about a linux-kernel, so I've taken a break from trying more.)

bur 2021-08-31 17:53

I couldn't find it in the thread (hope I didn't just overlook it), how does the speed of ECM on GPU generally compare to CPU? Say a GTX 1660 or similar.

And is it so that only small B1 values can be used? I found [URL="https://eprint.iacr.org/2020/1265.pdf"]this paper[/URL] and they also only seem to have used B1=50k. With a 2080 Ti they achieved "2781 ECM trials", I guess curves, per second for B1=50k. That is very fast, but if the B1 size is severely limited, a CPU is still required for larger factors?

SethTro 2021-08-31 21:01

[QUOTE=bur;586936]I couldn't find it in the thread (hope I didn't just overlook it), how does the speed of ECM on GPU generally compare to CPU? Say a GTX 1660 or similar.

And is it so that only small B1 values can be used? I found [URL="https://eprint.iacr.org/2020/1265.pdf"]this paper[/URL] and they also only seem to have used B1=50k. With a 2080 Ti they achieved "2781 ECM trials", I guess curves, per second for B1=50k. That is very fast, but if the B1 size is severely limited, a CPU is still required for larger factors?[/QUOTE]

The most important factor is the size of N (which is limitted by CGBN to 32K for GPUs or ~10,000 digits).
Both CPU and GPU have the same linear scaling for B1 which can be increased to any number you want.

the speedup is strongly depends on your CPU vs GPU. For my 1080ti vs 2600K

250 bits 46x faster on GPU
500 bits 48x faster on GPU
1000 bits 68x faster on GPU
1500 bits 83x faster on GPU
2000 bits 46x faster on GPU

Which means we are seeing roughly the same scaling for the GPU as CPU for bit levels < 2K.
Informal testing with larger inputs (2048 - 32,768 bits) bits shows the CPU outscales GPU for larger inputs and the speedup slowly decreases from ~50x to ~25x as bits increase from 2K to 16K. At the maximal value of 32K bits performances has decreases again to 14x (from 26x at 16K bits)

xilman 2021-09-01 01:50

[QUOTE=frmky;586881]Is there a simpler way to distribute stage 2 across multiple cores than creating a script to use the -save option with B2=0, split the save file, then launch multiple ecm processes with -resume?[/QUOTE]it is what I used to do when GPU-enabled ECM still worked on my machines. It was a trivial script to write.

bsquared 2021-09-01 15:02

I re-cloned the gpu_integration branch to capture the latest changes and went through the build process with the following caveats:

specifying --with-gmp together with --with-cgbn-include doesn't work. I had to use the system default gmp (6.0.0).

With compute 70 I still have to replace __any with __any_sync(__activemask() on line 10 of cude_kernel_default.cu

building with gcc I get this error in cgbn_stage1.cu: cgbn_stage1.cu(654): error: initialization with "{...}" is not allowed for object of type "const std::vector<uint32_t, std::allocator<uint32_t>>"

I suppose I need to build with g++ instead?

Anyway I can get past all of that and get a working binary and the cpu usage is now much lower. But now the gpu portion appears to be about 15% slower?

Before:
[CODE]
Input number is 2^997-1 (301 digits)
Computing 5120 Step 1 took 75571ms of CPU time / 129206ms of GPU time
Throughput: 39.627 curves per second (on average 25.24ms per Step 1)
[/CODE]

New clone:
[CODE]
Input number is 2^997-1 (301 digits)
Computing 5120 Step 1 took 643ms of CPU time / 149713ms of GPU time
Throughput: 34.199 curves per second (on average 29.24ms per Step 1)
[/CODE]

Anyone else seeing this?

chris2be8 2021-09-01 16:42

Hello,

I've upgraded my system with a GTX 970 (sm_52) to openSUSE 42.2 and installed CUDA 9.0 on it. But when I try to compile ecm with GPU support ./configure says:
[code]
configure: Using cuda.h from /usr/local/cuda/include
checking cuda.h usability... no
checking cuda.h presence... yes
configure: WARNING: cuda.h: present but cannot be compiled
configure: WARNING: cuda.h: check for missing prerequisite headers?
configure: WARNING: cuda.h: see the Autoconf documentation
configure: WARNING: cuda.h: section "Present But Cannot Be Compiled"
configure: WARNING: cuda.h: proceeding with the compiler's result
configure: WARNING: ## ----------------------------------- ##
configure: WARNING: ## Report this to ecm-discuss@inria.fr ##
configure: WARNING: ## ----------------------------------- ##
checking for cuda.h... no
configure: error: required header file missing
[/code]

README.gpu says:
[code]
Some versions of CUDA are not compatible with recent versions of gcc.
To specify which C compiler is called by the CUDA compiler nvcc, type:

$ ./configure --enable-gpu --with-cuda-compiler=/PATH/DIR

If you get errors about "cuda.h: present but cannot be compiled"
Try using an older CC:

$ ./configure --enable-gpu CC=gcc-8

The value of this parameter is directly passed to nvcc via the option
"--compiler-bindir". By default, GMP-ECM lets nvcc choose what C compiler it
uses.
[/code]

The only gcc installed now is version 4.8.5. Should I install an older gcc (if so what level) or should I upgrade the OS to a higher level so I can install a newer CUDA? Does anyone have ecm working with CUDA 9.0 or higher on openSUSE and if so what level of openSUSE?

Chris (getting slightly frustrated by now)

EdH 2021-09-01 17:47

[QUOTE=chris2be8;587001]The only gcc installed now is version 4.8.5. Should I install an older gcc (if so what level) or should I upgrade the OS to a higher level so I can install a newer CUDA? Does anyone have ecm working with CUDA 9.0 or higher on openSUSE and if so what level of openSUSE?

Chris (getting slightly frustrated by now)[/QUOTE]I've passed the frustration point with my systems. I was getting the same with my Ubuntu 20.04 with all the 10.x and 11.x CUDA versions (my card isn't supported by CUDA 11.x, anyway). I installed and made default several older gcc versions (8, 9, 10).* I gave up for now.

* I'm curious about the gcc version numer difference between yours and mine. The default Ubuntu 20.04 gcc is 9.3.0, my Debian Buster is 8.3.0, and the default for my Fedora 33 is 10.3.1. Is your version actually that old compared to mine?


All times are UTC. The time now is 16:35.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.