mersenneforum.org Faster GPU-ECM with CGBN
 User Name Remember Me? Password
 Register FAQ Search Today's Posts Mark Forums Read

 2021-08-30, 19:34 #34 frmky     Jul 2003 So Cal 3·751 Posts Is there a simpler way to distribute stage 2 across multiple cores than creating a script to use the -save option with B2=0, split the save file, then launch multiple ecm processes with -resume?
2021-08-30, 19:42   #35
bsquared

"Ben"
Feb 2007

29·7 Posts

Quote:
 Originally Posted by frmky Is there a simpler way to distribute stage 2 across multiple cores than creating a script to use the -save option with B2=0, split the save file, then launch multiple ecm processes with -resume?
I am working on the ability to process ecm save files with yafu, but it isn't ready yet.

2021-08-30, 19:44   #36
EdH

"Ed Hall"
Dec 2009

4,177 Posts

Quote:
 Originally Posted by frmky Is there a simpler way to distribute stage 2 across multiple cores than creating a script to use the -save option with B2=0, split the save file, then launch multiple ecm processes with -resume?
Not sure if I'm understanding the question, but would ECM.py work?

Edit: For my Colab-GPU ECM experiements, I use:
Code:
python3 ecm.py  -resume residues
to run the residues from the Colab GPU stage 1 portion. I think I have all the threads, etc. set in the Python code, but they can be used on the command line, as well.

Last fiddled with by EdH on 2021-08-30 at 20:25

 2021-08-31, 08:49 #37 SethTro     "Seth" Apr 2019 409 Posts @EdH I started using ECM.py again and it's great! --- I wrote a bunch of code today so S_BITS_PER_BATCH is dynamic and there's better verbose output. Verbose output includes this message, when the kernel size is much lager than the input number. Code: Input number is 2^239-1 (72 digits) Compiling custom kernel for 256 bits should be ~180% faster CGBN<512, 4> running kernel<56 block x 128 threads> I doubt that verbose is the right place for this output (as I'm not sure how many people can actually recompile cuda code), but if you have a working setup it's as easy as changing Code: - typedef cgbn_params_t<4, 512> cgbn_params_4_512; + typedef cgbn_params_t<4, 256> cgbn_params_4_512; --- ETA and estimated throughput Code: Copying 716800 bits of data to GPU CGBN<640, 8> running kernel<112 block x 128 threads> Computing 100 bits/call, 0/4328085 (0.0%) Computing 110 bits/call, 100/4328085 (0.0%) Computing 121 bits/call, 210/4328085 (0.0%) ... Computing 256 bits/call, 1584/4328085 (0.0%) Computing 655 bits/call, 5630/4328085 (0.1%) Computing 1694 bits/call, 16050/4328085 (0.4%) Computing 2049 bits/call, 35999/4328085 (0.8%), ETA 184 + 2 = 186 seconds (~104 ms/curves) Computing 2049 bits/call, 56489/4328085 (1.3%), ETA 183 + 2 = 185 seconds (~103 ms/curves) ... Computing 2049 bits/call, 158939/4328085 (3.7%), ETA 178 + 7 = 185 seconds (~103 ms/curves) Computing 2049 bits/call, 363839/4328085 (8.4%), ETA 169 + 16 = 185 seconds (~103 ms/curves) ... Computing 2049 bits/call, 1798139/4328085 (41.5%), ETA 109 + 77 = 186 seconds (~104 ms/curves) Computing 2049 bits/call, 2003039/4328085 (46.3%), ETA 100 + 86 = 186 seconds (~104 ms/curves) Computing 2049 bits/call, 4052039/4328085 (93.6%), ETA 12 + 175 = 187 seconds (~104 ms/curves) Copying results back to CPU ... Computing 1792 Step 1 took 240ms of CPU time / 186575ms of GPU time Throughput: 9.605 curves per second (on average 104.12ms per Step 1) This is nice as it can gives very early feedback (estimates after 1-5 seconds are very accurate) if you are changing -gpucurves or playing with custom kernel bit sizes. I've found that doubling gpucurves can lead to 2x worse throughput! So I may need to add some warnings. Last fiddled with by SethTro on 2021-08-31 at 08:49
2021-08-31, 12:41   #38
EdH

"Ed Hall"
Dec 2009

4,177 Posts

Quote:
 Originally Posted by SethTro @EdH I started using ECM.py again and it's great! ---
Good to read. I just wish I could get my sm_30 card to do something. . . (2 sm-20s and 1 sm_30 and none will do anything productive, . . . yet. With all the install/reinstall/remove activity, now the sm_30 machine is complaining about a linux-kernel, so I've taken a break from trying more.)

 2021-08-31, 17:53 #39 bur     Aug 2020 79*6581e-4;3*2539e-3 22×7×17 Posts I couldn't find it in the thread (hope I didn't just overlook it), how does the speed of ECM on GPU generally compare to CPU? Say a GTX 1660 or similar. And is it so that only small B1 values can be used? I found this paper and they also only seem to have used B1=50k. With a 2080 Ti they achieved "2781 ECM trials", I guess curves, per second for B1=50k. That is very fast, but if the B1 size is severely limited, a CPU is still required for larger factors? Last fiddled with by bur on 2021-08-31 at 17:56
2021-08-31, 21:01   #40
SethTro

"Seth"
Apr 2019

409 Posts

Quote:
 Originally Posted by bur I couldn't find it in the thread (hope I didn't just overlook it), how does the speed of ECM on GPU generally compare to CPU? Say a GTX 1660 or similar. And is it so that only small B1 values can be used? I found this paper and they also only seem to have used B1=50k. With a 2080 Ti they achieved "2781 ECM trials", I guess curves, per second for B1=50k. That is very fast, but if the B1 size is severely limited, a CPU is still required for larger factors?
The most important factor is the size of N (which is limitted by CGBN to 32K for GPUs or ~10,000 digits).
Both CPU and GPU have the same linear scaling for B1 which can be increased to any number you want.

the speedup is strongly depends on your CPU vs GPU. For my 1080ti vs 2600K

250 bits 46x faster on GPU
500 bits 48x faster on GPU
1000 bits 68x faster on GPU
1500 bits 83x faster on GPU
2000 bits 46x faster on GPU

Which means we are seeing roughly the same scaling for the GPU as CPU for bit levels < 2K.
Informal testing with larger inputs (2048 - 32,768 bits) bits shows the CPU outscales GPU for larger inputs and the speedup slowly decreases from ~50x to ~25x as bits increase from 2K to 16K. At the maximal value of 32K bits performances has decreases again to 14x (from 26x at 16K bits)

Last fiddled with by SethTro on 2021-08-31 at 21:02

2021-09-01, 01:50   #41
xilman
Bamboozled!

"𒉺𒌌𒇷𒆷𒀭"
May 2003
Down not across

2·5,557 Posts

Quote:
 Originally Posted by frmky Is there a simpler way to distribute stage 2 across multiple cores than creating a script to use the -save option with B2=0, split the save file, then launch multiple ecm processes with -resume?
it is what I used to do when GPU-enabled ECM still worked on my machines. It was a trivial script to write.

Last fiddled with by xilman on 2021-09-01 at 01:51 Reason: Fix typ

 2021-09-01, 15:02 #42 bsquared     "Ben" Feb 2007 358410 Posts I re-cloned the gpu_integration branch to capture the latest changes and went through the build process with the following caveats: specifying --with-gmp together with --with-cgbn-include doesn't work. I had to use the system default gmp (6.0.0). With compute 70 I still have to replace __any with __any_sync(__activemask() on line 10 of cude_kernel_default.cu building with gcc I get this error in cgbn_stage1.cu: cgbn_stage1.cu(654): error: initialization with "{...}" is not allowed for object of type "const std::vector>" I suppose I need to build with g++ instead? Anyway I can get past all of that and get a working binary and the cpu usage is now much lower. But now the gpu portion appears to be about 15% slower? Before: Code: Input number is 2^997-1 (301 digits) Computing 5120 Step 1 took 75571ms of CPU time / 129206ms of GPU time Throughput: 39.627 curves per second (on average 25.24ms per Step 1) New clone: Code: Input number is 2^997-1 (301 digits) Computing 5120 Step 1 took 643ms of CPU time / 149713ms of GPU time Throughput: 34.199 curves per second (on average 29.24ms per Step 1) Anyone else seeing this?
 2021-09-01, 16:42 #43 chris2be8     Sep 2009 2,221 Posts Hello, I've upgraded my system with a GTX 970 (sm_52) to openSUSE 42.2 and installed CUDA 9.0 on it. But when I try to compile ecm with GPU support ./configure says: Code: configure: Using cuda.h from /usr/local/cuda/include checking cuda.h usability... no checking cuda.h presence... yes configure: WARNING: cuda.h: present but cannot be compiled configure: WARNING: cuda.h: check for missing prerequisite headers? configure: WARNING: cuda.h: see the Autoconf documentation configure: WARNING: cuda.h: section "Present But Cannot Be Compiled" configure: WARNING: cuda.h: proceeding with the compiler's result configure: WARNING: ## ----------------------------------- ## configure: WARNING: ## Report this to ecm-discuss@inria.fr ## configure: WARNING: ## ----------------------------------- ## checking for cuda.h... no configure: error: required header file missing README.gpu says: Code: Some versions of CUDA are not compatible with recent versions of gcc. To specify which C compiler is called by the CUDA compiler nvcc, type: $./configure --enable-gpu --with-cuda-compiler=/PATH/DIR If you get errors about "cuda.h: present but cannot be compiled" Try using an older CC:$ ./configure --enable-gpu CC=gcc-8 The value of this parameter is directly passed to nvcc via the option "--compiler-bindir". By default, GMP-ECM lets nvcc choose what C compiler it uses. The only gcc installed now is version 4.8.5. Should I install an older gcc (if so what level) or should I upgrade the OS to a higher level so I can install a newer CUDA? Does anyone have ecm working with CUDA 9.0 or higher on openSUSE and if so what level of openSUSE? Chris (getting slightly frustrated by now)
2021-09-01, 17:47   #44
EdH

"Ed Hall"
Dec 2009

4,177 Posts

Quote:
 Originally Posted by chris2be8 The only gcc installed now is version 4.8.5. Should I install an older gcc (if so what level) or should I upgrade the OS to a higher level so I can install a newer CUDA? Does anyone have ecm working with CUDA 9.0 or higher on openSUSE and if so what level of openSUSE? Chris (getting slightly frustrated by now)
I've passed the frustration point with my systems. I was getting the same with my Ubuntu 20.04 with all the 10.x and 11.x CUDA versions (my card isn't supported by CUDA 11.x, anyway). I installed and made default several older gcc versions (8, 9, 10).* I gave up for now.

* I'm curious about the gcc version numer difference between yours and mine. The default Ubuntu 20.04 gcc is 9.3.0, my Debian Buster is 8.3.0, and the default for my Fedora 33 is 10.3.1. Is your version actually that old compared to mine?

 Similar Threads Thread Thread Starter Forum Replies Last Post moytrage Software 50 2021-07-21 05:55 indomit Information & Answers 4 2020-10-07 10:50 paulunderwood Miscellaneous Math 13 2016-08-02 00:05 lidocorc Software 2 2008-11-08 09:26 clowns789 Miscellaneous Math 3 2004-05-27 23:39

All times are UTC. The time now is 04:18.

Mon Jan 17 04:18:35 UTC 2022 up 177 days, 22:47, 0 users, load averages: 0.82, 0.93, 0.94

Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔