mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > Factoring

Reply
 
Thread Tools
Old 2021-08-30, 19:34   #34
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

1000100101012 Posts
Default

Is there a simpler way to distribute stage 2 across multiple cores than creating a script to use the -save option with B2=0, split the save file, then launch multiple ecm processes with -resume?
frmky is online now   Reply With Quote
Old 2021-08-30, 19:42   #35
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

2·1,789 Posts
Default

Quote:
Originally Posted by frmky View Post
Is there a simpler way to distribute stage 2 across multiple cores than creating a script to use the -save option with B2=0, split the save file, then launch multiple ecm processes with -resume?
I am working on the ability to process ecm save files with yafu, but it isn't ready yet.
bsquared is offline   Reply With Quote
Old 2021-08-30, 19:44   #36
EdH
 
EdH's Avatar
 
"Ed Hall"
Dec 2009
Adirondack Mtns

33×149 Posts
Default

Quote:
Originally Posted by frmky View Post
Is there a simpler way to distribute stage 2 across multiple cores than creating a script to use the -save option with B2=0, split the save file, then launch multiple ecm processes with -resume?
Not sure if I'm understanding the question, but would ECM.py work?

Edit: For my Colab-GPU ECM experiements, I use:
Code:
python3 ecm.py  -resume residues
to run the residues from the Colab GPU stage 1 portion. I think I have all the threads, etc. set in the Python code, but they can be used on the command line, as well.

The latest version is here.

Last fiddled with by EdH on 2021-08-30 at 20:25
EdH is offline   Reply With Quote
Old 2021-08-31, 08:49   #37
SethTro
 
SethTro's Avatar
 
"Seth"
Apr 2019

32·41 Posts
Default

@EdH I started using ECM.py again and it's great!

---

I wrote a bunch of code today so S_BITS_PER_BATCH is dynamic and there's better verbose output.

Verbose output includes this message, when the kernel size is much lager than the input number.
Code:
Input number is 2^239-1 (72 digits)
Compiling custom kernel for 256 bits should be ~180% faster
CGBN<512, 4> running kernel<56 block x 128 threads>
I doubt that verbose is the right place for this output (as I'm not sure how many people can actually recompile cuda code), but if you have a working setup it's as easy as changing

Code:
-  typedef cgbn_params_t<4, 512>  cgbn_params_4_512;
+  typedef cgbn_params_t<4, 256>  cgbn_params_4_512;
---

ETA and estimated throughput

Code:
Copying 716800 bits of data to GPU
CGBN<640, 8> running kernel<112 block x 128 threads>
Computing 100 bits/call, 0/4328085 (0.0%)
Computing 110 bits/call, 100/4328085 (0.0%)
Computing 121 bits/call, 210/4328085 (0.0%)
...
Computing 256 bits/call, 1584/4328085 (0.0%)
Computing 655 bits/call, 5630/4328085 (0.1%)
Computing 1694 bits/call, 16050/4328085 (0.4%)
Computing 2049 bits/call, 35999/4328085 (0.8%), ETA 184 + 2 = 186 seconds (~104 ms/curves)
Computing 2049 bits/call, 56489/4328085 (1.3%), ETA 183 + 2 = 185 seconds (~103 ms/curves)
...
Computing 2049 bits/call, 158939/4328085 (3.7%), ETA 178 + 7 = 185 seconds (~103 ms/curves)
Computing 2049 bits/call, 363839/4328085 (8.4%), ETA 169 + 16 = 185 seconds (~103 ms/curves)
...
Computing 2049 bits/call, 1798139/4328085 (41.5%), ETA 109 + 77 = 186 seconds (~104 ms/curves)
Computing 2049 bits/call, 2003039/4328085 (46.3%), ETA 100 + 86 = 186 seconds (~104 ms/curves)
Computing 2049 bits/call, 4052039/4328085 (93.6%), ETA 12 + 175 = 187 seconds (~104 ms/curves)
Copying results back to CPU ...
Computing 1792 Step 1 took 240ms of CPU time / 186575ms of GPU time
Throughput: 9.605 curves per second (on average 104.12ms per Step 1)
This is nice as it can gives very early feedback (estimates after 1-5 seconds are very accurate) if you are changing `-gpucurves` or playing with custom kernel bit sizes.
I've found that doubling gpucurves can lead to 2x worse throughput! So I may need to add some warnings.

Last fiddled with by SethTro on 2021-08-31 at 08:49
SethTro is offline   Reply With Quote
Old 2021-08-31, 12:41   #38
EdH
 
EdH's Avatar
 
"Ed Hall"
Dec 2009
Adirondack Mtns

33×149 Posts
Default

Quote:
Originally Posted by SethTro View Post
@EdH I started using ECM.py again and it's great!
---
Good to read. I just wish I could get my sm_30 card to do something. . . (2 sm-20s and 1 sm_30 and none will do anything productive, . . . yet. With all the install/reinstall/remove activity, now the sm_30 machine is complaining about a linux-kernel, so I've taken a break from trying more.)
EdH is offline   Reply With Quote
Old 2021-08-31, 17:53   #39
bur
 
bur's Avatar
 
Aug 2020
79*6581e-4;3*2539e-3

13×31 Posts
Default

I couldn't find it in the thread (hope I didn't just overlook it), how does the speed of ECM on GPU generally compare to CPU? Say a GTX 1660 or similar.

And is it so that only small B1 values can be used? I found this paper and they also only seem to have used B1=50k. With a 2080 Ti they achieved "2781 ECM trials", I guess curves, per second for B1=50k. That is very fast, but if the B1 size is severely limited, a CPU is still required for larger factors?

Last fiddled with by bur on 2021-08-31 at 17:56
bur is offline   Reply With Quote
Old 2021-08-31, 21:01   #40
SethTro
 
SethTro's Avatar
 
"Seth"
Apr 2019

32·41 Posts
Default

Quote:
Originally Posted by bur View Post
I couldn't find it in the thread (hope I didn't just overlook it), how does the speed of ECM on GPU generally compare to CPU? Say a GTX 1660 or similar.

And is it so that only small B1 values can be used? I found this paper and they also only seem to have used B1=50k. With a 2080 Ti they achieved "2781 ECM trials", I guess curves, per second for B1=50k. That is very fast, but if the B1 size is severely limited, a CPU is still required for larger factors?
The most important factor is the size of N (which is limitted by CGBN to 32K for GPUs or ~10,000 digits).
Both CPU and GPU have the same linear scaling for B1 which can be increased to any number you want.

the speedup is strongly depends on your CPU vs GPU. For my 1080ti vs 2600K

250 bits 46x faster on GPU
500 bits 48x faster on GPU
1000 bits 68x faster on GPU
1500 bits 83x faster on GPU
2000 bits 46x faster on GPU

Which means we are seeing roughly the same scaling for the GPU as CPU for bit levels < 2K.
Informal testing with larger inputs (2048 - 32,768 bits) bits shows the CPU outscales GPU for larger inputs and the speedup slowly decreases from ~50x to ~25x as bits increase from 2K to 16K. At the maximal value of 32K bits performances has decreases again to 14x (from 26x at 16K bits)

Last fiddled with by SethTro on 2021-08-31 at 21:02
SethTro is offline   Reply With Quote
Old 2021-09-01, 01:50   #41
xilman
Bamboozled!
 
xilman's Avatar
 
"π’‰Ίπ’ŒŒπ’‡·π’†·π’€­"
May 2003
Down not across

22×3×11×83 Posts
Default

Quote:
Originally Posted by frmky View Post
Is there a simpler way to distribute stage 2 across multiple cores than creating a script to use the -save option with B2=0, split the save file, then launch multiple ecm processes with -resume?
it is what I used to do when GPU-enabled ECM still worked on my machines. It was a trivial script to write.

Last fiddled with by xilman on 2021-09-01 at 01:51 Reason: Fix typ
xilman is offline   Reply With Quote
Old 2021-09-01, 15:02   #42
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

2×1,789 Posts
Default

I re-cloned the gpu_integration branch to capture the latest changes and went through the build process with the following caveats:

specifying --with-gmp together with --with-cgbn-include doesn't work. I had to use the system default gmp (6.0.0).

With compute 70 I still have to replace __any with __any_sync(__activemask() on line 10 of cude_kernel_default.cu

building with gcc I get this error in cgbn_stage1.cu: cgbn_stage1.cu(654): error: initialization with "{...}" is not allowed for object of type "const std::vector<uint32_t, std::allocator<uint32_t>>"

I suppose I need to build with g++ instead?

Anyway I can get past all of that and get a working binary and the cpu usage is now much lower. But now the gpu portion appears to be about 15% slower?

Before:
Code:
Input number is 2^997-1 (301 digits)
Computing 5120 Step 1 took 75571ms of CPU time / 129206ms of GPU time
Throughput: 39.627 curves per second (on average 25.24ms per Step 1)
New clone:
Code:
Input number is 2^997-1 (301 digits)
Computing 5120 Step 1 took 643ms of CPU time / 149713ms of GPU time
Throughput: 34.199 curves per second (on average 29.24ms per Step 1)
Anyone else seeing this?
bsquared is offline   Reply With Quote
Old 2021-09-01, 16:42   #43
chris2be8
 
chris2be8's Avatar
 
Sep 2009

2,179 Posts
Default

Hello,

I've upgraded my system with a GTX 970 (sm_52) to openSUSE 42.2 and installed CUDA 9.0 on it. But when I try to compile ecm with GPU support ./configure says:
Code:
configure: Using cuda.h from /usr/local/cuda/include
checking cuda.h usability... no
checking cuda.h presence... yes
configure: WARNING: cuda.h: present but cannot be compiled
configure: WARNING: cuda.h:     check for missing prerequisite headers?
configure: WARNING: cuda.h: see the Autoconf documentation
configure: WARNING: cuda.h:     section "Present But Cannot Be Compiled"
configure: WARNING: cuda.h: proceeding with the compiler's result
configure: WARNING:     ## ----------------------------------- ##
configure: WARNING:     ## Report this to ecm-discuss@inria.fr ##
configure: WARNING:     ## ----------------------------------- ##
checking for cuda.h... no
configure: error: required header file missing
README.gpu says:
Code:
Some versions of CUDA are not compatible with recent versions of gcc.
To specify which C compiler is called by the CUDA compiler nvcc, type:

  $ ./configure --enable-gpu --with-cuda-compiler=/PATH/DIR

If you get errors about "cuda.h: present but cannot be compiled"
Try using an older CC:

  $ ./configure --enable-gpu CC=gcc-8

The value of this parameter is directly passed to nvcc via the option
"--compiler-bindir". By default, GMP-ECM lets nvcc choose what C compiler it
uses.
The only gcc installed now is version 4.8.5. Should I install an older gcc (if so what level) or should I upgrade the OS to a higher level so I can install a newer CUDA? Does anyone have ecm working with CUDA 9.0 or higher on openSUSE and if so what level of openSUSE?

Chris (getting slightly frustrated by now)
chris2be8 is offline   Reply With Quote
Old 2021-09-01, 17:47   #44
EdH
 
EdH's Avatar
 
"Ed Hall"
Dec 2009
Adirondack Mtns

33·149 Posts
Default

Quote:
Originally Posted by chris2be8 View Post
The only gcc installed now is version 4.8.5. Should I install an older gcc (if so what level) or should I upgrade the OS to a higher level so I can install a newer CUDA? Does anyone have ecm working with CUDA 9.0 or higher on openSUSE and if so what level of openSUSE?

Chris (getting slightly frustrated by now)
I've passed the frustration point with my systems. I was getting the same with my Ubuntu 20.04 with all the 10.x and 11.x CUDA versions (my card isn't supported by CUDA 11.x, anyway). I installed and made default several older gcc versions (8, 9, 10).* I gave up for now.

* I'm curious about the gcc version numer difference between yours and mine. The default Ubuntu 20.04 gcc is 9.3.0, my Debian Buster is 8.3.0, and the default for my Fedora 33 is 10.3.1. Is your version actually that old compared to mine?
EdH is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
NTT faster than FFT? moytrage Software 50 2021-07-21 05:55
PRP on gpu is faster that on cpu indomit Information & Answers 4 2020-10-07 10:50
faster than LL? paulunderwood Miscellaneous Math 13 2016-08-02 00:05
My CPU is getting faster and faster ;-) lidocorc Software 2 2008-11-08 09:26
Faster than LL? clowns789 Miscellaneous Math 3 2004-05-27 23:39

All times are UTC. The time now is 08:15.


Tue Oct 26 08:15:17 UTC 2021 up 95 days, 2:44, 0 users, load averages: 1.77, 1.94, 1.91

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.