![]() |
![]() |
#133 |
"Ed Hall"
Dec 2009
Adirondack Mtns
125916 Posts |
![]()
I understood about the -gpucurves, but what confused me was the:
Code:
CGBN<512, 4> running kernel<56 block x 256 threads> input number is 246 bits Thanks for helping me understand this and for a great speedup. |
![]() |
![]() |
![]() |
#134 |
Sep 2009
26·37 Posts |
![]()
ecm-gpu downloaded from https://gitlab.inria.fr/zimmerma/ecm.git works for b1=11e7:
Code:
chris@4core:~/ecm-cgbn.2/ecm> date;time ./ecm -gpu -cgbn -save test2.save 110000000 1 <b58+148.ini;date Tue 8 Mar 08:18:31 GMT 2022 GMP-ECM 7.0.5-dev [configured with GMP 5.1.3, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM] Input number is 1044362381090522430349272504349028000743722878937901553864893424154624748141120681170432021570621655565526684395777956912757565835989960001844742211087555729316372309210417 (172 digits) Using B1=110000000, B2=1, sigma=3:35896186-3:35898617 (2432 curves) GPU: Large B1, S = 158705536 bits = 151 MB GPU: Using device code targeted for architecture compile_86 GPU: Ptx version is 86 GPU: maxThreadsPerBlock = 896 GPU: numRegsPerThread = 67 sharedMemPerBlock = 0 bytes Computing 2432 Step 1 took 4508ms of CPU time / 4674180ms of GPU time real 78m0.885s user 0m10.992s sys 0m2.513s Tue 8 Mar 09:36:32 GMT 2022 The older version without -cgbn took about 9 hours to do the same job. Many thanks for the speed up. |
![]() |
![]() |
![]() |
#135 |
"Ed Hall"
Dec 2009
Adirondack Mtns
10010010110012 Posts |
![]()
Sorry if this has an "elementary" answer, but is there an optimum value that B1 should be a multiple of?
I'm currently basing my B1 values on what 896 curves need for the different t-levels. Should I adjust B1 to a close multiple of a base value, then adjust the -gpucurves, accordingly, or am I complicating things? |
![]() |
![]() |
![]() |
#136 | |
"Seth"
Apr 2019
19×23 Posts |
![]() Quote:
TL;DR If you are still running B2 you should probably set B1 for each t-level based on this chart then round number of curves to the nearest multiple of 896. This is probably within 20% of optimal for >= t45. You could slightly optimize by increasing B1 if you round down or increasing B1 if you round up (so that ecm -v prints "expected number of curves to find a factor" equal to the number of curves you are using) In practice for small factors everything is really fast so for a single number who cares, but if you were working on factordb or a huge amount of numbers (>5000) you would want to do something smarter. In theory the code could run one curve for 896 different numbers or something. It can also make sense to tune the B1/B2 ratio based on how much RAM you have and how fast your CPU is versus your GPU. For example see the discussion here. I wrote some hacky shell code to do this at sethtro/misc-scripts/ecm_gpu_optimizer |
|
![]() |
![]() |
![]() |
#137 |
"Ed Hall"
Dec 2009
Adirondack Mtns
469710 Posts |
![]()
Thanks. This gives me something to study. Unfortunately, the machine I was able to get to run the GPU has only 2 cores and 8G RAM. But, I have a script now that sends the residues to a second machine and moves to the next B1 level. Of course, now the GPU is the bottleneck since I'm only running stage 1 operations on its machine. I'm still looking at what might be best for my setup.
|
![]() |
![]() |
![]() |
#138 | |
"Seth"
Apr 2019
19×23 Posts |
![]() Quote:
Code:
$ echo "1044362381090522430349272504349028000743722878937901553864893424154624748141120681170432021570621655565526684395777956912757565835989960001844742211087555729316372309210417" | ./ecm -cgbn -v 11e5 0 GMP-ECM 7.0.5-dev [configured with GMP 6.2.99, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM] Using B1=1100000, B2=0, sigma=3:1276799189-3:1276800020 (832 curves) Compiling custom kernel for 640 bits should be ~144% faster CGBN<1024, 8> running kernel<26 block x 256 threads> input number is 569 bits Computing 1158 bits/call, 96372/1586512 (6.1%), ETA 106 + 7 = 113 seconds (~135 ms/curves) Computing 1158 bits/call, 212172/1586512 (13.4%), ETA 97 + 15 = 113 seconds (~135 ms/curves) Computing 1158 bits/call, 327972/1586512 (20.7%), ETA 89 + 23 = 112 seconds (~135 ms/curves) After changing - typedef cgbn_params_t<8, 1024> cgbn_params_1024; + typedef cgbn_params_t<8, 640> cgbn_params_1024; $ echo "1044362381090522430349272504349028000743722878937901553864893424154624748141120681170432021570621655565526684395777956912757565835989960001844742211087555729316372309210417" | ./ecm -cgbn -v 11e5 0 GMP-ECM 7.0.5-dev [configured with GMP 6.2.99, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM] Using B1=1100000, B2=0, sigma=3:230651649-3:230652480 (832 curves) CGBN<640, 8> running kernel<26 block x 256 threads> input number is 569 bits Computing 1863 bits/call, 146292/1586512 (9.2%), ETA 67 + 7 = 74 seconds (~89 ms/curves) Computing 1863 bits/call, 332592/1586512 (21.0%), ETA 60 + 16 = 76 seconds (~92 ms/curves) Computing 1863 bits/call, 518892/1586512 (32.7%), ETA 52 + 25 = 77 seconds (~93 ms/curves) Last fiddled with by SethTro on 2022-03-17 at 09:07 |
|
![]() |
![]() |
![]() |
#139 |
Apr 2010
22×3×19 Posts |
![]()
If trying custom kernel sizes, try also 768 bits. For me (GTX 2060 Super) thats faster than 640 bits.
|
![]() |
![]() |
![]() |
#140 |
Just call me Henry
"David"
Sep 2007
Liverpool (GMT/BST)
23·7·107 Posts |
![]()
If thats the case then a kernal benchmark would be useful that identifies the fastest kernels for each card. I currently have a version with all the possible kernals added upto 300 digits or so.
|
![]() |
![]() |
![]() |
#141 | |
Sep 2009
236810 Posts |
![]() Quote:
I've looked at your chart for recommended B1 and B2 values, but it confuses my script's calculations of how much ECM to do for a number of a given size. I need to do some serious thinking to get it to all work together. |
|
![]() |
![]() |
![]() |
#142 |
I moo ablest echo power!
May 2013
1,801 Posts |
![]()
Hi, I've built this under WSL2, and everything works quite nicely, but when I do the test file (gpu_throughput_test.sh), CBGN fails when the input number is large enough:
"No available CGBN Kernel large enough to process N(1864 bits)" I saw some posts earlier in the thread that might apply, but I thought it would be best to ask before I start messing with anything. |
![]() |
![]() |
![]() |
#143 | |
"Seth"
Apr 2019
6658 Posts |
![]() Quote:
If you want to run ECM on numbers > 1020 bits look around line 670 in cgbn_stage1.cu Last fiddled with by SethTro on 2022-04-03 at 06:22 |
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
NTT faster than FFT? | moytrage | Software | 50 | 2021-07-21 05:55 |
PRP on gpu is faster that on cpu | indomit | Information & Answers | 4 | 2020-10-07 10:50 |
faster than LL? | paulunderwood | Miscellaneous Math | 13 | 2016-08-02 00:05 |
My CPU is getting faster and faster ;-) | lidocorc | Software | 2 | 2008-11-08 09:26 |
Faster than LL? | clowns789 | Miscellaneous Math | 3 | 2004-05-27 23:39 |