20220308, 13:32  #133 
"Ed Hall"
Dec 2009
Adirondack Mtns
1259_{16} Posts 
I understood about the gpucurves, but what confused me was the:
Code:
CGBN<512, 4> running kernel<56 block x 256 threads> input number is 246 bits Thanks for helping me understand this and for a great speedup. 
20220308, 16:39  #134 
Sep 2009
2^{6}·37 Posts 
ecmgpu downloaded from https://gitlab.inria.fr/zimmerma/ecm.git works for b1=11e7:
Code:
chris@4core:~/ecmcgbn.2/ecm> date;time ./ecm gpu cgbn save test2.save 110000000 1 <b58+148.ini;date Tue 8 Mar 08:18:31 GMT 2022 GMPECM 7.0.5dev [configured with GMP 5.1.3, enableasmredc, enablegpu, enableassert] [ECM] Input number is 1044362381090522430349272504349028000743722878937901553864893424154624748141120681170432021570621655565526684395777956912757565835989960001844742211087555729316372309210417 (172 digits) Using B1=110000000, B2=1, sigma=3:358961863:35898617 (2432 curves) GPU: Large B1, S = 158705536 bits = 151 MB GPU: Using device code targeted for architecture compile_86 GPU: Ptx version is 86 GPU: maxThreadsPerBlock = 896 GPU: numRegsPerThread = 67 sharedMemPerBlock = 0 bytes Computing 2432 Step 1 took 4508ms of CPU time / 4674180ms of GPU time real 78m0.885s user 0m10.992s sys 0m2.513s Tue 8 Mar 09:36:32 GMT 2022 The older version without cgbn took about 9 hours to do the same job. Many thanks for the speed up. 
20220308, 19:43  #135 
"Ed Hall"
Dec 2009
Adirondack Mtns
1001001011001_{2} Posts 
Sorry if this has an "elementary" answer, but is there an optimum value that B1 should be a multiple of?
I'm currently basing my B1 values on what 896 curves need for the different tlevels. Should I adjust B1 to a close multiple of a base value, then adjust the gpucurves, accordingly, or am I complicating things? 
20220308, 20:46  #136  
"Seth"
Apr 2019
19×23 Posts 
Quote:
TL;DR If you are still running B2 you should probably set B1 for each tlevel based on this chart then round number of curves to the nearest multiple of 896. This is probably within 20% of optimal for >= t45. You could slightly optimize by increasing B1 if you round down or increasing B1 if you round up (so that ecm v prints "expected number of curves to find a factor" equal to the number of curves you are using) In practice for small factors everything is really fast so for a single number who cares, but if you were working on factordb or a huge amount of numbers (>5000) you would want to do something smarter. In theory the code could run one curve for 896 different numbers or something. It can also make sense to tune the B1/B2 ratio based on how much RAM you have and how fast your CPU is versus your GPU. For example see the discussion here. I wrote some hacky shell code to do this at sethtro/miscscripts/ecm_gpu_optimizer 

20220308, 22:36  #137 
"Ed Hall"
Dec 2009
Adirondack Mtns
4697_{10} Posts 
Thanks. This gives me something to study. Unfortunately, the machine I was able to get to run the GPU has only 2 cores and 8G RAM. But, I have a script now that sends the residues to a second machine and moves to the next B1 level. Of course, now the GPU is the bottleneck since I'm only running stage 1 operations on its machine. I'm still looking at what might be best for my setup.

20220317, 08:55  #138  
"Seth"
Apr 2019
19×23 Posts 
Quote:
Code:
$ echo "1044362381090522430349272504349028000743722878937901553864893424154624748141120681170432021570621655565526684395777956912757565835989960001844742211087555729316372309210417"  ./ecm cgbn v 11e5 0 GMPECM 7.0.5dev [configured with GMP 6.2.99, enableasmredc, enablegpu, enableassert] [ECM] Using B1=1100000, B2=0, sigma=3:12767991893:1276800020 (832 curves) Compiling custom kernel for 640 bits should be ~144% faster CGBN<1024, 8> running kernel<26 block x 256 threads> input number is 569 bits Computing 1158 bits/call, 96372/1586512 (6.1%), ETA 106 + 7 = 113 seconds (~135 ms/curves) Computing 1158 bits/call, 212172/1586512 (13.4%), ETA 97 + 15 = 113 seconds (~135 ms/curves) Computing 1158 bits/call, 327972/1586512 (20.7%), ETA 89 + 23 = 112 seconds (~135 ms/curves) After changing  typedef cgbn_params_t<8, 1024> cgbn_params_1024; + typedef cgbn_params_t<8, 640> cgbn_params_1024; $ echo "1044362381090522430349272504349028000743722878937901553864893424154624748141120681170432021570621655565526684395777956912757565835989960001844742211087555729316372309210417"  ./ecm cgbn v 11e5 0 GMPECM 7.0.5dev [configured with GMP 6.2.99, enableasmredc, enablegpu, enableassert] [ECM] Using B1=1100000, B2=0, sigma=3:2306516493:230652480 (832 curves) CGBN<640, 8> running kernel<26 block x 256 threads> input number is 569 bits Computing 1863 bits/call, 146292/1586512 (9.2%), ETA 67 + 7 = 74 seconds (~89 ms/curves) Computing 1863 bits/call, 332592/1586512 (21.0%), ETA 60 + 16 = 76 seconds (~92 ms/curves) Computing 1863 bits/call, 518892/1586512 (32.7%), ETA 52 + 25 = 77 seconds (~93 ms/curves) Last fiddled with by SethTro on 20220317 at 09:07 

20220317, 09:16  #139 
Apr 2010
2^{2}×3×19 Posts 
If trying custom kernel sizes, try also 768 bits. For me (GTX 2060 Super) thats faster than 640 bits.

20220317, 09:37  #140 
Just call me Henry
"David"
Sep 2007
Liverpool (GMT/BST)
2^{3}·7·107 Posts 
If thats the case then a kernal benchmark would be useful that identifies the fastest kernels for each card. I currently have a version with all the possible kernals added upto 300 digits or so.

20220317, 16:53  #141  
Sep 2009
2368_{10} Posts 
Quote:
I've looked at your chart for recommended B1 and B2 values, but it confuses my script's calculations of how much ECM to do for a number of a given size. I need to do some serious thinking to get it to all work together. 

20220403, 03:52  #142 
I moo ablest echo power!
May 2013
1,801 Posts 
Hi, I've built this under WSL2, and everything works quite nicely, but when I do the test file (gpu_throughput_test.sh), CBGN fails when the input number is large enough:
"No available CGBN Kernel large enough to process N(1864 bits)" I saw some posts earlier in the thread that might apply, but I thought it would be best to ask before I start messing with anything. 
20220403, 06:22  #143  
"Seth"
Apr 2019
665_{8} Posts 
Quote:
If you want to run ECM on numbers > 1020 bits look around line 670 in cgbn_stage1.cu Last fiddled with by SethTro on 20220403 at 06:22 

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
NTT faster than FFT?  moytrage  Software  50  20210721 05:55 
PRP on gpu is faster that on cpu  indomit  Information & Answers  4  20201007 10:50 
faster than LL?  paulunderwood  Miscellaneous Math  13  20160802 00:05 
My CPU is getting faster and faster ;)  lidocorc  Software  2  20081108 09:26 
Faster than LL?  clowns789  Miscellaneous Math  3  20040527 23:39 