20210909, 17:03  #89  
"Seth"
Apr 2019
2^{4}×3^{3} Posts 
Quote:
/* NOTE: Custom kernel changes here You can either add a new kernel or I recommend just changing `cgbn_params_512`  typedef cgbn_params_t<4, 512> cgbn_params_512; + typedef cgbn_params_t<TPI_SEE_COMMENT_ABOVE, YOUR_BITS_HERE> cgbn_params_512; // My name is a lie The absolute limit is 32,768 bits. I found that GPU/CPU performance decreases 3x from 1,024 bits to 16,384 bits then an additional 2x above 16,384 still something like 13x faster on my system but possible no longer competitive watt for watt. Read the nearby comment for a sense of how long it will take to compile. 

20210910, 04:01  #90 
"Seth"
Apr 2019
660_{8} Posts 
I spent most of today working on new optimal bounds. It can be a large speedup to use these instead of the traditionally optimal B1 bounds. ecm can confirm they represent a full t<X> while taking substantially less time when accounting for the GPU speedup.
Full table at https://github.com/sethtroisi/miscs..._gpu_optimizer and an excerpt below Code:
GPU speedup/CPU cores digits optimal B1 optimal B2 B2/B1 ratio expected curves Fast GPU + 4 cores 40/4 35 2,567,367 264,075,603 103 809 40/4 40 8,351,462 1,459,547,807 175 1760 40/4 45 38,803,644 17,323,036,685 446 2481 40/4 50 79,534,840 58,654,664,284 737 7269 40/4 55 113,502,213 96,313,119,323 849 29883 40/4 60 322,667,450 395,167,622,450 1225 56664 Fast GPU + 8 cores 40/8 35 1,559,844 351,804,250 226 1038 40/8 40 6,467,580 2,889,567,750 447 1843 40/8 45 29,448,837 35,181,170,876 1195 2599 40/8 50 40,201,280 58,928,323,592 1466 11993 40/8 55 136,135,593 289,565,678,027 2127 20547 40/8 60 479,960,096 3,226,409,839,042 6722 30014 
20210910, 09:46  #91  
"Seth"
Apr 2019
2^{4}×3^{3} Posts 
Quote:
1. You might get 10% more throughput by toggling VERIFY_NORMALIZED to 0 on line 55 It's a nice debug check while this is still in development but it has never tripped so it's overly cautious especially if it costs 10% performance. 2. Would you mind sharing what card you have and the full output from v output (especially the lines that start with "GPU: ") 

20210913, 14:11  #92  
"Ben"
Feb 2007
3,617 Posts 
Quote:
It is a Tesla V100SXM232GB (compute capability 7.0, 80 MPs, maxSharedPerBlock = 49152 maxThreadsPerBlock = 1024 maxRegsPerBlock = 65536) 

20210921, 08:35  #93  
"Seth"
Apr 2019
2^{4}·3^{3} Posts 
Quote:
I'm seeing the new code be 3.1x faster which is similar to the 23x improvement I've seen on a 1080ti, 970, and K80. Code:
$ echo "(2^9971)"  ./ecm cgbn v sigma 3:1000 1000000 0 Computing 5120 Step 1 took 134ms of CPU time / 69031ms of GPU time Throughput: 74.170 curves per second (on average 13.48ms per Step 1) $ echo "(2^9971)"  ./ecm gpu v sigma 3:1000 1000000 0 Computing 5120 Step 1 took 10911ms of CPU time / 218643ms of GPU time Throughput: 23.417 curves per second (on average 42.70ms per Step 1) 

20211022, 13:50  #94 
May 2009
Russia, Moscow
2,767 Posts 
Hello, I've got an error while trying to run curves with B1=11e7:
Code:
ecmcgbn: cgbn_stage1.cu:525: char* allocate_and_set_s_bits(const __mpz_struct*, int*): Assertion `1 <= num_bits && num_bits <= 100000000' failed. 
20211022, 18:50  #95 
"Seth"
Apr 2019
2^{4}·3^{3} Posts 
It's to prevent GPU memory issues so it can be ignored (unless you run with a very huge number.
It's on my todo list to remove but I'm sadly without internet today. You can remove the assert and everything will be fine. 
20211024, 08:47  #96  
"Seth"
Apr 2019
660_{8} Posts 
Quote:
I just merged https://gitlab.inria.fr/zimmerma/ecm...ge_requests/27 which contains a fix of B1 limit along with a number of quality of life improvements: multiple kernels included by default (512 and 1024), estimated timing, better overflow detection, faster compilation. 

20211024, 13:35  #97 
May 2009
Russia, Moscow
2767_{10} Posts 
SethTro, thanks for the explanation and improvements!

20211104, 10:12  #98 
"Seth"
Apr 2019
2^{4}×3^{3} Posts 
I was playing around with CGBN today and I realized that it doesn't use fast squaring. in GMP fast squaring this yields a 1.5x speedup. I filed issue 19 asking the author what would be needed to add support for fast squaring.
I then discovered their paper ("Optimizing Modular Multiplication for NVIDIA’s Maxwell GPUs") on the subject. It suggests that a more modest 2030% gain is likely. The main doubling loop contains 11 additions, 4 multiplications, 4 squares so this would likely only be a ~10% final gain but something we (READ: I) should try to track down. 
20211108, 13:41  #99 
May 2009
Russia, Moscow
2,767 Posts 
BTW, the 54digit factor was found using Kaggle and ECM with CGBN support. 3584@43e6 took almost 3 hours for stage1 on Tesla P100.
Code:
Resuming ECM residue saved by @58c8c7d3f28a with GMPECM 7.0.5dev on Sun Nov 7 16:49:14 2021 Input number is 32548578398364358484341350345766214474783986512971108655859583723767495515336168718870906961859034438402149815916929838626831190652930427474273050773518305674391 (161 digits) Using B1=4300000043000000, B2=240490660426, polynomial Dickson(12), sigma=3:2723506384 Step 1 took 1ms Step 2 took 40649ms ********** Factor found in step 2: 414964253388127406110807725062798487272054568225225131 Found prime factor of 54 digits: 414964253388127406110807725062798487272054568225225131 Prime cofactor 78437065681223350191914183317403238121110774952722134895604084183194227719884667275947965573494888337419461 has 107 digits 
Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
NTT faster than FFT?  moytrage  Software  50  20210721 05:55 
PRP on gpu is faster that on cpu  indomit  Information & Answers  4  20201007 10:50 
faster than LL?  paulunderwood  Miscellaneous Math  13  20160802 00:05 
My CPU is getting faster and faster ;)  lidocorc  Software  2  20081108 09:26 
Faster than LL?  clowns789  Miscellaneous Math  3  20040527 23:39 