![]() |
[QUOTE=chris2be8;587572]Just the higher arch one (sm_52). Sorry.
PS. Does CGBN increase the maximum size of number that can be handled? I'd try it, but I'm tied up catching up with ECM work I delayed while I was getting ecm-cgbn working.[/QUOTE] Yes! In cgbn_stage1.cu search for this line /* NOTE: Custom kernel changes here You can either add a new kernel or I recommend just changing `cgbn_params_512` - typedef cgbn_params_t<4, 512> cgbn_params_512; + typedef cgbn_params_t<TPI_SEE_COMMENT_ABOVE, YOUR_BITS_HERE> cgbn_params_512; // My name is a lie The absolute limit is 32,768 bits. I found that GPU/CPU performance decreases 3x from 1,024 bits to 16,384 bits then an additional 2x above 16,384 still something like 13x faster on my system but possible no longer competitive watt for watt. Read the nearby comment for a sense of how long it will take to compile. |
I spent most of today working on new optimal bounds. It can be a [URL="https://www.mersenneforum.org/showpost.php?p=587617&postcount=22"]large speedup[/URL] to use these instead of the traditionally optimal B1 bounds. ecm can confirm they represent a full t<X> while taking substantially less time when accounting for the GPU speedup.
Full table at [url]https://github.com/sethtroisi/misc-scripts/tree/main/ecm_gpu_optimizer[/url] and an excerpt below [CODE]GPU speedup/CPU cores digits optimal B1 optimal B2 B2/B1 ratio expected curves Fast GPU + 4 cores 40/4 35 2,567,367 264,075,603 103 809 40/4 40 8,351,462 1,459,547,807 175 1760 40/4 45 38,803,644 17,323,036,685 446 2481 40/4 50 79,534,840 58,654,664,284 737 7269 40/4 55 113,502,213 96,313,119,323 849 29883 40/4 60 322,667,450 395,167,622,450 1225 56664 Fast GPU + 8 cores 40/8 35 1,559,844 351,804,250 226 1038 40/8 40 6,467,580 2,889,567,750 447 1843 40/8 45 29,448,837 35,181,170,876 1195 2599 40/8 50 40,201,280 58,928,323,592 1466 11993 40/8 55 136,135,593 289,565,678,027 2127 20547 40/8 60 479,960,096 3,226,409,839,042 6722 30014[/CODE] |
[QUOTE=bsquared;587026]1280: (~31 ms/curves)
2560: (~21 ms/curves) 640: (~63 ms/curves) 1792: (~36 ms/curves) So we have a winner! -gpucurves 2560 beats all the others and anything the old build could do as well (best on the old build was 5120 @ (~25 ms/curves)) With the smaller kernel (running (2^499-1) / 20959), -gpucurves 5120 is fastest at about 6ms/curve on both new and old builds.[/QUOTE] Two late night performance thoughts. 1. You might get 10% more throughput by toggling VERIFY_NORMALIZED to 0 on line 55 It's a nice debug check while this is still in development but it has never tripped so it's overly cautious especially if it costs 10% performance. 2. Would you mind sharing what card you have and the full output from -v output (especially the lines that start with "GPU: ") |
[QUOTE=SethTro;587628]Two late night performance thoughts.
1. You might get 10% more throughput by toggling VERIFY_NORMALIZED to 0 on line 55 It's a nice debug check while this is still in development but it has never tripped so it's overly cautious especially if it costs 10% performance. 2. Would you mind sharing what card you have and the full output from -v output (especially the lines that start with "GPU: ")[/QUOTE] Hmm, when running on 2^997-1 I'm getting *better* throughput with VERIFY_NORMALIZED 1, 53.5 curves/sec with it defined to 1 vs. 45.6 curves/sec with it defined to 0, both running -gpucurves 2560. If I set gpucurves 5120 then the no_verify version is 15% faster, but still slower than -gpucurves 2560. It is a Tesla V100-SXM2-32GB (compute capability 7.0, 80 MPs, maxSharedPerBlock = 49152 maxThreadsPerBlock = 1024 maxRegsPerBlock = 65536) |
[QUOTE=bsquared;587026]1280: (~31 ms/curves)
2560: (~21 ms/curves) 640: (~63 ms/curves) 1792: (~36 ms/curves) So we have a winner! -gpucurves 2560 beats all the others and anything the old build could do as well (best on the old build was 5120 @ (~25 ms/curves)) With the smaller kernel (running (2^499-1) / 20959), -gpucurves 5120 is fastest at about 6ms/curve on both new and old builds.[/QUOTE] I was confused when you saw only moderate gains so I rented a V100 (V100-SXM2-16GB) on AWS today. I'm seeing the new code be 3.1x faster which is similar to the 2-3x improvement I've seen on a 1080ti, 970, and K80. [CODE] $ echo "(2^997-1)" | ./ecm -cgbn -v -sigma 3:1000 1000000 0 Computing 5120 Step 1 took 134ms of CPU time / 69031ms of GPU time Throughput: 74.170 curves per second (on average 13.48ms per Step 1) $ echo "(2^997-1)" | ./ecm -gpu -v -sigma 3:1000 1000000 0 Computing 5120 Step 1 took 10911ms of CPU time / 218643ms of GPU time Throughput: 23.417 curves per second (on average 42.70ms per Step 1) [/CODE] |
Hello, I've got an error while trying to run curves with B1=11e7:
[CODE]ecm-cgbn: cgbn_stage1.cu:525: char* allocate_and_set_s_bits(const __mpz_struct*, int*): Assertion `1 <= num_bits && num_bits <= 100000000' failed.[/CODE] Is this a sort of CGBN limitations? |
It's to prevent GPU memory issues so it can be ignored (unless you run with a very huge number.
It's on my to-do list to remove but I'm sadly without internet today. You can remove the assert and everything will be fine. |
[QUOTE=unconnected;591358]Hello, I've got an error while trying to run curves with B1=11e7:
[CODE]ecm-cgbn: cgbn_stage1.cu:525: char* allocate_and_set_s_bits(const __mpz_struct*, int*): Assertion `1 <= num_bits && num_bits <= 100000000' failed.[/CODE] Is this a sort of CGBN limitations?[/QUOTE] I just merged [URL]https://gitlab.inria.fr/zimmerma/ecm/-/merge_requests/27[/URL] which contains a fix of B1 limit along with a number of quality of life improvements: multiple kernels included by default (512 and 1024), estimated timing, better overflow detection, faster compilation. |
[B]SethTro[/B], thanks for the explanation and improvements!
|
I was playing around with CGBN today and I realized that it [URL="https://github.com/NVlabs/CGBN/blob/master/include/cgbn/impl_cuda.cu#L1033"]doesn't use fast squaring[/URL]. in GMP fast squaring this yields a [URL="https://gmplib.org/manual/Basecase-Multiplication"]1.5x speedup[/URL]. I filed [URL="https://github.com/NVlabs/CGBN/issues/19"]issue 19[/URL] asking the author what would be needed to add support for fast squaring.
I then discovered their paper ([URL="http://www.acsel-lab.com/arithmetic/arith23/data/1616a047.pdf"]"Optimizing Modular Multiplication for NVIDIA’s Maxwell GPUs"[/URL]) on the subject. It suggests that a more modest 20-30% gain is likely. The main doubling loop contains 11 additions, 4 multiplications, 4 squares so this would likely only be a ~10% final gain but something we (READ: I) should try to track down. |
BTW, the 54-digit factor was found using Kaggle and ECM with CGBN support. 3584@43e6 took almost 3 hours for stage1 on Tesla P100.
[CODE]Resuming ECM residue saved by @58c8c7d3f28a with GMP-ECM 7.0.5-dev on Sun Nov 7 16:49:14 2021 Input number is 32548578398364358484341350345766214474783986512971108655859583723767495515336168718870906961859034438402149815916929838626831190652930427474273050773518305674391 (161 digits) Using B1=43000000-43000000, B2=240490660426, polynomial Dickson(12), sigma=3:2723506384 Step 1 took 1ms Step 2 took 40649ms ********** Factor found in step 2: 414964253388127406110807725062798487272054568225225131 Found prime factor of 54 digits: 414964253388127406110807725062798487272054568225225131 Prime cofactor 78437065681223350191914183317403238121110774952722134895604084183194227719884667275947965573494888337419461 has 107 digits [/CODE] |
All times are UTC. The time now is 04:57. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.