mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Factoring (https://www.mersenneforum.org/forumdisplay.php?f=19)
-   -   Faster GPU-ECM with CGBN (https://www.mersenneforum.org/showthread.php?t=27103)

SethTro 2021-09-09 17:03

[QUOTE=chris2be8;587572]Just the higher arch one (sm_52). Sorry.

PS. Does CGBN increase the maximum size of number that can be handled? I'd try it, but I'm tied up catching up with ECM work I delayed while I was getting ecm-cgbn working.[/QUOTE]

Yes! In cgbn_stage1.cu search for this line
/* NOTE: Custom kernel changes here

You can either add a new kernel or I recommend just changing `cgbn_params_512`

- typedef cgbn_params_t<4, 512> cgbn_params_512;
+ typedef cgbn_params_t<TPI_SEE_COMMENT_ABOVE, YOUR_BITS_HERE> cgbn_params_512; // My name is a lie

The absolute limit is 32,768 bits. I found that GPU/CPU performance decreases 3x from 1,024 bits to 16,384 bits then an additional 2x above 16,384 still something like 13x faster on my system but possible no longer competitive watt for watt. Read the nearby comment for a sense of how long it will take to compile.

SethTro 2021-09-10 04:01

I spent most of today working on new optimal bounds. It can be a [URL="https://www.mersenneforum.org/showpost.php?p=587617&postcount=22"]large speedup[/URL] to use these instead of the traditionally optimal B1 bounds. ecm can confirm they represent a full t<X> while taking substantially less time when accounting for the GPU speedup.

Full table at [url]https://github.com/sethtroisi/misc-scripts/tree/main/ecm_gpu_optimizer[/url] and an excerpt below

[CODE]GPU speedup/CPU cores digits optimal B1 optimal B2 B2/B1 ratio expected curves
Fast GPU + 4 cores
40/4 35 2,567,367 264,075,603 103 809
40/4 40 8,351,462 1,459,547,807 175 1760
40/4 45 38,803,644 17,323,036,685 446 2481
40/4 50 79,534,840 58,654,664,284 737 7269
40/4 55 113,502,213 96,313,119,323 849 29883
40/4 60 322,667,450 395,167,622,450 1225 56664
Fast GPU + 8 cores
40/8 35 1,559,844 351,804,250 226 1038
40/8 40 6,467,580 2,889,567,750 447 1843
40/8 45 29,448,837 35,181,170,876 1195 2599
40/8 50 40,201,280 58,928,323,592 1466 11993
40/8 55 136,135,593 289,565,678,027 2127 20547
40/8 60 479,960,096 3,226,409,839,042 6722 30014[/CODE]

SethTro 2021-09-10 09:46

[QUOTE=bsquared;587026]1280: (~31 ms/curves)
2560: (~21 ms/curves)
640: (~63 ms/curves)
1792: (~36 ms/curves)

So we have a winner! -gpucurves 2560 beats all the others and anything the old build could do as well (best on the old build was 5120 @ (~25 ms/curves))

With the smaller kernel (running (2^499-1) / 20959), -gpucurves 5120 is fastest at about 6ms/curve on both new and old builds.[/QUOTE]

Two late night performance thoughts.
1. You might get 10% more throughput by toggling VERIFY_NORMALIZED to 0 on line 55
It's a nice debug check while this is still in development but it has never tripped so it's overly cautious especially if it costs 10% performance.
2. Would you mind sharing what card you have and the full output from -v output (especially the lines that start with "GPU: ")

bsquared 2021-09-13 14:11

[QUOTE=SethTro;587628]Two late night performance thoughts.
1. You might get 10% more throughput by toggling VERIFY_NORMALIZED to 0 on line 55
It's a nice debug check while this is still in development but it has never tripped so it's overly cautious especially if it costs 10% performance.
2. Would you mind sharing what card you have and the full output from -v output (especially the lines that start with "GPU: ")[/QUOTE]

Hmm, when running on 2^997-1 I'm getting *better* throughput with VERIFY_NORMALIZED 1, 53.5 curves/sec with it defined to 1 vs. 45.6 curves/sec with it defined to 0, both running -gpucurves 2560. If I set gpucurves 5120 then the no_verify version is 15% faster, but still slower than -gpucurves 2560.

It is a Tesla V100-SXM2-32GB (compute capability 7.0, 80 MPs, maxSharedPerBlock = 49152 maxThreadsPerBlock = 1024 maxRegsPerBlock = 65536)

SethTro 2021-09-21 08:35

[QUOTE=bsquared;587026]1280: (~31 ms/curves)
2560: (~21 ms/curves)
640: (~63 ms/curves)
1792: (~36 ms/curves)

So we have a winner! -gpucurves 2560 beats all the others and anything the old build could do as well (best on the old build was 5120 @ (~25 ms/curves))

With the smaller kernel (running (2^499-1) / 20959), -gpucurves 5120 is fastest at about 6ms/curve on both new and old builds.[/QUOTE]

I was confused when you saw only moderate gains so I rented a V100 (V100-SXM2-16GB) on AWS today.
I'm seeing the new code be 3.1x faster which is similar to the 2-3x improvement I've seen on a 1080ti, 970, and K80.

[CODE]
$ echo "(2^997-1)" | ./ecm -cgbn -v -sigma 3:1000 1000000 0
Computing 5120 Step 1 took 134ms of CPU time / 69031ms of GPU time
Throughput: 74.170 curves per second (on average 13.48ms per Step 1)


$ echo "(2^997-1)" | ./ecm -gpu -v -sigma 3:1000 1000000 0
Computing 5120 Step 1 took 10911ms of CPU time / 218643ms of GPU time
Throughput: 23.417 curves per second (on average 42.70ms per Step 1)
[/CODE]

unconnected 2021-10-22 13:50

Hello, I've got an error while trying to run curves with B1=11e7:

[CODE]ecm-cgbn: cgbn_stage1.cu:525: char* allocate_and_set_s_bits(const __mpz_struct*, int*): Assertion `1 <= num_bits && num_bits <= 100000000' failed.[/CODE] Is this a sort of CGBN limitations?

SethTro 2021-10-22 18:50

It's to prevent GPU memory issues so it can be ignored (unless you run with a very huge number.
It's on my to-do list to remove but I'm sadly without internet today.
You can remove the assert and everything will be fine.

SethTro 2021-10-24 08:47

[QUOTE=unconnected;591358]Hello, I've got an error while trying to run curves with B1=11e7:

[CODE]ecm-cgbn: cgbn_stage1.cu:525: char* allocate_and_set_s_bits(const __mpz_struct*, int*): Assertion `1 <= num_bits && num_bits <= 100000000' failed.[/CODE] Is this a sort of CGBN limitations?[/QUOTE]


I just merged [URL]https://gitlab.inria.fr/zimmerma/ecm/-/merge_requests/27[/URL] which contains a fix of B1 limit along with a number of quality of life improvements: multiple kernels included by default (512 and 1024), estimated timing, better overflow detection, faster compilation.

unconnected 2021-10-24 13:35

[B]SethTro[/B], thanks for the explanation and improvements!

SethTro 2021-11-04 10:12

I was playing around with CGBN today and I realized that it [URL="https://github.com/NVlabs/CGBN/blob/master/include/cgbn/impl_cuda.cu#L1033"]doesn't use fast squaring[/URL]. in GMP fast squaring this yields a [URL="https://gmplib.org/manual/Basecase-Multiplication"]1.5x speedup[/URL]. I filed [URL="https://github.com/NVlabs/CGBN/issues/19"]issue 19[/URL] asking the author what would be needed to add support for fast squaring.

I then discovered their paper ([URL="http://www.acsel-lab.com/arithmetic/arith23/data/1616a047.pdf"]"Optimizing Modular Multiplication for NVIDIA’s Maxwell GPUs"[/URL]) on the subject. It suggests that a more modest 20-30% gain is likely.

The main doubling loop contains 11 additions, 4 multiplications, 4 squares so this would likely only be a ~10% final gain but something we (READ: I) should try to track down.

unconnected 2021-11-08 13:41

BTW, the 54-digit factor was found using Kaggle and ECM with CGBN support. 3584@43e6 took almost 3 hours for stage1 on Tesla P100.

[CODE]Resuming ECM residue saved by @58c8c7d3f28a with GMP-ECM 7.0.5-dev on Sun Nov 7 16:49:14 2021
Input number is 32548578398364358484341350345766214474783986512971108655859583723767495515336168718870906961859034438402149815916929838626831190652930427474273050773518305674391 (161 digits)
Using B1=43000000-43000000, B2=240490660426, polynomial Dickson(12), sigma=3:2723506384
Step 1 took 1ms
Step 2 took 40649ms
********** Factor found in step 2: 414964253388127406110807725062798487272054568225225131
Found prime factor of 54 digits: 414964253388127406110807725062798487272054568225225131
Prime cofactor 78437065681223350191914183317403238121110774952722134895604084183194227719884667275947965573494888337419461 has 107 digits
[/CODE]


All times are UTC. The time now is 19:49.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.