mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > Factoring

Reply
 
Thread Tools
Old 2021-09-09, 17:03   #89
SethTro
 
SethTro's Avatar
 
"Seth"
Apr 2019

23×32×5 Posts
Default

Quote:
Originally Posted by chris2be8 View Post
Just the higher arch one (sm_52). Sorry.

PS. Does CGBN increase the maximum size of number that can be handled? I'd try it, but I'm tied up catching up with ECM work I delayed while I was getting ecm-cgbn working.
Yes! In cgbn_stage1.cu search for this line
/* NOTE: Custom kernel changes here

You can either add a new kernel or I recommend just changing `cgbn_params_512`

- typedef cgbn_params_t<4, 512> cgbn_params_512;
+ typedef cgbn_params_t<TPI_SEE_COMMENT_ABOVE, YOUR_BITS_HERE> cgbn_params_512; // My name is a lie

The absolute limit is 32,768 bits. I found that GPU/CPU performance decreases 3x from 1,024 bits to 16,384 bits then an additional 2x above 16,384 still something like 13x faster on my system but possible no longer competitive watt for watt. Read the nearby comment for a sense of how long it will take to compile.
SethTro is offline   Reply With Quote
Old 2021-09-10, 04:01   #90
SethTro
 
SethTro's Avatar
 
"Seth"
Apr 2019

23×32×5 Posts
Default

I spent most of today working on new optimal bounds. It can be a large speedup to use these instead of the traditionally optimal B1 bounds. ecm can confirm they represent a full t<X> while taking substantially less time when accounting for the GPU speedup.

Full table at https://github.com/sethtroisi/misc-s..._gpu_optimizer and an excerpt below

Code:
GPU speedup/CPU cores	digits	optimal B1	optimal B2	B2/B1 ratio	expected curves
Fast GPU + 4 cores					
40/4	35	2,567,367	264,075,603	103	809
40/4	40	8,351,462	1,459,547,807	175	1760
40/4	45	38,803,644	17,323,036,685	446	2481
40/4	50	79,534,840	58,654,664,284	737	7269
40/4	55	113,502,213	96,313,119,323	849	29883
40/4	60	322,667,450	395,167,622,450	1225	56664
Fast GPU + 8 cores					
40/8	35	1,559,844	351,804,250	226	1038
40/8	40	6,467,580	2,889,567,750	447	1843
40/8	45	29,448,837	35,181,170,876	1195	2599
40/8	50	40,201,280	58,928,323,592	1466	11993
40/8	55	136,135,593	289,565,678,027	2127	20547
40/8	60	479,960,096	3,226,409,839,042	6722	30014
SethTro is offline   Reply With Quote
Old 2021-09-10, 09:46   #91
SethTro
 
SethTro's Avatar
 
"Seth"
Apr 2019

1011010002 Posts
Default

Quote:
Originally Posted by bsquared View Post
1280: (~31 ms/curves)
2560: (~21 ms/curves)
640: (~63 ms/curves)
1792: (~36 ms/curves)

So we have a winner! -gpucurves 2560 beats all the others and anything the old build could do as well (best on the old build was 5120 @ (~25 ms/curves))

With the smaller kernel (running (2^499-1) / 20959), -gpucurves 5120 is fastest at about 6ms/curve on both new and old builds.
Two late night performance thoughts.
1. You might get 10% more throughput by toggling VERIFY_NORMALIZED to 0 on line 55
It's a nice debug check while this is still in development but it has never tripped so it's overly cautious especially if it costs 10% performance.
2. Would you mind sharing what card you have and the full output from -v output (especially the lines that start with "GPU: ")
SethTro is offline   Reply With Quote
Old 2021-09-13, 14:11   #92
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

2×1,787 Posts
Default

Quote:
Originally Posted by SethTro View Post
Two late night performance thoughts.
1. You might get 10% more throughput by toggling VERIFY_NORMALIZED to 0 on line 55
It's a nice debug check while this is still in development but it has never tripped so it's overly cautious especially if it costs 10% performance.
2. Would you mind sharing what card you have and the full output from -v output (especially the lines that start with "GPU: ")
Hmm, when running on 2^997-1 I'm getting *better* throughput with VERIFY_NORMALIZED 1, 53.5 curves/sec with it defined to 1 vs. 45.6 curves/sec with it defined to 0, both running -gpucurves 2560. If I set gpucurves 5120 then the no_verify version is 15% faster, but still slower than -gpucurves 2560.

It is a Tesla V100-SXM2-32GB (compute capability 7.0, 80 MPs, maxSharedPerBlock = 49152 maxThreadsPerBlock = 1024 maxRegsPerBlock = 65536)
bsquared is offline   Reply With Quote
Old 2021-09-21, 08:35   #93
SethTro
 
SethTro's Avatar
 
"Seth"
Apr 2019

23·32·5 Posts
Default

Quote:
Originally Posted by bsquared View Post
1280: (~31 ms/curves)
2560: (~21 ms/curves)
640: (~63 ms/curves)
1792: (~36 ms/curves)

So we have a winner! -gpucurves 2560 beats all the others and anything the old build could do as well (best on the old build was 5120 @ (~25 ms/curves))

With the smaller kernel (running (2^499-1) / 20959), -gpucurves 5120 is fastest at about 6ms/curve on both new and old builds.
I was confused when you saw only moderate gains so I rented a V100 (V100-SXM2-16GB) on AWS today.
I'm seeing the new code be 3.1x faster which is similar to the 2-3x improvement I've seen on a 1080ti, 970, and K80.

Code:
$ echo "(2^997-1)" | ./ecm -cgbn -v -sigma 3:1000 1000000 0
Computing 5120 Step 1 took 134ms of CPU time / 69031ms of GPU time
Throughput: 74.170 curves per second (on average 13.48ms per Step 1)


$ echo "(2^997-1)" | ./ecm -gpu -v -sigma 3:1000 1000000 0
Computing 5120 Step 1 took 10911ms of CPU time / 218643ms of GPU time
Throughput: 23.417 curves per second (on average 42.70ms per Step 1)
SethTro is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
NTT faster than FFT? moytrage Software 50 2021-07-21 05:55
PRP on gpu is faster that on cpu indomit Information & Answers 4 2020-10-07 10:50
faster than LL? paulunderwood Miscellaneous Math 13 2016-08-02 00:05
My CPU is getting faster and faster ;-) lidocorc Software 2 2008-11-08 09:26
Faster than LL? clowns789 Miscellaneous Math 3 2004-05-27 23:39

All times are UTC. The time now is 06:55.


Fri Sep 24 06:55:12 UTC 2021 up 63 days, 1:24, 0 users, load averages: 1.98, 1.89, 1.78

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.