20201021, 17:23  #2520  
Jul 2018
Martin, Slovakia
2^{2}×3^{2}×7 Posts 
Quote:
FP32 TFLOPS for RTX 2080Ti is 11,75 TFLOPS, which translates to 5875 GHzD/D, which really is the most I can observe on stock settings. 

20201022, 01:38  #2521 
Jul 2009
Germany
2^{2}×113 Posts 
This one have similiar speed to GeForce GTX 980 Ti... (If I have all the comparison values together, I should create a Top 100 ranking list.)
Code:
0201021 23:33:27 Tesla T40 OpenCL compilation in 1.81 s 20201021 23:33:29 Tesla T40 77936867 OK 0 loaded: blockSize 400, 0000000000000003 20201021 23:33:29 Tesla T40 validating proof residues for power 8 20201021 23:33:29 Tesla T40 Proof using power 8 20201021 23:33:34 Tesla T40 77936867 OK 800 0.00%; 4247 us/it; ETA 3d 19:57; 1579c241dc63eca6 (check 1.82s) 20201021 23:47:52 Tesla T40 77936867 OK 200000 0.26%; 4299 us/it; ETA 3d 20:50; f0b04b45b0855bd2 (check 1.85s) 20201022 00:02:15 Tesla T40 77936867 OK 400000 0.51%; 4304 us/it; ETA 3d 20:43; c03f94396a5aa29e (check 1.85s) 20201022 00:16:37 Tesla T40 77936867 OK 600000 0.77%; 4300 us/it; ETA 3d 20:22; b9decd65ca71b629 (check 1.84s) Last fiddled with by moebius on 20201022 at 02:05 
20201022, 04:35  #2522  
"/X\(‘‘)/X\"
Jan 2013
3·971 Posts 
Quote:
In the RTX 30xx series is the same, but the INT32 cores can also do FP32, so it can give up to double the FP32 performance of the RTX 20xx series, but only equivalent INT32 performance for the same number of cores at the same frequency. 

20201022, 19:14  #2523  
Jul 2018
Martin, Slovakia
2^{2}·3^{2}·7 Posts 
Quote:
But shouldn't then the code be reworked to work with FP32? It seems like it should work  has a lot higher maximum value. Thus could potentially extend the range for the maximal exponent. (If so, please remove the minimal limit, too.) This above is my view on how it could work, I may be absolutely wrong. If it would be successfully reworked, and the DPbySP experiment turns out to also be successful, GIMPS would buy out all RTX 3080s and RTX 3090s (those maybe not, very expensive) within few days. 

20201022, 22:36  #2524 
"Composite as Heck"
Oct 2017
17·41 Posts 
There is potential, it's been discussed a little on the forum but from the sounds of it it's not straightforward. There's no rush to buy or to experiment with an implementation, it's not like the R7 which may only have had a production run measured in tens of thousands, there will eventually be millions of the 30 series.
You may be mildly overestimating the buying power of GIMPSters ;) 
20201023, 02:41  #2525  
"Eric"
Jan 2018
USA
211 Posts 
Quote:
Stock: 1600MHz Core, 850MHz HBM2 memory, 250W Code:
gpuowlwin prp 77936867 maxAlloc 8192 nospin 20201022 19:30:10 gpuowl v7.066gebe49cc 20201022 19:30:10 Note: not found 'config.txt' 20201022 19:30:10 config: prp 77936867 maxAlloc 8192 nospin 20201022 19:30:10 device 0, unique id '' 20201022 19:30:10 TITAN V0 77936867 FFT: 4M 1K:8:256 (18.58 bpw) 20201022 19:30:10 TITAN V0 77936867 OpenCL args "DEXP=77936867u DWIDTH=1024u DSMALL_HEIGHT=256u DMIDDLE=8u DCARRY64=1 DCARRYM64=1 DMM_CHAIN=1u DMM2_CHAIN=2u DMAX_ACCURACY=1 DWEIGHT_STEP_MINUS_1=0xa.c42d0d7cec038p5 DIWEIGHT_STEP_MINUS_1=0x8.0e50c8817ddf8p5 clunsafemathoptimizations clstd=CL2.0 clfinitemathonly " 20201022 19:30:10 TITAN V0 77936867 20201022 19:30:10 TITAN V0 77936867 OpenCL compilation in 0.01 s 20201022 19:30:10 TITAN V0 77936867 maxAlloc: 8.0 GB 20201022 19:30:10 TITAN V0 77936867 P1(0) 0 bits 20201022 19:30:10 TITAN V0 77936867 PRP starting from beginning 20201022 19:30:10 TITAN V0 77936867 OK 0 loaded: blockSize 400, 0000000000000003 20201022 19:30:10 TITAN V0 77936867 validating proof residues for power 8 20201022 19:30:10 TITAN V0 77936867 Proof using power 8 20201022 19:30:11 TITAN V0 77936867 OK 800 0.00% 1579c241dc63eca6 596 us/it + check 0.27s + save 0.11s; ETA 12:54 20201022 19:30:16 TITAN V0 77936867 10000 0.01% fc4f135f7cf4ad29 588 us/it 20201022 19:30:22 TITAN V0 77936867 20000 0.03% 3cd1bd9d5e09cbc5 589 us/it 20201022 19:30:28 TITAN V0 77936867 30000 0.04% c4e0ff35e3290d98 590 us/it 20201022 19:30:34 TITAN V0 77936867 40000 0.05% dffe1b1b0d748128 590 us/it 20201022 19:30:40 TITAN V0 77936867 50000 0.06% 52e286945371ed29 590 us/it 20201022 19:30:46 TITAN V0 77936867 60000 0.08% 0945da4dc08bdd95 590 us/it 20201022 19:30:52 TITAN V0 77936867 70000 0.09% 7131fa4eb77f4bb2 590 us/it 20201022 19:30:58 TITAN V0 77936867 80000 0.10% 8d76071d27ee4221 591 us/it 20201022 19:31:04 TITAN V0 77936867 90000 0.12% 0bacff453b2f470e 590 us/it 20201022 19:31:10 TITAN V0 77936867 100000 0.13% 6d7296b9e2830f50 591 us/it 20201022 19:31:12 TITAN V0 77936867 Stopping, please wait.. 20201022 19:31:13 TITAN V0 77936867 OK 104400 0.13% 587552d3b9350467 592 us/it + check 0.27s + save 0.11s; ETA 12:48 20201022 19:31:13 TITAN V0 Exiting because "stop requested" 20201022 19:31:13 TITAN V0 Bye Code:
gpuowlwin prp 77936867 maxAlloc 8192 nospin 20201022 19:34:11 gpuowl v7.066gebe49cc 20201022 19:34:11 Note: not found 'config.txt' 20201022 19:34:11 config: prp 77936867 maxAlloc 8192 nospin 20201022 19:34:11 device 0, unique id '' 20201022 19:34:11 TITAN V0 77936867 FFT: 4M 1K:8:256 (18.58 bpw) 20201022 19:34:11 TITAN V0 77936867 OpenCL args "DEXP=77936867u DWIDTH=1024u DSMALL_HEIGHT=256u DMIDDLE=8u DCARRY64=1 DCARRYM64=1 DMM_CHAIN=1u DMM2_CHAIN=2u DMAX_ACCURACY=1 DWEIGHT_STEP_MINUS_1=0xa.c42d0d7cec038p5 DIWEIGHT_STEP_MINUS_1=0x8.0e50c8817ddf8p5 clunsafemathoptimizations clstd=CL2.0 clfinitemathonly " 20201022 19:34:11 TITAN V0 77936867 20201022 19:34:11 TITAN V0 77936867 OpenCL compilation in 0.01 s 20201022 19:34:11 TITAN V0 77936867 maxAlloc: 8.0 GB 20201022 19:34:11 TITAN V0 77936867 P1(0) 0 bits 20201022 19:34:11 TITAN V0 77936867 PRP starting from beginning 20201022 19:34:12 TITAN V0 77936867 OK 0 loaded: blockSize 400, 0000000000000003 20201022 19:34:12 TITAN V0 77936867 validating proof residues for power 8 20201022 19:34:12 TITAN V0 77936867 Proof using power 8 20201022 19:34:12 TITAN V0 77936867 OK 800 0.00% 1579c241dc63eca6 500 us/it + check 0.23s + save 0.11s; ETA 10:49 20201022 19:34:17 TITAN V0 77936867 10000 0.01% fc4f135f7cf4ad29 494 us/it 20201022 19:34:22 TITAN V0 77936867 20000 0.03% 3cd1bd9d5e09cbc5 495 us/it 20201022 19:34:27 TITAN V0 77936867 30000 0.04% c4e0ff35e3290d98 496 us/it 20201022 19:34:32 TITAN V0 77936867 40000 0.05% dffe1b1b0d748128 497 us/it 20201022 19:34:37 TITAN V0 77936867 50000 0.06% 52e286945371ed29 497 us/it 20201022 19:34:42 TITAN V0 77936867 60000 0.08% 0945da4dc08bdd95 498 us/it 20201022 19:34:47 TITAN V0 77936867 70000 0.09% 7131fa4eb77f4bb2 499 us/it 20201022 19:34:52 TITAN V0 77936867 80000 0.10% 8d76071d27ee4221 499 us/it 20201022 19:34:57 TITAN V0 77936867 90000 0.12% 0bacff453b2f470e 500 us/it 20201022 19:35:02 TITAN V0 77936867 100000 0.13% 6d7296b9e2830f50 500 us/it 20201022 19:35:07 TITAN V0 77936867 110000 0.14% 8cbfd4435622bda7 500 us/it 20201022 19:35:08 TITAN V0 77936867 Stopping, please wait.. 20201022 19:35:09 TITAN V0 77936867 OK 113600 0.15% fb675f1fc2063c9b 501 us/it + check 0.23s + save 0.11s; ETA 10:50 20201022 19:35:09 TITAN V0 Exiting because "stop requested" 20201022 19:35:09 TITAN V0 Bye It seems that the new version doesn't let me use CARRY32, which the older 6.11 version did and appears to run faster. Here's the result for 6.11 on the same exponent Code:
gpuowl device 0 carry short use CARRY32,ORIG_SLOWTRIG,IN_WG=128,IN_SIZEX=16,IN_SPACING=4,OUT_WG=128,OUT_SIZEX=16,OUT_SPACING=4 nospin block 100 maxAlloc 10000 B1 750000 rB2 20 prp 77936867 20201022 19:36:40 gpuowl v6.11364g36f4e2a 20201022 19:36:40 Note: not found 'config.txt' 20201022 19:36:40 config: device 0 carry short use CARRY32,ORIG_SLOWTRIG,IN_WG=128,IN_SIZEX=16,IN_SPACING=4,OUT_WG=128,OUT_SIZEX=16,OUT_SPACING=4 nospin block 100 maxAlloc 10000 B1 750000 rB2 20 prp 77936867 20201022 19:36:40 device 0, unique id '' 20201022 19:36:40 TITAN V0 77936867 FFT: 4M 1K:8:256 (18.58 bpw) 20201022 19:36:40 TITAN V0 Expected maximum carry32: 583B0000 20201022 19:36:40 TITAN V0 OpenCL args "DEXP=77936867u DWIDTH=1024u DSMALL_HEIGHT=256u DMIDDLE=8u DPM1=0 DMM_CHAIN=1u DMM2_CHAIN=2u DMAX_ACCURACY=1 DWEIGHT_STEP_MINUS_1=0xa.c42d0d7cec038p5 DIWEIGHT_STEP_MINUS_1=0x8.0e50c8817ddf8p5 DCARRY32=1 DIN_SIZEX=16 DIN_SPACING=4 DIN_WG=128 DORIG_SLOWTRIG=1 DOUT_SIZEX=16 DOUT_SPACING=4 DOUT_WG=128 clunsafemathoptimizations clstd=CL2.0 clfinitemathonly " 20201022 19:36:40 TITAN V0 20201022 19:36:40 TITAN V0 OpenCL compilation in 0.01 s 20201022 19:36:40 TITAN V0 77936867 OK 0 loaded: blockSize 100, 0000000000000003 20201022 19:36:40 TITAN V0 validating proof residues for power 8 20201022 19:36:40 TITAN V0 Proof using power 8 20201022 19:36:41 TITAN V0 77936867 OK 200 0.00%; 502 us/it; ETA 0d 10:53; 2619e0f0cb78fe50 (check 0.09s) 20201022 19:38:16 TITAN V0 77936867 OK 200000 0.26%; 478 us/it; ETA 0d 10:19; f0b04b45b0855bd2 (check 0.20s) 20201022 19:39:52 TITAN V0 77936867 OK 400000 0.51%; 480 us/it; ETA 0d 10:21; c03f94396a5aa29e (check 0.09s) 20201022 19:40:50 TITAN V0 Stopping, please wait.. 20201022 19:40:50 TITAN V0 77936867 OK 519700 0.67%; 480 us/it; ETA 0d 10:20; 19d648e17333ad91 (check 0.09s) 20201022 19:40:50 TITAN V0 Exiting because "stop requested" 20201022 19:40:50 TITAN V0 Bye Last fiddled with by xx005fs on 20201023 at 02:47 

20201023, 03:55  #2526 
Jul 2009
Germany
1C4_{16} Posts 
Thank you very much, according to my expectations, the Titan V is so far the second best with 478 us/it to 442 us/it compared to a Tesla V100SXM216GB. I'm already working on an applicationoriented top list for gpuowl, which I will publish here in the forum.

20201023, 04:43  #2527  
P90 years forever!
Aug 2002
Yeehaw, FL
2×3×1,193 Posts 
Quote:
undervolted, underclocked to sclk=3, mem overclocked to 1200: Code:
20201023 04:23:47 gfx906+sramecc0 77936867 OK 800 0.00%; 556 us/it; ETA 0d 12:02; 1579c241dc63eca6 (check 0.39s) 20201023 04:24:04 gfx906+sramecc0 77936867 OK 30000 0.04%; 561 us/it; ETA 0d 12:08; c4e0ff35e3290d98 (check 0.39s) 20201023 04:24:21 gfx906+sramecc0 77936867 OK 60000 0.08%; 560 us/it; ETA 0d 12:07; 0945da4dc08bdd95 (check 0.39s) Code:
20201023 04:30:52 gfx906+sramecc0 77936867 OK 270000 0.35%; 985 us/it; ETA 0d 21:15; dc349756c5f05abf (check 0.57s) 20201023 04:31:01 gfx906+sramecc0 77936867 OK 270000 0.35%; 986 us/it; ETA 0d 21:16; dc349756c5f05abf (check 0.57s) 20201023 04:32:22 gfx906+sramecc0 77936867 OK 360000 0.46%; 985 us/it; ETA 0d 21:14; 992df79b843f90de (check 0.57s) 20201023 04:32:32 gfx906+sramecc0 77936867 OK 360000 0.46%; 985 us/it; ETA 0d 21:14; 992df79b843f90de (check 0.57s) undervolted, underclocked (slightly) to sclk=4, mem overclocked to 1200: Code:
20201023 04:26:43 gfx906+sramecc0 77936867 OK 90000 0.12%; 526 us/it; ETA 0d 11:22; 0bacff453b2f470e (check 0.38s) 20201023 04:26:47 gfx906+sramecc0 77936867 OK 97200 0.12%; 525 us/it; ETA 0d 11:22; ddaaad369befab47 (check 0.36s) Code:
20201023 04:27:51 gfx906+sramecc0 77936867 OK 150000 0.19%; 920 us/it; ETA 0d 19:53; 127631386c6a9b17 (check 0.55s) 20201023 04:28:01 gfx906+sramecc0 77936867 OK 150000 0.19%; 920 us/it; ETA 0d 19:53; 127631386c6a9b17 (check 0.54s) 20201023 04:28:19 gfx906+sramecc0 77936867 OK 180000 0.23%; 920 us/it; ETA 0d 19:53; 6bee5d054f770861 (check 0.54s) 20201023 04:28:29 gfx906+sramecc0 77936867 OK 180000 0.23%; 920 us/it; ETA 0d 19:53; 6bee5d054f770861 (check 0.56s) 

20201023, 05:00  #2528  
Romulan Interpreter
Jun 2011
Thailand
21344_{8} Posts 
Quote:
Say for example you want to rewrite mfaktc (which uses int32) to use FP32, to speed it up in some cards which have "pure FP32" hardware. For the most of the cards, the same units do either integer, either fp32 processing, so you won't get anything, but some gaming cards have dedicated fp32 cores inside, which suck at integer arithmetic, and you may get a speedup doing so. But... A 32 bit register can only store a number of 2^32 different values, regardless of how you "see" this register (i.e. regardless of the codification you associate to it). In the "unsigned int32" codification, you can put there a number from 0 to 2^321 exactly, i.e. lossless. Without losing information. It means, when you write 89, yo read back 89. In the "fp32" codification, you can only put there a much lower number of numbers from this range, lossless. Actually, only about 0.4% of them can be stored exact. For all the other "larger" numbers (or smaller than 1, fractional, by the way), you write "x", but when you read back, you read an "x+epsilon" or "xepsilon". The codification is not "exact". It is the same idea as when you count to a hundred, yo do it one by one, but when you get higher, you say "few hundred", or "few thousand", or "the budget of this project is about five millions and half", you are not anymore interested on the exact value, and look only to the most significant digits, as many as you can remember (store in your "space" in your brain). That's not useful for integer arithmetic, you will need to use two FP32 registers, to store the same information as you store in one int32 register, and that is worth only if you can achieve a double speed (well, about, in rough terms, the things are more complex than that). All the issue is the fact that, in 32 bit floats, numbers are represented as "sign*1.fraction*2^exponent", where the sign, fraction, and exponent are stored inside of the 32 bit register, therefore they take 32 bits in total, but their positions and sizes are fixed. As the sign is 1 bit, you can only have 8 bits for the exponent, and 23 bits for the fraction. Therefore, you can represent a very large number, like 618970019642690137449562112 (which is 2^89), by setting the exponent to 89 and the fraction to zero, but you will not be able to store the most of the numbers in between, like for example 33556688, which is just a 25 bit number. If you google "the smallest positive integer that can't be stored in fp32" (or just go to wikipedia and read the theory), you will find out a lot of interesting things. For a smaller scale, imagine you have a 3 bit register. You can store inside a number between 000 binary (decimal zero) and 111 binary (decimal 7). You can see this as an "unsigned integer on 3 bits", and then the information inside represents a number between 0 and 7, in order in binary: 000=0, 001=1, 010=2, 011=3, 100=4, 101=5, 110=6, 111=7. No other possibility. You can also consider this as "signed integer on 3 bits", and in that case, you need a bit to store the sign, let's consider first bit is for sign, then the larger integer you can store there will be 3 (using the two remaining bits) and your values will be, in order: 100=4, 101=3, 110=2, 111=1, 000=0, 001=1, 010=2, 011=3, there is no other possibility (and yes, there is a reason to put them in that order, to have the additions and multiplications work properly, without changing the addition and multiplication rules). You could see the 3 bits also like a "unsigned float on 3 bits", and in that case, the information inside will represent (I use letters for decimal numbers to avoid confusion with 0 and 1 binary): 000=100=zero*, 110=0.25, 111=0.5, 001=one, 010=3, 011=7. The advantage is that you can store "higher numbers", as well as numbers which are not integers, but you lose the accuracy, as you can't store all the numbers in between. To store the integer 5 exactly, you will need two of these "3 bit registers". So. here you can store a "larger" number (as well as a smaller, fractional) compared with unsigned integer, but every time you will write a 4, you will read back a 3, and every time you will write a 6 you will read back a 7. But yes, you can store a "larger" number, for sure.  *Edit: note that here you have 2 possibilities to store the value "zero", this is deliberate, because floats, in theory, NEVER represent exact values, therefore you may consider zero as being an infinitesimal small value, and it makes sense to have a positive and a negative one (like an "epsilon", in math, or even in programming). Last fiddled with by LaurV on 20201023 at 07:55 

20201023, 05:19  #2529 
Jul 2009
Germany
2^{2}×113 Posts 
Thanks for the trouble, I'll take the best value for one instance,because it should be a fair comparison. It is only important that gpuowl runs stable without errors with the selected settings.
Last fiddled with by moebius on 20201023 at 05:22 
20201023, 08:54  #2530  
Aug 2020
2^{5} Posts 
Quote:
108980089 and the result was refused, though it was assigned to me through Primenet. As I have seen the name of the PRP tester mentioned before, did this proof certification succeed? 

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
mfakto: an OpenCL program for Mersenne prefactoring  Bdot  GPU Computing  1657  20201027 01:23 
GPUOWL AMD Windows OpenCL issues  xx005fs  GpuOwl  0  20190726 21:37 
Testing an expression for primality  1260  Software  17  20150828 01:35 
Testing Mersenne cofactors for primality?  CRGreathouse  Computer Science & Computational Number Theory  18  20130608 19:12 
Primalitytesting program with multiple types of moduli (PFGWrelated)  Unregistered  Information & Answers  4  20061004 22:38 