![]() |
![]() |
#23 | |||
Nov 2010
Germany
3×199 Posts |
![]() Quote:
Quote:
Code:
void mul_24_48(uint *res_hi, uint *res_lo, uint a, uint b) /* res_hi*(2^24) + res_lo = a * b */ { *res_lo = mul24(a,b); *res_hi = (mul_hi(a,b) << 8) | (*res_lo >> 24); *res_lo &= 0xFFFFFF; } Code:
mul_24_48(&(a.d1),&(a.d0),k_tab[tid],4620); // NUM_CLASSES becomes 19 z: MUL_UINT24 ____, R1.x, (0x0000120C, 6.473998905e-42f).x t: MULHI_UINT ____, R1.x, (0x0000120C, 6.473998905e-42f).x 20 x: LSHL ____, PS19, (0x00000008, 1.121038771e-44f).x y: LSHR ____, PV19.z, (0x00000018, 3.363116314e-44f).y z: AND_INT ____, PV19.z, (0x00FFFFFF, 2.350988562e-38f).z w: ADD_INT ____, KC0[1].x, PV19.z 21 x: ADD_INT ____, KC0[1].x, PV20.z z: AND_INT R4.z, PV20.w, (0x00FFFFFF, 2.350988562e-38f).x w: OR_INT ____, PV20.y, PV20.x Quote:
<any 64-bit> >> 32 = 0 (of type int) <any 64-bit> >> 32ULL = <upper half> (of type long) Took me a while to find that ... but you may have meant something different. And to those still waiting for the real thing - it comes now, I just added Oliver's pre-release signal handler which is really required for OpenCL. Otherwise the graphics driver crashes Windows 7 when there are still kernels in the queue but the process is already gone. And yes, it may crash your machine too - do not use it on productive machines, yet. Just remember: this is a first test version. Though performance figures may be interesting, I'm more interested to see if it runs at all, stable and with or without odd side-effects. Do not attempt to upload "no factor" results to primenet yet. |
|||
![]() |
![]() |
![]() |
#24 | ||
Sep 2006
The Netherlands
2×17×23 Posts |
![]() Quote:
So this instruction eats 5 cycles if you look at it from that viewpoint (or in fact from 5 execution units it eats 1 cycle). The fast instruction is MULHI_UINT24 which is at the full speed of 1351 Gflop (at the HD6970 series and i suppose so at the 5000 series as well). As you see it doesn't generate that one. So this is a piece of turtle slow assembler code for several reasons. Not just because it's the T unit, there is other issues as well. Quote:
|
||
![]() |
![]() |
![]() |
#25 |
Sep 2006
The Netherlands
11000011102 Posts |
![]() Code:
*res_lo = mul24(a,b); *res_hi = (mul_hi(a,b) << 8) | (*res_lo >> 24); *res_lo &= 0xFFFFFF; Let me explain. You generate the result of res_lo using a mul24. This is a fast instruction. The problem comes after this. Directly after this. The GPU's are not out of order processors. They focus upon throughput. So i hope i formulate it not too bad if i say that the retirement of the execution unit doing the mul24, it eats a cycle or 8 for the result to have available. However already directly you need the result to be shifted right 24. That's too quick. The GPU already hides a little bit of the latency by running multiple threads, but that doesn't cover everything. The mul_hi gets shifted left 8. Now a good compiler would hide this latency. Yet that's not what you can expect here. Directly then it needs the result for an OR with the shifted part o fthe res_lo. So the mul_hi itself eats 5 cycles at the 5000 series, or better it requires all units at the same time, then another 8 cycles later it's available for usage to be OR-ed with the other half, which we can expect to be available at the same time, this 8 cycles later. So in reality we have a few operations that are 'in flight' at the same time here. The shiftleft after the Mul_hi and the shiftright are at the same time "in flight". Yet not yet available, as that eats another 8 cycles. So directly issuing then an 'or' is not so clever. The AND with the res_lo we can safely assume by then to be available of course, as we already needed it the line above it. So from programming viewpoint this is utmost beginnerscode for 2 reasons. Last fiddled with by diep on 2011-06-09 at 14:39 |
![]() |
![]() |
![]() |
#26 | |
Sep 2006
The Netherlands
2·17·23 Posts |
![]() Quote:
My email : diep@xs4all.nl |
|
![]() |
![]() |
![]() |
#27 | ||
Sep 2006
The Netherlands
78210 Posts |
![]() Quote:
the square_72_144 multiplicatoin basically seems like an optimized version of what Chrisjp posted over here. So with near 100% certainty your code is faster. Especially if we see how much you tested it to be. So i wonder why you posted this comment? Can you explain what you wrote over here? Thanks, Vincent Quote:
|
||
![]() |
![]() |
![]() |
#28 | |
Nov 2010
Germany
3·199 Posts |
![]() Quote:
Mainly I'm looking for a faster modulus because that is where currently ~3/4 of the TF effort are spent (squaring is less than 20%). There must be some reasons why Chrisjp had better performance figures. I want to see why (probably due to barrett), and take over what makes sense. And having a fast 84-bit kernel is of course another advantage over 72 bits. Last fiddled with by Bdot on 2011-06-10 at 11:40 Reason: typo |
|
![]() |
![]() |
![]() |
#29 |
Jun 2010
17 Posts |
![]()
My primary machine here has a unlocked (6970 equiv.) HD6950 2GB. I would be willing to test this stuff.
|
![]() |
![]() |
![]() |
#30 | |
Nov 2010
Germany
10010101012 Posts |
![]() Quote:
I'm not sure I will send out version 0.04 again as I just built a vectorized version of the 71-bit-kernel. It may take another few days to be finalized. My first tests showed a speedup of 30-40%. You'll probably get this one when it's ready. |
|
![]() |
![]() |
![]() |
#31 |
Nov 2010
Germany
59710 Posts |
![]()
I just sent out mfakto version 0.05 to a few interested people.
Main highlight is the use of vector data types, which on my GPU raises throughput from 60M/s to 100M/s when using multiple instances, and from 36M/s to 88M/s for single instance (on a HD 5750). |
![]() |
![]() |
![]() |
#32 |
Jun 2010
100012 Posts |
![]()
I tested this with my 6970. Single instance.
I Adjusted Sieveprimes to max performance, 55k This resulted in 90M/s rate on a M57 68 to 69. Then I played with the other settings. Vector Best to worst was 4,8,2,1,16 16 had a HUGE performance drop; down to 14M/s Could only just tell 4 was better than 8. Any closer and it would probably be within normal range it runs. 2 and 1 both ran in the 7xM/s range. Best gridsize was 4 then 3,2,1,0. Curiously, the GPU never kicked into HighPerf mode, it stayed at 500MHz. ?? Last fiddled with by Colt45ws on 2011-06-19 at 04:10 |
![]() |
![]() |
![]() |
#33 |
Nov 2010
Germany
3·199 Posts |
![]()
Thanks for testing so quickly!
I guess SievePrimes is at 5k, not 55k? Are the gridsize differences big enough to try even bigger ones? For my GPU the UI was not usable anymore with the next bigger one ... but for fast GPUs I could check if bigger grids would always fit ... Did you monitor the CPU load? On my box I see it never go really high (max 50%). I'm afraid, for now, the only way to fully utilize such a capable GPU is many instances of mfakto (working on different exponents). Which on my machine has the nasty issue of freezing the machine sometimes ... working on it. Maybe it's time for a multi-threaded siever ... until the GPU-siever comes. |
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
gpuOwL: an OpenCL program for Mersenne primality testing | preda | GpuOwl | 2760 | 2022-05-15 00:00 |
mfaktc: a CUDA program for Mersenne prefactoring | TheJudger | GPU Computing | 3541 | 2022-04-21 22:37 |
LL with OpenCL | msft | GPU Computing | 433 | 2019-06-23 21:11 |
OpenCL for FPGAs | TObject | GPU Computing | 2 | 2013-10-12 21:09 |
Program to TF Mersenne numbers with more than 1 sextillion digits? | Stargate38 | Factoring | 24 | 2011-11-03 00:34 |