![]() |
![]() |
#45 | |
Nov 2010
Germany
3·199 Posts |
![]() Quote:
I already have in my mind to use the same code for a vector of data (just replace all uint by uint4, for instance), and then the result of a comparison is 0 or -1 (all bits set). What I was really hoping for is something that propagates the carry with a little less logic, as that will really slow down additions and subtractions ... |
|
![]() |
![]() |
![]() |
#46 |
Jan 2005
Caught in a sieve
5×79 Posts |
![]()
Would the "ulong" data type (a 64-bit unsigned integer) help?
|
![]() |
![]() |
![]() |
#47 |
"Ben"
Feb 2007
7·11·47 Posts |
![]()
If you cast the arguments to uint64 then the borrow can be computed by shifting...
tmp = (uint64)a - (uint64)b; sub = (uint32)tmp; borrow = tmp >> 63; |
![]() |
![]() |
![]() |
#48 | |
Jan 2008
France
22·3·47 Posts |
![]() Quote:
- do as many computations as you can without taking care of carries and do a specific pass for handling them; of course that could lead to a big slowdown if you have to reload values and memory is the limiting factor - another way to compute carries is bit arithmetic; let's say you want the carry from a - b Code:
res = a - b; carry = ((b & ~a) | (res & ~a) | (b & res)) >> (bitsize-1); |
|
![]() |
![]() |
![]() |
#49 |
Nov 2010
Germany
59710 Posts |
![]()
Thanks for all your suggestions so far. I'll definitely try the (ulong) conversion and compare it to my current bunch of logic. I still welcome suggestions ;-)
Here's the task again: Inputs: uint a, uint b, uint carry (which is the borrow of the previous (lower) 32 bits) Output: uint res=a-b, carry (which should be the borrow for the next (higher) 32 bits) currently this basically looks like Code:
res = a - b - carry; carry = (res > a) || ((res == a) && carry); Code:
carry = (res >= a) || carry; We have available all logical operators, +, -, and the OpenCL Integer Functions. But maybe a total of 6 operations for one 32-bit subtraction with borrow is already the minimum for OpenCL? I did not quite understand how evaluating carries afterwards can save something. Access to the operands is no problem, it's all in registers. The bit-wise operations lead to a total of 10 instructions (?) for one subtraction ... less likely to be an acceleration ;-) |
![]() |
![]() |
![]() |
#50 | |
Jan 2008
France
23416 Posts |
![]() Quote:
The post pass carry evaluation for instance is very useful for vector like architectures. And it might be possible on some micro-architectures that the result of a comparison blocks a pipeline while logical operations don't, thus making the logical variant faster even though it uses more operations. But then again I don't know anything about your target and OpenGL, so I'm probably completely off track ![]() |
|
![]() |
![]() |
![]() |
#51 |
Jun 2011
131 Posts |
![]() |
![]() |
![]() |
![]() |
#52 |
Jan 2005
Caught in a sieve
5·79 Posts |
![]()
Bdot, what did you wind up finding did the fastest math, 32-bit numbers or 24-bit numbers? And what form of math? Or are you still working on it?
|
![]() |
![]() |
![]() |
#53 | ||
Nov 2010
Germany
3·199 Posts |
![]()
I played around with this a little ...
Quote:
![]() Quote:
Here's the current performance analysis of the 79-bit barrett kernel: Code:
Name Throughput Radeon HD 5870 135 M Threads\Sec Radeon HD 6970 118 M Threads\Sec Radeon HD 6870 100 M Threads\Sec Radeon HD 5770 68 M Threads\Sec Radeon HD 4890 66 M Threads\Sec Radeon HD 4870 58 M Threads\Sec FireStream 9270 58 M Threads\Sec FireStream 9250 48 M Threads\Sec Radeon HD 4770 46 M Threads\Sec Radeon HD 6670 38 M Threads\Sec Radeon HD 4670 35 M Threads\Sec Radeon HD 5670 23 M Threads\Sec Radeon HD 6450 10 M Threads\Sec Radeon HD 4550 7 M Threads\Sec Radeon HD 5450 6 M Threads\Sec Unfortunately I right now have a problem that some of the kernel's code is skipped unless I enable kernel tracing. I need that fixed before I can get you another version for testing. |
||
![]() |
![]() |
![]() |
#54 | |
Nov 2010
Germany
3×199 Posts |
![]() Quote:
76 M/s mfakto_cl_71_4: 3x24-bit, 4-vectored kernel 68 M/s mfakto_cl_barrett79: 2.5x32-bit unvectored barrett kernel 53 M/s mfakto_cl_barrett92: 3x32-bit unvectored barrett kernel 44 M/s mfakto_cl_71: 3x24-bit unvectored kernel The barrett kernels currently need to use a nasty workaround for a bug of the compiler, costing ~ 3%. I'm still working on vectorizing the barretts, a similar speedup as for the 24-bit kernel can be expected, so that the 32-bit based kernels will be a lot faster than 24-bit. A 24-bit based barrett kernel that was suggested on the forum runs at 75M/s, but as it is using vectors for the representation of the factors, it cannot (easily) be enhanced to run on a vector of candidates. If that were easily possible, then the 24-bit kernel might run for the crown again. But it will not be far ahead of the 32-bit kernel. And the 32-bit one has the advantage of running FCs up to 79 bit instead of 71 bit. |
|
![]() |
![]() |
![]() |
#55 |
Nov 2010
Germany
3·199 Posts |
![]()
Oh boy, I finally found why the barrett kernels sometimes behaved odd ...
OpenCL does bit-shifts only up to the number of bits in the target, and for that it only evaluates the necessary amount of bits of the operand. So for a bit-shift a >> b with 32-bit-values, only the lowest 5 bits of b are used, the rest is ignored ... Therefore the code that used bit-shifts of 32 or more positions to implicitely zero the target did not deliver the expected result. The fix for that goes without performance-penalty ... a little more testing and version 0.06 will be ready. |
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
gpuOwL: an OpenCL program for Mersenne primality testing | preda | GpuOwl | 2772 | 2022-06-24 17:42 |
mfaktc: a CUDA program for Mersenne prefactoring | TheJudger | GPU Computing | 3541 | 2022-04-21 22:37 |
LL with OpenCL | msft | GPU Computing | 433 | 2019-06-23 21:11 |
OpenCL for FPGAs | TObject | GPU Computing | 2 | 2013-10-12 21:09 |
Program to TF Mersenne numbers with more than 1 sextillion digits? | Stargate38 | Factoring | 24 | 2011-11-03 00:34 |