mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2011-06-23, 17:34   #45
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3·199 Posts
Default

Quote:
Originally Posted by ldesnogu View Post
Warning: I don't know anything about OpenCL...

Why do you use ||, && et ?: at all? Doesn't OpenCL say a comparison result is either 0 or 1?
For scalar data types that is true. I could have saved a conditional load, but I guess the compiler will optimize that out anyway.

I already have in my mind to use the same code for a vector of data (just replace all uint by uint4, for instance), and then the result of a comparison is 0 or -1 (all bits set).

What I was really hoping for is something that propagates the carry with a little less logic, as that will really slow down additions and subtractions ...
Bdot is offline   Reply With Quote
Old 2011-06-23, 22:15   #46
Ken_g6
 
Ken_g6's Avatar
 
Jan 2005
Caught in a sieve

5×79 Posts
Default

Would the "ulong" data type (a 64-bit unsigned integer) help?
Ken_g6 is offline   Reply With Quote
Old 2011-06-24, 00:18   #47
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

3,617 Posts
Default

If you cast the arguments to uint64 then the borrow can be computed by shifting...

tmp = (uint64)a - (uint64)b;
sub = (uint32)tmp;
borrow = tmp >> 63;
bsquared is offline   Reply With Quote
Old 2011-06-24, 08:41   #48
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

2×281 Posts
Default

Quote:
Originally Posted by Bdot View Post
What I was really hoping for is something that propagates the carry with a little less logic, as that will really slow down additions and subtractions ...
Two other random ideas (again sorry if it's not applicable...):

- do as many computations as you can without taking care of carries and do a specific pass for handling them; of course that could lead to a big slowdown if you have to reload values and memory is the limiting factor

- another way to compute carries is bit arithmetic; let's say you want the carry from a - b
Code:
res = a - b;
carry = ((b & ~a) | (res & ~a) | (b & res)) >> (bitsize-1);
where bitsize is the number of bits of a, b and res. Again that could be slower than your original code.
ldesnogu is offline   Reply With Quote
Old 2011-06-24, 16:27   #49
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

59710 Posts
Default

Thanks for all your suggestions so far. I'll definitely try the (ulong) conversion and compare it to my current bunch of logic. I still welcome suggestions ;-)

Here's the task again:

Inputs: uint a, uint b, uint carry (which is the borrow of the previous (lower) 32 bits)
Output: uint res=a-b, carry (which should be the borrow for the next (higher) 32 bits)

currently this basically looks like
Code:
 res   = a - b - carry;
 carry = (res > a) || ((res == a) && carry);
I'm looking for something simpler for the evaluation of the new carry. Something like
Code:
  carry = (res >= a) || carry;
(Yes, I know this one is not correct)

We have available all logical operators, +, -, and the OpenCL Integer Functions. But maybe a total of 6 operations for one 32-bit subtraction with borrow is already the minimum for OpenCL?

I did not quite understand how evaluating carries afterwards can save something. Access to the operands is no problem, it's all in registers. The bit-wise operations lead to a total of 10 instructions (?) for one subtraction ... less likely to be an acceleration ;-)
Bdot is offline   Reply With Quote
Old 2011-06-24, 16:43   #50
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

2×281 Posts
Default

Quote:
Originally Posted by Bdot View Post
I did not quite understand how evaluating carries afterwards can save something. Access to the operands is no problem, it's all in registers. The bit-wise operations lead to a total of 10 instructions (?) for one subtraction ... less likely to be an acceleration ;-)
Well that all depends on two things: what your compiler is able to find depending on the form of your program and what your micro-architecture is able to do.

The post pass carry evaluation for instance is very useful for vector like architectures. And it might be possible on some micro-architectures that the result of a comparison blocks a pipeline while logical operations don't, thus making the logical variant faster even though it uses more operations.

But then again I don't know anything about your target and OpenGL, so I'm probably completely off track
ldesnogu is offline   Reply With Quote
Old 2011-07-03, 15:42   #51
apsen
 
Jun 2011

131 Posts
Default

Quote:
Originally Posted by Bdot View Post

As I have only this one ATI GPU I wanted to see if anyone would be willing to help testing on different hardware.

Current requirements: OpenCL 1.1 (i.e. only ATI GPUs), Windows 64-bit.
I have HD 4550, Windows 2008 R2 x64. Would that work?

Andriy
apsen is offline   Reply With Quote
Old 2011-07-03, 18:52   #52
Ken_g6
 
Ken_g6's Avatar
 
Jan 2005
Caught in a sieve

39510 Posts
Default

Bdot, what did you wind up finding did the fastest math, 32-bit numbers or 24-bit numbers? And what form of math? Or are you still working on it?
Ken_g6 is offline   Reply With Quote
Old 2011-07-04, 11:35   #53
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3·199 Posts
Default

I played around with this a little ...

Quote:
Originally Posted by ldesnogu View Post
Two other random ideas (again sorry if it's not applicable...):

- do as many computations as you can without taking care of carries and do a specific pass for handling them; of course that could lead to a big slowdown if you have to reload values and memory is the limiting factor
I have at most 5-6 operations that I can do before checking the carries. The runtime stays exactly the same and the reason is that the compiler reorders the instructions anyway as it thinks fit better. Even a few repeated steps that were necessary for the carry ops did not influence runtime as they were optimized out

Quote:
Originally Posted by ldesnogu View Post
- another way to compute carries is bit arithmetic; let's say you want the carry from a - b
Code:
res = a - b;
carry = ((b & ~a) | (res & ~a) | (b & res)) >> (bitsize-1);
where bitsize is the number of bits of a, b and res. Again that could be slower than your original code.
I was really surprised by that one. Even though this is way more operations than my original code, it runs the same speed! Not a bit faster, not a bit slower with the single-vector kernel. Comparing the assembly it turns out that many of the additional operations are "hidden" in otherwise unused slots. Following up with the vector-version of the kernel that has less unused slots, I really saw the kernel takes 3 cycles more - with a total number of ~700 cycles thats less than .5% ...

Here's the current performance analysis of the 79-bit barrett kernel:

Code:
Name                Throughput
Radeon HD 5870      135 M Threads\Sec
Radeon HD 6970      118 M Threads\Sec
Radeon HD 6870      100 M Threads\Sec
Radeon HD 5770       68 M Threads\Sec
Radeon HD 4890       66 M Threads\Sec
Radeon HD 4870       58 M Threads\Sec
FireStream 9270      58 M Threads\Sec
FireStream 9250      48 M Threads\Sec
Radeon HD 4770       46 M Threads\Sec
Radeon HD 6670       38 M Threads\Sec
Radeon HD 4670       35 M Threads\Sec
Radeon HD 5670       23 M Threads\Sec
Radeon HD 6450       10 M Threads\Sec
Radeon HD 4550        7 M Threads\Sec
Radeon HD 5450        6 M Threads\Sec
This is the peak performance of the kernel given enough CPU-power to feed the factor candidates fast enough. Empirical translation tells that 1M Threads/sec is good for 1.2 - 1.5 GHz-days/day.

Unfortunately I right now have a problem that some of the kernel's code is skipped unless I enable kernel tracing. I need that fixed before I can get you another version for testing.
Bdot is offline   Reply With Quote
Old 2011-07-04, 12:54   #54
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3·199 Posts
Default

Quote:
Originally Posted by Ken_g6 View Post
Bdot, what did you wind up finding did the fastest math, 32-bit numbers or 24-bit numbers? And what form of math? Or are you still working on it?
Currently the fastest kernel is a 24-bit based kernel working on a vector of 4 factor candidates at once. Here's the list of kernels I currently have, along with the performance on a HD5770:

76 M/s mfakto_cl_71_4: 3x24-bit, 4-vectored kernel
68 M/s mfakto_cl_barrett79: 2.5x32-bit unvectored barrett kernel
53 M/s mfakto_cl_barrett92: 3x32-bit unvectored barrett kernel
44 M/s mfakto_cl_71: 3x24-bit unvectored kernel

The barrett kernels currently need to use a nasty workaround for a bug of the compiler, costing ~ 3%. I'm still working on vectorizing the barretts, a similar speedup as for the 24-bit kernel can be expected, so that the 32-bit based kernels will be a lot faster than 24-bit.

A 24-bit based barrett kernel that was suggested on the forum runs at 75M/s, but as it is using vectors for the representation of the factors, it cannot (easily) be enhanced to run on a vector of candidates. If that were easily possible, then the 24-bit kernel might run for the crown again. But it will not be far ahead of the 32-bit kernel. And the 32-bit one has the advantage of running FCs up to 79 bit instead of 71 bit.
Bdot is offline   Reply With Quote
Old 2011-07-05, 08:12   #55
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3·199 Posts
Default

Oh boy, I finally found why the barrett kernels sometimes behaved odd ...

OpenCL does bit-shifts only up to the number of bits in the target, and for that it only evaluates the necessary amount of bits of the operand. So for a bit-shift a >> b with 32-bit-values, only the lowest 5 bits of b are used, the rest is ignored ... Therefore the code that used bit-shifts of 32 or more positions to implicitely zero the target did not deliver the expected result.

The fix for that goes without performance-penalty ... a little more testing and version 0.06 will be ready.
Bdot is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
gpuOwL: an OpenCL program for Mersenne primality testing preda GpuOwl 2760 2022-05-15 00:00
mfaktc: a CUDA program for Mersenne prefactoring TheJudger GPU Computing 3541 2022-04-21 22:37
LL with OpenCL msft GPU Computing 433 2019-06-23 21:11
OpenCL for FPGAs TObject GPU Computing 2 2013-10-12 21:09
Program to TF Mersenne numbers with more than 1 sextillion digits? Stargate38 Factoring 24 2011-11-03 00:34

All times are UTC. The time now is 03:51.


Wed May 25 03:51:52 UTC 2022 up 41 days, 1:53, 0 users, load averages: 1.48, 1.63, 1.82

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔