mersenneforum.org mfakto: an OpenCL program for Mersenne prefactoring
 Register FAQ Search Today's Posts Mark Forums Read

2011-06-19, 11:34   #34
Colt45ws

Jun 2010

17 Posts

No, 55000
I have to make an amendment to my previous post, I must have made a mistake when I was keeping track of which GridSize I was running. 3, 2, and 1 are identical. Maybe leaning towards 2 almost imperceptibly. Then 4 and 0.

CPU load is around 13%, or about 65% of a single core.
Im running a i7-920 @ 4GHz
Attached Files
 GridSize.txt (11.6 KB, 650 views)

 2011-06-19, 15:31 #35 Christenson     Dec 2010 Monticello 5·359 Posts mfaktc/mfakto certainly needs a GPU-based siever....I have to complete a different project (automatic assignment handling) first before I can think about taking it on.
2011-06-19, 17:17   #36
henryzz
Just call me Henry

"David"
Sep 2007
Liverpool (GMT/BST)

32×5×7×19 Posts

Quote:
 Originally Posted by Christenson mfaktc/mfakto certainly needs a GPU-based siever....I have to complete a different project (automatic assignment handling) first before I can think about taking it on.
Can I suggest that if you can get recieving assignments working faster than both then you should. It is fine to only report results occasionally but running out of work is bad.

2011-06-19, 18:15   #37
davieddy

"Lucan"
Dec 2006
England

2·3·13·83 Posts

Quote:
 Originally Posted by henryzz but running out of work is bad.
That happens to be one of my favourite occupations.

But if picking the low-hanging fruit floats your boat,
OTOH if you get bored with finding new factors (or "getting work"),
try making it as easy for us CPU-bound, patient,
LL-testing prime searchers as possible.

TFing X to X+1 is 1/7th of X+3 effort.*

David

*Open to correction, but you get the idea.
1+2+4 = 7

Last fiddled with by davieddy on 2011-06-19 at 18:52

 2011-06-19, 18:41 #38 Christenson     Dec 2010 Monticello 179510 Posts Henry: Once automatic reporting begins to work, it will come all at once....I'm having issues with learning my tools (eclipse) right now, just have to sit and work at it...then add the mutex and thread management stuff and call the appropriate parts of P95. As for you CPU-bound, LL-testing types (which, incidentally, includes myself), don't worry. The way I look at it is that TF and P-1 both have as their goal making as many LL tests as possible unnecessary. Odds of finding a factor for a given exponent, for the current bit level of 70, are about 1/70. Supposing the GPUs are 128 times faster than the CPUs, then we can do 7 extra bit levels, which will factor about 10% of the candidates that wouldn't have been factored by CPU. This helps, but the real speed-up in finding M48 and beyond will be in freed-up CPUs not doing TF and in the GPU LL tests.
2011-06-19, 18:50   #39
Bdot

Nov 2010
Germany

10010101012 Posts

Quote:
 Originally Posted by henryzz Can I suggest that if you can get recieving assignments working faster than both then you should. It is fine to only report results occasionally but running out of work is bad.
Hehe, if receiving the assignment takes longer than the task itself, then we don't need to optimize the GPU kernels anymore ...

2011-06-19, 21:45   #40
Ken_g6

Jan 2005
Caught in a sieve

18B16 Posts

Quote:
 Originally Posted by Christenson mfaktc/mfakto certainly needs a GPU-based siever....I have to complete a different project (automatic assignment handling) first before I can think about taking it on.

 2011-06-23, 02:07 #41 Bdot     Nov 2010 Germany 59710 Posts This missing carry flag is driving me nuts ... Has anyone a better idea for the carry-propagation: Code: typedef _int96_t { uint d0, d1, d2; } int96_t; void sub_96(int96_t *res, int96_t a, int96_t b) /* a must be greater or equal b! res = a - b */ { uint carry = (b.d0 > a.d0); res->d0 = a.d0 - b.d0; res->d1 = a.d1 - b.d1 - (carry ? 1 : 0); res->d2 = a.d2 - b.d2 - (((res->d1 > a.d1) || ((res->d1 == a.d1) && carry)) ? 1 : 0); } I also need this for an int192 (6x32 bit). Then the above logic would become quite lengthy ... Do I really need to use something like this: Code:  uint carry = (b.d0 > a.d0); res->d0 = a.d0 - b.d0; res->d1 = a.d1 - b.d1 - (carry ? 1 : 0); carry = (res->d1 > a.d1) || ((res->d1 == a.d1) && carry); res->d2 = a.d2 - b.d2 - (carry ? 1 : 0); carry = (res->d2 > a.d2) || ((res->d2 == a.d2) && carry); res->d3 = a.d3 - b.d3 - (carry ? 1 : 0); carry = (res->d3 > a.d3) || ((res->d3 == a.d3) && carry); res->d4 = a.d4 - b.d4 - (carry ? 1 : 0); ... Last fiddled with by Bdot on 2011-06-23 at 02:08
 2011-06-23, 04:21 #42 Ken_g6     Jan 2005 Caught in a sieve 5×79 Posts Getting the carries would be a lot simpler if the number were 4x24-bit numbers instead of 3x32. Bit shifts could be used instead of conditionals, and conditionals on AMD are slow. This would also seem to allow for easier multiplication, when 24-bit multiplies are faster than 32-bit ones.
2011-06-23, 09:14   #43
Bdot

Nov 2010
Germany

59710 Posts

Quote:
 Originally Posted by Ken_g6 Getting the carries would be a lot simpler if the number were 4x24-bit numbers instead of 3x32. Bit shifts could be used instead of conditionals, and conditionals on AMD are slow. This would also seem to allow for easier multiplication, when 24-bit multiplies are faster than 32-bit ones.
Sure, the 24-bit stuff works quite well. I just wanted to get a 32-bit kernel running in order to compare exactly that.

BTW, conditional loads are not slow (1st cycle: eval condition and prepare the two possible load values, 2nd cycle: load it), they run at full speed. Only branches having a different control flow have that big penalty, which consists of executing both branches plus some overhead to mask out one of the executions.

 2011-06-23, 09:44 #44 ldesnogu     Jan 2008 France 10001100102 Posts Warning: I don't know anything about OpenCL... Why do you use ||, && et ?: at all? Doesn't OpenCL say a comparison result is either 0 or 1? If so then, I would have written: Code: uint carry = (b.d0 > a.d0); res->d0 = a.d0 - b.d0; res->d1 = a.d1 - b.d1 - carry; res->d2 = a.d2 - b.d2 - ((res->d1 > a.d1) | ((res->d1 == a.d1) & carry)); and: Code: uint carry = (b.d0 > a.d0); res->d0 = a.d0 - b.d0; res->d1 = a.d1 - b.d1 - carry; carry = (res->d1 > a.d1) | ((res->d1 == a.d1) & carry); res->d2 = a.d2 - b.d2 - carry; ...

 Similar Threads Thread Thread Starter Forum Replies Last Post preda GpuOwl 2760 2022-05-15 00:00 TheJudger GPU Computing 3541 2022-04-21 22:37 msft GPU Computing 433 2019-06-23 21:11 TObject GPU Computing 2 2013-10-12 21:09 Stargate38 Factoring 24 2011-11-03 00:34

All times are UTC. The time now is 05:16.

Sun May 29 05:16:54 UTC 2022 up 45 days, 3:18, 0 users, load averages: 1.33, 1.37, 1.28