mersenneforum.org World's second-dumbest CUDA program
 User Name Remember Me? Password
 Register FAQ Search Today's Posts Mark Forums Read

2012-06-30, 05:03   #34
bsquared

"Ben"
Feb 2007

2×1,789 Posts

Here's the code. I'd be interested to hear how it runs on other hardware (or *if* it runs).
Attached Files
 erato.cu.txt (11.0 KB, 207 views)

 2012-06-30, 06:17 #35 Dubslow Basketry That Evening!     "Bunslow the Bold" Jun 2011 40
 2012-06-30, 09:27 #36 rcv   Dec 2011 8F16 Posts I gave a version of my sieving code to TheJudger, Bdot, and Prime95, on or about March 8, 2012. It's not a perfect apples-to-apples comparison, partly for the reasons given by Dubslow. Simplifying my code to just sieve the integers from 1 to 1e9 should not slow it down, but it probably wouldn't speed it up by very much either. (Setup of the sieving loops might be simplified, but the compute-intensive sieving, itself, would be unchanged.) Looking at the timings I took circa March, on my EVGA GTX560Ti, and scaling to find primes to 1e9, and assuming I haven't made an arithmetic error, my code should take about 55406 usec, plus a few hundred usec of setup time. This includes sieving with all of the primes below 2^15, which is slightly more sieving than necessary to find big primes to 1e9. My sieving code respects the "class" concept used by mfaktc/mfakto. [Contrast 960 of 4620 classes with 2 of 6 for the code posted above. More classes provides a nice performance advantage for big problems, but adds overhead for small problems, such as merely finding the primes to 1e9. (Another non apples-to-apples issue.)] My sieving code is also larger, more complex, and uses multiple kernels, each of which is optimized a bit differently than the next. For the mfaktc/mfakto problem, the sieved results do not have to leave the GPU. As bsquared noted, for counting the primes to 1e9, the sieved results do not have to leave the GPU. Be aware that transfer of the primes, themselves, would add significant data-transfer overhead to any performance metric. The code was released under GPLv3. Bdot had my blessing to put the code under source control. [I don't know whether he did.]
2012-06-30, 12:08   #37
bsquared

"Ben"
Feb 2007

2×1,789 Posts

Quote:
 Originally Posted by rcv Looking at the timings I took circa March, on my EVGA GTX560Ti, and scaling to find primes to 1e9, and assuming I haven't made an arithmetic error, my code should take about 55406 usec, plus a few hundred usec of setup time. This includes sieving with all of the primes below 2^15, which is slightly more sieving than necessary to find big primes to 1e9.
Wow, cool! My apologies for coming around after 2 days of looking at gpu code and presuming to have come up with the fastest thing around. Also good to know that targetting a version of the above for mfakt* isn't necessary. I'd really like to have a look at your code, if its still available.

2012-07-02, 16:43   #38
bsquared

"Ben"
Feb 2007

2·1,789 Posts

More playing around this past weekend:

Code:
5.149000 milliseconds for big sieve
5761455 big primes (< 100000000) found

64.112999 milliseconds for big sieve
50847534 big primes (< 1000000000) found

1161.977051 milliseconds for big sieve
455052511 big primes (< 10000000000) found
Attached Files
 erato.cu.txt (14.0 KB, 275 views)

 2012-07-03, 02:32 #39 bsquared     "Ben" Feb 2007 357810 Posts + Early abort for primes that don't hit a block at all + axn's multiply by inverse trick (thanks again!) Code: 4.929000 milliseconds for big sieve 5761455 big primes (< 100000000) found 57.847000 milliseconds for big sieve 50847534 big primes (< 1000000000) found 891.489014 milliseconds for big sieve 455052511 big primes (< 10000000000) found
2012-07-03, 10:02   #40
henryzz
Just call me Henry

"David"
Sep 2007
Cambridge (GMT/BST)

25×5×37 Posts

Quote:
 Originally Posted by bsquared + Early abort for primes that don't hit a block at all + axn's multiply by inverse trick (thanks again!) Code: 4.929000 milliseconds for big sieve 5761455 big primes (< 100000000) found 57.847000 milliseconds for big sieve 50847534 big primes (< 1000000000) found 891.489014 milliseconds for big sieve 455052511 big primes (< 10000000000) found
Looks like you are catching up to rcv's code. Any clue how they would compare on the same hardware? rcv's gpu looks a little more powerful I think but yours is tesla.

 2012-07-03, 11:30 #41 axn     Jun 2003 19×271 Posts I see that the code is using plain "malloc" for allocating the two buffers which are later used in cudaMemcpy. Technically, it is more efficient to allocate "pinned" buffers using cudaHostAlloc (and cudaFreeHost) and using them in cudaMemcpy . However, the gains will probably be negligible (due to the small-ish size of the buffers). Nonetheless, it is there for the taking. EDIT:- This may or may not affect the timing, since the code is only timing the kernel itself. Last fiddled with by axn on 2012-07-03 at 11:33
2012-07-03, 13:39   #42
bsquared

"Ben"
Feb 2007

DFA16 Posts

Quote:
 Originally Posted by henryzz Looks like you are catching up to rcv's code. Any clue how they would compare on the same hardware? rcv's gpu looks a little more powerful I think but yours is tesla.
Mine is a Tesla M2050. I have little idea how that compares with his, but the tesla is pretty fast. I would really like to see rcv's code to do a proper comparison and see if any of his ideas are transferrable (assuming he's willing to do that). It's not that I don't trust his timing estimate, but sieving that many classes may cause things scale in unexpected ways. Every experiment I tried with more than 2/6 classes ran slower - a lot slower. But maybe that's because I implemented it badly. *shrug*. I'm pretty happy with where the performance is now.

Quote:
 Originally Posted by axn I see that the code is using plain "malloc" for allocating the two buffers which are later used in cudaMemcpy. Technically, it is more efficient to allocate "pinned" buffers using cudaHostAlloc (and cudaFreeHost) and using them in cudaMemcpy . However, the gains will probably be negligible (due to the small-ish size of the buffers). Nonetheless, it is there for the taking. EDIT:- This may or may not affect the timing, since the code is only timing the kernel itself.
Didn't seem to make any difference, but it's probably the perferred way to do it, so thanks for the pointer.

Code.

It's maybe worth noting that for the 10e9 sieve interval, the kernel sieves approximately as many bytes as a SIQS job for a C60 (with a "factor base" more than twice as big), and does it 7 times faster. Granted, there are a lot more operations in siqs vs. erato. And it will be very hard (impossible?) to avoid read-modify-writes, so use of atomics might be necessary. But this suggests it might not be totally insane to implement a decent siqs on a gpu.

2012-07-04, 04:50   #43
axn

Jun 2003

19×271 Posts

Quote:
 Originally Posted by bsquared Code.
Looking at the latest code, the compiler is actually doing very good job.

One micro-optimization. Where you're using an array s[2], you can change it out to use two scalar variables s0 & s1. Followed by this:

Code:
			uint32 sxor = s0 ^ s1;
uint32 scur = (k == 0) ? s0 : s1;
for( ;i < block_size; )
{
locsieve[i] = 0;
i += scur;
scur ^= sxor;
}
This changes that loop from
Code:
	/*0998*/     /*0x0851dc4340000000*/ 	ISCADD R7, R5, R2, 0x2;
/*09a0*/     /*0x004fdc05c9000000*/ 	STS.U8 [R4], RZ;
/*09a8*/     /*0x07f15c23328ac000*/ 	ICMP.NE R5, RZ, 0x1, R5;
/*09b0*/     /*0x0071dc85c0000000*/ 	LDL R7, [R7];
/*09b8*/     /*0x10711c0348000000*/ 	IADD R4, R7, R4;
/*09c0*/     /*0x0041dc03188ec080*/ 	ISETP.LT.U32.AND P0, pt, R4, 0x2000, pt;
/*09c8*/     /*0x200001e74003ffff*/ 	@P0 BRA 0x998;
to
Code:
	/*09a8*/     /*0x009fdc05c9000000*/ 	STS.U8 [R9], RZ;
/*09b0*/     /*0x24725c0348000000*/ 	IADD R9, R7, R9;
/*09b8*/     /*0x1c41dc8368000000*/ 	LOP.XOR R7, R4, R7;
/*09c0*/     /*0x0091dc03188ec080*/ 	ISETP.LT.U32.AND P0, pt, R9, 0x2000, pt;
/*09c8*/     /*0x600001e74003ffff*/ 	@P0 BRA 0x9a8;
And totally avoids that usage of two memory locations (which, incidentally are "local" memory and AFAIK is slower(!) than shared memory). As usual, net savings may be iffy :(

EDIT:- Now that I've looked at the code closer, you can replace the index k and instead use scur directly wherever k appears. This will actually allow to combine the two "unrolled loops".

Another suggestion for a micro-optimization. In the main "pid" loop, instead of two variables j & pid, you can directly loop on pid, like so:
Code:
for (int pid =threadIdx.x + startprime; pid < maxp; pid += threadsPerBlock)
thus removing the inner "if" and the need to send the "nump" parameter to the kernel. This should have almost no impact on runtime :)

Last fiddled with by axn on 2012-07-04 at 05:08

 2012-07-04, 05:30 #44 axn     Jun 2003 10100000111012 Posts Major (code size) optimization. The first stage reduction can be rewritten as: Code:  j=threadIdx.x * range; uint8 sum = 0; for (k=0; k> k); locsieve[j] = sum; Note that this only works when range = 32 (i.e. block_size / threads_per_block). But then again, the initial logic of setting bitsieve[threadIdx.x] only works when that is the case. This alone reduces the code size by well over 25%! EDIT:- Probably you need ~bitsieve[] EDIT2:- In another "duh!" moment, I just realized that if the above transformation is valid, then you can get rid of the entire bitsieve array and replace it with a single uint32 bitsieve, because that is the slice of bitsieve that a thread is dealing with. Last fiddled with by axn on 2012-07-04 at 05:52

 Similar Threads Thread Thread Starter Forum Replies Last Post TheJudger GPU Computing 3506 2021-09-18 00:04 firejuggler GPU Computing 753 2020-12-12 18:07 firejuggler Lounge 3 2012-12-22 01:43 davieddy Hobbies 111 2011-05-28 19:21 xilman Programming 1 2009-11-16 10:26

All times are UTC. The time now is 18:31.

Thu Oct 21 18:31:16 UTC 2021 up 90 days, 13 hrs, 1 user, load averages: 1.30, 1.39, 1.32