mersenneforum.org  

Go Back   mersenneforum.org > Extra Stuff > Programming

Reply
 
Thread Tools
Old 2012-06-30, 05:03   #34
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

2×1,789 Posts
Default

Here's the code. I'd be interested to hear how it runs on other hardware (or *if* it runs).
Attached Files
File Type: txt erato.cu.txt (11.0 KB, 207 views)
bsquared is offline   Reply With Quote
Old 2012-06-30, 06:17   #35
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

3×29×83 Posts
Default

Mfakt* search for factors of Mersenne numbers; a factor of 2^p-1 must be congruent to 1 (mod 2p). Thus they produce a list of k's (the multiplier) such that 2p*k+1 is between 2^a and 2^b for some integers a and b. (Currently, depending on the exponent, the final TF bound is 2^72 or 2^73, though mfaktc supports TFing up to 96 bits.) This list of potential factors is then sieved for primality on the CPU before being sent to the GPU for the actual divisibility test. Currently, in "vanilla" versions, the depth of the sieve is variable; the "SievePrimes" variable is allowed to be between 5000 and 250000. Presumably that's the number of small primes by which a factor candidate is divided before being sent to the GPU (if none of the small primes divide the factor candidate). There has been some experimentation with reducing SievePrimes, down to around 1500 which seemed to be a sweet spot, and someone (rcv?) had experimented with sieving on the GPU. While that wasn't a failure (it did seem to work), nothing has come of it yet.

tl;dr: Arbitrary numbers of up to ~32 digits are sieved for primality against 5000-250000 small primes.

I'll test your code on my GTX460 when I get home tomorrow.

Last fiddled with by Dubslow on 2012-06-30 at 06:20
Dubslow is offline   Reply With Quote
Old 2012-06-30, 09:27   #36
rcv
 
Dec 2011

8F16 Posts
Default

I gave a version of my sieving code to TheJudger, Bdot, and Prime95, on or about March 8, 2012. It's not a perfect apples-to-apples comparison, partly for the reasons given by Dubslow. Simplifying my code to just sieve the integers from 1 to 1e9 should not slow it down, but it probably wouldn't speed it up by very much either. (Setup of the sieving loops might be simplified, but the compute-intensive sieving, itself, would be unchanged.)

Looking at the timings I took circa March, on my EVGA GTX560Ti, and scaling to find primes to 1e9, and assuming I haven't made an arithmetic error, my code should take about 55406 usec, plus a few hundred usec of setup time. This includes sieving with all of the primes below 2^15, which is slightly more sieving than necessary to find big primes to 1e9.

My sieving code respects the "class" concept used by mfaktc/mfakto. [Contrast 960 of 4620 classes with 2 of 6 for the code posted above. More classes provides a nice performance advantage for big problems, but adds overhead for small problems, such as merely finding the primes to 1e9. (Another non apples-to-apples issue.)]

My sieving code is also larger, more complex, and uses multiple kernels, each of which is optimized a bit differently than the next.

For the mfaktc/mfakto problem, the sieved results do not have to leave the GPU. As bsquared noted, for counting the primes to 1e9, the sieved results do not have to leave the GPU. Be aware that transfer of the primes, themselves, would add significant data-transfer overhead to any performance metric.

The code was released under GPLv3. Bdot had my blessing to put the code under source control. [I don't know whether he did.]
rcv is offline   Reply With Quote
Old 2012-06-30, 12:08   #37
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

2×1,789 Posts
Default

Quote:
Originally Posted by rcv View Post

Looking at the timings I took circa March, on my EVGA GTX560Ti, and scaling to find primes to 1e9, and assuming I haven't made an arithmetic error, my code should take about 55406 usec, plus a few hundred usec of setup time. This includes sieving with all of the primes below 2^15, which is slightly more sieving than necessary to find big primes to 1e9.
Wow, cool! My apologies for coming around after 2 days of looking at gpu code and presuming to have come up with the fastest thing around. Also good to know that targetting a version of the above for mfakt* isn't necessary. I'd really like to have a look at your code, if its still available.
bsquared is offline   Reply With Quote
Old 2012-07-02, 16:43   #38
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

2·1,789 Posts
Default

More playing around this past weekend:

Code:
5.149000 milliseconds for big sieve
5761455 big primes (< 100000000) found
 
64.112999 milliseconds for big sieve
50847534 big primes (< 1000000000) found
 
1161.977051 milliseconds for big sieve
455052511 big primes (< 10000000000) found
Attached Files
File Type: txt erato.cu.txt (14.0 KB, 275 views)
bsquared is offline   Reply With Quote
Old 2012-07-03, 02:32   #39
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

357810 Posts
Default

+ Early abort for primes that don't hit a block at all
+ axn's multiply by inverse trick (thanks again!)

Code:
4.929000 milliseconds for big sieve
5761455 big primes (< 100000000) found

57.847000 milliseconds for big sieve
50847534 big primes (< 1000000000) found

891.489014 milliseconds for big sieve
455052511 big primes (< 10000000000) found
bsquared is offline   Reply With Quote
Old 2012-07-03, 10:02   #40
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

25×5×37 Posts
Default

Quote:
Originally Posted by bsquared View Post
+ Early abort for primes that don't hit a block at all
+ axn's multiply by inverse trick (thanks again!)

Code:
4.929000 milliseconds for big sieve
5761455 big primes (< 100000000) found

57.847000 milliseconds for big sieve
50847534 big primes (< 1000000000) found

891.489014 milliseconds for big sieve
455052511 big primes (< 10000000000) found
Looks like you are catching up to rcv's code. Any clue how they would compare on the same hardware? rcv's gpu looks a little more powerful I think but yours is tesla.
henryzz is offline   Reply With Quote
Old 2012-07-03, 11:30   #41
axn
 
axn's Avatar
 
Jun 2003

19×271 Posts
Default

I see that the code is using plain "malloc" for allocating the two buffers which are later used in cudaMemcpy. Technically, it is more efficient to allocate "pinned" buffers using cudaHostAlloc (and cudaFreeHost) and using them in cudaMemcpy . However, the gains will probably be negligible (due to the small-ish size of the buffers). Nonetheless, it is there for the taking.

EDIT:- This may or may not affect the timing, since the code is only timing the kernel itself.

Last fiddled with by axn on 2012-07-03 at 11:33
axn is online now   Reply With Quote
Old 2012-07-03, 13:39   #42
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

DFA16 Posts
Default

Quote:
Originally Posted by henryzz View Post
Looks like you are catching up to rcv's code. Any clue how they would compare on the same hardware? rcv's gpu looks a little more powerful I think but yours is tesla.
Mine is a Tesla M2050. I have little idea how that compares with his, but the tesla is pretty fast. I would really like to see rcv's code to do a proper comparison and see if any of his ideas are transferrable (assuming he's willing to do that). It's not that I don't trust his timing estimate, but sieving that many classes may cause things scale in unexpected ways. Every experiment I tried with more than 2/6 classes ran slower - a lot slower. But maybe that's because I implemented it badly. *shrug*. I'm pretty happy with where the performance is now.


Quote:
Originally Posted by axn View Post
I see that the code is using plain "malloc" for allocating the two buffers which are later used in cudaMemcpy. Technically, it is more efficient to allocate "pinned" buffers using cudaHostAlloc (and cudaFreeHost) and using them in cudaMemcpy . However, the gains will probably be negligible (due to the small-ish size of the buffers). Nonetheless, it is there for the taking.

EDIT:- This may or may not affect the timing, since the code is only timing the kernel itself.
Didn't seem to make any difference, but it's probably the perferred way to do it, so thanks for the pointer.

Code.

It's maybe worth noting that for the 10e9 sieve interval, the kernel sieves approximately as many bytes as a SIQS job for a C60 (with a "factor base" more than twice as big), and does it 7 times faster. Granted, there are a lot more operations in siqs vs. erato. And it will be very hard (impossible?) to avoid read-modify-writes, so use of atomics might be necessary. But this suggests it might not be totally insane to implement a decent siqs on a gpu.
bsquared is offline   Reply With Quote
Old 2012-07-04, 04:50   #43
axn
 
axn's Avatar
 
Jun 2003

19×271 Posts
Default

Quote:
Originally Posted by bsquared View Post
Looking at the latest code, the compiler is actually doing very good job.

One micro-optimization. Where you're using an array s[2], you can change it out to use two scalar variables s0 & s1. Followed by this:

Code:
			uint32 sxor = s0 ^ s1;
			uint32 scur = (k == 0) ? s0 : s1;
			for( ;i < block_size; )
			{
				locsieve[i] = 0;
				i += scur;
				scur ^= sxor;
			}
This changes that loop from
Code:
	/*0998*/     /*0x0851dc4340000000*/ 	ISCADD R7, R5, R2, 0x2;
	/*09a0*/     /*0x004fdc05c9000000*/ 	STS.U8 [R4], RZ;
	/*09a8*/     /*0x07f15c23328ac000*/ 	ICMP.NE R5, RZ, 0x1, R5;
	/*09b0*/     /*0x0071dc85c0000000*/ 	LDL R7, [R7];
	/*09b8*/     /*0x10711c0348000000*/ 	IADD R4, R7, R4;
	/*09c0*/     /*0x0041dc03188ec080*/ 	ISETP.LT.U32.AND P0, pt, R4, 0x2000, pt;
	/*09c8*/     /*0x200001e74003ffff*/ 	@P0 BRA 0x998;
to
Code:
	/*09a8*/     /*0x009fdc05c9000000*/ 	STS.U8 [R9], RZ;
	/*09b0*/     /*0x24725c0348000000*/ 	IADD R9, R7, R9;
	/*09b8*/     /*0x1c41dc8368000000*/ 	LOP.XOR R7, R4, R7;
	/*09c0*/     /*0x0091dc03188ec080*/ 	ISETP.LT.U32.AND P0, pt, R9, 0x2000, pt;
	/*09c8*/     /*0x600001e74003ffff*/ 	@P0 BRA 0x9a8;
And totally avoids that usage of two memory locations (which, incidentally are "local" memory and AFAIK is slower(!) than shared memory). As usual, net savings may be iffy :(

EDIT:- Now that I've looked at the code closer, you can replace the index k and instead use scur directly wherever k appears. This will actually allow to combine the two "unrolled loops".

Another suggestion for a micro-optimization. In the main "pid" loop, instead of two variables j & pid, you can directly loop on pid, like so:
Code:
for (int pid =threadIdx.x + startprime; pid < maxp; pid += threadsPerBlock)
thus removing the inner "if" and the need to send the "nump" parameter to the kernel. This should have almost no impact on runtime :)

Last fiddled with by axn on 2012-07-04 at 05:08
axn is online now   Reply With Quote
Old 2012-07-04, 05:30   #44
axn
 
axn's Avatar
 
Jun 2003

10100000111012 Posts
Default

Major (code size) optimization. The first stage reduction can be rewritten as:

Code:
	j=threadIdx.x * range;
	uint8 sum = 0;
	for (k=0; k<range; k++)
		sum += locsieve[j+k] & (bitsieve[threadIdx.x] >> k);
	locsieve[j] = sum;
Note that this only works when range = 32 (i.e. block_size / threads_per_block). But then again, the initial logic of setting bitsieve[threadIdx.x] only works when that is the case.

This alone reduces the code size by well over 25%!

EDIT:- Probably you need ~bitsieve[]

EDIT2:- In another "duh!" moment, I just realized that if the above transformation is valid, then you can get rid of the entire bitsieve array and replace it with a single uint32 bitsieve, because that is the slice of bitsieve that a thread is dealing with.

Last fiddled with by axn on 2012-07-04 at 05:52
axn is online now   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfaktc: a CUDA program for Mersenne prefactoring TheJudger GPU Computing 3506 2021-09-18 00:04
The P-1 factoring CUDA program firejuggler GPU Computing 753 2020-12-12 18:07
End of the world as we know it (in music) firejuggler Lounge 3 2012-12-22 01:43
World Cup Soccer davieddy Hobbies 111 2011-05-28 19:21
World's dumbest CUDA program? xilman Programming 1 2009-11-16 10:26

All times are UTC. The time now is 18:31.


Thu Oct 21 18:31:16 UTC 2021 up 90 days, 13 hrs, 1 user, load averages: 1.30, 1.39, 1.32

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.