mersenneforum.org  

Go Back   mersenneforum.org > Extra Stuff > Programming

Reply
 
Thread Tools
Old 2012-06-27, 17:11   #23
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

67728 Posts
Default

Quote:
Originally Posted by axn View Post
have you tried looking at the generated assembly (not ptx, but native isa)? it might give you ideas as to where you can optimize.
No - don't even know how to do that.

But nearly every tweak I try now makes things slower, which is actually encouraging.

I want to try one algorithmic improvement first before tinkering with low-level stuff (a mod30 wheel). I've given up on bit-packing. The read-modify-writes implied by the bit-twiddleing that needs to happen introduce race conditions that, as far as I can see, can only be mitigated by performance-crushing atomic operations.

Last fiddled with by bsquared on 2012-06-27 at 17:11
bsquared is offline   Reply With Quote
Old 2012-06-27, 19:56   #24
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

2·1,789 Posts
Default

If I allocate data "on the device" using cudaMalloc, is that data available to any kernel I run thereafter? If so, is the state of the data preserved between kernel calls? I'm wondering if I can break up my increasingly large device kernel into several separate ones that each act on a resident set of on-device data.
bsquared is offline   Reply With Quote
Old 2012-06-28, 04:39   #25
axn
 
axn's Avatar
 
Jun 2003

514910 Posts
Default

Quote:
Originally Posted by bsquared View Post
No - don't even know how to do that.
Are you using CUDA toolkit 4.2? It comes with a neat utility called cuobjdump which can parse the exe and spit out any cubin included in it in device-specific assembly format. [It might have come with previous 4.x versions also, but it is definitely no there with 3.2 toolkit]. Actually, if you can post either the executable or the complete source list (including the cutil_inline.h), I can get the asm output for you.

Quote:
Originally Posted by bsquared View Post
If I allocate data "on the device" using cudaMalloc, is that data available to any kernel I run thereafter? If so, is the state of the data preserved between kernel calls?
Yes and yes. Though, I'm not sure you mean to use cudaMalloc from _within_ __device__ code. I've only used cudaMalloc from within host code.
axn is online now   Reply With Quote
Old 2012-06-28, 12:27   #26
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

2×1,789 Posts
Default

I'm running 3.2, so thanks for your offer. I'll take you up on that once I'm done tinkering around.

And yeah, I meant calling cudaMalloc from host code.
bsquared is offline   Reply With Quote
Old 2012-06-28, 15:00   #27
axn
 
axn's Avatar
 
Jun 2003

19×271 Posts
Default

Quote:
Originally Posted by bsquared View Post
I'm running 3.2, so thanks for your offer. I'll take you up on that once I'm done tinkering around.

And yeah, I meant calling cudaMalloc from host code.
I did compile it using 4.2 (after commenting out some offending bits). Couple of observations:

1. There is a divide operation (dist / p). That is _very_ expensive. You could probably replace it with (uint32) __fdividef((float) dist / (float) p). Depends. Accuracy of only 22 significant bits. You might also calculate the (scaled) inverse of primes using CPU and use it to avoid the division.

2. Initialization of sieve -- the special check of < N should be avoided and just set everything to 1. As it is, the compiler does a decent job (it unrolls the loop completely, with 4 instructions per iteration), but unconditional setting is, of course, 1 instruction per iteration.

3. Where you set sieve[i], [i+p], [i+2p], ..., you can use the more intuitive way -- [i], [i+p], [i+p*2], [i+p*3],... [i+p*7] -- the compiler is good enough to optimize it away.

I haven't looked at the rest of the asm in detail.

Last fiddled with by axn on 2012-06-28 at 15:00
axn is online now   Reply With Quote
Old 2012-06-28, 15:43   #28
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

DFA16 Posts
Default

Quote:
Originally Posted by axn View Post
I did compile it using 4.2 (after commenting out some offending bits). Couple of observations:

1. There is a divide operation (dist / p). That is _very_ expensive. You could probably replace it with (uint32) __fdividef((float) dist / (float) p). Depends. Accuracy of only 22 significant bits. You might also calculate the (scaled) inverse of primes using CPU and use it to avoid the division.

2. Initialization of sieve -- the special check of < N should be avoided and just set everything to 1. As it is, the compiler does a decent job (it unrolls the loop completely, with 4 instructions per iteration), but unconditional setting is, of course, 1 instruction per iteration.

3. Where you set sieve[i], [i+p], [i+2p], ..., you can use the more intuitive way -- [i], [i+p], [i+p*2], [i+p*3],... [i+p*7] -- the compiler is good enough to optimize it away.

I haven't looked at the rest of the asm in detail.
Thanks! Good observations.

I tried 3.) and surprisingly it was consistently about 1% slower. Assuming the timer is accurate to that degree, it seems the compiler has a harder time with p*5 and p*7. replacing those with p*4+p and p*4+p*2+p was ~1% faster than p*5 and p*7. replacing p*3 and p*6 with their binary equivalents had no effect.

Re: 2.), I round up the effective sieve interval so that I only have to deal with an integer number of blocks, so the last few indices of the interval will correspond to primes that are outside the interval of interest. The check (< N) is there so that I don't count those sieve locations. I suppose the check could be in the counting routine, but it does have to be somewhere. Removing the check makes the count inaccurate, and (recompiling... checking...) gives a 2 millisecond speedup ( < 0.5%).

Re: 1.), I also need dist % p, and I don't think 22 bits will be enough. For example, in the worst case dist is ~2^32 and p = 3. Since it is so expensive this is a great place to look for improvements, but so far I haven't seen an easy way that also retains functionality. Mul by inverse might do it, but the inverse might need to be a double.

Thanks again!
bsquared is offline   Reply With Quote
Old 2012-06-29, 05:46   #29
axn
 
axn's Avatar
 
Jun 2003

19·271 Posts
Default

Quote:
Originally Posted by bsquared View Post
I tried 3.) and surprisingly it was consistently about 1% slower. Assuming the timer is accurate to that degree, it seems the compiler has a harder time with p*5 and p*7. replacing those with p*4+p and p*4+p*2+p was ~1% faster than p*5 and p*7. replacing p*3 and p*6 with their binary equivalents had no effect.
Perhaps 4.2 is better at this compared to 3.2, since I observed no difference in the generated codes.

Quote:
Originally Posted by bsquared View Post
Re: 2.), I round up the effective sieve interval so that I only have to deal with an integer number of blocks, so the last few indices of the interval will correspond to primes that are outside the interval of interest. The check (< N) is there so that I don't count those sieve locations. I suppose the check could be in the counting routine, but it does have to be somewhere. Removing the check makes the count inaccurate, and (recompiling... checking...) gives a 2 millisecond speedup ( < 0.5%).
Yeah. I didn't expect much of a performance gain from this. It just sounded wasteful for all these blocks to undergo that check when only the last one will need this. Hmmm... Perhaps, you can uncoditionally set all of them to 1. and then check if the block is the last one and repeat the conditional logic inside that check. This will be a win for all blocks except last, and slight loss for last block. Also, the unconditional set can be sped up by combining 4 byte-writes into 1 word-write.

Quote:
Originally Posted by bsquared View Post
Re: 1.), I also need dist % p, and I don't think 22 bits will be enough. For example, in the worst case dist is ~2^32 and p = 3. Since it is so expensive this is a great place to look for improvements, but so far I haven't seen an easy way that also retains functionality. Mul by inverse might do it, but the inverse might need to be a double.!
This should work [I haven't tested exhaustively]. For each prime p, compute invp = 0xffffffffU / p. [Should be done in CPU, and sent to the kernel]

In GPU, instead of "dist / p", do "__umulhi(dist, invp)". This should be either correct or less by 1, which can be conditionally corrected. Once you have the correct quotient, the "dist % p" is just "dist - steps*p" (which is a single IMAD instruction). I don't know whether it'd be a net win because of additional data transfer (and additional CPU divisions). To minimize that impact, I'd transfer both the p's and inv's in the same array, preferably pairing the p & inv in adjacent locations.
axn is online now   Reply With Quote
Old 2012-06-29, 06:00   #30
axn
 
axn's Avatar
 
Jun 2003

19·271 Posts
Default

This:
Code:
	for (j=threadIdx.x * (range/4), k=0; k<(range/4); k++)
	{
		((uint32 *) locsieve)[j+k] = 0x01010101;
	}
resulted in this:
Code:
	/*0020*/     /*0x04011de218040404*/ 	MOV32I R4, 0x1010101;
	/*0028*/     /*0x04015de218040404*/ 	MOV32I R5, 0x1010101;
	/*0030*/     /*0x04019de218040404*/ 	MOV32I R6, 0x1010101;
	/*0038*/     /*0x0401dde218040404*/ 	MOV32I R7, 0x1010101;

	/*0048*/     /*0x04021de218040404*/ 	MOV32I R8, 0x1010101;
	/*0050*/     /*0x04025de218040404*/ 	MOV32I R9, 0x1010101;
	/*0058*/     /*0x04029de218040404*/ 	MOV32I R10, 0x1010101;
	/*0060*/     /*0x0402dde218040404*/ 	MOV32I R11, 0x1010101;

	/*0070*/     /*0x00311cc5c9000000*/ 	STS.128 [R3], R4;
	/*0080*/     /*0x40321cc5c9000000*/ 	STS.128 [R3+0x10], R8;
Note the two 128-bit stores?

OTOH, this
Code:
	for (j=threadIdx.x * range, k=0; k<range; k++)
	{
		locsieve[j+k] = 1;
	}
results in this:
Code:
	/*0038*/     /*0x00315c05c9000000*/ 	STS.U8 [R3], R5;
	/*0040*/     /*0x04315c05c9000000*/ 	STS.U8 [R3+0x1], R5;
	/*0048*/     /*0x08315c05c9000000*/ 	STS.U8 [R3+0x2], R5;
	/*0050*/     /*0x0c315c05c9000000*/ 	STS.U8 [R3+0x3], R5;
	/*0058*/     /*0x10315c05c9000000*/ 	STS.U8 [R3+0x4], R5;
	/*0060*/     /*0x14315c05c9000000*/ 	STS.U8 [R3+0x5], R5;
	/*0068*/     /*0x18315c05c9000000*/ 	STS.U8 [R3+0x6], R5;
	/*0070*/     /*0x1c315c05c9000000*/ 	STS.U8 [R3+0x7], R5;
	/*0078*/     /*0x20315c05c9000000*/ 	STS.U8 [R3+0x8], R5;
	/*0080*/     /*0x24315c05c9000000*/ 	STS.U8 [R3+0x9], R5;
	/*0088*/     /*0x28315c05c9000000*/ 	STS.U8 [R3+0xa], R5;
	/*0090*/     /*0x2c315c05c9000000*/ 	STS.U8 [R3+0xb], R5;
	/*0098*/     /*0x30315c05c9000000*/ 	STS.U8 [R3+0xc], R5;
	/*00a0*/     /*0x34315c05c9000000*/ 	STS.U8 [R3+0xd], R5;
	/*00a8*/     /*0x38315c05c9000000*/ 	STS.U8 [R3+0xe], R5;
	/*00b0*/     /*0x3c315c05c9000000*/ 	STS.U8 [R3+0xf], R5;
	/*00b8*/     /*0x40315c05c9000000*/ 	STS.U8 [R3+0x10], R5;
	/*00c0*/     /*0x44315c05c9000000*/ 	STS.U8 [R3+0x11], R5;
	/*00c8*/     /*0x48315c05c9000000*/ 	STS.U8 [R3+0x12], R5;
	/*00d0*/     /*0x4c315c05c9000000*/ 	STS.U8 [R3+0x13], R5;
	/*00d8*/     /*0x50315c05c9000000*/ 	STS.U8 [R3+0x14], R5;
	/*00e0*/     /*0x54315c05c9000000*/ 	STS.U8 [R3+0x15], R5;
	/*00e8*/     /*0x58315c05c9000000*/ 	STS.U8 [R3+0x16], R5;
	/*00f0*/     /*0x5c315c05c9000000*/ 	STS.U8 [R3+0x17], R5;
	/*00f8*/     /*0x60315c05c9000000*/ 	STS.U8 [R3+0x18], R5;
	/*0100*/     /*0x64315c05c9000000*/ 	STS.U8 [R3+0x19], R5;
	/*0108*/     /*0x68315c05c9000000*/ 	STS.U8 [R3+0x1a], R5;
	/*0110*/     /*0x6c315c05c9000000*/ 	STS.U8 [R3+0x1b], R5;

	/*0120*/     /*0x70315c05c9000000*/ 	STS.U8 [R3+0x1c], R5;
	/*0128*/     /*0x74315c05c9000000*/ 	STS.U8 [R3+0x1d], R5;
	/*0130*/     /*0x78315c05c9000000*/ 	STS.U8 [R3+0x1e], R5;
	/*0138*/     /*0x7c315c05c9000000*/ 	STS.U8 [R3+0x1f], R5;
axn is online now   Reply With Quote
Old 2012-06-29, 17:23   #31
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

1101110101112 Posts
Default

To answer bsquared's earlier question, yes any memory that you allocate is available to all kernels afterwards; just give them a pointer to it. It is also preserved across calls.

Actually the CUDA model recommends global memory as the only way to perform global synchronization across all threads and thread blocks when a kernel is launched. i.e. if you want to be sure that all kernels have completed something, leave the something in memory and stick consuming code in a subsequent kernel.
jasonp is offline   Reply With Quote
Old 2012-06-30, 03:07   #32
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

2·1,789 Posts
Default

Quote:
Originally Posted by axn View Post
This:
Code:
	for (j=threadIdx.x * (range/4), k=0; k<(range/4); k++)
	{
		((uint32 *) locsieve)[j+k] = 0x01010101;
	}
Awesome, doing that for most blocks and doing the byte-by-byte check only for the top few blocks gave about a 9% speedup. 136ms to count primes up to 2^30 now.
bsquared is offline   Reply With Quote
Old 2012-06-30, 04:19   #33
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

DFA16 Posts
Default update

Got a mod6 wheel working. Now sub 100ms for 2^30 (err, sorry, 1e9)

Code:
computername 903 % erato 1000000000                                                                                                                          Device 0 supports CUDA 2.0
It has warp size 32, 32768 regs per block, 1024 threads per block
max Threads 1024 x 1024 x 64
max Grid 65535 x 65535 x 1
total constant memory 65536
3401 small primes (< 31622) found
using grid of 202x202 blocks with 256 threads per block and 14 primes per thread
99.602997 milliseconds for big sieve

50847534 big primes (< 1000000000) found
The 2*3 wheel is nice because it has a regular pattern to the jumps between sieve locations that is reasonable to code. I've tried the 2*3*5 wheel but haven't been able to get a speedup because I either have to sieve over 8x residue classes or the pattern of jumps becomes much more complex. Either way the GPU doesn't react well.

This is now the fastest sieve I'm aware of (for ranges of primes below 2^32). Even yafu with 12 threads on a dual hex-core Xeon 5680 is slower (by a hair).

So is this useful to anyone? I remember a while back the mfaktc/o projects wanted to do sieving on the gpu, but I don't know if they are sieving for these primes (in this range, or "general purpose" ones like this).

Last fiddled with by bsquared on 2012-06-30 at 04:21
bsquared is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfaktc: a CUDA program for Mersenne prefactoring TheJudger GPU Computing 3506 2021-09-18 00:04
The P-1 factoring CUDA program firejuggler GPU Computing 753 2020-12-12 18:07
End of the world as we know it (in music) firejuggler Lounge 3 2012-12-22 01:43
World Cup Soccer davieddy Hobbies 111 2011-05-28 19:21
World's dumbest CUDA program? xilman Programming 1 2009-11-16 10:26

All times are UTC. The time now is 17:23.


Thu Oct 21 17:23:30 UTC 2021 up 90 days, 11:52, 1 user, load averages: 1.68, 1.65, 1.89

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.