mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Programming (https://www.mersenneforum.org/forumdisplay.php?f=29)
-   -   World's second-dumbest CUDA program (https://www.mersenneforum.org/showthread.php?t=11900)

bsquared 2012-06-27 17:11

[QUOTE=axn;303504]have you tried looking at the generated assembly (not ptx, but native isa)? it might give you ideas as to where you can optimize.[/QUOTE]

No - don't even know how to do that.

But nearly every tweak I try now makes things slower, which is actually encouraging.

I want to try one algorithmic improvement first before tinkering with low-level stuff (a mod30 wheel). I've given up on bit-packing. The read-modify-writes implied by the bit-twiddleing that needs to happen introduce race conditions that, as far as I can see, can only be mitigated by performance-crushing atomic operations.

bsquared 2012-06-27 19:56

If I allocate data "on the device" using cudaMalloc, is that data available to any kernel I run thereafter? If so, is the state of the data preserved between kernel calls? I'm wondering if I can break up my increasingly large device kernel into several separate ones that each act on a resident set of on-device data.

axn 2012-06-28 04:39

[QUOTE=bsquared;303507]No - don't even know how to do that. [/QUOTE]
Are you using CUDA toolkit 4.2? It comes with a neat utility called cuobjdump which can parse the exe and spit out any cubin included in it in device-specific assembly format. [It might have come with previous 4.x versions also, but it is definitely no there with 3.2 toolkit]. Actually, if you can post either the executable or the complete source list (including the cutil_inline.h), I can get the asm output for you.

[QUOTE=bsquared;303526]If I allocate data "on the device" using cudaMalloc, is that data available to any kernel I run thereafter? If so, is the state of the data preserved between kernel calls? [/QUOTE]
Yes and yes. Though, I'm not sure you mean to use cudaMalloc from _within_ __device__ code. I've only used cudaMalloc from within host code.

bsquared 2012-06-28 12:27

I'm running 3.2, so thanks for your offer. I'll take you up on that once I'm done tinkering around.

And yeah, I meant calling cudaMalloc from host code.

axn 2012-06-28 15:00

[QUOTE=bsquared;303581]I'm running 3.2, so thanks for your offer. I'll take you up on that once I'm done tinkering around.

And yeah, I meant calling cudaMalloc from host code.[/QUOTE]

I did compile it using 4.2 (after commenting out some offending bits). Couple of observations:

1. There is a divide operation (dist / p). That is _very_ expensive. You could probably replace it with (uint32) __fdividef((float) dist / (float) p). Depends. Accuracy of only 22 significant bits. You might also calculate the (scaled) inverse of primes using CPU and use it to avoid the division.

2. Initialization of sieve -- the special check of < N should be avoided and just set everything to 1. As it is, the compiler does a decent job (it unrolls the loop completely, with 4 instructions per iteration), but unconditional setting is, of course, 1 instruction per iteration.

3. Where you set sieve[i], [i+p], [i+2p], ..., you can use the more intuitive way -- [i], [i+p], [i+p*2], [i+p*3],... [i+p*7] -- the compiler is good enough to optimize it away.

I haven't looked at the rest of the asm in detail.

bsquared 2012-06-28 15:43

[QUOTE=axn;303586]I did compile it using 4.2 (after commenting out some offending bits). Couple of observations:

1. There is a divide operation (dist / p). That is _very_ expensive. You could probably replace it with (uint32) __fdividef((float) dist / (float) p). Depends. Accuracy of only 22 significant bits. You might also calculate the (scaled) inverse of primes using CPU and use it to avoid the division.

2. Initialization of sieve -- the special check of < N should be avoided and just set everything to 1. As it is, the compiler does a decent job (it unrolls the loop completely, with 4 instructions per iteration), but unconditional setting is, of course, 1 instruction per iteration.

3. Where you set sieve[i], [i+p], [i+2p], ..., you can use the more intuitive way -- [i], [i+p], [i+p*2], [i+p*3],... [i+p*7] -- the compiler is good enough to optimize it away.

I haven't looked at the rest of the asm in detail.[/QUOTE]

Thanks! Good observations.

I tried 3.) and surprisingly it was consistently about 1% slower. Assuming the timer is accurate to that degree, it seems the compiler has a harder time with p*5 and p*7. replacing those with p*4+p and p*4+p*2+p was ~1% faster than p*5 and p*7. replacing p*3 and p*6 with their binary equivalents had no effect.

Re: 2.), I round up the effective sieve interval so that I only have to deal with an integer number of blocks, so the last few indices of the interval will correspond to primes that are outside the interval of interest. The check (< N) is there so that I don't count those sieve locations. I suppose the check could be in the counting routine, but it does have to be somewhere. Removing the check makes the count inaccurate, and (recompiling... checking...) gives a 2 millisecond speedup ( < 0.5%).

Re: 1.), I also need dist % p, and I don't think 22 bits will be enough. For example, in the worst case dist is ~2^32 and p = 3. Since it is so expensive this is a great place to look for improvements, but so far I haven't seen an easy way that also retains functionality. Mul by inverse might do it, but the inverse might need to be a double.

Thanks again!

axn 2012-06-29 05:46

[QUOTE=bsquared;303588]I tried 3.) and surprisingly it was consistently about 1% slower. Assuming the timer is accurate to that degree, it seems the compiler has a harder time with p*5 and p*7. replacing those with p*4+p and p*4+p*2+p was ~1% faster than p*5 and p*7. replacing p*3 and p*6 with their binary equivalents had no effect.[/quote]
Perhaps 4.2 is better at this compared to 3.2, since I observed no difference in the generated codes.

[QUOTE=bsquared;303588]Re: 2.), I round up the effective sieve interval so that I only have to deal with an integer number of blocks, so the last few indices of the interval will correspond to primes that are outside the interval of interest. The check (< N) is there so that I don't count those sieve locations. I suppose the check could be in the counting routine, but it does have to be somewhere. Removing the check makes the count inaccurate, and (recompiling... checking...) gives a 2 millisecond speedup ( < 0.5%).[/quote]
Yeah. I didn't expect much of a performance gain from this. It just sounded wasteful for all these blocks to undergo that check when only the last one will need this. Hmmm... Perhaps, you can uncoditionally set all of them to 1. and then check if the block is the last one and repeat the conditional logic inside that check. This will be a win for all blocks except last, and slight loss for last block. Also, the unconditional set can be sped up by combining 4 byte-writes into 1 word-write.

[QUOTE=bsquared;303588]Re: 1.), I also need dist % p, and I don't think 22 bits will be enough. For example, in the worst case dist is ~2^32 and p = 3. Since it is so expensive this is a great place to look for improvements, but so far I haven't seen an easy way that also retains functionality. Mul by inverse might do it, but the inverse might need to be a double.![/QUOTE]

This should work [I haven't tested exhaustively]. For each prime p, compute invp = 0xffffffffU / p. [Should be done in CPU, and sent to the kernel]

In GPU, instead of "dist / p", do "__umulhi(dist, invp)". This should be either correct or less by 1, which can be conditionally corrected. Once you have the correct quotient, the "dist % p" is just "dist - steps*p" (which is a single IMAD instruction). I don't know whether it'd be a net win because of additional data transfer (and additional CPU divisions). To minimize that impact, I'd transfer both the p's and inv's in the same array, preferably pairing the p & inv in adjacent locations.

axn 2012-06-29 06:00

This:
[CODE] for (j=threadIdx.x * (range/4), k=0; k<(range/4); k++)
{
((uint32 *) locsieve)[j+k] = 0x01010101;
}[/CODE]

resulted in this:
[CODE] /*0020*/ /*0x04011de218040404*/ MOV32I R4, 0x1010101;
/*0028*/ /*0x04015de218040404*/ MOV32I R5, 0x1010101;
/*0030*/ /*0x04019de218040404*/ MOV32I R6, 0x1010101;
/*0038*/ /*0x0401dde218040404*/ MOV32I R7, 0x1010101;

/*0048*/ /*0x04021de218040404*/ MOV32I R8, 0x1010101;
/*0050*/ /*0x04025de218040404*/ MOV32I R9, 0x1010101;
/*0058*/ /*0x04029de218040404*/ MOV32I R10, 0x1010101;
/*0060*/ /*0x0402dde218040404*/ MOV32I R11, 0x1010101;

/*0070*/ /*0x00311cc5c9000000*/ STS.128 [R3], R4;
/*0080*/ /*0x40321cc5c9000000*/ STS.128 [R3+0x10], R8;[/CODE]
Note the two 128-bit stores?

OTOH, this
[CODE] for (j=threadIdx.x * range, k=0; k<range; k++)
{
locsieve[j+k] = 1;
}[/CODE]
results in this:
[CODE] /*0038*/ /*0x00315c05c9000000*/ STS.U8 [R3], R5;
/*0040*/ /*0x04315c05c9000000*/ STS.U8 [R3+0x1], R5;
/*0048*/ /*0x08315c05c9000000*/ STS.U8 [R3+0x2], R5;
/*0050*/ /*0x0c315c05c9000000*/ STS.U8 [R3+0x3], R5;
/*0058*/ /*0x10315c05c9000000*/ STS.U8 [R3+0x4], R5;
/*0060*/ /*0x14315c05c9000000*/ STS.U8 [R3+0x5], R5;
/*0068*/ /*0x18315c05c9000000*/ STS.U8 [R3+0x6], R5;
/*0070*/ /*0x1c315c05c9000000*/ STS.U8 [R3+0x7], R5;
/*0078*/ /*0x20315c05c9000000*/ STS.U8 [R3+0x8], R5;
/*0080*/ /*0x24315c05c9000000*/ STS.U8 [R3+0x9], R5;
/*0088*/ /*0x28315c05c9000000*/ STS.U8 [R3+0xa], R5;
/*0090*/ /*0x2c315c05c9000000*/ STS.U8 [R3+0xb], R5;
/*0098*/ /*0x30315c05c9000000*/ STS.U8 [R3+0xc], R5;
/*00a0*/ /*0x34315c05c9000000*/ STS.U8 [R3+0xd], R5;
/*00a8*/ /*0x38315c05c9000000*/ STS.U8 [R3+0xe], R5;
/*00b0*/ /*0x3c315c05c9000000*/ STS.U8 [R3+0xf], R5;
/*00b8*/ /*0x40315c05c9000000*/ STS.U8 [R3+0x10], R5;
/*00c0*/ /*0x44315c05c9000000*/ STS.U8 [R3+0x11], R5;
/*00c8*/ /*0x48315c05c9000000*/ STS.U8 [R3+0x12], R5;
/*00d0*/ /*0x4c315c05c9000000*/ STS.U8 [R3+0x13], R5;
/*00d8*/ /*0x50315c05c9000000*/ STS.U8 [R3+0x14], R5;
/*00e0*/ /*0x54315c05c9000000*/ STS.U8 [R3+0x15], R5;
/*00e8*/ /*0x58315c05c9000000*/ STS.U8 [R3+0x16], R5;
/*00f0*/ /*0x5c315c05c9000000*/ STS.U8 [R3+0x17], R5;
/*00f8*/ /*0x60315c05c9000000*/ STS.U8 [R3+0x18], R5;
/*0100*/ /*0x64315c05c9000000*/ STS.U8 [R3+0x19], R5;
/*0108*/ /*0x68315c05c9000000*/ STS.U8 [R3+0x1a], R5;
/*0110*/ /*0x6c315c05c9000000*/ STS.U8 [R3+0x1b], R5;

/*0120*/ /*0x70315c05c9000000*/ STS.U8 [R3+0x1c], R5;
/*0128*/ /*0x74315c05c9000000*/ STS.U8 [R3+0x1d], R5;
/*0130*/ /*0x78315c05c9000000*/ STS.U8 [R3+0x1e], R5;
/*0138*/ /*0x7c315c05c9000000*/ STS.U8 [R3+0x1f], R5;[/CODE]

jasonp 2012-06-29 17:23

To answer bsquared's earlier question, yes any memory that you allocate is available to all kernels afterwards; just give them a pointer to it. It is also preserved across calls.

Actually the CUDA model recommends global memory as the only way to perform global synchronization across all threads and thread blocks when a kernel is launched. i.e. if you want to be sure that all kernels have completed something, leave the something in memory and stick consuming code in a subsequent kernel.

bsquared 2012-06-30 03:07

[QUOTE=axn;303630]This:
[CODE] for (j=threadIdx.x * (range/4), k=0; k<(range/4); k++)
{
((uint32 *) locsieve)[j+k] = 0x01010101;
}[/CODE]

[/QUOTE]

Awesome, doing that for most blocks and doing the byte-by-byte check only for the top few blocks gave about a 9% speedup. 136ms to count primes up to 2^30 now.

bsquared 2012-06-30 04:19

update
 
Got a mod6 wheel working. Now sub 100ms for 2^30 (err, sorry, 1e9) :smile:

[CODE]computername 903 % erato 1000000000 Device 0 supports CUDA 2.0
It has warp size 32, 32768 regs per block, 1024 threads per block
max Threads 1024 x 1024 x 64
max Grid 65535 x 65535 x 1
total constant memory 65536
3401 small primes (< 31622) found
using grid of 202x202 blocks with 256 threads per block and 14 primes per thread
99.602997 milliseconds for big sieve

50847534 big primes (< 1000000000) found

[/CODE]

The 2*3 wheel is nice because it has a regular pattern to the jumps between sieve locations that is reasonable to code. I've tried the 2*3*5 wheel but haven't been able to get a speedup because I either have to sieve over 8x residue classes or the pattern of jumps becomes much more complex. Either way the GPU doesn't react well.

This is now the fastest sieve I'm aware of (for ranges of primes below 2^32). Even yafu with 12 threads on a dual hex-core Xeon 5680 is slower (by a hair).

So is this useful to anyone? I remember a while back the mfaktc/o projects wanted to do sieving on the gpu, but I don't know if they are sieving for these primes (in this range, or "general purpose" ones like this).


All times are UTC. The time now is 14:58.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.