mersenneforum.org World's second-dumbest CUDA program
 Register FAQ Search Today's Posts Mark Forums Read

 2009-07-13, 13:17 #12 jasonp Tribal Bullet     Oct 2004 3×1,181 Posts Hmm, maybe NFS linear algebra can benefit from this (and pstach's CADO presentation).
2009-07-25, 01:36   #13
SPWorley

Jul 2009

111112 Posts

Quote:
 Originally Posted by frmky I've written just a few simple programs. Besides the obvious things like have lots of threads to mask latencies and access global memory as little as possible, getting the best performance really seems to be trial and error. I was involved in the integration of distributed.net's RC5 CUDA core. I found that seemingly trivial changes like changing an always positive integer from signed to unsigned, and assigning blockDim.x, blockIdx.x, and threadIdx.x to an integer before doing ANY math made huge differences in the speed of the code. In fact, just the move to CUDA 2.2 has halved the speed of the RC5 code. I recently asked why on the nVidia forums but haven't gotten any replies.
Lots of possible reasons.. I just replied to you on that forum, but admittedly I haven't tried compiling the code yet. Especially since I don't have 2.0 around anymore.

 2009-10-17, 18:47 #14 fivemack (loop (#_fork))     Feb 2006 Cambridge, England 644110 Posts I can get a cheap factor 5 speed-up for the program above by splitting primes into small primes (the first eight) and large primes (all the rest); large primes are handled with the Sieve kernel above, one prime per thread, and small primes handled one chunk of memory per thread by Code: __global__ static void SplitSieve(char* sieve, int prime, int zap, int sieve_size) { int idx = blockIdx.x * blockDim.x + threadIdx.x; for (int i=prime*(prime+idx); i>>(device_big_sieve,primes[r],400,N); cudaThreadSynchronize(); } This is still desperately inefficient; I'm doing half a billion one-byte writes, when it ought to be possible to do something coalesced for the small primes. I assume I can't use a texture 179 units wide and let the texture-fetch unit do my modulo operations for free I'm a bit confused for coalesced operations about how I tell which thread in the warp I am - is it threadId%32, or does it depend on the launch parameters; is this even a meaningful question? Atomic operations on the GTX275 are supposedly fairly fast; packing eight odd numbers per byte should allow very quick bulk writing for the small primes, and then AtomicAnd to zap bits one at a time for the large primes. I'm noticing that small black and white spots appear and disappear on my desktop, not only while I'm running cuda jobs but afterwards (it may be connected to having a window well behind the one I'm writing in scrolling upwards); presumably this is a Linux driver issue triggered by cuda doing something odd in main memory.
 2009-10-19, 09:57 #15 fivemack (loop (#_fork))     Feb 2006 Cambridge, England 11001001010012 Posts The texture-mapping approach works nicely; can fill 200MB with the pattern with primes <30 removed in about 6ms. Things that bit me: it's kernel<<>> not the other way round you want to index on threadIdx.x (not on gidx = blockDim.x*blockIdx.x+threadIdx.x) when filling shared memory at the start of a kernel blockDim.x is the number of threads in the block; I can't figure out how to get at the number of blocks in the grid from the kernel, but obviously you can pass it as a parameter. loads from global memory are slow and not cached at all. shared memory is fast. I haven't used texture memory yet. you can't just fill the shared memory when blockIdx.x==0 and trust that it's still there the next time through the grid <<<30,256>>> is a good conformation on the GTX275 for texture-mapping-like jobs. Things to try: texture memory, atomic bitwise operations for the larger primes.
2009-10-19, 15:00   #16
frmky

Jul 2003
So Cal

42228 Posts

Quote:
 Originally Posted by fivemack it's kernel<<>> not the other way round
Should have warned you when I discovered that. Sorry.
Quote:
 Originally Posted by fivemack you want to index on threadIdx.x (not on gidx = blockDim.x*blockIdx.x+threadIdx.x) when filling shared memory at the start of a kernel you can't just fill the shared memory when blockIdx.x==0 and trust that it's still there the next time through the grid
I too have trouble remembering that shared memory is shared in and persists for the lifetime of a block, not an entire grid.
Quote:
 Originally Posted by fivemack blockDim.x is the number of threads in the block; I can't figure out how to get at the number of blocks in the grid from the kernel, but obviously you can pass it as a parameter.
gridDim.x (and gridDim.y if 2D)
Quote:
 Originally Posted by fivemack loads from global memory are slow and not cached at all. shared memory is fast. I haven't used texture memory yet.
Never used it, but supposedly writing is the same to global memory while reads, even somewhat random ones as long as the data read is nearby, is fast.
Quote:
 Originally Posted by fivemack <<<30,256>>> is a good conformation on the GTX275 for texture-mapping-like jobs.
The grid size needs to be a multiple of the number of MPs, and as I've just learned from Jason a multiplier of 1 works well.
Quote:
 Originally Posted by fivemack Things to try: texture memory, atomic bitwise operations for the larger primes.
Good luck!

 2009-11-10, 16:24 #17 __HRB__     Dec 2008 Boycotting the Soapbox 24·32·5 Posts Need a dumb program There is a way to do fast multiprecision multiplies using a SIMD-in-a-register technique, but runtime would be something like 2x-4x that of a floating-point FFT. Since the algorithm uses only AND, OR, XOR, NAND, shuffles and shifts, it might have an advantage if one considers energy consumption. I don't have an NVDIA card, but maybe someone is willing to code and run two simple (and stupid) test programs to compute Code: (watts(program 1)-watts(idle))/(watts(program 2)-watts(idle)) preferably by measuring power-consumption directly at the power-supply. If that isn't possible, maybe one can get an estimate by setting the fan-speed on the card constant and compute Code: (maxtemp(program 1)-maxtemp(idle))/(maxtemp(program 2)-maxtemp(idle)) by measuring the GPU temperature, instead. Program 1: fill an array 'x' of int with 0xFFFFFFFF, and repeatedly compute y=y xor x. Program 2: fill the array 'x' of float with -1/3, and repeatedly compute y=x + y*y, using a fused multiply-accumulate instruction. Arrays should be large enough to maximize throughput.
2011-03-05, 18:04   #18
alexhiggins732

Mar 2010
Brick, NJ

6710 Posts

Quote:
 Originally Posted by fivemack I can get a cheap factor 5 speed-up for the program above by splitting primes into small primes (the first eight) and large primes (all the rest); large primes are handled with the Sieve kernel above, one prime per thread, and small primes handled one chunk of memory per thread by Code: __global__ static void SplitSieve(char* sieve, int prime, int zap, int sieve_size) { int idx = blockIdx.x * blockDim.x + threadIdx.x; for (int i=prime*(prime+idx); i>>(device_big_sieve,primes[r],400,N); cudaThreadSynchronize(); }

Quote:
 Originally Posted by fivemack This is still desperately inefficient; I'm doing half a billion one-byte writes, when it ought to be possible to do something coalesced for the small primes.
A quick look at this and a major cause of slowdown is thread starvation. The code above was running something like 3 threads on a block which each thread sieving a single prime for the the entire sieve. Just to overcome global memory latency you need 50% occupancy.

I was able to get a large speed up by assigning each block to a prime and having each thread sieve a partition of the entire sieve.

Code:
__global__ static void Sieve(u8 * sieve,int * primes, int maxp, unsigned int sieve_size)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if ((idx>0) && (idx < maxp))
{
int ithPrime = primes[idx];
for(unsigned int i=(3*ithPrime)>>1 ;i < sieve_size; i+=ithPrime)
// i = (ithPrime-1)/2 + ithPrime though the compiler knew this
sieve[i] = 1;
}
}
Use this
Code:
__global__ static void Sieve2(u8 * sieve,int * primes, int maxp, unsigned int sieve_size)
{
int bx = blockIdx.x;
int block_size=    sieve_size / threads_per_block;// default: 134217728/256=524288
if  ((bx>0) && (bx < maxp))
{
int p = primes[bx];
unsigned int i ; //=(sieve_size / threads_per_block);
int lim;
if (tx==0)
{
i=(3*p)>>1; // set to next multiple of p that is not even, eg 3*p
lim=block_size;
}
else
{

i=block_size*tx;
lim=i+block_size;

i <<= 1;
i -= 1;
i/=p; // set i to largest i satisifying  of (ithPrime*i) < start
i*=p; // set i = (ithPrime*i) < start
i+=p; // set i to smalles value of (ithPrime*i>start)
if ((i&1)==0) i+=p; // this should never happen
i -= 1;
i >>= 1;
//i>>=1;
}
if (lim>sieve_size)
lim=sieve_size; // max sure lim=<sieve_size so we don't go past the size of our array
for(;i<lim;i+=p)
sieve[i] = 1;
}

}
It sieves up to 2^30 in 2 seconds on a GTX 260. Yafu takes 2.4375 seconds to sieve the same range.

By compacting the bytes into a bit array,simply by changing sieve[i] = 1; to sieve[i>>3] = 1 <<(i&7);, the sieve only takes only ~.6 seconds up to 2^30. However, this introduces a race condition which spits out inaccurate results.

Still, this is a very unoptimized implementation which runs on par with very optimized versions of the CPU sieve so maybe this should not be written off as undo-able yet.

I think the best approach may be to have all primes sieve a partition of the sieve in shared memory, using the bitarray approach, and then copy the shared memory to global memory in a bank conflict free and coalesced manner. Playing around with it that approach, I can achieve a computation time of .2 seconds for up to 2^32 and similar time writing to global memory. That gives me hope that a range of 2^32 should be able to be sieved in ~.5 seconds

However attempts at combining the two ends up with race conditions and bank conflicts which hit performance dramatically, but I still have few more ideas to try.

2012-03-06, 14:30   #19
ET_
Banned

"Luigi"
Aug 2002
Team Italia

10010110111102 Posts

Quote:
 Originally Posted by alexhiggins732 I think the best approach may be to have all primes sieve a partition of the sieve in shared memory, using the bitarray approach, and then copy the shared memory to global memory in a bank conflict free and coalesced manner. Playing around with it that approach, I can achieve a computation time of .2 seconds for up to 2^32 and similar time writing to global memory. That gives me hope that a range of 2^32 should be able to be sieved in ~.5 seconds However attempts at combining the two ends up with race conditions and bank conflicts which hit performance dramatically, but I still have few more ideas to try.
Any news?

Luigi

 2012-06-26, 18:51 #20 bsquared     "Ben" Feb 2007 2·1,789 Posts Recently discovered that one of the servers at work has a Tesla M2050 running CUDA 3.2. So I thought I'd make my first foray into GPU programming by trying out this problem. Below is a segmented sieve that counts the primes in the 10^8 interval in 22 milliseconds, or about 200x faster (not quite 11k faster, but still not so shabby). A 2^30 range takes 250 milliseconds, which is about twice as fast as YAFU's single threaded cpu version on a X5680. So far though, a multi-threaded cpu will still beat the gpu. But I have a bunch more optimizations left to try on the gpu (bit packing, addtional "spokes" in the wheel sieve, coalesced writes for the smallest primes, etc.). code: Code: #include #include typedef unsigned char u8; typedef unsigned int uint32; typedef unsigned char uint8; typedef long long unsigned int uint64; int maxT = 0; const int threadsPerBlock = 256; const uint32 block_size = 8192; __global__ static void SegSieve(uint32 *count, uint32 * primes, int maxp, int nump, uint32 N) { uint32 j,k; uint32 bid = blockIdx.y * gridDim.x + blockIdx.x; uint32 range = block_size / threadsPerBlock; __shared__ u8 locsieve[block_size]; // everyone init the array. for (j=threadIdx.x * range, k=0; k N if ((2*(bid * block_size + j + k)+1) < N) locsieve[j+k] = 1; else locsieve[j+k] = 0; } // regroup before sieving __syncthreads(); // now sieve the array for (j=0; j= block_start)) start_offset = p/2 + p - block_start; else { uint32 dist = (block_start - p/2); uint32 steps = dist / p; if ((dist % p) == 0) start_offset = (steps * p) + p/2 - block_start; else start_offset = (steps * p) + p + p/2 - block_start; } for(uint32 i=start_offset ;i < block_size; i+=p) locsieve[i] = 0; } } // regroup before counting __syncthreads(); // each thread sum a range of the array for (j=threadIdx.x * range + 1, k=0; k 0) { prime = (uint32)(2*i + 1); for (j=i+prime;j= 1) { printf("Device %d supports CUDA %d.%d\n",i, prop.major, prop.minor); printf("It has warp size %d, %d regs per block, %d threads per block\n",prop.warpSize, prop.regsPerBlock, prop.maxThreadsPerBlock); printf("max Threads %d x %d x %d\n",prop.maxThreadsDim[0], prop.maxThreadsDim[1], prop.maxThreadsDim[2]); maxT = prop.maxThreadsDim[0]; printf("max Grid %d x %d x %d\n",prop.maxGridSize[0], prop.maxGridSize[1], prop.maxGridSize[2]); printf("total constant memory %d\n",prop.totalConstMem); break; } } } if(i == count) { fprintf(stderr, "There is no device supporting CUDA 1.x.\n"); return false; } cudaSetDevice(i); return true; } int main(int argc, char** argv) { uint64 N = 100000; uint32 Nsmall; int numblocks; int primes_per_thread; uint64 array_size; uint32* primes; uint32* device_primes; uint32* block_counts; uint32 np; unsigned int thandle; uint32* block_counts_on_host; if (argc > 1) N = strtoull(argv[1],NULL,10); Nsmall = (uint32)sqrt(N); if(!InitCUDA()) { return 0; } primes = (uint32*)malloc(Nsmall*sizeof(uint32)); np = tiny_soe(Nsmall, primes); printf("%d small primes (< %d) found\n", np, Nsmall); // put the primes on the device CUDA_SAFE_CALL(cudaMalloc((void**) &device_primes, sizeof(uint32) * np)); CUDA_SAFE_CALL(cudaMemcpy(device_primes, primes, sizeof(uint32)*np, cudaMemcpyHostToDevice)); cutCreateTimer(&thandle); cutStartTimer(thandle); array_size = (N / 2 / block_size + 1) * block_size; numblocks = array_size / block_size; primes_per_thread = (np + threadsPerBlock - 1) / threadsPerBlock; printf("using %d blocks and %d threads, with %d primes per thread\n", numblocks, threadsPerBlock, primes_per_thread); CUDA_SAFE_CALL(cudaMalloc((void**) &block_counts, sizeof(uint32) * numblocks)); CUDA_SAFE_CALL(cudaMemset(block_counts, 0, sizeof(uint32) * numblocks)); dim3 grid((uint32)sqrt(numblocks)+1,(uint32)sqrt(numblocks)+1); SegSieve<<>>(block_counts, device_primes, np, primes_per_thread, N); cudaThreadSynchronize(); cutStopTimer(thandle); printf("%f milliseconds for big sieve\n",cutGetTimerValue(thandle)); block_counts_on_host = (uint32 *)malloc(numblocks * sizeof(uint32)); cudaMemcpy(block_counts_on_host, block_counts, sizeof(uint32) * numblocks, cudaMemcpyDeviceToHost); cudaFree(device_primes); cudaFree(block_counts); uint32 nbig = 0; for (int i=0; i
 2012-06-27, 13:28 #21 bsquared     "Ben" Feb 2007 2·1,789 Posts Some loop unrolling gave nearly a factor of 2 speedup: Code: __global__ static void SegSieve(uint32 *count, uint16 * primes, int maxp, int nump, uint32 N) { uint32 i,j,k; uint32 bid = blockIdx.y * gridDim.x + blockIdx.x; uint32 range = block_size / threadsPerBlock; __shared__ u8 locsieve[block_size]; // everyone init the array. for (j=threadIdx.x * range, k=0; k N if ((2*(bid * block_size + j + k)+1) < N) locsieve[j+k] = 1; else locsieve[j+k] = 0; } // regroup before sieving __syncthreads(); // now sieve the array for (j=0; j= block_start)) start_offset = p/2 + p - block_start; else { uint32 dist = (block_start - p/2); uint32 steps = dist / p; if ((dist % p) == 0) start_offset = (steps * p) + p/2 - block_start; else start_offset = (steps * p) + p + p/2 - block_start; } if (p < 1024) { uint32 stop = block_size - p * 8 + p; for(i=start_offset ;i < stop; i+=8*p) { locsieve[i] = 0; locsieve[i+p] = 0; locsieve[i+p*2] = 0; locsieve[i+p*2+p] = 0; locsieve[i+p*4] = 0; locsieve[i+p*4+p] = 0; locsieve[i+p*4+p*2] = 0; locsieve[i+p*7] = 0; } } else if (p < 2048) { uint32 stop = block_size - p * 4 + p; for(i=start_offset ;i < stop; i+=4*p) { locsieve[i] = 0; locsieve[i+p] = 0; locsieve[i+p*2] = 0; locsieve[i+p*2+p] = 0; } } else if (p < 4096) { uint32 stop = block_size - p * 2 + p; for(i=start_offset ;i < stop; i+=2*p) { locsieve[i] = 0; locsieve[i+p] = 0; } } else i = start_offset; for( ;i < block_size; i+=p) locsieve[i] = 0; } } // regroup before counting __syncthreads(); // each thread sum a range of the array for (j=threadIdx.x * range + 1, k=0; k
 2012-06-27, 17:03 #22 axn     Jun 2003 10100000111112 Posts have you tried looking at the generated assembly (not ptx, but native isa)? it might give you ideas as to where you can optimize.

 Similar Threads Thread Thread Starter Forum Replies Last Post TheJudger GPU Computing 3509 2021-10-22 11:54 firejuggler GPU Computing 753 2020-12-12 18:07 firejuggler Lounge 3 2012-12-22 01:43 davieddy Hobbies 111 2011-05-28 19:21 xilman Programming 1 2009-11-16 10:26

All times are UTC. The time now is 08:35.

Mon Oct 25 08:35:07 UTC 2021 up 94 days, 3:04, 0 users, load averages: 2.26, 1.60, 1.41