View Single Post
Old 2009-11-16, 09:57   #1
xilman's Avatar
May 2003
Down not across

22·3·883 Posts
Default World's dumbest CUDA program?

Any CUDA people out there who may be able to explain why my GPU code doesn't seem to be able to write to global memory?

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>

#include <cutil_inline.h>

__global__ void
testKernel(unsigned char *output) 
  int k;
  const unsigned tid = blockIdx.x * blockDim.x + threadIdx.x;
  for (k=0; k < 8; k++) output[16*tid+k] = 42;
  for (k=0; k < 8; k++) output[16*tid+k+8] = 66;

#define BLOCK_SIZE 2

#define MEM_SIZE (NUM_THREADS * 8)

run_test (int argc, char** argv) 
  unsigned char *h_output, *d_output;	/* Host and device memory for output */
  int iter;
  dim3  grid (BLOCK_SIZE, 1, 1);	/* setup execution parameters */
  dim3  threads (THREADS_PER_BLOCK, 1, 1);

  if (cutCheckCmdLineFlag(argc, (const char**)argv, "device"))
    cutilDeviceInit(argc, argv);
    cudaSetDevice (cutGetMaxGflopsDeviceId());

  /* Allocate host memory for output */
  h_output = (unsigned char *) calloc (1, 2 * MEM_SIZE);
  /* Allocate device memory for output */
  cutilSafeCall (cudaMalloc ((void**) &d_output, 2 * MEM_SIZE));

  /* Execute the kernel */
  testKernel <<< grid, threads >>> (d_output);

  /* Check if kernel execution generated an error */
  cutilCheckMsg ("Kernel execution failed");

  cutilSafeCall (cudaThreadSynchronize());	/* Wait for threads to complete. */

  /* Copy results from device to host memory */
  cutilSafeCall (cudaMemcpy (h_output, d_output, 2 & MEM_SIZE,
  cutilSafeCall (cudaThreadSynchronize());	/* Wait for threads to complete. */

  for (iter = 0; iter < 2*MEM_SIZE; iter++) {
    printf ("%d %02x\n", iter, h_output[iter]);
  free (h_output);
  cutilSafeCall (cudaFree (d_output));
  cudaThreadExit ();

int main (int argc, char** argv) 
  run_test (argc, argv);
  cutilExit(argc, argv);
Environment: 64-bit RedHat EL5.2; Tesla C1060; fresh install of CUDA 2.3; all SDK projects built without error and a representative selection run correctly.

This test case was developed from the SDK template project then stripped down pretty much to bare-bones. It allocates global memory on host and device, calls a kernel to write constant non-zero bytes then prints what's happened, if anything. On my system it invariably prints zeros.

A kernel which does significant computation takes significant time to run, implying that the kernel is being called, but still doesn't write to global memory.

The fact that everything works except my code suggests that I have a conceptual error rather than a system bug.

xilman is offline   Reply With Quote