mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Programming (https://www.mersenneforum.org/forumdisplay.php?f=29)
-   -   World's dumbest CUDA program? (https://www.mersenneforum.org/showthread.php?t=12722)

xilman 2009-11-16 09:57

World's dumbest CUDA program?
 
Any CUDA people out there who may be able to explain why my GPU code doesn't seem to be able to write to global memory?

[CODE]#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>

#include <cutil_inline.h>

__global__ void
testKernel(unsigned char *output)
{
int k;
const unsigned tid = blockIdx.x * blockDim.x + threadIdx.x;
for (k=0; k < 8; k++) output[16*tid+k] = 42;
for (k=0; k < 8; k++) output[16*tid+k+8] = 66;
}

#define BLOCK_SIZE 2
#define THREADS_PER_BLOCK 2
#define NUM_THREADS (BLOCK_SIZE * THREADS_PER_BLOCK)

#define MEM_SIZE (NUM_THREADS * 8)

void
run_test (int argc, char** argv)
{
unsigned char *h_output, *d_output; /* Host and device memory for output */
int iter;
dim3 grid (BLOCK_SIZE, 1, 1); /* setup execution parameters */
dim3 threads (THREADS_PER_BLOCK, 1, 1);

if (cutCheckCmdLineFlag(argc, (const char**)argv, "device"))
cutilDeviceInit(argc, argv);
else
cudaSetDevice (cutGetMaxGflopsDeviceId());

/* Allocate host memory for output */
h_output = (unsigned char *) calloc (1, 2 * MEM_SIZE);
/* Allocate device memory for output */
cutilSafeCall (cudaMalloc ((void**) &d_output, 2 * MEM_SIZE));

/* Execute the kernel */
testKernel <<< grid, threads >>> (d_output);

/* Check if kernel execution generated an error */
cutilCheckMsg ("Kernel execution failed");

cutilSafeCall (cudaThreadSynchronize()); /* Wait for threads to complete. */

/* Copy results from device to host memory */
cutilSafeCall (cudaMemcpy (h_output, d_output, 2 & MEM_SIZE,
cudaMemcpyDeviceToHost));
cutilSafeCall (cudaThreadSynchronize()); /* Wait for threads to complete. */

for (iter = 0; iter < 2*MEM_SIZE; iter++) {
printf ("%d %02x\n", iter, h_output[iter]);
}
free (h_output);
cutilSafeCall (cudaFree (d_output));
cudaThreadExit ();
}

int main (int argc, char** argv)
{
run_test (argc, argv);
cutilExit(argc, argv);
}
[/CODE]
Environment: 64-bit RedHat EL5.2; Tesla C1060; fresh install of CUDA 2.3; all SDK projects built without error and a representative selection run correctly.

This test case was developed from the SDK template project then stripped down pretty much to bare-bones. It allocates global memory on host and device, calls a kernel to write constant non-zero bytes then prints what's happened, if anything. On my system it invariably prints zeros.

A kernel which does significant computation takes significant time to run, implying that the kernel is being called, but still doesn't write to global memory.

The fact that everything works except my code suggests that I have a conceptual error rather than a system bug.


Paul

xilman 2009-11-16 10:26

[QUOTE=xilman;196023]
[CODE]
/* Copy results from device to host memory */
cutilSafeCall (cudaMemcpy (h_output, d_output, 2 & MEM_SIZE,
cudaMemcpyDeviceToHost));
[/code]

The fact that everything works except my code suggests that I have a conceptual error rather than a system bug.[/QUOTE]
Found the problem --- a simple and silly typo. The '&' key is next to the '*' key on my keyboard...


Paul


All times are UTC. The time now is 14:49.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.