mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Programming (https://www.mersenneforum.org/forumdisplay.php?f=29)
-   -   World's second-dumbest CUDA program (https://www.mersenneforum.org/showthread.php?t=11900)

ewmayer 2015-02-11 04:55

[QUOTE=ewmayer;394667]Pardon the rank-n00b-ness on my part here, but when I use nvcc to build the codebase, I get something other than a CUDA binary? Are you talking about some kind of secondary "tuning" compile/link pass? If so, how much of a speedup might one expect from the added tuning pass?[/QUOTE]

I had a look at the build-log for mfaktc on my Linux system, and it seems the required combination of compile-time options is ' --ptxas-options=-v --generate-code ...'. Here are the results for my current two GPU modular-powering options:

[b]***** 64-bit integer-modpow: *****[/b]
[i]
nvcc --ptxas-options=-v --generate-code arch=compute_20,code=sm_20 -c -DUSE_GPU twopmodq64.cu

ptxas info : 0 bytes gmem
ptxas info : Compiling entry function 'GPU_TF64' for 'sm_20'
ptxas info : Function properties for GPU_TF64
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 19 registers, 68 bytes cmem[0], 4 bytes cmem[16]
ptxas info : Compiling entry function 'VecModpow64' for 'sm_20'
ptxas info : Function properties for VecModpow64
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 19 registers, 84 bytes cmem[0], 4 bytes cmem[16]
ptxas info : Compiling entry function 'GPU_TF64_q4' for 'sm_20'
ptxas info : Function properties for GPU_TF64_q4
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 25 registers, 68 bytes cmem[0], 4 bytes cmem[16]
ptxas info : Compiling entry function 'GPU_TF64_pop64' for 'sm_20'
ptxas info : Function properties for GPU_TF64_pop64
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 19 registers, 60 bytes cmem[0], 4 bytes cmem[16]

ptxas info : 0 bytes gmem
ptxas info : Compiling entry function 'GPU_TF64' for 'sm_30'
ptxas info : Function properties for GPU_TF64
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 24 registers, 356 bytes cmem[0], 4 bytes cmem[2]
ptxas info : Compiling entry function 'VecModpow64' for 'sm_30'
ptxas info : Function properties for VecModpow64
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 25 registers, 372 bytes cmem[0], 4 bytes cmem[2]
ptxas info : Compiling entry function 'GPU_TF64_q4' for 'sm_30'
ptxas info : Function properties for GPU_TF64_q4
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 30 registers, 356 bytes cmem[0], 4 bytes cmem[2]
ptxas info : Compiling entry function 'GPU_TF64_pop64' for 'sm_30'
ptxas info : Function properties for GPU_TF64_pop64
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 25 registers, 348 bytes cmem[0], 4 bytes cmem[2]
[/i]
[b]***** 78-bit float64-modpow: *****[/b]
[i]
nvcc --ptxas-options=-v --generate-code arch=compute_20,code=sm_20 --generate-code arch=compute_30,code=sm_30 -c -DUSE_GPU twopmodq80.cu

ptxas info : 0 bytes gmem
ptxas info : Compiling entry function 'GPU_TF78' for 'sm_20'
ptxas info : Function properties for GPU_TF78
32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 35 registers, 68 bytes cmem[0], 12 bytes cmem[16]
ptxas info : Compiling entry function 'VecModpow78' for 'sm_20'
ptxas info : Function properties for VecModpow78
32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 36 registers, 84 bytes cmem[0], 12 bytes cmem[16]
ptxas info : Compiling entry function 'GPU_TF78_q4' for 'sm_20'
ptxas info : Function properties for GPU_TF78_q4
400 bytes stack frame, 268 bytes spill stores, 88 bytes spill loads
ptxas info : Used 63 registers, 68 bytes cmem[0], 12 bytes cmem[16]
ptxas info : Compiling entry function 'GPU_TF78_pop64' for 'sm_20'
ptxas info : Function properties for GPU_TF78_pop64
32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 36 registers, 60 bytes cmem[0], 12 bytes cmem[16]

ptxas info : 0 bytes gmem
ptxas info : Compiling entry function 'GPU_TF78' for 'sm_30'
ptxas info : Function properties for GPU_TF78
32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 38 registers, 356 bytes cmem[0], 12 bytes cmem[2]
ptxas info : Compiling entry function 'VecModpow78' for 'sm_30'
ptxas info : Function properties for VecModpow78
32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 40 registers, 372 bytes cmem[0], 12 bytes cmem[2]
ptxas info : Compiling entry function 'GPU_TF78_q4' for 'sm_30'
ptxas info : Function properties for GPU_TF78_q4
424 bytes stack frame, 336 bytes spill stores, 152 bytes spill loads
ptxas info : Used 63 registers, 356 bytes cmem[0], 12 bytes cmem[2]
ptxas info : Compiling entry function 'GPU_TF78_pop64' for 'sm_30'
ptxas info : Function properties for GPU_TF78_pop64
32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 41 registers, 348 bytes cmem[0], 12 bytes cmem[2]
[/i]
Do those register counts look reasonable for the functions in question? Are there certain 'sweet spot' breakover points (such as <= a power of 2) one should target here, and is such targeting best done via launch_bounds specifier on the function in question or by playing with the -maxrregcount compile option?

Prime95 2015-02-11 15:06

Get the CUDA occupancy spreadsheet. There you can plug in shared memory usage as well as your register usage and learn your occupancy.

I can tell you that when you see lots of spill loads and stores you are generally in trouble. That means you've used the maximum number of fast registers (which also hurts occupancy) and are forced to use slow global memory.

ewmayer 2015-02-12 22:51

Thanks, George - [url=https://devtalk.nvidia.com/default/topic/368105/cuda-occupancy-calculator-helps-pick-optimal-thread-block-size/]here is the link[/url] to the nVidia (MS-Ecxel-based) occupancy calculator. Will play with it when I get home this evening.

Note that there is another similar-sounding functionality enabling CUDA users to get occupancy data at runtime for their kernels. This distinct Occupancy Calculator functionality is described in section 5.2.3.1 of the current CUDA C Programming guide. I tried the sample code in the C-guide (with one of my own kernels, obviously), but get [i]error: 'cudaOccupancyMaxActiveBlocksPerMultiprocessor' : identifier not found.[/i] Missing header file, perhaps? (The sample code and surrounding documentation mentions no extra headers or link-libraries as being needed to enable this.)

Have any of our readers used the above API?


All times are UTC. The time now is 15:09.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.