mersenneforum.org  

Go Back   mersenneforum.org > Extra Stuff > Programming

Reply
 
Thread Tools
Old 2015-02-11, 04:55   #111
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
Rep├║blica de California

2·3·29·67 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Pardon the rank-n00b-ness on my part here, but when I use nvcc to build the codebase, I get something other than a CUDA binary? Are you talking about some kind of secondary "tuning" compile/link pass? If so, how much of a speedup might one expect from the added tuning pass?
I had a look at the build-log for mfaktc on my Linux system, and it seems the required combination of compile-time options is ' --ptxas-options=-v --generate-code ...'. Here are the results for my current two GPU modular-powering options:

***** 64-bit integer-modpow: *****

nvcc --ptxas-options=-v --generate-code arch=compute_20,code=sm_20 -c -DUSE_GPU twopmodq64.cu

ptxas info : 0 bytes gmem
ptxas info : Compiling entry function 'GPU_TF64' for 'sm_20'
ptxas info : Function properties for GPU_TF64
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 19 registers, 68 bytes cmem[0], 4 bytes cmem[16]
ptxas info : Compiling entry function 'VecModpow64' for 'sm_20'
ptxas info : Function properties for VecModpow64
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 19 registers, 84 bytes cmem[0], 4 bytes cmem[16]
ptxas info : Compiling entry function 'GPU_TF64_q4' for 'sm_20'
ptxas info : Function properties for GPU_TF64_q4
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 25 registers, 68 bytes cmem[0], 4 bytes cmem[16]
ptxas info : Compiling entry function 'GPU_TF64_pop64' for 'sm_20'
ptxas info : Function properties for GPU_TF64_pop64
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 19 registers, 60 bytes cmem[0], 4 bytes cmem[16]

ptxas info : 0 bytes gmem
ptxas info : Compiling entry function 'GPU_TF64' for 'sm_30'
ptxas info : Function properties for GPU_TF64
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 24 registers, 356 bytes cmem[0], 4 bytes cmem[2]
ptxas info : Compiling entry function 'VecModpow64' for 'sm_30'
ptxas info : Function properties for VecModpow64
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 25 registers, 372 bytes cmem[0], 4 bytes cmem[2]
ptxas info : Compiling entry function 'GPU_TF64_q4' for 'sm_30'
ptxas info : Function properties for GPU_TF64_q4
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 30 registers, 356 bytes cmem[0], 4 bytes cmem[2]
ptxas info : Compiling entry function 'GPU_TF64_pop64' for 'sm_30'
ptxas info : Function properties for GPU_TF64_pop64
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 25 registers, 348 bytes cmem[0], 4 bytes cmem[2]

***** 78-bit float64-modpow: *****

nvcc --ptxas-options=-v --generate-code arch=compute_20,code=sm_20 --generate-code arch=compute_30,code=sm_30 -c -DUSE_GPU twopmodq80.cu

ptxas info : 0 bytes gmem
ptxas info : Compiling entry function 'GPU_TF78' for 'sm_20'
ptxas info : Function properties for GPU_TF78
32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 35 registers, 68 bytes cmem[0], 12 bytes cmem[16]
ptxas info : Compiling entry function 'VecModpow78' for 'sm_20'
ptxas info : Function properties for VecModpow78
32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 36 registers, 84 bytes cmem[0], 12 bytes cmem[16]
ptxas info : Compiling entry function 'GPU_TF78_q4' for 'sm_20'
ptxas info : Function properties for GPU_TF78_q4
400 bytes stack frame, 268 bytes spill stores, 88 bytes spill loads
ptxas info : Used 63 registers, 68 bytes cmem[0], 12 bytes cmem[16]
ptxas info : Compiling entry function 'GPU_TF78_pop64' for 'sm_20'
ptxas info : Function properties for GPU_TF78_pop64
32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 36 registers, 60 bytes cmem[0], 12 bytes cmem[16]

ptxas info : 0 bytes gmem
ptxas info : Compiling entry function 'GPU_TF78' for 'sm_30'
ptxas info : Function properties for GPU_TF78
32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 38 registers, 356 bytes cmem[0], 12 bytes cmem[2]
ptxas info : Compiling entry function 'VecModpow78' for 'sm_30'
ptxas info : Function properties for VecModpow78
32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 40 registers, 372 bytes cmem[0], 12 bytes cmem[2]
ptxas info : Compiling entry function 'GPU_TF78_q4' for 'sm_30'
ptxas info : Function properties for GPU_TF78_q4
424 bytes stack frame, 336 bytes spill stores, 152 bytes spill loads
ptxas info : Used 63 registers, 356 bytes cmem[0], 12 bytes cmem[2]
ptxas info : Compiling entry function 'GPU_TF78_pop64' for 'sm_30'
ptxas info : Function properties for GPU_TF78_pop64
32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 41 registers, 348 bytes cmem[0], 12 bytes cmem[2]

Do those register counts look reasonable for the functions in question? Are there certain 'sweet spot' breakover points (such as <= a power of 2) one should target here, and is such targeting best done via launch_bounds specifier on the function in question or by playing with the -maxrregcount compile option?
ewmayer is online now   Reply With Quote
Old 2015-02-11, 15:06   #112
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

24·32·53 Posts
Default

Get the CUDA occupancy spreadsheet. There you can plug in shared memory usage as well as your register usage and learn your occupancy.

I can tell you that when you see lots of spill loads and stores you are generally in trouble. That means you've used the maximum number of fast registers (which also hurts occupancy) and are forced to use slow global memory.
Prime95 is online now   Reply With Quote
Old 2015-02-12, 22:51   #113
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
Rep├║blica de California

2×3×29×67 Posts
Default

Thanks, George - here is the link to the nVidia (MS-Ecxel-based) occupancy calculator. Will play with it when I get home this evening.

Note that there is another similar-sounding functionality enabling CUDA users to get occupancy data at runtime for their kernels. This distinct Occupancy Calculator functionality is described in section 5.2.3.1 of the current CUDA C Programming guide. I tried the sample code in the C-guide (with one of my own kernels, obviously), but get error: 'cudaOccupancyMaxActiveBlocksPerMultiprocessor' : identifier not found. Missing header file, perhaps? (The sample code and surrounding documentation mentions no extra headers or link-libraries as being needed to enable this.)

Have any of our readers used the above API?
ewmayer is online now   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfaktc: a CUDA program for Mersenne prefactoring TheJudger GPU Computing 3506 2021-09-18 00:04
The P-1 factoring CUDA program firejuggler GPU Computing 753 2020-12-12 18:07
End of the world as we know it (in music) firejuggler Lounge 3 2012-12-22 01:43
World Cup Soccer davieddy Hobbies 111 2011-05-28 19:21
World's dumbest CUDA program? xilman Programming 1 2009-11-16 10:26

All times are UTC. The time now is 02:26.


Mon Oct 18 02:26:25 UTC 2021 up 86 days, 20:55, 0 users, load averages: 2.10, 1.56, 1.40

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.