mersenneforum.org mfaktc: a CUDA program for Mersenne prefactoring
 Register FAQ Search Today's Posts Mark Forums Read

 2009-12-24, 02:04 #23 jasonp Tribal Bullet     Oct 2004 353410 Posts Yes, there are patches to Nvidia's version of the Open64 compiler that allow high half multiplies and also the add and subtract instructions that generate ad consume carry bits. I wasn't able to build the compiler using either MinGW or Cygwin, even though the source specifically has build directories for those. Given the deafening silence from the Nvidia forums, probably nobody else is able to do it either :) Have you tried building the compiler in Linux? If all you do is generate the PTX code but not run it, then you don't need any of the other GPU infrastructure on the machine...
2009-12-24, 10:19   #24
xilman
Bamboozled!

"𒉺𒌌𒇷𒆷𒀭"
May 2003
Down not across

3×3,529 Posts

Quote:
 Originally Posted by TheJudger The "trick" is to hack the ptx code (lets say it is like assembler code on CPU) and replace one instruction. The nvidia compiler has no intrinsic for [u]mul24hi while it exists in the ptx code. (24x24 multiply is faster as mentioned before) Bad news #1: The "ptx hack" is ugly!!! I have to check some compilers... There is a patch to enable some more intrinsics but I was not able to build the compiler. :(
Could you post details of your __[u]mul24hi() trick please? If you prefer, PM or email will be just as good but posting here will aid other CUDA programmers too.

I've seen the nvidia forum postings and the alleged patches to nvcc but I've never managed to get it working either. It will be very useful in some code I'm writing which, at present, has to use __umulhi() and nasty shifting and masking.

Thanks,
Paul

2009-12-24, 14:38   #25
TheJudger

"Oliver"
Mar 2005
Germany

45716 Posts

Quote:
 Originally Posted by axn I am assuming you have a prelim sieve to get the TF candidates. Why not just lower the sieve limit for that one? In fact, the ideal scenario would involve the program doing benchmark during runtime to pick the optimal sieve limit.
This won't maximize the throughput of the machine.
And for the next generation GPU "Fermi" I might have to force the sieve to sieve only up to 17 or so :(

 2009-12-24, 15:00 #26 TheJudger     "Oliver" Mar 2005 Germany 11·101 Posts Jason/Paul: did you check other CUDA compilers? E.g. PGI advertises their compiler as CUDA-capable. Jason: yep, I tried on Linux and failed. (actually I'm devolopping my code under Linux). Paul: for sure, here we go (hopefully your familar with the bash): My code contained only a single __umulhi(). Since the device functions are inlined all the time it appears several times in the ptx code. Step #1 (just for safety): - comment out that __umulhi() - add "--keep" to the nvcc command line (this will generate alot of files, do it in a seperate subdirectory or so) and compile the code - check if there is no "mul.hi.u32" in the ptx code - comment in that __umulhi() Step #2: - add "--dry-run" to the nvcc command line and compile the code (actually it won't compile). This shows you the commands issued by nvcc. Write down the commands issued after the ptx-code generation Step #3: - compile you code with the --keep option again - modify the ptx file (search & replace mul.hi.u32 with mul24.hi.u32) - run the commands which you have written down in step #2 my script used for compiling without ptx hack Code: #!/bin/bash -xe rm -f sieve.o main.o main.exe gcc -Wall -O2 -c sieve.c -o sieve.o nvcc -c main.cu -o main.o -I /opt/cuda/include/ --ptxas-options=-v gcc -fPIC -o main.exe sieve.o main.o -L/opt/cuda/lib64/ -lcudart and now with ptx hack Code: #!/bin/bash mkdir compile_bla_bla cd compile_bla_bla gcc -Wall -O2 -c ../sieve.c -o sieve.o nvcc -c ../main.cu -o main.o -I /opt/cuda/include/ --ptxas-options=-v --keep cat main.ptx | sed s/mul\.hi\.u32/mul24\.hi\.u32/ > main.ptx.new mv main.ptx main.ptx.old mv main.ptx.new main.ptx rm -f main.sm_10.cubin main.cu.cpp main.o ptxas --key="xxxxxxxxxx" -arch=sm_10 -v "main.ptx" -o "main.sm_10.cubin" fatbin --key="xxxxxxxxxx" --source-name="../main.cu" --usage-mode="-v " --embedded-fatbin="main.fatbin.c" "--image=profile=sm_10,file=main.sm_10.cubin" "--image=profile=compute_10,file=main.ptx" cudafe++ --m64 --gnu_version=40302 --diag_error=host_device_limited_call --diag_error=ms_asm_decl_not_allowed --parse_templates --gen_c_file_name "main.cudafe1.cpp" --stub_file_name "main.cudafe1.stub.c" --stub_header_file_name "main.cudafe1.stub.h" "main.cpp1.ii" gcc -D__CUDA_ARCH__=100 -E -x c++ -DCUDA_NO_SM_12_ATOMIC_INTRINSICS -DCUDA_NO_SM_13_DOUBLE_INTRINSICS -DCUDA_FLOAT_MATH_FUNCTIONS -DCUDA_NO_SM_11_ATOMIC_INTRINSICS "-I/opt/cuda/bin/../include" "-I/opt/cuda/bin/../include/cudart" -I. -I"/opt/cuda/include/" -m64 -o "main.cu.cpp" "main.cudafe1.cpp" gcc -c -x c++ "-I/opt/cuda/bin/../include" "-I/opt/cuda/bin/../include/cudart" -I. -I"/opt/cuda/include/" -m64 -o "main.o" "main.cu.cpp" gcc -fPIC -o ../main.exe sieve.o main.o -L/opt/cuda/lib64/ -lcudart cd .. rm compile_bla_bla -rf If you want to replace only some of your __[u]mulhi with __[u]mul24hi than it will bit a bit more complicated. :( And offcourse the build script uses some systemspecific pathes... Before you replace the instruction remember about the different behaviour. __[u]mulhi() returns bits 32 to 63 while __[u]mul24hi() returns bits 16 to 47. Last fiddled with by TheJudger on 2009-12-24 at 15:09
2009-12-24, 15:09   #27
xilman
Bamboozled!

"𒉺𒌌𒇷𒆷𒀭"
May 2003
Down not across

3×3,529 Posts

Quote:
 Originally Posted by TheJudger Bad news #2: My siever is too slow. Without the latest optimisation a single core of a Core 2 running at 3GHz was sufficient to feed the GPU (GTX 275) with new factor candidates to test. Now it is too slow as the GPU code is faster now. I have to think about possiblities: (1) speedup the siever by writing better code (I'm not sure if I can do this). If "Fermi" is only twice as fast as the GT200 chip (due to the fact it has roughly doubled amount of shaders) and has no other improvements I need to speedup the siever again by a factor of 2. (2) write a multithreaded siever. I think I can do this but I'm not really happy with this solution. (3) put the siever on the GPU. I'm not sure if this might work... (4) newer GPUs are capable of running several "kernels" at the same time. With some modifications on the code it should be possible to have several instances of the application running at the same time. If the GPU is too fast for one CPU core just start another test on a different exponent on a 2nd Core, ... personnally I prefer (4) Any comments?
I'd try a variant of (3) as follows:

Write a siever in CUDA and run it on the GPU (no TF, in other words) until you have at least a few hundred megabytes of sieved results. You would use compact storage, obviously. Something like this would work: all factors are of the form 2kp+1, so store only the k values and them only as deltas from the previous value. The factors also form 16 "obvious" residue classes (something exploited by prime95 since the very early days) so store 16 such lists of deltas. I don't know whether an unsigned char is enough to store the deltas but an unsigned short surely would be. The results would be stored on disk or in cpu RAM as appropriate. Then you can feed the results into a separate TF kernel in a separate cpu thread.

Paul

2009-12-24, 15:22   #28
ldesnogu

Jan 2008
France

3·181 Posts

Quote:
 Originally Posted by TheJudger My siever is too slow. Without the latest optimisation a single core of a Core 2 running at 3GHz was sufficient to feed the GPU (GTX 275) with new factor candidates to test. Now it is too slow as the GPU code is faster now. I have to think about possiblities: (1) speedup the siever by writing better code (I'm not sure if I can do this). If "Fermi" is only twice as fast as the GT200 chip (due to the fact it has roughly doubled amount of shaders) and has no other improvements I need to speedup the siever again by a factor of 2.
Ernst has some code in Mlucas for trial factoring in factor.c. You could perhaps steal some ideas?

Last fiddled with by ldesnogu on 2009-12-24 at 15:22

 2009-12-24, 15:23 #29 TheJudger     "Oliver" Mar 2005 Germany 11·101 Posts Execept that I'm doing the sieve on the CPU thats more or less the way I'm doing the sieving. I generate a list of 2^20 k's (in the same class) at once and transfer them to the GPU. The k's are stored as uint32 k_ref_hi, uint32 k_ref_lo and uint32 *k_delta. The deltas are relative to k_ref, NOT to the previous k_delta (think parallel ;)) For 2^20 k's and sieving the first 4000 odd primes the k_delta grows above 300.000.000 so a short surely doesn't fit. This sieve is segmented and so small that it fits into the L1-cache of the CPU. From my feeling sieving doesn't fit well on CUDA.
 2010-01-02, 00:06 #30 TheJudger     "Oliver" Mar 2005 Germany 100010101112 Posts Hi and a happy new year! GTX 275, Core 2 Duo overclocked to 4GHz, sieving up to 37831 (4000th odd prime): One process on the CPU: M66362159 TF from 2^64 to 2^65: 180s (siever still too slow) Two processes on the CPU at the same time M66362159 TF from 2^64 to 2^65: 279s M66362189 TF from 2^64 to 2^65: 279s (using two CPUs core easily keep the GPU busy all the time. A 2.66GHz Core 2 Duo should be fine for a GTX 275 (with the current code)) There are some compiletime options which run fine on newer GPUs but won't work on older ones. (e.g. asynchronous memory transfers aren't supported on G80 chips). Therefore I need to write some checks into the code if the current GPU as capable or not. I had a horrible stupid "bug" during my attempt to make the code capable running multiple CPU processes concurrently on one GPU: CUDA source files are usually named "*.cu". My favorite editor doesn't know ".cu" files (for syntax highlighting), I was lazy and did symlinks "*.c" -> "*.cu". This worked fine until I copied the files with scp (secure copy, openssh). Doing so the "*.c" files were copies of the "*.cu" files. So I've edited the "*.c" files and compiled the "*.cu" files..... It took some time to figure out why code changes didn't show any difference...
2010-01-02, 00:33   #31
TheJudger

"Oliver"
Mar 2005
Germany

11·101 Posts

Quote:
 Originally Posted by axn I am assuming you have a prelim sieve to get the TF candidates. Why not just lower the sieve limit for that one? In fact, the ideal scenario would involve the program doing benchmark during runtime to pick the optimal sieve limit.
Some more details on this topic:

Generating a candidate list (2^20 candidates)
Code:
Sieve limit                | time | number of raw candidates
31 (10th odd prime)        |  8ms | ~1.83M
547 (100th odd prime)      | 16ms | ~3.16M
7927 (1000th odd prime)    | 21ms | ~4.48M
104743 (10000th odd prime) | 28ms | ~5.77M
(hopefully no typo in this list...)

"raw candidates" are the number of candidates before sieving needed to generate a list of 2^20 candidates after sieving (on average).
As you can see the runtime of the siever depends mostly on the number of raw candidates.

Generate a list of 2^20 candidates with sieving up to 104743 takes ~3.5 times longer than generation a list of same size with sieving up to 31. BUT compare the number of raw candidates. Sieving up to 104743 covers 5.77/1.83 = 3.15 times raw candidates!

So lowering the sievelimit will help to keep the GPU busy BUT this won't increase the throughput much. It will help to generate more heat on the GPU and burn electricy.

Last fiddled with by TheJudger on 2010-01-02 at 00:34

2010-01-02, 23:59   #32
msft

Jul 2009
Tokyo

11428 Posts

Happy new year!, Thejudger
Quote:
 Originally Posted by TheJudger (using two CPUs core easily keep the GPU busy all the time. A 2.66GHz Core 2 Duo should be fine for a GTX 275 (with the current code))
Why use CPU ?
Something slow on GPU ?

 2010-01-03, 01:43 #33 TheJudger     "Oliver" Mar 2005 Germany 11·101 Posts Hello msft, I'm sceptical about sieving on the GPUs. I would be happy if somebody proves me wrong on that topic. ;) - sieving in one big array with all GPU threads (each GPU thread handels different small primes) Problem: will need global memory for the array. When using one char per candidate I need only writes. When using one bit per char I need reads and writes (one global memory!). This should be very slow. - each thread sieves a small segment Problem: need to calculate starting offsets much more often. For this I use a modified euclidean algorithm which needs ifs. Different code paths will break parallel execution as far as I know. :( Problem: might get imbalanced between threads if the segments are too small. If the segments are bigger (to minimize the effect of imbalanced threads) it will have LONG kernel runtimes on GPU which stalls the GUI (if there is some).

 Similar Threads Thread Thread Starter Forum Replies Last Post Bdot GPU Computing 1668 2020-12-22 15:38 firejuggler GPU Computing 753 2020-12-12 18:07 MrRepunit GPU Computing 32 2020-11-11 19:56 keisentraut Software 2 2020-08-18 07:03 fivemack Programming 112 2015-02-12 22:51

All times are UTC. The time now is 07:48.

Sun Feb 28 07:48:38 UTC 2021 up 87 days, 3:59, 0 users, load averages: 1.92, 1.50, 1.39