![]() |
![]() |
#23 |
Tribal Bullet
Oct 2004
353410 Posts |
![]()
Yes, there are patches to Nvidia's version of the Open64 compiler that allow high half multiplies and also the add and subtract instructions that generate ad consume carry bits. I wasn't able to build the compiler using either MinGW or Cygwin, even though the source specifically has build directories for those. Given the deafening silence from the Nvidia forums, probably nobody else is able to do it either :)
Have you tried building the compiler in Linux? If all you do is generate the PTX code but not run it, then you don't need any of the other GPU infrastructure on the machine... |
![]() |
![]() |
![]() |
#24 | |
Bamboozled!
"πΊππ·π·π"
May 2003
Down not across
3×3,529 Posts |
![]() Quote:
I've seen the nvidia forum postings and the alleged patches to nvcc but I've never managed to get it working either. It will be very useful in some code I'm writing which, at present, has to use __umulhi() and nasty shifting and masking. Thanks, Paul |
|
![]() |
![]() |
![]() |
#25 | |
"Oliver"
Mar 2005
Germany
45716 Posts |
![]() Quote:
And for the next generation GPU "Fermi" I might have to force the sieve to sieve only up to 17 or so :( |
|
![]() |
![]() |
![]() |
#26 |
"Oliver"
Mar 2005
Germany
11·101 Posts |
![]()
Jason/Paul: did you check other CUDA compilers?
E.g. PGI advertises their compiler as CUDA-capable. Jason: yep, I tried on Linux and failed. (actually I'm devolopping my code under Linux). Paul: for sure, here we go (hopefully your familar with the bash): My code contained only a single __umulhi(). Since the device functions are inlined all the time it appears several times in the ptx code. Step #1 (just for safety): - comment out that __umulhi() - add "--keep" to the nvcc command line (this will generate alot of files, do it in a seperate subdirectory or so) and compile the code - check if there is no "mul.hi.u32" in the ptx code - comment in that __umulhi() Step #2: - add "--dry-run" to the nvcc command line and compile the code (actually it won't compile). This shows you the commands issued by nvcc. Write down the commands issued after the ptx-code generation Step #3: - compile you code with the --keep option again - modify the ptx file (search & replace mul.hi.u32 with mul24.hi.u32) - run the commands which you have written down in step #2 my script used for compiling without ptx hack Code:
#!/bin/bash -xe rm -f sieve.o main.o main.exe gcc -Wall -O2 -c sieve.c -o sieve.o nvcc -c main.cu -o main.o -I /opt/cuda/include/ --ptxas-options=-v gcc -fPIC -o main.exe sieve.o main.o -L/opt/cuda/lib64/ -lcudart Code:
#!/bin/bash mkdir compile_bla_bla cd compile_bla_bla gcc -Wall -O2 -c ../sieve.c -o sieve.o nvcc -c ../main.cu -o main.o -I /opt/cuda/include/ --ptxas-options=-v --keep cat main.ptx | sed s/mul\.hi\.u32/mul24\.hi\.u32/ > main.ptx.new mv main.ptx main.ptx.old mv main.ptx.new main.ptx rm -f main.sm_10.cubin main.cu.cpp main.o ptxas --key="xxxxxxxxxx" -arch=sm_10 -v "main.ptx" -o "main.sm_10.cubin" fatbin --key="xxxxxxxxxx" --source-name="../main.cu" --usage-mode="-v " --embedded-fatbin="main.fatbin.c" "--image=profile=sm_10,file=main.sm_10.cubin" "--image=profile=compute_10,file=main.ptx" cudafe++ --m64 --gnu_version=40302 --diag_error=host_device_limited_call --diag_error=ms_asm_decl_not_allowed --parse_templates --gen_c_file_name "main.cudafe1.cpp" --stub_file_name "main.cudafe1.stub.c" --stub_header_file_name "main.cudafe1.stub.h" "main.cpp1.ii" gcc -D__CUDA_ARCH__=100 -E -x c++ -DCUDA_NO_SM_12_ATOMIC_INTRINSICS -DCUDA_NO_SM_13_DOUBLE_INTRINSICS -DCUDA_FLOAT_MATH_FUNCTIONS -DCUDA_NO_SM_11_ATOMIC_INTRINSICS "-I/opt/cuda/bin/../include" "-I/opt/cuda/bin/../include/cudart" -I. -I"/opt/cuda/include/" -m64 -o "main.cu.cpp" "main.cudafe1.cpp" gcc -c -x c++ "-I/opt/cuda/bin/../include" "-I/opt/cuda/bin/../include/cudart" -I. -I"/opt/cuda/include/" -m64 -o "main.o" "main.cu.cpp" gcc -fPIC -o ../main.exe sieve.o main.o -L/opt/cuda/lib64/ -lcudart cd .. rm compile_bla_bla -rf And offcourse the build script uses some systemspecific pathes... Before you replace the instruction remember about the different behaviour. __[u]mulhi() returns bits 32 to 63 while __[u]mul24hi() returns bits 16 to 47. Last fiddled with by TheJudger on 2009-12-24 at 15:09 |
![]() |
![]() |
![]() |
#27 | |
Bamboozled!
"πΊππ·π·π"
May 2003
Down not across
3×3,529 Posts |
![]() Quote:
Write a siever in CUDA and run it on the GPU (no TF, in other words) until you have at least a few hundred megabytes of sieved results. You would use compact storage, obviously. Something like this would work: all factors are of the form 2kp+1, so store only the k values and them only as deltas from the previous value. The factors also form 16 "obvious" residue classes (something exploited by prime95 since the very early days) so store 16 such lists of deltas. I don't know whether an unsigned char is enough to store the deltas but an unsigned short surely would be. The results would be stored on disk or in cpu RAM as appropriate. Then you can feed the results into a separate TF kernel in a separate cpu thread. Paul |
|
![]() |
![]() |
![]() |
#28 | |
Jan 2008
France
3·181 Posts |
![]() Quote:
![]() EDIT: A link could help: http://hogranch.com/mayer/README.html Last fiddled with by ldesnogu on 2009-12-24 at 15:22 |
|
![]() |
![]() |
![]() |
#29 |
"Oliver"
Mar 2005
Germany
11·101 Posts |
![]()
Execept that I'm doing the sieve on the CPU thats more or less the way I'm doing the sieving.
I generate a list of 2^20 k's (in the same class) at once and transfer them to the GPU. The k's are stored as uint32 k_ref_hi, uint32 k_ref_lo and uint32 *k_delta. The deltas are relative to k_ref, NOT to the previous k_delta (think parallel ;)) For 2^20 k's and sieving the first 4000 odd primes the k_delta grows above 300.000.000 so a short surely doesn't fit. This sieve is segmented and so small that it fits into the L1-cache of the CPU. From my feeling sieving doesn't fit well on CUDA. |
![]() |
![]() |
![]() |
#30 |
"Oliver"
Mar 2005
Germany
100010101112 Posts |
![]()
Hi and a happy new year!
GTX 275, Core 2 Duo overclocked to 4GHz, sieving up to 37831 (4000th odd prime): One process on the CPU: M66362159 TF from 2^64 to 2^65: 180s (siever still too slow) Two processes on the CPU at the same time M66362159 TF from 2^64 to 2^65: 279s M66362189 TF from 2^64 to 2^65: 279s (using two CPUs core easily keep the GPU busy all the time. A 2.66GHz Core 2 Duo should be fine for a GTX 275 (with the current code)) There are some compiletime options which run fine on newer GPUs but won't work on older ones. (e.g. asynchronous memory transfers aren't supported on G80 chips). Therefore I need to write some checks into the code if the current GPU as capable or not. I had a horrible stupid "bug" during my attempt to make the code capable running multiple CPU processes concurrently on one GPU: CUDA source files are usually named "*.cu". My favorite editor doesn't know ".cu" files (for syntax highlighting), I was lazy and did symlinks "*.c" -> "*.cu". This worked fine until I copied the files with scp (secure copy, openssh). Doing so the "*.c" files were copies of the "*.cu" files. So I've edited the "*.c" files and compiled the "*.cu" files..... It took some time to figure out why code changes didn't show any difference... |
![]() |
![]() |
![]() |
#31 | |
"Oliver"
Mar 2005
Germany
11·101 Posts |
![]() Quote:
Generating a candidate list (2^20 candidates) Code:
Sieve limit | time | number of raw candidates 31 (10th odd prime) | 8ms | ~1.83M 547 (100th odd prime) | 16ms | ~3.16M 7927 (1000th odd prime) | 21ms | ~4.48M 104743 (10000th odd prime) | 28ms | ~5.77M "raw candidates" are the number of candidates before sieving needed to generate a list of 2^20 candidates after sieving (on average). As you can see the runtime of the siever depends mostly on the number of raw candidates. Generate a list of 2^20 candidates with sieving up to 104743 takes ~3.5 times longer than generation a list of same size with sieving up to 31. BUT compare the number of raw candidates. Sieving up to 104743 covers 5.77/1.83 = 3.15 times raw candidates! So lowering the sievelimit will help to keep the GPU busy BUT this won't increase the throughput much. It will help to generate more heat on the GPU and burn electricy. Last fiddled with by TheJudger on 2010-01-02 at 00:34 |
|
![]() |
![]() |
![]() |
#32 |
Jul 2009
Tokyo
11428 Posts |
![]() |
![]() |
![]() |
![]() |
#33 |
"Oliver"
Mar 2005
Germany
11·101 Posts |
![]()
Hello msft,
I'm sceptical about sieving on the GPUs. I would be happy if somebody proves me wrong on that topic. ;) - sieving in one big array with all GPU threads (each GPU thread handels different small primes) Problem: will need global memory for the array. When using one char per candidate I need only writes. When using one bit per char I need reads and writes (one global memory!). This should be very slow. - each thread sieves a small segment Problem: need to calculate starting offsets much more often. For this I use a modified euclidean algorithm which needs ifs. Different code paths will break parallel execution as far as I know. :( Problem: might get imbalanced between threads if the segments are too small. If the segments are bigger (to minimize the effect of imbalanced threads) it will have LONG kernel runtimes on GPU which stalls the GUI (if there is some). |
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
mfakto: an OpenCL program for Mersenne prefactoring | Bdot | GPU Computing | 1668 | 2020-12-22 15:38 |
The P-1 factoring CUDA program | firejuggler | GPU Computing | 753 | 2020-12-12 18:07 |
gr-mfaktc: a CUDA program for generalized repunits prefactoring | MrRepunit | GPU Computing | 32 | 2020-11-11 19:56 |
mfaktc 0.21 - CUDA runtime wrong | keisentraut | Software | 2 | 2020-08-18 07:03 |
World's second-dumbest CUDA program | fivemack | Programming | 112 | 2015-02-12 22:51 |