![]() |
![]() |
#397 | |
Jul 2009
Tokyo
26216 Posts |
![]()
Hi, Karl M Johnson
Quote:
Code:
msft@ubuntu:~$ ./CUDALucas -c1000 33333333 & [1] 14609 msft@ubuntu:~$ Iteration 10000 M( 33333333 )C, 0xd717246f501c7d94, n = 2097152, CUDALucas v1.0 msft@ubuntu:~$ kill 14609 msft@ubuntu:~$ [1]+ Done ./CUDALucas -c1000 33333333 msft@ubuntu:~$ ./CUDALucas -c1000 c33333333 caso 2 Iteration 20000 M( 33333333 )C, 0x7f036ff2b230121b, n = 2097152, CUDALucas v1.0 |
|
![]() |
![]() |
![]() |
#398 |
Mar 2010
3·137 Posts |
![]() ![]() Works! Thanks! |
![]() |
![]() |
![]() |
#399 |
Jun 2005
3·43 Posts |
![]()
Small update. Windows will not rename a file if it is open. This happens in CUDALucas if you run from a checkpoint file - the c24603451 file will remain open. All this means is that CUDALucas won't be able to back up the old checkpoint before updating it. It's not a significant problem - I only noticed it because I needed both the current and backup files to time execution speed.
In any case, I've fixed the problem and included it here. I've also included cudart64_31_9.dll in the archive so it should be everything you need to build and/or run in one shot. cudalucas.1.0a.winx64.zip Run times on my factory overclocked GTX 275, along with some rough run times for current work assignments. I know these aren't the most efficient use of the code but it's a good basis for comparison to a CPU. 8.96 msec/iter @ 2M FFT (~ 2.5 days for a 25M LL double check) 18.8 msec/iter @ 4M FFT (~ 11 days for a 47M LL first time run) Not sure how that compares to Linux versions, but it's definitely fast enough to be useful. |
![]() |
![]() |
![]() |
#400 |
P90 years forever!
Aug 2002
Yeehaw, FL
22×3×659 Posts |
![]()
Does Nvidia's license let you include cufft64_31_9.dll? If so, please include that in your zip file so that it truly includes everything you need to run cudalucas.
Last fiddled with by Prime95 on 2011-02-04 at 03:06 |
![]() |
![]() |
![]() |
#401 |
Jul 2009
Tokyo
2×5×61 Posts |
![]() Code:
#include <cuda.h> #include <cuda_runtime.h> #include <cufft.h> #include <cutil_inline.h> int main() { cufftHandle plan; cudaEvent_t start, stop; double *x; double *g_x; int i,j,imax; imax = 1024*1024*4; cutilSafeCall(cudaMalloc((void**)&g_x, sizeof(double)*imax)); x = ((double *)malloc(sizeof(double)*imax)); for(i=0;i<imax;i++)x[i]=0; cutilSafeCall(cudaMemcpy(g_x, x, sizeof(double)*imax, cudaMemcpyHostToDevice)); cutilSafeCall( cudaEventCreate(&start) ); cutilSafeCall( cudaEventCreate(&stop) ); for(j=1024*1024;j<imax;j+=1024*1024) { cufftSafeCall(cufftPlan1d(&plan, j, CUFFT_Z2Z, 1)); cufftSafeCall(cufftExecZ2Z(plan,(cufftDoubleComplex *)g_x,(cufftDoubleComplex *)g_x, CUFFT_INVERSE)); cutilSafeCall( cudaEventRecord(start, 0) ); for(i=0;i<10;i++) cufftSafeCall(cufftExecZ2Z(plan,(cufftDoubleComplex *)g_x,(cufftDoubleComplex *)g_x, CUFFT_INVERSE)); cutilSafeCall( cudaEventRecord(stop, 0) ); cutilSafeCall( cudaEventSynchronize(stop) ); float outerTime; cutilSafeCall( cudaEventElapsedTime(&outerTime, start, stop) ); printf("CUFFT_Z2Z size=%d k time=%f msec\n",j/1024,outerTime/10); cufftSafeCall(cufftDestroy(plan)); } for(j=1024*1024;j<imax;j+=256*1024) { cufftSafeCall(cufftPlan1d(&plan, j, CUFFT_D2Z, 1)); cufftSafeCall(cufftExecD2Z(plan,g_x,(cufftDoubleComplex *)g_x)); cutilSafeCall( cudaEventRecord(start, 0) ); for(i=0;i<10;i++) cufftSafeCall(cufftExecD2Z(plan,g_x,(cufftDoubleComplex *)g_x)); cutilSafeCall( cudaEventRecord(stop, 0) ); cutilSafeCall( cudaEventSynchronize(stop) ); float outerTime; cutilSafeCall( cudaEventElapsedTime(&outerTime, start, stop) ); printf("CUFFT_D2Z size=%d k time=%f msec\n",j/1024,outerTime/10); cufftSafeCall(cufftDestroy(plan)); } cutilSafeCall(cudaFree((char *)g_x)); cutilSafeCall( cudaEventDestroy(start) ); cutilSafeCall( cudaEventDestroy(stop) ); } CUFFT_Z2Z size=2048 k time=6.288720 msec CUFFT_Z2Z size=3072 k time=10.626810 msec CUFFT_D2Z size=1024 k time=1.947040 msec CUFFT_D2Z size=1280 k time=2.580678 msec CUFFT_D2Z size=1536 k time=3.186858 msec CUFFT_D2Z size=1792 k time=3.640893 msec CUFFT_D2Z size=2048 k time=4.063977 msec CUFFT_D2Z size=2304 k time=4.664579 msec CUFFT_D2Z size=2560 k time=5.340890 msec CUFFT_D2Z size=2816 k time=76.725174 msec CUFFT_D2Z size=3072 k time=6.547805 msec CUFFT_D2Z size=3328 k time=98.685196 msec CUFFT_D2Z size=3584 k time=7.542326 msec CUFFT_D2Z size=3840 k time=8.636828 msec Non power of 2 is enhancement. But not enough. |
![]() |
![]() |
![]() |
#402 | |
Banned
"Luigi"
Aug 2002
Team Italia
3×5×17×19 Posts |
![]() Quote:
Luigi |
|
![]() |
![]() |
![]() |
#403 |
Jul 2009
Tokyo
2×5×61 Posts |
![]() |
![]() |
![]() |
![]() |
#404 |
Dec 2010
23 Posts |
![]()
There are wide variations in the time similar sized transforms based on their factorization: CUFFT (CUDA 3.2) on Fermi supports 2^a * 3^b * 5^c * 7^d transforms, with pure powers of 2 and 3 being pretty fast, but powers of 5 noticeably slower, and products of powers giving orders-of-magnitude differences in runtimes, some quite good, some horrible, depending on which bases and which powers.
I've got tabulations of runtimes based on a complete search of [a, b, c, d] values giving FFTs of length between 2^18 and 2^24. Use it (manually, at present) to pick LL runlengths, and will be correlating it with maximum wordsizes for LL giving acceptable errors. More to understand convolution errors w/ balanced integers than for the Mersenne stuff per se. Just one more thing I should put online for anyone who's interested. |
![]() |
![]() |
![]() |
#405 |
Jun 2005
3·43 Posts |
![]()
Another update for the windows version. Looks like the previous version I posted will end up in an infinite loop once a test finishes. The result will be printed to mersarch.txt, but not to the console.
There should be no problem using this version to complete a test started using one of the other windows versions. I've included a fix in source and executable attached to this post. I didn't add cufft64_31_9.dll because that file is something like 5MB compressed - it's too large to fit as an attachment. cudalucas.1.0b.winx64.zip |
![]() |
![]() |
![]() |
#407 |
Jan 2011
Dudley, MA, USA
73 Posts |
![]()
There seems to be a couple upper limits to this right now. I tried running higher numbers, and get a couple different errors:
#CUDALucas 151150000 err = 0.353794, increasing n from 8388608 CUDALucas.cu(534) : cufftSafeCall() CUFFT error. I'm guessing it's because of: "The cuFFT manual states that 1-D ffts are supported for < 8 million elements." The other is at exponents around 318750000, I hit the memory limit on my 768MB card. At 336000000, it wants over 1Gb. Combined, these prevent it from being useful for the 100 million digit numbers. (I can't be the only one eyeing this as making that task feasible.) |
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Don't DC/LL them with CudaLucas | LaurV | Data | 131 | 2017-05-02 18:41 |
CUDALucas / cuFFT Performance on CUDA 7 / 7.5 / 8 | Brain | GPU Computing | 13 | 2016-02-19 15:53 |
CUDALucas: which binary to use? | Karl M Johnson | GPU Computing | 15 | 2015-10-13 04:44 |
settings for cudaLucas | fairsky | GPU Computing | 11 | 2013-11-03 02:08 |
Trying to run CUDALucas on Windows 8 CP | Rodrigo | GPU Computing | 12 | 2012-03-07 23:20 |