![]() |
CUDALucas (a.k.a. MaclucasFFTW/CUDA 2.3/CUFFTW)
1 Attachment(s)
Hi,
I convert MaclucasFFTW to CUDA/CUFFTW(Single precision). On ION/atom 330. [quote] ION$mkdir mers ION$cd mers ION$wget [URL]http://www.garlic.com/%7Ewedgingt/mers.tar.gz[/URL] ION$tar -zxvf mers.tar.gz ION$patch -p0 -d . < MacLucasFFTW.cuda.0.patch ION$cd mers ION$/usr/local/cuda/bin/nvcc -DMERS_PACKAGE -DBIT_SIEVE -DTESTING_SMALL_EXPONENTS -DSIEVE_SIZE_IN_BYTES=32 -DNUM_SMALL_PRIMES=32768 -O3 -DDO_NOT_USE_LONG_DOUBLE -I/usr/local/include MacLucasFFTW.c setup.c rw.c balance.c zero.c -L/usr/local/lib -c ION$g++ -fPIC -o MacLucasFFTW MacLucasFFTW.o setup.o rw.o balance.o zero.o -L/usr/local/cuda/lib -L/NVIDIA_GPU_Computing_SDK/C/lib -L/NVIDIA_GPU_Computing_SDK/C/common/common/lib/linux -lcudart -L/usr/local/cuda/lib -L/NVIDIA_GPU_Computing_SDK/C/lib -L/NVIDIA_GPU_Computing_SDK/C/common/lib/linux -lcufft -lcutil -lm ION$ time ./MacLucasFFTW 11213 1 2048 ... 11001 2048 M( 11213 )P, n = 2048, MacLucasFFTW v8.1 Ballester real 0m4.945s user 0m4.200s sys 0m0.744s ION$ time ./MacLucasFFTW 216091 1 32768 1 32768 1 32768 1 65536 1001 65536 ... 216001 65536 M( 216091 )P, n = 65536, MacLucasFFTW v8.1 Ballester real 35m14.453s user 30m41.439s sys 4m32.585s [/quote]Cannot resume. Thank you, |
[QUOTE=msft;192993]
ION$ time ./MacLucasFFTW 216091 ... real 35m14.453s [/QUOTE] Wow, that's slow! For fun, I thought I'd try it on the C1060. A 64-bit compile didn't work, but 32-bit works fine although it runs at about the same speed. It's completely bandwidth limited with all of the transfers on and off the device. |
Hi, Mr fmky.
Depend PCI BUS bandwith.(My ION cyoice is corect...) Now CPU <-> GPU data transfer 4 times/itelation. Reduce to 2 is easy,but 0 is very difficult. All rutine on GPU...is nightmare. ION$ time ./MacLucasFFTW 859433 ... 859001 262144 M( 859433 )P, n = 262144, MacLucasFFTW v8.1 Ballester real 553m57.624s user 486m37.685s sys 63m24.838s |
Depend ratency,not bandwith.
11211 4194304 M( 11213 )P, n = 4194304, MacLucasFFTW v8.1 Ballester real 130m0.106s user 108m8.678s sys 11m52.809s Use 4194304 point fft = 130min/11213iter = 0.96 sec/iter 11211 8388608 M( 11213 )P, n = 8388608, MacLucasFFTW v8.1 Ballester real 263m41.280s user 217m26.123s sys 25m9.138s Use 8388608 point fft = 263min/11213iter = 1.41 sec/iter |
1 Attachment(s)
[B]I make MaclucasFFTW/ubuntu 9.04(32 bit)/CUDA 2.3/CUFFTW(double precision) version.[/B]
|
GTX260 result.
Hi,
I get GTX260 from Akihabara. 216001 16384 M( 216091 )P, n = 16384, MacLucasFFTW v8.1 Ballester real 4m14.854s user 2m53.103s sys 1m21.725s 859001 65536 M( 859433 )P, n = 65536, MacLucasFFTW v8.1 Ballester real 62m52.755s user 44m54.440s sys 17m58.311s 11001 2097152 M( 11213 )P, n = 2097152, MacLucasFFTW v8.1 Ballester real 22m30.253s user 20m19.644s sys 2m10.604s 2048k fft sec/iter = 0.12 11001 4194304 M( 11213 )P, n = 4194304, MacLucasFFTW v8.1 Ballester real 44m57.511s user 40m37.008s sys 4m20.572s 4096k fft sec/iter = 0.24 |
GTX260 CUFFT double precision benchmark.
base source from [URL="http://www.science.uwaterloo.ca/%7Ehmerz/CUDA_benchFFT/"]http://www.science.uwaterloo.ca/~hmerz/CUDA_benchFFT/[/URL]
CUFFT double precision Complex to Complex fft. calculated 4 point fft 1000 times in 0.008224 seconds = 4.863870 mflops calculated 8 point fft 1000 times in 0.011834 seconds = 10.140262 mflops calculated 16 point fft 1000 times in 0.022916 seconds = 13.964068 mflops calculated 32 point fft 1000 times in 0.018881 seconds = 42.370436 mflops calculated 64 point fft 1000 times in 0.021270 seconds = 90.268460 mflops calculated 128 point fft 1000 times in 0.033173 seconds = 135.049269 mflops calculated 256 point fft 1000 times in 0.040438 seconds = 253.227750 mflops calculated 512 point fft 1000 times in 0.039472 seconds = 583.703807 mflops calculated 1024 point fft 1000 times in 0.053949 seconds = 949.044903 mflops calculated 2048 point fft 1000 times in 0.063347 seconds = 1778.141443 mflops calculated 4096 point fft 1000 times in 0.073652 seconds = 3336.771353 mflops calculated 8192 point fft 1000 times in 0.072037 seconds = 7391.749264 mflops calculated 16384 point fft 1000 times in 0.095094 seconds = 12060.484847 mflops calculated 32768 point fft 1000 times in 0.168577 seconds = 14578.497099 mflops calculated 65536 point fft 1000 times in 0.290185 seconds = 18067.368993 mflops calculated 131072 point fft 1000 times in 0.541373 seconds = 20579.382834 mflops calculated 262144 point fft 1000 times in 1.012113 seconds = 23310.597859 mflops calculated 524288 point fft 1000 times in 1.930565 seconds = 25799.369189 mflops calculated 1048576 point fft 1000 times in 3.874456 seconds = 27063.825242 mflops calculated 2097152 point fft 1000 times in 7.998022 seconds = 27531.927191 mflops calculated 4194304 point fft 1000 times in 16.420866 seconds = 28096.778989 mflops calculated 8388608 point fft 1000 times in 34.149594 seconds = 28248.942537 mflops CUFFT double precision Complex to Complex fft with memory transfer. calculated 4 point fft 1000 times in 0.039361 seconds = 1.016236 mflops calculated 8 point fft 1000 times in 0.043003 seconds = 2.790504 mflops calculated 16 point fft 1000 times in 0.054218 seconds = 5.902111 mflops calculated 32 point fft 1000 times in 0.051855 seconds = 15.427609 mflops calculated 64 point fft 1000 times in 0.053721 seconds = 35.740155 mflops calculated 128 point fft 1000 times in 0.065968 seconds = 67.911776 mflops calculated 256 point fft 1000 times in 0.075842 seconds = 135.017605 mflops calculated 512 point fft 1000 times in 0.076387 seconds = 301.622209 mflops calculated 1024 point fft 1000 times in 0.096225 seconds = 532.085824 mflops calculated 2048 point fft 1000 times in 0.116388 seconds = 967.797693 mflops calculated 4096 point fft 1000 times in 0.149064 seconds = 1648.687601 mflops calculated 8192 point fft 1000 times in 0.188968 seconds = 2817.832906 mflops calculated 16384 point fft 1000 times in 0.297410 seconds = 3856.224820 mflops calculated 32768 point fft 1000 times in 0.541233 seconds = 4540.742277 mflops calculated 65536 point fft 1000 times in 1.005135 seconds = 5216.095639 mflops calculated 131072 point fft 1000 times in 1.937994 seconds = 5748.790044 mflops calculated 262144 point fft 1000 times in 3.775305 seconds = 6249.285866 mflops calculated 524288 point fft 1000 times in 7.420558 seconds = 6712.077396 mflops calculated 1048576 point fft 1000 times in 14.830570 seconds = 7070.368855 mflops calculated 2097152 point fft 1000 times in 29.878380 seconds = 7369.909602 mflops calculated 4194304 point fft 1000 times in 60.144817 seconds = 7671.042373 mflops calculated 8388608 point fft 1000 times in 121.557722 seconds = 7936.064479 mflops /NVIDIA_GPU_Computing_SDK/C/bin/linux/release$ ./bandwidthTest Running on...... device 0:GeForce GTX 260 Quick Mode Host to Device Bandwidth for Pageable memory . Transfer Size (Bytes) Bandwidth(MB/s) 33554432 2529.5 Quick Mode Device to Host Bandwidth for Pageable memory . Transfer Size (Bytes) Bandwidth(MB/s) 33554432 2173.9 calculated 8388608 point fft 1000 times in 34.149594 seconds = 28248.942537 mflops calculated 8388608 point fft 1000 times in 121.557722 seconds = 7936.064479 mflops 8388608*16/1024/1024 (Mbyte) /2100 (MB/s) * 2 * 1000 (times) = 121.9 sec |
1 Attachment(s)
Hi,
New result on GTX260. 216001 16384 M( 216091 )P, n = 16384, MacLucasFFTW v8.1 Ballester real 2m16.778s user 1m22.469s sys 0m54.311s 859001 65536 M( 859433 )P, n = 65536, MacLucasFFTW v8.1 Ballester real 31m40.768s user 19m31.713s sys 12m9.150s 11001 2097152 M( 11213 )P, n = 2097152, MacLucasFFTW v8.1 Ballester real 15m2.462s user 10m15.066s sys 4m47.262s 2048k fft sec/iter = 0.08 11001 4194304 M( 11213 )P, n = 4194304, MacLucasFFTW v8.1 Ballester real 30m14.297s user 20m37.141s sys 9m41.688s 4096k fft sec/iter = 0.16 Thank you, |
1 Attachment(s)
Hi,
New result on GTX260. 216001 16384 M( 216091 )P, n = 16384, MacLucasFFTW v8.1 Ballester real 1m42.692s user 1m2.768s sys 0m39.906s 859001 65536 M( 859433 )P, n = 65536, MacLucasFFTW v8.1 Ballester real 23m8.920s user 14m3.845s sys 9m5.094s 11001 2097152 M( 11213 )P, n = 2097152, MacLucasFFTW v8.1 Ballester real 8m35.896s user 5m5.511s sys 3m30.349s 2048k fft sec/iter = 0.046 11001 4194304 M( 11213 )P, n = 4194304, MacLucasFFTW v8.1 Ballester real 17m14.207s user 10m0.930s sys 7m12.587s 4096k fft sec/iter = 0.092 Thank you, |
1 Attachment(s)
Hi,
New result on GTX260. M( 216091 )P, n = 16384, MacLucasFFTW v8.1 Ballester real 1m39.794s user 1m7.864s sys 0m31.934s M( 859433 )P, n = 65536, MacLucasFFTW v8.1 Ballester real 20m33.342s user 14m1.825s sys 6m31.548s M( 11213 )P, n = 2097152, MacLucasFFTW v8.1 Ballester real 7m27.026s user 5m8.783s sys 2m18.257s 2048k fft sec/iter = 0.040 M( 11213 )P, n = 4194304, MacLucasFFTW v8.1 Ballester real 14m54.153s user 10m14.254s sys 4m39.897s 4096k fft sec/iter = 0.080 Thank you, |
Good work.
My Core i7 920 clocked at default settings gives: Best time for 2048K FFT length: 39.869 ms. Best time for 4096K FFT length: 87.849 ms. So you're getting into the realm of what's theoretically expected. Can't wait untilt he 3xx series comes out with 5-fold increase in 64bit floats. -- Craig |
All times are UTC. The time now is 22:17. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.