20131002, 12:37  #1 
Mar 2010
3·137 Posts 
How to pick the best FFT size: a CUDALucas guide
Greetings.
I wrote this guide after thinking of a semiautomated method for best FFT size generation. Nothing new to be honest, just a systematic approach. The method, of course, could be fully scripted, but I neither possess the knowledge to do that or feel the need to, it's easy already. The only prerequesites are a perl interpreter and unixutils/Cygwin. Let's begin! 0. Pick an exponent 1. Run Code:
CUDALucas t c 10000 $exp$ 2.1a If the $starting_FFT$ did not fail, then the new $starting_FFT$ = 2^n, where the old $starting_FFT$ > 2^n. 3. Run the following command, where working_FFT < ending_FFT >= 2^n. Code:
CUDALucas cufftbench $starting_FFT$ $ending_FFT$ 512 > data.txt 5. Run Code:
perl ne "chomp; my $a=substr ($_, 24); my $b=substr ($_, 0, 24); print \"$a $b\n\"" < data.txt > data_flipped.txt Code:
sort n $../data_flipped.txt$ o $../FFT_data.txt$ Code:
CUDALucas t c 500 f $best FFT size$ $exp$ Example № 1: 0. Let's pick 6972593 as our exponent. 1. With my GTX Titan, the starting FFT was chosen as 393216, which is 2^17 * 3. 2.1a The new starting FFT should be 2^18 (262144), since 2^18 < 2^17 * 3. 3. Ending FFT was chosen as 524800, since 2^17 * 3 < 2^19 + 512 > 2^19. Full command now looks like Code:
CUDALucas cufftbench 262144 524800 512 > data.txt 5. Done, sample data_flipped.txt attached. 6. Done, sorted by efficiency FFT sizes are in the attached FFT_data.txt file. 7. First seven FFT sizes produced error rates > 0.25, no good. 8. FFT size of 401408, which is 2^13 * 7^2 was picked as the fastest. As a result, it produces timings of 0.3095ms per LL iteration vs 0.3520ms per LL iteration with the default(393216) FFT size. Example № 2: 0. Let's pick 68321161 as our exponent. 1. With GTX titan, the starting FFT was chosen as 3670016, which is 2^19 * 7. 2. Working FFT was chosen as 3932160, which is 2^18 * 3 * 5. 3. Ending FFT was chosen as 4194816, since 2^19 * 7 < 2^22 + 512 > 2^22. Full command now looks like Code:
CUDALucas cufftbench 3670016 4194816 512 > data.txt 5. Done, sample data_flipped.txt attached. 6. Done, sorted by efficiency FFT sizes are in the attached FFT_data.txt file. 7. First FFT size was golden. 8. FFT size of 4096000, which is 2^15 * 5^3, was picked as the fastest. As a result, it produces timings of 2.9670ms per LL iteration vs. 4.2085ms per LL iteration with the default (3932160) FFT size. This would save me 20h of work, should I decide to do a LL test on 68321161. Thoughs and assumptions: Cufftbench can be reworked to test numbers only in the following forms: 2^n 2^n * 3^m 2^n * 5^m 2^n * 7^m 2^n * 3^m * 5^l 2^n * 3^m * 7^l 2^n * 5^m * 7^l 2^n * 3^m * 5^l * 7^k where n,m,l,k are within all the numbers within some sane boundary ([1;32] ?) Once the numbers are generated, another algorithm applies, which discards values that are too big and too small (this is the current cufftbench's start and end values). This way, we don't waste our time on numbers, which will yield bad timings, and the FFT size selection happens a lot faster. Of course, I may be missing something here, but as far as I know, current highend NV gpus benefit from FFT sizes in some of those forms. Other forms may or may not be beneficial for future CUDA/OpenCL GPUs. This guide should be useful for huge exponents. I'm currently doing a LL test again 59473411, and the best FFT size for it is 3359232, which is 2^9 * 3^8. Interesting combination, ehh? I'd like to hear your comments, suggestions and corrections regarding all of this. Last fiddled with by Karl M Johnson on 20131002 at 12:39 Reason: yes 
20131002, 22:30  #2 
"Carl Darby"
Oct 2012
Spring Mountains, Nevada
100111011_{2} Posts 
Be careful with the fft size. Last I looked, in order to give correct results, the fft size must be a multiple of the threads parameter. I doubt you are using threads > 512, but it would be worth checking just to make sure.
Also, the pointwise multiplication kernel requires fft size be a multiple of 512 for correct output, so don't go lower than 2^9 on the power of two part. Last fiddled with by owftheevil on 20131002 at 22:33 
20131002, 22:41  #3 
If I May
"Chris Halsall"
Sep 2002
Barbados
61·167 Posts 

20131003, 09:17  #4  
Mar 2010
3·137 Posts 
Quote:
No big deal. I assume cufftbench will not produce such FFT sizes (since it depends on the threads parameter), so that changes nothing. As for pointwise multiplication, if the FFT size is indivisible by 2^9, while the threads are set to < 2^9, it just throws an error rate of 0.5 at you. Again, no big deal. If it is even possible to stumble across such a FFT size, it can safely be skipped. Any other comments? Am I the only one picking golden FFT sizes before commencing a LL test (it's somewhat akin to GNFS's polynomial selection)? 

20131003, 12:50  #5  
"Kieren"
Jul 2011
In My Own Galaxy!
2·3·1,693 Posts 
Quote:
I do appreciate, and try to take advantage of the efforts of those more thoughtful and skilled. 

20131003, 13:48  #6  
"Carl Darby"
Oct 2012
Spring Mountains, Nevada
3^{2}·5·7 Posts 
Quote:
So in the rare cases that it might show up, the problem is simply annoying rather than critical, since the program gives immediate feedback that something is wrong rather than quietly producing an incorrect result many hours later. That's good to know. 

20131010, 08:20  #7 
Mar 2010
3·137 Posts 
Here's another guide which utilizes an easier approach.
I've done a cufftbench on a [25M;100M] exponent range, which was identified as FFT sizes of [1048576;8388608], with a step of 512. Now, there are three lists: The raw results of FFT benchmark, FFT sizes sorted by efficiency and factored FFT sizes, sorted by efficiency or download all from sendspace. The devs may need all 3 lists, users of Cudalucas should stick with the second. Now, to pick a golden FFT size, note the FFT size Cudalucas selects, and search for a similar (or even a smaller one, if Culu doesn't fail) size. Here's an example: Say I want to commence a LL test on 2^85123453  1. Here's what I get when first launching Cudalucas: Code:
Starting M85123453 fft length = 4718592 iteration = 33 < 1000 && err = 0.253906 >= 0.25, increasing n from 4718592 Starting M85123453 fft length = 5242880 Iteration 1000 M( 85123453 )C, 0xdf4efecbfda348f5, n = 5242880, CUDALucas v2.03 err = 0.0244 (0:05 real, 4.5829 ms/iter, ETA 108:21:49) Search the sorted FFT size list for "CUFFT_Z2Z size= 4\d\d\d\d\d\d" (you'll need a text editor that supports regxp, Notepad++ with TextFX plugin does). First results will be 4096000, which is too small for us. After a dozen of clicks, you should find 4741632. Even though it's bigger, here's what it produces: Code:
Starting M85123453 fft length = 4741632 Iteration 1000 M( 85123453 )C, 0xdf4efecbfda348f5, n = 4741632, CUDALucas v2.03 err = 0.2969 (0:04 real, 4.1280 ms/iter, ETA 97:36:25) Iteration 2000 M( 85123453 )C, 0x8898eec48732d5cc, n = 4741632, CUDALucas v2.03 err = 0.3140 (0:04 real, 4.0200 ms/iter, ETA 95:03:02) Iteration 3000 M( 85123453 )C, 0x1fc79db79df64352, n = 4741632, CUDALucas v2.03 err = 0.3140 (0:04 real, 4.0224 ms/iter, ETA 95:06:28) Iteration 4000 M( 85123453 )C, 0x7c85fa5ae6a60809, n = 4741632, CUDALucas v2.03 err = 0.3140 (0:04 real, 4.0204 ms/iter, ETA 95:03:34) Soon, you should find 4816896. Looks like it's what we need: Code:
Iteration 1000 M( 85123453 )C, 0xdf4efecbfda348f5, n = 4816896, CUDALucas v2.03 err = 0.2188 (0:05 real, 4.1924 ms/iter, ETA 99:07:43) Iteration 2000 M( 85123453 )C, 0x8898eec48732d5cc, n = 4816896, CUDALucas v2.03 err = 0.2188 (0:04 real, 4.0760 ms/iter, ETA 96:22:35) Iteration 3000 M( 85123453 )C, 0x1fc79db79df64352, n = 4816896, CUDALucas v2.03 err = 0.2227 (0:04 real, 4.1040 ms/iter, ETA 97:02:15) A note to Cudalucas developers: as you can see from the third list, a lot of FFT sizes are not entirely made of classic 2/3/5/7 integers, yet they still remain faster than some others. Conclusion(s): 1. Cufftbenchmark is perfect, no need to change it, as it revealed FFT sizes composed of interesting factors. No connection found. 2. Testing FFT sizes on [100M;200M] range would take a while. Comments, ideas, suggestions are welcome. Last fiddled with by Karl M Johnson on 20131010 at 08:22 
20131010, 17:45  #8 
If I May
"Chris Halsall"
Sep 2002
Barbados
10187_{10} Posts 
Interesting...
For those who do CUDALucas work, could a table be derived which would show where the "sweat spots" (and the related FFT length) are for the ranges currently being worked? This thought comes out of discussions Kracker and LaurV recently had as to where their kit's best (read: most efficient) work area was for DCing. 
20131011, 04:15  #9 
Mar 2010
3×137 Posts 
Such a table will be highly beneficial, but I could only compile it by using Cudalucas with constantly increasing exponents.
I've seen people estimate FFT lengths based on exponent size. Is there a formula for approximate FFT size, which is based on exponent? Or is it a rule of thumb? 
20131011, 07:33  #10 
Romulan Interpreter
"name field"
Jun 2011
Thailand
71×139 Posts 
The numbers are systemdependent (i.e. GPU mostly). A table will only have guidance character, but the user still need to bench its own system for the range of the expos he is doing. You will be very surprised!
All this discussion we already had for cudaLucas, when Dubslow repeatedly changed (see following posts there) the table used for FFT selection in that program. We got a list with those best FFT's for different cards, including for Tesla, or for Titan, (and if you dig around those posts, there is a lot of info, say read 45 pages back and forth, from about page 150, in the cudaLucas thread), which table is partially embedded into cudaLucas 2.04, and it is doing well for my gtx580, but still has some "small flicks" here and there. I always do my tuning when I switch exponents/ranges. I mean, the justwritten tutorial is nice, and kudos/kotgw to Karl for putting all together, but renegotiating the same things every 4 months is a bit bothering, don't you think? The idea is that if you want to squeeze the last bit of speed from the GPU, you need to tune, anyhow. If not, then let cudaLucas to do its job, he is (almost) very good in doing it. With a "table", you won't be able to beat it (except for few very particular cards/systems), unless you make this table a "book" (one table for one GPU, or so). Of course, there are "guides" about the "granularity" of the FFT (again, see cuLu thread), finer granularities doing better than "some small factors times a big prime", but that is not a rule. There are cases when (depending of the number of threads, card, exponent), for example, x*y*2^2*5 and x*y*2*3^2 are both worse than x*y*19. See concrete example in those lists. Last fiddled with by LaurV on 20131011 at 07:43 
20131011, 20:23  #11  
∂^{2}ω=0
Sep 2002
República de California
11688_{10} Posts 
Quote:


Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
PRP: Pick A Range  Citrix  Prime Sierpinski Project  2  20140216 18:47 
Guide  OmbooHankvald  Operation Billion Digits  4  20090826 08:18 
Help me pick a math course.  jasong  Math  9  20050311 21:04 
Pick and Choose  Wacky  Puzzles  5  20030716 20:02 
Pick a stone, or two, .... or three  Wacky  Puzzles  5  20030624 16:11 