mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2013-10-13, 00:22   #12
owftheevil
 
owftheevil's Avatar
 
"Carl Darby"
Oct 2012
Spring Mountains, Nevada

32·5·7 Posts
Default

From fairly extensive testing on my own systems, I get a power regresssion:

fft = 0.03665* exponent^1.12217

or

exponent = 21.852 * fft ^0.9783

Only one fft I've tested so far falls more than 1% below this bound, 3136k.
owftheevil is offline   Reply With Quote
Old 2013-10-16, 09:01   #13
Karl M Johnson
 
Karl M Johnson's Avatar
 
Mar 2010

3·137 Posts
Default

Quote:
Originally Posted by LaurV View Post
The numbers are system-dependent (i.e. GPU mostly). A table will only have guidance character, but the user still need to bench its own system for the range of the expos he is doing. You will be very surprised!

All this discussion we already had for cudaLucas, when Dubslow repeatedly changed (see following posts there) the table used for FFT selection in that program. We got a list with those best FFT's for different cards, including for Tesla, or for Titan, (and if you dig around those posts, there is a lot of info, say read 4-5 pages back and forth, from about page 150, in the cudaLucas thread), which table is partially embedded into cudaLucas 2.04, and it is doing well for my gtx580, but still has some "small flicks" here and there. I always do my tuning when I switch exponents/ranges.

I mean, the just-written tutorial is nice, and kudos/kotgw to Karl for putting all together, but re-negotiating the same things every 4 months is a bit bothering, don't you think?

The idea is that if you want to squeeze the last bit of speed from the GPU, you need to tune, anyhow. If not, then let cudaLucas to do its job, he is (almost) very good in doing it. With a "table", you won't be able to beat it (except for few very particular cards/systems), unless you make this table a "book" (one table for one GPU, or so). Of course, there are "guides" about the "granularity" of the FFT (again, see cuLu thread), finer granularities doing better than "some small factors times a big prime", but that is not a rule. There are cases when (depending of the number of threads, card, exponent), for example, x*y*2^2*5 and x*y*2*3^2 are both worse than x*y*19. See concrete example in those lists.
After doing several tests on two different GPUs, it seems you're right, the precomputed results are only good if the same GPU will be used.
Drivers, Cudalucas version and settings were all the same.
The only exception in the tests was power of two FFT, and only on the minimum band.


The first approach works for any GPUs, both current and future, so it should be used
Karl M Johnson is offline   Reply With Quote
Old 2013-10-20, 12:25   #14
Manpowre
 
"Svein Johansen"
May 2013
Norway

C916 Posts
Default FFT Bench Titan and gtx 590

This weekend was the first weekend where I were able to fiddle with my platform since mid august.. and oboy did I get new experiences.
* New driver from Nvidia from mid-september has the memory fix where we had to take memory down to 2500mhz. I can now run stock 3004mhz on memory on the titans.
* I researched FFTs.. and I discovered a new FFT length for my exponents I am testing which saved a few hours
* With FFT research, I confirmed the FFT length for my 590s were the most optimal one.

I also looked into interesting FFT sizes for both platforms, titan and 590 board. This overview is in the attached excel file. All red FFT lengths are interesting as they are large and quick compared to the other ones. The FFT length I tested was from 3m-5m which is suitable for most people. I would love to do this on 690, 680, 780, etc.. but I simply do not have these cards.

Atleast I found the best FFT length for my exponents I am testing, and for the titans, the fiddling during weekend, took my exponents from 84h per test to 60h per test. Simply with a combination of new driver and adjusting GPU clock speed and memory speed and adjusted FFT length.

The attached overview, simply also shows the difference between the platforms. Titan has one 4m FFT length with great result, while 590 has a different 4m FFT which is quick. This explains easily why the 590s now are 2.5 times slower per titan. But since 590s can do 2 exponents simultaniously, they still perform compared to Titan. also 590s are cheap these days.

One of the Titans is right now computing a very detailed FFTlength overview from 3.5m-4.5m as this is the range of FFTs I am looking for the exponents I am testing. I will take time to get this list, so I am waiting patiently for the titan to process the list. I will post my findings after some tests.. I am offcourse hoping for a new FFT length which gives me even better performance. well see.
Attached Files
File Type: zip titan and 590 fft.zip (133.3 KB, 108 views)

Last fiddled with by Manpowre on 2013-10-20 at 12:25
Manpowre is offline   Reply With Quote
Old 2013-10-20, 16:40   #15
Manpowre
 
"Svein Johansen"
May 2013
Norway

3×67 Posts
Default

I checked 20k FFT lengths, and the quickest one between 3.5m-4.5m is 4096000 with 0,888446 msec for Titan. (590 has different quickest FFT length in this FFTlength area).
This FFTlength of 4096000 also shows up using 1024 between each fft length when benchmarking using cudalucas. nice confirmation...

Last fiddled with by Manpowre on 2013-10-20 at 16:42
Manpowre is offline   Reply With Quote
Old 2013-11-03, 04:08   #16
fairsky
 
Sep 2013

41 Posts
Default

Quote:
Originally Posted by Karl M Johnson View Post
Greetings.

I wrote this guide after thinking of a semi-automated method for best FFT size generation.
Nothing new to be honest, just a systematic approach.
The method, of course, could be fully scripted, but I neither possess the knowledge to do that or feel the need to, it's easy already.

The only pre-requesites are a perl interpreter and unixutils/Cygwin.

Let's begin!



0. Pick an exponent
1. Run
Code:
CUDALucas -t -c 10000 $exp$
2. Note the failed FFT($starting_FFT) size and the one that works(working_FFT).
2.1a If the $starting_FFT$ did not fail, then the new $starting_FFT$ = 2^n, where the old $starting_FFT$ > 2^n.
3. Run the following command, where working_FFT < ending_FFT >= 2^n.
Code:
CUDALucas -cufftbench $starting_FFT$ $ending_FFT$ 512 > data.txt
4. After the command has finished, delete all non-FFT related text from the data.txt file (usually it's 2-3 lines).
5. Run
Code:
perl -ne "chomp; my $a=substr ($_, 24); my $b=substr ($_, 0, 24); print \"$a $b\n\"" < data.txt > data_flipped.txt
6. Run
Code:
sort -n $../data_flipped.txt$ -o $../FFT_data.txt$
7. Run the following command, if at first 500 iterations you see that error size >= 0.25, pick the next best FFT size and run the command with it again.
Code:
CUDALucas -t -c 500 -f $best FFT size$ $exp$
8. If the error is < 2.5, you're good to go.



Example № 1:
0. Let's pick 6972593 as our exponent.
1. With my GTX Titan, the starting FFT was chosen as 393216, which is 2^17 * 3.
2.1a The new starting FFT should be 2^18 (262144), since 2^18 < 2^17 * 3.
3. Ending FFT was chosen as 524800, since 2^17 * 3 < 2^19 + 512 > 2^19. Full command now looks like
Code:
CUDALucas -cufftbench 262144 524800 512 > data.txt
4. Done, sample data.txt attached.
5. Done, sample data_flipped.txt attached.
6. Done, sorted by efficiency FFT sizes are in the attached FFT_data.txt file.
7. First seven FFT sizes produced error rates > 0.25, no good.
8. FFT size of 401408, which is 2^13 * 7^2 was picked as the fastest. As a result, it produces timings of 0.3095ms per LL iteration vs 0.3520ms per LL iteration with the default(393216) FFT size.

Example № 2:
0. Let's pick 68321161 as our exponent.
1. With GTX titan, the starting FFT was chosen as 3670016, which is 2^19 * 7.
2. Working FFT was chosen as 3932160, which is 2^18 * 3 * 5.
3. Ending FFT was chosen as 4194816, since 2^19 * 7 < 2^22 + 512 > 2^22. Full command now looks like
Code:
CUDALucas -cufftbench 3670016 4194816 512 > data.txt
4. Done, sample data.txt attached.
5. Done, sample data_flipped.txt attached.
6. Done, sorted by efficiency FFT sizes are in the attached FFT_data.txt file.
7. First FFT size was golden.
8. FFT size of 4096000, which is 2^15 * 5^3, was picked as the fastest. As a result, it produces timings of 2.9670ms per LL iteration vs. 4.2085ms per LL iteration with the default (3932160) FFT size. This would save me 20h of work, should I decide to do a LL test on 68321161.



Thoughs and assumptions:

Cufftbench can be reworked to test numbers only in the following forms:
2^n
2^n * 3^m
2^n * 5^m
2^n * 7^m
2^n * 3^m * 5^l
2^n * 3^m * 7^l
2^n * 5^m * 7^l
2^n * 3^m * 5^l * 7^k
where n,m,l,k are within all the numbers within some sane boundary ([1;32] ?)
Once the numbers are generated, another algorithm applies, which discards values that are too big and too small (this is the current cufftbench's start and end values).
This way, we don't waste our time on numbers, which will yield bad timings, and the FFT size selection happens a lot faster.
Of course, I may be missing something here, but as far as I know, current high-end NV gpus benefit from FFT sizes in some of those forms.
Other forms may or may not be beneficial for future CUDA/OpenCL GPUs.

This guide should be useful for huge exponents.

I'm currently doing a LL test again 59473411, and the best FFT size for it is 3359232, which is 2^9 * 3^8.
Interesting combination, ehh?



I'd like to hear your comments, suggestions and corrections regarding all of this.


Why did you start the fft from the value close to the original one?
I am asking, could one test the best-fft by ranging fft from such as 1024 to 3276800?
Then pick up the best one (defined as the fast one as well as throws no error or less error than 0.12) .
Surely, this had happened in my job cases, and error is less than 0,1 with the fft_picked less or less less than the fft_automatic.
fairsky is offline   Reply With Quote
Old 2013-11-03, 05:30   #17
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

2·11·439 Posts
Default

Quote:
Originally Posted by fairsky View Post
Why did you start the fft from the value close to the original one?
If you have a bag full of 10 cm nails and want to put them into a fence, you may try to do this very fast with a 100 kilo-tons pneumatic hammer, be done in five minutes and at the end can't find the fence anymore, in the rubble you create, or you can try what is written on the bag: "use a half kilo manual hammer to insert these nails into the fence". Sometimes the half kilo can be too heavy for you if you hammer all the day with it, your hand will be in terrible pain, or, if you are a strong guy, it can be too light and detrimental to your productivity. So you can "tune" the hammer between, say, 300 grams and 800 grams, to be more productive and get less pain in your muscles. For sure you won't use a 5 grams jewelery hammer, otherwise you may need to hammer all your life for that bag of nails. Neither a one ton hammer won't be good, if you still need the fence at the end... So, stay around the prescription on the bag. Why? Well... to understand why, you need to learn the subtleties of FFT multiplication...

Last fiddled with by LaurV on 2013-11-03 at 05:52 Reason: s/tones/tons
LaurV is online now   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
PRP:- Pick A Range Citrix Prime Sierpinski Project 2 2014-02-16 18:47
Guide OmbooHankvald Operation Billion Digits 4 2009-08-26 08:18
Help me pick a math course. jasong Math 9 2005-03-11 21:04
Pick and Choose Wacky Puzzles 5 2003-07-16 20:02
Pick a stone, or two, .... or three Wacky Puzzles 5 2003-06-24 16:11

All times are UTC. The time now is 13:00.


Tue Aug 3 13:00:59 UTC 2021 up 11 days, 7:29, 1 user, load averages: 2.99, 2.72, 2.61

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.