mersenneforum.org  

Go Back   mersenneforum.org > Extra Stuff > Programming

Reply
 
Thread Tools
Old 2012-07-04, 16:17   #45
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

2·1,789 Posts
Default

Outstanding, axn!

All together, your optimizations sped it up nearly 20%!

Code:
3.889000 milliseconds for big sieve
5761455 big primes (< 100000000) found

47.001999 milliseconds for big sieve
50847534 big primes (< 1000000000) found

741.184021 milliseconds for big sieve
455052511 big primes (< 10000000000) found
code

And, it made me think of one more thing to try... hopefully I'll get a chance to later today.
bsquared is offline   Reply With Quote
Old 2012-07-05, 14:16   #46
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

2×1,789 Posts
Default

+ "Continue" can be "break" ()
+ added three more primes to the small prime pre-sieve
+ a few more code cleanups

now 15% faster:

Code:
3.314000 milliseconds for big sieve
5761455 big primes (< 100000000) found

39.870998 milliseconds for big sieve
50847534 big primes (< 1000000000) found

654.596985 milliseconds for big sieve
455052511 big primes (< 10000000000) found
bsquared is offline   Reply With Quote
Old 2012-07-05, 15:34   #47
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

37·269 Posts
Default

Great work bsquared and axn!

Now I'm going to ask a question which will probably reveal my naivety on such matters...

At what point can such GPU processing be incorporated into mfakt* to eliminate the current need for CPU pre-sieving? I know that rcv did some work on this, but I don't know if it actually went anywhere.

As it currently is, for higher-end GPUs, it can take many CPU cores just to saturate a single GPU.

Insight from those who actually understand how all this works would be greatly appreciated.
chalsall is offline   Reply With Quote
Old 2012-07-05, 16:10   #48
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

2·1,789 Posts
Default

Quote:
Originally Posted by chalsall View Post
Great work bsquared and axn!

Now I'm going to ask a question which will probably reveal my naivety on such matters...

At what point can such GPU processing be incorporated into mfakt* to eliminate the current need for CPU pre-sieving? I know that rcv did some work on this, but I don't know if it actually went anywhere.
There's a fair bit of work involved in taking basic sieve of eratosthenes code like this into a useful supplier of k's for mfakt*. Translating sieve bits to k's, honoring the class residue system of mfakt*, working with large (96 bit) integers, etc. This is stuff rcv has already done (on top of his very efficient sieve).

I've been talking to rcv about this a little. His code was incorporated into Bdot's github repository, but I don't know how far anyone has taken it toward merging it into the master (it is currently a branch). I also don't know how much work, if any, has been done to figure out how much of a hit TF performance will take if the GPU is also doing sieving, whether or not that performance hit is a net win given that freed CPU cycles can be used for other work (it depends on what other work is done, of course), or to figure out optimal sieve/TF crossover points, etc. There is likely a lot of work to do there.

I don't have any plans at the moment to try to port this code into mfakt*. My only ulterior motive (hope? notion?) is to someday use the sieving tricks I've learned to make a fast tiny-qs engine for factoring composite residues in multi-large-prime variants of SIQS or NFS.
bsquared is offline   Reply With Quote
Old 2012-07-05, 16:14   #49
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

2×1,789 Posts
Default

+ special main loop for larger primes

Only really helps if many larger primes are needed, of course:
Code:
3.324000 milliseconds for big sieve
5761455 big primes (< 100000000) found

39.411999 milliseconds for big sieve
50847534 big primes (< 1000000000) found

587.780029 milliseconds for big sieve
455052511 big primes (< 10000000000) found
code
bsquared is offline   Reply With Quote
Old 2012-07-05, 16:34   #50
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

233418 Posts
Default

Quote:
Originally Posted by bsquared View Post
I also don't know how much work, if any, has been done to figure out how much of a hit TF performance will take if the GPU is also doing sieving, whether or not that performance hit is a net win given that freed CPU cycles can be used for other work (it depends on what other work is done, of course), or to figure out optimal sieve/TF crossover points, etc. There is likely a lot of work to do there.
Thank you.

I would argue that while GPU TFing will of course take a "hit" by not having the candidates pre-sieved by a CPU, it would be better to have the GPUs do all the work required for TFing. This would mean that only a single instance of mfakt* would need to be run per card, and the CPU(s) would be free to do other work.

As the musical davieddy once said, "a CPU shouldn't get out of bed to do trial factoring."
chalsall is offline   Reply With Quote
Old 2012-07-06, 03:18   #51
axn
 
axn's Avatar
 
Jun 2003

514910 Posts
Default

I would think that the logical thing would be to start with rcv's code, and see if you can incorporate some of the concepts there.
axn is offline   Reply With Quote
Old 2012-07-15, 16:26   #52
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

1DD016 Posts
Default

More ideas for you:

pinfo is an array of structures. Structure of arrays usually works better (threads coalesce global memory references better). That is, turn pinfo into two half-sized arrays (primes and pinv).

Division is bad, oh so bad. Eliminate your div-by-3s and div-by-6s by storing p/6 in pinfo instead of p. Won't you also have to store p mod 6 - increasing global memory accessing costs? You could, or you could have two pinfo arrays, one for 1 mod 6 and another for 5 mod 6 primes and simply double the amount of kernel code that processes primes.
Prime95 is offline   Reply With Quote
Old 2012-07-16, 03:28   #53
axn
 
axn's Avatar
 
Jun 2003

141D16 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Division is bad, oh so bad. Eliminate your div-by-3s and div-by-6s by storing p/6 in pinfo instead of p. Won't you also have to store p mod 6 - increasing global memory accessing costs? You could, or you could have two pinfo arrays, one for 1 mod 6 and another for 5 mod 6 primes and simply double the amount of kernel code that processes primes.
The compiler optimizes the "division by constants" away, so that shouldn't be a problem.
axn is offline   Reply With Quote
Old 2012-07-16, 13:36   #54
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

2·1,789 Posts
Default

Thanks for the suggestions George.

As axn said, though, the divisions by constants get optimized away, so anything I try to do to get rid of them only makes things slower.

Using two half sized arrays did help a bit - thanks again!

Code:
3.307000 milliseconds for big sieve
5761455 big primes (< 100000000) found

38.532001 milliseconds for big sieve
50847534 big primes (< 1000000000) found

556.763000 milliseconds for big sieve
455052511 big primes (< 10000000000) found
code
bsquared is offline   Reply With Quote
Old 2012-07-16, 15:17   #55
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

24·32·53 Posts
Default

Quote:
Originally Posted by bsquared View Post
Using two half sized arrays did help a bit - thanks again!
Glad to help! Something else to try to reduce memory usage. Change the primes array from a uint32 to a uint16. Instead of the prime number, store the difference between the new prime to test and the last prime tested by the thread. In other words, it will be the distance to the 256th previous prime. This reduces memory accesses by half at the cost of one addition.

I haven't thought it through, but you should be able to do the same thing to pinv once the prime gets large enough.

P.S. You may have to read 32-bits at a time from memory and split it into uint16s yourself (there is a PTX instruction to do that) to get the best performance. I'm still learning CUDA myself, so let me know what works and what doesn't. Thanks!
Prime95 is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfaktc: a CUDA program for Mersenne prefactoring TheJudger GPU Computing 3506 2021-09-18 00:04
The P-1 factoring CUDA program firejuggler GPU Computing 753 2020-12-12 18:07
End of the world as we know it (in music) firejuggler Lounge 3 2012-12-22 01:43
World Cup Soccer davieddy Hobbies 111 2011-05-28 19:21
World's dumbest CUDA program? xilman Programming 1 2009-11-16 10:26

All times are UTC. The time now is 13:05.


Thu Oct 21 13:05:13 UTC 2021 up 90 days, 7:34, 1 user, load averages: 1.79, 1.50, 1.38

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.