mersenneforum.org World's second-dumbest CUDA program
 Register FAQ Search Today's Posts Mark Forums Read

 2012-07-04, 16:17 #45 bsquared     "Ben" Feb 2007 2·1,789 Posts Outstanding, axn! All together, your optimizations sped it up nearly 20%! Code: 3.889000 milliseconds for big sieve 5761455 big primes (< 100000000) found 47.001999 milliseconds for big sieve 50847534 big primes (< 1000000000) found 741.184021 milliseconds for big sieve 455052511 big primes (< 10000000000) found code And, it made me think of one more thing to try... hopefully I'll get a chance to later today.
 2012-07-05, 14:16 #46 bsquared     "Ben" Feb 2007 2×1,789 Posts + "Continue" can be "break" () + added three more primes to the small prime pre-sieve + a few more code cleanups now 15% faster: Code: 3.314000 milliseconds for big sieve 5761455 big primes (< 100000000) found 39.870998 milliseconds for big sieve 50847534 big primes (< 1000000000) found 654.596985 milliseconds for big sieve 455052511 big primes (< 10000000000) found
 2012-07-05, 15:34 #47 chalsall If I May     "Chris Halsall" Sep 2002 Barbados 37·269 Posts Great work bsquared and axn! Now I'm going to ask a question which will probably reveal my naivety on such matters... At what point can such GPU processing be incorporated into mfakt* to eliminate the current need for CPU pre-sieving? I know that rcv did some work on this, but I don't know if it actually went anywhere. As it currently is, for higher-end GPUs, it can take many CPU cores just to saturate a single GPU. Insight from those who actually understand how all this works would be greatly appreciated.
2012-07-05, 16:10   #48
bsquared

"Ben"
Feb 2007

2·1,789 Posts

Quote:
 Originally Posted by chalsall Great work bsquared and axn! Now I'm going to ask a question which will probably reveal my naivety on such matters... At what point can such GPU processing be incorporated into mfakt* to eliminate the current need for CPU pre-sieving? I know that rcv did some work on this, but I don't know if it actually went anywhere.
There's a fair bit of work involved in taking basic sieve of eratosthenes code like this into a useful supplier of k's for mfakt*. Translating sieve bits to k's, honoring the class residue system of mfakt*, working with large (96 bit) integers, etc. This is stuff rcv has already done (on top of his very efficient sieve).

I've been talking to rcv about this a little. His code was incorporated into Bdot's github repository, but I don't know how far anyone has taken it toward merging it into the master (it is currently a branch). I also don't know how much work, if any, has been done to figure out how much of a hit TF performance will take if the GPU is also doing sieving, whether or not that performance hit is a net win given that freed CPU cycles can be used for other work (it depends on what other work is done, of course), or to figure out optimal sieve/TF crossover points, etc. There is likely a lot of work to do there.

I don't have any plans at the moment to try to port this code into mfakt*. My only ulterior motive (hope? notion?) is to someday use the sieving tricks I've learned to make a fast tiny-qs engine for factoring composite residues in multi-large-prime variants of SIQS or NFS.

 2012-07-05, 16:14 #49 bsquared     "Ben" Feb 2007 2×1,789 Posts + special main loop for larger primes Only really helps if many larger primes are needed, of course: Code: 3.324000 milliseconds for big sieve 5761455 big primes (< 100000000) found 39.411999 milliseconds for big sieve 50847534 big primes (< 1000000000) found 587.780029 milliseconds for big sieve 455052511 big primes (< 10000000000) found code
2012-07-05, 16:34   #50
chalsall
If I May

"Chris Halsall"
Sep 2002

233418 Posts

Quote:
 Originally Posted by bsquared I also don't know how much work, if any, has been done to figure out how much of a hit TF performance will take if the GPU is also doing sieving, whether or not that performance hit is a net win given that freed CPU cycles can be used for other work (it depends on what other work is done, of course), or to figure out optimal sieve/TF crossover points, etc. There is likely a lot of work to do there.
Thank you.

I would argue that while GPU TFing will of course take a "hit" by not having the candidates pre-sieved by a CPU, it would be better to have the GPUs do all the work required for TFing. This would mean that only a single instance of mfakt* would need to be run per card, and the CPU(s) would be free to do other work.

As the musical davieddy once said, "a CPU shouldn't get out of bed to do trial factoring."

 2012-07-06, 03:18 #51 axn     Jun 2003 514910 Posts I would think that the logical thing would be to start with rcv's code, and see if you can incorporate some of the concepts there.
 2012-07-15, 16:26 #52 Prime95 P90 years forever!     Aug 2002 Yeehaw, FL 1DD016 Posts More ideas for you: pinfo is an array of structures. Structure of arrays usually works better (threads coalesce global memory references better). That is, turn pinfo into two half-sized arrays (primes and pinv). Division is bad, oh so bad. Eliminate your div-by-3s and div-by-6s by storing p/6 in pinfo instead of p. Won't you also have to store p mod 6 - increasing global memory accessing costs? You could, or you could have two pinfo arrays, one for 1 mod 6 and another for 5 mod 6 primes and simply double the amount of kernel code that processes primes.
2012-07-16, 03:28   #53
axn

Jun 2003

141D16 Posts

Quote:
 Originally Posted by Prime95 Division is bad, oh so bad. Eliminate your div-by-3s and div-by-6s by storing p/6 in pinfo instead of p. Won't you also have to store p mod 6 - increasing global memory accessing costs? You could, or you could have two pinfo arrays, one for 1 mod 6 and another for 5 mod 6 primes and simply double the amount of kernel code that processes primes.
The compiler optimizes the "division by constants" away, so that shouldn't be a problem.

 2012-07-16, 13:36 #54 bsquared     "Ben" Feb 2007 2·1,789 Posts Thanks for the suggestions George. As axn said, though, the divisions by constants get optimized away, so anything I try to do to get rid of them only makes things slower. Using two half sized arrays did help a bit - thanks again! Code: 3.307000 milliseconds for big sieve 5761455 big primes (< 100000000) found 38.532001 milliseconds for big sieve 50847534 big primes (< 1000000000) found 556.763000 milliseconds for big sieve 455052511 big primes (< 10000000000) found code
2012-07-16, 15:17   #55
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

24·32·53 Posts

Quote:
 Originally Posted by bsquared Using two half sized arrays did help a bit - thanks again!
Glad to help! Something else to try to reduce memory usage. Change the primes array from a uint32 to a uint16. Instead of the prime number, store the difference between the new prime to test and the last prime tested by the thread. In other words, it will be the distance to the 256th previous prime. This reduces memory accesses by half at the cost of one addition.

I haven't thought it through, but you should be able to do the same thing to pinv once the prime gets large enough.

P.S. You may have to read 32-bits at a time from memory and split it into uint16s yourself (there is a PTX instruction to do that) to get the best performance. I'm still learning CUDA myself, so let me know what works and what doesn't. Thanks!

 Similar Threads Thread Thread Starter Forum Replies Last Post TheJudger GPU Computing 3506 2021-09-18 00:04 firejuggler GPU Computing 753 2020-12-12 18:07 firejuggler Lounge 3 2012-12-22 01:43 davieddy Hobbies 111 2011-05-28 19:21 xilman Programming 1 2009-11-16 10:26

All times are UTC. The time now is 13:05.

Thu Oct 21 13:05:13 UTC 2021 up 90 days, 7:34, 1 user, load averages: 1.79, 1.50, 1.38