mersenneforum.org  

Go Back   mersenneforum.org > Prime Search Projects > Prime Gap Searches

Reply
 
Thread Tools
Old 2021-06-21, 12:15   #1
MJansen
 
Jan 2018

61 Posts
Default GPU Prime Gap searching

Quote:
Originally Posted by CraigLo View Post
Thanks. I use 1 1080 TI. My code doesn't work well for large numbers so I decided to switch to the max gap search until I have time to rewrite it. I planned to run it over the summer with the hope of finding 1 new record above 1432. I got lucky that this gap is so close to the previous max gap.
Cool actually and I would love to know more about gpu progamming, so many questions regarding GPU programming:
a. what code language do you use
b. is it under windows or Linux (or a linux dialect)?
c. can you give a (simplified) code example?
c. how does a cuda core compare to a cpu thread? Can you use all 1564 cuda cores for primegap hunting in parallel, just like a cpu thread?
d. is your energy bill going up by a lot?
e. I looked for gpu's and the 1080 ti is available second hand (introduced in 2017), but still at high prices (Euro 500-700) and some have the explicit note that they have not been used for mining bit coins. Does mining (and I guess 24/7 prime gap hunting) wear out the GPU faster than usual?

Curious as ever,
Michiel Jansen
MJansen is online now   Reply With Quote
Old 2021-06-22, 07:02   #2
CraigLo
 
Mar 2021

53 Posts
Default

Quote:
Originally Posted by MJansen View Post
Cool actually and I would love to know more about gpu progamming, so many questions regarding GPU programming:
a. what code language do you use
I use Cuda which is an extension of C. It is Nvidia specific. I looked at OpenCL when I started but at the time it didn't have support for arithmetic carry. I'm not sure if this has changed.
Quote:
b. is it under windows or Linux (or a linux dialect)?
I've only used Linux but others have used windows.
Quote:
c. can you give a (simplified) code example?
The cuda programming guide is good.
https://docs.nvidia.com/cuda/cuda-c-...ide/index.html
Quote:
c. how does a cuda core compare to a cpu thread? Can you use all 1564 cuda cores for primegap hunting in parallel, just like a cpu thread?
See the programming guide for details.
The GPU is broken up into SMs (streaming multiprocessors). Each SM has a fixed amount of fast shared memory and a fixed number of registers than can be used by all threads in the SM. Each SM can run 128 threads at a time (this varies with the GPU architecture). Code is executed in a warp which consists of 32 threads. Every thread in a warp runs the same code and executes the same line of code.

My GPU has 28 SMs*128 = 3584 cores. They can all run at the same time. I have each thread process a different possible prime. My code slows down as the numbers get bigger because each thread uses too many resources to be able to use all the cores. To fix this I will need to have multiple threads process a single prime (summer/fall project maybe, and I really need something like Karatsuba multiplication).

Quote:
d. is your energy bill going up by a lot?
The 1080 TI has a peak of about 250 W. Over the winter I was heating my basement with primes so it didn't cost me much. Now the cost is noticeable but pretty small.

Quote:
e. I looked for gpu's and the 1080 ti is available second hand (introduced in 2017), but still at high prices (Euro 500-700) and some have the explicit note that they have not been used for mining bit coins. Does mining (and I guess 24/7 prime gap hunting) wear out the GPU faster than usual?
I bought mine on ebay 3 years ago for $500. Chip prices are crazy right now. I've probably had mine running for less than a year since I've had it. I'm sure running it constantly will decrease the life but I haven't had any problems with mine.
CraigLo is offline   Reply With Quote
Old 2021-06-22, 19:21   #3
MJansen
 
Jan 2018

61 Posts
Default

Hi Craig,

thanx for the info! I have been reading up on the nvidia guide, but it is a lot to take in and it seems like a tool kit that needs to take into consideration memory management, parallel programming, sending tasks to cores-warp's-SM's, writing your own math functions, retrieving results, preventing dead locks ... pfff no easy task! Cudo's to you for having come this far!

Conceptually I am trying to wrap my head around what can be done in parallel. You wrote that you have each thread (equals 32 cores? So 28 x 128/32 = 112 threads simultaneous? Would be a lot more than the 24 of my AMD 3900 ...) process a single prime (or do you mean prime gap?). Forgive me if I get this wrong, but this confuses me a little.

Regarding a parallel approach, my assumption would be, but I do not know if it is possible using a GPU, that having each of the 112 thread's check a single candidate of a prime gap (with a Fermat test or Miller - Rabin test) and after no PRP's are found, feed the next 112 candidates until a PRP is found, that would be a method. Of course the workload could be smaller, so you can work on more intervals at a time I guess.

Or sieving the interval around the primorial start value, for composites so only the stronger candidates remain, that can be checked using a pseudoprime test on the cpu would be another approach I can think of. But I am totally new to this and nowhere near knowledgeable enough to have a good understanding of the possibilities of using the GPU to its fullest! Again my curiosity speaking: what conceptual approach have you implemented?

Ps no room in my house that needs an extra heater ;-) and 1564 was my mistake, I did scramble the correct core number (3584).

Kind regards
Michiel
MJansen is online now   Reply With Quote
Old 2021-06-22, 20:32   #4
CraigLo
 
Mar 2021

53 Posts
Default

Quote:
Originally Posted by MJansen View Post
Hi Craig,

thanx for the info! I have been reading up on the nvidia guide, but it is a lot to take in and it seems like a tool kit that needs to take into consideration memory management, parallel programming, sending tasks to cores-warp's-SM's, writing your own math functions, retrieving results, preventing dead locks ... pfff no easy task! Cudo's to you for having come this far!

Conceptually I am trying to wrap my head around what can be done in parallel. You wrote that you have each thread (equals 32 cores? So 28 x 128/32 = 112 threads simultaneous? Would be a lot more than the 24 of my AMD 3900 ...) process a single prime (or do you mean prime gap?). Forgive me if I get this wrong, but this confuses me a little.

Regarding a parallel approach, my assumption would be, but I do not know if it is possible using a GPU, that having each of the 112 thread's check a single candidate of a prime gap (with a Fermat test or Miller - Rabin test) and after no PRP's are found, feed the next 112 candidates until a PRP is found, that would be a method. Of course the workload could be smaller, so you can work on more intervals at a time I guess.

Or sieving the interval around the primorial start value, for composites so only the stronger candidates remain, that can be checked using a pseudoprime test on the cpu would be another approach I can think of. But I am totally new to this and nowhere near knowledgeable enough to have a good understanding of the possibilities of using the GPU to its fullest! Again my curiosity speaking: what conceptual approach have you implemented?

Ps no room in my house that needs an extra heater ;-) and 1564 was my mistake, I did scramble the correct core number (3584).

Kind regards
Michiel
It is actually 3584 cores. I was trying to say that you cannot have 3584 separate codes running at the same time. The cores are broken up into blocks of 32 called warps which all run the same code at the same time. You can have all 3584 cores running a separate Fermat test.
CraigLo is offline   Reply With Quote
Old 2021-06-23, 13:47   #5
MJansen
 
Jan 2018

61 Posts
Default

Hi Craig,

I am trying to compare my CPU only calculations to the GPU calculating power. Because of the differences in architecture not quite the same, but am I correct if I say that your GPU can run 3584 / 32 = 112 simultaneous gap searches (scripts) on the 112 warps? And each warp can process 32 prime candidates at a time? Is that correct?

I read a thread elsewhere on this forum on GPU sieving (300+ pages ...) and some notions:
a. the thread started in 2009 already! There must be some people around who have built up knowledge about GPU programming since then
b. if you use full capacity of the GPU, your display will be hampered/frozen, so disconnect the GPU from the monitor and use a separate GPU for display.
c. developing cross platform/ generic GPU code is tricky in a sense that NVIDIA driver updates, windows or Unix/Linux updates, different video cards, etc can cause errors in older (working) code
d. not sure were I read it but there was some mention about a penalty for using % operations on the GPU, is that something you encountered?

Kind regards
Michiel
MJansen is online now   Reply With Quote
Old 2021-06-23, 16:02   #6
CraigLo
 
Mar 2021

658 Posts
Default

Quote:
Originally Posted by MJansen View Post
Hi Craig,

I am trying to compare my CPU only calculations to the GPU calculating power. Because of the differences in architecture not quite the same, but am I correct if I say that your GPU can run 3584 / 32 = 112 simultaneous gap searches (scripts) on the 112 warps? And each warp can process 32 prime candidates at a time? Is that correct?

I read a thread elsewhere on this forum on GPU sieving (300+ pages ...) and some notions:
a. the thread started in 2009 already! There must be some people around who have built up knowledge about GPU programming since then
b. if you use full capacity of the GPU, your display will be hampered/frozen, so disconnect the GPU from the monitor and use a separate GPU for display.
c. developing cross platform/ generic GPU code is tricky in a sense that NVIDIA driver updates, windows or Unix/Linux updates, different video cards, etc can cause errors in older (working) code
d. not sure were I read it but there was some mention about a penalty for using % operations on the GPU, is that something you encountered?

Kind regards
Michiel
We should probably start a separate GPU thread.

You can have 3584 simultaneous gap searches. In my 65-bit optimized code roughly 1/2 the candidates are prime after sieving. If I processed 32 candidates for a single gap I would be giving up about a factor of 16 in performance. I think the term Nvidia uses for a warp is single instruction multiple thread (SIMT) which is similar to SIMD.

A single GPU core is going to be slower than a single CPU core. GPU operations are 32-bit. Integer multiplication is slower than floating point multiplication (I think it is ~7x cycles for a 32 bit integer multiply 32-bit float on mine). Integer division and mod are very slow on a GPU and should be avoided as much as possible. GPUs are designed for fast 32-bit float multiply and fast memory access. Everything else is slower.

For an idea of the processing power the specs are listed here
https://en.wikipedia.org/wiki/GeForce_10_series

The 1080 Ti has about 10000 GFlops. That counts a multiply and accumulate as separate operations so 5000 GMACs. Divide by 7 gives me about 700E9/sec 32-bit integer multiply and accumulate.

On the memory side, global memory bandwidth is about 480 GB/s and shared memory is about a factor of 10 higher.

In my experience, my internet browser is slow while the GPU is running. Everything else I do is unaffected, but I don't use that computer for much.

I've only used linux and the GTX 10 series so not sure about cross platform / driver updates. Different GPU architectures (9 series vs 10 series, etc) have different instruction sets and features. Even within a single family the optimal parameters will probably change due to different capabilities.
CraigLo is offline   Reply With Quote
Old 2021-06-24, 13:00   #7
MJansen
 
Jan 2018

61 Posts
Default

Craig wrote:
We should probably start a separate GPU thread.
MJ: would be good, could anybody move the relevant posts to a separate post? Or should I open a new post altogether?

The next bit should be discussed in the new post.

You can have 3584 simultaneous gap searches. In my 65-bit optimized code roughly 1/2 the candidates are prime after sieving. If I processed 32 candidates for a single gap I would be giving up about a factor of 16 in performance. I think the term Nvidia uses for a warp is single instruction multiple thread (SIMT) which is similar to SIMD.

A single GPU core is going to be slower than a single CPU core. GPU operations are 32-bit. Integer multiplication is slower than floating point multiplication (I think it is ~7x cycles for a 32 bit integer multiply 32-bit float on mine). Integer division and mod are very slow on a GPU and should be avoided as much as possible. GPUs are designed for fast 32-bit float multiply and fast memory access. Everything else is slower.

For an idea of the processing power the specs are listed here
https://en.wikipedia.org/wiki/GeForce_10_series

The 1080 Ti has about 10000 GFlops. That counts a multiply and accumulate as separate operations so 5000 GMACs. Divide by 7 gives me about 700E9/sec 32-bit integer multiply and accumulate.

On the memory side, global memory bandwidth is about 480 GB/s and shared memory is about a factor of 10 higher.

In my experience, my internet browser is slow while the GPU is running. Everything else I do is unaffected, but I don't use that computer for much.

I've only used linux and the GTX 10 series so not sure about cross platform / driver updates. Different GPU architectures (9 series vs 10 series, etc) have different instruction sets and features. Even within a single family the optimal parameters will probably change due to different capabilities.
MJansen is online now   Reply With Quote
Old 2021-06-24, 15:54   #8
MJansen
 
Jan 2018

61 Posts
Default

Wow, that was fast! Many thanks to however did this for making this a separate thread!

Regarding the rest of the post, @Craig: I am going to try, in my own words, to summarize what I think I understand. This probably means I will not use the correct terms, so please give me some slack on that regard:

I think you use your CPU (1 or more threads?) to sieve an interval and feed the remaining candidates to the GPU. In the GPU a SIMT (let's use that term) has the code to send the candidates to the cores in that SIMT and perform a Fermat test. Is that the basic concept?

How the results get stored or processed you have not yet mentioned so I will not guess on that.

Regarding sieving: Jens Kruse Andersen made a very efficient treesieve program (especially for larger gaps) that is 32 bit C-code and outputs text files of the remaining candidates of an interval around the prime gap center, is that something that can be used? Let me rephrase: can you store an exe file at the GPU/SIMT and run it?

Or can you - using Linux - access or store the GMP library from the GPU to allow for fast multiplication?

I am not familiar with linux (just windows sorry!) but there is software around (and maybe people here have their own sieving code, Dana has incorporated a very efficient sieving program in his Perl next_prime / prev_prime functions).

What else have we learned:
a. A GPU core is about half as (or even less) fast as a CPU core.
b. division or modulo operations are very slow on a GPU
c. use memory of the GPU if possible (much faster), but the practical implications are not yet clear (to me at least)

Kind regards and thanks again for creating this thread!
Michiel

Last fiddled with by MJansen on 2021-06-24 at 15:56
MJansen is online now   Reply With Quote
Old 2021-06-24, 22:59   #9
CraigLo
 
Mar 2021

658 Posts
Default

Quote:
Originally Posted by MJansen View Post
Wow, that was fast! Many thanks to however did this for making this a separate thread!

Regarding the rest of the post, @Craig: I am going to try, in my own words, to summarize what I think I understand. This probably means I will not use the correct terms, so please give me some slack on that regard:

I think you use your CPU (1 or more threads?) to sieve an interval and feed the remaining candidates to the GPU. In the GPU a SIMT (let's use that term) has the code to send the candidates to the cores in that SIMT and perform a Fermat test. Is that the basic concept?

How the results get stored or processed you have not yet mentioned so I will not guess on that.

Regarding sieving: Jens Kruse Andersen made a very efficient treesieve program (especially for larger gaps) that is 32 bit C-code and outputs text files of the remaining candidates of an interval around the prime gap center, is that something that can be used? Let me rephrase: can you store an exe file at the GPU/SIMT and run it?

Or can you - using Linux - access or store the GMP library from the GPU to allow for fast multiplication?

I am not familiar with linux (just windows sorry!) but there is software around (and maybe people here have their own sieving code, Dana has incorporated a very efficient sieving program in his Perl next_prime / prev_prime functions).

What else have we learned:
a. A GPU core is about half as (or even less) fast as a CPU core.
b. division or modulo operations are very slow on a GPU
c. use memory of the GPU if possible (much faster), but the practical implications are not yet clear (to me at least)

Kind regards and thanks again for creating this thread!
Michiel
I do all my sieving in the GPU. Sieving is limited by memory speeds which are much faster in the GPU. I run the sieving and prime check code at the same time. They work well together because one is memory bandwidth limited and the other is processor limited. While waiting for a memory access the GPU can switch to another warp that is ready to run. The GPU does this automatically and instantly. Running them at the same time will take less time than running one and then the other.

If you already had pre-computed candidates you could send them to the GPU for processing. The 65-bit code I'm running now is fast enough that even if you had pre-computed candidates I think you would be limited by how quickly you could send them from the CPU to the GPU.
CraigLo is offline   Reply With Quote
Old 2021-06-25, 15:09   #10
MJansen
 
Jan 2018

61 Posts
Default

Quote:
Originally Posted by CraigLo View Post
I do all my sieving in the GPU. Sieving is limited by memory speeds which are much faster in the GPU. I run the sieving and prime check code at the same time. They work well together because one is memory bandwidth limited and the other is processor limited. While waiting for a memory access the GPU can switch to another warp that is ready to run. The GPU does this automatically and instantly. Running them at the same time will take less time than running one and then the other.

If you already had pre-computed candidates you could send them to the GPU for processing. The 65-bit code I'm running now is fast enough that even if you had pre-computed candidates I think you would be limited by how quickly you could send them from the CPU to the GPU.
This means at least the following two conceptual approaches:
A. using the GPU for pre-sieving and PRP test
B. pre-sieve on CPU and using the GPU for PRP test only

The latter approach was used in the thread I mentioned (https://mersenneforum.org/showthread.php?t=12827) and there were regular remarks about the CPU being not fast enough so the GPU was waiting. So I guess your approach with both pre sieving and PRP test on the GPU circumvents that.

Usually pre-sieving depends on (a combination of) wheel based approaches, modulo operations or a gcd check for small factors. These methods depend on division or modulo remainder calculations, so not the most efficient approach for a GPU it seems. Treesieve is fast but also heavy on modulo operations. I admit I am curious how your 65-bit code handles the pre-sieving!

Regarding the hardware, I am not sure when I will have a GPU installed to start experimenting, so for the time being conceptual discussion only. I found a nice overview of NVIDIA GPU's and their specs: https://www.studio1productions.com/A...-GPU-Chart.htm this makes for a nice wishlist ;-)

Kind regards
Michiel
MJansen is online now   Reply With Quote
Old 2021-06-25, 17:42   #11
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

52×127 Posts
Default

Quote:
Originally Posted by MJansen View Post
The latter approach was used in the thread I mentioned (https://mersenneforum.org/showthread.php?t=12827) and there were regular remarks about the CPU being not fast enough so the GPU was waiting. So I guess your approach with both pre sieving and PRP test on the GPU circumvents that.
mfaktc can also sieve on the GPU: SieveOnGPU=1 in mfaktc.ini. It think almost everyone uses that now, and I think it was only in the very beginning of mfaktc that people were using CPU for sieving, but I could be wrong about that.
ATH is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
HTML5 prime number searching? Dale Mahalko Software 7 2015-03-21 19:55
Elementary Literature on probabilistic aspects of prime searching hhh Math 7 2014-03-18 14:37
Question about multi-core prime searching. Trilo Riesel Prime Search 33 2013-09-19 21:21
idea for twin prime searching Mini-Geek Twin Prime Search 8 2011-01-30 00:18
Searching for user BlueSoda (no its not a prime) ltd Prime Sierpinski Project 3 2006-03-23 19:27

All times are UTC. The time now is 08:11.


Fri Oct 22 08:11:17 UTC 2021 up 91 days, 2:40, 1 user, load averages: 0.96, 1.08, 1.13

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.