mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2022-09-08, 19:28   #716
pepi37
 
pepi37's Avatar
 
Dec 2011
After 1.58M nines:)

52·67 Posts
Default

Did my testing show to you where problem is, my cpu is still faster then RTX 2060!
Any other idea? Different setup?
pepi37 is offline   Reply With Quote
Old 2022-09-08, 21:43   #717
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

161008 Posts
Default

Quote:
Originally Posted by pepi37 View Post
Did my testing show to you where problem is, my cpu is still faster then RTX 2060!
Any other idea? Different setup?
You can try a different value for -g, but remember that you are comparing a single GPU to multiple cores on a CPU.

Unlike some of the other sieves using the framework, sr2sievecl uses a lot more GPU memory as each thread has to maintain its own set of tables in memory. When it comes to discrete logs, you can use less memory, but then the computation time for each p varies significantly. The discrete log used by srsieve2cl uses a method that "flattens the curve" for the calculation regardless of p, but requires more memory. It might be possible to modify the algorithm to use less memory in the GPU, but that could lead to other issues. One of the worst things with the current algorithm is that there are many conditionals and the remaining loops can't really be unrolled. It would likely require a completely different algorithm to get more speed out of it.
rogue is offline   Reply With Quote
Old 2022-09-08, 22:43   #718
pepi37
 
pepi37's Avatar
 
Dec 2011
After 1.58M nines:)

167510 Posts
Default

Quote:
Originally Posted by rogue View Post
You can try a different value for -g, but remember that you are comparing a single GPU to multiple cores on a CPU.

Unlike some of the other sieves using the framework, sr2sievecl uses a lot more GPU memory as each thread has to maintain its own set of tables in memory. When it comes to discrete logs, you can use less memory, but then the computation time for each p varies significantly. The discrete log used by srsieve2cl uses a method that "flattens the curve" for the calculation regardless of p, but requires more memory. It might be possible to modify the algorithm to use less memory in the GPU, but that could lead to other issues. One of the worst things with the current algorithm is that there are many conditionals and the remaining loops can't really be unrolled. It would likely require a completely different algorithm to get more speed out of it.

I agree with all that but what is purpose of the opencl sieve. RTX 2060 is not most powerfull card, but it is not bad at all. I expect huge difference in speed, .
And as ryanp says " I don't know how else to explain an A100 being the same speed as a regular 64 core machine."
A100 is beast GPU card...
pepi37 is offline   Reply With Quote
Old 2022-09-09, 01:58   #719
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

161008 Posts
Default

Quote:
Originally Posted by pepi37 View Post
I agree with all that but what is purpose of the opencl sieve. RTX 2060 is not most powerfull card, but it is not bad at all. I expect huge difference in speed, .
And as ryanp says " I don't know how else to explain an A100 being the same speed as a regular 64 core machine."
A100 is beast GPU card...
If you compare a single GPU vs a single core on a CPU, that is where you see the difference in speed. A GPU core is not equivalent to a CPU core.
rogue is offline   Reply With Quote
Old 2022-09-09, 13:31   #720
ryanp
 
ryanp's Avatar
 
Jun 2012
Boulder, CO

23×3×19 Posts
Default

Quote:
Originally Posted by rogue View Post
If you compare a single GPU vs a single core on a CPU, that is where you see the difference in speed. A GPU core is not equivalent to a CPU core.
I'm not really sure how that information helps. The reality is that we haven't seemed to find any combination of params that actually make srsieve2cl any faster than a simple 4- or 8-core CPU run ("-W").
ryanp is offline   Reply With Quote
Old 2022-09-09, 13:57   #721
rebirther
 
rebirther's Avatar
 
Sep 2011
Germany

67448 Posts
Default

It looks like that srsieve2.cl cannot run on my HD7950:


Code:
Sieving with generic logic for p >= 1000000000
Split 27683 base 486 sequences into 27683 base 486^1 sequences.

OpenCL Error: Program build failure
       in call to clBuildProgram
       "C:\Users\user\AppData\Local\Temp\OCL2224T5.cl", line 165: warning: state
ment
          is unreachable
  resBM64 = mmmPowmod(resBM64, BABY_STEPS, thePrime, _q, _one);
  ^

"C:\Users\user\AppData\Local\Temp\OCL2224T5.cl", line 238: warning: statement
          is unreachable
  return 0;
  ^

Error:E013:Insufficient Private Resources!

Anyone has a testline? I think the file is too large to handle. It was running on my old win7 PC, the card has 3GB VRAM
rebirther is offline   Reply With Quote
Old 2022-09-09, 15:16   #722
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

26·113 Posts
Default

Quote:
Originally Posted by rebirther View Post
It looks like that srsieve2.cl cannot run on my HD7950:


Code:
Sieving with generic logic for p >= 1000000000
Split 27683 base 486 sequences into 27683 base 486^1 sequences.

Error:E013:Insufficient Private Resources!
Anyone has a testline? I think the file is too large to handle. It was running on my old win7 PC, the card has 3GB VRAM
I have not seen this before. You can use -K to create multiple kernels. That will reduce how much memory is needed.
rogue is offline   Reply With Quote
Old 2022-09-09, 15:33   #723
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

26·113 Posts
Default

Quote:
Originally Posted by ryanp View Post
I'm not really sure how that information helps. The reality is that we haven't seemed to find any combination of params that actually make srsieve2cl any faster than a simple 4- or 8-core CPU run ("-W").
You cannot make the assumption the xx cores on a GPU is equivalent to xx cores on a CPU. You can't even assume that a GPU core at yy GHZ is the same as a CPU core at yy GHZ.

There are many factors that impact the speed of the GPU. The GPU is great for parallelizing tasks. How much of a speed bump you actually get is dependent upon the size of each task and how much memory each task needs. The discrete log algorithm is much larger than typical GPU tasks and requires a lot more memory than typical GPU tasks. The driver is also a factor as it is responsible for compiling the OpenCL C (in the kernel) to machine code. Some drivers are better than others at the task.

Based upon my testing across various computers with GPUs, -G1 is anywhere from 5x to 10x faster than -W1. So if you only have 5x faster with -G1 than -W1 but want to use -W8, then the CPU will clearly be faster, but if you use -G1 -W5, that will be twice as fast as -W5 alone.
rogue is offline   Reply With Quote
Old 2022-09-09, 15:43   #724
rebirther
 
rebirther's Avatar
 
Sep 2011
Germany

DE416 Posts
Default

Quote:
Originally Posted by rogue View Post
I have not seen this before. You can use -K to create multiple kernels. That will reduce how much memory is needed.

The file was too large I think, tested a sievefile with 2k's in it and it was running. The question is how many k's per sievefile can handle srsieve2cl?
rebirther is offline   Reply With Quote
Old 2022-09-09, 16:56   #725
ryanp
 
ryanp's Avatar
 
Jun 2012
Boulder, CO

45610 Posts
Default

Quote:
Originally Posted by rogue View Post
Based upon my testing across various computers with GPUs, -G1 is anywhere from 5x to 10x faster than -W1. So if you only have 5x faster with -G1 than -W1 but want to use -W8, then the CPU will clearly be faster, but if you use -G1 -W5, that will be twice as fast as -W5 alone.
OK, can you suggest a combination of -g and -G that will be faster than -W 36? That's sort of been the question all along.
ryanp is offline   Reply With Quote
Old 2022-09-09, 17:57   #726
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

723210 Posts
Default

Quote:
Originally Posted by ryanp View Post
OK, can you suggest a combination of -g and -G that will be faster than -W 36? That's sort of been the question all along.
I cannot provide one that is can be faster in the GPU without using CPU workers.

I can only suggest -W36 -G1 as -W36 alone will not use the GPU. It will require trial and error to determine if you need to use a higher value for -G or the default for -g.
rogue is offline   Reply With Quote
Reply

Thread Tools


All times are UTC. The time now is 02:19.


Wed Jun 7 02:19:51 UTC 2023 up 292 days, 23:48, 0 users, load averages: 0.83, 0.98, 0.92

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔