mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Software (https://www.mersenneforum.org/forumdisplay.php?f=10)
-   -   mtsieve (https://www.mersenneforum.org/showthread.php?t=23042)

pepi37 2022-09-08 19:28

Did my testing show to you where problem is, my cpu is still faster then RTX 2060!
Any other idea? Different setup?

rogue 2022-09-08 21:43

[QUOTE=pepi37;612978]Did my testing show to you where problem is, my cpu is still faster then RTX 2060!
Any other idea? Different setup?[/QUOTE]

You can try a different value for -g, but remember that you are comparing a single GPU to multiple cores on a CPU.

Unlike some of the other sieves using the framework, sr2sievecl uses a lot more GPU memory as each thread has to maintain its own set of tables in memory. When it comes to discrete logs, you can use less memory, but then the computation time for each p varies significantly. The discrete log used by srsieve2cl uses a method that "flattens the curve" for the calculation regardless of p, but requires more memory. It might be possible to modify the algorithm to use less memory in the GPU, but that could lead to other issues. One of the worst things with the current algorithm is that there are many conditionals and the remaining loops can't really be unrolled. It would likely require a completely different algorithm to get more speed out of it.

pepi37 2022-09-08 22:43

[QUOTE=rogue;612989]You can try a different value for -g, but remember that you are comparing a single GPU to multiple cores on a CPU.

Unlike some of the other sieves using the framework, sr2sievecl uses a lot more GPU memory as each thread has to maintain its own set of tables in memory. When it comes to discrete logs, you can use less memory, but then the computation time for each p varies significantly. The discrete log used by srsieve2cl uses a method that "flattens the curve" for the calculation regardless of p, but requires more memory. It might be possible to modify the algorithm to use less memory in the GPU, but that could lead to other issues. One of the worst things with the current algorithm is that there are many conditionals and the remaining loops can't really be unrolled. It would likely require a completely different algorithm to get more speed out of it.[/QUOTE]


I agree with all that but what is purpose of the opencl sieve. RTX 2060 is not most powerfull card, but it is not bad at all. I expect huge difference in speed, .
And as ryanp says " I don't know how else to explain an A100 being the same speed as a regular 64 core machine."
A100 is beast GPU card...

rogue 2022-09-09 01:58

[QUOTE=pepi37;612990]I agree with all that but what is purpose of the opencl sieve. RTX 2060 is not most powerfull card, but it is not bad at all. I expect huge difference in speed, .
And as ryanp says " I don't know how else to explain an A100 being the same speed as a regular 64 core machine."
A100 is beast GPU card...[/QUOTE]

If you compare a single GPU vs a single core on a CPU, that is where you see the difference in speed. A GPU core is not equivalent to a CPU core.

ryanp 2022-09-09 13:31

[QUOTE=rogue;613005]If you compare a single GPU vs a single core on a CPU, that is where you see the difference in speed. A GPU core is not equivalent to a CPU core.[/QUOTE]

I'm not really sure how that information helps. The reality is that we haven't seemed to find any combination of params that actually make [C]srsieve2cl[/C] any faster than a simple 4- or 8-core CPU run ("-W").

rebirther 2022-09-09 13:57

It looks like that srsieve2.cl cannot run on my HD7950:


[CODE]Sieving with generic logic for p >= 1000000000
Split 27683 base 486 sequences into 27683 base 486^1 sequences.

OpenCL Error: Program build failure
in call to clBuildProgram
"C:\Users\user\AppData\Local\Temp\OCL2224T5.cl", line 165: warning: state
ment
is unreachable
resBM64 = mmmPowmod(resBM64, BABY_STEPS, thePrime, _q, _one);
^

"C:\Users\user\AppData\Local\Temp\OCL2224T5.cl", line 238: warning: statement
is unreachable
return 0;
^

Error:E013:Insufficient Private Resources![/CODE]


Anyone has a testline? I think the file is too large to handle. It was running on my old win7 PC, the card has 3GB VRAM

rogue 2022-09-09 15:16

[QUOTE=rebirther;613056]It looks like that srsieve2.cl cannot run on my HD7950:


[CODE]Sieving with generic logic for p >= 1000000000
Split 27683 base 486 sequences into 27683 base 486^1 sequences.

Error:E013:Insufficient Private Resources![/CODE]

Anyone has a testline? I think the file is too large to handle. It was running on my old win7 PC, the card has 3GB VRAM[/QUOTE]

I have not seen this before. You can use -K to create multiple kernels. That will reduce how much memory is needed.

rogue 2022-09-09 15:33

[QUOTE=ryanp;613054]I'm not really sure how that information helps. The reality is that we haven't seemed to find any combination of params that actually make [C]srsieve2cl[/C] any faster than a simple 4- or 8-core CPU run ("-W").[/QUOTE]

You cannot make the assumption the xx cores on a GPU is equivalent to xx cores on a CPU. You can't even assume that a GPU core at yy GHZ is the same as a CPU core at yy GHZ.

There are many factors that impact the speed of the GPU. The GPU is great for parallelizing tasks. How much of a speed bump you actually get is dependent upon the size of each task and how much memory each task needs. The discrete log algorithm is much larger than typical GPU tasks and requires a lot more memory than typical GPU tasks. The driver is also a factor as it is responsible for compiling the OpenCL C (in the kernel) to machine code. Some drivers are better than others at the task.

Based upon my testing across various computers with GPUs, -G1 is anywhere from 5x to 10x faster than -W1. So if you only have 5x faster with -G1 than -W1 but want to use -W8, then the CPU will clearly be faster, but if you use -G1 -W5, that will be twice as fast as -W5 alone.

rebirther 2022-09-09 15:43

[QUOTE=rogue;613061]I have not seen this before. You can use -K to create multiple kernels. That will reduce how much memory is needed.[/QUOTE]


The file was too large I think, tested a sievefile with 2k's in it and it was running. The question is how many k's per sievefile can handle srsieve2cl?

ryanp 2022-09-09 16:56

[QUOTE=rogue;613063]Based upon my testing across various computers with GPUs, -G1 is anywhere from 5x to 10x faster than -W1. So if you only have 5x faster with -G1 than -W1 but want to use -W8, then the CPU will clearly be faster, but if you use -G1 -W5, that will be twice as fast as -W5 alone.[/QUOTE]

OK, can you suggest a combination of [C]-g[/C] and [C]-G[/C] that will be faster than -W 36? That's sort of been the question all along.

rogue 2022-09-09 17:57

[QUOTE=ryanp;613071]OK, can you suggest a combination of [C]-g[/C] and [C]-G[/C] that will be faster than -W 36? That's sort of been the question all along.[/QUOTE]

I cannot provide one that is can be faster in the GPU without using CPU workers.

I can only suggest -W36 -G1 as -W36 alone will not use the GPU. It will require trial and error to determine if you need to use a higher value for -G or the default for -g.


All times are UTC. The time now is 21:47.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.