mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Software (https://www.mersenneforum.org/forumdisplay.php?f=10)
-   -   mtsieve (https://www.mersenneforum.org/showthread.php?t=23042)

rogue 2022-09-09 18:02

[QUOTE=rebirther;613065]The file was too large I think, tested a sievefile with 2k's in it and it was running. The question is how many k's per sievefile can handle srsieve2cl?[/QUOTE]

-K will split the number of sequences. In your case you have 27683 sequences so -K2 will use 13842 and 13841 sequences per kernel. -K3 will split 27683 across 3 kernels. On the GPUs I have tested, I have been able to do over 6000 sequences per kernel. For you that would be -K5. The max is under 10000 for my GPUs, but I don't know what the exact max it.

I have observed that the higher the value for -K, then the worse the overall performance will be compared to the CPU. It will require some trial and error on your part to see which performs better.

rebirther 2022-09-09 20:56

[QUOTE=rogue;613078]-K will split the number of sequences. In your case you have 27683 sequences so -K2 will use 13842 and 13841 sequences per kernel. -K3 will split 27683 across 3 kernels. On the GPUs I have tested, I have been able to do over 6000 sequences per kernel. For you that would be -K5. The max is under 10000 for my GPUs, but I don't know what the exact max it.

I have observed that the higher the value for -K, then the worse the overall performance will be compared to the CPU. It will require some trial and error on your part to see which performs better.[/QUOTE]


I have tried -K1-5, end up all with the same error with a base of 4800k's

rogue 2022-09-09 23:47

[QUOTE=rebirther;613091]I have tried -K1-5, end up all with the same error with a base of 4800k's[/QUOTE]

You can try higher values for -K to see at what point it starts to work.

BTW, did you d/l the latest srsieve2cl.exe from sourceforge. It is not in the mtsievel.7z file. The previous one has a bug.

rebirther 2022-09-10 06:06

[QUOTE=rogue;613105]You can try higher values for -K to see at what point it starts to work.

BTW, did you d/l the latest srsieve2cl.exe from sourceforge. It is not in the mtsievel.7z file. The previous one has a bug.[/QUOTE]


Yes, I have the latest version and no also not working with -K10

rogue 2022-10-09 03:35

I ran into something interesting today. Apparently if the kernel requires a CL_KERNEL_PRIVATE_MEM_SIZE that is too large, then you the kernel will fail. On Windows you get an OpenCL error when the kernel executes, but on OS X, you get an Abort trap: 6 error. With a little debugging I see an error under the covers that implies that there isn't enough memory. Unfortunately neither of these results tell you what the real problem is.

Each GPU seems to have a different limit for how much private memory a kernel can use. For some GPUs I see that limit at 64KB. At others I see it at 512KB. When you run with -H it will output this value to you.

This seems to only impact srsieve2cl when sieving with multiple sequences because there are arrays in the main kernel whose size vary depends upon the number of sequences sieved. This means that using -K will be necessary in order to keep CL_KERNEL_PRIVATE_MEM_SIZE under the limit. The main driver in srsieve2cl is the number of baby steps for the BSGS algorithm as this is used to determine the number of elements in the arrays. It might be possible to reduce the number of baby steps which could then increase sieving speed if one can use a lower value for -K.

Unfortunately I cannot find any documentation that tells me what the limit is, although it is possible that I haven't dug far enough yet. If I knew then I could adjust -K automatically to avoid failures at runtime since the failures are mysterious to end users of the software.

ruckuslemur 2022-10-10 00:14

There's no such thing as CL_DEVICE_PRIVATE_MEM_SIZE (the kernel limit), so one can either make the assumption that it's > 16KB on most GPUs or use local/global memory for the arrays, depending on which one performs the best.

rogue 2022-10-10 12:30

Sorry, I meant CL_KERNEL_PRIVATE_MEM_SIZE.

These arrays are local arrays in the kernel itself. Using global arrays is possible, but performance would take a hit. It would still be limited by the amount of memory available to the GPU.

ruckuslemur 2022-10-10 17:04

[QUOTE=rogue;615246]Sorry, I meant CL_KERNEL_PRIVATE_MEM_SIZE.

These arrays are local arrays in the kernel itself. Using global arrays is possible, but performance would take a hit. It would still be limited by the amount of memory available to the GPU.[/QUOTE]
I apologize for having been vague:

What I meant is that the out of resources error is due to the program's attempt to allocate more than the maximum amount of private memory, which, unlike local or global memory, cannot be queried (hence there is no CL_DEVICE_PRIVATE_MEM_SIZE). On AMD platforms, the error is likely thrown when a kernel with __private declarations is initialized after other kernels have used all available private memory and have spilled registers in global memory ([URL]https://community.amd.com/t5/archives-discussions/silent-private-memory-size-limit/td-p/166604[/URL]), however this behavior, much like the amount of private memory, is strictly implementation-dependent. The only real solution to the crash is to minimize private memory usage either by limiting it to the arbitrary value of 16 KB per work item ([URL]https://stackoverflow.com/questions/22083507/is-there-a-maximum-limit-to-private-memory-in-opencl[/URL]) spreading the work across multiple kernels (which won't necessarily check out to 16KB per kernel and may still cause the error) or by attempting to allocate the arrays elsewhere, which as you said has the potential to make the program extremely inefficient but shouldn't produce a dramatic slowdown assuming that some register spilling is already happening.

rogue 2022-10-10 18:01

[QUOTE=ruckuslemur;615266]I apologize for having been vague:

What I meant is that the out of resources error is due to the program's attempt to allocate more than the maximum amount of private memory, which, unlike local or global memory, cannot be queried (hence there is no CL_DEVICE_PRIVATE_MEM_SIZE). On AMD platforms, the error is likely thrown when a kernel with __private declarations is initialized after other kernels have used all available private memory and have spilled registers in global memory ([url]https://community.amd.com/t5/archives-discussions/silent-private-memory-size-limit/td-p/166604[/url]), however this behavior, much like the amount of private memory, is strictly implementation-dependent. The only real solution to the crash is to minimize private memory usage either by limiting it to the arbitrary value of 16 KB per work item ([url]https://stackoverflow.com/questions/22083507/is-there-a-maximum-limit-to-private-memory-in-opencl[/url]) spread across multiple kernels (which isn't necessarily 16KB per kernel and may still cause the error for the same reason) or by attempting to allocate the arrays elsewhere, which as you said has the potential to make the program extremely inefficient but shouldn't produce a dramatic slowdown assuming that some register spilling is already happening.[/QUOTE]

These are the variables in the kernel that require the memory:

[code]
ushort h_table[HASH_SIZE];
ushort h_olist[HASH_ELEMENTS];
ulong h_BJ64[HASH_ELEMENTS+1];
ulong resBDCK64[SUBSEQUENCES];
[/code]

and they exist for the entire execution of the kernel. There is nothing I can do about that.

These values are set via #define statements prepended to the kernel in C++ code. HASH_ELEMENTS is the number of baby steps. HASH_SIZE is the smallest power of 2 greater than HASH_ELEMENTS. SUBSEQUENCES is computed based upon the sequences. Sometimes sequences = subsequences, but not necessarily. subsequences cannot be split across kernels.

Removing the variables would require a different discrete log algorithm, but as I have stated elsewhere that can be problematic in other ways. Some discrete log algorithms require a lot less memory, but the execution time for each distinct prime varies greatly. That isn't good when we are doing discrete logs in parallel as the slowest one will dictate the speed of the application. Others are much faster, but they require a lot more memory. So we have a "middle of the road" approach that uses a little more memory, but flattens the execution time for each distinct prime.

The end user can specify -K and -b (a new command line switch) to affect the calculation for the #defined values, but it will be trial and error (for the end user) to determine which values work and which values provide the best throughput. For example, many combinations will work, but some will yield higher p/sec than others. In most cases the default will be best.

-K is used to split the sequences across multiple kernels. The default is 1 which means that it will try to do the discrete log for all sequences in a single kernel execution. Using -K 2 will split sequences across 2 kernels, and so on.

-b is used in the calculation of the number of baby steps (and thus giant steps). The default is 1.0, but by reducing it one can reduce both HASH_SIZE and HASH_ELEMENTS. By decreasing the number of baby steps one can decrease these two values thus reducing memory consumption in the kernel. giant steps are more expensive than baby steps, so reducing the number of baby steps alone is not the best approach.

As I stated and based upon your response, I cannot detect in code that the kernel will fail when -K and -b are not good. That sucks for end users. Fortunately by using the -H command line switch the user can probably figure out the maximum value for CL_KERNEL_PRIVATE_MEM_SIZE for their GPU before errors occur and adjust -K and -b to stay below that value.

I have only seen this issue when trying to sieve thousands of sequences concurrently. I'm one of the few who has been doing that. I have sieved over 50,000 sequences at a time. It required -K3 on an NVIDIA RTX A5000 and 500 KB seems to be the upper bound of CL_KERNEL_PRIVATE_MEM_SIZE before failure. On an AMD Radio Pro 5700 the upper bound for CL_KERNEL_PRIVATE_MEM_SIZE is closer to 50 KB.

rogue 2022-10-19 21:03

I posted srsievee2cl at sourceforge. Here is what I added:

[code]
srsieve2/srsieve2cl: verison 1.6.4
Added -b. This parameter is used to calculate the number of baby steps and giant steps
for the discrete log. The default is 1.0. As this value increases, so does the number
of baby steps while decreasing the number of giant steps.

The parameter has an impact on CL_KERNEL_PRIVATE_MEM_SIZE in GPU kernels (output when
using -H). If that value is too high, then srsieve2cl will terminate. The reasons
for termination vary depending on CPU.

An example of usage would be when you are sieving thousands of sequences and have to use -K.
In one test I had to use -K5 (the GPU has about 4 GB of memory). When using -K3 -b0.2 there
is a nearly 40% performance boost. I am unaware (at this time) of how to detect the maximum
amount of memory available to a kernel so there is no way to auto-set -K and -b to the
optimial values at runtime. It will require trial and error for the end user.

The impact of changing from the default value has not been determined for CPU workers.
[/code]

SethTro 2022-10-27 09:38

[QUOTE=rogue;585589]srsieve2 does not work correctly when d > 1. I don't know how easy that will be to fix. I thought it was correct and working but is clearly not.

I do see a separate issue in factor validation when abs(c) > 1, so I will fix that.[/QUOTE]

Did this issue ever get fixed? I think I'm running into it again (but possible a different issue)

[CODE]
$ ./srsieve2 -W1 -p 10 -P 1e8 -n 10 -N 300000 -s "1*9^n-7"
srsieve2 v1.6.4, a program to find factors of k*b^n+c numbers for fixed b and variable k and n
Sieving with generic logic for p >= 10
Sieve started: 10 < p < 1e8 with 299991 terms (10 < n < 300000, k*9^n-7) (expecting 262492 factors)
Sieving with single sequence c=1 logic for p >= 257
BASE_MULTIPLE = 30, POWER_RESIDUE_LCM = 720, LIMIT_BASE = 720
Split 1 base 9 sequence into 132 base 9^180 sequences.
Legendre summary: Approximately 2 B needed for Legendre tables

...

518400 bytes used for congruent q and ladder indices
263200 bytes used for congruent qs and ladders
Fatal Error: Invalid factor: 1*9^39060-7 mod 1303 = 1297
[/CODE]I'm sieving "(3^n-7)/2" which I think I can hack to avoid d by testing 9^n-7 and disabling the division by 2 test. I'm excited that srsieve2 might be 10-100x faster than my hand coded sieve.


All times are UTC. The time now is 20:20.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.