mersenneforum.org mtsieve
 Register FAQ Search Today's Posts Mark Forums Read

2022-09-09, 18:02   #727
rogue

"Mark"
Apr 2003
Between here and the

1B8E16 Posts

Quote:
 Originally Posted by rebirther The file was too large I think, tested a sievefile with 2k's in it and it was running. The question is how many k's per sievefile can handle srsieve2cl?
-K will split the number of sequences. In your case you have 27683 sequences so -K2 will use 13842 and 13841 sequences per kernel. -K3 will split 27683 across 3 kernels. On the GPUs I have tested, I have been able to do over 6000 sequences per kernel. For you that would be -K5. The max is under 10000 for my GPUs, but I don't know what the exact max it.

I have observed that the higher the value for -K, then the worse the overall performance will be compared to the CPU. It will require some trial and error on your part to see which performs better.

2022-09-09, 20:56   #728
rebirther

Sep 2011
Germany

27×33 Posts

Quote:
 Originally Posted by rogue -K will split the number of sequences. In your case you have 27683 sequences so -K2 will use 13842 and 13841 sequences per kernel. -K3 will split 27683 across 3 kernels. On the GPUs I have tested, I have been able to do over 6000 sequences per kernel. For you that would be -K5. The max is under 10000 for my GPUs, but I don't know what the exact max it. I have observed that the higher the value for -K, then the worse the overall performance will be compared to the CPU. It will require some trial and error on your part to see which performs better.

I have tried -K1-5, end up all with the same error with a base of 4800k's

2022-09-09, 23:47   #729
rogue

"Mark"
Apr 2003
Between here and the

2·3,527 Posts

Quote:
 Originally Posted by rebirther I have tried -K1-5, end up all with the same error with a base of 4800k's
You can try higher values for -K to see at what point it starts to work.

BTW, did you d/l the latest srsieve2cl.exe from sourceforge. It is not in the mtsievel.7z file. The previous one has a bug.

2022-09-10, 06:06   #730
rebirther

Sep 2011
Germany

1101100000002 Posts

Quote:
 Originally Posted by rogue You can try higher values for -K to see at what point it starts to work. BTW, did you d/l the latest srsieve2cl.exe from sourceforge. It is not in the mtsievel.7z file. The previous one has a bug.

Yes, I have the latest version and no also not working with -K10

 2022-10-09, 03:35 #731 rogue     "Mark" Apr 2003 Between here and the 11011100011102 Posts I ran into something interesting today. Apparently if the kernel requires a CL_KERNEL_PRIVATE_MEM_SIZE that is too large, then you the kernel will fail. On Windows you get an OpenCL error when the kernel executes, but on OS X, you get an Abort trap: 6 error. With a little debugging I see an error under the covers that implies that there isn't enough memory. Unfortunately neither of these results tell you what the real problem is. Each GPU seems to have a different limit for how much private memory a kernel can use. For some GPUs I see that limit at 64KB. At others I see it at 512KB. When you run with -H it will output this value to you. This seems to only impact srsieve2cl when sieving with multiple sequences because there are arrays in the main kernel whose size vary depends upon the number of sequences sieved. This means that using -K will be necessary in order to keep CL_KERNEL_PRIVATE_MEM_SIZE under the limit. The main driver in srsieve2cl is the number of baby steps for the BSGS algorithm as this is used to determine the number of elements in the arrays. It might be possible to reduce the number of baby steps which could then increase sieving speed if one can use a lower value for -K. Unfortunately I cannot find any documentation that tells me what the limit is, although it is possible that I haven't dug far enough yet. If I knew then I could adjust -K automatically to avoid failures at runtime since the failures are mysterious to end users of the software.
 2022-10-10, 00:14 #732 ruckuslemur     Sep 2022 510 Posts There's no such thing as CL_DEVICE_PRIVATE_MEM_SIZE (the kernel limit), so one can either make the assumption that it's > 16KB on most GPUs or use local/global memory for the arrays, depending on which one performs the best. Last fiddled with by ruckuslemur on 2022-10-10 at 01:00
 2022-10-10, 12:30 #733 rogue     "Mark" Apr 2003 Between here and the 11011100011102 Posts Sorry, I meant CL_KERNEL_PRIVATE_MEM_SIZE. These arrays are local arrays in the kernel itself. Using global arrays is possible, but performance would take a hit. It would still be limited by the amount of memory available to the GPU.
2022-10-10, 17:04   #734
ruckuslemur

Sep 2022

5 Posts

Quote:
 Originally Posted by rogue Sorry, I meant CL_KERNEL_PRIVATE_MEM_SIZE. These arrays are local arrays in the kernel itself. Using global arrays is possible, but performance would take a hit. It would still be limited by the amount of memory available to the GPU.
I apologize for having been vague:

What I meant is that the out of resources error is due to the program's attempt to allocate more than the maximum amount of private memory, which, unlike local or global memory, cannot be queried (hence there is no CL_DEVICE_PRIVATE_MEM_SIZE). On AMD platforms, the error is likely thrown when a kernel with __private declarations is initialized after other kernels have used all available private memory and have spilled registers in global memory (https://community.amd.com/t5/archive...it/td-p/166604), however this behavior, much like the amount of private memory, is strictly implementation-dependent. The only real solution to the crash is to minimize private memory usage either by limiting it to the arbitrary value of 16 KB per work item (https://stackoverflow.com/questions/...mory-in-opencl) spreading the work across multiple kernels (which won't necessarily check out to 16KB per kernel and may still cause the error) or by attempting to allocate the arrays elsewhere, which as you said has the potential to make the program extremely inefficient but shouldn't produce a dramatic slowdown assuming that some register spilling is already happening.

Last fiddled with by ruckuslemur on 2022-10-10 at 17:29

2022-10-10, 18:01   #735
rogue

"Mark"
Apr 2003
Between here and the

2×3,527 Posts

Quote:
 Originally Posted by ruckuslemur I apologize for having been vague: What I meant is that the out of resources error is due to the program's attempt to allocate more than the maximum amount of private memory, which, unlike local or global memory, cannot be queried (hence there is no CL_DEVICE_PRIVATE_MEM_SIZE). On AMD platforms, the error is likely thrown when a kernel with __private declarations is initialized after other kernels have used all available private memory and have spilled registers in global memory (https://community.amd.com/t5/archive...it/td-p/166604), however this behavior, much like the amount of private memory, is strictly implementation-dependent. The only real solution to the crash is to minimize private memory usage either by limiting it to the arbitrary value of 16 KB per work item (https://stackoverflow.com/questions/...mory-in-opencl) spread across multiple kernels (which isn't necessarily 16KB per kernel and may still cause the error for the same reason) or by attempting to allocate the arrays elsewhere, which as you said has the potential to make the program extremely inefficient but shouldn't produce a dramatic slowdown assuming that some register spilling is already happening.
These are the variables in the kernel that require the memory:

Code:
   ushort  h_table[HASH_SIZE];
ushort  h_olist[HASH_ELEMENTS];
ulong   h_BJ64[HASH_ELEMENTS+1];
ulong   resBDCK64[SUBSEQUENCES];
and they exist for the entire execution of the kernel. There is nothing I can do about that.

These values are set via #define statements prepended to the kernel in C++ code. HASH_ELEMENTS is the number of baby steps. HASH_SIZE is the smallest power of 2 greater than HASH_ELEMENTS. SUBSEQUENCES is computed based upon the sequences. Sometimes sequences = subsequences, but not necessarily. subsequences cannot be split across kernels.

Removing the variables would require a different discrete log algorithm, but as I have stated elsewhere that can be problematic in other ways. Some discrete log algorithms require a lot less memory, but the execution time for each distinct prime varies greatly. That isn't good when we are doing discrete logs in parallel as the slowest one will dictate the speed of the application. Others are much faster, but they require a lot more memory. So we have a "middle of the road" approach that uses a little more memory, but flattens the execution time for each distinct prime.

The end user can specify -K and -b (a new command line switch) to affect the calculation for the #defined values, but it will be trial and error (for the end user) to determine which values work and which values provide the best throughput. For example, many combinations will work, but some will yield higher p/sec than others. In most cases the default will be best.

-K is used to split the sequences across multiple kernels. The default is 1 which means that it will try to do the discrete log for all sequences in a single kernel execution. Using -K 2 will split sequences across 2 kernels, and so on.

-b is used in the calculation of the number of baby steps (and thus giant steps). The default is 1.0, but by reducing it one can reduce both HASH_SIZE and HASH_ELEMENTS. By decreasing the number of baby steps one can decrease these two values thus reducing memory consumption in the kernel. giant steps are more expensive than baby steps, so reducing the number of baby steps alone is not the best approach.

As I stated and based upon your response, I cannot detect in code that the kernel will fail when -K and -b are not good. That sucks for end users. Fortunately by using the -H command line switch the user can probably figure out the maximum value for CL_KERNEL_PRIVATE_MEM_SIZE for their GPU before errors occur and adjust -K and -b to stay below that value.

I have only seen this issue when trying to sieve thousands of sequences concurrently. I'm one of the few who has been doing that. I have sieved over 50,000 sequences at a time. It required -K3 on an NVIDIA RTX A5000 and 500 KB seems to be the upper bound of CL_KERNEL_PRIVATE_MEM_SIZE before failure. On an AMD Radio Pro 5700 the upper bound for CL_KERNEL_PRIVATE_MEM_SIZE is closer to 50 KB.

 2022-10-19, 21:03 #736 rogue     "Mark" Apr 2003 Between here and the 2·3,527 Posts I posted srsievee2cl at sourceforge. Here is what I added: Code:  srsieve2/srsieve2cl: verison 1.6.4 Added -b. This parameter is used to calculate the number of baby steps and giant steps for the discrete log. The default is 1.0. As this value increases, so does the number of baby steps while decreasing the number of giant steps. The parameter has an impact on CL_KERNEL_PRIVATE_MEM_SIZE in GPU kernels (output when using -H). If that value is too high, then srsieve2cl will terminate. The reasons for termination vary depending on CPU. An example of usage would be when you are sieving thousands of sequences and have to use -K. In one test I had to use -K5 (the GPU has about 4 GB of memory). When using -K3 -b0.2 there is a nearly 40% performance boost. I am unaware (at this time) of how to detect the maximum amount of memory available to a kernel so there is no way to auto-set -K and -b to the optimial values at runtime. It will require trial and error for the end user. The impact of changing from the default value has not been determined for CPU workers.
2022-10-27, 09:38   #737
SethTro

"Seth"
Apr 2019

7528 Posts

Quote:
 Originally Posted by rogue srsieve2 does not work correctly when d > 1. I don't know how easy that will be to fix. I thought it was correct and working but is clearly not. I do see a separate issue in factor validation when abs(c) > 1, so I will fix that.
Did this issue ever get fixed? I think I'm running into it again (but possible a different issue)

Code:
\$ ./srsieve2 -W1 -p 10 -P 1e8 -n 10 -N 300000 -s "1*9^n-7"
srsieve2 v1.6.4, a program to find factors of k*b^n+c numbers for fixed b and variable k and n
Sieving with generic logic for p >= 10
Sieve started: 10 < p < 1e8 with 299991 terms (10 < n < 300000, k*9^n-7) (expecting 262492 factors)
Sieving with single sequence c=1 logic for p >= 257
BASE_MULTIPLE = 30, POWER_RESIDUE_LCM = 720, LIMIT_BASE = 720
Split 1 base 9 sequence into 132 base 9^180 sequences.
Legendre summary:  Approximately 2 B needed for Legendre tables

...

518400 bytes used for congruent q and ladder indices
263200 bytes used for congruent qs and ladders
Fatal Error:  Invalid factor: 1*9^39060-7 mod 1303 = 1297
I'm sieving "(3^n-7)/2" which I think I can hack to avoid d by testing 9^n-7 and disabling the division by 2 test. I'm excited that srsieve2 might be 10-100x faster than my hand coded sieve.

All times are UTC. The time now is 02:46.

Thu Mar 23 02:46:54 UTC 2023 up 217 days, 15 mins, 0 users, load averages: 1.11, 1.03, 0.93