mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2022-09-09, 18:02   #727
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

1B8E16 Posts
Default

Quote:
Originally Posted by rebirther View Post
The file was too large I think, tested a sievefile with 2k's in it and it was running. The question is how many k's per sievefile can handle srsieve2cl?
-K will split the number of sequences. In your case you have 27683 sequences so -K2 will use 13842 and 13841 sequences per kernel. -K3 will split 27683 across 3 kernels. On the GPUs I have tested, I have been able to do over 6000 sequences per kernel. For you that would be -K5. The max is under 10000 for my GPUs, but I don't know what the exact max it.

I have observed that the higher the value for -K, then the worse the overall performance will be compared to the CPU. It will require some trial and error on your part to see which performs better.
rogue is offline   Reply With Quote
Old 2022-09-09, 20:56   #728
rebirther
 
rebirther's Avatar
 
Sep 2011
Germany

27×33 Posts
Default

Quote:
Originally Posted by rogue View Post
-K will split the number of sequences. In your case you have 27683 sequences so -K2 will use 13842 and 13841 sequences per kernel. -K3 will split 27683 across 3 kernels. On the GPUs I have tested, I have been able to do over 6000 sequences per kernel. For you that would be -K5. The max is under 10000 for my GPUs, but I don't know what the exact max it.

I have observed that the higher the value for -K, then the worse the overall performance will be compared to the CPU. It will require some trial and error on your part to see which performs better.

I have tried -K1-5, end up all with the same error with a base of 4800k's
rebirther is offline   Reply With Quote
Old 2022-09-09, 23:47   #729
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

2·3,527 Posts
Default

Quote:
Originally Posted by rebirther View Post
I have tried -K1-5, end up all with the same error with a base of 4800k's
You can try higher values for -K to see at what point it starts to work.

BTW, did you d/l the latest srsieve2cl.exe from sourceforge. It is not in the mtsievel.7z file. The previous one has a bug.
rogue is offline   Reply With Quote
Old 2022-09-10, 06:06   #730
rebirther
 
rebirther's Avatar
 
Sep 2011
Germany

1101100000002 Posts
Default

Quote:
Originally Posted by rogue View Post
You can try higher values for -K to see at what point it starts to work.

BTW, did you d/l the latest srsieve2cl.exe from sourceforge. It is not in the mtsievel.7z file. The previous one has a bug.

Yes, I have the latest version and no also not working with -K10
rebirther is offline   Reply With Quote
Old 2022-10-09, 03:35   #731
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

11011100011102 Posts
Default

I ran into something interesting today. Apparently if the kernel requires a CL_KERNEL_PRIVATE_MEM_SIZE that is too large, then you the kernel will fail. On Windows you get an OpenCL error when the kernel executes, but on OS X, you get an Abort trap: 6 error. With a little debugging I see an error under the covers that implies that there isn't enough memory. Unfortunately neither of these results tell you what the real problem is.

Each GPU seems to have a different limit for how much private memory a kernel can use. For some GPUs I see that limit at 64KB. At others I see it at 512KB. When you run with -H it will output this value to you.

This seems to only impact srsieve2cl when sieving with multiple sequences because there are arrays in the main kernel whose size vary depends upon the number of sequences sieved. This means that using -K will be necessary in order to keep CL_KERNEL_PRIVATE_MEM_SIZE under the limit. The main driver in srsieve2cl is the number of baby steps for the BSGS algorithm as this is used to determine the number of elements in the arrays. It might be possible to reduce the number of baby steps which could then increase sieving speed if one can use a lower value for -K.

Unfortunately I cannot find any documentation that tells me what the limit is, although it is possible that I haven't dug far enough yet. If I knew then I could adjust -K automatically to avoid failures at runtime since the failures are mysterious to end users of the software.
rogue is offline   Reply With Quote
Old 2022-10-10, 00:14   #732
ruckuslemur
 
ruckuslemur's Avatar
 
Sep 2022

510 Posts
Default

There's no such thing as CL_DEVICE_PRIVATE_MEM_SIZE (the kernel limit), so one can either make the assumption that it's > 16KB on most GPUs or use local/global memory for the arrays, depending on which one performs the best.

Last fiddled with by ruckuslemur on 2022-10-10 at 01:00
ruckuslemur is offline   Reply With Quote
Old 2022-10-10, 12:30   #733
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

11011100011102 Posts
Default

Sorry, I meant CL_KERNEL_PRIVATE_MEM_SIZE.

These arrays are local arrays in the kernel itself. Using global arrays is possible, but performance would take a hit. It would still be limited by the amount of memory available to the GPU.
rogue is offline   Reply With Quote
Old 2022-10-10, 17:04   #734
ruckuslemur
 
ruckuslemur's Avatar
 
Sep 2022

5 Posts
Default

Quote:
Originally Posted by rogue View Post
Sorry, I meant CL_KERNEL_PRIVATE_MEM_SIZE.

These arrays are local arrays in the kernel itself. Using global arrays is possible, but performance would take a hit. It would still be limited by the amount of memory available to the GPU.
I apologize for having been vague:

What I meant is that the out of resources error is due to the program's attempt to allocate more than the maximum amount of private memory, which, unlike local or global memory, cannot be queried (hence there is no CL_DEVICE_PRIVATE_MEM_SIZE). On AMD platforms, the error is likely thrown when a kernel with __private declarations is initialized after other kernels have used all available private memory and have spilled registers in global memory (https://community.amd.com/t5/archive...it/td-p/166604), however this behavior, much like the amount of private memory, is strictly implementation-dependent. The only real solution to the crash is to minimize private memory usage either by limiting it to the arbitrary value of 16 KB per work item (https://stackoverflow.com/questions/...mory-in-opencl) spreading the work across multiple kernels (which won't necessarily check out to 16KB per kernel and may still cause the error) or by attempting to allocate the arrays elsewhere, which as you said has the potential to make the program extremely inefficient but shouldn't produce a dramatic slowdown assuming that some register spilling is already happening.

Last fiddled with by ruckuslemur on 2022-10-10 at 17:29
ruckuslemur is offline   Reply With Quote
Old 2022-10-10, 18:01   #735
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

2×3,527 Posts
Default

Quote:
Originally Posted by ruckuslemur View Post
I apologize for having been vague:

What I meant is that the out of resources error is due to the program's attempt to allocate more than the maximum amount of private memory, which, unlike local or global memory, cannot be queried (hence there is no CL_DEVICE_PRIVATE_MEM_SIZE). On AMD platforms, the error is likely thrown when a kernel with __private declarations is initialized after other kernels have used all available private memory and have spilled registers in global memory (https://community.amd.com/t5/archive...it/td-p/166604), however this behavior, much like the amount of private memory, is strictly implementation-dependent. The only real solution to the crash is to minimize private memory usage either by limiting it to the arbitrary value of 16 KB per work item (https://stackoverflow.com/questions/...mory-in-opencl) spread across multiple kernels (which isn't necessarily 16KB per kernel and may still cause the error for the same reason) or by attempting to allocate the arrays elsewhere, which as you said has the potential to make the program extremely inefficient but shouldn't produce a dramatic slowdown assuming that some register spilling is already happening.
These are the variables in the kernel that require the memory:

Code:
   ushort  h_table[HASH_SIZE];
   ushort  h_olist[HASH_ELEMENTS];
   ulong   h_BJ64[HASH_ELEMENTS+1];
   ulong   resBDCK64[SUBSEQUENCES];
and they exist for the entire execution of the kernel. There is nothing I can do about that.

These values are set via #define statements prepended to the kernel in C++ code. HASH_ELEMENTS is the number of baby steps. HASH_SIZE is the smallest power of 2 greater than HASH_ELEMENTS. SUBSEQUENCES is computed based upon the sequences. Sometimes sequences = subsequences, but not necessarily. subsequences cannot be split across kernels.

Removing the variables would require a different discrete log algorithm, but as I have stated elsewhere that can be problematic in other ways. Some discrete log algorithms require a lot less memory, but the execution time for each distinct prime varies greatly. That isn't good when we are doing discrete logs in parallel as the slowest one will dictate the speed of the application. Others are much faster, but they require a lot more memory. So we have a "middle of the road" approach that uses a little more memory, but flattens the execution time for each distinct prime.

The end user can specify -K and -b (a new command line switch) to affect the calculation for the #defined values, but it will be trial and error (for the end user) to determine which values work and which values provide the best throughput. For example, many combinations will work, but some will yield higher p/sec than others. In most cases the default will be best.

-K is used to split the sequences across multiple kernels. The default is 1 which means that it will try to do the discrete log for all sequences in a single kernel execution. Using -K 2 will split sequences across 2 kernels, and so on.

-b is used in the calculation of the number of baby steps (and thus giant steps). The default is 1.0, but by reducing it one can reduce both HASH_SIZE and HASH_ELEMENTS. By decreasing the number of baby steps one can decrease these two values thus reducing memory consumption in the kernel. giant steps are more expensive than baby steps, so reducing the number of baby steps alone is not the best approach.

As I stated and based upon your response, I cannot detect in code that the kernel will fail when -K and -b are not good. That sucks for end users. Fortunately by using the -H command line switch the user can probably figure out the maximum value for CL_KERNEL_PRIVATE_MEM_SIZE for their GPU before errors occur and adjust -K and -b to stay below that value.

I have only seen this issue when trying to sieve thousands of sequences concurrently. I'm one of the few who has been doing that. I have sieved over 50,000 sequences at a time. It required -K3 on an NVIDIA RTX A5000 and 500 KB seems to be the upper bound of CL_KERNEL_PRIVATE_MEM_SIZE before failure. On an AMD Radio Pro 5700 the upper bound for CL_KERNEL_PRIVATE_MEM_SIZE is closer to 50 KB.
rogue is offline   Reply With Quote
Old 2022-10-19, 21:03   #736
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

2·3,527 Posts
Default

I posted srsievee2cl at sourceforge. Here is what I added:

Code:
   srsieve2/srsieve2cl: verison 1.6.4
      Added -b.  This parameter is used to calculate the number of baby steps and giant steps
      for the discrete log.  The default is 1.0.  As this value increases, so does the number
      of baby steps while decreasing the number of giant steps.  

      The parameter has an impact on CL_KERNEL_PRIVATE_MEM_SIZE in GPU kernels (output when
      using -H).  If that value is too high, then srsieve2cl will terminate.  The reasons
      for termination vary depending on CPU.  

      An example of usage would be when you are sieving thousands of sequences and have to use -K. 
      In one test I had to use -K5 (the GPU has about 4 GB of memory).  When using -K3 -b0.2 there
      is a nearly 40% performance boost.  I am unaware (at this time) of how to detect the maximum 
      amount of memory available to a kernel so there is no way to auto-set -K and -b to the
      optimial values at runtime.  It will require trial and error for the end user.

      The impact of changing from the default value has not been determined for CPU workers.
rogue is offline   Reply With Quote
Old 2022-10-27, 09:38   #737
SethTro
 
SethTro's Avatar
 
"Seth"
Apr 2019

7528 Posts
Default

Quote:
Originally Posted by rogue View Post
srsieve2 does not work correctly when d > 1. I don't know how easy that will be to fix. I thought it was correct and working but is clearly not.

I do see a separate issue in factor validation when abs(c) > 1, so I will fix that.
Did this issue ever get fixed? I think I'm running into it again (but possible a different issue)

Code:
$ ./srsieve2 -W1 -p 10 -P 1e8 -n 10 -N 300000 -s "1*9^n-7"
srsieve2 v1.6.4, a program to find factors of k*b^n+c numbers for fixed b and variable k and n
Sieving with generic logic for p >= 10
Sieve started: 10 < p < 1e8 with 299991 terms (10 < n < 300000, k*9^n-7) (expecting 262492 factors)
Sieving with single sequence c=1 logic for p >= 257
BASE_MULTIPLE = 30, POWER_RESIDUE_LCM = 720, LIMIT_BASE = 720
Split 1 base 9 sequence into 132 base 9^180 sequences.
Legendre summary:  Approximately 2 B needed for Legendre tables

...

518400 bytes used for congruent q and ladder indices
263200 bytes used for congruent qs and ladders
Fatal Error:  Invalid factor: 1*9^39060-7 mod 1303 = 1297
I'm sieving "(3^n-7)/2" which I think I can hack to avoid d by testing 9^n-7 and disabling the division by 2 test. I'm excited that srsieve2 might be 10-100x faster than my hand coded sieve.
SethTro is offline   Reply With Quote
Reply

Thread Tools


All times are UTC. The time now is 02:46.


Thu Mar 23 02:46:54 UTC 2023 up 217 days, 15 mins, 0 users, load averages: 1.11, 1.03, 0.93

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔