mersenneforum.org mtsieve enhancements
 Register FAQ Search Today's Posts Mark Forums Read

2020-04-26, 00:50   #12
Citrix

Jun 2003

22·397 Posts

Quote:
 Originally Posted by rogue I'm thinking that a faster implementation of the HashTable class would be a good start.
For converting the hash table to bitmap (bit index) this is what I was thinking of:-

Step 1:-
For BSGS we want to compare G[i] to B[j] for giant and baby step respectively.
Assuming there are G_n giant steps and B_n baby steps.

Then we calculate G[i] (mod p) and B[j] (mod p)

We want to store the baby step value (could be the other way around also)

Instead of comparing G[i] and B[j] we want to limit our search space by using probability. This can reduce the amount of memory space needed and search operations but can produce false positive. There will be a inverse relationship between how much memory we use and how many false positives we get.

Let G[i]=B[j] (mod p)
then [G[i]=B[j] (mod p)](mod n)
G[i]%n=GN[i]
B[j]%n=BN[j]
GN[i]=BN[j]

We will store BN[j] for comparison with GN[i]
** if n is a power of 2 (2^x) then this modular arithmetic can be done at no extra cost.

So we do not need to store the full 64 bits of G[i] and B[j]. Effectively we have reduced our search space.

We can store these in a hash table or use Step 2 below in certain cases to speed up even further.

For very small B_n range (<5-10) we can just use an array instead of hash table which should be faster. This would happen if there were too few candidates left in the sieve.

Step 2:-
For small n we can create an bit index (array of bits) to store all B[j] values. Does not work for large n values and would need to use a regular hash table.

At default all values in the array are set to zero.

Then for each BN[j] value you convert the corresponding location in the array to 1.

For each GN[i] you read the corresponding value from the array.

Complexity to add a value is O(1) and to search a value is O (1)

If you do get a positive result (=1) it suggests that G[i] might produce a factor and then you need to check G[i] against B[1..j]. This will differentiate between false positive and true factors.

This might not need to be done if there were no values corresponding to a particular G[i] in the candidate left range.

To check G[i] against B[1..j] instead of repeating B_n steps this can be done in 2*sqrt(B_n) steps by repeating the BSGS algorithm again on this subrange.

Code:
Bit array [n];
for (int a=0; a<n; a++) {array[a]=0;}
int b=BN[j];
array[b]=1;
int g=GN[i];
if(array[g]==1) {// possible factor found and need to double check}

2) Choosing the size of n

We want to reduce the number of false positive and reduce the amount of memory used
Larger the n the fewer the number of false positive and more the amount of memory used.

Percent chances of a positive result will be (G_n*B_n*100)/n

For a 1% false positive rate n will have to be around G_n*B_n*100

The false positive rate should be adjusted to the speed up from using less memory/array structure vs extra work needed to test the false positive.

n values will quickly become very large and this is best suited for low weight candidates or small ranges being searched with a few k.

 2020-04-26, 02:53 #13 Happy5214     "Alexander" Nov 2008 The Alamo City 3×52×11 Posts My desktop's Core 2 doesn't have AVX, so this is from my laptop (Intel Core i3-4012Y, Kubuntu 18.04): Code: sse_mulmod_4a_4b_4p 2018927 ms 5047 ms (per 1e6 mulmod) vec_avx_mulmod 21365975 ms 6676 ms (per 1e6 mulmod) vec2_mulmod64_mmx 82026202 ms 1708 ms (per 1e6 mulmod) vec4_mulmod64_mmx 34685917 ms 722 ms (per 1e6 mulmod) vec6_mulmod64_mmx 24030532 ms 500 ms (per 1e6 mulmod) vec8_mulmod64_mmx 18026937 ms 375 ms (per 1e6 mulmod) vec2_mulmod64_fpu 80305117 ms 1673 ms (per 1e6 mulmod) vec4_mulmod64_fpu 38246894 ms 796 ms (per 1e6 mulmod) vec6_mulmod64_fpu 27558843 ms 574 ms (per 1e6 mulmod) vec8_mulmod64_fpu 18187244 ms 378 ms (per 1e6 mulmod) As an aside, I had to add a definition before an include to get it to compile, so it's now Code:  #define __USE_GNU #include  Last fiddled with by Happy5214 on 2020-04-26 at 02:55
 2020-04-26, 12:18 #14 rogue     "Mark" Apr 2003 Between here and the 11010000110102 Posts I am not concerned about memory usage on a CPU as much as I would be on a GPU, but if the lookup could be faster, that would be helpful. Please write some code so that we can see how it performs.
2020-04-26, 14:16   #15
rogue

"Mark"
Apr 2003
Between here and the

2·13·257 Posts

I added the mtsieve function fpu_mulmod_4a_4b_4p to this. I didn't include it previously because it I didn't use it correctly which led to incorrect output. Here is the updated output:

Code:
fpu_mulmod_4a_4b_4p   890625 ms 2226 ms (per 1e6 mulmod)
sse_mulmod_4a_4b_4p   734375 ms 1835 ms (per 1e6 mulmod)
vec_avx_mulmod  3828125 ms 1196 ms (per 1e6 mulmod)
vec2_mulmod64_mmx 30625000 ms  638 ms (per 1e6 mulmod)
vec4_mulmod64_mmx 12359375 ms  257 ms (per 1e6 mulmod)
vec6_mulmod64_mmx  8187500 ms  170 ms (per 1e6 mulmod)
vec8_mulmod64_mmx  6078125 ms  126 ms (per 1e6 mulmod)
vec2_mulmod64_fpu 29906250 ms  623 ms (per 1e6 mulmod)
vec4_mulmod64_fpu 13546875 ms  282 ms (per 1e6 mulmod)
vec6_mulmod64_fpu  8968750 ms  186 ms (per 1e6 mulmod)
vec8_mulmod64_fpu  6031250 ms  125 ms (per 1e6 mulmod)
fpu_mulmod_4a_4b_4p works like this:

Code:
a[0] = a[0] * b[0] mod p[0]
a[1] = a[1] * b[1] mod p[1]
a[2] = a[2] * b[2] mod p[2]
a[3] = a[3] * b[3] mod p[3]
vec4_mulmod64_fpu works like this:

Code:
for (int i=0; i<count; i++)
a[0] = a[0] * b[0] mod p
a[1] = a[1] * b[1] mod p
a[2] = a[2] * b[2] mod p
a[3] = a[3] * b[3] mod p
Now if mtsieve had a function that worked like this:

Code:
for (int i=0; i<count; i++)
a[0] = a[0] * b[0] mod p[0]
a[1] = a[1] * b[1] mod p[1]
a[2] = a[2] * b[2] mod p[2]
a[3] = a[3] * b[3] mod p[3]
and had a speed faster than 1500 ms per 1e6 mulmod (compared to the above), that would be killer for the entire framework. Not only could that be used to give a 20% or better performance boost to most sieves, it would also allow most sieves to support sieving for p up to 2^62. vec4_mulmod64_fpu has very limited usage in the existing sieves of the framework, but the new function could be used in most, if not all of the sieves.

The sources to both of these are included in the attached and the usage of both is in the test.c file. For those of you are are not afraid of x86 assembly, I'd love you to take a crack at such a function.
Attached Files
 mulmod_test.7z (26.9 KB, 214 views)

Last fiddled with by rogue on 2020-04-26 at 14:17

 2020-04-26, 15:46 #16 storm5510 Random Account     Aug 2009 Not U. + S.A. 227910 Posts Using your update. Same parameter as before. 100. Attached Thumbnails
2020-04-26, 16:27   #17
Citrix

Jun 2003

22×397 Posts

Quote:
 Originally Posted by rogue I am not concerned about memory usage on a CPU as much as I would be on a GPU, but if the lookup could be faster, that would be helpful. Please write some code so that we can see how it performs.
I am not sure how to integrate the code into mtsieve to do the comparison. The code is below:-

As I mentioned in the earlier post that it only is useful for small n range or low weight and might not work for a larger range due to memory requirements. It might not be of any practical use. It does use more memory but look up is significantly faster.

See: https://docs.microsoft.com/en-us/cpp...s?view=vs-2019
for more details on bitset.

Code:
#include <bitset> // built in STD library for bitmaps.

//BSGS algorithm
// For a range r=b*g for k*2^n+1
// then baby step = k*2^(1) to k*2^b -- b steps (mod p)
// giant step will be -2^(-n) to -(2^b) step 2^b --g steps (mod p)

const int g = 32; // number of giant steps
const int b = 32; // number of baby steps
const int c = 1024; // for 0.1% false positive rate. rate=1/c*100%
int n = g*b*c; //works better if n=2^m
std::bitset<n> bitmap; // by default in the constructor everything is set to zero/false

void main(void)
{
uint64_t giant[g];
uint64_t baby[b];
// replace arrays with mulmod code in mtsieve to generate giant and baby values (mod p)

for (int i = 0; i < b; i++)
{
uint32_t temp = baby[i];
temp = temp%n; // can use bit operations if n is a power of 2
bitmap.set(temp); //sets bit to 1
}
// we are done with setting up the table
// Now we will compare with the giant values
for (int j= 0; j < g; j++)
{
uint32_t temp = giant[j];
temp = temp%n; // can use bit operations if n is a power of 2
if (bitmap.test(temp))
{
// we might have a possible factor
// Possible factor will be k*2^(b*j)*2^x+1 =0 (mod p)
// Where x is between 1...b
// Need to search this range for factors
// Repeat BSGS for this smaller range.
}
}

bitmap.reset(); // sets everything to false
}

Last fiddled with by Citrix on 2020-04-26 at 16:49

2020-04-26, 19:26   #18
Citrix

Jun 2003

22×397 Posts

Quote:
 Originally Posted by rogue Now if mtsieve had a function that worked like this: Code: for (int i=0; i
Question for you:-
For the low weight sequences-
Assuming BSGS requires 64 steps each for a range of 4096
Would the following simpler algorithm be faster on CPU or GPU than BSGS for a 4096 range?
Multiple primes could be tested in parallel.
(Assuming we use BestQ and Legendre symbols in both algorithms).

Code:
void low_wt_sieve (int k, int base, int nmin, int nmax, int sign, int step, int p)
{
// checks if p divides k*base^n+-sign (over range of n with base^step size increments)

// calculate number of iterations
int number=(nmax-nmin)/step+1;
int start=(k*base^nmin)%p;
int step_exp=base^step%p;
for (int x=0; x<number; x++)
{
start=start*step_exp%p;
if (start+sign==p || start-sign==0)
{
//output factor
}
}
}
Thanks.

2020-04-26, 21:29   #19
rogue

"Mark"
Apr 2003
Between here and the

1A1A16 Posts

Quote:
 Originally Posted by Citrix Question for you:- For the low weight sequences- Assuming BSGS requires 64 steps each for a range of 4096 Would the following simpler algorithm be faster on CPU or GPU than BSGS for a 4096 range? Multiple primes could be tested in parallel. (Assuming we use BestQ and Legendre symbols in both algorithms). Code: void low_wt_sieve (int k, int base, int nmin, int nmax, int sign, int step, int p) { // checks if p divides k*base^n+-sign (over range of n with base^step size increments) // calculate number of iterations int number=(nmax-nmin)/step+1; int start=(k*base^nmin)%p; int step_exp=base^step%p; for (int x=0; x
As for GPU vs CPU, I don't know the answer without running any tests. The issue with the GPU is the usage of memory. Although GPUs have a lot of memory, each "thread" typically accesses only a small portion of that. You can use global memory that all threads can access, but it is my understanding that global memory access is expensive. I only have access to AMD GPUs, so OpenCL is the only choice, in other words, no CUDA. Fortunately OpenCL can be run on NVIDIA GPUs.

As for your other post using bitset, I'll look into it further to see if it can be used.

 2020-04-27, 01:42 #20 rogue     "Mark" Apr 2003 Between here and the 1A1A16 Posts Upon reviewing the code, vec4_mulmod64_fpu works like this: vec4_mulmod64_fpu works like this: Code: for (int i=0; i 1e8 (or larger) because the current function will be faster at finding terms with small factors. The new function I'm thinking of would be better when the factors are further apart. Does anyone want to take a crack at it? Does anyone know if there is an additional latency to using fmul with a memory address instead of st(x)? It doesn't appear that there is one from Agner Fog's tables, but I don't know if anyone has practical experience who can also answer the question.
 2020-04-27, 15:37 #21 rogue     "Mark" Apr 2003 Between here and the 2·13·257 Posts As I think about it more, primorial and factorial won't benefit that much because they have a specialized routine that reduces the number of asm function calls.
 2020-04-27, 18:21 #22 rogue     "Mark" Apr 2003 Between here and the 2·13·257 Posts I realized that I wasn't computing the runtimes correctly for the vec_xx functions. Here are updated numbers, which make a lot more sense now. I'm sorry that I didn't discover my mistake sooner. I assume that nobody has wasted time on this besides me. Code: fpu_mulmod_4a_4b_4p 859375 ms 2148 ms (per 1e6 mulmod) sse_mulmod_4a_4b_4p 703125 ms 1757 ms (per 1e6 mulmod) vec_avx_mulmod 3750000 ms 1171 ms (per 1e6 mulmod) vec2_mulmod64_mmx 57687500 ms 1201 ms (per 1e6 mulmod) vec4_mulmod64_mmx 47265625 ms 984 ms (per 1e6 mulmod) vec6_mulmod64_mmx 44859375 ms 934 ms (per 1e6 mulmod) vec8_mulmod64_mmx 44953125 ms 936 ms (per 1e6 mulmod) vec2_mulmod64_fpu 58921875 ms 1227 ms (per 1e6 mulmod) vec4_mulmod64_fpu 50531250 ms 1052 ms (per 1e6 mulmod) vec6_mulmod64_fpu 50812500 ms 1058 ms (per 1e6 mulmod) vec8_mulmod64_fpu 42796875 ms 891 ms (per 1e6 mulmod)` Further analysis reveals that access to memory is a killer for performance in the asm routines. Last fiddled with by rogue on 2020-04-27 at 18:31

 Similar Threads Thread Thread Starter Forum Replies Last Post rogue Software 677 2022-08-15 00:13 rogue Software 304 2021-11-06 13:51 kar_bon No Prime Left Behind 10 2008-03-28 11:21 Greenbank Octoproth Search 2 2006-12-03 17:28 Reboot It Software 16 2003-10-17 01:31

All times are UTC. The time now is 17:22.

Tue Aug 16 17:22:13 UTC 2022 up 40 days, 12:09, 1 user, load averages: 1.06, 1.30, 1.32