 mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Software (https://www.mersenneforum.org/forumdisplay.php?f=10)

 Citrix 2020-04-26 00:50

[QUOTE=rogue;543610]
I'm thinking that a faster implementation of the HashTable class would be a good start.[/QUOTE]

For converting the hash table to bitmap (bit index) this is what I was thinking of:-

Step 1:-
For BSGS we want to compare G[i] to B[j] for giant and baby step respectively.
Assuming there are G_n giant steps and B_n baby steps.

Then we calculate G[i] (mod p) and B[j] (mod p)

We want to store the baby step value (could be the other way around also)

Instead of comparing G[i] and B[j] we want to limit our search space by using probability. This can reduce the amount of memory space needed and search operations but can produce false positive. There will be a inverse relationship between how much memory we use and how many false positives we get.

Let G[i]=B[j] (mod p)
then [G[i]=B[j] (mod p)](mod n)
G[i]%n=GN[i]
B[j]%n=BN[j]
GN[i]=BN[j]

We will store BN[j] for comparison with GN[i]
** if n is a power of 2 (2^x) then this modular arithmetic can be done at no extra cost.

So we do not need to store the full 64 bits of G[i] and B[j]. Effectively we have reduced our search space.

We can store these in a hash table or use Step 2 below in certain cases to speed up even further.

For very small B_n range (<5-10) we can just use an array instead of hash table which should be faster. This would happen if there were too few candidates left in the sieve.

Step 2:-
For small n we can create an bit index (array of bits) to store all B[j] values. Does not work for large n values and would need to use a regular hash table.

At default all values in the array are set to zero.

Then for each BN[j] value you convert the corresponding location in the array to 1.

For each GN[i] you read the corresponding value from the array.

Complexity to add a value is O(1) and to search a value is O (1)

If you do get a positive result (=1) it suggests that G[i] might produce a factor and then you need to check G[i] against B[1..j]. This will differentiate between false positive and true factors.

This might not need to be done if there were no values corresponding to a particular G[i] in the candidate left range.

To check G[i] against B[1..j] instead of repeating B_n steps this can be done in 2*sqrt(B_n) steps by repeating the BSGS algorithm again on this subrange.

[CODE]
Bit array [n];
for (int a=0; a<n; a++) {array[a]=0;}
int b=BN[j];
array[b]=1;
int g=GN[i];
if(array[g]==1) {// possible factor found and need to double check}
[/CODE]

2) Choosing the size of n

We want to reduce the number of false positive and reduce the amount of memory used
Larger the n the fewer the number of false positive and more the amount of memory used.

Percent chances of a positive result will be (G_n*B_n*100)/n

For a 1% false positive rate n will have to be around G_n*B_n*100

The false positive rate should be adjusted to the speed up from using less memory/array structure vs extra work needed to test the false positive.

n values will quickly become very large and this is best suited for low weight candidates or small ranges being searched with a few k.

 Happy5214 2020-04-26 02:53

My desktop's Core 2 doesn't have AVX, so this is from my laptop (Intel Core i3-4012Y, Kubuntu 18.04):

[CODE]sse_mulmod_4a_4b_4p 2018927 ms 5047 ms (per 1e6 mulmod)
vec_avx_mulmod 21365975 ms 6676 ms (per 1e6 mulmod)
vec2_mulmod64_mmx 82026202 ms 1708 ms (per 1e6 mulmod)
vec4_mulmod64_mmx 34685917 ms 722 ms (per 1e6 mulmod)
vec6_mulmod64_mmx 24030532 ms 500 ms (per 1e6 mulmod)
vec8_mulmod64_mmx 18026937 ms 375 ms (per 1e6 mulmod)
vec2_mulmod64_fpu 80305117 ms 1673 ms (per 1e6 mulmod)
vec4_mulmod64_fpu 38246894 ms 796 ms (per 1e6 mulmod)
vec6_mulmod64_fpu 27558843 ms 574 ms (per 1e6 mulmod)
vec8_mulmod64_fpu 18187244 ms 378 ms (per 1e6 mulmod)[/CODE]

As an aside, I had to add a definition before an include to get it to compile, so it's now
[code] #define __USE_GNU
#include <sys/resource.h>[/code]

 rogue 2020-04-26 12:18

I am not concerned about memory usage on a CPU as much as I would be on a GPU, but if the lookup could be faster, that would be helpful. Please write some code so that we can see how it performs.

 rogue 2020-04-26 14:16

1 Attachment(s)
I added the mtsieve function fpu_mulmod_4a_4b_4p to this. I didn't include it previously because it I didn't use it correctly which led to incorrect output. Here is the updated output:

[code]
fpu_mulmod_4a_4b_4p 890625 ms 2226 ms (per 1e6 mulmod)
sse_mulmod_4a_4b_4p 734375 ms 1835 ms (per 1e6 mulmod)
vec_avx_mulmod 3828125 ms 1196 ms (per 1e6 mulmod)
vec2_mulmod64_mmx 30625000 ms 638 ms (per 1e6 mulmod)
vec4_mulmod64_mmx 12359375 ms 257 ms (per 1e6 mulmod)
vec6_mulmod64_mmx 8187500 ms 170 ms (per 1e6 mulmod)
vec8_mulmod64_mmx 6078125 ms 126 ms (per 1e6 mulmod)
vec2_mulmod64_fpu 29906250 ms 623 ms (per 1e6 mulmod)
vec4_mulmod64_fpu 13546875 ms 282 ms (per 1e6 mulmod)
vec6_mulmod64_fpu 8968750 ms 186 ms (per 1e6 mulmod)
vec8_mulmod64_fpu 6031250 ms 125 ms (per 1e6 mulmod)
[/code]

fpu_mulmod_4a_4b_4p works like this:

[code]
a = a * b mod p
a = a * b mod p
a = a * b mod p
a = a * b mod p
[/code]

vec4_mulmod64_fpu works like this:

[code]
for (int i=0; i<count; i++)
a = a * b mod p
a = a * b mod p
a = a * b mod p
a = a * b mod p
[/code]

Now if mtsieve had a function that worked like this:

[code]
for (int i=0; i<count; i++)
a = a * b mod p
a = a * b mod p
a = a * b mod p
a = a * b mod p
[/code]

and had a speed faster than 1500 ms per 1e6 mulmod (compared to the above), that would be killer for the entire framework. Not only could that be used to give a 20% or better performance boost to most sieves, it would also allow most sieves to support sieving for p up to 2^62. vec4_mulmod64_fpu has very limited usage in the existing sieves of the framework, but the new function could be used in most, if not all of the sieves.

The sources to both of these are included in the attached and the usage of both is in the test.c file. For those of you are are not afraid of x86 assembly, I'd love you to take a crack at such a function.

 storm5510 2020-04-26 15:46

1 Attachment(s)
Using your update. Same parameter as before. 100.

 Citrix 2020-04-26 16:27

[QUOTE=rogue;543855]I am not concerned about memory usage on a CPU as much as I would be on a GPU, but if the lookup could be faster, that would be helpful. Please write some code so that we can see how it performs.[/QUOTE]

I am not sure how to integrate the code into mtsieve to do the comparison. The code is below:-

As I mentioned in the earlier post that it only is useful for small n range or low weight and might not work for a larger range due to memory requirements. It might not be of any practical use. It does use more memory but look up is significantly faster.

See: [url]https://docs.microsoft.com/en-us/cpp/standard-library/bitset-class?view=vs-2019[/url]
for more details on bitset.

[CODE]
#include <bitset> // built in STD library for bitmaps.

//BSGS algorithm
// For a range r=b*g for k*2^n+1
// then baby step = k*2^(1) to k*2^b -- b steps (mod p)
// giant step will be -2^(-n) to -(2^b) step 2^b --g steps (mod p)

const int g = 32; // number of giant steps
const int b = 32; // number of baby steps
const int c = 1024; // for 0.1% false positive rate. rate=1/c*100%
int n = g*b*c; //works better if n=2^m
std::bitset<n> bitmap; // by default in the constructor everything is set to zero/false

void main(void)
{
uint64_t giant[g];
uint64_t baby[b];
// replace arrays with mulmod code in mtsieve to generate giant and baby values (mod p)

for (int i = 0; i < b; i++)
{
uint32_t temp = baby[i];
temp = temp%n; // can use bit operations if n is a power of 2
bitmap.set(temp); //sets bit to 1
}
// we are done with setting up the table
// Now we will compare with the giant values
for (int j= 0; j < g; j++)
{
uint32_t temp = giant[j];
temp = temp%n; // can use bit operations if n is a power of 2
if (bitmap.test(temp))
{
// we might have a possible factor
// Possible factor will be k*2^(b*j)*2^x+1 =0 (mod p)
// Where x is between 1...b
// Need to search this range for factors
// Repeat BSGS for this smaller range.
}
}

bitmap.reset(); // sets everything to false
}
[/CODE]

 Citrix 2020-04-26 19:26

[QUOTE=rogue;543860]
Now if mtsieve had a function that worked like this:

[code]
for (int i=0; i<count; i++)
a = a * b mod p
a = a * b mod p
a = a * b mod p
a = a * b mod p
[/code]
[/QUOTE]

Question for you:-
For the low weight sequences-
Assuming BSGS requires 64 steps each for a range of 4096
Would the following simpler algorithm be faster on CPU or GPU than BSGS for a 4096 range?
Multiple primes could be tested in parallel.
(Assuming we use BestQ and Legendre symbols in both algorithms).

[CODE]
void low_wt_sieve (int k, int base, int nmin, int nmax, int sign, int step, int p)
{
// checks if p divides k*base^n+-sign (over range of n with base^step size increments)

// calculate number of iterations
int number=(nmax-nmin)/step+1;
int start=(k*base^nmin)%p;
int step_exp=base^step%p;
for (int x=0; x<number; x++)
{
start=start*step_exp%p;
if (start+sign==p || start-sign==0)
{
//output factor
}
}
}
[/CODE]

Thanks.

 rogue 2020-04-26 21:29

[QUOTE=Citrix;543882]Question for you:-
For the low weight sequences-
Assuming BSGS requires 64 steps each for a range of 4096
Would the following simpler algorithm be faster on CPU or GPU than BSGS for a 4096 range?
Multiple primes could be tested in parallel.
(Assuming we use BestQ and Legendre symbols in both algorithms).

[CODE]
void low_wt_sieve (int k, int base, int nmin, int nmax, int sign, int step, int p)
{
// checks if p divides k*base^n+-sign (over range of n with base^step size increments)

// calculate number of iterations
int number=(nmax-nmin)/step+1;
int start=(k*base^nmin)%p;
int step_exp=base^step%p;
for (int x=0; x<number; x++)
{
start=start*step_exp%p;
if (start+sign==p || start-sign==0)
{
//output factor
}
}
}
[/CODE][/QUOTE]

As for GPU vs CPU, I don't know the answer without running any tests. The issue with the GPU is the usage of memory. Although GPUs have a lot of memory, each "thread" typically accesses only a small portion of that. You can use global memory that all threads can access, but it is my understanding that global memory access is expensive. I only have access to AMD GPUs, so OpenCL is the only choice, in other words, no CUDA. Fortunately OpenCL can be run on NVIDIA GPUs.

As for your other post using bitset, I'll look into it further to see if it can be used.

 rogue 2020-04-27 01:42

Upon reviewing the code, vec4_mulmod64_fpu works like this:

vec4_mulmod64_fpu works like this:

[code]
for (int i=0; i<count; i++)
y = x * b mod p
y = x * b mod p
y = x * b mod p
y = x * b mod p
[/code]

so it isn't a fair comparison between the two as b/p is precomputed. For the function I'm looking for, I cannot do that as b and p are variable. In doing more investigation into the performance of fpu_mulmod_4a_4b_4p, it seem that the overhead of calling the function is driving up the time to execute it but quite a bit. I think that is where I need to look. For some sieves one can also pre-compute all b as a long double (factorial and primorial for example). Since factorial is already GPU enabled, it might not benefit that sieve as much as primorial. It would probably only make sense to use a new function for p > 1e8 (or larger) because the current function will be faster at finding terms with small factors. The new function I'm thinking of would be better when the factors are further apart. Does anyone want to take a crack at it?

Does anyone know if there is an additional latency to using fmul with a memory address instead of st(x)? It doesn't appear that there is one from Agner Fog's tables, but I don't know if anyone has practical experience who can also answer the question.

 rogue 2020-04-27 15:37

As I think about it more, primorial and factorial won't benefit that much because they have a specialized routine that reduces the number of asm function calls.

 rogue 2020-04-27 18:21

I realized that I wasn't computing the runtimes correctly for the vec_xx functions. Here are updated numbers, which make a lot more sense now. I'm sorry that I didn't discover my mistake sooner. I assume that nobody has wasted time on this besides me.

[code]
fpu_mulmod_4a_4b_4p 859375 ms 2148 ms (per 1e6 mulmod)
sse_mulmod_4a_4b_4p 703125 ms 1757 ms (per 1e6 mulmod)
vec_avx_mulmod 3750000 ms 1171 ms (per 1e6 mulmod)
vec2_mulmod64_mmx 57687500 ms 1201 ms (per 1e6 mulmod)
vec4_mulmod64_mmx 47265625 ms 984 ms (per 1e6 mulmod)
vec6_mulmod64_mmx 44859375 ms 934 ms (per 1e6 mulmod)
vec8_mulmod64_mmx 44953125 ms 936 ms (per 1e6 mulmod)
vec2_mulmod64_fpu 58921875 ms 1227 ms (per 1e6 mulmod)
vec4_mulmod64_fpu 50531250 ms 1052 ms (per 1e6 mulmod)
vec6_mulmod64_fpu 50812500 ms 1058 ms (per 1e6 mulmod)
vec8_mulmod64_fpu 42796875 ms 891 ms (per 1e6 mulmod)
[/code]

Further analysis reveals that access to memory is a killer for performance in the asm routines.

All times are UTC. The time now is 22:33.