mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2020-04-26, 00:50   #12
Citrix
 
Citrix's Avatar
 
Jun 2003

32×52×7 Posts
Default

Quote:
Originally Posted by rogue View Post
I'm thinking that a faster implementation of the HashTable class would be a good start.
For converting the hash table to bitmap (bit index) this is what I was thinking of:-

Step 1:-
For BSGS we want to compare G[i] to B[j] for giant and baby step respectively.
Assuming there are G_n giant steps and B_n baby steps.

Then we calculate G[i] (mod p) and B[j] (mod p)

We want to store the baby step value (could be the other way around also)

Instead of comparing G[i] and B[j] we want to limit our search space by using probability. This can reduce the amount of memory space needed and search operations but can produce false positive. There will be a inverse relationship between how much memory we use and how many false positives we get.

Let G[i]=B[j] (mod p)
then [G[i]=B[j] (mod p)](mod n)
G[i]%n=GN[i]
B[j]%n=BN[j]
GN[i]=BN[j]


We will store BN[j] for comparison with GN[i]
** if n is a power of 2 (2^x) then this modular arithmetic can be done at no extra cost.

So we do not need to store the full 64 bits of G[i] and B[j]. Effectively we have reduced our search space.

We can store these in a hash table or use Step 2 below in certain cases to speed up even further.

For very small B_n range (<5-10) we can just use an array instead of hash table which should be faster. This would happen if there were too few candidates left in the sieve.

Step 2:-
For small n we can create an bit index (array of bits) to store all B[j] values. Does not work for large n values and would need to use a regular hash table.

At default all values in the array are set to zero.

Then for each BN[j] value you convert the corresponding location in the array to 1.

For each GN[i] you read the corresponding value from the array.

Complexity to add a value is O(1) and to search a value is O (1)

If you do get a positive result (=1) it suggests that G[i] might produce a factor and then you need to check G[i] against B[1..j]. This will differentiate between false positive and true factors.

This might not need to be done if there were no values corresponding to a particular G[i] in the candidate left range.

To check G[i] against B[1..j] instead of repeating B_n steps this can be done in 2*sqrt(B_n) steps by repeating the BSGS algorithm again on this subrange.

Code:
Bit array [n];
for (int a=0; a<n; a++) {array[a]=0;}
int b=BN[j];
array[b]=1;
int g=GN[i];
if(array[g]==1) {// possible factor found and need to double check}

2) Choosing the size of n

We want to reduce the number of false positive and reduce the amount of memory used
Larger the n the fewer the number of false positive and more the amount of memory used.

Percent chances of a positive result will be (G_n*B_n*100)/n

For a 1% false positive rate n will have to be around G_n*B_n*100

The false positive rate should be adjusted to the speed up from using less memory/array structure vs extra work needed to test the false positive.

n values will quickly become very large and this is best suited for low weight candidates or small ranges being searched with a few k.
Citrix is offline   Reply With Quote
Old 2020-04-26, 02:53   #13
Happy5214
 
Happy5214's Avatar
 
"Alexander"
Nov 2008
The Alamo City

389 Posts
Default

My desktop's Core 2 doesn't have AVX, so this is from my laptop (Intel Core i3-4012Y, Kubuntu 18.04):

Code:
sse_mulmod_4a_4b_4p  2018927 ms 5047 ms (per 1e6 mulmod)
vec_avx_mulmod 21365975 ms 6676 ms (per 1e6 mulmod)
vec2_mulmod64_mmx 82026202 ms 1708 ms (per 1e6 mulmod)
vec4_mulmod64_mmx 34685917 ms  722 ms (per 1e6 mulmod)
vec6_mulmod64_mmx 24030532 ms  500 ms (per 1e6 mulmod)
vec8_mulmod64_mmx 18026937 ms  375 ms (per 1e6 mulmod)
vec2_mulmod64_fpu 80305117 ms 1673 ms (per 1e6 mulmod)
vec4_mulmod64_fpu 38246894 ms  796 ms (per 1e6 mulmod)
vec6_mulmod64_fpu 27558843 ms  574 ms (per 1e6 mulmod)
vec8_mulmod64_fpu 18187244 ms  378 ms (per 1e6 mulmod)
As an aside, I had to add a definition before an include to get it to compile, so it's now
Code:
   #define __USE_GNU
   #include <sys/resource.h>

Last fiddled with by Happy5214 on 2020-04-26 at 02:55
Happy5214 is online now   Reply With Quote
Old 2020-04-26, 12:18   #14
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

5,953 Posts
Default

I am not concerned about memory usage on a CPU as much as I would be on a GPU, but if the lookup could be faster, that would be helpful. Please write some code so that we can see how it performs.
rogue is offline   Reply With Quote
Old 2020-04-26, 14:16   #15
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

135018 Posts
Default

I added the mtsieve function fpu_mulmod_4a_4b_4p to this. I didn't include it previously because it I didn't use it correctly which led to incorrect output. Here is the updated output:

Code:
fpu_mulmod_4a_4b_4p   890625 ms 2226 ms (per 1e6 mulmod)
sse_mulmod_4a_4b_4p   734375 ms 1835 ms (per 1e6 mulmod)
     vec_avx_mulmod  3828125 ms 1196 ms (per 1e6 mulmod)
  vec2_mulmod64_mmx 30625000 ms  638 ms (per 1e6 mulmod)
  vec4_mulmod64_mmx 12359375 ms  257 ms (per 1e6 mulmod)
  vec6_mulmod64_mmx  8187500 ms  170 ms (per 1e6 mulmod)
  vec8_mulmod64_mmx  6078125 ms  126 ms (per 1e6 mulmod)
  vec2_mulmod64_fpu 29906250 ms  623 ms (per 1e6 mulmod)
  vec4_mulmod64_fpu 13546875 ms  282 ms (per 1e6 mulmod)
  vec6_mulmod64_fpu  8968750 ms  186 ms (per 1e6 mulmod)
  vec8_mulmod64_fpu  6031250 ms  125 ms (per 1e6 mulmod)
fpu_mulmod_4a_4b_4p works like this:

Code:
a[0] = a[0] * b[0] mod p[0]
a[1] = a[1] * b[1] mod p[1]
a[2] = a[2] * b[2] mod p[2]
a[3] = a[3] * b[3] mod p[3]
vec4_mulmod64_fpu works like this:

Code:
for (int i=0; i<count; i++)
   a[0] = a[0] * b[0] mod p
   a[1] = a[1] * b[1] mod p
   a[2] = a[2] * b[2] mod p
   a[3] = a[3] * b[3] mod p
Now if mtsieve had a function that worked like this:

Code:
for (int i=0; i<count; i++)
   a[0] = a[0] * b[0] mod p[0]
   a[1] = a[1] * b[1] mod p[1]
   a[2] = a[2] * b[2] mod p[2]
   a[3] = a[3] * b[3] mod p[3]
and had a speed faster than 1500 ms per 1e6 mulmod (compared to the above), that would be killer for the entire framework. Not only could that be used to give a 20% or better performance boost to most sieves, it would also allow most sieves to support sieving for p up to 2^62. vec4_mulmod64_fpu has very limited usage in the existing sieves of the framework, but the new function could be used in most, if not all of the sieves.

The sources to both of these are included in the attached and the usage of both is in the test.c file. For those of you are are not afraid of x86 assembly, I'd love you to take a crack at such a function.
Attached Files
File Type: 7z mulmod_test.7z (26.9 KB, 45 views)

Last fiddled with by rogue on 2020-04-26 at 14:17
rogue is offline   Reply With Quote
Old 2020-04-26, 15:46   #16
storm5510
Random Account
 
storm5510's Avatar
 
Aug 2009
U.S.A.

1,621 Posts
Default

Using your update. Same parameter as before. 100.
Attached Thumbnails
Click image for larger version

Name:	mulmod.JPG
Views:	43
Size:	69.0 KB
ID:	22174  
storm5510 is offline   Reply With Quote
Old 2020-04-26, 16:27   #17
Citrix
 
Citrix's Avatar
 
Jun 2003

30478 Posts
Default

Quote:
Originally Posted by rogue View Post
I am not concerned about memory usage on a CPU as much as I would be on a GPU, but if the lookup could be faster, that would be helpful. Please write some code so that we can see how it performs.
I am not sure how to integrate the code into mtsieve to do the comparison. The code is below:-

As I mentioned in the earlier post that it only is useful for small n range or low weight and might not work for a larger range due to memory requirements. It might not be of any practical use. It does use more memory but look up is significantly faster.

See: https://docs.microsoft.com/en-us/cpp...s?view=vs-2019
for more details on bitset.

Code:
#include <bitset> // built in STD library for bitmaps.

//BSGS algorithm
// For a range r=b*g for k*2^n+1
// then baby step = k*2^(1) to k*2^b -- b steps (mod p)
// giant step will be -2^(-n) to -(2^b) step 2^b --g steps (mod p)

const int g = 32; // number of giant steps
const int b = 32; // number of baby steps
const int c = 1024; // for 0.1% false positive rate. rate=1/c*100%
int n = g*b*c; //works better if n=2^m
std::bitset<n> bitmap; // by default in the constructor everything is set to zero/false

void main(void)
{
	uint64_t giant[g];
	uint64_t baby[b];
	// replace arrays with mulmod code in mtsieve to generate giant and baby values (mod p)

	for (int i = 0; i < b; i++)
	{
		uint32_t temp = baby[i];
		temp = temp%n; // can use bit operations if n is a power of 2
		bitmap.set(temp); //sets bit to 1
	}
	// we are done with setting up the table
	// Now we will compare with the giant values
	for (int j= 0; j < g; j++)
	{
		uint32_t temp = giant[j];
		temp = temp%n; // can use bit operations if n is a power of 2
		if (bitmap.test(temp))
		{
			// we might have a possible factor
			// Possible factor will be k*2^(b*j)*2^x+1 =0 (mod p)
			// Where x is between 1...b
			// Need to search this range for factors
			// Repeat BSGS for this smaller range.
		}
	}

	bitmap.reset(); // sets everything to false
}

Last fiddled with by Citrix on 2020-04-26 at 16:49
Citrix is offline   Reply With Quote
Old 2020-04-26, 19:26   #18
Citrix
 
Citrix's Avatar
 
Jun 2003

62716 Posts
Default

Quote:
Originally Posted by rogue View Post
Now if mtsieve had a function that worked like this:

Code:
for (int i=0; i<count; i++)
   a[0] = a[0] * b[0] mod p[0]
   a[1] = a[1] * b[1] mod p[1]
   a[2] = a[2] * b[2] mod p[2]
   a[3] = a[3] * b[3] mod p[3]
Question for you:-
For the low weight sequences-
Assuming BSGS requires 64 steps each for a range of 4096
Would the following simpler algorithm be faster on CPU or GPU than BSGS for a 4096 range?
Multiple primes could be tested in parallel.
(Assuming we use BestQ and Legendre symbols in both algorithms).

Code:
void low_wt_sieve (int k, int base, int nmin, int nmax, int sign, int step, int p)
{
	// checks if p divides k*base^n+-sign (over range of n with base^step size increments)

	// calculate number of iterations
        int number=(nmax-nmin)/step+1;
        int start=(k*base^nmin)%p;
        int step_exp=base^step%p;
       for (int x=0; x<number; x++)
       {
	   start=start*step_exp%p;
           if (start+sign==p || start-sign==0)
           {
		//output factor
	    }
	}
}
Thanks.
Citrix is offline   Reply With Quote
Old 2020-04-26, 21:29   #19
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

5,953 Posts
Default

Quote:
Originally Posted by Citrix View Post
Question for you:-
For the low weight sequences-
Assuming BSGS requires 64 steps each for a range of 4096
Would the following simpler algorithm be faster on CPU or GPU than BSGS for a 4096 range?
Multiple primes could be tested in parallel.
(Assuming we use BestQ and Legendre symbols in both algorithms).

Code:
void low_wt_sieve (int k, int base, int nmin, int nmax, int sign, int step, int p)
{
	// checks if p divides k*base^n+-sign (over range of n with base^step size increments)

	// calculate number of iterations
        int number=(nmax-nmin)/step+1;
        int start=(k*base^nmin)%p;
        int step_exp=base^step%p;
       for (int x=0; x<number; x++)
       {
	   start=start*step_exp%p;
           if (start+sign==p || start-sign==0)
           {
		//output factor
	    }
	}
}
As for GPU vs CPU, I don't know the answer without running any tests. The issue with the GPU is the usage of memory. Although GPUs have a lot of memory, each "thread" typically accesses only a small portion of that. You can use global memory that all threads can access, but it is my understanding that global memory access is expensive. I only have access to AMD GPUs, so OpenCL is the only choice, in other words, no CUDA. Fortunately OpenCL can be run on NVIDIA GPUs.

As for your other post using bitset, I'll look into it further to see if it can be used.
rogue is offline   Reply With Quote
Old 2020-04-27, 01:42   #20
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

5,953 Posts
Default

Upon reviewing the code, vec4_mulmod64_fpu works like this:

vec4_mulmod64_fpu works like this:

Code:
for (int i=0; i<count; i++)
   y[0] = x[0] * b mod p
   y[1] = x[1] * b mod p
   y[2] = x[2] * b mod p
   y[3] = x[3] * b mod p
so it isn't a fair comparison between the two as b/p is precomputed. For the function I'm looking for, I cannot do that as b and p are variable. In doing more investigation into the performance of fpu_mulmod_4a_4b_4p, it seem that the overhead of calling the function is driving up the time to execute it but quite a bit. I think that is where I need to look. For some sieves one can also pre-compute all b as a long double (factorial and primorial for example). Since factorial is already GPU enabled, it might not benefit that sieve as much as primorial. It would probably only make sense to use a new function for p > 1e8 (or larger) because the current function will be faster at finding terms with small factors. The new function I'm thinking of would be better when the factors are further apart. Does anyone want to take a crack at it?

Does anyone know if there is an additional latency to using fmul with a memory address instead of st(x)? It doesn't appear that there is one from Agner Fog's tables, but I don't know if anyone has practical experience who can also answer the question.
rogue is offline   Reply With Quote
Old 2020-04-27, 15:37   #21
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

5,953 Posts
Default

As I think about it more, primorial and factorial won't benefit that much because they have a specialized routine that reduces the number of asm function calls.
rogue is offline   Reply With Quote
Old 2020-04-27, 18:21   #22
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

135018 Posts
Default

I realized that I wasn't computing the runtimes correctly for the vec_xx functions. Here are updated numbers, which make a lot more sense now. I'm sorry that I didn't discover my mistake sooner. I assume that nobody has wasted time on this besides me.

Code:
fpu_mulmod_4a_4b_4p   859375 ms 2148 ms (per 1e6 mulmod)
sse_mulmod_4a_4b_4p   703125 ms 1757 ms (per 1e6 mulmod)
     vec_avx_mulmod  3750000 ms 1171 ms (per 1e6 mulmod)
  vec2_mulmod64_mmx 57687500 ms 1201 ms (per 1e6 mulmod)
  vec4_mulmod64_mmx 47265625 ms  984 ms (per 1e6 mulmod)
  vec6_mulmod64_mmx 44859375 ms  934 ms (per 1e6 mulmod)
  vec8_mulmod64_mmx 44953125 ms  936 ms (per 1e6 mulmod)
  vec2_mulmod64_fpu 58921875 ms 1227 ms (per 1e6 mulmod)
  vec4_mulmod64_fpu 50531250 ms 1052 ms (per 1e6 mulmod)
  vec6_mulmod64_fpu 50812500 ms 1058 ms (per 1e6 mulmod)
  vec8_mulmod64_fpu 42796875 ms  891 ms (per 1e6 mulmod)
Further analysis reveals that access to memory is a killer for performance in the asm routines.

Last fiddled with by rogue on 2020-04-27 at 18:31
rogue is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
srsieve/sr2sieve enhancements rogue Software 281 2020-09-29 16:36
mtsieve rogue Software 447 2020-09-19 05:50
LLRnet enhancements kar_bon No Prime Left Behind 10 2008-03-28 11:21
TODO list and suggestions/comments/enhancements Greenbank Octoproth Search 2 2006-12-03 17:28
Suggestions for future enhancements Reboot It Software 16 2003-10-17 01:31

All times are UTC. The time now is 00:27.

Wed Oct 28 00:27:18 UTC 2020 up 47 days, 21:38, 2 users, load averages: 1.79, 1.82, 1.88

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.