20200921, 19:23  #56  
"Mark"
Apr 2003
Between here and the
1100010101010_{2} Posts 
Quote:
Note that the code does not have any sr1sieve or sr2sieve enhancements. I have a ways to go to pull in the sr2sieve bsgs logic. 

20200922, 14:27  #57 
Apr 2013
Durham, UK
3^{2}·7 Posts 
I think EXP_WORK relates to the additional mulmods involved in precalculating b^d (mod p) for all the appropriate remainder values 'd' where 0 <= d < Q.
E.g. We are rewriting the terms of a sequence k * b^n 1 as k * b^d * b^(Qm) 1 where n=Qm+d. The idea behind Q is to identify which values 'd' are unnecessary, thus ending up with less than Q subsequences over a range 1/Q times the size of the original range, hopefully giving a performance improvement. There could be multiple terms with the same remainder 'd' so we can precalculate b^d (mod p) for those instances. My understanding from looking at the source code of sr1sieve (bsgs.c) is that it is this mulmod that we are accounting for in EXP_WORK. 
20200924, 03:45  #58  
Jun 2003
1,579 Posts 
Quote:
For multiple k (sr2sieve) the program should intelligently decide which b^d should be computed. In Srsieve2 the following function would need to be corrected to make it faster for single k. Code:
void GenericWorker::BuildTables(uint64_t *baseToUse, uint64_t *primeList, double *invp, uint64_t *bm64){ ...} 

20200924, 05:31  #59 
Jun 2003
1,579 Posts 
I tried to replace mulmod by powmod to avoid unnecessary multiplication. I get an error during run time. Any idea why the calculations are not being done correctly?
Code:
void GenericWorker::BuildTables(uint64_t *baseToUse, uint64_t *primeList, double *invp, uint64_t *bm64) { uint64_t inv_b[4]; uint32_t qIdx, seqIdx, ssIdx; uint64_t umod[4], inv[4]; fpu_push_1divp(primeList[3]); fpu_push_1divp(primeList[2]); fpu_push_1divp(primeList[1]); fpu_push_1divp(primeList[0]); // Precompute 1/b^d (mod p) for 0 <= d <= Q. bd64[0][0] = bd64[0][1] = bd64[0][2] = bd64[0][3] = 1; bm64[0] = inv_b[0] = invmod64(baseToUse[0], primeList[0]); bm64[1] = inv_b[1] = invmod64(baseToUse[1], primeList[1]); bm64[2] = inv_b[2] = invmod64(baseToUse[2], primeList[2]); bm64[3] = inv_b[3] = invmod64(baseToUse[3], primeList[3]); /* for (qIdx=1; qIdx<ii_BestQ; qIdx++) { bd64[qIdx][0] = bm64[0]; bd64[qIdx][1] = bm64[1]; bd64[qIdx][2] = bm64[2]; bd64[qIdx][3] = bm64[3]; fpu_mulmod_4a_4b_4p(bm64, inv_b, primeList); } */ for (seqIdx=0; seqIdx<ii_SequenceCount; seqIdx++) { ck64[seqIdx][0] = smod64(ip_Sequences[seqIdx].c, primeList[0]); ck64[seqIdx][1] = smod64(ip_Sequences[seqIdx].c, primeList[1]); ck64[seqIdx][2] = smod64(ip_Sequences[seqIdx].c, primeList[2]); ck64[seqIdx][3] = smod64(ip_Sequences[seqIdx].c, primeList[3]); umod[0] = umod64(ip_Sequences[seqIdx].k, primeList[0]); umod[1] = umod64(ip_Sequences[seqIdx].k, primeList[1]); umod[2] = umod64(ip_Sequences[seqIdx].k, primeList[2]); umod[3] = umod64(ip_Sequences[seqIdx].k, primeList[3]); inv[0] = invmod64(umod[0], primeList[0]); inv[1] = invmod64(umod[1], primeList[1]); inv[2] = invmod64(umod[2], primeList[2]); inv[3] = invmod64(umod[3], primeList[3]); fpu_mulmod_4a_4b_4p(ck64[seqIdx], inv, primeList); } // Compute c/(k*b^d) (mod p) for each subsequence. for (ssIdx=0; ssIdx<ii_SubsequenceCount; ssIdx++) { bdck64[ssIdx][0] = ck64[ip_Subsequences[ssIdx].seqIdx][0]; bdck64[ssIdx][1] = ck64[ip_Subsequences[ssIdx].seqIdx][1]; bdck64[ssIdx][2] = ck64[ip_Subsequences[ssIdx].seqIdx][2]; bdck64[ssIdx][3] = ck64[ip_Subsequences[ssIdx].seqIdx][3]; fpu_powmod_4b_1n_4p(bm64, ip_Subsequences[ssIdx].d, primeList); //fpu_mulmod_4a_4b_4p(bdck64[ssIdx], bd64[ip_Subsequences[ssIdx].d], primeList); fpu_mulmod_4a_4b_4p(bdck64[ssIdx], bm64, primeList); bm64[0] = inv_b[0]; bm64[1] = inv_b[1]; bm64[2] = inv_b[2]; bm64[3] = inv_b[3]; } fpu_powmod_4b_1n_4p(bm64, ii_BestQ, primeList); fpu_pop(); fpu_pop(); fpu_pop(); fpu_pop(); } 
20200924, 12:44  #60 
"Mark"
Apr 2003
Between here and the
14252_{8} Posts 
fpu_powmod_4b_1n_4p() assumes that there is nothing on the FPU stack, but fpu_mulmod_4a_4b_4p() requires the first four entries on the stack to be preset (per fpu_push_1divp). You would have to create a version of fpu_powmod_4b_1n_4p() that supports precomputed values on the stack.

20200926, 12:15  #61  
"Alexander"
Nov 2008
The Alamo City
3×7×29 Posts 
Quote:
I discovered when I implemented fpu_powmod() that the x87 version does in fact use righttoleft exponentiation, as I noted in a previous post, so you were right on that concern earlier. Last fiddled with by Happy5214 on 20200926 at 12:16 

20200926, 13:06  #62  
"Mark"
Apr 2003
Between here and the
14252_{8} Posts 
Quote:


20201227, 18:26  #63 
"Mark"
Apr 2003
Between here and the
2×7×11×41 Posts 
I have posted mtsieve 2.1.0 over at sourceforge. There are many changes with this release, but users of gcwsievecl, mfsieve, and mfsievecl will get the most benefit. The changes are:
Code:
2.1.0  December 27, 2020 framework: On Windows, switched to using gcc from msys2 instead of gcc from mingw64. O3 gives a nice performance bump to some of the nonASM code. Exit after showing help when using h swtich instead of giving fatal error. Add logic to support validation of factors passed with I, but not all sieves are coded to do this validation. On GPU builds, default W to 0 and G to 1. A "special" CPU worker will be created as necessary to test ranges of p that are to small for the GPU code. For GPU executables, add H to show some GPU details when each kernel is created. Improve factor rate calculation when using GPU workers. Created FUTURE.txt for items I would like to get working. cksieve: version 1.3 Added logic to validate factors passed with I. gcwsieve, gcwsievecl: version 1.4 Added logic to validate factors passed with I. Add f option to gcwsieve so that it can generate ABC output that is supported by LLR. Added M option and fixed S for gcwsievecl. Improved speed of GPU code of gcwsievecl by about 25%. The GPU code is more than 20x faster than the CPU for the same amount of work. mfsieve, mfsievecl: version 1.7 Added logic to validate factors passed with I. Replaced all ASM routines with nonASM routines based upon Montgomery mulitplication. This provides a speed bump of at least 30% over the previous release. Fixed the GPU code and replaced magic number logic with Montgomery multiplcation as it is faster. The new GPU code is more then 25% faster than the old GPU code. The GPU code is more than 20x faster than the CPU for the same amount of work. sgsieve: version 1.2 Added logic to validate factors passed with I. twinsieve: version 1.3 Added logic to validate factors passed with I. xyyxsieve, xyyxsievecl: version 1.8 Added logic to validate factors passed with I. Code:
2.0.7  December 22, 2020 framework: Replace AMD SDK on Windows with open source SDK for OpenCL. Add more output for H for GPU executables. afsieve, afsievecl: Replace logic with Montgomery multiplcation. Add factor validation when using I. dmdsieve: Add factor validation when using I. fbncsieve: Add factor validation when using I. fkbnsieve: Add factor validation when using I. gfndsieve: Add factor validation when using I. kbbsieve: Add factor validation when using I. psieve: Add factor validation when using I. pixsieve: srsieve2: Implement sr1sieve and sr2sieve logic. Add factor validation when using I. 
20201227, 19:21  #64 
Dec 2011
After milion nines:)
2^{4}·89 Posts 
If you can rise upper limit of srsieve2.

20201227, 20:06  #65 
"Mark"
Apr 2003
Between here and the
2×7×11×41 Posts 

20201227, 21:01  #66 
Dec 2011
After milion nines:)
590_{16} Posts 

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
srsieve/sr2sieve enhancements  rogue  Software  300  20210318 20:31 
mtsieve  rogue  Software  543  20210227 18:43 
LLRnet enhancements  kar_bon  No Prime Left Behind  10  20080328 11:21 
TODO list and suggestions/comments/enhancements  Greenbank  Octoproth Search  2  20061203 17:28 
Suggestions for future enhancements  Reboot It  Software  16  20031017 01:31 