![]() |
I fixed the issue that caused the c=1 (single sequence) sieve to slow down. While playing around with the code, I decided to make BASE_MULTIPLE, LIMIT_BASE, and POWER_RESIDUE_LCM configurable via command line switches. These are used primarily for the c=1 sieves and not the generic sieve. Although the default setting for these provides some of the fastest sieving, I have found some configurations that provide a 4% speed bump. Many other configurations are dreadful. The point is that you will want to take a range of 1e9 primes and try various settings for the switches. This could take a couple of hours, but considering that sieving deeply enough could take many days to reach optimal sieving depth, it would be worth the time.
Before I take the c=1 (multiple sequence) sieve testing I will finally implement the code for -L, where srsieve2 can read from a file then save Legendre tables to a file. Right now it appears that the main benefit is when one has to build large Legendre tables (or many Legendre tables) because building small ones doesn't take very long. |
In working the logic to read/write files for the Legendre tables, I have a question to the users. Would it be preferable to have a single file for all sequences or to have a single file per sequence?
In the case of a single file, if you have a lot of sequences, then that file can be quite large. In one of my current tests with nearly 4000 sequences that file is over 6 GB in size. It takes over five hours to build (single threaded). I'm leaning towards a change to have a single file per sequence. -L would now be used to provide a directory name (instead of a file name). The file names would be auto-generated. This would allow a user to run multiple instances of srsieve2 concurrently (same box or different box) to build those files then run a single instance (with one or more workers) that could then read the files for the sequences it uses. Those files could easily be saved for future use. Thoughts? |
my initial reaction was "one file please, clutter."
Then your plan for organization convinced me that one file per sequence is fine too! |
Not a lot of feedback, but my mind was made up for me in testing.
I am running into an issue with large binary files, specifically the Legendre file when it exceeds 2 GB in size. Per the gcc doc I should be able to use fseek() to go beyond 2 GB, but I cannot despite the defining _FILE_OFFSET_BITS as 64 at compile time. I suspect that srsieve2 was never used to manage such a large file since it doesn't work either. For this reason I am going to modify -L to represent a directory (which can be relative or absolute to srsieve2) instead of a file. That probably isn't what everyone wants to read, but I cannot imagine it being too much of a hardship for users. |
I have finally made enough progress to commit my changes. I haven't posted a build yet as I need to do more testing, but everything looks good right now. The new release is version 1.6.0.
My goal was never to make this logic as fast as sr2sieve. My hope was to get it "in the ballpark", but the results below are worse than what I expected. I know some of the reasons why it is slower (no hand-tuned assembly for unrolling loops), but there could be others. The long term goal was to implement this logic in the GPU, where I expect it to perform much better than sr2sieve. The unknown is whether or not it will perform better that the generic logic in srsieve2cl 1.5.3. I have not started coding it yet. I think it will be partly dependent upon whether or not I can determine the reasons for v1.6.0 to be so much slower. Here are some timings for one range that has 265 sequences. Each run produced the same list of factors. All of these were run on the same laptop. The CPU is an Intel i7-8850H. The GPU is an NVIDIA Quadro P3200. [code] srsieve 5397 seconds sr2sieve (no Legendre tables) 4083 seconds sr2sieve (with Legendre tables) 1772 seconds srsieve2 v1.5.3 (generic logic) 3424 seconds srsieve2cl v1.5.3 (generic logic, -g10 (default)) 672 secomds srsieve2cl v1.5.3 (generic logic, -g100) 602 secomds srsieve2 v1.6.0 (c=1 logic, no legendre tables) 5015 seconds srsieve2 v1.6.0 (c=1 logic, with legendre tables) 3504 seconds [/code] Once I can compile and run this on OS X, I can compare there as I have an AMD Radeon on that computer. I think the results are pretty clear if you have multiple sequences where abs(c) = 1 for all sequences: [LIST][*]If you have a GPU, use srsieve2cl, although I strongly suggest you run some tests as GPU speeds vary significantly.[*]If you do not have a GPU and are sieving sequences which support Legendre tables, use sr2sieve.[*]If you do not have a GPU and are sieving sequences which do not support Legendre tables, use srsieve2.[*]If you do not have a GPU and have a mix of sequences where some support Legendre tables and some do not, split into two inputs, run one thru sr2sieve and one thru srsieve2.[/LIST] |
On the latest commit (r160) I am running into an issue with the compilation of srsieve2:
[code]sierpinski_riesel/CisOneWithOneSequenceGpuWorker.cpp:87:56: error: ‘POWER_RESIDUE_LCM’ was not declared in this scope 87 | sprintf(define12, "#define POWER_RESIDUE_LCM %u\n", POWER_RESIDUE_LCM); | [/code] which seems strange as a) we are defining this here, and b) in CisOneSequenceHelper.h we have the related ii_PowerResidueLcm defined. |
[QUOTE=Dylan14;597820]On the latest commit (r160) I am running into an issue with the compilation of srsieve2:
[code]sierpinski_riesel/CisOneWithOneSequenceGpuWorker.cpp:87:56: error: ‘POWER_RESIDUE_LCM’ was not declared in this scope 87 | sprintf(define12, "#define POWER_RESIDUE_LCM %u\n", POWER_RESIDUE_LCM); | [/code] which seems strange as a) we are defining this here, and b) in CisOneSequenceHelper.h we have the related ii_PowerResidueLcm defined.[/QUOTE] That has to be fixed before I can release the next build. If you need the GPU worker for the single sequence c=1 sieve, then d/l the latest .7z file of the Windows executables. If you want to take srsieve2 for multi sequence c=1 sieving for a spin, make srsieve2 and not srsieve2cl. |
I fixed the source to address the compiler errors for srsieve2cl.
|
mtsieve 2.2.3 is finally released.
At this time if you are sieving multiple sequences with srsieve2/srsieve2cl, use -l0 to force usage of the generic logic. Without it that command line option it will use the sr2sieve logic, but my testing has shown that to be slower than the generic logic. Also there is no OpenCL variant supporting sr2sieve logic. Here are the updates: [code] framework: Modified sieving and factor rate calculations to account for long-running sieving processes which lead to those rates fluctuating a lot. Added xmallocNew() which can return NULL if not enough memory can be allocated which allows callers to decide how to handle that condition. For OpenCL sievers, added a call to clFinish() immediately after the call to clEnqueueNDRangeKernel() so that the GPU time is more accurately calculated. fkbnsieve: version 1.4 Do not create output file if there are no terms to write to it. Exit sieving early if there are no remaining terms. smsieve, smsievecl: version 1.0 Initial release for sieving of Smarandache numbers between Sm(100000) and Sm(9999999). Cannot be used at this time for sieving p < 10000, but that is okay since a project has a pre-sieved input file to use. It is impractical at this time to sieve values larger than Sm(9999999) due to the time needed for a PRP test with pfgw. srsieve2, srsieve2cl: version 1.6.0 Changed the calculation of factor density so that the GPU workers can use a lot less GPU memory. Added -S to split sequences into groups when feeding the GPU. The following only apply to the abs(c) = 1 logic: Added sr2sieve logic which uses Legendre checks in support of sieving when there are multiple sequences. Modified -L to represent a directory to hold a back up of the Legendre tables. There will be one file per unique b/k/c. Modified -l to specify the amount of memory to allocate to Legendre tsbles. -l0 disables use of Legendre tables. Added -U to specify a multiplier to compute BASE_MULTIPLE. The specified value is multiplied by 2 to compute BASE_MULTIPLE. Added -V to specify a multiplier to compute POWER_RESIDUE_LCM. The specified value is multiplied by BASE_MULTIPLE to compute POWER_RESIDUE_LCM. Added -X to specify a multiplier to compute LIMIT_BASE. The specified value is multiplied by POWER_RESIDUE_LCM to compute LIMIT_BASE. BASE_MULTIPLE: least exponent Q for which sieving in subsequence base b^Q will be allowed. POWER_RESIDUE_LCM: For a prime p that satisfies p=1 (mod r), an "r-th power residue test" checks whether a subsequence of k*b^n+c can possibly contain any terms of the form x^r (mod p). If there are none then that subsequence can be omitted from the BSGS step. To conduct r-th power residue tests for each r in a set R of prime powers, set POWER_RESIDUE_LCM to lcm(R). LIMIT_BASE: Allow sieving in base b^Q for Q chosen from the divisors of LIMIT_BASE. When sieving with a single sequence the defaults are: bmMultiplier = 15, prlMultiplier = 24, lbMultiplier = 1 BASE_MULTIPLE = 30, POWER_RESIDUE_LCM = 720, LIMIT_BASE = 720 When sieving with multiple sequences the defaults are: bmMultiplier = 1, prlMultiplier = 360, lbMultiplier = 1 BASE_MULTIPLE = 2, POWER_RESIDUE_LCM = 720, LIMIT_BASE = 720 It is recommend to play with these multipliers to find the optimal values as the provided defaults are from sr1sieve and sr2sieve and some values can provide faster sieving. The recommendation is to sieve a range of 1e9 and see how long it takes (use sr2sieve.log to see how long it actually takes). Vary -U, -V, and-X evaluate the results. With a script and multiple cores, this could done relatively quickly for most inputs. [/code] |
latest revision
e:\MTSIEVE\216\MT>k1b2sieve -P 100000000 -w 1e7 -W 1 -n 100 -N 10000 -c 1 -c 3 k1b2sieve v1.1, a program to find factors of 2^n+c numbers for variable n and c Fatal Error: kmax must be specified |
[QUOTE=pepi37;601563]latest revision
e:\MTSIEVE\216\MT>k1b2sieve -P 100000000 -w 1e7 -W 1 -n 100 -N 10000 -c 1 -c 3 k1b2sieve v1.1, a program to find factors of 2^n+c numbers for variable n and c Fatal Error: kmax must be specified[/QUOTE] Probably means cmax, i.e. -C. I will fix the text. |
All times are UTC. The time now is 05:04. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.