![]() |
[QUOTE=matzetoni;592673][C]> gfndsievecl.exe -n22001 -N25000 -k2000000 -K3000000 -o"out1.txt"[/C]
I tried running above input with the newest mtsieve version 2.2.2.7 and the program just exits after 5 minutes with no output / error in log file written.[/QUOTE] I found the issue and will fix. In the interim use gfndsieve to sieve to 1e8, then switch to gfndsievecl. |
I am sieving a conjecture with 7223 sequences for CRUS. I was trying to use srsieve2cl and ran into some issues.
I discovered that srsieve2cl consumes too much memory when using a lot of sequences so I have made an adjustment to calculation of factor density, which can be changed using the -M command line switch. This impacts how much memory is allocated for factors returned from GPU memory to CPU memory. In the future I will add functionality to account for reduced factor density on the fly. As one sieves more deeply, the amount of memory needed for moving factors from GPU to CPU will decrease. For now the adjustment to the calculation is sufficient. I have noticed that when using the GPU that even though the CPU is waiting on the GPU that a full core of CPU is used (on Windows). I will modify the code to exclude GPU time when computing the factor rate until I have fully investigated the issue. I noticed in testing that long-running kernels (a few seconds or longer) impact the computed factor rate making it fluctuate up and down depending upon how many kernel executions completed in the previous minute. To account for this, I have increased the minimum number of minutes used to compute the factor rate. My hope is to time between Christmas and New Year's to finally get the sr2sieve functionality (which supports Legendre tables when one has multiple sequences) implemented into srsieve2 and srsieve2cl. Based upon how long it takes to build Legendre tables when one has many sequences, I will try to offload that logic into the GPU so it should take seconds to build the sequences instead of minutes or hours. For these 7233 sequences took hours to build those tables. Using a fixed range of P and the exact same input file (7233 sequences and 6868941 terms), here is the wall time (in seconds) to run of the various programs: [code] srsieve 1407 sr2sieve 2334 (no Legendre tables) sr2sieve fail (with Legendre tables) srsieve2 1183 srsieve2cl 213 (using the default of -M10 -g10) srsieve2cl 147 (using -M1 -g100) [/code] I didn't use clock time because srsieve/sr2sieve don't compute it and because srsieve2cl computes CPU utilization to account for the fact that it is multi-threaded. I ensured that other CPU intensive applications were not running so I cannot determine how much these programs are impacted by other CPU intensive programs that are running concurrently. There was not enough memory to build the Legendre tables for sr2sieve. I don't know how much memory it needed. It stopped about halfway thru building them. That implies that srsieve2 and srsieve2cl will fail to allocate the necessary memory for those tables. When I make the enhancements to srsieve2, I will see what I can do about failing earlier rather than later. I doubt there is much I can do about reducing memory using with so many sequences. |
[QUOTE=rogue;594426]I am sieving a conjecture with 7223 sequences for CRUS. I was trying to use srsieve2cl and ran into some issues.
I discovered that srsieve2cl consumes too much memory when using a lot of sequences so I have made an adjustment to calculation of factor density, which can be changed using the -M command line switch. This impacts how much memory is allocated for factors returned from GPU memory to CPU memory. In the future I will add functionality to account for reduced factor density on the fly. As one sieves more deeply, the amount of memory needed for moving factors from GPU to CPU will decrease. For now the adjustment to the calculation is sufficient. I have noticed that when using the GPU that even though the CPU is waiting on the GPU that a full core of CPU is used (on Windows). I will modify the code to exclude GPU time when computing the factor rate until I have fully investigated the issue. I noticed in testing that long-running kernels (a few seconds or longer) impact the computed factor rate making it fluctuate up and down depending upon how many kernel executions completed in the previous minute. To account for this, I have increased the minimum number of minutes used to compute the factor rate. My hope is to time between Christmas and New Year's to finally get the sr2sieve functionality (which supports Legendre tables when one has multiple sequences) implemented into srsieve2 and srsieve2cl. Based upon how long it takes to build Legendre tables when one has many sequences, I will try to offload that logic into the GPU so it should take seconds to build the sequences instead of minutes or hours. For these 7233 sequences took hours to build those tables. Using a fixed range of P and the exact same input file (7233 sequences and 6868941 terms), here is the wall time (in seconds) to run of the various programs: [code] srsieve 1407 sr2sieve 2334 (no Legendre tables) sr2sieve fail (with Legendre tables) srsieve2 1183 srsieve2cl 213 (using the default of -M10 -g10) srsieve2cl 147 (using -M1 -g100) [/code] I didn't use clock time because srsieve/sr2sieve don't compute it and because srsieve2cl computes CPU utilization to account for the fact that it is multi-threaded. I ensured that other CPU intensive applications were not running so I cannot determine how much these programs are impacted by other CPU intensive programs that are running concurrently. There was not enough memory to build the Legendre tables for sr2sieve. I don't know how much memory it needed. It stopped about halfway thru building them. That implies that srsieve2 and srsieve2cl will fail to allocate the necessary memory for those tables. When I make the enhancements to srsieve2, I will see what I can do about failing earlier rather than later. I doubt there is much I can do about reducing memory using with so many sequences.[/QUOTE] sr2sieve has a limit of a little over 2GB. It is a 64-bit build, so I don't know what is causing that limit. I trimmed down the input file to 2541 sequences with 2421027 terms. The time for sr2sieve here is after it has built the Legendre tables. [code] sr2sieve 759 (no Legendre tables) sr2sieve 327 (with Legendre tables) srsieve2 446 srsieve2cl 36 (using the default of -M10 -g10) srsieve2cl 32 (using -M1 -g100) [/code] So even without Legendre tables, srsieve2cl kicks sr2sieve's ass. FYI sr2sieve took over an hour to build those Legendre tables. That time is not included in this table. srsieve2 and srsieve2cl with Legendre tables will be faster, but how much faster is unknown until I write that code. All tests were for the same range of primes, so I don't know why the times do not scale. I have about 1/3 the number of terms, but the time is more than 3x as fast. I suspect cache misses in memory, but that is just a guess. It requires further investigation. |
Mark, I was intrigued by your speed tests, so I decided to dig my old code up and run a few too.
I assumed from the numbers (and reservations page) you were running tests on R126, so I fired up a quick sieve file with 7223 sequences and 8212814 terms, so reasonably similar, just less well sieved. I see a roughly 12x performance difference between srsieve (~30,000 p/sec) and my CUDA code (~365,000 p/sec) on my hardware (5600x and GTX1060), which is pretty similar to the 10x difference you see between srsieve and srsieve2cl. If you ran your tests on the same hardware you mentioned a few pages back (i7-8850H and P3200) then adjusting for hardware, my native CUDA code looks to be about 50% faster than srsieve2cl. I do however have a form of Legendre implemented, so I would expect you to see a pretty reasonable performance bump from implementing Legendre based on that difference and obviously the underlying theory. I really need to spend an evening getting srsieve2cl to work on my machine, then I might be able to help identify any potential areas for gains. When you come to implement it you may find you get more performance calculating Legendre on the fly, rather than generating the tables and storing them. I certainly found that as it helped to relieve memory pressure of looking things up. I also suspect you are correct that your greater than 3x performance improvement with 1/3 the terms is memory pressure. The GPU cache size makes an incredible difference to my code, I've seen 2x performance from cards with similar compute but 33%-50% more L2 cache, it appears your code sees something reasonably similar. |
[QUOTE=rob147147;594442]Mark, I was intrigued by your speed tests, so I decided to dig my old code up and run a few too.
I assumed from the numbers (and reservations page) you were running tests on R126, so I fired up a quick sieve file with 7223 sequences and 8212814 terms, so reasonably similar, just less well sieved. I see a roughly 12x performance difference between srsieve (~30,000 p/sec) and my CUDA code (~365,000 p/sec) on my hardware (5600x and GTX1060), which is pretty similar to the 10x difference you see between srsieve and srsieve2cl. If you ran your tests on the same hardware you mentioned a few pages back (i7-8850H and P3200) then adjusting for hardware, my native CUDA code looks to be about 50% faster than srsieve2cl. I do however have a form of Legendre implemented, so I would expect you to see a pretty reasonable performance bump from implementing Legendre based on that difference and obviously the underlying theory. I really need to spend an evening getting srsieve2cl to work on my machine, then I might be able to help identify any potential areas for gains. When you come to implement it you may find you get more performance calculating Legendre on the fly, rather than generating the tables and storing them. I certainly found that as it helped to relieve memory pressure of looking things up. I also suspect you are correct that your greater than 3x performance improvement with 1/3 the terms is memory pressure. The GPU cache size makes an incredible difference to my code, I've seen 2x performance from cards with similar compute but 33%-50% more L2 cache, it appears your code sees something reasonably similar.[/QUOTE] The memory hit for the Legendre tables imply that computing on the fly could be better. |
I have started implementing the changes to support the sr2sieve functionality into srsieve2. So far I have focused on the building of the Legendre tables. Although not difficult, the main issue with sr2sieve is that one has not idea of the progress when it is expected to take a long time to build those tables. srsieve2 will now give an estimate as to when it will complete. srsieve2 will also "abort" building the Legendre tables if are too large, i.e. need too much memory. sr2sieve would just terminate. srsieve2 will go on its merry way after giving a message. What I don't know yet is if srsieve2 without Legendre tables when using the c=1 logic will be faster than the generic logic, which does not use Legendre tables. What I also don't know is if computing Legendre values on the fly will be faster if there isn't enough memory to build them.
I do intend to make a change to split the sequences into groups when feeding the GPU. That should boost performance (fewer cache hits) and it should address problems for those with a GPU that has less memory. |
It has been a week, so time for an update.
The code is a mess right now. As I was working on the logic for sr2sieve, I ended up having to refactor some of the single sequence logic (sr1sieve) used to build congruent q, ladder, and legendre tables since some of that code is shared between both of those sieves. A lot of of this was in consideration of using OpenCL for the new sr2sieve logic. On the plus side, I addressed some code that was confusing to read and thus hard to understand. So at this point it won't compile. It might take a day or more to fix all of those problems, but I first need to revisit the new code for sr2sieve functionality because I'm certain that I have missed some logic or did not implement some logic correctly. For the Legendre tables, the code will first determine how much memory is needed for those tables and spit that out. It will then try to allocate that memory. If it fails, then it won't use the Legendre tables. But there are other considerations that I have to look into. First, what if the code can build some Legendre tables, but not all of them? Based upon my understanding of the code, srsieve2 should use the Legendre tables for the sequences for which there was enough memory but then fall back and compute the Legendre symbol on the fly for the other sequences. Second, I recall reading that there are cases where one could use sr1sieve to sieve more than one k (in series) and that would be faster than sieving both k with sr2sieve. I haven't tested that out. I can imagine that might be useful if one can build the Legendre tables for only one k, but not both (due to memory limitations). I will have to do some testing to see if that is true. I don't think it is, but I could be wrong. Third, how much of a performance hit will it be to the OpenCL logic (for multiple sequences) if it has to compute the Legendre symbol on the fly? This goes back to my concern about how much GPU memory is needed. Fourth, before I started on these changes I had implemented the logic to split the number of sequences sieved at a time with the GPU. I didn't complete that testing, although the results at the time were promising. Fifth, I need to address building OpenCL code on OS X. With Apple's changes that are pushing developers to use Metal it no longer supplies the OpenCL headers even though I can use the OpenCL framework. I will have to pull those from kronos and put into SVN and use those for builds on OS X. Next year I hope to get an M1 MacBook, so mtsieve will be supported on that, although some sieves might not be ported because few people use them and they rely heavily on x86 asm. |
Very good work Rogue.
A note on your 1 k or 2 k sequence sieve. We have to go back to the time just a few years after CRUS started, when someone finally upgraded sr2sieve, to run more efficient. Since then, if memory serves me correct, or someone can find the post our dear missed Lennart made, we are all the way back to 2010 or 2011. Before that update of sr2sieve, it was indeed faster to sieve just 1 sequence and running 2 instances of sr1sieve sieving 1 sequence at a time. After the upgrade sr1sieve only was faster for a single sequence and not for conjectures with 2 or more k's remaining. Before the upgrade it was infact fastest to use sr2sieve, for 3 or more sequences. After upgrade of sr2sieve, only fastest to use sr1sieve for 1 sequence. Looking forward to see what can be done on my GT 1030's once I complete SR383 testing to n=1M and has to start sieving about 39 sequences for either 1M n, 9M n or 49M n (39M candidates, 351M candidates or 1,911M candidates) - that sieve is about 970 days into the future :smile: |
Thanks for letting me know KEP. I will do some testing to verify behavior when the code is ready.
In the next release the -l parameter, which is currently used to turn off Legendre logic, will be modified to accept a value indicating how much memory you want to allocate for Legendre tables. It will default to 2^62, which is ridiculous of course. At runtime srsieve2 will tell you how much memory is needed and will give you an idea how long it will take to build the Legendre tables, giving an ETC. If your machine doesn't have enough memory it will terminate immediately (as opposed to trying to build tables then fail before it finishes). It won't tell you how much memory you have available, but if you combine how much memory it tells it needs vs how much you can see is available, you can run again with -l limiting how much memory it will allocate for those table. srsieve2 will only allocate up to that limit. This means that some sequences will be tested using pre-built Legendre tables and others will compute the Legendre symbol on the fly. This brings up a few things. First, I suspect that srsieve2 with calculating the Legendre symbol might be slower than the generic sieving logic. I have seen this with srsieve vs sr2sieve. This might be dependent upon the sequences being sieved or might be true for all sequences. Second, if computing the Legendre symbol on the fly is slower than generic sieving, then it would make sense to sieve sequences with a Legendre table with the sr2sieve code and then sieve those without with the srsieve code. I could possibly do this in code, but it would be easier for me to have users split the sequences across two input files and sieve them separately. This could be a challenge to do in srsieve2. If anything it would go on the wish list. Third, I am thinking about building two Legendre table for mixed parity sequences, which is what sr1sieve and srsieve2 (with a single sequence) do. When sr2sieve was initially written Geoff was more concerned about memory requirements, but today's computers typically have a lot more memory than the ones that existed when sr2sieve was first created. I think that this could improve performance by about 25%, but it will take some time for me to verify the accuracy of that assumption. I'm trying to think how to best manage this if a computer doesn't have enough memory. I have some options. One is "all or nothing" where all sequences use only one method. This could be based upon available memory, as creating two tables instead of one for a sequence requires more memory. In this case I try to allocate all memory for two Legendre tables (for mixed parity sequences) and if there isn't enough, switch to allocating memory for one Legendre table (for mixed parity sequences). I could write the code to use available memory to pick and choose the sequences that use one Legendre table and which use two Legendre tables, but that would be a lot of work and might not be worth the effort. Fourth, although not written yet, the GPU logic for multiple sequences with Legendre tables is a concern. Some computers will have more available CPU memory than GPU memory. They will fail when sieving starts as opposed to when the program builds the Legendre tables. So this is a question of whether or not to force the GPU to compute the Legendre symbol on the fly when the CPU could use Legendre tables. Fifth, it might be possible that the GPU logic with computing the Legendre symbol (no Legendre tables) is faster than the CPU logic with computing the Legendre symbol yet the CPU with the Legendre table is faster than the GPU with the Legendre tables. This is likely dependent upon the end user's GPU and CPU. If that is the case, then the end user will need to do some testing to find out what is best for them. Right now I do not have a switch to force use of generic logic when the c=1 logic is available. That will be a requirement for the next release. If anyone here has opinions on the direction I am taking, feel free to share them. |
I like your ideas Mark :smile:
When that is said, in wake of the experience that I have now had, trying to make mfaktc running on Linux Mint 20.4 - ONE really really big wish is that if mtsieve is not like mprime a truly STAND-ALONE program, where all needed for the program to execute and test is selfcontained in the coding or compilation, that the driverfiles and what else externally needed for mtsieve to work on a Linux machine, is included in the zip file (with instructions on where to put the files). It simply isn't as much a walk in the park, using Linux, especially if one has low experience and the machine(s) is unable to get online for at least 4 or maybe even 6 months - if ever, when one has to install forinstance CUDA and make mfaktc (just to give an example) see the claimed by Linux terminal installed version of CUDA. Still looking forward to be seeing what can be done and achieved when using mtsieve :smile: Merry christmas to you and your loved ones :smile: |
Someone could probably write some kind of service to start a .bat file at startup. On Linux or OS X, this would be a .sh file.
One could wrap a client/server app around it, but that would take a lot of work. As of today, the generic sieve and c=1 sieve (for a single sequence) are now working again, which required at bit of testing after all of the refactoring to support the c=1 sieve (for multiple sequences). Unfortunately the c=1 sieve for a single sequence lost a lot of speed. I do not know why. It finds all of the expected factors so I need to investigate this further. Hopefully the cause is fairly obvious. There is another issue in the c=1 GPU sieve (for a single sequence) that causes it to crash, but I cannot trigger the problem when running thru gdb. Very mysterious and annoying. I have yet to find the cause of that problem. Once I build on OS X, it is more likely that I will be able to trigger the issue in lldb, but that remains to be seen. I have yet to start testing the c=1 sieve (for multiple sequences). I know it won't work "out of the box", but I hope it is close. Once that works I can focus on the GPU version of that sieve. All being said, there is still a lot of testing to be done, but I feel confident that it will be ready next week. |
All times are UTC. The time now is 05:52. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.