![]() |
![]() |
#1 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
4,909 Posts |
![]()
This thread is intended to hold only reference material specifically for CUDALucas
(Suggestions are welcome. Discussion posts in this thread are not encouraged. Please use the reference material discussion thread. Off-topic posts may be moved or removed, to keep the reference threads clean, tidy, and useful.) If you're already set up and running in CUDALucas, scroll to the bottom of the post for the thread table of contents. How to set up and run CUDALucas Gpuowl and PRP are recommended for new first-time primality tests, on GPUs that can run it. It has superior error detection and handling, much lower cost of verification due to proof generation capability, and is also faster, than CUDALucas 2.06. To perform LL DC, on GPUs that can run Gpuowl, Gpuowl is recommended for that too, unless the first test was done with zero shift, since recent Gpuowl versions include the Jacobi check but lacks nonzero shift capability. Attempting prp first is a reliable way to assess a GPU's reliability. Some older NVIDIA gpus can't run Gpuowl, but can run CUDALucas, which lacks the Jacobi check. Running LL DC on them is recommended, not first-time LL tests. Download from https://download.mersenne.ca/CUDALucas, for Linux https://download.mersenne.ca/CUDALuc...nux-x86_64.zip or for Windows https://download.mersenne.ca/CUDALuc...indows32.64.7z or https://sourceforge.net/projects/cudalucas/files/ Create a user directory. Unzip the software in it. Get the appropriate CUDA level cufft and cudart library files for your gpu and OS from https://download.mersenne.ca/CUDA-DLLs and place them in the same directory. Review the cudalucas.ini file. Keep an original version for reference. Only make changes you're sure of. Get the cl-startup script below, for Windows, or tdulcet's scripts for Linux from http://www.mersenneforum.org/showpos...12&postcount=1. Edit carefully to adapt to your gpu and environment. Read and run them. Be patient. Depending on gpu model and other variables, the Windows startup script can take hours or days to complete. On an RTX2080 (which is better suited to TF), a single pass memory test alone takes about 75 minutes. If it crashes with an out of memory error reduce the number of 25MB blocks to just below what it logged as attempting, and try again. (A test on RTX2080 worked with 314 blocks.) The cl-startup script includes rerunning a small known Mersenne prime, of your choice by editing the file. Do not proceed to new work until it completes that correctly. If the gpu shows memory errors, you might be able to clear them up by improving cooling or lowering the clock speed. Until it passes a comprehensive memory test, don't use it for primality testing. Retesting gpu memory annually and regularly performing double checks are recommended. CUDALucas is very vulnerable to memory errors since it has neither the Gerbicz error check nor the Jacobi check. System ram errors or gpu vram errors can cause wrong primality test results. See also the draft readme update at https://www.mersenneforum.org/showpo...84&postcount=6 It is likely that future discoveries of Mersenne primes will be made with Gpuowl via PRP/GEC, and confirmed with Gpuowl on Radeon VII running LL with Jacobi check. Parallel confirmations are likely in CUDALucas, prime95, and Mlucas on the fastest available reliable hardware. To obtain LL DC assignments, go to https://www.mersenne.org/manual_assignment/ and check at the upper right you're logged in. Specify number of assignments and workers. (Start small.) At "Preferred work type:" select "Double-check LL tests". Click "Get Assignments". The page will update with assignments. (Eventually; be patient. Do not click page refresh unless you want multiple batches and have already copied the previous batch.) Copy and paste the assignments from the page, into a worktodo.txt file in your CUDALucas working directory. Then launch the CUDALucas program to test the assignments. These can take a long time. Start with the smallest you can get, until you develop a sense of time required. Generally, time required on a given gpu is proportional to p2.1 and is measured in days, weeks, or months. Longer than about a month per assignment is likely to be unreliable on even good equipment. ECC system ram may help reliability. An RTX2080 Super takes about 40 hours for a 53M LL DC, and is much more productive for the project when performing TF with mfaktc. To report results, go to https://www.mersenne.org/manual_result/ and check at the upper right you're logged in. Copy and paste recent (previously unreported) results into the page and click submit. The page will refresh. Note any error messages, and whether your double-check(s) matched. I usually append a marker in the results.txt file they came from to indicate what's preceding has been reported, then save it. If the double-check does not match the first test's residue64, it means at least one of them is wrong. A triple check to resolve which can be requested at https://mersenneforum.org/showthread.php?t=24148 On rare occasions quad or higher checks are needed and can be requested there too. You can also help out there with the workload of triple checks. See post one of that thread and the gpu link there. While the workload of managing the lengthy tests manually is small, https://www.mersenneforum.org/showpo...92&postcount=3 includes an attachment describing client management software options, which might be useful if you'd like to try to add some automation to the assignment and result reporting process for your gpu(s). See also for more information on CUDALucas specifications and limits, development and discussion thread link, etc. the attachment at Available Mersenne Prime hunting software Table of contents
Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2021-01-24 at 16:49 Reason: revised how-to section slightly to encourage gpuowl use |
![]() |
![]() |
#2 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
490910 Posts |
![]()
Timings for an assortment of exponents are tabulated and charted for reference. Note, only one trial per combination was tabulated, so no measure made or indication given of reproducibility run to run for same inputs. (One exponent's run time was estimated from a fit made on results from several other exponents, with unexpectedly good results. Likely fit error appears to be 1-5% typically, subject to revision later.) See the attachment. This is a somewhat different way of looking at test speed than the GPU Lucas-Lehmer performance benchmarks at http://www.mersenne.ca/cudalucas.php
Run time power fits were made for timings from CUDALucas v2.05.1 on GTX480 separately, for p<106 (p1.339), 106<p<107 (p1.849), 107<p<108 (p2.095). Compare those results to the expected asymptotic run time scaling p2 log p log log p (~p2.117) Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2019-11-18 at 14:25 |
![]() |
![]() |
#3 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
4,909 Posts |
![]()
Here is the latest posted version of the list I am maintaining for CUDALucas As always, this is in appreciation of the authors' past contributions. Users may want to browse this for workarounds included in some of the descriptions, and for an awareness of some known pitfalls. Please respond with any comments, additions or suggestions you may have, preferably by PM to kriesel.
Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2019-11-18 at 14:25 Reason: added items and references |
![]() |
![]() |
#4 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
132D16 Posts |
![]()
What it may look like if your gpu should not be allowed to run or benchmark 1024 threads
http://www.mersenneforum.org/showpos...postcount=2634 set up cudalucas, and notes on -r option; checks residues for up to 8192k, no higher http://www.mersenneforum.org/showpos...postcount=2620 Choosing a CUDA level. Considerations include which gives maximum performance, what the installed driver supports, and what the gpu model(s) installed require. http://www.mersenneforum.org/showpos...postcount=2625 Gpus changing device numbers when another drops out (the gory details) http://www.mersenneforum.org/showpos...postcount=2603 edited readme vintage April 2017 (no longer current) http://www.mersenneforum.org/showpos...postcount=2576 unanswered questions about the ini file contents http://www.mersenneforum.org/showpos...postcount=2579 Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2019-11-18 at 14:25 |
![]() |
![]() |
#5 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
114558 Posts |
![]()
On Linux:
I haven't tried it or even looked at it, but http://www.mersenneforum.org/showpos...12&postcount=1 indicates a CUDALucas install and startup script. Make sure you get the latest version of CUDALucas with checks for known-bad interim residues. On Windows: The attachment is a draft startup batch file (not installation), derived from a more compact but less annotated one I have used. It is for use after the necessary files are unzipped and placed in a folder and the CUDALucas.ini file configured, driver installed, dlls added, etc. Note, while the CUDALucas program supports up to 256M fft lengths, they are not recommended. Run times at or above 64M can be years or decades. For example, on a GTX 1080 Ti, a run of an exponent approx 1 billion has an expected fft length of 57344k, which has a benchmark time of about 43.8 msec/iteration, corresponding to a 1.39 year run time estimate. Estimated time on a near maximal exponent ~2,147,483,647 would be ~2,147,483,647 iterations times 94.16 msec/iteration on fft length 128M, or 6.4 years on a GTX 1080 Ti. The chance of a primality test that long completing correctly without GEC or Jacobi check is small.. Exponent is currently capped at 2,147,483,645 for fft lengths 128M to 256M in the CUDALucas fft file. Reliability of a run taking multiple months or years is expected to be low, since there is no Gerbicz check or Jacobi check in CUDALucas. There's also no support in the program for residue self tests above fft length 8M. Also mersenne.org does not assign primality tests for such high exponents (p>109) or accept results for them (nor does any other site to my knowledge). Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2020-07-16 at 19:09 |
![]() |
![]() |
#6 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
4,909 Posts |
![]()
The CUDALucas readme has been updated somewhat to include info on maximum fft length and more recent CUDA levels, as well as other additions or clarifications.
Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2019-11-18 at 14:26 |
![]() |
![]() |
#7 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
114558 Posts |
![]()
The current CUDALucas code supports exponent values up to 231-1, 2147483647. A quick test on GTX1080Ti:
Code:
Wed Jan 09 04:41:05 2019 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 378.78 Driver Version: 378.78 | |-------------------------------+----------------------+----------------------+ | GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Quadro 2000 WDDM | 0000:02:00.0 On | N/A | |100% 78C P0 N/A / N/A | 88MiB / 1024MiB | 99% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX 108... WDDM | 0000:03:00.0 Off | N/A | | 66% 82C P2 220W / 250W | 1619MiB / 11264MiB | 100% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 1868 C ... Documents\mfaktc q2000\mfaktc-win-64.exe N/A | | 1 4644 C ...CUDALucas2.06beta-CUDA8.0-Windows-x64.exe N/A | +-----------------------------------------------------------------------------+ Code:
Continuing M999999937 @ iteration 4302 with fft length 57344K, 0.00% done | Date Time | Test Num Iter Residue | FFT Error ms/It Time | ETA Done | | Jan 09 04:45:26 | M999999937 5000 0xb723ad2cf90fefd5 | 57344K 0.18750 40.3755 28.18s | 473:09:25:34 0.00% | | Jan 09 04:46:07 | M999999937 6000 0x00c230e56a4bc3ca | 57344K 0.20313 40.6178 40.61s | 472:20:17:29 0.00% | | Jan 09 04:46:48 | M999999937 7000 0x7d01674dde8ecc02 | 57344K 0.18945 40.9224 40.92s | 472:22:59:37 0.00% | Extrapolating linearly for memory requirements (which is optimistic; above 2G, code gets a bit bigger), and by 2.1 power for run time versus exponent, and note, while I was originally composing this, as the gpu warmed up, the projected run time increased about 0.5% beyond what's tabulated here for M1G, from which all the others are extrapolated: Code:
p VRAM GB run time (years per exponent) M617M 1.00 0.47 M1G 1.62 1.3 M1234M 2.00 2.0 M2G 3.24 5.6 (~current exponent limit in the CUDALucas code) M2.53G 4.00 9.1 M3G 4.86 13.1 M3.32G 5.38 16.2 M3.7G 5.99 20.3 M4G 6.48 23.9 M4.94G 8.00 37.2 M5G 8.10 38.2 M6G 9.72 56. M6.8G 11.02 73. M7G 11.34 77. M8G 12.96 102. M9G 14.58 131. Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2020-01-05 at 20:55 |
![]() |
![]() |
#8 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
4,909 Posts |
![]()
It depends. What does best mean? What does your gpu require? What's the lowest CUDA level usable with your gpu? What's the highest CUDA level that still supports your gpu? What other gpus do you have in the same system, that may constrain your choice of driver version or CUDA level? What driver versions and CUDA level support are available for the OS you are using? Which fft lengths will you run the most? Which releases have unacceptable reliability or bugs? Which is fastest on your hardware, and most-used fft lengths, providing the other considerations are acceptable?
In general, unless your gpu is so new that it requires the latest release, it's likely some other CUDA version could perform better. And it can vary depending on the program inputs. I tested years ago for a performance dependence on NVIDIA driver version (~v260-v378), and did not see any statistically significant difference for driver version, on a GTX480 and Windows 7, CUDA versions 4.2-8.0. (However, on AMD, I have seen reductions of performance with driver version updates, including a drop of over 5%.) On NVIDIA, CUDA version does affect speed. Test results ranging from CUDA5 to CUDA8 provided by ATH for his Titan black gpu are shown in the attachment. CUDA8 is rarely the fastest. Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2019-11-26 at 07:20 |
![]() |
![]() |
#9 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
4,909 Posts |
![]()
Note, this apparently goes to stderr since it is unaffected by output redirection.
Code:
$ CUDALucas -h|-v $ CUDALucas [-d device_number] [-info] [-i inifile] [-threads t1 t2] [-c checkpoint_iteration] [-f fft_length] [-s folder] [-polite iteration] [-k] exponent|input_file name $ CUDALucas [-d device_number] [-info] [-i inifile] [-threads t1 t2] -r [0|1] $ CUDALucas [-d device_number] -cufftbench start end passes (see cudalucas.ini) $ CUDALucas [-d device_number] -threadbench start end passes mode (see cudalucas.ini) $ CUDALucas [-d device_number] -memtest size passes (see cudalucas.ini) -h print this help message -v print version number -info print device information -i set .ini file name (default = "CUDALucas.ini") -threads set threads numbers (eg -threads 256 128) -f set fft length (if round off error then exit) -s save all checkpoint files -polite GPU is polite every n iterations (default -polite 0) (-polite 0 = GPU aggressive) -r exec residue test. -k enable keys (see CUDALucas.ini for details.) Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2019-11-18 at 14:27 |
![]() |
![]() |
#10 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
4,909 Posts |
![]()
Save file size is almost proportional to exponent. There is a small amount of space used for constants regardless of exponent size. Based on fits to observed file size over a wide range of exponents (~1.4M to 1G), a very wide extrapolation is made to estimate file sizes that would be required for some very large exponents, up to M127 as an exponent.
File systems' max file size would become limiting at ~67 bits, but run time becomes limiting much lower. Gpu ram size also imposes a limit. Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2020-05-31 at 14:06 |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Reference material discussion thread | kriesel | kriesel | 62 | 2020-12-12 08:57 |
Mersenne Prime GPU Computing reference material | kriesel | kriesel | 31 | 2020-07-09 14:04 |
Mfaktc-specific reference material | kriesel | kriesel | 8 | 2020-04-17 03:50 |
How do you obtain material of which your disapproval governs? | jasong | jasong | 97 | 2015-09-14 00:17 |
CUDALucas Residue Test (-r) Reference Table | Brain | GPU Computing | 0 | 2012-04-12 20:21 |