![]() |
![]() |
#1 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2×5×17×29 Posts |
![]()
This thread is here for comparison to the gpu-based applications. Please use the reference discussion thread https://www.mersenneforum.org/showthread.php?t=23383 to make comments or suggestions.
Mprime and prime95 are Intel-compatible-processor-specific. Older processor models will have limited if any support. For ARM and other not-Intel-compatible cpus, see mlucas. There is a version of v30.3 (build 6), and for MacOS v29.8 build 7, available at https://www.mersenne.org/download/. V30.3 and later are PRP-proof capable versions, which greatly reduce the effort of verification of a primality test, but will require considerably more disk space to accomplish that. See the readme.txt and other documentation included in the compressed distribution file, for more info on that. There is also v30.4 announced for prerelease testing. Now at build 9, considered a beta release. There are also older versions v29.8b6 and older, for legacy operating systems, available at https://www.mersenne.org/download/. Setup instructions are included at https://www.mersenne.org/download/. Follow with https://www.mersenne.org/gettingstarted/ One thing to avoid is installing into "Program Files" or other restricted directories. Permissions problems will follow. Making a separate working directory for prime95 under the user's home directory is the way to go. I strongly recommend benchmarking over the range of fft lengths expected to be used, analyzing the results in a spreadsheet, and configuring for best throughput that is consistent with latencies shorter than applicable expiration periods. Configure worker windows for your preferred work type, and make sure that trial factoring is not it; gpus are far more effective at that. It is normal for the PrimeNet server to issue a new prime95 installation only LL DC, until each prime95/mprime worker has completed 4 LL DC successfully. After a new installation accumulates a history of reliability, the PrimeNet server will allow additional work types. For remaining questions see the program's extensive included documentation. PRP run time scaling for low p https://www.mersenneforum.org/showpo...78&postcount=2 P-1 run time scaling https://www.mersenneforum.org/showpo...92&postcount=3 Effect of number of workers https://www.mersenneforum.org/showpo...18&postcount=4 Effect of number of workers (continued) https://www.mersenneforum.org/showpo...19&postcount=5 Effect of frequent interim residue output https://www.mersenneforum.org/showpo...44&postcount=6 Prime95 documentation https://www.mersenneforum.org/showpo...03&postcount=7 Prime95 exponent limits https://www.mersenneforum.org/showpo...74&postcount=8 PRP proof capable versions https://www.mersenneforum.org/showpo...35&postcount=9 Performing version upgrades https://www.mersenneforum.org/showpo...2&postcount=10 Effect of number of workers continued 2 https://www.mersenneforum.org/showpo...4&postcount=11 See also the Concepts in GIMPS Trial Factoring post at https://www.mersenneforum.org/showpo...23&postcount=6 Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2021-02-13 at 19:16 Reason: updated note on available versions; added directory advice and initial LL DC comments |
![]() |
![]() |
#2 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
134216 Posts |
![]()
Run time is fitted as approximately proportional to p2.094, for 86243 <= p <= 2976221. LL run time is expected to scale very similarly. For comparison a theoretical fft convolution based primality tester scales as p2 log p log log p, which over the mersenne.org interval fits as p2.117. Overhead at low exponents lowers the power on a fit.
Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2019-11-18 at 14:30 |
![]() |
![]() |
#3 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2×5×17×29 Posts |
![]()
A small number of widely spaced exponents were run to observe the run time scaling.
For prime95 v29.4b8 x64 run on a Windows 7 x64 system with dual e5-2670 chips, 4 cores (half a chip package) per worker, 32,000 MB allowance per worker, run time was approximately proportional to exponent p2.33 up to 595M (27 days), a somewhat higher power than observed for P-1 on gpus (~2.1). Another prime95 v29.4b8 x64 run on an FMA equipped i7-7500U Windows 10 X64 system seemed to be taking inordinately long to perform P-1, at p=101M, on 7,200 MB memory allowed, one core. It had been running for two weeks to perform stage 1 and reach 90% in stage 2. It appeared to be paging to disk excessively. The same system can complete an 83M primality test per core in about 2.5 weeks. It was allowed to complete that P-1 and then reset to 4096M memory allowed, after it was found to still page excessively at 6144M. This is a system with 8GB ram currently. In all cases it was running 1 core per worker; the other worker was running an 83M LL. It projected P-1 run times ranging from 4.4 days for 201M to 43 days for 605M, 67 days for 701M. However, attempting 605M resulted in "Cannot initialize FFT code, errcode=1002". The fit to observed run time is p2.087 (with five data points). Another run, a mix of prime95 V29.7b1, v29.8b3, and v29.8b6, on an FMA equipped i7-8750H Windows 10 X64 system was able to run 801M (at 8GB allocated of its 16GB installed ram, 37 days run time), and 901M (at 12GB allocated, 57 days run time) and is expected to be capable of up to 920.8M. The offset in the estimated days runtime is believed to be due to whether mfakto is running on the Intel igp or not. It seems to be using somewhat lower bounds than GPU72 figures for exponents above p~400M. Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2020-01-05 at 14:05 Reason: updated i7-8750h attachment for new data |
![]() |
![]() |
#4 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2×5×17×29 Posts |
![]()
Similar to the number of threads choices in gpu applications, on multicore systems, the effect of number of cores per worker in prime95 is unpredictable, and so there is provision for benchmarking.
Number of workers could be chosen to optimize performance. But which measure of performance? Aggregate throughput maximized, latency of one assignment minimized, number of joules used for a 100GhzD primality test, aggregate throughput given a constraint of latency low enough to avoid assignment expiration, something else? For which single fft length, or for the current and next several? For minimum latency, as for confirming a newly discovered Mersenne prime, Madpoo has run experiments on a dual-14-core system. He reported the fastest primality test time around 20 cores out of the 28 available; any more than 6 on the lesser use package, and the increased package to package data transfers slow the progress. For picking number of cores/worker per cpu type, that's a reasonable compromise for maximum aggregate throughput, so I can set it and forget it for months or years on each system, I ran the built in prime95 benchmarking over wide fft ranges for a variety of cores/worker, on a variety of cpu types. Then the timings were tabulated in spreadsheets and graphed. If going after the maximum performance per fft length, consider that some work types restart from the beginning when the number of workers is changed. Read the readme.txt and other files, back up before changing number of workers, plan ahead, etc. Some patterns emerge. Worker counts that would straddle the divide between processor packages if divided evenly typically do not provide as much throughput. A 12-core 2-package system with 3 workers with equal cores/worker would have at least one worker with cores in each package (4 2 + 2 4). George indicates recent versions of prime95 prevent the straddle by assigning unequal numbers of cores to the workers. For larger core counts there can be quite a few choices to evaluate. What's fastest for one fft length may not be for others. A compromise that averages a small percentage penalty is usually available. Plotting the various combinations with trend lines seems a useful visualization method for selecting one configuration to run with for a long time. Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2020-11-15 at 19:28 |
![]() |
![]() |
#5 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
10011010000102 Posts |
![]()
Working around the 5-attachment limit per post:
Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2020-11-13 at 13:27 Reason: cosmetic cleanup for i5-1035G1 |
![]() |
![]() |
#6 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2·5·17·29 Posts |
![]()
Timing runs on LL DC on the same 51M exponent and old 32-bit hardware with prime95 29.4b7 yield conflicting information on the cost of a Res64 output as a multiple of an ordinary iteration. The res64 cost is estimated as 7/8 to 4 times an iteration. Note that because of numbering skew between prime95 and other conventions, prime95 outputs res64 at 3 successive iterations, with cost ~3.1 to 12 times an iteration. The lower value is based on prime95-provided timings per iteration, the higher value on prime95-provided time stamp of 1 second resolution of the res64 output line.
An initial attempt to make a similar measurement on an i7-8750H with UHD630 igp in prime95 v29.4b8 x64 yielded negative per-res64 cost in two tries. I speculate this was an interaction with mfakto running at the same time on the same chip package power budget. Performance monitor indicates the cpu utilization drops considerably when frequent interim residue output is enabled. A retest, with the UHD630 mfakto instance halted, yielded timings that indicate a cost per PRP3 res64 interim output on the i7-8750H system of 2.7 seconds, equivalent to 263. iterations, on an 83M primality test. One of the 6 cores stays very busy while the rest are only used at a low duty cycle when outputting an interim residue every 10 iterations. This cut throughput from 96.6 iter/sec to 3.54 iter/sec, a rather severe 96.3% reduction. The estimated effect on run time for the exponent when producing interim residues for the primenet server at 5,000,000 iteration intervals is about 45 seconds, 52ppm of run time. The retest was brief, taking 48 seconds for iterations with interim residues, and 114 seconds without, so accuracy is no better than a percent or two. Note also the cpu clock was not held constant during the test. In this case the agreement between time stamp based rates and program-computed ms/iter was very good, ~1/4%. Another test, on a dual-xeon-e5-2690 system, v29.6b6 x64 on Win10, 4 cores/worker, 83.9M PRP tests, gave ~305 iterations/interim residue64, 3.45 sec/interim residue, or around 61ppm for the default 5,000,000 iteration interval. The preceding figure ignores the initial 500K-iteration interim residue, which raises the impact a bit to 65ppm for ~84M exponents, and somewhat more for DC exponents. Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2019-11-18 at 14:31 |
![]() |
![]() |
#7 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2·5·17·29 Posts |
![]()
Most GIMPS applications include a readme file. Prime95 has very comprehensive documentation included in the zip package, in multiple files.
Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2019-11-18 at 14:32 |
![]() |
![]() |
#8 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
115028 Posts |
![]()
Prime95 and its sibling mprime contain many code paths specific to processor types and exponent magnitudes. What range of exponents is supported varies by processor type. I think what has been implemented was determined by a combination of processor throughput versus exponent size and decisions by George on which to spend his programming time.
There are several ways to determine what these limits are. George has made statements about them in email or on the forum. https://mersenneforum.org/showpost.p...&postcount=219 The whatsnew.txt describes numerous changes in what was supported. The source code is available for examination. Trying runs on differing hardware and OS may obscure the situation, because it could be that it's an old operating system version, not the processor type, that prevents running some versions of code. Last fiddled with by kriesel on 2020-05-25 at 00:21 |
![]() |
![]() |
#9 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2×5×17×29 Posts |
![]()
UPDATE:
V30.3b6 is now generally available. This automatically uploads proof files and includes resource limit features. Direct download links for prime95 64-bit for Windows; mprime 64-bit for Linux. (32 bit and other variations also available.) V30.3b6 appears on the main GIMPS software page and the mersenne.ca download mirror. Previously: Per https://www.mersenneforum.org/showpo...&postcount=119 V30.1b1 prime95 or mprime are available and require manual uploading of proof files. Direct download from dropbox: prime95 for Windows 64-bit; mprime for Linux 64-bit A run of PRP with proof becomes conspicuous by its multi-gigabyte p.residues file. These downloads contain all the necessary code includiing dll files. (V30.2b1 DID NOT contain the dlls. Install v30.1b1 first, then v30.2b1 atop it.) The standalone command-line uploader, which works for gpuowl as well as prime95, is described briefly at https://www.mersenneforum.org/showpo...&postcount=154 but the direct download from dropbox for Windows x64 is no longer available. It can be found as an attachment at https://www.mersenneforum.org/showpo...0&postcount=26 NOTE: it is not being maintained, and preferred usage is upload through a current version of prime95 or mprime. Usage is Code:
uploader user_id proof_filename[ chunk_size[ upload_rate_limit]] (Note, for gpuowl, there are more choices; https://www.mersenneforum.org/showpo...0&postcount=26, some of which might conceivably apply to prime95/mprime too, at least for the most adventurous. But I encourage users to stick with prime95 & mprime's built in PrimeNet API & supported features whenever practical.) Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2020-09-17 at 14:14 Reason: V30.3b6 general release update |
![]() |
![]() |
#10 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2·5·17·29 Posts |
![]()
The most efficient method will depend on whether it's a single install or a fleet of them to be upgraded. Each leaves your results files, worktodo, log files, work in progress files, and configuration files in place and undisturbed. (But you should be doing regular system backups anyway.)
Single install: Stop and exit the prime95 program to allow prime95 program files to be overwritten. Download the zip file. Unzip it. If necessary, move the new files into your working directory. Select replace if prompted. Restart the program in the working directory. Multiple systems, with USB drive: Download the zip file. Put it onto the USB drive. Unzip it there. On each system: Insert the USB stick. Stop and exit the prime95 program to allow prime95 program files to be overwritten. Copy the new version's files from the USB stick to the working directory, overwriting the old. Start the program in the working directory. "Eject" the USB drive. Its file explorer window will close. Remove the USB stick. Multiple systems, with network drive: Download the zip file. Put it onto the network drive. Unzip it there. On each system: In file explorer, navigate to the update version prime95 folder on the network drive. Stop and exit the prime95 program to allow prime95 program files to be overwritten. Copy the new version's files from the network folder to the working directory, overwriting the old. Start the program in the working directory. Close the file explorer window for the update version folder. It's possible to streamline the above somewhat with a bit of batch script. Strictly speaking, it is not necessary to copy and overwrite files that have not changed from the previous version, but it does little harm. Unneeded copying can be efficiently avoided by date sorting both source and destination folders, and only copying what's newer than the corresponding destination file. For more detail, quoted with some editing, from S485122 at https://mersenneforum.org/showpost.p...61&postcount=4 prime.txt contains the GIMPS user data, local.txt contains the machine data, worktodo.txt contains the current work (assigned or not), at some times a file named prime.spl which contains the results not yet transmitted to the server might be present, the work files pnnnnnnn mnnnnnnnn etc and their backup copies .bu, bu2, etc... None of these files are in the prime95.zip archive and will thus not be overwritten. They are essential for continuity. There are other user files that are not in the archive either, but they are less critical (results.txt, results.json.txt, prime.log, gwnum.txt, ...) In other words, keep all other files in the folder, since they contain your user and machine data and preferences, your work in progress and results. The only files overwritten will be the program and version dependent files. Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2020-11-19 at 22:26 |
![]() |
![]() |
#11 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2·5·17·29 Posts |
![]()
Additional processor types:
FMA3 capable 6-core i7-8750H (no code running on the IGP at the time) Xeon Phi 7250 (68 cores in one socket) see also https://www.mersenneforum.org/showthread.php?t=25767 Last fiddled with by kriesel on 2020-11-19 at 22:19 |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
gpuOwL-specific reference material | kriesel | kriesel | 27 | 2021-01-13 23:25 |
Mfakto-specific reference material | kriesel | kriesel | 5 | 2020-07-02 01:30 |
gpu-specific reference material | kriesel | kriesel | 4 | 2019-11-03 18:02 |
clLucas-specific reference material | kriesel | kriesel | 4 | 2019-08-12 16:32 |
CUDAPm1-specific reference material | kriesel | kriesel | 12 | 2019-08-12 15:51 |