2018-12-14, 19:54   #2
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

134216 Posts
PRP run time scaling for low p

Run time is fitted as approximately proportional to p2.094, for 86243 <= p <= 2976221. LL run time is expected to scale very similarly. For comparison a theoretical fft convolution based primality tester scales as p2 log p log log p, which over the mersenne.org interval fits as p2.117. Overhead at low exponents lowers the power on a fit.

Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1
Attached Files
 prp run times low Mp.pdf (15.9 KB, 204 views)

Last fiddled with by kriesel on 2019-11-18 at 14:30

2018-12-24, 22:06   #3
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

2×5×17×29 Posts
Prime95 P-1 run time scaling

A small number of widely spaced exponents were run to observe the run time scaling.

For prime95 v29.4b8 x64 run on a Windows 7 x64 system with dual e5-2670 chips, 4 cores (half a chip package) per worker, 32,000 MB allowance per worker, run time was approximately proportional to exponent p2.33 up to 595M (27 days), a somewhat higher power than observed for P-1 on gpus (~2.1).

Another prime95 v29.4b8 x64 run on an FMA equipped i7-7500U Windows 10 X64 system seemed to be taking inordinately long to perform P-1, at p=101M, on 7,200 MB memory allowed, one core. It had been running for two weeks to perform stage 1 and reach 90% in stage 2. It appeared to be paging to disk excessively. The same system can complete an 83M primality test per core in about 2.5 weeks. It was allowed to complete that P-1 and then reset to 4096M memory allowed, after it was found to still page excessively at 6144M. This is a system with 8GB ram currently. In all cases it was running 1 core per worker; the other worker was running an 83M LL. It projected P-1 run times ranging from 4.4 days for 201M to 43 days for 605M, 67 days for 701M. However, attempting 605M resulted in "Cannot initialize FFT code, errcode=1002".
The fit to observed run time is p2.087 (with five data points).

Another run, a mix of prime95 V29.7b1, v29.8b3, and v29.8b6, on an FMA equipped i7-8750H Windows 10 X64 system was able to run 801M (at 8GB allocated of its 16GB installed ram, 37 days run time), and 901M (at 12GB allocated, 57 days run time) and is expected to be capable of up to 920.8M. The offset in the estimated days runtime is believed to be due to whether mfakto is running on the Intel igp or not. It seems to be using somewhat lower bounds than GPU72 figures for exponents above p~400M.

Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1
Attached Files
 p-1 run time scaling e5-2670.pdf (14.1 KB, 142 views) p-1 run time scaling FMA i7-7500u.pdf (16.1 KB, 149 views) p-1 run time scaling FMA i7-8750H.pdf (18.4 KB, 143 views)

Last fiddled with by kriesel on 2020-01-05 at 14:05 Reason: updated i7-8750h attachment for new data

2018-12-28, 18:13   #4
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

2×5×17×29 Posts
Effect of number of workers

Similar to the number of threads choices in gpu applications, on multicore systems, the effect of number of cores per worker in prime95 is unpredictable, and so there is provision for benchmarking.

Number of workers could be chosen to optimize performance. But which measure of performance? Aggregate throughput maximized, latency of one assignment minimized, number of joules used for a 100GhzD primality test, aggregate throughput given a constraint of latency low enough to avoid assignment expiration, something else? For which single fft length, or for the current and next several?

For minimum latency, as for confirming a newly discovered Mersenne prime, Madpoo has run experiments on a dual-14-core system. He reported the fastest primality test time around 20 cores out of the 28 available; any more than 6 on the lesser use package, and the increased package to package data transfers slow the progress.

For picking number of cores/worker per cpu type, that's a reasonable compromise for maximum aggregate throughput, so I can set it and forget it for months or years on each system, I ran the built in prime95 benchmarking over wide fft ranges for a variety of cores/worker, on a variety of cpu types. Then the timings were tabulated in spreadsheets and graphed.

If going after the maximum performance per fft length, consider that some work types restart from the beginning when the number of workers is changed. Read the readme.txt and other files, back up before changing number of workers, plan ahead, etc.

Some patterns emerge. Worker counts that would straddle the divide between processor packages if divided evenly typically do not provide as much throughput. A 12-core 2-package system with 3 workers with equal cores/worker would have at least one worker with cores in each package (4 2 + 2 4). George indicates recent versions of prime95 prevent the straddle by assigning unequal numbers of cores to the workers.
For larger core counts there can be quite a few choices to evaluate. What's fastest for one fft length may not be for others. A compromise that averages a small percentage penalty is usually available. Plotting the various combinations with trend lines seems a useful visualization method for selecting one configuration to run with for a long time.

Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1
Attached Files
 2-core Core 2 Duo E8200 performance.pdf (27.7 KB, 157 views) 2-core i3-M370 performance.pdf (28.3 KB, 176 views) dual 4-core e5520 performance.pdf (28.3 KB, 157 views) dual 6-core X5650 performance tune.pdf (30.7 KB, 160 views) dual 6-core E5645 performance.pdf (31.9 KB, 169 views)

Last fiddled with by kriesel on 2020-11-15 at 19:28

2018-12-28, 18:16   #5
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

10011010000102 Posts
Effect of number of workers continued

Working around the 5-attachment limit per post:

Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1
Attached Files
 dual 8-core e5-2670 performance.pdf (29.0 KB, 164 views) i7-4790 performance.pdf (37.4 KB, 151 views) dual e5-2697 prime95 performance.pdf (125.6 KB, 164 views) nuc performance.pdf (89.5 KB, 136 views) i5-1035g1 performance.pdf (96.3 KB, 58 views)

Last fiddled with by kriesel on 2020-11-13 at 13:27 Reason: cosmetic cleanup for i5-1035G1

2019-03-14, 03:49   #6
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

2·5·17·29 Posts
Effect of frequent Res64 output

Timing runs on LL DC on the same 51M exponent and old 32-bit hardware with prime95 29.4b7 yield conflicting information on the cost of a Res64 output as a multiple of an ordinary iteration. The res64 cost is estimated as 7/8 to 4 times an iteration. Note that because of numbering skew between prime95 and other conventions, prime95 outputs res64 at 3 successive iterations, with cost ~3.1 to 12 times an iteration. The lower value is based on prime95-provided timings per iteration, the higher value on prime95-provided time stamp of 1 second resolution of the res64 output line.

An initial attempt to make a similar measurement on an i7-8750H with UHD630 igp in prime95 v29.4b8 x64 yielded negative per-res64 cost in two tries. I speculate this was an interaction with mfakto running at the same time on the same chip package power budget. Performance monitor indicates the cpu utilization drops considerably when frequent interim residue output is enabled.

A retest, with the UHD630 mfakto instance halted, yielded timings that indicate a cost per PRP3 res64 interim output on the i7-8750H system of 2.7 seconds, equivalent to 263. iterations, on an 83M primality test. One of the 6 cores stays very busy while the rest are only used at a low duty cycle when outputting an interim residue every 10 iterations. This cut throughput from 96.6 iter/sec to 3.54 iter/sec, a rather severe 96.3% reduction. The estimated effect on run time for the exponent when producing interim residues for the primenet server at 5,000,000 iteration intervals is about 45 seconds, 52ppm of run time. The retest was brief, taking 48 seconds for iterations with interim residues, and 114 seconds without, so accuracy is no better than a percent or two. Note also the cpu clock was not held constant during the test. In this case the agreement between time stamp based rates and program-computed ms/iter was very good, ~1/4%.

Another test, on a dual-xeon-e5-2690 system, v29.6b6 x64 on Win10, 4 cores/worker, 83.9M PRP tests, gave ~305 iterations/interim residue64, 3.45 sec/interim residue, or around 61ppm for the default 5,000,000 iteration interval. The preceding figure ignores the initial 500K-iteration interim residue, which raises the impact a bit to 65ppm for ~84M exponents, and somewhat more for DC exponents.

Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1
Attached Thumbnails

Attached Files
 res64 timing for prime95.pdf (12.1 KB, 167 views) res64 timing for prime95 v29.4b8 i7-8750h.pdf (13.1 KB, 161 views) res64 timing for prime95 v29.6b6 e5-2690.pdf (11.8 KB, 160 views)

Last fiddled with by kriesel on 2019-11-18 at 14:31

 2019-08-12, 17:22 #7 kriesel     "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 2·5·17·29 Posts Prime95 documentation Most GIMPS applications include a readme file. Prime95 has very comprehensive documentation included in the zip package, in multiple files. License.txt for the license terms Readme.txt for the new user and periodic reference Whatsnew.txt particularly useful when upgrading Stress.txt relating to stress testing and reliability testing Undoc.txt documentation of the perhaps less frequently used options Read them early and often. Like for the other applications, reading the documentation again after additional experience with the program is useful. Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2019-11-18 at 14:32
2020-05-25, 00:07   #8
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

115028 Posts
Prime95 exponent limits

Prime95 and its sibling mprime contain many code paths specific to processor types and exponent magnitudes. What range of exponents is supported varies by processor type. I think what has been implemented was determined by a combination of processor throughput versus exponent size and decisions by George on which to spend his programming time.

There are several ways to determine what these limits are.
George has made statements about them in email or on the forum.
https://mersenneforum.org/showpost.p...&postcount=219

The whatsnew.txt describes numerous changes in what was supported.

The source code is available for examination.

Trying runs on differing hardware and OS may obscure the situation, because it could be that it's an old operating system version, not the processor type, that prevents running some versions of code.
Attached Files
 processor specific summary.txt (250 Bytes, 148 views) exponent limits versus hardware.txt (278 Bytes, 132 views) prime95 exponent and fft length limits vs processor type from mult.asm source.txt (2.5 KB, 146 views) prime95 fft highlights of whatsnew.txt (7.0 KB, 155 views)

Last fiddled with by kriesel on 2020-05-25 at 00:21

 2020-07-27, 16:03 #9 kriesel     "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 2×5×17×29 Posts PRP proof capable versions UPDATE: V30.3b6 is now generally available. This automatically uploads proof files and includes resource limit features. Direct download links for prime95 64-bit for Windows; mprime 64-bit for Linux. (32 bit and other variations also available.) V30.3b6 appears on the main GIMPS software page and the mersenne.ca download mirror. Previously: Per https://www.mersenneforum.org/showpo...&postcount=119 V30.1b1 prime95 or mprime are available and require manual uploading of proof files. Direct download from dropbox: prime95 for Windows 64-bit; mprime for Linux 64-bit A run of PRP with proof becomes conspicuous by its multi-gigabyte p.residues file. These downloads contain all the necessary code includiing dll files. (V30.2b1 DID NOT contain the dlls. Install v30.1b1 first, then v30.2b1 atop it.) The standalone command-line uploader, which works for gpuowl as well as prime95, is described briefly at https://www.mersenneforum.org/showpo...&postcount=154 but the direct download from dropbox for Windows x64 is no longer available. It can be found as an attachment at https://www.mersenneforum.org/showpo...0&postcount=26 NOTE: it is not being maintained, and preferred usage is upload through a current version of prime95 or mprime. Usage is Code: uploader user_id proof_filename[ chunk_size[ upload_rate_limit]] with chunk_size expressed in MB and upload_rate_limit expressed in Mbps apparently. (Note, for gpuowl, there are more choices; https://www.mersenneforum.org/showpo...0&postcount=26, some of which might conceivably apply to prime95/mprime too, at least for the most adventurous. But I encourage users to stick with prime95 & mprime's built in PrimeNet API & supported features whenever practical.) Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2020-09-17 at 14:14 Reason: V30.3b6 general release update
2020-11-15, 19:38   #11
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

2·5·17·29 Posts
Effect of number of workers continued 2

FMA3 capable 6-core i7-8750H (no code running on the IGP at the time)
Attached Files
 peregrine performance.pdf (34.0 KB, 68 views) xeon phi 7250 performance.pdf (105.7 KB, 68 views)

Last fiddled with by kriesel on 2020-11-19 at 22:19

