mersenneforum.org Mfaktc-specific reference material
 Register FAQ Search Today's Posts Mark Forums Read

 2018-05-28, 18:38 #1 kriesel     "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 2·3·821 Posts Mfaktc-specific reference material This thread is intended as a home for reference material specific to the mfaktc program. (Suggestions are welcome. Discussion posts in this thread are not encouraged. Please use the reference material discussion thread http://www.mersenneforum.org/showthread.php?t=23383. Off-topic posts may be moved or removed, to keep the reference threads clean, tidy, and useful.) Mfaktc howto The following assumes you've already read the generic "How to get started in gpu computing for GIMPS" portion of https://www.mersenneforum.org/showpo...89&postcount=1, and duplicates little of that here. Download a suitable version of mfaktc from here https://download.mersenne.ca/mfaktc/mfaktc-0.21 You might as well get one of the larger-GpuSieveSize-enabled ones allowing up to 2047 Mbits if one is available. You'll also need the appropriate CUDArt library. CUDA Dlls are in the CUDA_dlls folder of https://download.mersenne.ca/ If you can't find a suitable version there, you may be able to get one built by one of the participants on the mfaktc thread. Install in a suitable user subfolder. Set the needed folder permissions so suitable permissions are inherited by files created there. Modify the mfaktc.ini file to customize it to your gpu model, primenet account name, and system name. (I usually incorporate system name, gpu identification, and instance number together. Something like systemname/gpuid-wn; condor/gtx1060-w2 for example.) Correct the comments in the standard ini file to reflect the increased GpuSieveSize maximum. Create a Windows batch file or Linux shell script with a short name. Set the device number there. Consider redirecting console output to a file or employing a good tee program. Create a desktop shortcut for easy launch of the batch file or script. (Eventually, for multiple instances or multiple gpus this could launch a routine that invokes the individual-instance files with short time delays between, so you have a few seconds to see whether each launched correctly or a bug occurred.) You may want to try GPU-Z as a utility on your Windows system to see an indication of what the computer thinks is installed for your gpu (CUDA opencl openGl etc), graphically monitor gpu parameters, maybe even log them if you want. One of many utilities listed in https://www.mersenneforum.org/showpo...74&postcount=6 which also lists some Linux alternatives. It can be handy while getting a gpu application going. When it's not needed shut it down along with other idle applications to reduce overhead that's costing performance. Now is a good time to run mfaktc with -h >>help.txt in your working directory. Run once, refer to as often as needed. Run the self test: mfaktc -st >>selftest.txt Or the longer one: mfaktc -st2 >>selftest.txt Check the results. Resolve any reliability issues before proceeding to real GIMPS work. Create a worktodo file and put some assignments in there. Start with few, in case your gpu or igp does not work out. Get the type you plan to run the most. Get them from https://www.mersenne.org/manual_gpu_assignment/ Results are reported manually at https://www.mersenne.org/manual_result/ Run one instance with default settings. Modify tuning in the mfaktc.ini file, stop and restart the program to make it take effect, and document performance for each modification. Tune one variable at a time. For best accuracy and reproducibility, minimize interactive use during a benchmark period. Averaging many timing samples in a spreadsheet improves accuracy. See Mfaktc tuning for optimal throughput for tuning advice. An example of tuning for an RTX 2080 Super is here. There are other gpu models' tuning shown also near it in the same thread. When substantially changing type of work, such as switching between 100M tasks and 100Mdigit tasks, or significantly changing bit levels, especially when changing kernels results, retuning is suggested. (You may want to save different tunes for different exponent and bit level ranges, for later reuse.) For best performance use a SSD, tune for single-instance first, and tune for maximum throughput by experimenting with multiple instances last. (I suggest keeping the work of simultaneous instances similar. If they are too different, different kernels may reduce throughput. Test for that.) Multiple instances give slightly higher sustained throughput, and keep the gpu working if one instance has a problem, runs out of work, or is stopped briefly to replenish work and report results. Slow gpus benefit less from large GpuSieveSize and multiple instances; fast gpus benefit more. Get familiar with the nvidia-smi command line tool at some point. That's more efficient for when you get into production mode. It's probably hiding somewhere like the following. I usually make a batch file with a short name nv.bat (and suggest doing the Linux equivalent) Code: :loop "c:\Program Files\NVIDIA Corporation\NVSMI\nvidia-smi.exe" pause goto loop Nvidia-smi has less overhead than graphical monitoring utilities such as GPU-Z. Used as above, it will list all CUDA gpus working with the driver and whatever executables are using them, memory usage, power, temperature, etc. One keystroke in that command prompt window will refresh it. Nvidia-smi can also control power and other parameters. Power savings can be quite considerable; 50% reduction in power from the nominal maximum seems to still provide better than 80% of maximum throughput. Beware, nvidia-smi gpu number may not match mfaktc device number in the case of multiple gpus per system. After getting the program functioning manually, you can consider continuing to operate it that way, or trying one of the client management software described at Available Mersenne prime hunting client management software, each of which have their own install requirements, or GPU to 72 https://mersenneforum.org/forumdisplay.php?f=95 (Note much of the preceding was derived from or copied from https://mersenneforum.org/showthread.php?t=25673) Table of Contents: This post Run time versus exponent and bit level for a GTX480 http://www.mersenneforum.org/showpos...19&postcount=2 Bug and wish list http://www.mersenneforum.org/showpos...21&postcount=3 Mfaktc throughput variables http://www.mersenneforum.org/showpos...76&postcount=4 Volta and Turing support https://www.mersenneforum.org/showpo...29&postcount=5 Concepts in GIMPS trial factoring https://www.mersenneforum.org/showpo...23&postcount=6 Mfaktc -h help output https://www.mersenneforum.org/showpo...90&postcount=7 Mfaktc tuning for optimal throughput https://www.mersenneforum.org/showpo...99&postcount=8 gr-mfaktc https://www.mersenneforum.org/showpo...37&postcount=9 etc tbd Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2020-07-16 at 19:00 Reason: Added large howto section
2018-05-28, 18:41   #2
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

133E16 Posts
Mfaktc runtime scaling versus exponent and bit depth

Timings for an assortment of exponents and bit depths are tabulated and charted, for reference. Note, only one trial per combination was tabulated, so no measure made or indication given of reproducibility run to run for same inputs. See the attachment. This is a somewhat different way of looking at it than the factoring benchmarks at http://www.mersenne.ca

Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1
Attached Files
 mfaktc trial factoring runtime scaling.pdf (20.7 KB, 212 views)

Last fiddled with by kriesel on 2019-11-18 at 14:00

2018-05-28, 19:24   #3
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

2×3×821 Posts
Mfaktc bug and wish list

Known and suspected bugs and desired features or changes are listed. Some cross references to other gpu programs are included. Some issues may be hardware related or due to other causes.

As always, such lists are in appreciation of the various authors' past efforts, contributions, and skills, and intended to help improve both the software and the users' experience.

Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1
Attached Files
 mfaktc bug and wish list.pdf (48.5 KB, 275 views)

Last fiddled with by kriesel on 2019-11-18 at 14:01

2018-05-31, 03:12   #4
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

114768 Posts
Mfaktc throughput variables

Investigating Mfaktc consistently indicating 98% GPU usage in GPU-Z, rather than the 100% that CUDAPm1 and CUDALucas typically produce, there are slight gains in total throughput to be had by running multiple instances. There are confounding factors, including:

Definitely:
Exponent being run (substantial effect)
Factoring depth being run (minor effect)
Overhead of running the screen display with the same GPU (varying effect)
How many instances are running per gpu
GPU sieving parameters: GPUSieveProcessSize, GPUSieveSize, GPUSievePrimes; tuning these can provide a several percent boost in throughput
Frequency of checkpoint saving (minor effect)

Probably:
GPU clock throttling due to thermal limits
How similar the work of multiple instances are (similar is better, very dissimilar may cause a net slowdown. Running very different bit levels that cause use of different kernels seems likely to cause this. Very different exponents may.)

Potentially, at high exponent or instance count:
memory constraints (although mfaktc doesn't use much memory per instance at current wavefront exponents)

Load sharing for multiple instances appears to be not quite equal; they have slightly differing throughputs.

An operational advantage of running multiple instances is being able to stop one, add work, look at results, change ini file settings, etc, without noticeable loss of throughput; the others will essentially saturate the GPU while one instance is out of action temporarily. If one instance runs out of work the other running instances run a bit faster than if it hadn't, limiting lost total throughput.

On a GTX480, some slight effect of CUDA level (test your own gpu models; there are reports GTX10xx like v8.0 better than lower versions)
V0.20 CUDA 6.5 346.38 GhzD/d ----fastest 100%
V0.21 Less Classes CUDA 4.2 340.18 ---98.2%
V0.21 Less Classes CUDA 8.0 340.43 ---98.3%
Running the less classes versions could cost ~0.9 week per year of throughput.

Note, v0.20 checkpoint file was ignored by either flavor of less-classes, 3 hours throughput was lost when switching version, by the Less-Classes versions of mfaktc starting the bit depth over at each version change. I was able to get it back by reverting to V0.20 CUDA6.5 64-bit version.

These results were seen on a system with an actual spinning hard drive. Gain from multiple instances may be less with a solid state disk or ramdisk.

The effect of lower than 100% usage with a single instance is stronger with newer faster gpus such as the GTX 1080 or GTX 1080 Ti. The GTX 1080 Ti under test is running at 90% gpu load with one instance, 95% with two instances. (Not the less-classes version.)

The above is observed using the standard Mfaktc release. The modified versions allowing GpuSieveSize up to 2047M when well tuned reduce but do not eliminate the throughput advantage of running multiple instances.

Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1
Attached Files
 parallel throughput.pdf (17.4 KB, 187 views)

Last fiddled with by kriesel on 2020-07-01 at 17:12 Reason: added comments on similar work

 2018-10-09, 21:40 #5 kriesel     "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 492610 Posts Volta and Turing support For versions compiled for CUDA 10, see https://www.mersenneforum.org/showpo...postcount=2929 Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2019-11-18 at 14:03
2019-02-14, 03:10   #6
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

2×3×821 Posts
Concepts in GIMPS trial factoring (TF)

(note, sort of mfaktc oriented, more so toward the end. Where description applies to both mfaktc and mfakto, they are referred to as mfaktx)

Following is my evolving attempt to list and describe the many concepts and optimization considerations relevant to GIMPS trial factoring, in a way that assumes little or no prior knowledge.

https://www.mersenneforum.org/showpo...35&postcount=3

1 Trial factoring work is generally assigned by exponent and bit levels;
Mersenne exponent,
bit level already trial factored to or where to start, and
bit level to factor up to.
Nearly all TF is now done on gpus not cpus. Assignments and result reporting are typically automatic for mprime/prime95, but manually or through separate programs for gpu applications.
Manual assignment (reservation) links and result submission links are available in the attachment at https://www.mersenneforum.org/showpo...11&postcount=9
See https://www.mersenneforum.org/showpo...92&postcount=3 for a list of client management programs that get assignments and/or submit results.

2 Each bit level of TF is about as much computing effort as all bit levels below it, for the same exponent.
Probability of finding a factor by trial factoring between 2x and 2x+1 is approximately 1/x. https://www.mersenne.org/various/math.php
So while the probability of finding a factor goes down slowly with x, the effort goes up exponentially with x.
Assignments are made up to the point where it's more cost effective to primality test instead of continuing to trial factor. Finding a factor by trial factoring before any primality tests or P-1 factoring saves 2 x (1 + primality test error rate) primality tests plus the P-1 factoring time, which is typically a few percent of a primality test. Error rate is around 2%, so about 2(1+.02)+.03 =~ 2.07 primality tests equivalent. Except that P-1 effort (bounds selection) is optimized for maximum time savings, so actually <2(1+0.02) or <2.04 primality tests.

3 TF makes use of the special form of factors of Mersenne numbers, f = 2 k p +1, where p is the prime exponent of the Mersenne number, k is a positive integer. See https://en.wikipedia.org/wiki/Mersen...rsenne_numbers

4 Use a wheel generator https://en.wikipedia.org/wiki/Wheel_factorization for candidate factors, to efficiently exclude and not even consider candidate factors that have very small factors themselves, such as 2, 3, 5, 7 and usually 11. 2 2 3 5 7 = 420, the Less-classes number; 2 2 3 5 7 11 = 4620, the more-classes number. https://www.mersenneforum.org/showpo...1&postcount=35
https://www.mersenneforum.org/showpo...7&postcount=37
(There was discussion of going to 13 or higher, but that was considered not worthwhile. 2 2 3 5 7 11 13 = 60060, etc. Higher complexity and overhead, trading off against diminishing returns of incremental number of candidates excluded.) Prime95 uses a simpler 2 2 3 5 = 60 wheel spokes, which reduces to 16 classes (26.67%). For Mfaktx at 420, 96 survive (22.86%) or at 4620, 960 (20.78%). The next would be 11520/60060=19.18%. (A potential reduction of 7.7%) (I need to do a bit more homework to describe why there are two factors of 2 in the number of classes. I think that's due to including #6 into the wheel generator; it's not just evens that can be dismissed, but 6 out of 8 mod 8 cases, or 3 out of 4.)

5 For a given exponent, exclude entire classes of candidate factors with a single test. Additionally there is subsequent fast sieving in the surviving classes for small prime factors, to eliminate the bits representing the corresponding k values of small primes beyond the wheel generator level.
https://www.mersenneforum.org/showpo...7&postcount=37

6 Make use of the special form of factors of Mersenne numbers, that they are 1 or 7 mod 8. (Discard or do not generate candidate factors that are not 1 or 7 mod 8). A short proof for f=+/-1 mod 8 is at https://primes.utm.edu/notes/proofs/MerDiv.html.

7 Use of dense representation of the candidate factors, as a bit map of k values. https://www.mersenneforum.org/showpo...4&postcount=72
(That bit map is exponent-specific, so can not be reused for other exponents.)

8 These candidate factors are sieved, somewhat, but not exhaustively. Candidate factors found composite by sieving are discarded. Trial with prime candidates is sufficient, with composite candidates redundant.

9 Exhaustive sieving of candidate factors can produce less throughput per total computing effort, than a lesser amount of sieving of candidates. There's user control of sieving depth available to allow adjustment to near optimal for the exponents, depths, and other variables, in mfaktc and mfakto.
https://www.mersenneforum.org/showpo...&postcount=180

10 Candidate factors surviving wheel generation and restriction to 1 or 7 mod 8 are sieved somewhat. Implementation in mfaktx is by a maximum sieve prime value, that as I recall is set in mfaktx.ini. Sieving level (at some version of mfaktc) must not exceed the level where possible candidate factors are included in sieving values. This makes sieve limit settings dependent on exponent for low exponent low bit level combinations. https://www.mersenneforum.org/showpo...&postcount=788
(If I recall correctly, in later versions of mfaktc special handling of certain cases relaxes this restriction on sieving level. See in #21 below)

11 Surviving candidate factors are tried. (With MOSTLY prime candidates, and the relatively few composite candidates that passed the speed-optimized incomplete sieving.)

12 The trial method is to generate a representation of the Mersenne prime plus 1 (2p), modulo the trial factor, by a succession of squarings and doublings according to the bit pattern in the exponent, modulo the trial factor. If 2p mod f =1, 2p-1 mod f = 0 proving f is a factor. The powering method is much much faster than trial long division for the very sizable numbers of interest near the current GIMPS wavefront, and uses far less memory than a simple approach of representing the full dividend at the beginning, and more so on both counts for larger numbers. (Retina points out though at https://www.mersenneforum.org/showpo...&postcount=478 that there would be no need to create a full representation of p ones at the beginning of long division. Since all the binary digit positions of a mersenne number are ones they could simply be added at the right as needed. Also, we care only about the remainder, not the quotient.) There's a description and brief small-numbers example of the modpow method at https://www.mersenne.org/various/math.php which is called left to right binary method in https://en.wikipedia.org/wiki/Modular_exponentiation. It is not necessarily the minimum-time method, but it's reliably pretty good, and fairly simple. The modulo trial-factor at each step keeps the operands at a manageable size for speed: 3 32-bit words, far smaller than the tens of millions or hundreds of millions of bits that naive trial-division could involve or deferred modulo could generate. Trial division would take (O)p operations, while modpow takes only (O)log2p. For p~90M, p/log2p ~3,400,000. If stepping through long division 32 bits at a time, ~106400; log2106400 ~17. The efficiency of the modpow method allows trial factoring to many more bits factor depth than even a well implemented long division trial would. See point #2. Therefore the probability of finding a factor is increased significantly with the modpow approach.

13 On significantly parallel computing hardware, such as gpus, many trial factors can be run in parallel, so successive subsets of candidates are distributed over the many available cores. Gpus are so much more relatively effective at TF than other types of GIMPS computation, that the optimal gpu TF bit level is typically a few bits higher than for cpus, and cpu TF is generally avoided. (See also point 2 again.) Xeon Phi are reportedly quite fast at trial factoring, but not very common, and rather expensive. "Additionally, something like an FPGA would be really fast, but most users don't have one. And ASICs would be even faster, but way too expensive." (kruoli PM) FPGA or ASIC likely would have yet different TF bit level optima. Guides to suggested TF bit levels versus exponent and gpu model are available for each gpu model listed at https://www.mersenne.ca/cudalucas.php; click on the gpu model of interest. Note that the curves correspond to optimization for one or two primality tests to be performed on the same gpu model as the TF. If the TF will be performed on a high TF/DP performance ratio gpu but the primality test performed on a lower TF/DP ratio gpu, the optimal TF is likely somewhere between the two levels. For example, for a 100Mdigit mersenne number, TF on RTX2080 indicates 83 bits but Radeon indicates 80 bits; 81 or 82 may be the optimal for the combination RTX 2080 & Radeon VII PRP or LL. RTX2080: https://www.mersenne.ca/cudalucas.php?model=744 Radeon VII: https://www.mersenne.ca/cudalucas.php?model=752

14 Operation is as multiword integers, sometimes with some bits used for carries. https://www.mersenneforum.org/showpo...3&postcount=17 72-bit math via 3 32-bit words with 24 bits data, 8 bits for carries.
https://www.mersenneforum.org/showpo...&postcount=159

15 Different code sequences are written for different bit level ranges of trial factor, for best speeds for various bit level ranges. Higher bit levels take longer code sequences, so factors tried per unit time declines at higher bit levels on the same hardware and exponent. https://www.mersenneforum.org/showpo...&postcount=231
Cursory examination of current source code shows the following 17 distinct sequences identified in mfaktc:
_71BIT_MUL24
_75BIT_MUL32
_95BIT_MUL32
BARRETT76_MUL32
BARRETT77_MUL32
BARRETT79_MUL32
BARRETT87_MUL32
BARRETT88_MUL32
BARRETT92_MUL32
_75BIT_MUL32_GS
BARRETT76_MUL32_GS
BARRETT77_MUL32_GS
BARRETT79_MUL32_GS
BARRETT87_MUL32_GS
BARRETT88_MUL32_GS
BARRETT92_MUL32_GS
_95BIT_MUL32_GS
(In the above, _GS indicates gpu-sieving. https://www.mersenneforum.org/showpo...postcount=3082)
(mfakto is similar although it does not include 95-bit code or capability. That's not an issue for the GIMPS search since optimal final TF level via gpu is generally 86 bits or less in the mersenne.org range of exponents <109; 92 bits to Mlucas range max 232. But see also #43 below)
Judging by the names above, most of these are based on Barrett reduction. https://en.wikipedia.org/wiki/Barrett_reduction

16 "First barrett kernel in mfaktc was BARRETT92, all other kernels are stripped down versions.
From BARRETT92 to BARRETT79 first (fixed inverse, multibit in single stage possible, a bit faster)
From there we go from BARRETT92 to BARRETT88 and BARRETT87 by (re)moving interim correction steps and some other "tricks" (loss of accuracy in interim steps (small example 22 mod 10 yields 12 (instead of 2))). Trading accuracy for speed. The same 'tricks' lead from BARRETT79 to BARRETT77 and BARRETT76." https://www.mersenneforum.org/showpo...postcount=3080

17 "funnel shift" barrett 87 https://www.mersenneforum.org/showpo...postcount=2243

18 The frequency of primes on the number line declines gradually as the bit level increases. This partially offsets the effect of the longer code sequences for higher bit levels mentioned in #15.

19 For gpu applications, there are various implementation approaches for performance. https://www.mersenneforum.org/showpo...9&postcount=29

20 multiple streams and data sets allowing concurrent data transfer and processing https://www.mersenneforum.org/showpo...&postcount=152

21 On-gpu sieving of factor candidates: https://www.mersenneforum.org/showpo...&postcount=554
https://www.mersenneforum.org/showpo...postcount=2999

22 There's a small performance advantage to 32-bit images when available, when using GPU sieving. (32bit addresses are smaller, placing less demand on memory bandwidth) https://www.mersenneforum.org/showpo...postcount=1981
However, at CUDA8 or higher (which means GTX10xx or newer models), only 64-bit CUDA is available.

23 How, in the CUDA or OpenCl cases, all those factor candidates get bundled into batches for processing in parallel on many gpu cores is very hazy for me. Presumably it involves using as many of the cores as possible as much of the time as possible, implying work batches of equal size / run time.
Prime95 comments on passing out chunks of work and using memory bandwidth efficiently: https://www.mersenneforum.org/showpo...postcount=1634
Oliver writes to R Gerbicz about it https://www.mersenneforum.org/showpo...postcount=3020

24 Could FP be faster than integer math for the kernels? Probably not https://www.mersenneforum.org/showpo...postcount=2377
Mark Rose did a detailed analysis in 2014 (and maybe revisiting that for recent gpu designs would be useful) https://www.mersenneforum.org/showpo...postcount=2380

25 mfaktc v0.21 announce, and ruminations about possible v0.22 https://www.mersenneforum.org/showpo...postcount=2492

26 ini file parameter tuning advice, https://www.mersenneforum.org/showpo...postcount=2505 - 2508
in this order:
GPUSieveProcessSize
GPUSieveSize
GPUSievePrimes

27 There was mention of an mfaktc v0.22-pre2 back in 2015 at https://www.mersenneforum.org/showpo...postcount=2547
"Removed old stuff (CC 1.x code, CUDA compatibility < 6.5 dropped, minor changes and bugfixed)."
https://www.mersenneforum.org/showpo...postcount=3080
The current released version is v0.21, which incorporates worktodo.add file support and is slightly faster than v0.20.

28 Experimenting on tuning with a GTX1080Ti, RTX2080, and RTX2080Super, on Windows 7, they seem to like the upper end of the tuning variable limits. Possibly the faster cards would benefit from higher maximums than documented for mfaktc v0.21. (Recompile required.)
# GPUSieveSize defines how big of a GPU sieve we use (in M bits).
# Minimum: GPUSieveSize=4
# Maximum: GPUSieveSize=128 (except in altered versions allowing 2047)
# Default: GPUSieveSize=64
GPUSieveSize=2047
# GPUSieveProcessSize defines how far many bits of the sieve each TF block
# processes (in K bits). Larger values may lead to less wasted cycles by
# reducing the number of times all threads in a warp are not TFing a
# candidate. However, more shared memory is used which may reduce occupancy.
# Smaller values should lead to a more responsive system (each kernel takes
# less time to execute). GPUSieveProcessSize must be a multiple of 8.
# Minimum: GPUSieveProcessSize=8
# Maximum: GPUSieveProcessSize=32
# Default: GPUSieveProcessSize=16
GPUSieveProcessSize=32
GPUSievePrimes ~94000

29 Some RTX20xx owners have modified the tuning variable limits, recompiled, and obtained performance gains of several percent, with diminishing incremental returns as GPUSieveSize grows to gigabits https://www.mersenneforum.org/showpo...postcount=3071
Maximum gpusievesize was found to be 2047 (Mbits) after program modification. This may relate to computing bit positions with signed 32 bit integer math. Other model gpus that show greater combined throughput with multiple mfaktc instances with the 128Mbit GpuSieveSize maximum, also benefit from the modified code. For example, a GTX1080Ti gains several percent throughput running a single modified instance. It appears there would be additional gain on a single instance if the limit was raised to 4095 Mbits by changing to unsigned 32 bit integer math for the bit positions, since even at 2047Mbits, it shows performance improvement compared to somewhat lesser values, and benefits from running multiple instances. RTX2080 and RTX2080Super also show additional throughput by several percent from running multiple instances after available tuning is completed.

30 The decoupling of FP from integer math in the RTX20xx may change the tradeoffs significantly or allow use of both simultaneously for TF math. As far as I know this is as yet unexplored.

31 According to some, the probability of finding a factor for least effort is optimized by TF up to one bit less than the eventual tradeoff limit, then P-1 with potential computing time savings optimized by the program itself (prime95 or CUDAPm1), and then, if no factor is found in P-1, the last bit level of TF, constituting the remaining half of the TF effort for that exponent. This is somewhat different sequencing than usual PrimeNet assignment.

32 Since gpus are far more effective at TF than at primality testing or P-1, and far faster at TF than cpus are, assignments seek to prevent TF on cpus, by performing all useful TF on gpus. TF on gpus allows going about 4 bits deeper, within computing time tradeoff optimization, saving some primality tests that cpu-optimized-TF limits would not have found factors for. The optimal number of bits deeper depend on the properties of the gpu models, with the higher ~40:1 TF/primality test ratio of the RTX20xx family probably justifying 5 bits deeper for their use. As far as I know primenet contains no provision for automatically taking them 1 bit higher on RTX20xx.

33 The time required to perform a single TF attempt (one candidate factor on one exponent) is so brief, and the reliability of modern hardware high enough over such brief periods, that it does not pay to double check TF. Not double checking TF lets the TF go one bit higher, finding more factors. But there is a sort of double-checking being performed, with a completely different approach, by user TJAOI, as uncwilly pointed out at https://www.mersenneforum.org/showpo...postcount=3043. There's a link there to a whole thread about that activity.

34 There is some limited overlap of coverage of factors between TF and P-1, estimated at about 20-25%. The economics of TF and the sequential progress of TF assignments dictate that TF factors found will have at most a certain number of bits (ranging up to 75 bits for 84M, to 86 bits for 999M). P-1 can economically find factors with more bits, but requires that the number one less than the factor be rather smooth (composite with many smallish factors). Some factors that TF should find will be smooth, and many will not. Therefore P-1 factoring currently serves as an incomplete double-check on the TF runs. "Stage 2 of P-1 helps to find factors f=2kp+1, where k contains one less smooth factor. If k=ab with b being the biggest prime factor of k, then it is sufficent to do P-1 with B1 = maximum prime factor of a and B2 = b." (kruoli by PM)

35 Reported factors are confirmed at the primenet server by repeating the computation for that exponent and factor, because that is quick to do and factors are worth confirming. No-factor reports are not confirmed there. Historically, TF was on the honor system, with no security code attached to a result outside prime95, or verification of completion of the work. A recent development is Robert Gerbicz's proposal for a method of incorporating verification of performing the TF work in the no-factor-found case. See https://www.mersenneforum.org/showpo...7&postcount=30 and some following discussion on the same thread. At the moment this is only conceptual, not implemented in any software to my knowledge. There's also discussion of other approaches in https://mersenneforum.org/showthread.php?t=25493

36 There is currently no Mersenne TF code for the following interface specifications:
DirectCompute
DirectX
OpenGL
PhysX
Vulkan
In many cases, this is not a limitation, because the Intel CPU TF code in prime95 and mprime, and gpu TF code in Mfaktc (NIVIDIA CUDA) and Mfakto (OpenCL on AMD gpus and some Intel IGPs) provide considerable coverage. However, there are older IGPs or discrete GPUs that do not have the required support for CUDA or OpenCL but do have support for other interface specifications such as DirectX or OpenGL. It's likely the performance of these older devices would be low compared to the ~20GhzD/day of current IGPs, or hundreds or thousands of GhzD/day on current discrete gpus.

37 Mfaktc and Mfakto are capable of factoring Mersenne numbers with exponents up to 232.
Extending them to exponents above 232 would require a rewrite of the individual kernels as well as the surrounding code. http://mersenneforum.org/showpost.ph...postcount=1148 If this was attempted, it would be advisable to maintain separate source files or use conditional compilation to produce separate executables, to preserve efficiency for exponents below 232. Once done, though slower than the 32-bit-exponent max versions, these are likely to be much faster over 232 than Factor5, which has no limit, or Ernst Mayer's Mfactor, which has some limits depending on its build options. There's no hurry. There's a great deal of work to do within the mersenne.org limit of 109, and much more from there to 232, that will take centuries. https://www.mersenneforum.org/showpo...postcount=2862

38 "Checkpoints are written between two residue classes by just remembering the last finished residue class." The checkpoint files are small and simple. https://www.mersenneforum.org/showpo...postcount=3008
Perform version upgrades between bit levels. The checkpoint files include what version was used to produce them, and are not accepted by a later version of software, which will restart the bit level instead.

39 The number of residue classes (see #4 above) is set at compile time in mfaktc, or in the ini file in Mfakto (choosing between 420 and 4620). Another possibility that has been suggested is to make it dynamic, adjusting to exponent and bit level. https://www.mersenneforum.org/showpo...postcount=2999

40 Some practical considerations constraining number of classes include
magnitude of spacing between candidate factors (32 vs 64bit),
frequency and effect of the gpu work queue running empty, and
effect on throughput of running a full last block per class that extends past the bit level's end.
https://www.mersenneforum.org/showpo...postcount=3002

41 In mfaktc or mfakto, there is screen output once per class. This can range from several classes per second or more, at high exponents and low bit levels, to nearly an hour or more per class at high bit levels on slow gpus. (Bit levels already reached in trial factoring previously ranged from 64 to 87, 223=8,388,608:1 within the mersenne.org space. As of early 2021, they ranged from 72 to 87 bits, 215=65536:1.) There is no user control of screen output frequency other than the approximately 11:1 effect of selecting more-classes or less-classes. In some cases a solid state disk or ramdisk is recommended to prevent logging or checkpointing from diminishing TF throughput unacceptably. A proposed user control for rate of screen output has not been implemented. https://www.mersenneforum.org/showpo...postcount=2703 Using less-classes to somewhat limit output frequency can cost of order 1.7% TF throughput. https://www.mersenneforum.org/showpo...postcount=2706

42 Since finding factors is more likely with low k values than high k values, lowest bit levels are run first, for efficiency. For the primary purpose of GIMPS, finding new Mersenne primes, we only care about finding a factor quickly enough to eliminate an exponent p from any further processing, thereby saving computing time overall. The probability of finding a factor times the computing time of further trial factoring, plus P-1 factoring, and running two primality tests, must be less than the effort of trial factoring for there to be net savings. Trying the most-likely factors first optimizes the savings.

43 The optimal TF bit level described in #2, #13 and #32 depends somewhat on the gpu model, since different gpu models have different ratios of TF and primality test performance. Recent developments may raise the optimal above the levels described. RTX20xx increased parallelism between FP and int computations, or its much higher performance ratio between TF and primality testing with current code, for example. Separate code and assignments according to gpu model may be necessary to make best use of the properties of the RTX20xx gpus and older gpus.

44 There is a discussion at https://www.mersenneforum.org/showpo...postcount=1511 to ~1514
regarding possibly eliminating the 20 to 25% overlap of P-1 and TF described in #34. If doing so had little or no overhead, this could provide a considerable speedup in TF, which may speed up the total per-exponent effort in GIMPS by <~0.03 x 0.25 / 2, or <~0.375%. Not large, but worth exploring. Its effect is diminished from there since current practice is to opportunistically seek a quickly found factor by TF in the lower bit levels, then P-1 after all TF is done. See also #31, regarding P-1 being run before the last, highest TF level, which may offer an opportunity for saving <~0.19% if the final bit level of TF, which is comparable in effort to all preceding levels combined, is performed only on k values not also covered by the preceding P-1. Taking full advantage of this probably involves
modifying primenet TF and P-1 assignment policy, in both level and sequence,
modifying code in the P-1 applications (prime95, CUDAPm1, GpuOwL) for determining optimal P-1 bounds,
modifying TF applications code (mfaktc, mfakto, mprime/prime95; gpuowl TF branch, Mfactor) to efficiently skip the relevant factor candidates, and
modifying mersenne.ca gpu-specific TF level versus exponent charts and other project documentation.
It may also involve changing or adding parameters to the worktodo lines or ini files or command line options to indicate whether the TF run is to include or exclude the smooth factor candidates. (Different use cases for TF may require different behavior.) New TF worktodo line structure implies modification of the parsing code in each TF application. New TF worktodo line structure also implies modifying some of the several client management programs (see http://www.mersenneforum.org/showpos...92&postcount=3). Making use of this also probably involves changes to TF result report records to indicate whether the smooth factors were tried or skipped, including primenet and manual reports and assigned and unassigned work, impacting PrimeNet and its API, web manual report page processing, mfaktc, mfakto, prime95, gpuowl, Mfactor, and some of the client management programs. In total, over a dozen code authors would need to be on board with the change. Not all are currently actively maintaining what they wrote.

45 from https://www.mersenneforum.org/showpo...4&postcount=23 re efficiency versus assignment size (roughly 2bitlevel/exponent)
Quote:
 mfakto (and mfaktc) are more efficient with larger assignments. This does not mean larger exponents, but "more work" or "longer runtime per class", generally bigger bit ranges. The reason is a certain one-time initialization effort per class - no matter if the class will just test 1 million factor candidates or 1 billion. In the first case, the one-time effort may account for, say, 75% of the sieving effort, in the latter case just 0.3%. Plus there is an average of half a block wasted for each class. If the class consists of only one block, that's 50%, for 1000 blocks it's just 0.05%.
46 Results are reported by gpu applications as exponent, factor(s) found, optionally username and computerid, program and version and kernel used, optionally time and date stamp, or
exponent, no factor from bit level a to b, optionally username and computerid, program and version and kernel used, optionally time and date stamp (but not in that order). Other possible variables such as factor candidate sieving limit are not reported.

47 as #46 above implies, sometimes more than one factor will be present within a bit level. The behavior of mfaktx is controlled in such cases by part of mfaktx.ini[CODE]# possible values for StopAfterFactor:
# 0: Do not stop the current assignment after a factor was found.
# 1: When a factor was found for the current assignment stop after the
# current bitlevel. This makes only sense when Stages is enabled.
# 2: When a factor was found for the current assignment stop after the
# current class.
#
# Default: StopAfterFactor=1
It's uncommon but finding two factors in a single bit level does occur.

48 More than one factor per class within a bit level for an exponent also occurs. Its overall impact on the GIMPS project is small. See https://www.mersenneforum.org/showpost.php?p=520982&postcount=5
Up to 10 factors per class are supported, by use of a small array for factors found. That would be very hard to test for. More than 4 per class/bitlevel/exponent is estimated to be quite rare.
https://www.mersenneforum.org/showpost.php?p=410784&postcount=35
For some examples of exponents with multiple factors found in one class in one bit level see https://mersenneforum.org/showpost.p...postcount=3262

See also https://www.mersenneforum.org/showpo...&postcount=471; what's described here is referred to there as "search by p".
(Thanks, to TheJudger, kruoli and others who have reviewed this large list or responded to PMs and offered constructive feedback.)

Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1

Last fiddled with by kriesel on 2021-01-03 at 15:47 Reason: updated #41 for bit level progress

 2019-08-12, 15:29 #7 kriesel     "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 114768 Posts Mfaktc -h help output Code: mfaktc v0.21 Copyright (C) 2009, 2010, 2011, 2012, 2013, 2014, 2015 Oliver Weihe (o.weihe@t-online.de) This program comes with ABSOLUTELY NO WARRANTY; for details see COPYING. This is free software, and you are welcome to redistribute it under certain conditions; see COPYING for details. Usage: mfaktc-win-64 [options] -h display this help and exit -d specify the device number used by this program -tf trial factor M from 2^ to 2^ and exit instead of parsing the worktodo file -st run builtin selftest and exit -st2 same as -st but extended range for k_min/m_max -v set verbosity (min = 0, default = 1, more = 2, max/debug = 3) options for debuging purposes --timertest run test of timer functions and exit --sleeptest run test of sleep functions and exit Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2019-11-18 at 14:04
2019-09-29, 15:15   #8
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

10011001111102 Posts
Mfaktc tuning for maximum throughput

Per https://www.mersenneforum.org/showpo...23&postcount=6 item 26,
tune mfaktc.ini gpu sieve parameters in this order:
GPUSieveProcessSize
GPUSieveSize
GPUSievePrimes

For faster gpu models, a recompiled version of mfaktc allowing GpuSieveSize up to 2047 can produce up to 5% greater throughput over GpuSieveSize=128 for GTX1080Ti, more than 10% for RTX 2080 Super.

After tuning single-instance throughput, faster gpu models can benefit from running multiple instances on the same gpu simultaneously. The optimal number of instances appears to depend on both gpu model and cpu model. The potential gain is greater with faster gpus, less with slower gpus. I think the gain is that when one instance is waiting on display and disk writes, another can keep the gpu working. This has been observed on Win7 or Win10 systems with conventional rotating disks. Solid state disk or ramdisk may reduce or eliminate this multi-instance advantage.

Whether the optimal tune for multiple instances differs from the optimal tune for single instances is untested currently.
Note also the optimal tune is expected to slowly change with exponent and bit level.

CUDA level and driver version seem to have little or no effect.
More_classes is more efficient than less_classes, unless the classes are so fast that it becomes somewhat I/O bound. Most exponents and bit levels remaining between the wavefront and the mersenne.org limit are better run in more_classes.

A tuning example for GTX1080Ti gpu is attached. A second example with two rounds of further tuning for the modified GpuSieveSize up to 2047 version is also attached. It gains a little more performance and a GPU-Z indicated 100% utilization with 3 instances.
Final tune check of an RTX2080 gpu using GPUSieveSize up to 2047 is also attached. It gains a little more performance and a GPU-Z indicated 100% utilization with 3 instances.
Both models indicate possible further gains if GpuSieveSize was not limited by 231-1 bits due to signed 32-bit bit position computation.
The newest gpu models respond substantially to tuning, of order 3 to 10% gain.

Lower speed gpus show less response to tuning, and their optimal tune tends to be at somewhat lower values of tune parameters.

Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1
Attached Files
 gtx1080ti mfaktc tuning.pdf (18.1 KB, 193 views) gtx1080ti tuning 2047 limit.pdf (31.6 KB, 187 views) rtx2080super tuning.pdf (16.9 KB, 148 views) gtx1050ti-tune.txt (2.1 KB, 113 views)

Last fiddled with by kriesel on 2020-08-23 at 15:29 Reason: updated for GTX1050Ti

 2020-04-17, 03:50 #9 kriesel     "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 2×3×821 Posts gr-mfaktc There is a variant of mfaktc which also supports bases other than 2, called gr-mfaktc (generalized repunits mfaktc). It is somewhat slower for base 2 than mfaktc, so not recommended for Mersenne numbers. See https://mersenneforum.org/showthread.php?t=24901 Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2020-04-17 at 03:51

 Similar Threads Thread Thread Starter Forum Replies Last Post kriesel kriesel 62 2020-12-12 08:57 kriesel kriesel 31 2020-07-09 14:04 jasong jasong 97 2015-09-14 00:17 jasonp Software 17 2009-01-29 02:25 Jushi Math 2 2006-08-28 12:07

All times are UTC. The time now is 09:11.

Sun Mar 7 09:11:44 UTC 2021 up 94 days, 5:23, 0 users, load averages: 2.47, 2.88, 2.91