mersenneforum.org  

Go Back   mersenneforum.org > Extra Stuff > Blogorrhea > kriesel

Reply
 
Thread Tools
Old 2020-11-27, 22:51   #1
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

114538 Posts
Default Mfactor-specific thread

This is a reference thread specific to Ernst Mayer's mfactor program. And if/when it matters, to the cpu-oriented builds of it. Please comment in the reference material discussion thread, not here. (Posts here may be incorporated with attribution, moved, or removed without recourse.)

In most cases, GIMPS trial factoring should be performed on gpus using mfaktc or mfakto, or for special purposes, special programs such as mmff on NVIDIA gpus. Mfactor comes into the picture for special cases they won't handle, such as trial factoring Mersenne numbers beyond those programs' limits.

This whole thread is a draft in progress and some posts may be mostly a placeholder at the moment or in portions.
Please note that Ernst describes this software as "experimental". Expect some rough edges.


(Getting started section to follow someday.
Choose the fewest-word build that fits the requirements of the task at hand. See the attached mfactor bits table.pdf
If building yourself, test the resulting build(s) such as by finding the small known factors of MM31.
Single threaded:
Poor man's multithreaded:
Linux multithreaded
)

Some Mfactor notes:
  1. The effective minimum kmax is set to 16,336,320 by the size of the small-primes sieve. https://www.mersenneforum.org/showpo...6&postcount=16
  2. NWord is limited to 64,000,000 bits exponent and factor. So it could theoretically be used on MM57885161 but not MM74207281 and up.
  3. The help output of the program is considerable. https://www.mersenneforum.org/showpo...8&postcount=24 Ernst cautioned it may not be current.
  4. Savefiles are not implemented, although the program emits messages about them.
  5. It runs controlled by command line parameters. No ini files, config files, etc.
  6. It outputs to stdout and stderr. Any logging is because of redirection or tee use, on command lines, in batch files, or shell scripts.

Naming convention is as follows, for the builds posted in this thread:
Mfactor-<arch>-<x>w[-tfc][-mt], where:
<arch> is base, for any x86-64, ...
<x> is number of words, or variable if n is present;
if -tfc is present, it's the 960-pass out of 4620 classes variant, otherwise it's 16-pass out of 60 classes;
if -mt is present, it's a multithreaded build, otherwise it's single-threaded.


Table of contents for Mfactor-specific thread (this thread)
  1. Intro and table of contents (this post)
  2. 16-pass Windows builds https://www.mersenneforum.org/showpo...27&postcount=2
  3. 960-pass Windows builds https://www.mersenneforum.org/showpo...28&postcount=3
  4. Linux builds https://www.mersenneforum.org/showpo...29&postcount=4
  5. Poor-man's multithreading approximation https://www.mersenneforum.org/showpo...33&postcount=5
  6. etc
(add bug and wish list)

For more background, see https://www.mersenneforum.org/showthread.php?t=25009 and other content available from links there


Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1
Attached Files
File Type: pdf mfactor bits table.pdf (10.9 KB, 11 views)

Last fiddled with by kriesel on 2021-01-06 at 21:25 Reason: added tee use to note 6
kriesel is offline   Reply With Quote
Old 2020-11-27, 22:57   #2
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

7×701 Posts
Default 16-pass Windows builds

These were built using msys2 on a Windows 7 X64 Pro dual-Xeon E5645 system. They are single threaded because that's all that build approach supports.
Differing number of words allows for fast runs on small operands, and for bigger factors and exponents. See the mfactor bits table attached to post one of this thread.
These were built for the common base of 64-bit Intel compatible cpus, not the higher SSE2, FMA3, or AVX512 flavors, so should run regardless of processor model. (Those higher processor capabilities are only supported for a subset of word lengths, as shown in the bits table attachment of post 1, but would give higher performance where supported.)

After renaming factor.c.txt to factor.c, these were built by the following:
gcc -c -Os ../get*c && rm get_preferred_fft_radix.o
gcc -c -Os ../imul_macro.c ../mi64.c ../qfloat.c ../rng_isaac.c ../two*c ../types.c ../util.c
gcc -c -Os -DFACTOR_STANDALONE -DTRYQ=4 ../factor.c ../get_cpuid.c
gcc -o Mfactor-base-1w *o -lm

gcc -c -Os -DFACTOR_STANDALONE -DTRYQ=4 -DP2WORD ../factor.c
gcc -o Mfactor-base-2w *o -lm

gcc -c -Os -DFACTOR_STANDALONE -DTRYQ=4 -DP3WORD ../factor.c
gcc -o Mfactor-base-3w *o -lm

gcc -c -Os -DFACTOR_STANDALONE -DTRYQ=4 -DP4WORD ../factor.c
gcc -o Mfactor-base-4w *o -lm

gcc -c -Os -DFACTOR_STANDALONE -DTRYQ=4 -DNWORD ../factor.c
gcc -o Mfactor-base-nw *o -lm


Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1
Attached Files
File Type: exe Mfactor-base-1w.exe (895.7 KB, 9 views)
File Type: exe Mfactor-base-2w.exe (900.8 KB, 13 views)
File Type: exe Mfactor-base-3w.exe (902.8 KB, 10 views)
File Type: exe Mfactor-base-4w.exe (904.3 KB, 11 views)
File Type: exe Mfactor-base-nw.exe (893.2 KB, 11 views)

Last fiddled with by kriesel on 2020-12-20 at 17:37
kriesel is offline   Reply With Quote
Old 2020-11-27, 22:58   #3
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

490710 Posts
Default 960-pass Windows builds

These were built using msys2 on a Windows 7 X64 Pro dual-Xeon E5645 system. They are single threaded because that's all that build approach supports. The higher pass count is somewhat more efficient in avoiding composite factor candidates.
It also allows more flexibility in number of processes run in parallel if doing that.

Differing number of words allows for fast runs on small operands, and for bigger factors and exponents. See the mfactor bits table attached to post one of this thread.

After renaming factor.c.txt to factor.c, and building the 16-pass, which already compiled some needed modules, these were built by the following:
rem large number of passes builds, for better sieving, finer pass granularity, better manycore multithreading
gcc -c -Os -DFACTOR_STANDALONE -DTRYQ=4 -DTF_CLASSES=4620 ../factor.c ../get_cpuid.c
gcc -o Mfactor-base-1w-tfc *o -lm

gcc -c -Os -DFACTOR_STANDALONE -DTRYQ=4 -DTF_CLASSES=4620 -DP2WORD ../factor.c
gcc -o Mfactor-base-2w-tfc *o -lm

gcc -c -Os -DFACTOR_STANDALONE -DTRYQ=4 -DTF_CLASSES=4620 -DP3WORD ../factor.c
gcc -o Mfactor-base-3w-tfc *o -lm

gcc -c -Os -DFACTOR_STANDALONE -DTRYQ=4 -DTF_CLASSES=4620 -DP4WORD ../factor.c
gcc -o Mfactor-base-4w-tfc *o -lm

gcc -c -Os -DFACTOR_STANDALONE -DTRYQ=4 -DTF_CLASSES=4620 -DNWORD ../factor.c
gcc -o Mfactor-base-nw-tfc *o -lm


These may be very useful for long tasks on manycore systems (dual-Xeons, Xeon Phi). However, for small tasks, they may be slower. Case in point:
On condorella dual e5645 Win 7 X64 Pro

It seems there's a considerable overhead disadvantage to many classes at small exponent and bit level
60: M(2147483647) has 3 factors in range k = [0, 68726898240], passes 0-15
Performed 2740062501 trial divides
Clocks = 00:19:47.068
Clocks = 00:19:47.068 = 1187.068 seconds.

4620: M(2147483647) has 3 factors in range k = [0, 69004615680], passes 0-959
Performed 2751128805 trial divides
Clocks = 00:23:34.701
Clocks = 00:23:34.701 = 1414.701 seconds =1.19176 times that of the 60-classes 16-passes timing


Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1
Attached Files
File Type: exe Mfactor-base-1w-tfc.exe (891.7 KB, 8 views)
File Type: exe Mfactor-base-2w-tfc.exe (896.8 KB, 11 views)
File Type: exe Mfactor-base-3w-tfc.exe (899.3 KB, 10 views)
File Type: exe Mfactor-base-4w-tfc.exe (900.3 KB, 9 views)
File Type: exe Mfactor-base-nw-tfc.exe (889.2 KB, 11 views)

Last fiddled with by kriesel on 2020-12-29 at 21:03
kriesel is offline   Reply With Quote
Old 2020-11-27, 23:04   #4
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

7·701 Posts
Default Linux builds

These were built on Ubuntu 18.04 / WSL on a Windows 10 Pro x64 i7-8750H system. See the mfactor bits table attached to post one of this thread. See also for an indication of what is possible on Knight's Landing and many threads on Linux, https://www.mersenneforum.org/showpo...&postcount=165


Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1
Attached Files
File Type: gz Mfactor-base-1w.gz (245.9 KB, 8 views)
File Type: gz Mfactor-base-2w-tfc-mt.gz (251.3 KB, 11 views)

Last fiddled with by kriesel on 2020-12-28 at 00:21
kriesel is offline   Reply With Quote
Old 2020-11-27, 23:24   #5
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

132B16 Posts
Default Poor-man's multithreading approximation

Launching separate processes for separate pass numbers with output redirection to pass-specific files permits using multiple cores on Windows. If there's a system crash before the processes complete their work, it's possible to resume each from roughly where it left off, by manually specifying beginning and ending k values. For large process counts, that can become tedious.

The following two paragraphs first appeared here:
"Poor man's multithreading" is running multiple processes for the same bit level and exponent, with different passmin and passmax. For example, 4-way, to use 4 cores with an msys2 compiled image,
passmin 0 passmin 3,
passmin 4 passmax 7,
passmin 8 passmax 11,
passmin 12 passmax 15.
This works well for powers of two passes per run. 1,2,4,8.

If the build is done with -DTF_CLASSES=4620 for finer pass granularity, then passmin and passmax ranges become 0 to 959, 960 = 26 * 3 * 5 in number. This larger number of passes with numerous small factors allows for much more choice of degree of parallelism. 960 is a highly composite number:
1, 2, 3, 4, 5, 6, 8, 10, 12, 15, 16, 20, 24, 30, 32, 40, 48, 60, 64, 80, 96, 120, 160, 192, 240, 320, 480, 960
For brief runs there is no point to going to high degrees of parallelism, and -DTF_CLASSES seems to introduce higher overhead into a single run. For lengthy runs, the only way to get run times reasonable may be high degrees of parallelism. Using hyperthreading helps.

These additional choices may better fit the available number of cpu cores and hyperthreads on a given system.

Note that in the case of a system problem, automated update restart, power outage, etc., it can be unpleasant to have a large number of incomplete passes to deal with. A good UPS, stable up-to-date reliable system and well chosen number of parallel processes are recommended to minimize the size of the chore to continue from k values capture in log files. Or sacrifice some throughput and resume all the processes from the lowest maximum k value reached among all the processes being resumed, with a script to relaunch them all. Nevertheless, parallel processes can be powerful, when the run time is weeks even with 16-64 processes.

Different hardware seems to behave differently. On dual-Xeon-E5-2697v2 (dual-12-core, 2-way hyperthreading), I've run 16 processes in parallel and seen only minor differences in duration among the parallel processes, and ~15% impact on prime95 throughput. On a Knights Landing Xeon Phi (which have 4-way hyperthreading and 64, 68 or 72 cores), with 64 Mfactor processes and 4 prime95 workers, I've seen the Mfactor processes vary significantly in run time (longest = 1.68 x shortest; 151.8 hours vs. 255+ for mfactor-base-2w-tfc -m 60651732991 -bmin 85 -bmax 86, 64 processes, with the OS assigning processes to the cores without user involvement, MCDRAM only), and the prime95 workers' impact varied greatly too, from ~10% to the highest numbered worker indicating more than 100% increase in primality test iteration time IIRC.

The exponent and bit level entries in the attached Windows batch files are for illustration only. Please do not run them as is without coordination with me. They take too long to waste time by duplicating effort. There's no web or other server site known to me for coordinating work on such large exponents, other than perhaps posting messages somewhere on the forum.

https://www.mersenneforum.org/showpo...04&postcount=5 gives an indication of how long one of the Mfactor runs took.

For simplicity, or maybe I didn't think of it soon enough, the log files are output in the working directory, not one level lower. It's straightforward to create a folder for the exponent and final bit level, put the code there, and create all the run files there, then move the code to another for another run. But not required. Since the individual processes' log files are named according to exponent, starting pass number, and ending bit level, multiple bit levels or even exponents could be run in the same directory at the same time, without log file name collision. For example, running 1 process to do bit level x, 2 processes for x+1, 4 processes for x+2, 8 processes for x+3, which would all complete in about the same time, assuming there are enough hyperthreads available that each mfactor process gets its own register set. I strongly recommend starting with small process counts and small bit levels for small run times to familiarize with the program and batch script operation, and confirm run time projections before attempting higher bit levels or more complex runs. Run time of a bit level, when setup overhead is small compared to factoring time, ideally scales as 2bits / exponent / parallelprocesscount. Run times of weeks are easy to exceed.


Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1
Attached Files
File Type: zip mfactor-batch-files.zip (4.9 KB, 7 views)

Last fiddled with by kriesel on 2021-01-12 at 18:24
kriesel is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Mlucas-specific reference thread kriesel kriesel 7 2021-01-20 17:47
Ernst's Mfactor program kriesel Software 29 2020-11-26 06:56
Running Mfactor M0CZY LMH > 100M 2 2011-02-23 11:48
Mfactor sieve code thread ewmayer Operation Billion Digits 27 2006-11-03 08:05
Is it possible to reserve a specific n-value for 2^n-1? jasong PrimeNet 1 2006-09-21 00:10

All times are UTC. The time now is 07:43.

Sun Feb 28 07:43:31 UTC 2021 up 87 days, 3:54, 0 users, load averages: 1.05, 1.21, 1.30

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.