This ftp site contains Ernst Mayer's C source code for performing Lucas-Lehmer tests of prime-exponent Mersenne numbers. It also includes a simple Python script for assignment management via the GIMPS project's PrimeNet server, or you can manually manage your work, if you prefer. In short, everything you need to search for Mersenne primes on your Intel, AMD or non-x86-CPU-based computer!
Mlucas is an open-source program for primality testing of Mersenne numbers in search of a world-record prime. You may use it to test any suitable number as you wish, but it is preferable that you do so in a coordinated fashion, as part of the Great Internet Mersenne Prime Search (GIMPS). Note that on x86 processors Mlucas is not as efficient as the main GIMPS client, George Woltman's Prime95 program (a.k.a. mprime for the linux version), but that program is not 100% open-source. Prime95 is also only available for platforms based on the x86 processor architecture.
[1] Use the Paypal "Send Money" function with my e-mail address, and mark it as Friends and Family rather than a purchase. Note that such money-sending is only fee-less if you are in the US and you use a PayPal balance or a bank account to fund the transaction. Note that "bank account" appears to include debit cards linked to such, since my Paypal account is linked to such a unified bank/checking/debit-card money-market account and I never pay to send money to US relatives. But check your transaction preview details before sending to be sure!
[2] E-mail me for a snail-mail address to which you can send your old-fashioned personal check for only the cost of your time and postage.
06 Jul 2022: Patch Alert: Small patch which fixes an infrequent (but run-killing when it occurs) issue with multithreaded runs of p-1 stage 2. The version number remains v20.1.1.
12 May 2022: Prototype Mlucas v21 used to test compositeness/probable-primality of cofactors of F25−F30.
20 Mar 2022: Patch Alert: A user assertion-exit report revealed a bug in the logic used to perform the Gerbicz error check in PRP tests. The bug only manifests for users doing PRP tests and who have set CheckInterval = 1000, the minimal allowable value, in their mlucas.ini file, overriding the default values of 10000 (≤ 4 threads) and 100000 (> 4 threads). (The only good reason to use such a small value is on slow devices where 10000 iterations need substantially more than, say 10 minutes, as checkpoints more frequent than a minute or so will cost 1-2% throughput). The version number remains v20.1.1.
02 Dec 2021: Patch Alert: Due to a user report of bad behavior of regular (non '-9') kill with his multithreaded run, signal handling has been changed to immediate-exit without savefile write. Suggest users 'killall -9 Mlucas' any ongoing jobs at their earliest convenience and switch to the updated code; using that, regular 'kill' should work. Also, workfile assignments are now echoed to the per-exponent log (.stat) file, not just to stderr (e.g. to the nohup.out file).
23 Nov 2021: Patch Alert:
06 Nov 2021: Patch Alert: There was a PRP-postprocessing bug leading to a sanity-check assertion-exit, and a p-1 premature-exit-at-end-of-stage-1 bug in the initial v20.1.1 release. (01 Nov, md5 = e3302de913e7bf65a83985d68a1193e1). Please make sure you have the current version if you do either of these kinds of GIMPS assignments.
01 Nov 2021: v20.1.1 Patch-release, containing two modest feature-adds and several bugfixes. Thanks to tdulcet and kriesel for much beating-on-the-code QA work, bug reports and stack traces:
Feature-add #1 is to support several new user options stored in a file mlucas.ini in the user's run directory. These are CheckInterval, LowMem and InterimGCD. See the help.txt file for details and restrictions.
Feature-add #2 is to support p-1 "stage 2 continuation", meaning if one has run p-1 to the default stage 1 and 2 bounds B1 and B2 (or to stage bounds specified via the Pminus1= assignment format) and failed to find a factor, one can run one or more deeper stage 2 intervals at the same B1 in an attempt to find a factor. See the help.txt file for details and examples.
Here are descriptions of the bugfixes:
31 Aug 2021: v20.1 released: This is an update-release of v20, but with enough changes as to warrant a minor-version number increment.
I urge users to delete (or rename) the mlucas.cfg file they are using for runs and run the self-tests using the v20.1 build to generate a fresh one, due to the v20 suboptimal-radix-set selection issue mentioned in the list below.
Changes include:
30 Jul 2021: v20.0 released: The major feature-add is support for p-1 factoring. Special thanks to Mihai Preda and George Woltman for the total-immersion course in state-of-the-art p-1 stage 2 optimizations, and Paul Underwood for remote access to his Odroid N2 with 4GB RAM, which allowed me to test my stage 2 implementation in an Arm-NEON-SIMD build context. Thanks also to Teal Dulcet for help with the make-Makefile scripting, and Ken Kriesel for trying the automake under WSL.
The key new runtime flag added in v20 is -maxalloc, which takes an argument representing the limit on the percentage of system free RAM used in p-1 stage 2. Note: free RAM, not total system RAM; for Linux this is based on the value of the sysinfo "freeram" field. The default is 90%; if you see start seeing very small numbers in the 'top' Mem-usage summary (typically several lines above the big columnwise tasks-table) or said table starts showing the dreaded 'kswapd' entries near top of the %CPU column, you need to lower the allocation. I've not found a reliable way to obtain free-RAM numbers for MacOS, so the default there is 50% of available RAM, based on the sysctl "hw.memsize" value.
Available memory determines the number of floating-point-residue-sized 'buffers' used in the p-1 stage 2 prime-pairing algorithm. The greater the number of buffers, the greater the fraction of stage 2 primes we can pair, which reduces the stage 2 work. At the current GIMPS testing wavefront of 6M FFT each stage 2 buffer needs about 50MB of RAM. Note that once one gets beyond 500-1000 memory buffers in stage 2, one gets rapidly diminishing returns, as the prime-pairing percentage approaches 100% − this plot illustrates for the Mlucas v20 pairing algorithm (blue circles represent actual pairing %; the black curve is a semi-decent-at-fewer-than-2000-buffers curve fit. (The code uses the actual data tables to make its determination, not the curve fit.)
Savefiles: During stage 1, the code uses the same redundant p[exp], q[exp] savefile pair as for LL/PRP tests, where [exp] stands for the exponent p of the Mersenne number M(p) being used. On completion of stage 1 the p[exp] file is renamed p[exp].s1; this can later be used to run stage 2 continuations to deeper-than-default bounds, if the user wishes. (This is "advanced usage", not recommended − I will likely add a note to the thread linked at bottom of this release summary with how-to details; note that a deeper stage 1 requires rerun-from scratch). Stage 2 uses a single p[exp].s2 savefile; both s1 and s2 savefiles are roughly [exp] bits in size. During p-1 the logfile (p[exp].stat) output will also look very similar to that for LL/PRP tests, except each checkpoint-summary will include either 'S1' or 'S2', e.g.
[2021-07-28 13:22:38] M110763649 S2 at q = 3270330 [ 1.99% complete] clocks = 00:05:34.349 [ 33.3849 msec/iter] Res64: 99F98F04EFD4E49A. AvgMaxErr = 0.124210464. MaxErr = 0.187500000.For both p-1 stages, 'iter' refers to FFT-based mod-M(p) multiplies (more precisely, "mod-squaring equivalents" in requiring precisely one forward and one inverse FFT); in stage 2 each such multiply processes between 1 and 2 stage 2 primes, the precise value depending on the number of memory buffers allocated; the value is the reciprocal of the ordinate value in the above-linked plot.
Note that for p-1 run ahead of a scheduled PRP test of the same exponent, stage 1 − let's say it does s1 iterations − is not fused with the corresponding opening s1 iterations of the PRP test. This is a possible future enhancement.
Other stuff:
Users of low-memory systems (< 4GB RAM) such as Raspberry Pi and Odroid-style micro-PCs will only get stage 1 of the p-1 algorithm, since running an effective stage 2 on an exponent requires a minimum of 30 floating-point-residue-sized 'buffers', which at the current GIMPS testing wavefront translates to around 1.5GB of free RAM. I did an Arm-Neon SIMD build of v20 on my sole remaining Samsung Galaxy S7 android phone, and quickly discovered that nearly all of its 4GB RAM was eaten up by various system tasks and unremovable apps. (Linux-running micros will be better in that regard, but still.)
Users who for any reason decide continue with v19: please make sure you are using the patched version of the primenet.py script described at top of the v19.1 section here.
Post feedback and any questions regarding v20.1.1 here.
Jump to Download & Build section
For general questions about the math behind the Lucas-Lehmer test, general primality testing or related topics (and also no small number of unrelated topics, such as this one following in the literary-humor footsteps of Ambrose Bierce), check out the Mersenne Forum.
For discussions specifically about Mlucas at the above venue, see the Mlucas subforum there.
The 'which' commands are how to check if various pieces are already installed. I list only components which are not always included in a typical Linux distro or MacOS release:
Users need GCC ('which gcc') and/or the Clang/LLVM compiler ('which clang'; this is the native one for MacOS), the xzip compression package ('which xz'), GNU make ('which make'), and − for those using some version of the primenet.py work-managment script − python ('which python'). Nice-to-haves include wget and gdb. For example, under Ubuntu Linux one would use - [] here indicate optional nice-to-have packages:
sudo apt -y install build-essential xzip wget python [git lm-sensors ssh openssh-server]Here, build-essential instead is a meta package that installs gcc/g++/make and a few other packages commonly used in a standard libc toolchain. Among the nice-to-haves, 'git' is useful for the numerous third-party freeware apps which use said repository system. The lm-sensors package is useful for monitoring system temperatures, e.g. via the alias alias temp='sensors|grep Package'; I suggest first using simply 'sensors' to see the various components which appear, repeating under CPU load to see which ones get hot, and the customizing the alias to suit your system. The two ssh-related packages are useful for remote system administration.
Mlucas v20 and later also require GMP, but the install of that package is described separately, below.
WSL 2 is supposed to be faster than WSL 1, at least for IO, but I suspect Mlucas might actually be faster with WSL 1, since there would be no overhead from the VM used with WSL 2 (do not quote me on that). With WSL 2, Mlucas would also not be able to access all the system's RAM and the RAM usage percentage would be incorrect, since it would only be for the VM. If someone is actually going to use Mlucas with the WSL, they may want to first test with both WSL 1 and 2.Another user adds further notes re. WSL1-versus-2:
Manual WSL1 installation is straightforward, and possible on hardware that lacks virtualization features required for WSL2, allowing multithreaded Mlucas builds and runs on Windows 10 systems that can't run WSL2. Case in point, Xeon Phi 7210 & 7250, and some other processor models too. On Win 10, WSL1 install is simple: Manual method step 1, restart, step 6, and use as soon as it completes the distro download, bsdtar extract visible in Task Manager, and install, in my case Ubuntu 18.04 LTS.
Users switching from one version of the script to the other for any reason should delete the current primenet.py script and local.ini file from their run directory and re-register using the new script.
For the Do-It-Yourselfers, the build and tune flow is detailed below.
Linux: If 'ls -l /usr/include/gmp*' comes up with one or more header files, it is likely that GMP is already installed on your system. If that comes up empty:
MacOS:
Here is the procedure − the needed downloads can all be to the ~/Downloads folder, no need to put them in any special location:
1. Download the .tar.xz of the current GMP from https://gmplib.org/ (in my case, that was gmp-6.2.1.tar.xz):
2. install gmp:
tar xJf gmp*xz cd gmp-[your downloaded version] ./configure make make check make install (re-run with sudo if you receive a 'permission denied' message)
HWLOC: [NOTE: this package will only be needed as of v21, but is still useful for visualizing the hardware topology, especially for manycore and hybrid BIG+little processors.] In the context of 'apt', another useful associated tool is 'apt-file', which allows one to search for the package which provides a given file: for example, when first playing with the hwloc API, I needed the hwloc.h header, which was still missing after I installed the hwloc shared-object runtime libraries. I first used 'sudo apt-get install apt-file' to install apt-file, followed by 'apt-file update' to build the file cache needed by that for its search. Lastly, 'apt-file find hwloc.h' revealed that under Ubuntu one needs the libhwloc-dev package to get the needed header, thus 'sudo apt install -y libhwloc-dev'.
Installing hwloc on CentOS is even more fun: between CentOS 8.2 and 8.3 some clever soul decided to change the package-name capitalization from 'PowerTools' to 'powertools', so you first need to check your CentOS version, using 'lsb_release -a'. (If that gives 'lsb_release: command not found', you first need to - yep! - install yet another package, this one from the RedHat - sorry, I mean 'redhat' - side of the RedHat/CentOS alliance: 'sudo yum install redhat-lsb-core'). Once you know your CentOS version number:
More fun: On CentOS the resulting command-line tool was named not 'lstopo' but 'lstopo-no-graphics, so I added alias lstopo='lstopo-no-graphics to my .bashrc file to uniformize command naming across my various Linux machines.
The command-line tool can be used to generate a nifty graphical representation of your hardware topology output to various image formats, e.g. 'lstopo my_machine.svg' produces output in scaled vector graphics format. Here are some examples of such graphical topology views:
To do primality tests, you'll need to build the latest Mlucas C source code release. First get the release tarball, available via wget as wget --backups=1 https://www.mersenneforum.org/mayer/src/C/mlucas_v20.1.1.txz (the '--backups=1' option causes wget to append a '.1' to any existing tarfile of the same name, which can come in handy; note that 'wget -O' permits direct-overwrite-of-existing, but requires the target filename to be explicitly listed by name), or via-click-to-download, here: mlucas_v20.1.1.txz (06 Jul 2022, 1690368 bytes ~= 1.6 MB, md5 checksum = 8d8851f5e383d8a74cf067192474256a). Then use 'tar xJf mlucas_v20.1.1.txz' to one-step uncompress/unpack the archive.
That unpacking will create a directory next to the .txz file whose name is the same as the prefix of the compressed tarchive, mlucas_v20.1.1. Inside that directory you will see the following ascii-text files: a bash-script file makemake.sh, a Python script primenet.py and a help.txt file with descriptions of the various command-line options. You will also see a directory 'src' containing the various C source and header files. If you wish to view any of the sourcefiles in an editor, I recommend using a 4-column tab setting, since much of the C code and especially the inline-assembly contained in various .h files is in multicolumn form and needs a 4-column tab setting to line up properly for viewing.
'bash makemake.sh' will create an 'obj' directory next to src, create a Makefile suitable for our CPU architecture inside obj, do a parallel-compile dividing the task of compilation among as many threads as your CPU has cores and, if all goes well, create an executable inside obj titled simply 'Mlucas'.
If you want to build with the Clang/LLVM (or other GCC-compatible) compiler instead of GCC, run export CC=clang before running the makemake.sh script to build Mlucas with Clang instead of the default GCC. If you have already run the script and want to recompile Mlucas with a different compiler without rerunning the entire script, you must first run 'make clean' inside the obj directory and then invoke 'make' the desired compiler's name:
cd mlucas_v20.1.1/obj/ make clean CC=[compiler name] make -j "$(nproc)"On my Intel Haswell quad running Ubuntu, the latest version of Clang for that was installed as clang-9, so after first building and testing with GCC, I made that replacement in the above CC= snip, then tested the resulting Clang binary for performance against the GCC one. (I found no appreciable difference on that machine.)
Build Errors: Should the above make-script indicate errors, the first thing to do is to view the build.log file generated in the 'obj' directory for the keyword 'error' − use case-insensitive search, so e.g. 'error' and 'Error' both get flagged. If you can't figure out the problem that way yourself, please e-mail me copies of the Makefile and build.log files insde 'obj'.
Issue: "clang: error: the clang compiler does not support '-march=native'": This affects Clang builds on the Arm platform, including Apple Silicon CPUs such as M1. The workaround is 'bash makemake.sh asimd', which will build an Mlucas_asimd-named binary in the obj_asimd directory. The binary can be renamed just 'Mlucas' post-build, if you prefer.
Issue: "cc: Command not found": Some users have reported hitting this make error under certain WSL installs. On both my Mac and various Linux boxes, 'which cc' points to /usr/bin/cc, and 'ls -l /usr/bin/cc' shows it to be an alias (soft-link) to gcc. But that is an assumption on the part of the makemake.sh script, which clearly does not hold on all Linuxes. On such systems, typing 'export CC=gcc' before retrying 'bash makemake.sh' should fix the problem.
Cross-compilation: The default mode for makemake.sh is to compile for whatever the highest vector-SIMD mode and other key architectural features are supported by the host machine. In general, cross-compiled binaries will not have quite the performance as natively compiled ones. Nevertheless, we anticipate that some users will want the convenience of building for multiple targets on a single host build platform. To that end, the script supports a single optional SIMD-mode targeting flag, which assumes the build host platform is capable of building said mode. The flag must be one of the supported SIMD-arithmetic types: [x86_64: avx512_skylake avx512_knl avx2 avx sse2]; [Armv8: asimd]. E.g. to build an avx2-SIMD binary on a skylake-X host, 'bash makemake.sh avx2' will create an object-file directory obj_avx2 and a binary name Mlucas_avx2 within it.
Rebuilding an updated version of the source file archive: This may be needed when upgrading from a beta to a release version, or after downloading a patched version of a release, or for users who want to make local modifications to one or more source files. Generally, one should always start with 'make clean' in the obj-file directory (or 'rm -f obj/*.o' from the same top-level directory which contains makemake.sh), making sure one substitutes obj_[extension] for 'obj' in the case of a cross-compiled binary. Then 'cd ..' and re-run 'bash makemake.sh' on the updated source-file archive in /src. Gnu make is smart enough that after download of a new src-file archive version or user-modification of one or more source files, it will rebuild only the *.c files that have changed, but that can be dangerous as it does not account for any header-file changes that may have occurred. One of the nice things about parallel-make is that it's so speedy that full-rebuilds are easy and fast. Incremental rebuilds of just or more .c files followed by relinking to obtain a new executable should be untertaken only by users who really know what they are doing.
[various]: warning: "/*" within comment [-Wcomment]
[various]: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
[various]: warning: cast from pointer to integer of different size
twopmodq80.c: In function `twopmodq78_3WORD_DOUBLE':
twopmodq80.c:1032: warning: right shift count ≥ width of type
twopmodq80.c:1032: warning: left shift count is negative
These are similarly benign − the -Wcomment warnings are a result of the way I do commenting in my inline-asm code; the -Wstrict-aliasing ones are a result of using precomputed integer bitfields to init various floating-point data according to IEEE-float format rules; the cast warnings are due to some array-alignment code which only needs the bottom few bits of a pointer, and the shift-count warnings are a result of compiler speculation-type optimizations.
0x0000000000723126 <+1718>: vmovdqa %xmm5,0x690(%rsp) 0x000000000072312f <+1727>: vmovq %rsi,%xmm5 0x0000000000723134 <+1732>: lea 0x4540(%rbp),%rsi 0x000000000072313b <+1739>: vpinsrq $0x1,%rcx,%xmm5,%xmm13 0x0000000000723141 <+1745>: vmovq %rsi,%xmm5 0x0000000000723146 <+1750>: lea 0x4580(%rbp),%rsi => 0x000000000072314d <+1757>: vmovdqa64 %xmm16,%xmm6 0x0000000000723153 <+1763>: vpinsrq $0x1,%rsi,%xmm5,%xmm5There are several other vmovdqa in the surrounding disassembly, but this is the only 64-suffixed one. This family of instructions is listed as "Move Aligned Packed Integer Values". Only the 512-bit ZMM-operand version is available in AVX512F (foundation instructions); the XMM,YMM forms need AVX512VL, the vector-length instruction subset, which is not supported by KNL. The workaround is to recompile just the radix64_ditN_cy_dif1.c source file with -O2, then relink.
./Mlucas -fft 192 -iters 100 -radset 0You will want to look through the resulting informational output for a line reading "INFO: System has [X] available processor cores.", which I have bolded below in the sample output from my Core2 macbook. Here, the number reported refers to *logical* (virtual) processor cores. Intel and AMD users, if this number is double the number of physical system cores, that means your CPU supports hyperthreading. This is important in the "Performance Tune for Your Machine" section below.
This particular testcase should produce the following 100-iteration residues, with some platform-dependent variability in the roundoff errors and possibly the final residue shift count, which users need not concern themselves with:
INFO: System has 2 available processor cores. ... 100 iterations of M3888509 with FFT length 196608 = 192 K, final residue shift count = 744463 Res64: 71E61322CCFB396C. AvgMaxErr = 0.255430821. MaxErr = 0.312500000. Program: E19.1 Res mod 2^35 - 1 = 29259839105 Res mod 2^36 - 1 = 50741070790[If the residues differ from these internally-pretabulated 100-iteration ones, the code will emit a visually-loud error message.]
./Mlucas -fft 192 -iters 100 -radset 0 -nthread 2The -cpu flag and the differing Intel and AMD hyperthreaded-core-numbering conventions:
On non-hyperthreaded CPUs, this should nearly double the throughput (= half the runtime) versus the initial single-threaded (default) run. On hyperthreaded x86 processors, Intel users should see a nearly 2-fold speedup running this way, but AMD users won't. That's because '-nthread 2' really translates to 'run 2-threaded, with thread affinities set to logical CPU cores 0 and 1'. By 'logical cores' we mean the multiple (typically 2, but sometimes more) 'virtual cores' mapped to each physical CPU core in modern 'hyperthreaded' (Intel's term) CPU architectures. The Intel numbering system here is that on a system with n physical cores, physical CPU core 0 maps to logical cores 0 and n; physical CPU core 1 maps to logical cores 1 and n+1, etc. AMD uses a different logical-core numbering convention than Intel, whereby physical CPU core 0 maps to logical cores 0 and 1; physical CPU core 1 maps to logical cores 2 and 3, and so forth. The Intel-specificity of the Mlucas -nthread option is one reason it is deprecated (still supported but recommended-against) in v17 and beyond; another is that it does not permit setting the processor affinity of an Mlucas instance to a specific set of logical cores.
For these reasons Mlucas v17 introduced a new and much-more-flexible flag '-cpu', which accepts any mix of comma-separated individual core indices and core-index ranges of form low:high and low:high:stride, where if stride is omitted it defaults to 1, and if high is also omitted, it means "run 1 thread on logical core [low]". Thus for our Intel user, -nthread 2 is equivalent to -cpu 0:1, but now our user can run a second 2-threaded job using -cpu 2:3 and be reasonably sure that the two runs are not competing for the same CPU cores. Our AMD user will similarly see no runtime benefit from replacing -nthread 2 with -cpu 0:1 (since on AMD both have the same effect of overloading a single physical CPU core), but will find that -cpu 0,2 (or in colon-delimited syntax, 0:2:2, i.e. 'use cores 0 though 2 in increments of 2') gives the expected 2-threaded speedup.
[Advanced users: For a complete list of Mlucas command line options, see the help.txt file at top level in the unpacked source directory tree.]
After building the source code, the first thing that should be done is a set of self-tests to make sure the binary works properly on your system. During these self-tests, the code also collects various timing data which allow it to configure itself for optimal performance on your hardware. It does this by saving data about the optimal FFT radix combination at each FFT length tried in the self-test to a configuration file, named mlucas.cfg. Once this file has been generated, it will be read whenever the program is invoked to get the optimal-FFT data (specifically, the optimal set of radices into which to subdivide each FFT length) for the exponent currently being tested.
To perform the needed self-tests for a typical-user setup (which implies that you'll be either doing double-checking or first-time probable-prime (PRP) testing), first remove or rename any existing mlucas.cfg file from a previous code build/release in the run directory, then type − note you'll want to insert a specific set of -cpu options from the list below in place of the [-cpu flags] placeholder − the following to run a bunch of self-tests. This first basic test will run single-threaded self-tests and needs from a few minutes on fast Intel hardware to upwards of an hour on humbler CPUs such as my little Odroid:
./Mlucas -s m [-cpu flags] >& test.log mv mlucas.cfg mlucas.cfg.1thr
The above 'Mlucas -s m' command tells the program to perform a series of self-tests for FFT lengths in the 'medium' range, which currently means FFT lengths from 1024K-7680K, covering Mersenne numbers with exponents from 20M − 143M. You should run the self-tests under unloaded or constant-load conditions before starting work on any real assignments, so as to get the most-reliable optimal-FFT data for your machine, and to be able to identify and work around any anomalous timing data. (See example below for illustration of that). This may take a while, especially in single-threaded mode; you can monitor progress of the process by opening the mlucas.cfg file in an editor and watching the various-FFT-length entries get added as each set of tests at a given FFT length completes. When done, please check the resulting test.log file for error messages. You may see a few messages of the form
***** Excessive level of roundoff error detected - this radix set will not be used. *****
but a whole lot of such, or residue-mismatch or other kinds of errors means that something has likely gone awry in your build. This can be something as mundane as the compiler using unsafe optimizations for one or more FFT-radix functions, or something more serious. In such cases, please contact me, the program author, and attach zipped copies of your build.log and test.log, along with information about your compiler version and compute platform (CPU and OS).
Setting Up for Running on Multicore Systems:
Much testing experience shows that on systems with 4 or more physical cores, the optimal tradeoff between run-management-effort, RAM usage (especially in the presence of p-1 work) and total system throughput is achieved by assigning 2 or 4 physical processors (and possibly their logical counterparts if hyperthreaded) per Mlucas instance. Further, even if your CPU can run more than 4-threaded, you don't want to run small FFT lengths (< 1M) with > 4 threads, as the thread-overhead is going to kill parallel performance for those. Here are my suggestions what to enter in the [-cpu flags] field for several common hardware types and how to create a simple bash runscript which on boot-up can be invoked just once as 'bash jobs.sh', if your goal is to maximize total throughput of your system.
Note that in the primenet.py arguments in the runscript examples, the '-T 150' arguments represent a preferred worktype of "First time PRP tests", which is the default handed out by the server to decently fast user machines. See the Get exponents from PrimeNet section below for details regarding the possible choices here. The '-maxalloc' argument provided for each Mlucas instance limits the maximum amount of memory used by stage 2 of p-1 factoring. The default is 90% of free RAM, but for multiple instances one wants to allow for the fact that 2 or more instances may be doing simultaneous stage 2 work and cut the limit somewhat. The best values depend on the users' work mix; the values below are best guesses. Lastly, the copy-into-multiple-run-directories commands and sample bash scripts assume the user has registered their CPU using the primenet.py script, thus creating a local.ini file as detailed in Step 3 below − don't do the actual setup or run the bash script until this has been done, and you have decided on a preferred worktype for yourself:
./Mlucas -s m -cpu 0:1 >& test.logThen create as many run directories as needed to "fill the processor" up with such 2-thread instances − e.g. on a 10-core system starting at the same directory level as 'src' and 'obj':
mkdir run0 run1 run2 run3 run4 cp ../obj/mlucas.cfg primenet.py local.ini run0 cp ../obj/mlucas.cfg primenet.py local.ini run1 cp ../obj/mlucas.cfg primenet.py local.ini run2 cp ../obj/mlucas.cfg primenet.py local.ini run3 cp ../obj/mlucas.cfg primenet.py local.ini run4Then in one's home directory, create a jobs.sh file with these contents:
#!/bin/bash if [ "$EUID" -eq 0 ]; then echo "init script needs to be executed as regular user (no sudo)" && exit; fi cd ~/mlucas_v20.1.1/run0 && nohup ../obj/Mlucas -maxalloc 30 -cpu 0:1 & cd ~/mlucas_v20.1.1/run0 && nohup ./primenet.py -d -T 150 & cd ~/mlucas_v20.1.1/run1 && nohup ../obj/Mlucas -maxalloc 30 -cpu 2:3 & cd ~/mlucas_v20.1.1/run1 && nohup ./primenet.py -d -T 150 & cd ~/mlucas_v20.1.1/run2 && nohup ../obj/Mlucas -maxalloc 30 -cpu 4:5 & cd ~/mlucas_v20.1.1/run2 && nohup ./primenet.py -d -T 150 & cd ~/mlucas_v20.1.1/run3 && nohup ../obj/Mlucas -maxalloc 30 -cpu 6:7 & cd ~/mlucas_v20.1.1/run3 && nohup ./primenet.py -d -T 150 & cd ~/mlucas_v20.1.1/run4 && nohup ../obj/Mlucas -maxalloc 30 -cpu 8:9 & cd ~/mlucas_v20.1.1/run4 && nohup ./primenet.py -d -T 150 &
./Mlucas -s m -cpu 0:3 >& test.logThen create as many run directories as needed to "fill the processor" up with such 4-thread instances − e.g. on a 12-core system starting at the same directory level as 'src' and 'obj':
mkdir run0 run1 run2 cp ../obj/mlucas.cfg primenet.py local.ini run0 cp ../obj/mlucas.cfg primenet.py local.ini run1 cp ../obj/mlucas.cfg primenet.py local.ini run2Then in one's home directory, create a jobs.sh file with these contents:
#!/bin/bash if [ "$EUID" -eq 0 ]; then echo "init script needs to be executed as regular user (no sudo)" && exit; fi cd ~/mlucas_v20.1.1/run0 && nohup ../obj/Mlucas -maxalloc 50 -cpu 0:3 & cd ~/mlucas_v20.1.1/run0 && nohup ./primenet.py -d -T 150 & cd ~/mlucas_v20.1.1/run1 && nohup ../obj/Mlucas -maxalloc 50 -cpu 4:7 & cd ~/mlucas_v20.1.1/run1 && nohup ./primenet.py -d -T 150 & cd ~/mlucas_v20.1.1/run2 && nohup ../obj/Mlucas -maxalloc 50 -cpu 8:11 & cd ~/mlucas_v20.1.1/run2 && nohup ./primenet.py -d -T 150 &
./Mlucas -s m -cpu 0:1 >& test0.log mv mlucas.cfg mlucas.cfg.2c.2t ./Mlucas -s m -cpu 0:1,n:n+1 >& test1.log mv mlucas.cfg mlucas.cfg.2c.4tCompare the ms/iter numbers between the two resulting cfg-files. For the one which is generally faster across the various FFT lengths, do 'ln -s [faster of the mlucas.cfg.2c.*t files] mlucas.cfg' to create a symbolic link between it and 'mlucas.cfg', which is where Mlucas expects to find optimal FFT-tuning parameters at runtime. Then create as many run directories as needed to "fill the processor" up with such instances − e.g. on a 6c/12t system where the 2c.4t cfg-file showed overall better timings, one could 'cd ..' out of obj to the same directory level as 'src' and 'obj', then, assuming these are all under a directory in the user's home named 'mlucas_v20.1.1':
mkdir run0 run1 run2 cp ../obj/mlucas.cfg primenet.py local.ini run0 cp ../obj/mlucas.cfg primenet.py local.ini run1 cp ../obj/mlucas.cfg primenet.py local.ini run2Then in one's home directory, create a jobs.sh file with these contents:
#!/bin/bash if [ "$EUID" -eq 0 ]; then echo "init script needs to be executed as regular user (no sudo)" && exit; fi cd ~/mlucas_v20.1.1/run0 && nohup ../obj/Mlucas -maxalloc 50 -cpu 0:1,6:7 & cd ~/mlucas_v20.1.1/run0 && nohup ./primenet.py -d -T 150 & cd ~/mlucas_v20.1.1/run1 && nohup ../obj/Mlucas -maxalloc 50 -cpu 2:3,8:9 & cd ~/mlucas_v20.1.1/run1 && nohup ./primenet.py -d -T 150 & cd ~/mlucas_v20.1.1/run2 && nohup ../obj/Mlucas -maxalloc 50 -cpu 4:5,10:11 & cd ~/mlucas_v20.1.1/run2 && nohup ./primenet.py -d -T 150 &
./Mlucas -s m -cpu 0:3 >& test0.log mv mlucas.cfg mlucas.cfg.4c.4t ./Mlucas -s m -cpu 0:3,n:n+3 >& test1.log mv mlucas.cfg mlucas.cfg.4c.8tCompare the ms/iter numbers between the two resulting cfg-files. For the one which is generally faster across the various FFT lengths, do 'ln -s [faster of the mlucas.cfg.4c.*t files] mlucas.cfg' to create a symbolic link between it and 'mlucas.cfg', which is where Mlucas expects to find optimal FFT-tuning parameters at runtime. Then create as many run directories as needed to "fill the processor" up with such instances − e.g. on a 12c/24t system where the 4c.8t cfg-file showed overall better timings, one could 'cd ..' out of obj to the same directory level as 'src' and 'obj', then, assuming these are all under a directory in the user's home named 'mlucas_v20.1.1':
mkdir run0 run1 run2 cp ../obj/mlucas.cfg primenet.py local.ini run0 cp ../obj/mlucas.cfg primenet.py local.ini run1 cp ../obj/mlucas.cfg primenet.py local.ini run2Then in one's home directory, create a jobs.sh file with these contents:
#!/bin/bash if [ "$EUID" -eq 0 ]; then echo "init script needs to be executed as regular user (no sudo)" && exit; fi cd ~/mlucas_v20.1.1/run0 && nohup ../obj/Mlucas -maxalloc 50 -cpu 0:3,12:15 & cd ~/mlucas_v20.1.1/run0 && nohup ./primenet.py -d -T 150 & cd ~/mlucas_v20.1.1/run1 && nohup ../obj/Mlucas -maxalloc 50 -cpu 4:7,16:19 & cd ~/mlucas_v20.1.1/run1 && nohup ./primenet.py -d -T 150 & cd ~/mlucas_v20.1.1/run2 && nohup ../obj/Mlucas -maxalloc 50 -cpu 8:11,20:23 & cd ~/mlucas_v20.1.1/run2 && nohup ./primenet.py -d -T 150 &
./Mlucas -s m -cpu 0:3:2 >& test0.log mv mlucas.cfg mlucas.cfg.2c.2t ./Mlucas -s m -cpu 0:3 >& test1.log mv mlucas.cfg mlucas.cfg.2c.4tand as described above, create a symlink to the one showing better timings and create multiple rundirs with copies of mlucas.cfg, primenet.py and local.ini . For a 6c/12t AMD system, the "run 3 instances, each using 2c/2t" bash runscript would look like this:
#!/bin/bash if [ "$EUID" -eq 0 ]; then echo "init script needs to be executed as regular user (no sudo)" && exit; fi cd ~/mlucas_v20.1.1/run0 && nohup ../obj/Mlucas -maxalloc 50 -cpu 0:3:2 & cd ~/mlucas_v20.1.1/run0 && nohup ./primenet.py -d -T 150 & cd ~/mlucas_v20.1.1/run1 && nohup ../obj/Mlucas -maxalloc 50 -cpu 4:7:2 & cd ~/mlucas_v20.1.1/run1 && nohup ./primenet.py -d -T 150 & cd ~/mlucas_v20.1.1/run2 && nohup ../obj/Mlucas -maxalloc 50 -cpu 8:11:2 & cd ~/mlucas_v20.1.1/run2 && nohup ./primenet.py -d -T 150 &If on the other hand 2c/4t is faster on one's system, the "run 3 instances, each using 2c/4t" bash runscript would look like this:
#!/bin/bash if [ "$EUID" -eq 0 ]; then echo "init script needs to be executed as regular user (no sudo)" && exit; fi cd ~/mlucas_v20.1.1/run0 && nohup ../obj/Mlucas -maxalloc 50 -cpu 0:3 & cd ~/mlucas_v20.1.1/run0 && nohup ./primenet.py -d -T 150 & cd ~/mlucas_v20.1.1/run1 && nohup ../obj/Mlucas -maxalloc 50 -cpu 4:7 & cd ~/mlucas_v20.1.1/run1 && nohup ./primenet.py -d -T 150 & cd ~/mlucas_v20.1.1/run2 && nohup ../obj/Mlucas -maxalloc 50 -cpu 8:11 & cd ~/mlucas_v20.1.1/run2 && nohup ./primenet.py -d -T 150 &
./Mlucas -s m -cpu 0:7:2 >& test0.log mv mlucas.cfg mlucas.cfg.4c.4t ./Mlucas -s m -cpu 0:7 >& test1.log mv mlucas.cfg mlucas.cfg.4c.8tand as described above, create a symlink to the one showing better timings and create multiple rundirs with copies of mlucas.cfg, primenet.py and local.ini . For a 12c/24t AMD system, the above "run 3 instances, each using 4c/8t" bash runscript would look like this:
#!/bin/bash if [ "$EUID" -eq 0 ]; then echo "init script needs to be executed as regular user (no sudo)" && exit; fi cd ~/mlucas_v20.1.1/run0 && nohup ../obj/Mlucas -maxalloc 50 -cpu 0:7 & cd ~/mlucas_v20.1.1/run0 && nohup ./primenet.py -d -T 150 & cd ~/mlucas_v20.1.1/run1 && nohup ../obj/Mlucas -maxalloc 50 -cpu 8:15 & cd ~/mlucas_v20.1.1/run1 && nohup ./primenet.py -d -T 150 & cd ~/mlucas_v20.1.1/run2 && nohup ../obj/Mlucas -maxalloc 50 -cpu 16:23 & cd ~/mlucas_v20.1.1/run2 && nohup ./primenet.py -d -T 150 &
If for some reason you want to generate optimal-FFT-params data for a single FFT length not covered by the standard self-tests, you can do so using the following command template:
./Mlucas -fft [n] -iters [100|1000|10000] [-cpu [args]]First specify the FFT length, in units of Kdoubles − the supported lengths are of the form [8,9,10,11,12,13,14,15]*2k, with k some integer ≥10. Then replace the [-cpu [args]] placeholder in the command block below with the desired cores-to-use specifiers: nothing for 1-threaded self-test, -cpu [args] for multithreaded. Lastly, if you are trying to run a single-FFT-length self-test, you must explicitly specify the iteration count via '-iters [100|1000|10000]' − 100 is OK for 1-thread tests, but I suggest using 1000 for thread counts between 4 and 15, and 10000 for ≥ 16 threads, in order to reduce the thread-and-data-tables-initialization overhead to a reasonable level.
Each single-length self-test should add 1 line to your mlucas.cfg file. The cfg-file lines appended by such single-length self-tests will have some additional residue data following the "radices = " listing, which you can ignore, since Mlucas stops parsing of these lines after reading in the radices.
Format of the mlucas.cfg file:
If you are running multiple copies of Mlucas, a copy of the mlucas.cfg file should be placed into each working directory, along with a worktodo.ini file containing assignments from the PrimeNet server which will be done by a copy of the Mlucas executable run from that working directory. Note that the program can run without the .cfg file, but with a proper configuration file (in particular one which was run under unloaded or constant-load conditions) it will run optimally at each runlength.
What is contained in the configuration file? Well, let's let one speak for itself. The following mlucas.cfg file was generated on a 2.8 GHz AMD Opteron running RedHat 64-bit Linux. I've italicized and colorized the comments to set them off from the actual optimal-FFT-radix data:
# # mlucas.cfg file // Insert comments as desired in lines beginning with a # or // symbol, as long as such commenting occurs below line 1, which is reserved. // # First non-comment line contains program version used to generate this mlucas.cfg file; 14.1 # # Remaining non-comment lines contain data about the optimal FFT parameters at each runlength on the host platform. # Each line below contains an FFT length in units of Kdoubles (i.e. the number of 8-byte floats used to store the # LL test residues for the exponent being tested), the best timing achieved at that FFT length on the host platform # and the range of per-iteration worst-case roundoff errors encountered (these should not exceed 0.35 or so), and the # optimal set of complex-FFT radices (whose product divided by 512 equals the FFT length in Kdoubles) yielding that timing. # 2048 sec/iter = 0.134 ROE[min,max] = [0.281250000, 0.343750000] radices = 32 32 32 32 0 0 0 0 0 0 [Any text offset from the list-ending 0 by whitespace is ignored] 2304 sec/iter = 0.148 ROE[min,max] = [0.242187500, 0.281250000] radices = 36 8 16 16 16 0 0 0 0 0 2560 sec/iter = 0.166 ROE[min,max] = [0.281250000, 0.312500000] radices = 40 8 16 16 16 0 0 0 0 0 2816 sec/iter = 0.188 ROE[min,max] = [0.328125000, 0.343750000] radices = 44 8 16 16 16 0 0 0 0 0 3072 sec/iter = 0.222 ROE[min,max] = [0.250000000, 0.250000000] radices = 24 16 16 16 16 0 0 0 0 0 3584 sec/iter = 0.264 ROE[min,max] = [0.281250000, 0.281250000] radices = 28 16 16 16 16 0 0 0 0 0 4096 sec/iter = 0.300 ROE[min,max] = [0.250000000, 0.312500000] radices = 16 16 16 16 32 0 0 0 0 0Note that as of Jun 2014 the per-iteration timing data written to mlucas.cfg file have been changed from seconds to milliseconds, but that change in scaling is immaterial with respect to the notes below.
You are free to modify or append data to the right of the # signs in the .cfg file and to add or delete comment lines beginning with a # as desired. For instance, one useful thing is to add information about the specific build and platform at the top of the file. Any text to the right of the 0-terminated radices list for each FFT length is similarly ignored, whether it is preceded by a # or // or not. (But there must be a whitespace separator between the list-ending 0 and any following text).
One important thing to look for in a .cfg file generated on your local system is non-monotone timing entries in the sec/iter (seconds per iteration at the particular FFT length) data. for instance, consider the following snippet from an example mlucas.cfg file (to which I've added some boldface highlighting):
1536 sec/iter = 0.225 1664 sec/iter = 0.244 1792 sec/iter = 0.253 1920 sec/iter = 0.299 2048 sec/iter = 0.284
We see that the per-iteration time for runlength 1920K is actually greater than that for the next-larger vector length that follows it. If you encounter such occurrences in the mlucas.cfg file generated by the self-test run on your system, don't worry about it − when parsing the cfg file the program always "looks one FFT length beyond" the default one for the exponent in question. If the timing for the next-larger-available runlength is less than that for the default FFT length, the program will use the larger runlength. The only genuinely problematic case with this scheme is if both the default and next-larger FFT lengths are slower than an even larger runlength further down in the file, but this scenario is exceedingly rare. (If you do encounter it, please notify the author and in the meantime just let the run proceed).
Aside: This type of thing most often occurs for FFT lengths with non-power-of-2 leading radices (which are algorithmically less efficient than power-of-2 radices) just slightly less than a power-of-2 FFT length (e.g. 2048K = 221), and for FFT lengths involving a radix which is an odd prime greater than 7. It can also happen if for some reason the compiler does a relatively poorer job of optimization on a particular FFT radix, or if some FFT radix combinations happen to give better or worse memory-access and cache behavior on the system in question. Such nonmonotonicities have gotten more rare with each recent Mlucas release, and especially so at larger (say, > 1024K) FFT lengths, but they do still crop up now and again.
Assuming your self-tests ran successfully, reserve a range of exponents from the GIMPS PrimeNet server.
By far the easiest way to do this and also submit results as they become available is to use the Python script named primenet.py for automated PrimeNet assignments management − this is to be found in the top-level directory of the unzipped Mlucas source-archive, next to the makemake.sh script. the first thing to do is to invoke the help menu to see the various options and their abbreviated names (e.g. '--username' can be abbreviated as '-u'):
python primenet.py -hNext, register your computer under your Primenet username with the server, now using the abbreviated forms of various double-minus-prefixed options listed in the above help output − note that <> and [] mean required and optional arguments, respectively:
python primenet.py [-d] -r -u <uid> [-T <worktype>] [-H <computer name>] [--frequency <CPU freq (MHz)>] [-m <total mem (MiB)>]The '-d' flag enables helpful debug diagnostics, and '-r' is short for '--register'. The registration step creates a local.ini file; as the PrimeNet sever will preclude the system from getting some worktypes if the user sets too low of a value for the CPU frequency or total memory, I suggest users manually edit the local.ini file after registration to fix any incorrect values for their machine; e.g. on my Intel Core i3 NUC mini, I set 'frequency = 2000' and 'memory = 8192' to reflect the 2GHz CPU frequency and the 8GB RAM.
p-1 Assignments: Note that although Mlucas v20.1.1 supports p-1 testing in form of either Pfactor or Pminus1 assignments in the worktodo.ini file, in terms of the primenet.py script p-1 is only supported implicitly, i.e. as a Pminus1 assignment split from an LL|PRP assignment needing p-1 done: Mlucas does this splitting automatically when it encounters such an LL|PRP assignment with p-1 remaining to be done in the workfile. Users who want to do standalone p-1 testing must either fetch such assignments from the server via the Manual Tests form and paste them into the worktodo.ini file, or use the Dulcet/Connelly primenet script linked above, which supports standalone p-1 assignment fetch.
After you create however many run-subdirectories you want to run jobs from (say, one per 2 or 4 physical CPU cores of your system as in the above 'Performance-tune for your machine' section), you need to copy the mlucas.cfg file resulting from the post-build self-tests into each, and also place a copy of primenet.py and the registration-created local.ini file into each rundir −the Performance-tune section already described this, except for the local.ini copies. Then, assuming you created a job-management jobs.sh bash-script as described, one level above your various run directories, 'bash jobs.sh' will launch all the needed Mlucas and primenet.py instances. In the primenet.py invocations in my jobs.sh examples, -d enables some useful debug diagnostics, and -T 150 reflects my preferred worktype. The 'nohup' prefix for Mlucas and primenet.py invocations is short for "no hangup", and means the process can be safely invoked on a remote machine and continue after the user logs out of said machine. It also causes all terminal output by the process in question to be diverted to the 'nohup.out' file in the run directory, which is useful for tracking down problems.
As of this writing, the available worktype arguments for the -T flag are as follows:
Worktype: | ||
Code | Mnemonic | Description |
101 | DoubleCheck | Double-checking LL tests |
150 | SmallestAvailPRP | First time PRP tests (Gerbicz) |
151 | DoubleCheckPRP | Doublecheck PRP tests (Gerbicz) |
152 | WorldRecordPRP | World record sized numbers to PRP test (Gerbicz) |
153 | 100MdigitPRP | 100M digit number to PRP test (Gerbicz) |
If the -T argument is omitted the script will use the default for this, which is DoubleCheck (numeric value 101. You must be connected to the internet when you launch the script; once it has done its initial work-fetching you can be offline most of the time; the program will simply periodically check whether there are any new results in the run directory in which it was launched; if yes *and* it is able to connect to the PrimeNet server it will submit the new results (usually just one, unless you are offline nearly all the time) and fetch new work; otherwise it will sleep and retry later. The default is to check for 'results to submit/work to get?' every 6 hours; you may override this via the -t option, followed by your desired time interval in seconds. '-t 0' means run a single-shot get-work-to-do and quit, if for some reason you prefer to periodically run the script manually yourself.
If the script runs successfully you should see a worktodo.ini file (if none existed already, the script creates it; otherwise it appends new work to the existing version of the file) with at least 2 LL|PRP-test assignments in it. The script will also periodically check the results.txt file in each run-subdirectory in which it is invoked. Whenever one or more new results are found and a connection to the internet is active during one of these periodic checks, the result is automatically submitted to the PrimeNet server, and the worktodo.ini file in the run directory 'topped up' to make sure it has at least 2 valid entries, the first of which will correspond to the currently ongoing job. Thus, the first time you use it, you just need to run the py-script in each local run directory to grab work to be done, then invoke the Mlucas binary to start the Lucas-Lehmer testing.
Offline Testing:
Users who wish to eschew this can continue to use the PrimeNet manual testing webforms at mersenne.org as described further down on this page, but folks running multiple copies of the program will find the .py-script greatly simplifies things. See the Get exponents from PrimeNet section for the simple instructions. Here's the procedure (for less-experienced users, I suggest toggling between the PrimeNet links and my explanatory comments):
The various PrimeNet work assignments supported by Mlucas are of several different forms (master reference available here). [aid,]Refers to a , with the [] indicating it is optional -- it may be either omitted or "n/a," (case-insensitive) put in its place. A 32-hexit string of zeros can also be used, though this will not be written to any JSON-formatted results.txt file entry, since the Primenet v5 API rejects all-zeros with an ERROR_INVALID_ASSIGNMENT_KEY. For the JSON result formatting, anything but a 32-hexit-and-nonzero value will be ignored. Here are the assignment types supported by Mlucas:
{assignment type}={Unique assignment ID},{Mersenne exponent},{trial-factored to this many bits},{p-1 factoring has/has-not been tried}
PRP={Unique assignment ID},{k},{b},{n},{c},{trial-factored to this many bits},{ignore}Here the integers k,b,n,c denote that k⋅bn+c is the number being tested. Currently only Mersenne-number moduli are supported for this assignment type, i.e., k = 1, b = 2, n an odd prime and c = −1.
PRP={Unique assignment ID},{k},{b},{n},{c},{trial-factored to this many bits},{ignore},{base used for first-time PRP test},{PRP residue type}The latter 2 arguments are needed because the DC must use the same base as the first-time test did, and there are several 'flavors' of the underlying Fermat-PRP test which early implementations used. Mlucas only supports residue types 1 and 5, which differ only if the PRP test is followed by a fast Suyama-style cofactor-PRP test (currently unsupported by Mlucas), i.e. are identical as far as PRP-testing goes. Note there is a also a PRPDC= form of PRP-DC assignment, which otherwise is identical to the PRP= first-time-test format, and assumes the first-time test was type 1 and used base 3. (Let us just say that PRP testing was not a win for standardization and unambiguous documentation.)
Pminus1=[aid,]k,b,n,c,B1,B2[,TF_BITS][,B2_start][,known_factors]where the rightmost optional argument, if provided, must be in form of a comma-separated list of known factors bookended with ""; and
Pfactor=[aid,]k,b,n,c,TF_BITS,ll_tests_saved_if_factor_foundNote that in contrast to the other, simpler, assignment typenames, the parsing of the "Pminus1=" or "Pfactor=" assignment typenames is case-insensitive.
Examples of typical assignments returned by the server follows:
Assignment | Explanation |
Test=DDD21F2A0B252E499A9F9020E02FE232,48295213,69,0 Gets split into: Pminus1=DDD21F2A0B252E499A9F9020E02FE232,1,2,48295213,-1,400000,12000000
| M48295213 has not been previously LL-tested (otherwise the assignment would begin with "DoubleCheck=" instead of "Test="). The long hexadecimal string is a unique assignment ID generated by the PrimeNet v5 server as an anti-poaching measure. The ",69" indicates that M48295213 has been trial-factored to depth 269 with no factors found.
The 0 following the 69 indicates that p-1 still needs to be done, so Mlucas splits the original one into a pair, the Pminus1= of which specifies p-1 factoring to stage 1 and 2 bounds 400000 and 12000000, respectively. If the p-1 fails to turn up a factor, the program will proceed to the LL-test; should p-1 find a factor, both assignments are deleted. |
DoubleCheck=B83D23BF447184F586470457AD1E03AF,22831811,66,1 | M22831811 has already had a first-time LL test performed, been trial-factored to a depth of 266, and has had p-1 factoring attempted with no small factors found, so perform a second LL test of M22831811 in order to validate the result of the initial test. (Or refute it − in case of mismatching residues for the first-time test and the double-check a triple-check assignment would be generated by the server, whose format would however still read "Doublecheck") | PRP=C57FF1C644A0CB16F5E2B5B3A9FC4E1D,1,2,98024161,-1,77,2 Gets split into: Pminus1=C57FF1C644A0CB16F5E2B5B3A9FC4E1D,1,2,98024161,-1,800000,29000000 | This is a probable-prime-test assignment, meaning the Gerbicz check will be done at regular intervals throughout. The 4 integers following the hexadecimal assignment ID define the number to be tested as 1*298024161-1, i.e. the mersenne number M98024161. This number has not been previously tested (otherwise the assignment would have 2 additional trailing arguments). The ",77" indicates that M98024161 has been trial-factored to depth 277, and the trailing 2 indicates that p-1 factoring has not been tried, i.e. would save 2 full-length primality tests (a first-time one and a double-check one) were p-1 to find a small factor.
Mlucas splits the original one into a pair, the Pminus1= of which specifies p-1 factoring to stage 1 and 2 bounds 800000 and 29000000, respectively. If the p-1 fails to turn up a factor, the program will proceed to the PRP-test; should p-1 find a factor, both assignments are deleted. |
PRP=C42540C352E54E906108D48FA5D89488,1,2,80340397,-1,75,0,3,1 | This is a PRP-double-check assignment, or PRP-DC for short. M80340397 has been trial-factored to depth 275, has had sufficient p-1 testing done (the 0 following the 75), and already had a first-time PRP test performed, using base 3 and returning residue type 1. |
If you are using the PrimeNet manual testing pages rather than the primenet.py script, copy the Test=... or DoubleCheck=... lines returned by the server into the worktodo.ini file, which must be in the same directory as the Mlucas executable (or contain a symbolic link to it) and the mlucas.cfg file. If this file does not yet exist, create it. If this file already has some existing entries, append any new ones below them.
Note that Mlucas makes no distinction between first-time LL tests and double-checks − this distinction is only important to the PrimeNet server.
Most exponents handed out by the PrimeNet server have already been trial-factored to the recommended depth (i.e. will be of the 'Test' or 'DoubleCheck' assignment type), so in most cases, no additional trial-factoring effort is necessary. If you have exponents that require additional trial factoring, you'll want to either return those assignments or, if you have a fast GPU installed on your system, download the appropriate GPU client from the GPU72 project to do the trial factoring.
Lastly, if you wish to Lcas-Lehmer test some non-server-assigned prime exponent, you can simply enter the raw exponent on a line by itself in the worktodo.ini file. To run a PRP-test with its superior hardware-error detection capability, create an assignment of the form
PRP=00000000000000000000000000000000,1,2,[your exponent here],-1,75,0
Then save the file and invoke Mlucas with your preferred -cpu arguments.
The program will run silently in background, leaving you free to do other things or to log out (If running on a remote machine, make sure to prefix the './Mlucas' with 'nohup'). Every 10000 iterations (or every 100000 if > 4 threads are used), the program writes a timing to the "p{exponent}.stat" file (which is automatically created for each exponent), and writes the current residue and all other data it needs to pick up at this point (in case of a crash or powerdown) to a pair of restart files, named "p{exponent}" and "q{exponent}." (The second is a backup, in the rare event the first is corrupt.) When the exponent finishes, the program writes the least significant 64 bits of the final residue (in hexadecimal form, just like Prime95) to the .stat and results.txt (master output) file. Any round-off or FFT convolution error warnings are written as they are detected both to the status and to the output file, thus preserving a record of them when the Lucas-Lehmer test of the current exponent is completed.
p-1 savefiles: During stage 1, the code uses the same redundant p[exp], q[exp] savefile pair as for LL/PRP tests, where [exp] stands for the exponent p of the Mersenne number M(p) being used. On completion of stage 1 the p[exp] file is renamed p[exp].s1; this can later be used to run stage 2 continuations to deeper-than-default bounds, if the user wishes. (This is "advanced usage", not recommended except for power-users who know what they are doing; note that a deeper stage 1 requires rerun-from scratch). Stage 2 uses a single p[exp].s2 savefile; both s1 and s2 savefiles are roughly [exp] bits in size. During p-1 the logfile (p[exp].stat) output will also look very similar to that for LL/PRP tests, except each checkpoint-summary will include either 'S1' or 'S2', e.g.
[2021-07-28 13:22:38] M110763649 S2 at q = 3270330 [ 1.99% complete] clocks = 00:05:34.349 [ 33.3849 msec/iter] Res64: 99F98F04EFD4E49A. AvgMaxErr = 0.124210464. MaxErr = 0.187500000.For both p-1 stages, 'iter' refers to FFT-based mod-M(p) multiplies (more precisely, "mod-squaring equivalents" in requiring precisely one forward and one inverse FFT); in stage 2 each such multiply processes between 1 and 2 stage 2 primes, the precise value depending on the number of memory buffers allocated.
For LL and PRP-tests, the program also saves a persistent p-savefile every 10M iterations, with extensions .10M, .20M, ..., reflecting which iteration the file contains restart data for. This allows for a partial-rerun − even in parallel 10Miter subinterval reruns, if desired − in case the final result proves suspect.
ADDING NEW EXPONENTS TO THE WORKTODO.INI FILE: You may add or modify ALL BUT THE FIRST EXPONENT, i.e. the current one, which may span 2 lines in case of a Pminus assignment split off from an LL|PRP one) in the worktodo.ini file while the program is running.
For users who prefer not to use the automated Python assignments-management script, to report results (either after finishing a range, or as they come in), login to your PrimeNet account and then proceed to the Manual Test Results Check In. Paste the results you wish to report, that is, one or more lines of the results.txt file (any results which were added since your last checkin from that file) into the large window immediately below. Note that as of v19, the results-line format has changed to so-called JSON (Javascript object notation) format, so here are sample results-entry formats, depending on version:
v18 and earlier:
M86748829 is not prime. Res64: F28B3E531E99C315. Program: E18.0. Final residue shift count = 14725081
v19 writes a similar human-readable line as above to the p[exponent].stat file, but the output to the results.txt file is in JSON format, e.g.:
{"status":"C", "exponent":81253819, "worktype":"PRP-3", "res64":"CE9AB357C704369B", "residue-type":1, "fft-length":4718592, "shift-count":66554884, "error-code":"00000000", "program":{"name":"Mlucas", "version":"19.0"}, "timestamp":"2019-12-01 03:42:39 UTC", "aid":"0CF485CD87F1BAC48B6B05D3A2094579"}
If for some reason you need more time than the 180-day default to complete a particular assignment, go to the Manual Test Time Extension page, check the about-to-expire exponents, then go to the bottom of the page and click on "Extend checked exponents 60 days".
Mlucas creates savefiles (a.k.a. "checkpoint" files) at regular intervals to permit safe restart with in case of a run being interrupted for any reason, such as the host machine going down or a code crash. o bytewise: Residues from LL or PRP-tests or p-1 factoring work are stored in minimum-size and endian-independent form o LL vs PRP+Gerbicz: o LL, PRP-test and p-1 stage 1 residues are stored in redundant savefile pairs. For work on the Mersenne number M[exponent] = 2[exponent] − 1, these files are named p[exponent] and q[exponent]. At the conclusion of a p-1 factoring run, the primary p[exponent] savefile is renamed p[exponent].s1. The reason for this is twofold:
You can track your overall progress (for both automated and manual testing work) at the PrimeNet server's producer page. Note that this does not include pre-v5-server manual test results. (That includes most of my GIMPS work, in case you were feeling personally slighted ;).
It uses the well-known Lucas-Lehmer test for Mersenne numbers, which involves selecting an initial seed number (typically 4, but other values, such as 10, also work), then repeatedly squaring and subtracting 2, with the result of each squaring being reduced modulo M(p), the number under test. This square/subtract-2 step is done exactly p-2 times. If the result (modulo M(p)) is zero, M(p) is prime. For an excellent and much more in-depth discussion of the Lucas-Lehmer test and many other prime-related topics, please visit Chris Caldwell's web page on Mersenne numbers. Starting with v19, the code also supports a probable-primality-testing algorithm (PRP) recently added to the GIMPS worktype options. Both the mathematically rigorous Lucas-Lehmer test (LL) and the probabilistic PRP test will identify likely M-primes with similar accuracy − the tiny risk of a "false positive" with the PRP test is dwarfed in practice by the odds of a run going awry due to hardware error or data corruption of some kind. The major advantage of PRP is that it supports the recently-introduced Gerbicz checking mechanism, which allows one to cheaply catch more or less all such residue-integrity errors, allowing typical consumer-style commodity computing hardware to run such tests with similar reliability as expensive high-end server hardware using ECC memory.
Since the numbers being tested (and hence the intermediate residues in the LL test) are so large that they typically must be distributed across millions of computer words, by far the most time-consuming part of the LL test is the modular squaring.
For a large integer occupying N computer words, a simple digit-by-digit ("grammar school") multiply operation (which needs on the order of N2 machine operations) is much too slow to be practical. Rather, the code uses a multiply algorithm based on discrete convolution. (Math-geek joke: A discrete convolution is one which doesn't kiss and tell.) The discrete convolution algorithm is best-known from the field of signal processing, but also proves to have a surprising and very nifty application to the task of multi-precision multiplication. Long story short, recasting the bignum multiply as a discrete convolution allows one to use highly efficient discrete-convolution-effecting algorithms, the best-known of which is the fast Fourier transform (FFT), which is described in many numerical analysis texts, such as the well-known Numerical Recipes books. (These even have a set of so-called multiprecision integer routines, but I suggest staying away from those except by way of learning the rudiments − they're awful in efficiency terms.) For a back-of-the-envelope-style worked example illustrating the procedure, see here.
The code also uses the now-well-known Discrete Weighted Transform technique of Crandall and Fagin to implicitly do the modding. This permits one to "fill" the digits of the input vector to the FFT-based squaring, and thus to reduce the vector length by a factor of 2 or more relative to any pre-1994 codes. For a detailed reference, see the original 1994 Crandall/Fagin DWT paper, kindly posted online by Barry Fagin. A useful identity for computing the inverse weights for Mersenne-mod IBDWT is here; this needs just a single reciprocal 1/2N, where N is the transform length, to be computed. For a plausible discrete-convolution roundoff error heuristic built atop this, see this research paper − in fact Mlucas uses said heuristic to automate the selection of FFT length based on the Mersenne exponent, in the given_N_get_maxP() function in get_fft_radices.c.
The upshot is, to write the world's fastest Mersenne testing program, one must write (or make use of) the world's fastest FFT algorithm.
Mlucas uses a custom FFT implementation written by me (EWM). I first started on this algorithmic journey in the late summer of 1996, and being a complete novice at transform-based arithmetic at the time, the first FFT routines I used were those from Numerical Recipes. Since then, the code has greatly evolved, and the FFT I currently use looks absolutely nothing like the original one, although it is doing basically the same thing (except for the non-power-of-2 vector length routines − NR has nothing along those lines.) In recent years I have also augmented the original generic high-level C-code FFT implementation with inline assembly code to take advantage of the more-recent x86 processors` SIMD vector processing capabilities. This more than doubles the program per-cycle throughput on AMD64 and Intel CPUs supporting SSE2 (roughly, Opteron through Core2), and nearly doubles it again on Intel CPUs supporting the newer AVX and AVX2 instruction set with their 256-bit-wide registers, each of which can hold 4 doubles. AVX-512 again roughly doubles the throughput, if the memory hierarchy can keep up with the demands of such data-hungry processors, which is alas typically not the case on modern manycore systems. As of 1997 the code also has dedicated inline-assembly for the Armv8 instruction set with its 128-bit SIMD, which has recently gone mainstream in form of the Apple Silicon line of processors.
I have not had time or desire to package the FFT core of Mlucas into a form suitable for inclusion in the FFTW benchmarks, but my own comparisons indicate that the Mlucas FFT is typically at least twice as fast as FFTW for the vector lengths of interest to Mersenne prime searchers (real vectors of length 128K and larger, where K=210=1024) running on comparable hardware.
As of this writing (v20.1.1), Mlucas can test Mersenne numbers M(p) = 2p-1 with prime exponents up to the limit set by the maximum supported FFT length of 512M. (Note that exponents > 232, thus FFT lengths 256M-512M, require '-shift 0' to run.) In practice, this translates to M(p) with p approaching 9 billion. For exponents > 232, there is at present a hard limit of 1 million iterations in LL/PRP-test mode since full-length primality tests of numbers this large are nowhere near practicable as of this writing. But such moduli can be useful for software and hardware parallel-scaling tests, and perhaps p-1 factoring attempts. For example, using the avx-512 build of Mlucas and running the first-available FFT-radix set at FFT length 512Mdoubles on 64 cores of an Intel Knights Landing workstation via './Mlucas -fft 512M -cpu 0:63 -iters 1000 -radset 0 -shift 0', we obtain:
NTHREADS = 64 INFO: Maximum recommended exponent for this runlength = 8937021925; p[ = 8883334793]/pmax_rec = 0.9939927268. Initial DWT-multipliers chain length = [short] in carry step. M8883334793: using FFT length 524288K = 536870912 8-byte floats, initial residue shift count = 0 this gives an average 16.546500461176038 bits per digit Using complex FFT radices 1024 16 16 32 32 mers_mod_square: Init threadpool of 64 threads Using 64 threads in carry step 1000 iterations of M8883334793 with FFT length 536870912 = 524288 K, final residue shift count = 0 Res64: E496C31B2534F520. AvgMaxErr = 0.261731586. MaxErr = 0.312500000. Program: E20.0 Res mod 2^35 - 1 = 11663456039 Res mod 2^36 - 1 = 21852180932 Clocks = 00:09:12.499The maximum roundoff error (ROE) encountered was 0.312500000, which is within the safe zone, since the program is using "chain length = [short] in carry step", and there is a [hiacc] mode available there to further reduce the ROEs should one dangerously close to the fatal value of 0.5 (at which point we can no longer round the floating-point-convolution output in question to the nearest integer with any degree of confidence) be encountered. We further see the default self-test exponent for 512M FFT is 99.4% of the maximum recommended exponent (8937021925) for this runlength, thus the largest "probably safe" exponent is the nearest prime <= 8937021925, which is 8937021911.
For p-1 the only limitation is that the product of stage 1 prime powers have bitcount < 232, which in practice translates to a stage 1 prime bound B1 a little under 3 billion which is again much larger than what is feasible to run at this time, except on very small moduli. (And the prime-powers algorithm currently implemented by the code is a simple grammar-school O(n2)-operation one with a few 64-bit math enhancements, thus becomes grindingly slow for B1 beyond a few million.)