Welcome to the Great Internet Mersenne Prime Search! (The not-PC-only version ;)

This ftp site contains Ernst Mayer's C source code for performing Lucas-Lehmer tests of prime-exponent Mersenne numbers. It also includes a simple Python script for assignment management via the GIMPS project's PrimeNet server, or you can manually manage your work, if you prefer. In short, everything you need to search for Mersenne primes on your Intel, AMD or non-x86-CPU-based computer!

Mlucas is an open-source program for primality testing of Mersenne numbers in search of a world-record prime. You may use it to test any suitable number as you wish, but it is preferable that you do so in a coordinated fashion, as part of the Great Internet Mersenne Prime Search (GIMPS). Note that on x86 processors Mlucas is not (yet) as efficient as the main GIMPS client, George Woltman's Prime95 program (a.k.a. mprime for the linux version), but that program is not 100% open-source. Prime95 is also only available for platforms based on the x86 processor architecture.

Page Index:


[Shameless beg-for-money section] Please consider donating to support ongoing Mlucas development. While this is a long-term labor of love, not money, donations to help defray expenses are greatly appreciated. Whatever you can spare to support this kind of open-source distributed-computing project is welcome! Note that Paypal charges [2.9% + $0.30] for each such donation, e.g. a $100 donation means $96.80 makes it to me. If you prefer to reduce or avoid these fees from eating into your hard-earned money, you have two workaround options:

[1] Use the Paypal "Send Money" function with my e-mail address, and mark it as Friends and Family rather than a purchase. Note that such money-sending is only fee-less if you are in the US and you use a PayPal balance or a bank account to fund the transaction. Note that "bank account" appears to include debit cards linked to such, since my Paypal account is linked to such a unified bank/checking/debit-card money-market account and I never pay to send money to US relatives. But check your transaction preview details before sending to be sure!

[2] E-mail me for a snail-mail address to which you can send your old-fashioned personal check for only the cost of your time and postage.


News:

09 Feb 2021: v19.1 released: This restores Clang/LLVM-compiler buildability for Armv8 SIMD builds. Briefly, Laurent Desnogues reported that v19 fails to build on the new Apple Silicon M1 CPU (which uses the same Armv8+Neon 128-bit SIMD instruction set Mlucas has supported starting with v17) using the Clang/LLVM compiler, which is native on MacOS platforms. (He was able to build v19 on M1 using GCC inside the Brew environment, however). The error emitted by Clang is "inline assembly requires more registers than available". In fact the issue is wider - all recent versions of Clang fail to build the code on Armv8. I've long used both GCC and Clang on my venerable Macbook Classic, there Clang compiles are significantly faster, but that is frozen at OSX 10.6.8 and its versions of both compilers are old. Using GCC and old versions of Clang, the relevant constraint on inline-asm macros has always been that such macros can use no more than 30 arguments. That may seem like more than anyone might ever consider using, but Mlucas deals in assembly macros much, much larger than the typical ones which litter sites like Stackexchange, especially short-length discrete Fourier transform macros for complex radices up to 32, many of which involve fairly intricate permuted-IO-address-list patterning.

Some experimentation revealed that the above error involves the general-purpose registers (GPRs), not the vector-SIMD ones, and the relevant constraint for Clang builds on Armv8 involves the sum of the number of macro arguments (#arg) and number of GPRs used (#gpr) by the macro:

(#arg + #gpr) ≤ [#gpr available]
The Armv8 AArch64 architecture provides 31 GPRs x0-x30, of which x30 is special in that it used as the "link register" which stores the return address of subroutines, thus in general [#gpr available] = 30. On Apple Silicon platforms the OS further reserves GPRs x18 and x29, so one's inline-asm should avoid using these and limit the above sum to 28. Oddly, Clang shows no analogous constraint with respect to the x86_64 versions of the various DFT assembly macros, even though x86_64 has just 16 GPRs, and only 14 of these are available for use by inline-asm. (In 64-bit terms, rsp and rbp are reserved). The Mlucas v19 codebase had 10 core-DFT macros which violated the above constraint, so v19.1 involved rewriting the IO addressing for these. That sounds pretty easy, but in order to maintain a uniform interface-across-platforms for these macros, for each one I needed to re-do the I/O address computations for no fewer than 5 distinct versions, corresponding to the 5 different major SIMD architectures supported by Mlucas: SSE2, AVX, AVX2, AVX-512 (x86_64), and 128-bit Neon Advanced SIMD (Armv8). I expect little or no performance impact, as the runtime is dominated by floating-point SIMD arithmetic. However, perhaps Clang builds will offer some speed improvements in terms of integration of the asm-macros into the surrounding C "glue" code. And it never hurts to have multiple build options.

Apple Silicon users: For Apple M1, there is at present no reliable way to assign an Mlucas instance to the performance CPU or the efficiency CPU, each of which has 4 cores. Generally splitting threads from a single Mlucas instance across distinct CPU cores of such hybrid systems (another example of which is the Odroid N2) is a bad idea, but a user reports that using -cpu 0:7 on his M1 gives a significant performance boost over -cpu 0:3, so Apple Silicon and MacOS appear to make good use of both the performance and efficiency CPUs together for such multithreaded heavy compute workloads. It is not clear at this time whether using a pair of 4-thread jobs one with -cpu 0:3 and the other with -cpu 4:7 (these would need to be run inside separate run directories) is any better in terms of total throughput than a single 8-thread job.

Other stuff: The v19.1 release also includes fixes for some bugs, such as a missing C-preprocessor #if wrapper for some x86_64 assembly macros in the mi64.c file which was causing some people's Armv8 compiles to fail on that file. Lo´c Le Loarer's enhanced version of the primenet.py assignment-management script described in the 7 July 2020 note below is also part of the release, though note that Teal Dulcet and Daniel Connelly have built on that work in a big way in conjunction with tdulcet's auto-install-and-tune bash script for Linux, detailed in the Download and Build section below. Also some incorrectly tabulated reference residues for the Mlucas self-tests (specifically for FFT length 1792K and the 'huge' FFT lengths 64-240M have been corrected, though no one in their right mind will be using the latter for anything other than timing/scalability tests any time soon. 32-bit x86 SIMD builds have not been supported for the last several releases, and now all the attendant assembly-macros are gone from the source archive, as well.

For folks who already have a working GCC-compiled v19 binary there is no compelling reason to upgrade, but note that savefiles are fully compatible between v19.1 and v19 in case you want to try the new version, built with either GCC or Clang.

Known Bugs: There is a memory-corruption bug affecting the 'huge'-FTT self-tests and causing mismatching residues starting midway through that series of self-tests. Since these FFT lengths start at roughly 10x those for the current GIMPS testing-wavefront, this will only affect users who want to play with such large FFT lengths. If you are one of such, note that you can still self-test each such FFT length individually, e.g. for 128M, './Mlucas -fftlen 131072 -iters [100|1000|10000] -cpu [...]'. The largest such supported FFT length is 256M = 262144 Kdoubles, which can in theory test exponents slightly larger than 232, although (cf. the Algorithmic Q & A section) the current actual limit is the prime 232-65.

Post feedback and any questions here.

Return to Page Index

Jump to Download & Build section


07 Jul 2020: Primenet-API-capable primenet.py script with run-progress-updating: Thanks to Lo´c Le Loarer, we have a vastly improved version of the primenet.py assignment-management script for users to try out. Until further notice (likely until the v20 release) the old one will continue shipping with the program source tarball, but we urge users to download the new version (15.1 KB, md5 = 6fefeef5773a7d0de9f1d2c7983c20f0), please peruse the README_primenet.txt file in the unpacked archive, and give the new py-script a try. Post feedback and any questions here.

04 Jul 2020: Patch Alert: I have found and fixed a pair of critical bugs affecting FFT lengths of form 3*2k. This means current GIMPS double-checks at FFT length 3M (3072K) and recently-reached-by-the-GIMPS-first-time-test-wavefront 6M (6144K). Here are the precise FFT radices and build modes affected - note ARM builds are not affected, but I rebuilt those binaries as a matter of course, to make sure my bugfix didn't inadvertently break anything there:

The bug specifically affects exponents in the lower roughly 90% of those covered by FFT lengths of form 3*2k. My pre-release testing missed it because it has in the past focused on exponents toward the upper limits of each FFT length, in order to make sure the floating-point accuracy of each code release has not degraded. Source tarballs (bzip2 and xzip-compressed) and ARM binaries have been updated.

07 Jun 2020: Parallel-make script: Thanks to GIMPSer tdulcet, we have a parallel-make build script for users to try out. Please post feedback, bug reports, etc, in that thread.


01 Dec 2019: v19.0 released: (Special thanks to GIMPSer Laurent Desnogues for the AVX-512 builds and debug-data generation.) The bugs turned up during the beta-testing stage are detailed here. Note that it appears that interrupt-handling is not fully debugged - the Gerbicz-check logic significantly complicates said handling, since there are now multiple possible states of execution (compared to the simpler repeated-auto-squaring of the LL test) which need to be treated differently in terms of interrupt-handling. Until further notice, if you need to interrupt a PRP-test run, use the "nuclear kill" option, 'kill -9 [pid]', rather than e.g. 'fg/ctrl-c' or regular 'kill'. You can minimize lost runtime by waiting until the current iteration interval completes, i.e. until the latest savefile updates complete. This should only be necessary for PRP-tests, not LL-tests, but is also usable in the rare event of a hung LL-test interrupt attempt.

You can use v19 as a drop-in replacement for a v18 build, once you have re-run self-tests to generate a v19-optimized mlucas.cfg file. v18-generated savefiles are upward compatible with v19, since the savefile format for LL-test assignments is identical. The new PRP assignments supported by v19, on the other hand, use an extended savefile format which cannot be read by v18.

Report bugs and build/run issues to this dedicated Mersenneforum.org thread.

Significant changes/enhancements/bugfixes/new-features:


General Questions:

For general questions about the math behind the Lucas-Lehmer test, general primality testing or related topics (and also no small number of unrelated topics, such as this one following in the literary-humor footsteps of Ambrose Bierce), check out the Mersenne Forum.

For discussions specifically about Mlucas at the above venue, see the Mlucas subforum there.


Windows Users:

Note that as described in this article in The Hacker News, Win10 supports Linux Bash Shell and Ubuntu Binaries. In such a shell you can just follow the Linux-build instructions below using the level of SIMD vector-arithmetic appropriate for your CPU. While I hate to find myself in the business of promoting Windows in any way, the fact is that while pre-Win10 users can build the code by installing the proper Linux emulation environment as detailed in the following section, none of those emulators supports multithreaded builds and the standard Posix thread/core affinity-setting mechanisms which are crucial to getting the most out of a modern multicore CPU. So if you run Windows and you want to play with the code, you might as well upgrade to Win10 if you've not already done so, because it makes it trivial to do a Linux-style build.

Pre-Win10: If you have some good reason to not upgrade to Win10, here is your best option: Mlucas does not support build using Windows tools, but windows users can download a pair of popular freeware packages to provide themselves with a Linux/GCC-like environment in which to build and run the code. (Thanks to Mersenne-forum member Anders H÷glund for the setup here.)

First, you'll need a suitable archiver utility which handles various common Linux compression formats along with the Linux 'tar' (tape-archive, its name a historic artifact) command. I use the freeware 7zip package, which can handle most linux compression formats including .xz, .7z and .bz2 . Download to your C-drive and run the extractor .exe .

Next, download the msys2-x86_64 package, which provides both the needed Linux emulation environment and the underlying MINGW compiler-tools installation. After downloading to c:, click to run the self-extractor .

Note that in the ensuing package-install and configuration steps, you will need to be connected to the Internet, and will need to quit and restart MSYS2 several times. I restart via the Start Menu → All Programs → MSYS2 64bit. When I press 'return' on the last category, a dropdown menu appears with these 3 items, of which you want the bottom-most, bold-highlighted one:

MSYS2 MinGW 32-bit
MSYS2 MinGW 64-bit
MSYS2 MSYS

From the resulting command shell (and with a working internet connection), run these package-management commands in MSYS2, replying 'Y' to any do-you-wish-to-go-ahead-and-install prompts:

pacman -Syu (then exit & restart MSYS2)
pacman -Su (then exit & restart MSYS2)

Lastly, install the compiler and python-scripting tools:

pacman -S mingw-w64-x86_64-gcc
pacman -S mingw-w64-x86_64-python2

...and do one small manual edit, of the c:\msys64\etc\profile (text) file, to add the bolded snippet, including the leading :-separator, to the following line:

MSYS2_PATH="/usr/local/bin:/usr/bin:/bin:/mingw64/bin"

Then a final exit & restart of MSYS2 and you are ready to go and everything is located inside the C:\msys64 folder, there is no additional c:\mingw64 folder installation needed like in MSYS.

To test that everything is set up properly, type 'which gcc' in your just-opened shell. That should point to /c/mingw/bin/gcc.exe, and 'gcc -v' should show the version of the compiler in your installation on the final line of screen output.

Return to Page Index


DOWNLOAD AND BUILD THE CODE -- Automated Linux-script Method

Thanks to fellow Mersenner and coder Teal Dulcet, Linux users now have the option of using a very nice auto-install script which automates the source-download, builds in fast parallel-make mode and does an auto-tuning stage to identify the run mode (number of Mlucas instances and core/thread combinations for each) which maximizes total throughput on the user's system -- though code-build enthusiasts are free to play in Do-It-Yourself mode if they prefer. To get the latest version of the script:

wget https://raw.github.com/tdulcet/Distributed-Computing-Scripts/master/mlucas.sh -nv
chmod u+x mlucas.sh

On invocation, after the unpacking of the various C source and header files, you may see a warning like Warning: pip3 is not installed and the Requests library may also not be installed; as long as the script continues to the build stage you should be fine. Among other things, the script will fetch a custom version of the primenet.py work-management script enhanced by T. Dulcet and Daniel Connelly in multiple ways, notably:

If you prefer to continue using an older version of primenet.py such as the L.Loarer-enhanced one in the unpacked Mlucas /src directory, you are free to do so and still use the above script for Mlucas install and autotune-for-your-hardware, though in that case, after wgetting mlucas.sh, I recommend commenting out the block of code near bottom of the script beginning with
echo -e "Registering computer with PrimeNet\n"
python3 ../primenet.py [...]
and ending with a for/done loop which starts one or more primenet.py and Mlucas instances, each in its own run. directory.

Author tdulcet writes regarding usage:

Here is the usage for the install script (optionally replace [PrimeNet User ID] with your GIMPS User ID, [Computer name] with your computer's name, [Type of work] with your preferred type of work and [Idle time to run] with the amount of idle time to wait before running in minutes):
./mlucas.sh [PrimeNet User ID] [Computer name] [Type of work] [Idle time to run]

If you do not supply the optional parameters, it is as if you ran it like this, which defaults to first-time PRP testing for the preferred assignment type (150), and waits for 10 minutes system idle time to start running:

[./mlucas.sh "$USER" "$HOSTNAME" 150 10]

Because of the new PrimeNet script, users can also run it anonymously as they can Prime95/MPrime:

[./mlucas.sh ANONYMOUS]

For users who want to build with the Clang/LLVM compiler instead of GCC:

Users can run [export CC=clang] before running the install script to build Mlucas with Clang instead of the default GCC. If they have already run the script and want to recompile Mlucas with a different compiler without rerunning the entire script, they can run (replace clang in the second command with the compiler's name):

cd mlucas_v19.1/obj/
CC=clang make -j "$(nproc)"
make clean
While the install script is designed for Linux, I have also tested it on Windows 10 with the Windows Subsystem for Linux (WSL) and it fully works.

(On my Intel Haswell quad running Ubuntu, the latest version of Clang for that was installed as clang-9, so I made that replacement in the above CC= snip.)

Note: In order to obtain accurate timings for your machine, the script must run with minimal CPU contention from other processes. You can use 'top', watch the output for a few seconds and scan the uppermost entries for ones consuming more than 10% CPU. It is often surprising what one finds -- for example, My Haswell system is mainly used for administrative work, occasional build-testing and new-code debug, and to host a pair of cutting-edge GPUs which do the heavy crunching, but that GPU work results in parasitic "kworker" tasks being spawned which start out near 0% CPU but over the course of hours and days consume more and more CPU cycles. For my timings below, I first rebooted the system, fire up the GPU jobs and verified that the kworker load was still negligible at the time I invoked the script.

As the script runs -- and it will typically need an hour or more -- you may see some scary-looking "FATAL ERROR...Halting test" messages. These indicate that a particular combination of FFT radices of the various ones tried during the tuning tests incurred excess-roundoff error and will be not be used for production runs. Here is the output of the auto-tune stage of of the script for my now-somewhat-aged Intel Haswell quad CPU, which does not support hyperthreading, thus only combinations (1 instance, 4 threads), (2 instances, 2 thread each) and (4 instances, 1 thread each) are tried:

Summary

	Adjusted msec/iter times (ms/iter) vs Actual iters/sec total throughput (iters/s) for each combination

FFT     #1                #2                #3
length  ms/iter  iters/s  ms/iter  iters/s  ms/iter  iters/s
2048K   21.52    185.082  18.84    190.722  17.56    179.875
2304K   29.48    155.569  23.24    137.457  21.52    146.095
2560K   30.84    150.943  23.9     135.382  22.27    141.446
2816K   34.88    110.424  28.86    67.395   26.77    117.024
3072K   37.76    121.139  29.8     115.644  28.08    114.379
3328K   45.6     96.880   34.52    98.834   32.29    99.873
3584K   42.64    107.331  33.64    92.411   31.26    104.778
3840K   52.04    82.795   40.06    83.119   37.28    83.241
4096K   47.56    82.284   41.14    74.339   37.29    65.860
4608K   56.76    69.876   44.28    74.940   41.91    76.779
5120K   61.52    68.531   49.16    66.017   46.35    68.679
5632K   74.56    55.531   55.4     60.101   52.45    60.391
6144K   73.6     54.183   62.3     53.080   57.05    43.384
6656K   98.84    43.827   71.74    42.502   67.56    44.556
7168K   91       48.005   69.8     47.815   65.83    47.055
7680K   112.68   35.229   84.64    37.439   78.81    37.190

Fastest combination
#  Workers/Runs  Threads        First -cpu argument
1  1             4, 1 per core  0:3

Mean/Average faster     #  Workers/Runs  Threads        First -cpu argument
1.065 ▒ 0.164 (106.5%)  2  2             2, 1 per core  0:1
1.027 ▒ 0.096 (102.7%)  3  4             1, 1 per core  0
Here, "Adjusted msec/iter" means in per-thread terms -- looking in the /obj directory afterward, there were 3 mlucas.N.cfg files with N = 0,1,2 corresponding to the above combos 1,2,3, of which the one deemed fastest overall, mlucas.0.cfg, was soft-linked under the name mlucas.cfg, which is where the program looks for its optimal FFT-parameters for production-mode runs, and whose 4-thread timings were exactly 1/4th the adjusted per-thread ms/iter timings above in the column pair under #1. Thus, the actual per-iteration timings for 4-thread mode range from 5.38-28.17 ms/iter for FFT length 2048K-7168K. For the mean/average output lines, the script uses a simple unweighted arithmetic mean of the iter/sec +- sqrt(variance). So for this machine, combo #1 was on average 6.5% faster than #2, and 2.7% faster than #3.

Also note the timing anomalies at 2816K and 3328K - those involve the prime-length 11 and 13 DFT macros, which have a slightly higher per-input opcount than their radix-10,12,14 neighbors due to their primality. However, they serve as merely the first layer in a much larger FFT computation the rest of which uses an efficient-as-can-be large power of 2. The slowdowns at 2816K and 3328K are not echoed at their 2x-length cousins, 5632K and 6656K. Similarly, the timings for FFT lengths 3840K and 7680K, both of form 15ċ2n, are anomalously high on this machine. I see no such weirdness on the other Intel AVX2 hardware I have, a 2-core/4-thread Broadwell (mostly a die shrink of Haswell) NUC mini. But is serves to illustrate the kinds of quirks one may see on particular hardware platforms. In such cases Mlucas will by default use the next-larger FFT length when invoked for production work, unless the user forces the smaller FFT length via the command line.

On successful completion, the script will attempt to register your computer with the Primenet server if this has not previously been done, and will print instructions for the user to edit their crontab file to set up for auto-running in the computed throughput-maximizing instance/thread configuration.

Return to Page Index

Jump to "Get exponents from PrimeNet" section for description of the various assignment types


DOWNLOAD AND BUILD THE CODE -- Manual Method

To do primality tests, you'll need to build the latest Mlucas C source code release. First get the release tarball, available in 2 different-zip-based forms - Windows users (again, pre-Win10) should have already downloaded the above-linked 7zip freeware archiver utility, so should just use that in conjunction with the smaller xz-compressed tarchive:

  • If your system has Xzip installed (do 'which xz' and see if that comes up empty or with a path-to-binary), get mlucas_v19.1.txz (07 Feb 2020, 1.6 MB, md5 checksum = 2b9af033d4bbb6d439d70bb9bc0c2617). (If the file extension looks unfamiliar, note that some people prefer a 'tar.xz' extension, as would result from a 2-step 'first tar, then Xzip-compress' procedure). Then use 'tar xJf mlucas_v19.1.txz' to one-step uncompress/unpack the archive.

  • Otherwise, get the bzip2-compressed mlucas_v19.1.tbz2 (07 Feb 2020, 2.8 MB, md5 checksum = 070f824de3aa8e820fd6adfab5c79746). (If the file extension looks unfamiliar, note that some people prefer a 'tar.bz2' extension, as would result from a 2-step 'first tar, then bzip2-compress' procedure). Then use 'tar xjf mlucas_v19.1.tbz2' to one-step uncompress/unpack the archive. That unpacking will create a directory next to the .txz or .tgz2 file whose name is the same as the prefix of the compressed tarchive, e.g. mlucas_v19.1 for v19.1. Inside that you will a directory 'src' containing the various C source and header files, along with a Python-script textfile named primenet.py. If you wish to view any of these sourcefiles in an editor, I recommend using a 4-column tab setting, since much of the C code and especially the inline-assembly contained in various .h files is in multicolumn form and needs a 4-column tab setting to line up properly for viewing.
    Windows (pre-Win10) Users: Assuming you successfully installed MSYS2 as described above, everything below should work for you, except that the MSYS2/MINGW emulation environment does not support multithreaded builds. Thus just select the appropriate SIMD vector-mode for your x86 processor using the /proc/cpuinfo-based procedure described below (or none if non-x86), and omit -DUSE_THREADS from your compile statement.

    Once you have the Mlucas tarball downloaded and unzipped to your C: drive, you cd to it in the MSYS2 shell, via 'cd /c/mlucas_v19.1/src', where '/c' is MSYS2's syntax for the C: drive. (Similarly, /e points to the removable-media mount point, which is handy for unnetworked 'sneakernetting' of files between your MSYS2-rendered Windows filesystem and a USB flash drive). Type 'ls' to list the files in your src-subdirectory


    Determining the SIMD build mode using the /proc/cpuinfo file: To see if your x86 CPU (either Intel or AMD) supports single-instruction-multiple-data (SIMD) vector arithmetic (and if so, what the highest-supported relevant SIMD level is), type the following regular-expression-search commands, stopping as soon as you get a hit (a line of text containing the substring-being-searched-for will be echoed to stdout):

    grep avx512 /proc/cpuinfo
    grep avx2 /proc/cpuinfo
    grep avx /proc/cpuinfo
    grep sse2 /proc/cpuinfo

    Whichever of these gave you the first hit, you will use -DUSE_[capitalized search substring] in your compile command line, e.g. if grepping for 'avx2' gave you the first hit, you use -DUSE_AVX2 in your compile command.

    Mac OS X has no /proc/cpuinfo file, but Mac users can instead type "sysctl hw.optional" in a Terminal shell and look for the above SIMD instruction set names there.

    ARM users: grep asimd /proc/cpuinfo to see if your CPU is ARMv8, i.e. supports the advanced SIMD instructions for which I added assembly-code support as of v17.1. Even if your CPU lacks ASIMD, you can still build Mlucas, you will just be restricted to a generic-C-code (non-SIMD) build. I built and tested both versions on my Odroid C2, and the generic-C (non-SIMD) one runs at around 2/3 the speed of the SIMD one. This is less of a difference than builds with and without SSE2-assembly on my old Intel Core2, because the latter has dedicated high-throughput functional units to support the SSE2 SIMD whereas the ARM, being ruthlessly optimized for minimal power consumption and 'shared silicon', has both the non-SIMD and SIMD arithmetic instructions share the same underlying functional units, e.g. a pair of 64-bit hardware floating-point adders, which can be used to execute two 64-bit FADDs per cycle in non-SIMD mode or a paired 64-bit vector FADD in SIMD mode. The theoretically achievable floating-point arithmetic throughput is thus the same for both kinds of builds - the speedup I mentioned for SIMD builds is all due to the optimized inline-assembly making better usage of the available functional units than the compiler-optimized generic-C-code builds. However, on CPUs which throw serious silicon at the SIMD compute units such as Apple M1, there will be much larger speed gains for SIMD builds.


    Building: The build procedure is so simple, there is little point in the extra script-infrastructure and maintenance work needed by the usual linux ./configure-then-make procedure - let's illustrate using a multithreaded x86/SSE2 build under 64-bit Linux. (Again, pre-Win10 Windows users must omit -DUSE_THREADS from their compile statement; Win10 users should simply be following the Linux build instructions using the built-in Bash shell support.) Within the directory resulting from the unpacking of the compressed source tarball, I suggest creating an 'obj' subdir next to the src-directory (or specific-build-mode-named object-subdirs if you want to try multiple build modes, say obj_avx2 and obj_avx512 on newer Intel x86 systems), the cd'ing into the obj-dir and doing like so (again, this example is specifically for an SSE2 vector-SIMD-arithmetic build):

    gcc -c -O3 -DUSE_SSE2 -DUSE_THREADS ../src/*.c >& build.log
    grep error build.log

    [Assuming above grep comes up empty] gcc -o Mlucas *.o -lm -lpthread -lrt

    The various other (including non-x86) build modes are all slight variants of the above example procedure - if you are building a binary for the host machine, you can replace the architecture-specific -m[arch] values below with a generic -march=native:

    Return to Page Index


    COMMON BUILD ISSUES AND WORKAROUNDS

    Once you have successfully linked a binary, I suggest you first try a spot-check at some smallish FFT length, say

    ./Mlucas -fftlen 192 -iters 100 -radset 0

    You will want to look through the resulting informational output for a line reading "INFO: System has [X] available processor cores.", which I have bolded below in the sample output from my Core2 macbook. Here, the number reported refers to *logical* (virtual) processor cores. Intel and AMD users, if this number is double the number of physical system cores, that means your CPU supports hyperthreading. This is important in the "Performance Tune for Your Machine" section below.

    This particular testcase should produce the following 100-iteration residues, with some platform-dependent variability in the roundoff errors and possibly the final residue shift count, which users need not concern themselves with:

    INFO: System has 2 available processor cores.
    ...
    100 iterations of M3888509 with FFT length 196608 = 192 K, final residue shift count = 744463
    Res64: 71E61322CCFB396C. AvgMaxErr = 0.255430821. MaxErr = 0.312500000. Program: E19.1
    Res mod 2^35 - 1 =          29259839105
    Res mod 2^36 - 1 =          50741070790
    
    [If the residues differ from these internally-pretabulated 100-iteration ones, the code will emit a visually-loud error message.]
    If that works, try rerunning the same case, now with 2 threads rather than the default single-threaded:

    ./Mlucas -fftlen 192 -iters 100 -radset 0 -nthread 2

    On non-hyperthreaded CPUs, this should nearly double the throughput (= half the runtime) versus the initial single-threaded (default) run. On hyperthreaded x86 processors, Intel users should see a nearly 2-fold speedup running this way, but AMD users won't. That's because '-nthread 2' really translates to 'run 2-threaded, with thread affinities set to logical CPU cores 0 and 1'. By 'logical cores' we mean the multiple (typically 2, but sometimes more) 'virtual cores' mapped to each physical CPU core in modern 'hyperthreaded' (Intel's term) CPU architectures. The Intel numbering system here is that on a system with n physical cores, physical CPU core 0 maps to logical cores 0 and n; physical CPU core 1 maps to logical cores 1 and n+1, etc. AMD uses a different logical-core numbering convention than Intel, whereby physical CPU core 0 maps to logical cores 0 and 1; physical CPU core 1 maps to logical cores 2 and 3, and so forth. The Intel-specificity of the Mlucas -nthread option is one reason is deprecated (still supported but recommended-against) in v17 and beyond; another is that it does not permit setting the processor affinity of an Mlucas instance to a specific set of physical core (and/or logical in the case of hyperthreaded CPUs) cores.

    For these reasons Mlucas v17 introduced a new and much-more-flexible flag '-cpu', which accepts any mix of comma-separated individual core indices and core-index ranges of form low:high and low:high:stride, where if stride is omitted it defaults to 1, and if high is also omitted, it means "run 1 thread on logical core [low]". Thus for our Intel user, -nthread 2 is equivalent to -cpu 0:1, but now our user can run a second 2-threaded job using -cpu 2:3 and be reasonably sure that the two runs are not competing for the same CPU cores. Our AMD user will similarly see no runtime benefit from replacing -nthread 2 with -cpu 0:1 (since on AMD both have the same effect of overloading a single physical CPU core), but will find that -cpu 0,2 (or in colon-delimited syntax, 0:2:2, i.e. 'use cores 0 though 2 in increments of 2') gives the expected 2-threaded speedup.


    STEP 2 - PERFORMANCE-TUNE FOR YOUR MACHINE

    [Advanced users: For a complete list of Mlucas command line options, type 'Mlucas -h', and note the topical help-submenu options.]

    After building the source code, the first thing that should be done is a set of self-tests to make sure the binary works properly on your system. During these self-tests, the code also collects various timing data which allow it to configure itself for optimal performance on your hardware. It does this by saving data about the optimal FFT radix combination at each FFT length tried in the self-test to a configuration file, named mlucas.cfg. Once this file has been generated, it will be read whenever the program is invoked to get the optimal-FFT data (specifically, the optimal set of radices into which to subdivide each FFT length) for the exponent currently being tested.

    To perform the needed self-tests for a typical-user setup (which implies that you'll be either doing double-checking or first-time LL testing), first remove or rename any existing mlucas.cfg file from a previous code build/release in the run directory, then type - note you'll want to insert a specific set of -cpu options from the list below in place of the [-cpu flags] placeholder - the following to run a bunch of self-tests (this needs from a few minutes on fast Intel hardware to upwards of an hour on humbler CPUs such as my little Odroid):

    Mlucas -s m [-cpu flags] >& selftest.log

    Here is what to enter in the [-cpu flags] field for several common hardware types, if your goal is to maximize total throughput of your system:

    The above 'Mlucas -s m' command tells the program to perform a series of self-tests for FFT lengths in the 'medium' range, which currently means FFT lengths from 1024K-7680K, covering Mersenne numbers with exponents from 20M - 143M. You should run the self-tests under unloaded or constant-load conditions before starting work on any real assignments, so as to get the most-reliable optimal-FFT data for your machine, and to be able to identify and work around any anomalous timing data. (See example below for illustration of that). This may take a while, especially in single-threaded mode; you can monitor progress of the process by opening the mlucas.cfg file in an editor and watching the various-FFT-length entries get added as each set of tests at a given FFT length completes. When done, please check the resulting selftest.log file for error messages. You should expect to see a few messages of the form

    ***** Excessive level of roundoff error detected - this radix set will not be used. *****

    but a whole lot of such, or residue-mismatch or other kinds of errors means that something has likely gone awry in your build. This can be something as mundane as the compiler using unsafe optimizations for one or more FFT-radix functions, or something more serious. In such cases, please contact me, the program author, and attach zipped copies of your build.log and selftest.log, along with information about your compiler version and compute platform (CPU and OS).

    If for some reason you want to generate optimal-FFT-params data for a single FFT length not covered by the standard self-tests, you can do so using the following command template:

    ./Mlucas -fftlen [n] -iters [100|1000|10000] [-cpu [args]] First specify the FFT length, in units of Kdoubles - the supported lengths are of the form [8,9,10,11,12,13,14,15]*2k, with k some integer ≥10. Then replace the [-cpu [args]] placeholder in the command block below with the desired cores-to-use specifiers: nothing for 1-threaded self-test, -cpu [args] for multithreaded. Lastly, if you are trying to run a single-FFT-length self-test, you must explicitly specify the iteration count via '-iters [100|1000|10000]' -- 100 is OK for 1-thread tests, but I suggest using 1000 for thread counts between 4 and 15, and 10000 for ≥ 16 threads, in order to reduce the thread-and-data-tables-initialization overhead to a reasonable level.

    Each single-length self-test should add 1 line to your mlucas.cfg file. The cfg-file lines appended by such single-length self-tests will have some additional residue data following the "radices = " listing, which you can ignore, since Mlucas stops parsing of these lines after reading in the radices.


    Format of the mlucas.cfg file:

    If you are running multiple copies of Mlucas, a copy of the mlucas.cfg file should be placed into each working directory, along with a worktodo.ini file containing assignments from the PrimeNet server which will be done by a copy of the Mlucas executable run from that working directory. Note that the program can run without the .cfg file, but with a proper configuration file (in particular one which was run under unloaded or constant-load conditions) it will run optimally at each runlength.

    What is contained in the configuration file? Well, let's let one speak for itself. The following mlucas.cfg file was generated on a 2.8 GHz AMD Opteron running RedHat 64-bit linux. I've italicized and colorized the comments to set them off from the actual optimal-FFT-radix data:

    	#
    	# mlucas.cfg file
    	# Insert comments as desired in lines beginning with a # or // symbol, as long as such commenting occurs below line 1, which is reserved.
    	#
    	# First non-comment line contains program version used to generate this mlucas.cfg file;
    	14.1
    	#
    	# Remaining non-comment lines contain data about the optimal FFT parameters at each runlength on the host platform.
    	# Each line below contains an FFT length in units of Kdoubles (i.e. the number of 8-byte floats used to store the
    	# LL test residues for the exponent being tested), the best timing achieved at that FFT length on the host platform
    	# and the range of per-iteration worst-case roundoff errors encountered (these should not exceed 0.35 or so), and the
    	# optimal set of complex-FFT radices (whose product divided by 512 equals the FFT length in Kdoubles) yielding that timing.
    	#
    	2048  sec/iter =    0.134  ROE[min,max] = [0.281250000, 0.343750000]  radices =  32 32 32 32  0  0  0  0  0  0  [Any text offset from the list-ending 0 by whitespace is ignored]
    	2304  sec/iter =    0.148  ROE[min,max] = [0.242187500, 0.281250000]  radices =  36  8 16 16 16  0  0  0  0  0
    	2560  sec/iter =    0.166  ROE[min,max] = [0.281250000, 0.312500000]  radices =  40  8 16 16 16  0  0  0  0  0
    	2816  sec/iter =    0.188  ROE[min,max] = [0.328125000, 0.343750000]  radices =  44  8 16 16 16  0  0  0  0  0
    	3072  sec/iter =    0.222  ROE[min,max] = [0.250000000, 0.250000000]  radices =  24 16 16 16 16  0  0  0  0  0
    	3584  sec/iter =    0.264  ROE[min,max] = [0.281250000, 0.281250000]  radices =  28 16 16 16 16  0  0  0  0  0
    	4096  sec/iter =    0.300  ROE[min,max] = [0.250000000, 0.312500000]  radices =  16 16 16 16 32  0  0  0  0  0
    
    Note that as of Jun 2014 the per-iteration timing data written to mlucas.cfg file have been changed from seconds to milliseconds, but that change in scaling is immaterial with respect to the notes below.

    You are free to modify or append data to the right of the # signs in the .cfg file and to add or delete comment lines beginning with a # as desired. For instance, one useful thing is to add information about the specific build and platform at the top of the file. Any text to the right of the 0-terminated radices list for each FFT length is similarly ignored, whether it is preceded by a # or // or not. (But there must be a whitespace separator between the list-ending 0 and any following text).

    One important thing to look for in a .cfg file generated on your local system is non-monotone timing entries in the sec/iter (seconds per iteration at the particular FFT length) data. for instance, consider the following snippet from an example mlucas.cfg file (to which I've added some boldface highlighting):

    	1536  sec/iter =    0.225
    	1664  sec/iter =    0.244
    	1792  sec/iter =    0.253
    	1920  sec/iter =    0.299
    	2048  sec/iter =    0.284
    

    We see that the per-iteration time for runlength 1920K is actually greater than that for the next-larger vector length that follows it. If you encounter such occurrences in the mlucas.cfg file generated by the self-test run on your system, don't worry about it -- when parsing the cfg file the program always "looks one FFT length beyond" the default one for the exponent in question. If the timing for the next-larger-available runlength is less than that for the default FFT length, the program will use the larger runlength. The only genuinely problematic case with this scheme is if both the default and next-larger FFT lengths are slower than an even larger runlength further down in the file, but this scenario is exceedingly rare. (If you do encounter it, please notify the author and in the meantime just let the run proceed).

    Aside: This type of thing most often occurs for FFT lengths with non-power-of-2 leading radices (which are algorithmically less efficient than power-of-2 radices) just slightly less than a power-of-2 FFT length (e.g. 2048K = 221), and for FFT lengths involving a radix which is an odd prime greater than 7. It can also happen if for some reason the compiler does a relatively poorer job of optimization on a particular FFT radix, or if some FFT radix combinations happen to give better or worse memory-access and cache behavior on the system in question. Such nonmonotonicities have gotten more rare with each recent Mlucas release, and especially so at larger (say, > 1024K) FFT lengths, but they do still crop up now and again.



    Users who just want to start doing GIMPS work after completing the above build self-test should skip down to the Reserve exponents from PrimeNet section. Those who want to see if multithreaded running (other then the 2-threads-per-physical-core described above, specifically for hyperthreaded Intel CPUs offers any gain on their system should read the following subsection.



    Advanced Users:

    Note that the default in automated self-test mode is the same as for production run mode: to use a single thread running on a single physical core, using 100-iteration timing runs of the various FFT lengths and radix combinations at each length. You may also explicitly specify the desired number of self-test iterations, but for this to produce a .cfg file you must use one of the 3 standard values, '-iters 100', '-iters 1000' or '-iters 10000' for which the code stores pretabulated results which it uses to validate (or reject) self-test results. 100 is nice for 1- and perhaps 2-thread testing, but on fast systems with ≥ 2 threads, 1000 is better, because it yields a more-precise timing and is better at catching radix sets which may yield an unsafely high level of roundoff error for exponents near the upper limit of what the code allows for a given FFT length. Thus, to run the small, medium and large self-tests 2-threaded and with 1000 iterations per individual subtest, first save the 1-threaded mlucas.cfg file under a different name, e.g. mlucas.cfg.1thr. Then, on Intel systems:

    ./Mlucas -s m -iters 1000 -nthread 2

    or, equivalently:

    ./Mlucas -s m -iters 1000 -cpu 0:1

    On systems using a different core-numbering system than Intel you will need to modify the core indices in multithread runs suitably, e.g. on AMD our 2-threaded timings should use

    ./Mlucas -s m -iters 1000 -cpu 0,2 (Note: 0,2 not 0:2 -- the latter means "use cores 0,1,2" but we want only cores 0 and 2 here)

    On systems other than Intel and AMD a quick single-case timing experiment should suffice to reveal whether the physical-core-numbering scheme is like that of Intel or AMD, or perhaps something else. Compare the runtimes for these:

    ./Mlucas -fftlen 192 -iters 100 -radset 0 [This is your 1-thread baseline timing]
    ./Mlucas -fftlen 192 -iters 100 -radset 0 -cpu 0:1
    ./Mlucas -fftlen 192 -iters 100 -radset 0 -cpu 0,2

    If -cpu 0:1 gives a clearly better timing - in the sense that the runtimes are on average < 0.5x the 1-thread ones - than 1-thread and -cpu 0,2, use the former (Intel) core-numbering scheme. If -cpu 0,2 gives the clear best timing, use the AMD numbering scheme. If neither of the 2-threaded runs gives a timing better than (say) 0.6x the 1-thread timing, you should stick to single-threaded running, 1 job per physical core.

    Once your 2-threaded self-tests complete, for the total system throughput to beat the simple one-single-threaded-job-per-physical-CPU, the per-iteration timings in the 2-thread .cfg file need to be on average half those in the single-thread .cfg file. If they are not, it's probably best to just go single-threaded. Rename the 2-threaded mlucas.cfg file mlucas.cfg.2thr, and either remove the .1thr extension you added to the 1-thread .cfg file, or place a soft-link to that one in each of your production run directories, under the alias mlucas.cfg. (E.g. 'mkdir run0 && cd run0 && ln -s ../mlucas.cfg.1thr mlucas.cfg'.)

    To follow the 2-threaded self-test with a 4-threaded one for purposes of timing comparison, first move the 2-threaded mlucas.cfg file under a different name, e.g. mlucas.cfg.2thr. Then on Intel:

    ./Mlucas -s m -iters 1000 -cpu 0:3

    or on AMD, where the following 3 -cpu argument sets are all equivalent, and illustrate the various available syntaxes:

    ./Mlucas -s m -iters 1000 -cpu 0,2,4,6
    ./Mlucas -s m -iters 1000 -cpu 0:6:2
    ./Mlucas -s m -iters 1000 -cpu 0:7:2 [think of C loop of form for(i = 0; i <= 7; i += 2)]

    And don't forget to

    mv mlucas.cfg mlucas.cfg.4thr

    For 4-threaded to give better total throughput than four single-threaded jobs, the .4thr timings need to be roughly 3.5x or more faster than the .1thr ones. The fuzzy-factor here is due to memory contention effects, whereby multiple 1-thread runs will compete for the same system memory bandwidth and slow each other down. Starting four 1-thread production runs and letting them run through the first several 10000-iteration checkpoints will give you per-iteration timings you can more fairly compare to those for the same FFT length in the mlucas.cfg.4thr file.

    Additional Notes:

    If you want to do the self-tests of the various available radix sets for one particular FFT length, enter

    Mlucas -s {FFT length in K} -iters [100 | 1000 | 10000]

    For instance, to test all FFT-radix combo supported for FFT length 704K for 10000 iterations each, enter

    Mlucas -s 704 -iters 10000

    The above single-FFT-length self-test feature is particularly handy if the binary you are using throws errors for one or more particular FFT lengths, which interrupt the complete self-test before it has a chance to complete the configuration file. In that case, after notifying me (please!) the user must skip the offending FFT length and go on to the next-higher one, and in this fashion build a .cfg file one FFT length at a time. (Note that each such test appends to any existing mlucas.cfg file, so make sure to comment out or delete any older entries for a given FFT length after running any new timing tests, if you plan to do any actual "production" LL testing.

    Overloading of Physical Cores:

    On some platforms running 2 threads per physical core may offer some performance benefit. It is difficult to predict in advance when this will be the case: For example, on my Intel Haswell quad I get the best performance from running one thread per physical core, but on my dual-core Intel Broadwell NUC, using 4 threads, thus 2 threads per physical core, gives a 5-10% throughput boost over 1-thread per physical core. On AMD Ryzen, I not only see no gain from, but observe a pronounced deterioration in throughput from running more than 1 thread per physical core. On the just-released Google Cloud Skylake Xeon instances (which support the new AVX-512 instruction set), my code gets a huge (nearly 2-fold) throughput boost from using 2 threads per physical core.

    To experiment with this yourself, you can again use a small set of self-tests, though I recommend using an FFT length for these which is reflective of current GIMPS assignments (As I write this, that means 5120K and 5632K FFT length for first-time tests and 2560K and 2816K for double-checks). It is also crucial to understand the CPU vendor's core numbering scheme here: On an Intel n-physical-core system, threads 0 and n map to physical core 0, threads 1 and n+1 map to physical core 1, and so forth through physical core n-1. On AMD, threads 0 and 1 map to physical core 0, threads 2 and 3 map to physical core 1, etc. Thus if you want to gauge whether overloading will help for your GIMPS assignment, e.g. if you are testing at 4096K, try a targeted single-FFT-length set of self-tests at that length (again, after saving any existing mlucas.cfg file under a suitable name to keep it from being appended to by this test):

    Intel n-core: ./Mlucas -fftlen 4096 -iters 1000 -cpu 0,n-1 (Insert your system's value of n, e.g. on a quad-core use -cpu 0,4 - and note the comma-separator here in place of the colon!)

    AMD n-core: ./Mlucas -fftlen 4096 -iters 1000 -cpu 0:1

    Then compare the resulting mlucas.cfg file entry's timing against that for the same FFT length in the mlucas.cfg.1thr file you should have saved previously.


    Advanced Usage: Manycore Systems and Multihreaded Runs:

    Note that the -cpu flag supports logical-core parametrization not only via standalone low:high:stride triplets, but also comma-separated triplets. This allows for a highly flexible affinity-setting schema. Let's say I find on my 32-physical-core system that running four Mlucas instances (labeled w0-w3, where w stands for 'worker'), each using eight index-adjacent physical cores and either 8 threads or 16 threads (in the second case we are thus overloading each physical core with 2 software threads) gives the best total system throughput. Then here are the resulting -cpu arguments for each of our 4 jobs (program instances), for the Intel and AMD logical-core-numbering schemes, in both 1-thread-per-physical-core and 2-thread-per-physical-core modes, in terms of -cpu assignments:

    1 thread per physical core: 2 threads per physical core:
    WorkerIntelAMD WorkerIntelAMD
    w0:-cpu 0:7 -cpu 0:14:2 w0:-cpu 0:7,32:39 -cpu 0:15
    w1:-cpu 8:15-cpu 16:30:2 w1:-cpu 8:15,40:47-cpu 16:31
    w2:-cpu 16:23-cpu 32:46:2 w2:-cpu 16:23,48:55-cpu 32:47
    w3:-cpu 24:31-cpu 48:62:2 w3:-cpu 24:31,56:63-cpu 48:63


    STEP 3 - RESERVE EXPONENTS FROM PRIMENET

    Assuming your self-tests ran successfully, reserve a range of exponents from the GIMPS PrimeNet server.

    By far the easiest way to do this and also submit results as they become available is to use the Python script named primenet.py for automated PrimeNet assignments management - this is to be found in the Mlucas src-directory.

    After you create however many run-subdirectories you want to run jobs from (say, one per physical CPU core of your system) and copy the mlucas.cfg file resulting from the post-build self-tests into each, you also place a copy of src/primenet.py into each rundir (or prepend 'primenet.py' below with the - absolute or relative - path to the src-directory). Then just cd into each rundir in turn and - assuming you have a valid PrimeNet user account with user ID 'uid' and and password 'pwd' (see below on how to create one, if not) - run the script like so, after filling in the [] fields with your own preferences and login credentials:

    python primenet.py -d [-T [worktype]] -u [uid] -p [pwd] [-t [frequency]] &

    Here, -d enables some useful debug diagnostics, nice to use on your first usage of the script, in fact I recommend always using this flag, unless you have a very good reason to run the script in 'silent' (no term-output) mode. My own preference is to shut WiFi off on my various Mlucas-running devices when it's not needed, and periodically - typically when I see a device has finished an exponent - enable WiFi and then run the script in one-shot mode, again with debug-info enabled:

    python primenet.py -d [-T [worktype]] -u [uid] -p [pwd] -t 0 &

    As of this writing, the available worktype arguments for the -T flag are as follows:

    Worktype:
    Code Mnemonic Description
    101 DoubleCheck Double-checking LL tests
    150 SmallestAvailPRPFirst time PRP tests (Gerbicz)
    151 DoubleCheckPRP Doublecheck PRP tests (Gerbicz)
    152 WorldRecordPRP World record sized numbers to PRP test (Gerbicz)
    153 100MdigitPRP 100M digit number to PRP test (Gerbicz)

    Note that as of 8 Apr 2021, the server no longer hands out first-time LL assignments; any requests for such will be converted to LL double-checking assignments. The change is due to the fact that probable-prime (PRP) testing, which supports both the Gerbicz error-checking mechanism and a quickly checkable single-run certificate of correctness (support for the latter will appear in an Mlucas release later this year), has proven itself superior to LL-testing in terms of generating reliable results. Any M(p) flagged as probable primes via PRP test will still be verified using multiple LL tests and various independent software clients, since PRP tests are, as the name implies, only "this number is very likely to be prime" in nature when they yield such a result. (They are, on the other hand, rigorous proofs of compositeness for composite M(p).)
    If the -T argument is omitted the script will use the default for this, which is DoubleCheck (numeric value 101. You must be connected to the internet when you launch the script; once it has done its initial work-fetching you can be offline most of the time; the program will simply periodically check whether there are any new results in the run directory in which it was launched; if yes *and* it is able to connect to the PrimeNet server it will submit the new results (usually just one, unless you are offline nearly all the time) and fetch new work; otherwise it will sleep and retry later. The default is to check for 'results to submit/work to get?' every 6 hours; you may override this via the -t option, followed by your desired time interval in seconds. '-t 0' means run a single-shot get-work-to-do and quit, if for some reason you prefer to periodically run the script manually yourself.

    If the script runs successfully you should see a worktodo.ini file (if none existed already, the script creates it; otherwise it appends new work to the existing version of the file) with at least 2 LL-test assignments in it. The script will also periodically check the results.txt file in each run-subdirectory in which it is invoked. Whenever one or more new results are found and a connection to the internet is active during one of these periodic checks, the result is automatically submitted to the PrimeNet server, and the worktodo.ini file in the run directory 'topped up' to make sure it has at least 2 valid entries, the first of which will correspond to the currently ongoing job. Thus, the first time you use it, you just need to run the py-script in each local run directory to grab work to be done, then invoke the Mlucas binary to start the Lucas-Lehmer testing.

    Offline Testing:

    Users who wish to eschew this can continue to use the PrimeNet manual testing webforms at mersenne.org as described further down on this page, but folks running multiple copies of the program will find the .py-script greatly simplifies things. See the

  • Get exponents from PrimeNet section for the simple instructions. Here's the procedure (for less-experienced users, I suggest toggling between the PrimeNet links and my explanatory comments):

    The various PrimeNet work assigments supported by Mlucas are of several different forms (master reference available here):

    Examples of typical assignments returned by the server follows:

    Assignment Explanation
    Test=DDD21F2A0B252E499A9F9020E02FE232,48295213,69,0 M48295213 has not been previously LL-tested (otherwise the assignment would begin with "DoubleCheck=" instead of "Test="). The long hexadecimal string is a unique assignment ID generated by the PrimeNet v5 server as an anti-poaching measure. The ",69" indicates that M48295213 has been trial-factored to depth 269, and had a default amount of p-1 factoring effort done with no factors found. The 0 following the 69 indicates that p-1 still needs to be done, but Mlucas currently does not support p-1 factoring, so perform a first-time LL test of M48295213. (But see note below this table re. doing the p-1 step using one of the GIMPS clients which support p-1, which will find a small factor in ~5% of such cases, thus saving an LL test.)
    DoubleCheck=B83D23BF447184F586470457AD1E03AF,22831811,66,1
    M22831811 has already had a first-time LL test performed, been trial-factored to a depth of 266, and has had p-1 factoring attempted with no small factors found, so perform a second LL test of M22831811 in order to validate the result of the initial test. (Or refute it - in case of mismatching residues for the first-time test and the double-check a triple-check assignment would be generated by the server, whose format would however still read "Doublecheck")
    PRP=C57FF1C644A0CB16F5E2B5B3A9FC4E1D,1,2,98024161,-1,77,2 This is a probable-prime-test assignment, meaning the Gerbicz check will be done at regular intervals throughout. The 4 integers following the hexadecimal assignment ID define the number to be tested as 1*298024161-1, i.e. the mersenne number M98024161. This number has not been previously tested (otherwise the assignment would have 2 additional trailing arguments). The ",77" indicates that M98024161 has been trial-factored to depth 277, and the trailing 2 indicates that p-1 factoring has not been tried, i.e. would save 2 full-length primality tests (a first-time one and a double-check one) were p-1 to find a small factor. Mlucas ignores that as a currently-unused field in this assignment type, which means ~5% chance of missing a small factor should the user proceed with the PRP test.
    PRP=C42540C352E54E906108D48FA5D89488,1,2,80340397,-1,75,1,3,1
    This is a PRP-double-check assignment, or PRP-DC for short. M80340397 has been trial-factored to depth 275, has had some p-1 testing done but could use a deeper round of p-1 (the 1 following the 75), and already had a first-time PRP test performed, using base 3 and returning residue type 1.

    If you are using the PrimeNet manual testing pages rather than the primenet.py script, copy the Test=... or DoubleCheck=... lines returned by the server into the worktodo.ini file, which must be in the same directory as the Mlucas executable (or contain a symbolic link to it) and the mlucas.cfg file. If this file does not yet exist, create it. If this file already has some existing entries, append any new ones below them.

    Note that Mlucas makes no distinction between first-time LL tests and double-checks - this distinction is only important to the PrimeNet server.

    Most exponents handed out by the PrimeNet server have already been trial-factored to the recommended depth (i.e. will be of the 'Test' or 'DoubleCheck' assignment type), so in most cases, no additional trial-factoring effort is necessary. If you have exponents that require additional trial factoring, you'll want to either return those assignments or, if you have a fast GPU installed on your system, download the appropriate GPU client from the GPU72 project to do the trial factoring, as those platforms are now much more efficient for such work than using Prime95's TF option on a PC. Mlucas does have trial factoring capability, but that functionality requires significant added work to to make it suitable for general-public use, thus it is not part of the current executable build. I plan to address that in a future release, depending on how that part of the code shapes up.

    If p-1 has not been done (Test or DoubleCheck assignment line ends with ',0'; PRP or PRP-DC assignment line has a 1 or 2 following the trial-factored-to-bit-depth number), the user is advised to farm out the p-1 step to one of the GIMPS clients which support that test type (Prime95/mprime, gpuOwl, CUDAPm1) - for example in my own setup I have an Intel quadcore ATX-case deskside box and a bunch of ARM-based Android "broke-o-phones" running a mix of LL and PRP tests using Mlucas. The ATX box also hosts an AMD Radeon VII GPU doing PRP tests using Mihai Preda's gpuOwl code. (In fact the Radeon VII gpuOwl runs give ~20x the throughput of the CPU ones - I would probably idle the CPU were it not important for purposes of 24/7 QA testing of my Mlucas code.) I use gpuOwl (whose mos-recent version does not support LL-testing, only PRP, but that is moot as far as the preceding round of p-1 testing goes) to do all the needed p-1 and then farm out the surviving candidates to the Mlucas-running devices.

    To create a (p-1)-test-only assignment from the above kinds of primality-test ones, the user needs to cast the assignment into the same form as a PRP one, and replace 'PRP with 'Pfactor'. For the above 4 example assignments, the first and last 2 have not had p-1 done, and the resulting p-1 assignments would be

    Pfactor=DDD21F2A0B252E499A9F9020E02FE232,1,2,48295213,-1,69,2
    Pfactor=C57FF1C644A0CB16F5E2B5B3A9FC4E1D,1,2,98024161,-1,77,2
    Pfactor=C42540C352E54E906108D48FA5D89488,1,2,80340397,-1,75,1

    Lastly, if you wish to test some non-server-assigned prime exponent, you can simply enter the raw exponent on a line by itself in the worktodo.ini file.


    STEP 4 - PRIMALITY TEST

    Your run setup depends on how many instances of the code you will be running - as with the build-self-test section above, I will gear things toward a typical GIMPS user who wishes to maximize overall system throughput, even if that means the individual instances run more slowly than they would by giving them software threads on multiple physical cores.

    Intel quad-core:

    You'll want to create 4 rundirs (say run0,run1,run2,run3) - I usually do this inside the dir where the Mlucas exe and the mlucas.cfg built from it reside, so my examples assume this. Within you'll want to soft-link to the master cfg file, and create a worktodo.ini file - you can do that by running the primenet.py script, if you like. I'm going to assume your dir structure is similar to mine, and that you are working from within the obj-dir which contains the Mlucas binary - customize to suit if you use a different one. Then, e.g. for run0:

    mkdir run0 && cd run0 && ln -s ../mlucas.cfg
    python primenet.py -d -T 100 -u [uid] -p [pwd]

    If your Intel CPU is hyperthreaded, i.e. 2-threads-per-physical-core gives a boost on your system, then run 'nice ../Mlucas -cpu 0,4 &', otherwise if non-hyperthreaded, use 'nice ./Mlucas -cpu 0 &'. Then do similarly to set up dirs run1, run2, run3, and use these run commands from within each in turn:

    run1: nice ../Mlucas -cpu 1,5 &
    run2: nice ../Mlucas -cpu 2,6 &
    run3: nice ../Mlucas -cpu 3,7 &

    If non-hyperthreaded Intel, just use -cpu 1, -cpu 2 and -cpu 3, respectively.

    AMD multi-core:

    On AMD you want just one single-threaded job per physical core, but the core affinities set by the -cpu flag depend on whether your AMD CPU supports hyperthreading or not. For a hyperthreaded CPU, your runs from directories run0,1,2,... should use -cpu 0,2,4,..., i.e. each job sets the core-index of the affinity setting 2 higher than the preceding one. For a non-hyperthreaded AMD, the indices increment by one: runs from directories run0,1,2,... should use -cpu 0,1,2,... .

    ARM v8 or above:

    You want one Mlucas job for each quad-core socket of your system. For example on a 2-socket octocore system you set up run directories run0 and run1, then from within each in turn:

    run0: nice ../Mlucas -cpu 0:3 &
    run1: nice ../Mlucas -cpu 4:7 &

    If your system is a hybrid BIG.little 2-socket with a mix of per-socket core counts, you'll want to first consult the /proc/cpuinfo file to check the core-to-socket numbering scheme, then fiddle the -cpu arguments above suitably. For example on a prototype Odroid N1 system I did a build on in early 2017, cores 0-3 belonged to the 'little' Cortex a53 CPU and cores 5-6 to the 'BIG' Cortex a72 CPU. On such systems one almost always wants a different Mlucas job running on each CPU, so on the N1 I simply modified the core assignments for the second run above to '4:5'. On some systems - multicore smartphones are a common example of this - one may have an even wider variety of configurations: 3-socket, one 'main' processor whose cores are partially reserved for system jobs (e.g. only 2 of 4 physical cores of the CPU appear in /proc/cpuinfo), etc. Users interested in running Mlucas on such systems should first have a read-through of the Mersenneforum CellPhone Compute Cluster for GIMPS thread.


    Windows (pre-Win10) Users: Since the emulated Posix build setup does not support multithreaded builds, you will simply need to start as many single-threaded jobs as there are physical cores and instead of setting affinity via the -cpu flag, rely on the operating system to manage job/core affinity. Windows Task Wanager and timings (compared to your self-test ones) will give you a good idea as to whether the OS is up to the task.


    The program will run silently in background, leaving you free to do other things or to log out. Every 10000 iterations (or every 100000 if > 4 threads are used), the program writes a timing to the "p{exponent}.stat" file (which is automatically created for each exponent), and writes the current residue and all other data it needs to pick up at this point (in case of a crash or powerdown) to a pair of restart files, named "p{exponent}" and "q{exponent}." (The second is a backup, in the rare event the first is corrupt.) When the exponent finishes, the program writes the least significant 64 bits of the final residue (in hexadecimal form, just like Prime95) to the .stat and results.txt (master output) file. Any round-off or FFT convolution error warnings are written as they are detected both to the status and to the output file, thus preserving a record of them when the Lucas-Lehmer test of the current exponent is completed.

    Dec 2014: The program also saves a persistent p-savefile every 10M iterations, with extensions .10M, .20M, ..., reflecting which iteration the file contains restart data for. This allows for a partial-rerun - even in parallel 10Miter subinterval reruns, if desired - in case the final result proves suspect.

    ADDING NEW EXPONENTS TO THE WORKTODO.INI FILE: You may add or modify ALL BUT THE FIRST EXPONENT (i.e. the current one) in the worktodo.ini file while the program is running. When the current exponent finishes, the program opens the file, deletes the first entry and, if there is another exponent on what was line 2 (and now is line 1), starts work on that one.


    STEP 5 - SEND YOUR RESULTS TO PRIMENET

    For users who prefer not to use the automated Python assignments-management script, to report results (either after finishing a range, or as they come in), login to your PrimeNet account and then proceed to the Manual Test Results Check In. Paste the results you wish to report, that is, one or more lines of the results.txt file (any results which were added since your last checkin from that file) into the large window immediately below. Note that as of v19, the results-line format has changed to so-called JSON (Javascript object notation) format, so here are sample results-entry formats, depending on version:

    v18 and earlier:
    M86748829 is not prime. Res64: F28B3E531E99C315. Program: E18.0. Final residue shift count = 14725081

    v19 writes a similar human-readable line as above to the p[exponent].stat file, but the output to the results.txt file is in JSON format, e.g.:
    {"status":"C", "exponent":81253819, "worktype":"PRP-3", "res64":"CE9AB357C704369B", "residue-type":1, "fft-length":4718592, "shift-count":66554884, "error-code":"00000000", "program":{"name":"Mlucas", "version":"19.0"}, "timestamp":"2019-12-01 03:42:39 UTC", "aid":"0CF485CD87F1BAC48B6B05D3A2094579"}

    If for some reason you need more time than the 180-day default to complete a particular assignment, go to the Manual Test Time Extension page, check the about-to-expire exponents, then go to the bottom of the page and click on "Extend checked exponents 60 days".


    TRACKING YOUR CONTRIBUTION

    You can track your overall progress (for both automated and manual testing work) at the PrimeNet server's producer page. Note that this does not include pre-v5-server manual test results. (That includes most of my GIMPS work, in case you were feeling personally slighted ;).


    ALGORITHMIC Q & A