mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Computer Science & Computational Number Theory (https://www.mersenneforum.org/forumdisplay.php?f=116)
-   -   Pépin tests of Fermat numbers beyond F24 (https://www.mersenneforum.org/showthread.php?t=18748)

ewmayer 2013-10-24 00:07

Pépin tests of Fermat numbers beyond F24
 
Placeholder thread to summarize what has been done in this area and collect new results as they complete.

General math/algo background:
As is well known, Fermat numbers admit both a very efficient rigorous primality test and a fast discrete "implicit mod" convolution method for effecting it, analogously to the Mersennes. I'm running the Pépin test for F[sub]28[/sub] as I type this on my quad-core [strike]Sandy Bridge[/strike]Haswell box. This number (in fact all of F25-32) is already known to be composite by way of a small factor, but the Pépin residue is still useful for fast cofactor PRP testing, and the Fermats make a good code-development milestone for my in-development AVX and manycore codes, since the Fermat-mod discrete convolution has a somewhat simpler structure than does the Mersenne-mod. (Technically, it's a complex "right-angle" transform with hybrid of acyclic weighting for the plus-sign mod and a Mersenne-style irrational-base DWT to allow for use non-power-of-2 transform lengths - my F[sub]28[/sub] run is using an FFT length of 15*2[sup]20[/sup] floating doubles.)
[i]
[Text below is only-slightly-edited version of an e-mail I sent to Wilfrid Keller, Richard Crandall and Jason Papadopoulos on 13 November 2012]
[/i]
Available [url=http://hogranch.com/mayer/files.tgz]here[/url] in .tgz format (use 'tar xzf files.tgz' to unpack) are summary run-status files for my successful (in the sense that 2 runs at different FFT lengths yield matching residues) Pépin-test compositeness results for F24-26. (F24 was re-run to test the hybrid DWT code described above). I believe F25 and perhaps F26 have already been Pépin-tested by others (likely using some version of George's x86 assembler FFT), in which event those residues can be compared to mine.

These were run using a C/assembler hybrid of my own Fermat-mod convolution code, using SSE2 on the Intel x86 for the heavy computational work. This implements both the traditional power-of-2-FFT which uses a negacyclic-weighting to effect Fermat-mod multiply arithmetic, as well as the non-power-of-2-FFT alluded to in the Crandall/Mayer/Papadopoulos F24 paper, which combines a negacyclic weighting and a Mersenne-mod-style dual-base DWT. The machinery was very modest sitting-on-my-desk stuff: In each case the smaller non-power-of-2-FFT-length run was on 1 core of a Core2 dual-core Lenovo notebook (vintage 2008), built under Win32 using MS Visual Studio. This machine's processor is rated at 1.67 GHz, but due to a quirk of the power-management system, ever since the battery died 3 years ago, the system only runs at 2/3 speed, that is ~1.1 GHz. (I chose not to buy a new one, as the purchase of the MacBook for my 64-bit builds allowed me to convert the Lenovo to stay-at-home plugin status). Since F24 had already been proven composite via matching runs at FFT length 1024k (k mean 1024 floating doubles), I include only the files for the re-run at 896k (human-readable checkpoint-status file: f24_896k.stat) on the above Lenovo notebook using the hybrid DWT code, said test performed in October of 2011. F25 was tested @1792 on the same machine from late October 2011 to January 2012 (checkpoint file: f25_1792k.stat); F26 ran from January to October of this year (checkpoint file: f26_3584k.stat).

The larger power-of-2 runlengths for F25 and F26 were done on 1 core of a 2 GHz Core2 Duo Apple MacBook, built under 64-bit OS X using GCC 4.2. The 64-bit builds include versions of various key assembly macros which take advantage of the doubled SSE register set (from 8 to 16 128-bit SSE registers) available in 64-bit mode; this yields ~10% added per-cycle throughput, on top of that afforeded by the higher clock rate, larger caches and faster system memory of the newer machine. On the other hand, the MacBook is my personal-use mobile machine, so does not enjoy quite the 24x7 uptime of the Lenovo. That is reflected in the frequent restarts-from-interrupt and timing slowdowns reflected in the respective checkpoint-status files, f25_2048k.stat [run from 02 Dec - 27 Jan 2012] and f26_4096k.stat [27 Jan - 08 Sep 2012; in there you can see the timing boost I got between 18-24 Jul, when my macbook's OEM fan finally got terminally dust-clogged/worn-out and I replaced it].

F27 and beyond will run in multithreaded mode on the above and faster/more-core hardware; it was only very recently that I achieved a first working multithreaded version of the SSE2-based convolution code. I am currently playing with various thread-management schemes (including a nice threadpool API Jason kindly has shared, which he adapted from some open-source code he found) to see what works best in the sense of efficiency and parallel-scalability.

I also include a (single) copy of the final bytewise residue files for all of the above exponents; in each case I first compared the full Pépin'-test residues for same-exponent runs at different runlengths to via diff of the respective final bytewise residue files.

The format for these residue files is slightly different than that described in the F24 paper, namely:

o 1 byte encoding an internal test type (other users of this file should ignore this);
o 1 byte encoding an internal modulus type (other users of this file should ignore this);
o 8 bytes encoding "number of mod squarings performed" in little-endian byte order (bytes stored in ascending order of significance);
o For modulus N, exactly ceiling(log2(N)/8) residue bytes, ordered from low to high, with residue R nonnegative-normalized, i.e. 0 <= R < N. For the Fermat number F_m = 2^2^m + 1, that translates to 2^(m-3) + 1 bytes;
o 8 bytes encoding the least-significant 64 buts of R (R modulo 2^64) in little-endian byte order;
o 5 bytes encoding the (R modulo 2^35-1) in little-endian byte order;
o 5 bytes encoding the (R modulo 2^36-1) in little-endian byte order;
o 1 byte for the C end-of-file (EOF) character.

Thus, e.g. for F29, such a savefile will be precisely 2^26 + 29 = 67108893 bytes in size.

The small C function attached below may be easily adapted to read the files, if desired. Anyone making use of the residues (e.g. for cofactor PRP testing) should validate their residue-reading code by directly computing the 3 Selfridge-Hurwitz residues from the bytewise residue and comparing those values to the S-H residues which are stored in the penultimate 18 bytes of the savefile. (And both of those sets should match the final residues printed in the respective run's f**.stat file).

Cheers
-Ernst

===============================

[b]Edit (13 Nov 2017):[/b] Here are links to tbz2'ed (use 'tar xjf *tbz2' to unpack) status-and-final-residue files for F24-F29. F24 has just the 896K-FFT .stat file, since that result matched the original 1M-FFT one published in the Crandall/Mayer/Papadopoulos [i]Math Comp[/i] paper. Each subsequent archive has a pair of such status files showing dates and interim Res64s for the Pe'pin tests at the respective FFT lengths used. F29 has a trio of such .stat files, one from the original (and fatally-hosed-by-undetected-hardware-error-between-the-25.63Miter-and-25.64Miter-checkpoints) 30M-FFT run on my haswell quad which I've given a '.orig' extension, and the subsequent 2 from the side-by-side runs at 30M and 32M FFT which I ran to iter ~230M on the Haswell while doing near-daily Res64 cross-checks, then moved to a pair of faster systems (32-core Xeon/avx2 and 64-core KNL/avx512) to finish, as describe in my 13 Nov 2017 new post to this thread:

[url=http://www.mersenneforum.org/mayer/F24_FILES.tbz2]F24_FILES.tbz2[/url] (2MB, MD5 = 06025e5d6ca3a746b5c0be1b3fd12465)
[url=http://www.mersenneforum.org/mayer/F25_FILES.tbz2]F25_FILES.tbz2[/url] (4MB, MD5 = 6db5632defdb06e1d7fde47322102544)
[url=http://www.mersenneforum.org/mayer/F26_FILES.tbz2]F26_FILES.tbz2[/url] (8MB, MD5 = 83bc460c7eb29b69364e038c53e72710)
[url=http://www.mersenneforum.org/mayer/F27_FILES.tbz2]F27_FILES.tbz2[/url] (169MB, MD5 = b863de5ff203c82fa8b1611de1c5c627)
[url=http://www.mersenneforum.org/mayer/F28_FILES.tbz2]F28_FILES.tbz2[/url] (742MB, MD5 = 2df7884c7c0ddd92f722d31f4b7392b7)

The F27 and F28 archives include a set of irregularly-spaced roughly-every-1Miter interim savefiles in addition to the final Pépin-residue one.

[b]Edit (04 May 2022):[/b] For F29 and F30 I have every-10Miter savefiles but only uploaded the every-100Miter and final-residue ones due to server space constraints, and did not tar/zip the respective archives, since the residue files are effectively random-bitstrings and thus incompressible, and the resulting archive size quadruples with each increment in Fermat-number index. One can download just one or any desired subset of same using ftp or wget; here are the file names, sizes and, for the residue files, the MD5 hashes:
[code]F29_FILES:
Filename Size(bytes) Date MD5
f29.100M 67108893 Sep 17 2017 0f7872a6de61f14da09c28aeee87cc9e
f29.200M 67108893 Sep 17 2017 4ad9f30c19d98026258cb148245ab9fc
f29.300M 67108893 Oct 31 2017 f2165c49b34c311379bb60829676500f
f29.400M 67108893 Oct 31 2017 6485f40256692e6c11d899eb0f785bf3
f29.500M 67108893 Oct 31 2017 d68cae02a296acce6e448cb500cd68bb
f29 67108893 Nov 13 2017 f7dfc69e905543b31a6591002e52e228
f29_30M.stat 4945113 Nov 12 2017
f29_30M.stat.orig 8329337 Sep 17 2017
f29_32M.stat 4517584 Nov 12 2017

F30_FILES:
Filename Size(bytes) Date MD5
f30.100M 134217757 Apr 17 2019 b4bda3a28af98776c40aa3c9791467d5
f30.200M 134217757 Apr 17 2019 dff26f37fe35c23b9d45f7de115320a5
f30.300M 134217757 Apr 17 2019 2a81d511d44ac49433dad5ccf5863ef1
f30.400M 134217757 Apr 17 2019 2044d695f0d98f5573c3fe325e3ff38b
f30.500M 134217757 Apr 17 2019 8015a7f0112ecb355868a9428c4dba2e
f30.600M 134217757 Apr 17 2019 5a3bc91a80f02d640c4fb244b8f284f7
f30.700M 134217757 Jan 18 2020 05f797b977ca07d9c4af4a42b3100ec5
f30.800M 134217757 Jan 18 2020 65e2cb73af7c0c7ce3b89d3d486d8c9b
f30.900M 134217757 Jan 18 2020 d5bf62f54f2db1519638cd4e2889f20c
f30.1000M 134217757 Jan 18 2020 e6534f8ac299ff6cec89c4d2d1e5966e
f30 134217757 Jan 18 2020 674218092a73b72d0bb0e5a4ca9e3326
f30_60M.stat 1618448 Nov 27 2020
f30_64M.stat 7156472 Jul 1 2022[/code]
===============================

[code]/*** READ: Assumes a valid file pointer has been gotten via a call of the form
fp = fopen(RESTARTFILE,"rb");
***/
int read_ppm1_savefiles(uint64 p, FILE*fp, uint32*ilo, uint8 arr_tmp[], uint64*Res64, uint64*Res35m1, uint64*Res36m1)
{
uint32 i;
uint32 nbytes;
uint64 nsquares= 0;

ASSERT(HERE, !(p >> 32), "read_ppm1_savefiles: p must be 32-bit or less!"); /* Future versions will need to loosen this p < 2^32 restriction: */

if(!file_valid(fp))
{
sprintf(cbuf, "read_ppm1_savefiles: File pointer invalid for read!\n");
return FALSE;
}
fprintf(stderr, " INFO: restart file %s found...reading...\n",RESTARTFILE);
/* t: */
if((i = fgetc(fp)) != TEST_TYPE)
{
sprintf(cbuf, "read_ppm1_savefiles: TEST_TYPE != fgetc(fp)\n");
return FALSE;
}
/* m: */
if((i = fgetc(fp)) != MODULUS_TYPE)
{
sprintf(cbuf, "ERROR: read_ppm1_savefiles: MODULUS_TYPE != fgetc(fp)\n");
return FALSE;
}
/* s: */
for(nbytes = 0; nbytes < 8; nbytes++)
{
i = fgetc(fp);
nsquares += (uint64)i << (8*nbytes);
}
/* For now, just make sure nsquares < 2^32 and copy to ilo: */
if(nsquares >= p)
{
sprintf(cbuf,"read_ppm1_savefiles: nsquares = %llu out of range, should be < p = %llu\n", nsquares, p);
// return FALSE;
}
*ilo = nsquares;

/* r: */
/* Set the expected number of residue bytes, depending on the modulus: */
if(MODULUS_TYPE == MODULUS_TYPE_MERSENNE)
{
nbytes = (p + 7)/8;
TRANSFORM_TYPE = REAL_WRAPPER;
}
else if(MODULUS_TYPE == MODULUS_TYPE_FERMAT)
{
ASSERT(HERE, p % 8 == 0,"read_ppm1_savefiles: p % 8 == 0");
nbytes = (p/8) + 1;
TRANSFORM_TYPE = RIGHT_ANGLE;
}

i = fread(arr_tmp, sizeof(char), nbytes, fp); /* Read bytewise residue... */
if(i != nbytes) { sprintf(cbuf, "read_ppm1_savefiles: Error reading bytewise residue array.\n") ; return FALSE; }
if(ferror(fp)) { sprintf(cbuf, "read_ppm1_savefiles: Unknown Error reading bytewise residue array.\n") ; return FALSE; }
if(feof(fp)) { sprintf(cbuf, "read_ppm1_savefiles: End-of-file encountered while attempting to read bytewise residue array.\n") ; return FALSE; }

/* 8 bytes for Res64: */
*Res64 = 0;
for(nbytes = 0; nbytes < 8; nbytes++)
{
i = fgetc(fp);
*Res64 += (uint64)i << (8*nbytes);
}
/* 5 bytes for Res35m1: */
*Res35m1 = 0;
for(nbytes = 0; nbytes < 5; nbytes++)
{
i = fgetc(fp);
*Res35m1 += (uint64)i << (8*nbytes);
}
ASSERT(HERE, *Res35m1 <= 0x00000007FFFFFFFFull,"read_ppm1_savefiles: *Res35m1 <= 0x00000007ffffffff");
/* 5 bytes for Res36m1: */
*Res36m1 = 0;
for(nbytes = 0; nbytes < 5; nbytes++)
{
i = fgetc(fp);
*Res36m1 += (uint64)i << (8*nbytes);
}
ASSERT(HERE, *Res36m1 <= 0x0000000FFFFFFFFFull,"read_ppm1_savefiles: *Res36m1 <= 0x0000000fffffffff");
/* Don't deallocate arr_tmp here, since we'll need it later for savefile writes. */
return TRUE;
}[/code]

ewmayer 2013-10-24 00:43

1 Attachment(s)
F27 has one complete run, at FFT length 7*2[sup]20[/sup] doubles, from 7 Dec 2012 - 11 Mar 2013 on all 4 cores of my quad-core Sandy Bridge box, using the aforementioned 64-bit GCC-built multithreaded SSE2 code. The checkpoint-summary-status file (attached) shows a number of roundoff warnings, reflecting the fact that while FFT lengths of the form 7*2[sup]m-7[/sup] are perfectly adequate for F[sub]m[/sub] with m <= 26, F[sub]27[/sub] is the largest Fermat number testable using such a length and a double-float-based FFT.
[i]
[Mar 11 22:59:49] F27 Iter# = 134210000 clocks = 00:00:00.000 [ 0.0593 sec/iter] Res64: C1A1546143F5D665. AvgMaxErr = 0.257079724. MaxErr = 0.343750000
[Mar 11 23:07:28] F27 Iter# = 134217727 clocks = 00:00:00.000 [ 0.0593 sec/iter] Res64: 043A6C8B9272B297. AvgMaxErr = 0.198636938. MaxErr = 0.343750000
F27 is not prime. Res64: 043A6C8B9272B297. Program: E3.0x
R27 mod 2^36 = 49701630615
R27 mod 2^35 - 1 = 30710305624
R27 mod 2^36 - 1 = 39732934618
[/i]
The second run of F[sub]27[/sub], at a mix of FFT lengths 2[sup]23[/sup] and 15*2[sup]19[/sup], actually started earlier than the above one (08 Sep 2012), but has been crawling along on my now-quite-aged Core2-based macbook. That has reached iteration ~85m (with all interim residues matching those of the above run), but with the cooling fan again beginning to sound stressed, I have suspended it and will complete it on my Haswell quad once that completes the first run of F[sub]28[/sub] next month.

MattcAnderson 2013-10-25 17:42

Hi all,

Thank you ewmayer for your thorough analysis of Fermat analysis. I appreciate your post.

Regards,
Matt A.

ewmayer 2013-11-11 21:24

1 Attachment(s)
First run of F[sub]28[/sub], at FFT length 15360k = 15*2[sup]20[/sup] doubles, finished yesterday. The timings in the attached checkpoint-summary-status file reflect the first part of the run being done on my old 4-core Sandy Bridge system using the SSE2 Mlucas code, then switching to AVX once that code became available (my AVX port was driven by the desire to get this run online ASAP; other FFT radices and functionality like Mersenne-mod convolution came after that), and finally switching to my new Haswell quad in June.

The summaries for the last few checkpoints and the final Selfridge-Hurwitz residues are
[i]
[Nov 10 16:57:24] F28 Iter# = 268430000 clocks = 00:00:00.000 [ 0.0638 sec/iter] Res64: 360A9E1308A3165A. AvgMaxErr = 0.145900781. MaxErr = 0.218750000
[Nov 10 17:03:13] F28 Iter# = 268435455 clocks = 00:00:00.000 [ 0.0640 sec/iter] Res64: 468731C3E6F13E8F. AvgMaxErr = 0.079615625. MaxErr = 0.203125000
F28 is not prime. Res64: 468731C3E6F13E8F. Program: E3.0x
F28 mod 2^36 = 16759471759
F28 mod 2^35 - 1 = 30476727665
F28 mod 2^36 - 1 = 9104636564
Sun Nov 10 17:22:00 PST 2013
[/i]
Before beginning the DC run of F[sub]28[/sub] I will spend a few weeks to complete the above-mentioned DC run of F[sub]27[/sub] which I had been doing on my Core2-based macbook.

ewmayer 2013-11-27 00:17

1 Attachment(s)
Second run of F27, at a mix of FFT lengths 8192k and 7680k - I briefly used the larger length early on while working through some multithreaded performance issues - finished last night. Both this and the earlier run @7168k (which was on the ragged edge of roundoff-error-ness) yield matching results.

If you compare the latter parts of the above 7168k-run statfile to the one attached below, you see a more than 2x throughput boost between the 2 respective quadcore-systems: Sandy Bridge+ddr 1333, versus Haswell + ddr2400. (My code benefits far more from the larger caches and such of Haswell than does George's).

2nd run of F28, this one at 16384k, is now underway.

philmoore 2013-11-27 02:20

[QUOTE=ewmayer;360379]2nd run of F28, this one at 16384k, is now underway.[/QUOTE]

Ernst, have you performed the Suyama test of the cofactor of F28, also of F27 ? That not only tells you whether the cofactor is composite, but whether it is a prime power. I don't recall that anyone has yet reported that the cofactor of F28 is composite.

ewmayer 2013-11-27 03:26

[QUOTE=philmoore;360389]Ernst, have you performed the Suyama test of the cofactor of F28, also of F27 ? That not only tells you whether the cofactor is composite, but whether it is a prime power. I don't recall that anyone has yet reported that the cofactor of F28 is composite.[/QUOTE]

No, not yet - I am currently just accumulating the Pépin-test residues, will code up the Suyama post-test sometime next year. (I had a working version of this in my very-old Fortran90 code, but never implemented it in C/assembler.)

philmoore 2013-11-27 04:15

[QUOTE=ewmayer;360391]No, not yet - I am currently just accumulating the Pépin-test residues, will code up the Suyama post-test sometime next year. (I had a working version of this in my very-old Fortran90 code, but never implemented it in C/assembler.)[/QUOTE]

The GCD step is the most time-consuming part of the test, but perhaps even the GMP version could do these in a reasonable time now (?) (Not that I have ever had the need to test it!) And we could also test GMP against George's code for verification.

ewmayer 2013-11-27 04:38

[QUOTE=philmoore;360397]The GCD step is the most time-consuming part of the test, but perhaps even the GMP version could do these in a reasonable time now (?) (Not that I have ever had the need to test it!) And we could also test GMP against George's code for verification.[/QUOTE]

Isn't a multiword remainder algo more efficient than a GCD here?

ewmayer 2013-12-15 21:48

Latest code enhancements drop the per-iteration time of my ongoing DC run of F28 @16384K from a (disappointing) 68 ms to 58 ms. The reason I say 68 ms was disappointing is that based on my 25 ms timing for the F27 DC run @7680K I expected a time of ~55ms for the slightly-more-than-doubling of the FFT length.

The code change yielding the improved timing was to implement a larger leading-pass FFT radix (call it r0) of (complex-data) 128. Until yesterday the largest r0 I had was 64. Now one of the key performance-related trends I have observed in playing with the multithreaded SIMD code over the past year is that it really likes larger r0 values. I believe this is a result of r0 determining the per-thread dataset size, as (total main-array size)/r0. An especially critical threshold here seems to be 1 MB, which is interesting because it is not obviously related to any of the cache sizes - it is several times (more than 2x) smaller than L2, and many times larger than L1. Nonetheless, the threshold effect is striking, and persists across multiple generations of the Intel x86, at least from Core2 onward.

Anyhoo, for F28 @16384K, (total main-array size) = 128 MB, thus r0 = 64 yields per-thread dataset size of 2 MB; r0 = 128 gets us back to 1 MB.

Time to see if another doubling of r0 gives another boost - implementing these is less trivial than it sounds because:

o My code uses a fused final-iFFT-pass/carry/initial-fFFT-pass strategy, thus the DIF and DIT DFTs at any new radix r0 need to be combined with the carry macros, and (for r0 a power of 2 or nearly so) these need to handle the different carry propagation schemes used for (real-data) Mersenne-mod and (complex-data, "right angle" twisted transform) Fermat-mod;

o For each new r0 I need to implement 3 distinct code paths: Scalar C, SSE2 and AVX. These share much of the "ambient" C code, but the DFT macros are custom inline-asm for the SIMD paths.

For now (since F28 at the above FFT length is the focus) I can just stick to r0 a power of 2, but eventually I need to go back and "fill in" the various non-power-of-2 radices as well, for instance between r0 = 64 and 128 I also will need to support r0 = 72,80,88,96,104,112 and 120. Each such radix needs 2-3 weeks of work to code ... much of my 2014 will be consumed this way. One bonus, though, is that larger r0 is also well-geared for manycore, since r0 sets the maximum number of threads my code can use.

ewmayer 2014-03-14 06:18

Nice little timing breakthrough recently - short title is "compact-object-code schemes for large-radix DFTs". Briefly, if one is dealing with a blob of code - mainly "hot sections" such as loop bodies here - which exceeds the L1 I-cache size, instructions must be fetched from L2, i.e. compete with data fetches for L2-to-L1 bandwidth. In my case the main culprits are the routines which wrap the carry step, and are laid out with a code-intensive loop body as follows:

1) Do radix-R final-iFFT pass on R complex data;
2) Send each output of [1] to the applicable DWT-unweight/normalize/round/carry/DWT-reweight macro (distinct ones for Fermat and Mersenne-mod transform modes);
3) Do radix-R initial fFFT pass on the R complex normalized data.

For R > 16 each DFT can easily run to a thousand instructions or more, and each of the R carry-macro invocations in [2] is on the order of 50-100 instructions.

I first considered reducing the sheer amount of code involved in the context of the non-SIMD (scalar-double) builds, where the optimizer has to deal with all of the code resulting from 1-3 (as opposed to just inlining blocks of assembly code, as in SIMD builds). I found that simple steps like modifying [2] to a single parametrized carry-macro call wrapped in a loop not only slashed the compile time, it appreciably boosted the resulting code speed. That's when I also started trying the strategy on the SIMD code paths.

Upshot: ~30% speedup across the board [i.e. all FFT lengths, and for both 1-thread and multithreaded] for scalar-double and SSE2 builds on my aging Core2Duo Macbook. I am now within a factor of 2x of Prime95, as typified by the following Mlucas/Prime95 timing ratios (in mSec/iter) at FFT length 2560K:

1-thread: 128/70 = 1.83x
2-thread: 73/46 = 1.59x

[These are using p95 v25.10 - I don't think George has sped the old SSE2 code much in the interim, but perhaps the multithreaded performance has been upped.]

The one annoyance: I get no speedup from the object-code-shrinkage on my Haswell quad. But my code already got a much greater per-cycle boost than p95 on Haswell, which leads me to believe the various hardware improvements [doubled L1 cache sizes, doubled L2-to-L1 bandwidth] on Haswell mitigated the I-cache-overflow issue for me.

I believe this phenomenon also explains many of the seemingly-bizarre performance issues I previously encountered with my SSE2 builds on Core2.

jasonp 2014-03-15 15:11

Nice to hear!

I just saw your previous post about the 1MB limit to performance. Could it be a limit on the number of TLB's on your machine? A single translation lookaside buffer speeds up access to a single 4kB page, so if your machine has a TLB with 128 entries and you pull in data from more than 128 different widely separated addresses then you will probably get noticeable slowdowns. Usually a TLB miss takes much less time than an access from RAM, but if you would otherwise be running from L2 then the miss penalty should be noticeable. Yes, this means modern x86 machines don't have enough TLB entries to cover their own caches; those caches are huge now, and a TLB is logic that is in the critical path of every memory access.

The good news on linux is that you can prove whether this is happening, by allocating a partition somewhere of type 'hugetlbfs' that is backed by RAM. Accesses inside this partition are backed by 'second level TLB' entries, which on x86 cover 2MB or 4MB. If your code speeds up drastically in this case, you know that it's TLB thrashing. The bad news is that you can't do much about it, the TLB size is a fundamental machine limit and the only way to speed up is to respect that limit.

The worse news is that I have no idea how to do the test :)

ewmayer 2014-03-15 21:39

Thanks for the commentary and suggestions - I considered that the effect might be a TLB issue rather than L2 cache per se, but as you note, saw no easy way to test it. Perhaps George can provide some useful insight here, as I know he did much wrestling with TLB issues in days of yore.

From my perspective, as I now have a nice unified-code methodology [most code aside from the DFT cores common for different radices] for implementing large-radix leading-pass DFTs (the ones that wrap the carry phase as noted above), absent a simpler direct-fiddle-with-TLB-usage fix I will simply add larger-radix DFTs as needed to keep the per-thread data chunksizes < 1 MB. Note that on pre-Haswell chips this is closely tied to the compact-object-code issue, because previously those L1 I-cache overflow issues were negating the chunksize-beneficial effects of the larger radices.

Prime95 2014-03-16 03:55

[QUOTE=jasonp;369016]Could it be a limit on the number of TLB's on your machine? A single translation lookaside buffer speeds up access to a single 4kB page, so if your machine has a TLB with 128 entries and you pull in data from more than 128 different widely separated addresses then you will probably get noticeable slowdowns. Usually a TLB miss takes much less time than an access from RAM, but if you would otherwise be running from L2 then the miss penalty should be noticeable. Yes, this means modern x86 machines don't have enough TLB entries to cover their own caches; [/QUOTE]

I'm not convinced TLB accesses are a problem.

Prime95 in pass 1 where we are using a large stride reads scattered data from RAM and copies it to a scratch area, the various FFT steps and normalization are done in the contiguous scratch area where TLBs are not a problem before copying the final results back to scattered RAM. Yes, there are some TLB misses at the start and end, but the cost appears to be insignificant. I've implemented large pages FFTs in Windows and did not observe a measurable speed difference.

Now if Ernst is trying to do the entire large stride pass 1 in-place, then the TLB costs could be much more significant.

ewmayer 2014-03-16 06:07

[QUOTE=Prime95;369070]I'm not convinced TLB accesses are a problem.

Prime95 in pass 1 where we are using a large stride reads scattered data from RAM and copies it to a scratch area, the various FFT steps and normalization are done in the contiguous scratch area where TLBs are not a problem before copying the final results back to scattered RAM. Yes, there are some TLB misses at the start and end, but the cost appears to be insignificant. I've implemented large pages FFTs in Windows and did not observe a measurable speed difference.

Now if Ernst is trying to do the entire large stride pass 1 in-place, then the TLB costs could be much more significant.[/QUOTE]

No - The way I do things is very much as you describe (holy convergent evolution, Batman) - large-regular-stride main-array data go into the radix-R DFT preceding the carry step, outputs of final-iFFT DFT are into small block of local storage, after the carry step we read from that, do the initial-fFFT radix-R DFT and write results back to the large-stride main array locations. I've been working to increase the pass 1 radix (R in my notation) to get the ensuing pass2 (remaining fFFT passes/dyadic-squaring/all-but-last-iFFTpass) data block sizes below the L2 size, and eventually to something which will fit in L1.

I can see why the local-array aspect of the DFT/carry/DFT stage is TLB-insensitive, but what about the scattered-data reads and writes bookending things?

One more note of possible relevance: For large radices I've found it expedient for the scalar-double code path to write things back to the scattered main array locs rather than to local memory. Interestingly I see no significant performance hit from doing so, contrary to my expectations. (This is not an option currently for the SIMD builds since those use carry macros which assume data are in contiguous memory.)

ldesnogu 2014-03-17 12:28

[QUOTE=ewmayer;368933]The one annoyance: I get no speedup from the object-code-shrinkage on my Haswell quad. But my code already got a much greater per-cycle boost than p95 on Haswell, which leads me to believe the various hardware improvements [doubled L1 cache sizes, doubled L2-to-L1 bandwidth] on Haswell mitigated the I-cache-overflow issue for me.[/QUOTE]
Another hypothesis: much better hardware prefetcher + branch prediction on Haswell :smile:

tServo 2014-03-17 23:49

Intel Vtune
 
If one had the time, one could download Intel's excellent utility, Vtune, to determine what exactly is going on. It's free for 30 days. Quite a few years ago, for a project, I had the chance to use an early version of it and was VERY impressed; it was quite amazing to actually look inside the cpu to see what's going on and make changes to improve response rather that doing the trial and error approach. Although there are lots of features requiring some dedicated time to learn, I also remember it had some "quick start" modes to get results quickly. I believe there are some YouTube videos that demonstrate its features.

ewmayer 2014-03-18 03:39

[QUOTE=tServo;369241]If one had the time, one could download Intel's excellent utility, Vtune, to determine what exactly is going on.[/QUOTE]

Does that require an Intel compiler install underneath?

tServo 2014-03-18 13:01

[QUOTE=ewmayer;369249]Does that require an Intel compiler install underneath?[/QUOTE]

It says it doesn't. With Visual Studio being so prevalent on Windoze boxes, it makes sense that they wouldn't limit it.
In examining its new features, one that I found VERY interesting that they improved from the time I used it is that you can collect data from remote machines by running a collection piece on the other machine that doesn't require a seperate license for it. Previously, if you had 3 different Intel cpu architectures you were targeting, you needed a license for each machine or an expensive floating license. Now, it looks like 1 will do the trick. Very cool.
BTW, its name is now Vtune Amplifier XE 2013. sheesh, I'll never understand marketing types.

ewmayer 2014-05-28 07:09

1 Attachment(s)
Second run of F[sub]28[/sub], at FFT length 16384 kdoubles, finished today, and match the one for 15360k I posted last November in this thread. The timings in the attached checkpoint-summary-status file reflect the first part of the run being done using leading radix r0 = 64 (yielding a per-thread dataset size of ~2 MB) and then switching to r0 = 128 (1MB/thread, right at the 1 MB threshold which is critical on Intel x86) and later r0 = 256 (~0.5MB/thread) as the new code for those radices became available.

The summaries for the first few and last few checkpoints and the final Selfridge-Hurwitz residues are
[i]
F28: using FFT length 16384K = 16777216 8-byte floats.
this gives an average 16.000000000000000 bits per digit
Using complex FFT radices 64 16 16 16 32
[Nov 26 09:00:00] F28 Iter# = 10000 clocks = 00:00:00.000 [ 0.0695 sec/iter] Res64: 0F38D26B35C6794C. AvgMaxErr = 0.010993669. MaxErr = 0.013671875
[Nov 26 09:11:38] F28 Iter# = 20000 clocks = 00:00:00.000 [ 0.0696 sec/iter] Res64: 6E621D1C6A05BC46. AvgMaxErr = 0.011034805. MaxErr = 0.015625000
...
Using complex FFT radices 128 16 16 16 16
[Dec 14 14:50:10] F28 Iter# = 22690000 clocks = 00:00:00.000 [ 0.0580 sec/iter] Res64: 262012254B592782. AvgMaxErr = 0.010806998. MaxErr = 0.013671875
[Dec 14 14:59:50] F28 Iter# = 22700000 clocks = 00:00:00.000 [ 0.0579 sec/iter] Res64: 0002AF208FE85365. AvgMaxErr = 0.010797848. MaxErr = 0.013671875
...
Using complex FFT radices 256 8 16 16 16
[Jan 09 14:55:42] F28 Iter# = 61130000 clocks = 00:00:00.000 [ 0.0567 sec/iter] Res64: F3699600BCBFECE0. AvgMaxErr = 0.011603018. MaxErr = 0.015625000
[Jan 09 15:05:10] F28 Iter# = 61140000 clocks = 00:00:00.000 [ 0.0567 sec/iter] Res64: 2DD786952E04FC17. AvgMaxErr = 0.011608487. MaxErr = 0.015625000
...
[May 27 04:12:06] F28 Iter# = 268390000 clocks = 00:00:00.000 [ 0.0575 sec/iter] Res64: E529CBABAE1D8B17. AvgMaxErr = 0.011612073. MaxErr = 0.015625000
[May 27 04:21:42] F28 Iter# = 268400000 clocks = 00:00:00.000 [ 0.0574 sec/iter] Res64: 7D195E1B0DD4A93E. AvgMaxErr = 0.011611240. MaxErr = 0.015625000
[May 27 04:31:17] F28 Iter# = 268410000 clocks = 00:00:00.000 [ 0.0574 sec/iter] Res64: 840DBB498F57E870. AvgMaxErr = 0.011617642. MaxErr = 0.015625000
[May 27 04:40:53] F28 Iter# = 268420000 clocks = 00:00:00.000 [ 0.0574 sec/iter] Res64: 64BE2132426CDD1C. AvgMaxErr = 0.011605388. MaxErr = 0.015625000
[May 27 04:50:28] F28 Iter# = 268430000 clocks = 00:00:00.000 [ 0.0575 sec/iter] Res64: 360A9E1308A3165A. AvgMaxErr = 0.011610274. MaxErr = 0.015625000
[May 27 04:55:43] F28 Iter# = 268435455 clocks = 00:00:00.000 [ 0.0574 sec/iter] Res64: 468731C3E6F13E8F. AvgMaxErr = 0.006339752. MaxErr = 0.015625000
F28 is not prime. Res64: 468731C3E6F13E8F. Program: E3.0x
R28 mod 2^36 = 16759471759
R28 mod 2^35 - 1 = 30476727665
R28 mod 2^36 - 1 = 9104636564[/i]

Tarchive of the 2 checkpoint-summary files and the final bytewise residue file (identical for both runs, thus just 1 copy) in the previously-described format are [url=http://www.hogranch.com/mayer/f28_runs.tgz]here[/url] (33 MB, so please do not download frivolously - the run-status file alone is in the much smaller attachment below, if you just want the summary data). The MD5 checksum on the extracted bytewise residue file is
[i]
md5sum f28 = 1554face46e633eb2fd773df2647cd2a[/i]

F29 (@30720K) is taking ~110 ms/iter on all 4 cores of my non-OCed Haswell 4670K; I'd really like to get that down to ~86ms (i.e. 1 Miter/day) which would still need 18 months for the complete run, so it may be time to consider [strike]performance-enhancing drugs[/strike] overclocking the hardware.

ewmayer 2014-10-10 04:07

[QUOTE=ewmayer;374437]F29 (@30720K) is taking ~110 ms/iter on all 4 cores of my non-OCed Haswell 4670K; I'd really like to get that down to ~86ms (i.e. 1 Miter/day) which would still need 18 months for the complete run, so it may be time to consider [strike]performance-enhancing drugs[/strike] overclocking the hardware.[/QUOTE]

My initial stab (summer 2013) at FMA-related speedups was restricted to the intermediate-FFT-pass radix-16 routines, which do roughly half the work in a run like my ongoing F29 one, where I am using complex FFT radices 240,[b]16[/b],[b]16[/b],[b]16[/b],16 (product = 15M, or 30 Mdoubles in real-data terms). The bolded radices are the previously-FMAized ones. That effort was somewhat disppointing, since it yielded an overall speedup of 4-5% over non-FMA AVX, i.e. perhaps 8-10% speedup for the FMA-optimized code macros. (I was hoping for something like 15% or better). Part of the disappointment is due to a very basic aspect of my FFT implementation, namely that I use a post-twiddles complex-multiply scheme for the various passes of the inverse FFT, in which the complex twiddles are applied to the outputs of the DFT use in each iFFT pass. This is nice from a dataflow-symmetry and auxiliary-data-indexing perspective, but is not favorable for an FMA-based code speedup because instead of leading to a nice "tangent structure" (as for the forward, pre-twiddle passes) in which the twiddle MULs can be absorbed nicely in with the ADD/SUBs, one instead ends up in the iFFT with a raft of "trivial" FMAs with one of the two multiplicands = 1 (or simply retains the original ADD/SUBs those replace), and at the end must do a batch of pure-MULs (i.e. wasting the fused mul/add aspect of FMA) as part of the post-twiddling. Here are the resulting opcounts for forward and inverse radix-16 complex FFT passes:

Fwd: 174 FMA [Breakdown: 87 FMA, 2 FMS, 85 FNMA, 0 FNMS.]
Inv: 174 FMA [a whopping 112 of which have a trivial unity multiplicand], 34 MUL.

Upshot: I probably get the desired FMA-speedup in the forward FFT routines restructured thusly, but get little speedup in the iFFT due to the above issues.

Partly as a result of the effort expended to achieve this mediocre gain I did not consider widening my FMA-related optimizations beyond the above 2 "heavy lifting" routines until recently. However, I will henceforth be focusing on more-localized "peephole" style FMA code opts, which can be done incrementally and in a localized fashion. For starters I FMAized the other code macros (radix-15 and twiddleless radix-16 assembled into radix-240, and fused radix-16 final-fwd-FFT/dyadic-square/initial-inv-FFT-pass) used in the above set of radices for my ongoing F29 run. That and a half-dozen other miscellaneous 1%-level tweaks have cut the per-iteration timing @30 Mdoubles to 95 msec, which is within 10% of my "stretch goal" of 86 msec.

(Still running at stock - if I added a ginormo-cooler and OCed up the wazoo I could probably hit 80 msec with the current code, but where's the challenge in that? :)

Prime95 2014-10-10 14:45

In the radix-16 code, have you tried prefetching the next radix-16 data into the L1 cache?

ewmayer 2014-10-10 21:48

[QUOTE=Prime95;384890]In the radix-16 code, have you tried prefetching the next radix-16 data into the L1 cache?[/QUOTE]

My current prefetch scheme there allows for a prefetch with a compile-time "fetch-ahead" distance specified. Based on experiments I found the best value of that to be ~1kB, very different from the "fetch very next set of inputs" scheme you describe. But let me do a quick-revert to the latter when I get home this evening and have access to the Haswell, and I will let you know what the result is.

One of my ongoing frustrations is that I have not been able to get anywhere near the big speedups you describe with prefetch - mine have been at best 2-3%, an order of magnitude lower than expected. Getting another 20-30% speedup effectively "for free" would of course be huge, but I have never seen anything near that with my prefetch experiments over the years.

Prime95 2014-10-10 22:50

Prime95 v27 and before prefetched way ahead into the L2 cache. That is, say were doing the first pass operating on 128KB of data. During the processing of the 128KB of data I slowly prefetch the next 128KB of data into the L2 cache.

In v28, I added the prefetch next few cache lines into the L1 cache. This moves the data from the L2 cache into the L1 cache.

How many total passes over main memory does your FFT take? Prime95 does the forward and inverse FFTs with reading and writing the FFT data only twice. If your code is doing more than that you will run into memory bandwidth limitations sooner.

ewmayer 2014-10-11 05:34

I just re-played with the prefetch params in the forward-radix-16 DFT macro, tried bytes-ahead for the address offset ranging from 8 (just one double ahead) to 4096 ... again only a few % variation in the resulting timings, and 1kB seems roughly the best, though the effect is weak. All these current prefetches are the t1 variety.

[QUOTE=Prime95;384941]Prime95 v27 and before prefetched way ahead into the L2 cache. That is, say were doing the first pass operating on 128KB of data. During the processing of the 128KB of data I slowly prefetch the next 128KB of data into the L2 cache.

In v28, I added the prefetch next few cache lines into the L1 cache. This moves the data from the L2 cache into the L1 cache.

How many total passes over main memory does your FFT take? Prime95 does the forward and inverse FFTs with reading and writing the FFT data only twice. If your code is doing more than that you will run into memory bandwidth limitations sooner.[/QUOTE]
Well, you have to store the not-currently-in-registers data somewhere, so where do you cache the data if not in the main array? In some contiguous chunk of local memory?

I'll illustrate my data move strategy using the above set of complex radices (240,16,16,16,16) I'm using for F29. Call the overall runlength in terms of doubles n:

[1] Initial forward radix-240 decimation-in-frequency (DIF) pass on n/240-stride-separated data;
[2] Remaining 4 fFFT passes, dyadic square step, and initial 4 iFFT passes (iFFT runs radices in reverse order, thus again 4 x radix-16 here) done on one contiguous n/240-sized data chunk at a time;
[3] Final radix-240 decimation-in-time (DIT) pass on n/240-stride-separated data;
[4] Round and do carries;

Steps 1,3,4 (actual order is 3,4,1 here) are all fused into a single pass through the data. Step 2 involves multiple passes, but all on a contiguous block of data. (And I process 240 such blocks in turn). The size of each of these blocks relates to the "1 MB chunk threshold" I described earlier, which is crucial for good performance on the x86. 4 passes of complex radix16 means 16^4 = 2^16 complex doubles, i.e. 2^17 doubles, exactly 1MB. For F30 if I again used leading radix 240 the resulting contiguous-data chunks would be 2MB, and I would expect (and have already confirmed) a timing deterioration. I already have in place a leading-radix-960 routine, which is slightly worse than 240 when used for F29, but is better for F30, and better still for F31.

Prime95 2014-10-11 14:39

[QUOTE=ewmayer;384960]
[1] Initial forward radix-240 decimation-in-frequency (DIF) pass on n/240-stride-separated data;
[2] Remaining 4 fFFT passes, dyadic square step, and initial 4 iFFT passes (iFFT runs radices in reverse order, thus again 4 x radix-16 here) done on one contiguous n/240-sized data chunk at a time;
[3] Final radix-240 decimation-in-time (DIT) pass on n/240-stride-separated data;
[4] Round and do carries;[/QUOTE]

This is what I call 2 passes over main memory -- exactly what prime95 does.

The combined steps 3,4,1 work on 240*16 = 4KB chunks. These 4KB chunks easily fit it the L1 cache while being processed. Two comments on performance:

1) While processing the 4KB chunk in the L1 cache, you should be prefetching the next 4KB chunk into the L1 cache. My concern is that you are not doing enough FPU work to hide the latencies of reading the next 4KB chunk from main memory. If I'm right, you can improve the FPU code all you want but your actual timings won't change.

2) You likely break up the up into two steps 16 by 15 (actually inverse 16,15,carries,forward 15,16) and you are working with horribly large strides. During the inverse radix-16 step in my example, Prime95 would read the horrible stride data and write it to a contiguous 4KB scratch area. The remaining steps then deal with the pleasant small stride scratch area until the final forward radix-16 step writes it back to the FFT data area with horrible strides.



What I call pass 2 over main memory, processes chunks of 16^4 * 16 bytes = 1MB. This pass operates out of the L3 cache. Some comments on performance:

1) The L3 cache is a lot slower than the L2 cache. The CPU will not be able to hide the latencies without prefetching into L1 cache. You've found that prefetching 1KB ahead seems to be optimal for hiding this latency.

2) While you are processing the 1MB chunk, you can prefetch the next 1MB chunk into the L3 cache. You are doing plenty of FPU work to easily hide all main memory latency. Except---

3) I'm worried you may be exceeding the L3 cache's capabilities. If you operate on 1MB of data while prefetching the next 1MB of data and you have 4 cores going -- that's 8MB of data. It probably won't all fit. I also worry about the contention you've got going with 4 cores all placing very heavy read and write demands on the shared L3 cache. Is it possible to exceed the L3 cache's bandwidth capabilities? I don't know.


Here's what I'd recommend trying:

In pass 1, go with 240 x 16 -- 60KB. While processing, prefetch the next 60KB into the L2 cache (spread these prefetches out!). Total demand on the L2 cache is 180+KB (60KB FFT data, next 60KB FFT data, 60KB scratch area, some sin/cos and weights). A tight fit.

In pass 2, go with 16x16x16 -- 64KB. Should fit nicely in the L2 cache since you don't need a scratch area.


That's a lot of work. I hope I'm right and it will be worth the effort. No guarantees.

ewmayer 2014-10-12 01:54

George, many thanks for the thoughtful, detailed reply.

[QUOTE=Prime95;384980]This is what I call 2 passes over main memory -- exactly what prime95 does.[/QUOTE]
Holy convergent evolution, Bat-man! :) Actually, the key pieces of the scheme - the fused-radix/carry step and a similar fusion of the final fFFT pass, dyadic-square step and initial iFFT pass (same radix as the final-fFFT, and sharing the same twiddles) were already in place in the Fortran90 code I used for the 1999 test of F24. The key pieces missing at that time (beside the transition to C and mass deployment of inline ASM) were that the 1999 code still used an out-of-place "ping pong" FFT (which performed well on at least one chip family, MIPS, but sucked hard on most others, from Alpha to x86), and instead of doing a fused step [2] to all but the final-iFFT-radix/carry/initial-fFFT-radix on one contiguous main-array data chunk at a time, I did the intermediate-radix (= all but the initial and final radix of each FFT) passes on the whole array, thus a set of 5 radices-for-each-FFT as I am using for F29 meant 8 passes through the whole array. That predictably scales badly to large FFT lengths.

[QUOTE]The combined steps 3,4,1 work on 240*16 = 4KB chunks. These 4KB chunks easily fit it the L1 cache while being processed.[/QUOTE]
I use a complex FFT, so radix-240 means 240*(16,32,64) = (3.75,7.5,15)KB in scalar-double, SSE2 and AVX build modes, respectively. Even the AVX-mode 15KB should fit nicely in L1 on Haswell with its 32KB level-1 D-cache, though.

However, note that for radices like 240 which decompose into 2 coprime factors (240 = 15*16 = odd * power-of-2, but could also be the product of 2 coprime odds or odd1 * (odd2 * power-of-2) with odd1,2 coprime), I typically use 2 such contiguous local stores, due to the index-permutation flexibility demanded by my twiddleless implementation of such coprime radix pairs. So e.g. pass1 reads from large-stride-separated main-array locations, writes to store1, pass 2 reads from stroe 1 and writes to store 2. But I still think the main-array step is the crucial one here, as you note below.

[QUOTE]Two comments on performance:

1) While processing the 4KB chunk in the L1 cache, you should be prefetching the next 4KB chunk into the L1 cache. My concern is that you are not doing enough FPU work to hide the latencies of reading the next 4KB chunk from main memory. If I'm right, you can improve the FPU code all you want but your actual timings won't change.[/QUOTE]
Yes, that sounds like a promising "try this first" step - I have done limited prefetch experiments for the large-stride step, but IIRC my attempts were influenced by my experience with prefetch within the contiguous Step [2] data blocks, where fetch-ahead distances ~1KB seem best for me. I don't think I ever made a concerted effort to fetch just the next set of complex data-to-be-used into L1.

[QUOTE]2) You likely break up the up into two steps 16 by 15 (actually inverse 16,15,carries,forward 15,16) and you are working with horribly large strides. During the inverse radix-16 step in my example, Prime95 would read the horrible stride data and write it to a contiguous 4KB scratch area. The remaining steps then deal with the pleasant small stride scratch area until the final forward radix-16 step writes it back to the FFT data area with horrible strides.[/QUOTE]
Same here, but with the extra twis of 2 scratch areas for coprime radices.

The carry macros also operate on the last-used scratch area. Carries also need auxiliary data - DWT weights for Mersenne-mod, and complex-roots-of-(-1) (similar to the FFT twiddles, but those are (n/2)th roots of +1, as opposed to the (n/2)th roots of -1 (same as the first (n/2), that is upper-complex-half-plane (n)th roots of +1) needed for the "acyclic twisting" scheme for moduli like Fermats with a "+" sign.

I use a 2-table multiply scheme for on-the-fly generation of both FFT twiddles and IBDWT weights. As I noted to you some years ago for the IBDWT weights w_0,...n-1 I make use of the identity

w_j * w_n-j = 2

to allow a single array to encode both the forward and inverse weights, at the price of some slightly-less-cache-friendly reverse-running indexing for the inverse-weights step.

My 2-tables schemes have each pair of tables within a factor of 2 of each in size, thus I have a pair of O(sqrt n)-sized tables of complex data for FFT twiddles, another such pair for acyclic-twisting weights in Fermat-mod mode, and a pair of similar-sized real data table for DWT weights in Mersenne-mod mode. Note that only one such pair of data tables competes for L1 cache with actual "live" transform (i.e. residue-related) data in any phase of the overall FFT-modmul scheme, though: The fused steps [1,3,4] need either IBDWT weights or acyclic-twisting weights depending on the kind of modulus; step [2] needs just the 2 subtables needed to generate the complex-FFT twiddles.

Aside: For non-power-of-2 FFT lengths (this case was not considered in the now-famous 1994 Crandall/Fagin DWT paper) the Fermat-mod scheme also needs Mersenne-style IBDWT weights in addition to the acyclic-twisting weights, but the IBDWTs are memory-negligible here because for an FFT length of form odd * 2^m the IBDWT weights cycle with sequence length (odd), and it makes little sense to use odd-radix components much larger than 15. (The largest such I have is 63, and will be sued specifically for the Fermat numbers F32 and F33, where an FFT length of form 15 * 2^m will no longer suffice due to ROE growth - I will use FFT length 63 * 2^m for one of the 2 independent tests, and a power-of-2-length 2^(m+6) for the other.

[QUOTE]What I call pass 2 over main memory, processes chunks of 16^4 * 16 bytes = 1MB. This pass operates out of the L3 cache. Some comments on performance:

1) The L3 cache is a lot slower than the L2 cache. The CPU will not be able to hide the latencies without prefetching into L1 cache. You've found that prefetching 1KB ahead seems to be optimal for hiding this latency.

2) While you are processing the 1MB chunk, you can prefetch the next 1MB chunk into the L3 cache. You are doing plenty of FPU work to easily hide all main memory latency. Except---

3) I'm worried you may be exceeding the L3 cache's capabilities. If you operate on 1MB of data while prefetching the next 1MB of data and you have 4 cores going -- that's 8MB of data. It probably won't all fit. I also worry about the contention you've got going with 4 cores all placing very heavy read and write demands on the shared L3 cache. Is it possible to exceed the L3 cache's bandwidth capabilities? I don't know.[/QUOTE]
So if I read you right the evidence for the L3-bandwidth issue you raise would be that the prefetch-into-L3 scheme helps for (say) 1 or 2-threaded runs, but hurts performance (vs no-L3-prefetch) above some thread count in the 4ish range?

[QUOTE]Here's what I'd recommend trying:

In pass 1, go with 240 x 16 -- 60KB. While processing, prefetch the next 60KB into the L2 cache (spread these prefetches out!). Total demand on the L2 cache is 180+KB (60KB FFT data, next 60KB FFT data, 60KB scratch area, some sin/cos and weights). A tight fit.[/QUOTE]
In "240 x 16", is the 16 a data-chunk byte count (as in your "combined steps 3,4,1 work on 240*16 = 4KB chunks" not above), or are you referring to radix-16, i.e. to prefetching data for the first 2 FFT passes into L2?

[QUOTE]In pass 2, go with 16x16x16 -- 64KB. Should fit nicely in the L2 cache since you don't need a scratch area.[/QUOTE]
On considering my experience with your comments, I'm not even sure prefetch in pass [2] is worth spending much time on, since the whole point of properly choosing the leading radix (e.g 240 in our example here) is so that the system should have an easy time fetching the needed data from the higher-level caches ... but I will first attempt to tackle the large-stride-related data fetch issues for fused steps [1,3,4] in my breakdown.

[QUOTE]That's a lot of work. I hope I'm right and it will be worth the effort. No guarantees.[/QUOTE]
Well, as you know, this is a long-term project for me. :) It's nice to have some promising directional hints, for sure.

For the coming month or so, I'll probably continue with on ongoing work of FMAizing the remaining key AVX code macros which need it - don't expect huge gains from that on Haswell, but reliable gains and hopefully-even-better-on-Broadwell when it comes out - and do some large-stride-related prefetch experiments as a side project. Will let you know what I find as results become available.

ewmayer 2014-10-13 02:13

I just added some prefetcht1 calls to my AVX Fermat-mod carry macro, such that during processing of the 240 (or more generally, the R, where R is the leading radix) carries within the segmented main data array (which is not directly in the picture during carry processing, as this is all done using a small contiguous local storage area as described above), prefetches-into-L1 of the data at the R distinct n/R-stride-separated main-array locations get done while the carries are processed. The carry macros seem the logical place to do this prefetching, since even though they do not otherwise access the main array in any way, they make is easy to propagate such prefetches to all the various leading-radix/carry routines I have.

For these experiments I added code such that the bytes-ahead offset for the prefetches is set as a compile-time command-line define and played with various values: 64 is the obvious candidate in AVX mode since an "AVX complex double" consists of 4 adjacent quartets of doubles, thus "fetch the next complex datum at each array location" means fetch 64 bytes ahead of the current data addresses. In practice, though, 32 bytes-ahead proves a smidge faster (and I also tried other values, like 16 and 128, all of which were clearly worse). Instant 4% speedup (95ms/iter --> 91 ms/iter for my ongoing F29 run), the clearest timing "signal" I have ever seen in all my playing-with-prefetch. Thanks, George! Hopefully this is the start of a trend - still lots of other things to play with here, as you lay out above.

ewmayer 2016-01-23 01:52

First Pe'pin test of F29 completed late Wednesday - took slightly under 20 months at FFT length 30M. Start-of-run per-iteration time on my 3.3 GHz Haswell 4670 quad (running at stock) was ~115ms; several rounds of AVX2/FMA-oriented optimizations brought this down to ~90ms, which was representative of roughly the final 80% of the iterations.

On to run #2 (@FFT length 32M) ... our own Mike/Xyzzy was kind enough to run the first ~25Miter @32M over the last several months for me on his 2-core Haswell NUC, so I am proceeding where that run left off.

[UPDATE] I was going to post the final checkpoint-Res64 and the traditional Selfridge-Hurwitz residues along with some notes on run #2 after letting the latter run for a day or two on the same quad-Haswell system, but the latter has already turned up a 'bonk!' between iteration 25M and 26M. Not yet 100% sure whether the error is in my new or the old run, but I suspect the latter because that entire run was 'fragile' and crash-at-random-intervals-ranging-from-an-hour-to-several-months prone. Said crashes were accompanied by either 'fatal roundoff error = 0.5' or 'nonzero exit carry' errors indicating some kind of data corruption, whereas the FFTM = 32M run has been clean of any untowardness. I further note that the 30M run used a fast but 'preliminary' binary of my v14.1 code release ... when I compared that to the eventual stabilized release code it was 2-3% faster at the FFT length in question, so I continued using it. Big mistake, as I now suspect the aforementioned 'noisy' fatal bad-data errors were likely accompanied by silent fails ... note the residue discrepancy between the 30M and 32M runs appears at a checkpoint where there are no signs of 'distress' from either run.

So, likely I am looking at a massive waste of compute effort ... lesson learned, never again do such a run-pair in serial mode, always do the 2 side-by-side on systems yielding as similar speeds as possible. To that end, I am looking for a [strike]victim[/strike]volunteer to [strike]go on a suicide mission[/strike] help me re-do the 30M run in parallel with the 32M one. Best will be someone with a similar Linux Haswell-quad to my own ... should one prove slightly faster than the other it will get to finish the 32M-FFT run, which needs ~6% more cycles than the 30M one. Or we could consider running @30M using just 3 cores on the faster system, leaving one core free for other tasks.

Note that [url=http://www.imdb.com/title/tt0060009/?ref_=nm_knf_t1]should you choose to accept this mission[/url], it will be a significant runtime commitment - around 18 months and 0 chance of a prime, although should you change your mind (or blow up your system) at some point along the way you can hand things off (even if your savefiles get nuked, since we will be cross-checking those vs mine) to a fresh recruit. But you will be contributing to a small piece of math history and of course getting a grateful 'thanks!' in any publication that may eventually result.

ixfd64 2016-01-23 05:09

Just curious, did the system have ECC memory?

Xyzzy 2016-01-23 05:19

None of his (or ours) has ECC memory.

:mike:

retina 2016-01-23 07:24

18 month test without ECC RAM! :loco:

ramshanker 2016-01-23 15:05

I envy these algorithms!

ET_ 2016-01-23 16:31

Too bad I still don't owe a Haswell... :sad:

Luigi

LaurV 2016-01-23 16:58

[QUOTE=ET_;423748]Too bad I still don't owe a Haswell... :sad:

Luigi[/QUOTE]
Hm, do you want to owe it, or to own it? :razz:

Xyzzy 2016-01-23 17:07

[QUOTE=ewmayer;423664]…although should you change your mind (or blow up your system) at some point along the way you can hand things off (even if your savefiles get nuked, since we will be cross-checking those vs mine) to a fresh recruit.[/QUOTE]:bump:

ewmayer 2016-01-23 22:44

[QUOTE=ET_;423748]Too bad I still don't owe a Haswell... :sad:

Luigi[/QUOTE]

Now you have the perfect excuse to run out and buy one! You can probably get used systems/components fairly cheap now.

Re. ramshanker's "I envy these algorithms" comment -- you envy abject #FAILure?

ET_ 2016-01-24 11:43

[QUOTE=ewmayer;423829]Now you have the perfect excuse to run out and buy one! You can probably get used systems/components fairly cheap now.
[/QUOTE]

I will, probably for my birthday on March 7th. Will I be able to help then, or will it be too late? :boxer::spinner:

ewmayer 2016-01-24 22:49

[QUOTE=ET_;423867]I will, probably for my birthday on March 7th. Will I be able to help then, or will it be too late? :boxer::spinner:[/QUOTE]

Not at all!

ET_ 2016-01-25 12:59

[QUOTE=ewmayer;423925]Not at all![/QUOTE]

Fine, let's keep in touch then :smile:

ewmayer 2016-01-25 22:22

[QUOTE=ET_;423980]Fine, let's keep in touch then :smile:[/QUOTE]

While my 32M Haswell run continues - at ~900,000 iterations per day - I'll be using spare cycles on my slow Core2 Macbook to re-run the roughly 1 million iterations from Mike's last 32M savefile (whose last checkpoint data agreed with those of my just-completed 30M run at the iteration in question) to the point at which my 30M and 32M-FFT runs diverge, in order to ascertain which of the 2 runs went off the rails. (Like I said I'm almost certain it was the 30M one, but I need to confirm that via a 3rd run.)

ewmayer 2016-02-05 02:47

[QUOTE=ewmayer;424035]While my 32M Haswell run continues - at ~900,000 iterations per day - I'll be using spare cycles on my slow Core2 Macbook to re-run the roughly 1 million iterations from Mike's last 32M savefile (whose last checkpoint data agreed with those of my just-completed 30M run at the iteration in question) to the point at which my 30M and 32M-FFT runs diverge, in order to ascertain which of the 2 runs went off the rails. (Like I said I'm almost certain it was the 30M one, but I need to confirm that via a 3rd run.)[/QUOTE]

It's official - re-run of iters 24.92M (Where Mike left off his Haswell-NUC DC run @FFTlen 32M) - 25.64M (where my Haswell-quad continuation of same first mismatches my original run @FFTlen 30M) using the above-mentioned SSE2 build on my Core2Duo Macbook
@FFTlen 30M matches my (ongoing) Haswell 32M continuation run:

[b][1] Original Haswell-quad run @30M:[/b]
[i]
[Jun 30 12:51:11] F29 Iter# = 25630000 clocks = 00:17:59.766 [ 0.1080 sec/iter] Res64: C782D1DA346F61C4. AvgMaxErr = 0.118563074. MaxErr = 0.156250000
[Jun 30 13:09:15] F29 Iter# = 25640000 clocks = 00:18:00.266 [ 0.1080 sec/iter] Res64: [b]BCF83642C5E19A09[/b]. AvgMaxErr = 0.118375359. MaxErr = 0.148437500
[/i]
[b][2] DC Haswell-NUC/quad run @32M:[/b]
[i]
[Jan 21 14:55:02] F29 Iter# = 25630000 clocks = 00:15:58.792 [ 0.0959 sec/iter] Res64: C782D1DA346F61C4. AvgMaxErr = 0.023437500. MaxErr = 0.023437500
[Jan 21 15:11:06] F29 Iter# = 25640000 clocks = 00:15:59.235 [ 0.0959 sec/iter] Res64: [b]12AF4B8DD106DF85[/b]. AvgMaxErr = 0.023437500. MaxErr = 0.023437500
[/i]
[b][3] DC Core2Duo/SSE2-build run @30M[/b] (timings vary widely since only run when plugged in):
[i]
[Feb 03 23:55:27] F29 Iter# = 25630000 [ 4.77% complete] clocks = 02:57:20.963 [ 1.0641 sec/iter] Res64: C782D1DA346F61C4. AvgMaxErr = 0.102656999. MaxErr = 0.132812500.
[Feb 04 18:19:56] F29 Iter# = 25640000 [ 4.78% complete] clocks = 05:04:33.375 [ 1.8273 sec/iter] Res64: [b]12AF4B8DD106DF85[/b]. AvgMaxErr = 0.102699611. MaxErr = 0.140625000.
[/i]

Thus, everything below beyond iter 25.63M in my original Haswell/AVX2 run @FFTlen 30M is garbage.

Luigi, you order your birthday present yet? My hope is that things will run sufficiently fast on your newer hardware that you might get throughput matching my 4-core speed using just 3 (or even 2, due to memory contention giving diminishing returns beyond 2-threads) threads, which will make me feel less guilty about monopolizing your cycles. :)

VBCurtis 2016-02-05 03:52

Mr Mayer-
I may be willing to donate a couple thread-months of haswell desktop until Luigi gets his machine up and running. If you might be so kind as to PM me instructions for Linux setup (I run ubuntu, if it matters), I can continue a double-check alongside your current run, with the caveat that I'll lose interest after 30-60 days.

ewmayer 2016-02-05 04:54

[QUOTE=VBCurtis;425276]Mr Mayer-
I may be willing to donate a couple thread-months of haswell desktop until Luigi gets his machine up and running. If you might be so kind as to PM me instructions for Linux setup (I run ubuntu, if it matters), I can continue a double-check alongside your current run, with the caveat that I'll lose interest after 30-60 days.[/QUOTE]

Thanks, Curtis - please, enough with the "Mr.' - you're making me feel my age. :)

First step is to download the current Mlucas release from [url=http://hogranch.com/mayer/README.html]here[/url] and build it - please use the auto-build version at top of the 'Recent News' section. It should end with a binary that is linked to the executable which reflects your platform, e.g. the USE_AVX2 build for Haswell+. Assuming you successfully build & install, run the medium self-tests using 'Mlucas -s m' in a dir created for your work. Fermat-testing needs a few further tweaks, but first let's get through the above.

ET_ 2016-02-05 09:40

[QUOTE=ewmayer;425270]
Luigi, you order your birthday present yet? My hope is that things will run sufficiently fast on your newer hardware that you might get throughput matching my 4-core speed using just 3 (or even 2, due to memory contention giving diminishing returns beyond 2-threads) threads, which will make me feel less guilty about monopolizing your cycles. :)[/QUOTE]

I completed the to-do list and cleared my doubts about the pieces to pick up: I will place my order next monday, and get the new toy before the end of the week.

Luigi

LaurV 2016-09-14 05:08

Ernst, can we have a status on F29?
Just curious...

Xyzzy 2016-09-14 14:12

[QUOTE=LaurV;442492]Ernst, can we have a status on F29?[/QUOTE]It is probably still composite.

:razz:

ewmayer 2016-09-15 00:50

[QUOTE=LaurV;442492]Ernst, can we have a status on F29?
Just curious...[/QUOTE]

Well, I'm doing both the 30M and 32M-FFT runs side-by-side on my Haswell quad, so each is running half as fast as it would alone, but it makes it very easy to cross-check residues, since 30M runs only a few % faster than 32M (.1810 sec/iter vs .1860 sec/iter). I'm running each job 4-threaded, even though it gets only the equivalent of 2 cores compute power. Why am I doing this? Because my Haswell system, despite running at stock, has never been completely stable (whereas the very same binary has been running rock-solidly on my slower Mike-built Broadwell NUC for over a year, with the only interruptions being power outages ... but note the NUC is doing 1st-time LL tests, not Fermat testing). I average around 1 week between mystery-fails; roughly half the time both jobs start emitting fatal roundoff errors (sign of data corruption) at the same time; ~30% of the fails involve just one run - this is the reason for making both 4-threaded, so the surviving run can continue at full 4-core speed - and ~20% of the fails are BSOD-style (no UPS on this system, not worth the hassle).

When a run fails due to fatal-ROE-style data corruption it will attempt to auto-restart for the most-recent savefile; this succeeds perhaps 1/3rd of the time, suggesting that is the fraction of times the data-corruption occurred in the main residue array as opposed to the auxiliary data tables needed for the FFT-mul. (Adding code logic to auto-reinit those tables is on my to-do list, but not currently a high priority.)

Whenever I reboot/restart it gives me a chance to allow whichever of the 2 runs has fallen behind to catch up before restarting the other one.

There have been at least 2 (perhaps 3 - too lazy to dig thru the savefiles just now) occasions where one run has started emitting non-matching residues - in such cases I have to halt both and restart from the nearest savefile preceding the start of mismatching residues, since I have no [i]a priori[/i] way of determining which one went off the rails.

Anyhow, both runs passed 130M iters a few days ago, which is a smidge more than 1/4 done. Total time for the side-by-side runs will be ~3 years, so only 27 months to go. :(

Since I plan to do F30 on a manycore system (I'm lookin' at you, Knight's Landing), it is possible those parallel runs will complete before the F29 ones, or not much later. I'll have a better sense of things by EOY, roughly the timeframe I estimate for a halfway-decent initial AVX-512 assembly-code deployment.

LaurV 2016-09-15 05:32

Nice, thanks for sharing that!

That is what I (used to) do when running two LLs in parallel in two Titans - stop both and resume from the last match, as there was no way to know which one failed. I said "used to" because momentarily I stopped my LL testing at all (no resources), and not because I would do it differently now. The old way is still the best. You can save a lot of time, actually (your time and others' time, compared with the version where one test would fail and you don't know and report a bad result, and others need to TC).

@Xyzzy: :yucky:
(as said before, this icon is improper named, it has nothing to do with "yuck", i see a grumpy guy sticking out a tongue in defiance, or spite, of another. Long ago I wanted it for myself, but you gave me the lightning rabbit... hey, btw, now with pokemon in vogue, maybe I can make some money if I sell it? Any takers?)

:laurv:

ET_ 2016-09-21 11:01

As you probably already noticed, "my" Skylake didn't come.

I had health issues, then had to sell my mom's home, my wife's home was too small and we had to look for another one, sell the old one, buy the new one in a city that is 450 miles away from where I live, prepare to relocate, look for a job and some other things I could put on the unhappy me thread.

I want to apologize with the forum and Ernst for offering a helping hand and silently retracting it. The new PC should finally arrive after our relocation is complete, I hope end of October (you will know by then).

If help is still needed on November, I hope I will be ready.

ewmayer 2016-09-21 21:09

[QUOTE=ET_;443146]As you probably already noticed, "my" Skylake didn't come.

I had health issues, then had to sell my mom's home, my wife's home was too small and we had to look for another one, sell the old one, buy the new one in a city that is 450 miles away from where I live, prepare to relocate, look for a job and some other things I could put on the unhappy me thread.

I want to apologize with the forum and Ernst for offering a helping hand and silently retracting it. The new PC should finally arrive after our relocation is complete, I hope end of October (you will know by then).

If help is still needed on November, I hope I will be ready.[/QUOTE]

No worries, Luigi, and I hope things have calmed down on the various fronts you mention.

In the meantime, the usefulness of having both runs executing locally on the same (or similar) hardware has become clear to me - If I had a 2nd *well quad, that would double my throughput, but can't justify the expense. So I plan to just have my Haswell complete F29, during which time I will produce AVX512 code which will hopefully bring F30 within reasonable reach, either on a KNL or one of the future desktop systems which will support that doouble-the-width-of-AVX2 instruction set.

Luigi, you go ahead and put your hardware, when it arrives, to use for GIMPS - you definitely have a better chance of discovering a prime that way. :)

ewmayer 2017-11-14 02:19

The 2 side-by-side runs at 30M and 32M FFT I mentioned in my Sep 2016 post both got to iter ~230M on my Haswell in late April with me doing near-daily Res64 cross-checks, but it was a major headache, the Haswell - since powered off except for occasional avx2 code tests and tweaks - being sufficiently unstable that it was often a challenge to get 24 hours of uptime. At that point I moved the respective runs to a pair of faster systems, both physically hosted by our own David Stanfill, a.k.a. user airsquirrels. One system is David's own GPU-filled 32-core Xeon/avx2, the other is the Intel Knights Landing box a bunch of us GIMPSers interested in doing/supporting AVX-512 code development crowdfunded this past spring. After doing the basic AVX-512 assembly code port and some experiments to figure out which FFT length should be run on each of these 2 servers, the 30M-FFT run was finished on the Xeon (best per-iter time there ~24 ms using 32 threads) and the 32M on the
64-core KNL (best-time ~35ms/iter using 64 threads) to finish. The Xeon run completed in late July, and the KNL run finished on 2 Sep., both gave these final Selfridge-Hurwitz residues (as well as matching full-length final residue files):
[i]
F29 is not prime. Res64: BECE1E5FFBE5EC12. Program: E17.0
R29 mod 2^36 = 68650658834
R29 mod 2^35 - 1 = 17796269382
R29 mod 2^36 - 1 = 21246329964
[/i]
Since my original Haswell runs were from a common 32M-FFT savefile (as opposed to being run completely independently @30 and 32M) and had numerous restarts as well as several residue mismatches needing backtracking to see which of the 2 runs (which were proceeding side-by-side on my quad Haswell at similar speeds, allowing for easier cross-checking) went "off the rails", data-integrity-wise, after the faster Xeon completed its run of the final ~300Miters I further used it to rerun the first ~230Miters, all @30M FFT length. That partial run just completed over the weekend, with 230 Miter result matching those of the Haswell-begun runs.

Many thanks to David for the generous gift of Xeon cycles and the IT support.

I've posted a tarball of the interim-checkpoint status files and the full final residue file via link added to the OP in this thread. I also have persistent full checkpoint files at 10Miter intervals, but as each such files is 2^29/8 = 64Mbyte, and there are 2^29/(10 million) = 53 such, the total archive is ~3.3GB. The residue files are more or less completely random byte data, i.e. cannot effectively be compressed. I have burned a copy of that archive to a thumb drive as a backup for the one on my HD, and will make it available on request should anyone wish to do e.g. a parallel run validation via rerun of said 10Miter intervals.

ewmayer 2018-01-12 07:32

1 Attachment(s)
Since I am currently in the process of adding PRP (as opposed to LL) testing for Mersenne numbers, as well as fast Suyama-style cofactor-PRP testing ('fast' in the 'when starting with a PRP-test residue for the modulus in question' sense) for both Mersenn and Fermat numbers to my Mlucas code, this seems worth sharing here. While going through a box of some old documents today, I came across a folder of overhead-projector transparecies for several talks I gave back in my 6-year stint in academia from '93-'99. One set of slides was for the following talk I gave at the 1997 West Coast NT conference at Asilomar conference center in Pacific Grove, near Monterey. (My Christmas-gift-to-self last month was a $60 Epson flatbed scanner, which is paying for itself in all kinds of useful ways.)

There is an obvious small typo on slide 4, the 2nd bullet point should read (boldface highlights the needed fix) "...since (M_p-1)/2 is just a string of (p-1) binary ones, the Euler base-3 test amounts to a series of [b](p-2)[/b] squarings and multiplications by 3...":

CRGreathouse 2018-01-22 06:45

I never realized you were at Case!

ewmayer 2018-06-24 03:30

Update on the side-by-side Pépin tests of F30:

Run #1 using AVX-512 build on all 64 processors of the 1.3 GHz "GIMPS KNL" @64M FFT, begun 25 Mar 2017, just passed 1/3rd mark at 360M iterations. At 67.7 ms/iter ETC is 31. Dec 2019 (ugh.)

Run #2 using AVX2 build on all 32 processors of David Stanfill's 2.3 GHz Xeon server [E5-2698 v3] @60M FFT, begun 12 Nov 2017, is 20M iterations behind the KNL one. The best runtime I've gotten on this system is 48 ms/iter, but (for reasons not fully known, best guess is some combo of system load from other low-CPU processes and possibly temps) things more typically run 10-15% slower than that. The Xeon also tends to have more outages than the KNL, so let's add another 5% to that, say 57 ms/iter, ETC is mid-October 2019.

Unlike my F29 runs on my Haswell quad, there have been no interim-residue mismatches between the above 2 runs. Hooray for ECC memory!

One intersting issue I ran into recently with regard to future test of larger Fermat numbers relates to residue shift such as George's Prime95 code uses to effect different-data first-time and double-check runs of a given Mersenne number at the same FFT length for both runs: Recently finished adding support for this to my Mlucas code. In the context of the LL test for Mersennes, the trickiest part, especially for highly optimized codes which sandwich the carry step between the final pass of the inverse FFT for one iteration and the initial pass of the forward-FFT for the next iteration (and as in my case, combine all 3 steps into the body of a single processing loop), is efficiently figuring out where to inject the shifted -2 which in an unshifted-residue LL test simply gets subtracted from the 0-element of the residue array. In the context of shift, one needs to first efficiently compute which residue word gets the shifted -2, and remember, this is in the context of the crazy mix of 2 different wordsizes (floor(p/n) and ceiling(p/n)) mandated by the Crandall-Fagin DWT. Then one is typically dealing with a multithreaded decomposition of the above fused FFT-pass/carry-step loop, so one needs to figure out which thread is processing said array word. Lots of fun. But, the update of the residue shift count each iteration is simple: it's simply a doubling (mod p), and the fact that p is prime means the odds of the resulting sequence of per-iteration shift counts will repeat in some kind of short-cycle sequence (or even worse, fall into some kind of limit cycle which is independent of the initial shift, i.e. which would be the same for 2 runs starting out with diffeent ahift values) are small.

For Fermat-mod arithmetic we have the opposite problem: There is no -2 carry needing to be injected each iteration as there is for the LL test, but the per-iteration shift-count doubling is problematic, since our modulus = 2^(2^n) + 1, i.e. our shift count gets doubled mod (2^n), which is guaranteed to yield (and then stay at) shift count s = 0 after at most n iterations. Instead updating the shift each iteration via s = 2*s+1, which one can do by including a scalar multiplier equal to the initial seed of the Pépin test, is little better, e.g. for n = 4, consider the kinds of shift-count sequences that result from different values of the initial shift:

0,1,3,7,15,15,...
1,3,7,15,15,...
2,5,11,7,15,15,...
3,7,15,15,...
4,9,3,7,15,15,...
5,11,7,15,15,...
6,13,11,7,15,15,...
7,15,15,...
8,1,3,7,15,15,...

The most promising avenue I.ve found here is to update the shift count via s = 2*s + random[0,1].

The reason this becomes relevant for Fermat numbers only a few indices above F30 is this: so far I've been able to use dual residue-shift-less runs at slightly different FFT lengths to ensure that all but the initial handful of iterations for each run involve different FFT data. For example for F24 we can do one run at 896K and another at 960K or 1M. for F30, as noted above, the minimal FFT length has crept up relative to the next-higher power of 2 due to roundoff error trends, and each doubling of the modulus length squeezes that lower bound that much closer to the power of 2 FFT length. For F33 I have already put in place a special FFT length of form 63*2^k, but that large-ish odd composite radix is relatively inefficient, so if it proves slower than the power of 2 just above it, it will be useful to be able to do both runs at the same power-fo-2 FFT length, using the residue-shift scheme to ensure different-data-ness.

ewmayer 2019-05-03 19:51

Has anyone heard from David Stanfill, a.k.a. airsquirrels, of late? I had my 2 side-by-side tests of F30 running on a pair of servers he hosts, one the GIMPS KNL (doing the run @64M FFT, last check ~70% done) another a 32-core Intel Xeon server (doing the run @60M FFT). The Xeon started its run some months after the KNL but was running faster (~55 ms/iter versus 68 ms/iter) and had pulled ahead of the KNL early this year. But I lost ability to ssh into it (from the KNL is only option, the way David configured things) a couple months ago, sent David several e-mails to that effect, no reply. And as of several days ago I can no longer access the KNL, either. I'm starting to get a bit worried - his forum profile indicates last activity back in January.

paulunderwood 2019-05-03 20:04

[QUOTE=ewmayer;515671]Has anyone heard from David Stanfill, a.k.a. airsquirrels, of late? I had my 2 side-by-side tests of F30 running on a pair of servers he hosts, one the GIMPS KNL (doing the run @64M FFT, last check ~70% done) another a 32-core Intel Xeon server (doing the run @60M FFT). The Xeon started its run some months after the KNL but was running faster (~55 ms/iter versus 68 ms/iter) and had pulled ahead of the KNL early this year. But I lost ability to ssh into it (from the KNL is only option, the way David configured things) a couple months ago, sent David several e-mails to that effect, no reply. And as of several days ago I can no longer access the KNL, either. I'm starting to get a bit worried - his forum profile indicates last activity back in January.[/QUOTE]

Maybe you can find out where he is through this site:

[url]http://press.airsquirrels.com/2019/03/squirrels-llc-names-andrew-gould-new-ceo-announces-split-from-squirrels-research-labs/[/url]

I guess he is really busy!

ewmayer 2019-05-03 20:09

[QUOTE=paulunderwood;515672]Maybe you can find out where he is through this site:

[url]http://press.airsquirrels.com/2019/03/squirrels-llc-names-andrew-gould-new-ceo-announces-split-from-squirrels-research-labs/[/url]

I guess he is really busy![/QUOTE]

Thanks - I knew well how busy he was, but he never failed to reply to multiple e-mails over such a long span previously. (Including one more "you are still among the living, I trust?" missive last night). And I hate to nag, really don't want to have to resort to trying to reach him via phone.

paulunderwood 2019-05-03 20:17

[QUOTE=ewmayer;515673]Thanks - I knew well how busy he was, but he never failed to reply to multiple e-mails over such a long span previously. (Including one more "you are still among the living, I trust?" missive last night). And I hate to nag, really don't want to have to resort to trying to reach him via phone.[/QUOTE]

I did a "whois squirrelsresearch.com" and the closest I could get is [email]squirrelsresearch.com@domainsbyproxy.com[/email] You could also try this form:
[url]http://www.squirrelsresearch.com/contact/[/url]

HTH

ewmayer 2019-05-03 20:55

[QUOTE=paulunderwood;515674]I did a "whois squirrelsresearch.com" and the closest I could get is [email]squirrelsresearch.com@domainsbyproxy.com[/email] You could also try this form:
[url]http://www.squirrelsresearch.com/contact/[/url]

HTH[/QUOTE]

Just send a missive via the contact form - thanks!

Tooryalai 2019-05-03 23:21

NO MORE FERMAT PRIMES
 
Here’s the proof that there are no more Fermat Primes:

The following website calculates the total number of integers that equal a given totient number:

[url]http://www.numbertheory.org/php/carmichael.html[/url]

Plug in n = 2k, k ≥ 33

Plug in e = 0

Plug in f = 0

All the solutions for n = 2k (k ≥ 33), contain exactly 32 solutions.

All 32 solutions are even and are equal to 2 * the solutions for the previous n.

Because there are no odd numbers, and that all subsequent solutions will just be a power of 2 times any of the 32 previous solutions, then there can’t be any more prime numbers.

Prime numbers are odd, and there’s only even solutions for n = 2k (k ≥ 33).

Thererfore, there are no more fermat prime numbers.

LaurV 2019-05-04 05:10

Disclaimer: "2k" in your post should be read as 2[SUP]k[/SUP].

Do you understand what that calculator does, and how?

ewmayer 2019-05-05 21:54

Still nary a peep from David ... I think I need to start looking for a new Xeon server to finish my F30 runs on ... what is the cheapest 32-core avx2-capable Xeon server in the several-years-old used-hardware market, I wonder. The other problem is that while I have recent local-download copies of the run-status file and every-10M-iter savefiles for the 64M-FFT run, my 60M-FFT run-status file is 5-6 months out of date, i.e. I have no proof that I did the last ~150M iters or so at that FFT length. Of course one can use the 10Miter savefiles to do a sped-up parallel verification (multiple machines, each reproducing a separate 10Miter sbinterval), but rather a spot of bother in any event.

GP2 2019-05-07 00:45

Before Robert Gerbicz came up with Gerbicz error checking for PRP tests for Mersenne numbers, I think he invented [URL="https://www.mersenneforum.org/showthread.php?t=22510"]something similar for Pépin tests[/URL].

So I wonder if it would make sense to pause the tests on F30 in order to incorporate error checking, to increase the confidence in the final result.

ewmayer 2019-05-09 03:25

[QUOTE=GP2;515983]Before Robert Gerbicz came up with Gerbicz error checking for PRP tests for Mersenne numbers, I think he invented [URL="https://www.mersenneforum.org/showthread.php?t=22510"]something similar for Pépin tests[/URL].

So I wonder if it would make sense to pause the tests on F30 in order to incorporate error checking, to increase the confidence in the final result.[/QUOTE]

Good suggestion, but in my case the whole point of doing side-by-side runs at FFT lengths 60M and 64M, with the hardware/FFT-length matched up to give similar throughput, was to be able to do regular cross-checking of interim residues. Before I lost access to the 32-core Avx2-Xeon system that run was perhaps 2-3% ahead of the KNL run, i.e. a few tens of millions of iterations.

I've had a generous offer from a fellow forumite to use a 32-core Skylake Xeon system he owns (or has access to) to finish F30 - depending on what his detailed tuning-run timings looks like, we may be able to use 16 cores for each of 2 runs at 60M and 64M, respectively, and get throughput not much less than I was getting before. The only issue is the missing run-status file documentation of the 60M run, where the last copy I ftp'ed over to my PC is iter ~530M, roughly 200Miter behind the last check of that run before the Avx2 Xeon went AWOL. I may simply start both 60M and 64M runs from the 740Miter savefile produced by the 64M run (last I copied to my PC before the KNL went AWOL), and hope to reach someone at Squirrels Inc who can help me recover the missing files between now and end of the year. Failing that I can re-do the missing 200Miter starting with the 530Miter savefile at a later date in order to fill in the gap.

Xyzzy 2019-05-09 15:01

For future runs, started from scratch, would Gerbicz error checking eliminate the need for parallel runs?

axn 2019-05-09 17:07

[QUOTE=Xyzzy;516242]For future runs, started from scratch, would Gerbicz error checking eliminate the need for parallel runs?[/QUOTE]

Yes. Even for the present run, you can start from a known good check point and continue with GEC.

ewmayer 2019-05-09 19:39

[QUOTE=axn;516266]Yes. Even for the present run, you can start from a known good check point and continue with GEC.[/QUOTE]

I'm not so sure about that - it's the same issue that's been debated in other threads hereabouts: "Does GEC means GIMPS can skip double-checks?" GEC can certainly give the tester much-greater confidence in the results of his run, but then there remains the issue of "now you need to convince the rest of the world".

My sense is that for academic-style comp-NT research like Fermat testing you're always going to need at least 2 independent runs. The academic-CS types are already gonna look askance at the fact of 2 runs done using floating-point rather than integer math. I will post the complete set of 10Miter interim residue files online after my runs complete so anyone who desires to do so can do a parallel DC using whatever code/algorithm they like, but personally I've no interested in (IMO) wasting time on the kind of pure-integer parallel verification we did for F24.

Mysticial 2019-05-09 20:00

Personally, I'm part of the "independent run" camp because that limits the number of failure points that can lead to an incorrect announcement.

What's the probability that GEC fails to detect an error? How many GEC checks are there? What about implementation bugs? Is it true that a single missed error among all the checks will result in an undetected wrong result?

The rule that I usually follow is: "Can an adversary cause a misleading result by injecting a single error into a flawless computation at just the right place and time?"

I don't know how GEC is implemented. But lets say your save file encounters a bit flip from when you wrote it to when you read it. Will it catch that?

ewmayer 2019-05-09 22:04

A bit of good news re. the missing data from my 60M-FFT F30 run ... that was running on a 32-core avx2 Xeon, the last time I was able to access said machine was late February (Feb 21, as I now can precisely date.) I knew that run was at iteration ~700M at that point, but my local copy of the run-status file was months older, and only covered through iteration 531.4M. But! - and this is one of the reasons I keep my macbook classic's OS frozen at the wonderfully stable OS X 10.6.8, despite the attendant lack-of-support headache, and why I set my xterms to save en effectively unlimited command history - it turns out the Xterm from which I would do my several-times-weekly run-status quick-checks has data going all the way back to the above file-download date, last October. And a check of when-did-I-last-need-to-reboot confirms:
[i]
MacBook:obj_sse2 ewmayer$ uptime
14:56 up 206 days, 15:51, 3 users, load averages: 0.22 0.24 0.39
[/i]
So by simply scrolling down through said window's long captured history I was able to recover a nicely granular snapshot of the 'missing' run history. My regular run-status checks consisted of 'tail -2 f30.stat && date', here the last portion of the thusly-recovered history:
[code][Feb 19 15:10:28] F30 Iter# = 681200000 [63.44% complete] clocks = 01:36:54.030 [ 0.0581 sec/iter] Res64: C120506BBDB97A13. AvgMaxErr = 0.212160316. MaxErr = 0.312500000.
[Feb 19 16:50:10] F30 Iter# = 681300000 [63.45% complete] clocks = 01:39:38.457 [ 0.0598 sec/iter] Res64: 63EA4A0A58D2DCFA. AvgMaxErr = 0.212213476. MaxErr = 0.312500000.

[Feb 21 20:10:37] F30 Iter# = 684600000 [63.76% complete] clocks = 01:35:04.862 [ 0.0570 sec/iter] Res64: D4F7D0654B95C2C3. AvgMaxErr = 0.212131626. MaxErr = 0.281250000.
[Feb 21 21:48:03] F30 Iter# = 684700000 [63.77% complete] clocks = 01:37:22.205 [ 0.0584 sec/iter] Res64: 9E0B4E5DD7C1C0A5. AvgMaxErr = 0.212171292. MaxErr = 0.312500000.[/code]
My KNL run @64M went up through just a little past iteration 730M, so assuming the avx-512 Skylake-Xeon timings @60M are, say, ~10% faster than @64M, I'll probably have my new cycle donor restart the 60M run from the iteration 680M checkpoint file and the 64M run from the iteration 730M checkpoint file, and let the former slowly catch up with the latter.

Prime95 2019-05-09 23:12

[QUOTE=Mysticial;516293]What's the probability that GEC fails to detect an error?[/quote]

Infinitessimal to a very large power.

[quote]How many GEC checks are there?[/quote]

Irrelevant.

[quote] What about implementation bugs?[/quote]

None known in prime95 29.6 (29.4 had one). Of course, if they were known we would fix them.

[quote] Is it true that a single missed error among all the checks will result in an undetected wrong result?[/quote]

A proper GEC implementation won't have a single missed error.

[quote]The rule that I usually follow is: "Can an adversary cause a misleading result by injecting a single error into a flawless computation at just the right place and time?"[/quote]

I've tried to make sure that can't happen, but now that you mention it there may be a way. If the loop counter is corrupted right after a GEC check there might be an issue. I'll have to go read the code. I think stack corruption that controls program flow (or instruction pointer corruption) is the biggest vulnerability.

[quote]I don't know how GEC is implemented. But lets say your save file encounters a bit flip from when you wrote it to when you read it. Will it catch that?[/QUOTE]

Yes. There are always two bignums in memory (and the save file). If either is bit-flipped a later GEC check will catch the error.


Taking all the above into account, I agree that double-checks are still necessary, albeit at a lower priority than LL double-checking.

axn 2019-05-10 02:47

[QUOTE=ewmayer;516290]I'm not so sure about that - it's the same issue that's been debated in other threads hereabouts: "Does GEC means GIMPS can skip double-checks?" GEC can certainly give the tester much-greater confidence in the results of his run, but then there remains the issue of "now you need to convince the rest of the world".

My sense is that for academic-style comp-NT research like Fermat testing you're always going to need at least 2 independent runs. The academic-CS types are already gonna look askance at the fact of 2 runs done using floating-point rather than integer math. I will post the complete set of 10Miter interim residue files online after my runs complete so anyone who desires to do so can do a parallel DC using whatever code/algorithm they like, but personally I've no interested in (IMO) wasting time on the kind of pure-integer parallel verification we did for F24.[/QUOTE]

Unfortunately, convincing the rest of the world is more of an expectation management thing rather than actual mathematical thing. So you're probably right that independent double check is necessary in these scenarios. However ...

A simple scheme where a single GEC run plus saving all the GEC-checked full residues, should together constitute a verifiable proof. With these residues, you or anyone else can do an independent double check later. In fact, this could be a novel angle to an academic paper -- how you managed to do a single run to positively verify a number using GEC!
/IMO

Mysticial 2019-05-10 06:30

[QUOTE=Prime95;516316]
Taking all the above into account, I agree that double-checks are still necessary, albeit at a lower priority than LL double-checking.[/QUOTE] I imagine that some kind of double-check is needed anyway to protect against Byzantine faults.

ewmayer 2019-05-16 19:06

Thanks to a generous donation of cycles on a 32-core avx512 machine by a forumite who wishes to remain anonymous, the 60M-FFT F30 run has been restarted from the 680Miter savefile. At that FFT length this machine gets 60ms/iter using 16 cores, but the || scaling deteriorates beyond that, using all 32 cores we get 50ms/iter. In contrast, David Stanfill's 32-core avx2 Xeon which I had been using for that run was in the 80-90ms range using 16 cores, but continued to scale nicely up to the full 32, getting 50 ms/iter as its fastest timing. More typically, due to other system loads (mainly the GPU), I got ~55 ms/iter on that one.

I am still trying to fiind out whassup w/David and the GIMPS KNL; to that end I sent e-mail to several addresses (squirrelsresearch.com@domainsbyproxy.com, [email]press@airsquirrels.com[/email]) last weekend, but have received no reply. I was hoping to arrange to have the KNL ground-shipped to a GIMPS officer (maybe Aaron Blosser, who admins the Primenet server, has room) and then restart my 64M run on it.

ewmayer 2020-01-19 21:37

My anonymous cycle donor's continuation of my 2/3-complete run of F30 @60M FFT length finished on
Friday, he has given me permission to thank him by name: it is Ryan Propper, who also did some yeoman's mass-DC work for GIMPS last year. Here the final residues, again the (mod 2^36) one of the now-traditional Selfridge-Hurwitz residue triplet is simply a decimal-form recapitulation of the low 9 hexits of the GIMPS-style Res64:
[code][Jan 15 11:28:26] F30 Iter# = 1069900000 [99.64% complete] clocks = 01:34:00.016 [ 0.0564 sec/iter] Res64: D4E1ADB4D56B90F5. AvgMaxErr = 0.184720270. MaxErr = 0.281250000.
[Jan 15 13:02:36] F30 Iter# = 1070000000 [99.65% complete] clocks = 01:34:05.574 [ 0.0565 sec/iter] Res64: E0A3C552712AD603. AvgMaxErr = 0.184693172. MaxErr = 0.281250000.
[Jan 15 14:36:48] F30 Iter# = 1070100000 [99.66% complete] clocks = 01:34:05.895 [ 0.0565 sec/iter] Res64: 44DE22333F576C20. AvgMaxErr = 0.184676185. MaxErr = 0.281250000.
[Jan 15 16:09:26] F30 Iter# = 1070200000 [99.67% complete] clocks = 01:32:33.138 [ 0.0555 sec/iter] Res64: C92DEFA95553316A. AvgMaxErr = 0.184739228. MaxErr = 0.265625000.
...
[Jan 17 23:54:10] F30 Iter# = 1073741823 [100.00% complete] clocks = 00:38:54.202 [ 0.0558 sec/iter] Res64: A70C2A3DB98D6D9D. AvgMaxErr = 0.184703967. MaxErr = 0.250000000.
F30 is not prime. Res64: A70C2A3DB98D6D9D. Program: E17.1
F30 mod 2^36 = 58947628445
F30 mod 2^35 - 1 = 26425548225
F30 mod 2^36 - 1 = 59810773698[/code]
It's kinda stunning to see those iteration counts over 1 billion in the stat-file checkpoint lines - one knows they're gonna occur, but still, seeing actual run data like that for the first time is a bit of a "wow" experience.

These still await confirmation by way of completion of the second run @64M FFT - I had been doing both the 60M and 64M runs on hardware hosted by David Stanfill, the 64M one was on the GIMPS-crowdfunded Intel Knights Landing machine which we used for early avx-512 code development. David went AWOL last spring, and Ryan has kindly agreed to also finish the 64M-FFT run on hardware available to him. Restarting that one @iteration 730M, at the 60 ms/iter he is getting on the same 32-core AVX-512 Intel Skylake virtual machine he used to finish the 60M run, the ETC is 8 months from now.

Ryan also sent me the every-10Miter persistent residue savefiles (in the same kind of FFT-length-independent bytewise format Mlucas uses for Mersenne-test savefiles) for his run, so - again pending cross-confirmation via the 64M run - I have a complete chain of 107 such flles (128 Mbyte each) which can be used for any future independent confirmation of the F30 computation in distributed-computation fashion by multiple machines, each crunching one such 10Miter subinterval and verifying the next link in the residue chain.

JeppeSN 2020-01-25 19:01

Does this simply say that F30 is composite, or does it help to determine whether the cofactor F30 / ((149041*2^32 + 1)*(127589*2^33 + 1)) is composite ([URL="https://www.mersenneforum.org/showthread.php?t=19840"]Suyama[/URL])? /JeppeSN

ewmayer 2020-01-25 19:55

@Jeppe: This is simply the basic Pépin residue ... I have Suyama cofactor-PRP code based on my own custom GMP-style bigint library also in place, but the various short-length residues of said test for the smaller Fermats where others have done the Pépin+Suyama and published e.g. the Res64 for the latter do not agree with mine. It may simply be a difference in some convention re. generation of the residue, but I need to dig into things and determine its source before I make any public announcements re. the F30 cofactor. (And also one wants the 2nd double-check run at the slightly larger FFT length to finish and confirm the Pépin-test results.)

R. Gerbicz 2020-01-25 20:52

[QUOTE=ewmayer;535962]@Jeppe: This is simply the basic Pépin residue ... (And also one wants the 2nd double-check run at the slightly larger FFT length to finish and confirm the Pépin-test results.)[/QUOTE]

Have you checked that:
(3^(2^(2^m)) mod F_n) mod q = 3^(2^(2^m)) mod q
is true for the stored interim residues?

Since you have 3^(2^(2^m)) mod F_n then to get the left side is trivial and you can get quickly the right side, becuase q is small , the check is cheap. Though it is not that very-very strong check, but the two known factors could find a mismatch with probability = 1-(1e-10) which is still quite impressive. Note that we don't get the strength of 1/(q-1) or 1/znorder(Mod(3,q)) because the order is divisible by a "large" power of two, and that will be lost in the iterated squarings of a wrong residue.

ewmayer 2020-01-27 21:41

[QUOTE=R. Gerbicz;535966]Have you checked that:
(3^(2^(2^m)) mod F_n) mod q = 3^(2^(2^m)) mod q
is true for the stored interim residues?

Since you have 3^(2^(2^m)) mod F_n then to get the left side is trivial and you can get quickly the right side, becuase q is small , the check is cheap. Though it is not that very-very strong check, but the two known factors could find a mismatch with probability = 1-(1e-10) which is still quite impressive. Note that we don't get the strength of 1/(q-1) or 1/znorder(Mod(3,q)) because the order is divisible by a "large" power of two, and that will be lost in the iterated squarings of a wrong residue.[/QUOTE]

ITYM (3^(2^(2^m - 1)) mod Fm) mod q = 3^(2^(2^m - 1)) mod q, since the Pépin test is just an Euler-pseudoprime test (which happens to prove rigorous in the special case of the Fm), i.e. computes 3^(2^((Fm - 1)/2)) mod Fm = 3^(2^((2^m)/2)) mod Fm. But in any event, that's a good suggestion - fiddled my code to load the final Pépin-test residue R from the savefile and for the 2 known small prime factors p=640126220763137 and q=1095981164658689 computed

R == 367500396407933 (mod p) and
R == 95971534699540 (mod q).

I then separately computed 3^(2^(2^m - 1)) mod p and q via (2^m-1) = 1073741823 iterated squarings modulo p and q, separately, and the results match, which anyone can confirm on even quite modest hardware. So we have quite good confidence in the result, on top of the 2 runs agreeing through ~740 Miter (as of the latest 64M-FFT run update from Ryan), and said runs being done on high-end server ssytems with ECC memory.

For F31 I will also be using the now-famous periodic whole-residue check named in honor of the above-quoted forum member. :)

R. Gerbicz 2020-01-28 08:49

[QUOTE=ewmayer;536075]
I then separately computed 3^(2^(2^m - 1)) mod p and q via (2^m-1) = 1073741823 iterated squarings modulo p and q, separately, and the results match, which anyone can confirm on even quite modest hardware.[/QUOTE]

Trivial note: it is faster to reduce the exponent mod (p-1) [use Fermat's little theorem].

ewmayer 2021-07-02 18:57

Fermat-modmul with circularly shifted residues
 
I spent an appreciable amount of time last year wrestling with basic concepts and implementation-in-code of Fermat-mod squaring chains in the presence of residue shift. This proves rather more interesting (a polite euphemism for "pain in the butt") than for the Mersenne-mod case. The Gerbicz check renders it less important, but it may still prove useful in some contexts, so figured I'd post a summary.

Let s0 = initial shift, i.e. instead of starting the Pépin test of Fm with initial seed x0, we start with (x0 << s0) (mod Fm).

[1] Data layout: for Mersenne-mod IBDWT, we have 2 different wordsizes differing by one bit, so need to loop over the elements in our variable-wordsize array until we find the one containing the (s0)th bit, then shift x0 by the appropriate small number of bits to place it on the proper bit boundary within that target word. For Fermat-mod, the easy case is that of power-of-2 FFT length, for which we have a fixed power-of-2 wordsize. More generally we will have a non-power-of-2 FFT length optimized for the target modulus - e.g. for F24 we can use 896 Kdoubles, and for F30 I did one run at 60M and the other (now 93% complete) at 64M. In both cases we use a special acyclic-transform-effecting DWT which multiplies our N input words by the first N (2N)th complex roots of unity, i.e. the (j)th input gets multiplied by DWT weight w[j] = exp(I*Pi*j/N). This means multiplying our original real input word by a complex number, but we notice that w[j+N/2] = I*w[j], making it natural to group the (j)th and (j+N/2)th input words together in memory and treat them as a single input to a complex FFT. This is the so-called "right-angle transform" trick, and is detailed in the original Crandall/Fagin 1994 DWT paper. In the Mersenne-mod IBDWT case we can also treat adjacent even/odd-indexed imput elements as a single input to a complex FFT, but as the Mersenne-mod IBDWT weights are strictly real, these are simply pairs of adjacent input words.

For Fermat-mod arithmetic using a non-power-of-2-length transform we need to layer a Mersenne-style dual-wordlength base representation and similarly defined IBDWT weights atop our acyclic-effecting DWT. One simplification relative to the Mersenne case is that for FFT length [odd]*2^k, the Fermat-mod wordsizes and IBDWT weights recur in repeating length-[odd] patterns.

In the context of a right-angle Fermat-mod acyclic transform, finding the target word for a given initial-seed shift means looping first over the even-indexed elements of our residue vector (considered here as a length-N real array), then if the accumulated wordsizes are still less than the target shift s0, looping over the odd-indexed elements until we reach the target word. For the Fermat-mod case we need only do this for the initial seed of the Pépin test; for the Mersenne-mod case, specifically the Lucas-Lehmer test, we must also repeat the exercise on each subsequent iteration in order to find the proper bit offset for injection of the -2 term added to the outout of each mod-squaring output. This can be done very efficiently - specifically in O(1) time - by initializing a suitable bitmap encoding the variable wordsizes of the Mersenne-mod IBDWT and using that to effect a fast lookup scheme.

[2] For a basic squaring-chain-with-shift scheme as we do for Mersenne number M(p), each mod-squaring doubles the shift count (mod p). That means that for any s0 != 0 (mod p) the shift will remain nonzero and nonrepeating during the LL/PRP-test, since at squaring i, our shift s = s0.2^i (mod p) and the multiplicative group (mod p) is of maximal order p-1.

OTOH for squaring chains modulo a Fermat number Fm, each subsequent squaring doubles the shift count (mod 2^m), so we have the problem that any nonzero initial shift s0 will lead to a shift s = 0 after precisely m square-mods, and the shift will remain 0 throughout the remainder of the squaring chain. One way to get around this problem is to define an auxiliary pseudorandom-bit array, and on the (i)th square-mod, if bit[i] = 1, do a further mod-doubling of the residue, thus yielding the shift-count update s[i] = 2.s[i-1] + bit[i]. For a pair of Pépin tests of Fm such as are typically done to allow for real-time cross-validation of interim residues, it is also advisable to seed the bit-arrays differently for each run.

[3] For the random-bit-offset scheme in [2], we further need a postprocessing step to determine the correct sign to apply to the final residue; this relates to the shift-doubling modulo the binary exponent which occurs on each square-mod. For Mp, squaring a shifted residue R.2^s gives R^2.2^2s (mod Mp) and 2^2s == 2^(2s-p) (mod Mp), so at each iteration we simply double the shift count (mod p). For arithmetic modulo Fm, by way of contrast, each time the doubled shift count incurs a reduction (mod 2^m) the shifted residue undergoes a change of sign: R^2.2^2s (mod Fm) == -R^2.2^(2s-2^m) (mod Fm), so a shifted-residue scheme must keep track of both the shift count and the sign of the shifted residue, relative to the true unshifted one.

However: since - unlike LL - this PRP test involves just repeated mod-squaring, we don't care if the sign of the residue on any particular intermediate iteration is flipped, since the subsequent mod-squaring will flip it back to +. The only thing we care about is whether the sign of the final residue (either for the Pépin test or for a particular checkpoint subinterval thereof) has flipped w.r.to that of its immediate predecessor, the penultimate residue, since such a final-squaring sign-flip has no ensuing mod-squaring to autocorrect it. Thus, if there is a final-squaring sign flip we need to unflip the sign of the resulting residue manually. But that is not all, since generating a shift-removed final residue also involves a modular multiply or divide by a power of 2. I've always found it conceptually easier to multiply-mod rather to divide-mod, so for both that and less-code-to-maintain reasons, my mi64 multiword int library has low-level code only for leftward circular shift, and effects s-bit rightward circular shift of a b-bit number by a (b-s)-bit leftward one. That is all that is needed working modulo a Mersenne number, as these are of form 2^p-1 and thus any bits off-shifted to the left get rewrapped from the right with + sign. Once again, Fermat moduli are trickier. Modulo Fm, a leftward circular shift must feed the bits off-shifted to the left back into the right end with a - sign, and the aforementioned shrc-done-as-shlc substitution implies a negation (mod Fm) of the resulting shift-removed residue.

Thus, if at end of our current checkpoint interval or Pépin test, our residue R needed a sign-flip due the doubled-shift-count needing a reduction (mod 2^m), our shrc-done-as-shlc already took care of it. If not, the residue needs an explicit negation. Rather than doing an explicit Fm - R with the attendant digitwise negations and borrows, one can simply do a bitwise-complement of the residue vector and further += 2 of limb 0 in order to recover the unshifted residue.

ewmayer 2022-05-13 21:54

F25-F30 cofactor status, Part 1
 
[I PMed a more-or-less-identical draft of this to Prime95, ATH; yorix, jasonp, philmoore and R. Gerbicz yesterday]

[b]F25-F30 cofactor status:[/b]

In the context of testing the cofactor-PRP functionality which will be supported in Mlucas v21, I've at long last done Suyama-style PRP tests on the cofactors of F25-F30, using the Pépin-test residues I've generated for same over the past 10 years, as detailed in this thread - original primality-tests for F25-26 detailed in post #1 (recently updated with latest results), F27 in #2 and #5, F28 in #4 and #20, F29 in #29,#42,#52. For F30, the interim-residue files and logfile for the completed run @60 M FFT have been uploaded but the 99.5%-complete double-check run @FFT 64M is stalled due to my KNL having gone offline last week - I suspect the cheapie PSU installed by the vendor of this barebones system, but not yet had time to get under the hood, so to speak.

The cofactor-PRP results are tabulated in a "Part 2" followup to this note. My code only supports the Suyama cofactor-PRP method, which starts from the basic Pépin-test residue, does one mod-squaring to generate the Fermat-PRP residue A from that and an additional lg2(F) mod-squarings to generate the subtrahend B for the Suyama test, where F is the product of known prime factors; it makes little sense to spend roughly the same amount on a direct PRP test of such cofactors as needed by the Pépin test of F_m, only to have to re-do a similar amount of work when a new factor is discovered.

I first added some basic cofactor-PRP code to Mlucas in early 2018, and used it to test the cofactors of F25-29 using the Pépin-test residues I had generated via my primaiity tests for same. Did not post the cofactor-PRP results at the time because the accompanying Res64 values mismatched ones posted by Andreas Höglund (a.k.a. ATH), who had done cofactor-PRP tests of F25 and F26 using George's code. At the time I was trying to figure out why said result mismatched Andreas' value for what I thought was the same quantity, and in the ensuing PM exchange it did emerge that George's code-at-the-time (2009) which Andreas used in fact was doing a direct-PRP-test of the cofactor, but I was unaware of the cofactor-PRP runs for F25 and F26 done by forumite Yar (a.k.a. yorix - details below) in late 2017 using the then-latest Prime95 version. Having more pressing concerns at the time I decided to put that aside and revisit later. Mlucas v21 will have 2 major feature adds, PRP-CF and PRP-cert support, so in the context of the first, 'later' is now.

In 2009-2010 Andreas Höglund used George's code to test the cofactors of F25-27; cf. Posts #51, #62, #64 in [url=http://mersenneforum.org/showthread.php?t=12168]this thread[/url]:
[quote]UID: athath, F25/known_factors is not prime. RES64: 44BFC8D231602007. Wd1: B9307E03,00000000
Known factors used for PRP test were: 25991531462657,204393464266227713,2170072644496392193

UID: athath, F26/76861124116481 is not prime. RES64: 6C433D4E3CC9522E.

UID: athath, F27/151413703311361/231292694251438081 is not prime. RES64: 481F26965DE16117.[/quote]

Those posts were not clear on precisely what *type* of cofactor-PRP test was run - direct Fermat-PRP test on the cofactor C, or a Pépin-test (R = 3^((N-1)/2) (mod N)) of the Fermat number N = F*C followed by the Suyama postprocessing step (cf. the post by Phil Moore [url=https://mersenneforum.org/showpost.php?p=210599&postcount=67]here[/url]), where one computes first the Fermat-PRP residue A = 3^(N-1) (mod N) = R^2 (mod N) via a single mod-squaring of the Pépin-test residue R, then uses the product of known factors F to compute B = 3^(F-1) (mod N), and checks if the difference (A - B) is divisible by the cofactor C. The Suyama version is preferable because it starts with a basic Pépin-test of the N, and as new factors are found that same residue can be used to quickly (#of squarings = bits in product-of-known-prime-factors) check each resulting cofactor for PRP-ness.

In late 2017 user Yar (don't know last name, uid = yorix) [url=https://www.mersenneforum.org/showthread.php?t=22668]posted a thread[/url] about his own cofactor-PRP runs for F25-26, again using George's code, but now mentioning 2 different run types for each, which clarifies the types of cofactor-PRP tests used:
[quote]PRP test for F25 with known factors: PRP=N/A,1,2,33554432,1,"25991531462657,204393464266227713,2170072644496392193" gives 'composite' with res64: 7B6B087B84A45562

PRP test for F26 with known factors: PRP=N/A,1,2,67108864,1,"76861124116481" gives 'composite' with res64: FBB406B3A281838C[/quote]
We see that Res64 differs from ATH's above - Yar notes "This test with base 3 and returns the same residue independent of the number of known factors", which indicates a Suyama-style cofactor-PRP test.

For F25 and F26 my Fermat-PRP Res64 values match Yar's for the Suyama-style cofactor check, and his direct-cofactor results match ATH's (though that is a same-software-used match). But it would be nice if someone using George's code could confirm not just the Fermat-PRP residues (A) for the above, but for purposes of pronouncing the ensuing cofactor-PRP results "cross-verified using independent codes", we should also cross-check the (A - B) mod C ones, where C is the cofactor. The Fermat-PRP computation of course constitutes the overwhelming bulk of a PRP-CF run, but it is important to also have assurance that the B-residue and (A - B) mod C have been correctly computed.

[
@George, would it be possible for you to tweak your code to print the Res64 checksum for the latter quantity? If Yar still has the Fermat-PRP-test residues from his runs of F25 and F26 using your code, it would then be a simple matter to rerun the Suyama step from those and cross-check the (A - B) mod C residues. If not, the F25 and F26 full-length tests are now easily runnable on modest hardware.

Also, is there a different Prime95/mprime worktodo-entry syntax for a direct-PRP test of a cofactor vs a 2-step approach (Fermat-PRP of N = F*C followed by Suyama step, or did you switch from the former to the latter along the way? Yar only echoes the workfile entry for the 2-step run, but notes that used 29.4 and then in a later post notes retrying with v29.3 and getting results matching Andreas' earlier ones.
]

Lastly, I did not see any post by Yar regarding a Suyama-style cofactor-PRP test for F27 using Prime95. It would be nice for someone to do such so we can put that one to bed, as well, at least until another factor is discovered. I could obviously do it myself, but I think the "different persons using different codes" optics would be better. Compute cost would be 20-30% more than a current GIMPS wavefront-PRP test.

Results for my Mlucas v21-prototype cofactor-PRP runs follow in Part 2.

ewmayer 2022-05-13 22:05

F25-F30 cofactor status, Part 2
 
In each case I begin with a quick Pépin-test residue sanity check, summarized here:
[quote]For PRP-tests of N with known-factors (e.g. prelude to a cofactor-PRP postprocessing step),
R. Gerbicz [post #79 of [url]https://mersenneforum.org/showthread.php?t=18748&page=8][/url] suggests, using
the specific case of the (m)th Fermat number F_m to illustrate:
[quote]
Have you checked that:
(3^(2^(2^m)) mod F_m) mod q = 3^(2^(2^m)) mod q
is true for the stored interim residues?

Since you have 3^(2^(2^m)) mod F_n then to get the left side is trivial and you can get quickly
the right side, becuase q is small , the check is cheap. Though it is not that very-very strong
check, but the two known factors could find a mismatch with probability = 1-(1e-10) which is
still quite impressive. Note that we don't get the strength of 1/(q-1) or 1/znorder(Mod(3,q))
because the order is divisible by a "large" power of two, and that will be lost in the iterated
squarings of a wrong residue.
[/quote]
(For Pépin-test residue, we'd actually check (3^(2^(2^m - 1)) mod F_m) mod q = 3^(2^(2^m - 1)) mod q.)
More generally, for the powermod residue R = a^pow (mod N) where N has prime factor q, we check whether

a^pow (mod N) = a^pow (mod q).

If pow > q, the RHS can be simplified using Fermat's little theorem, a^(q-1) == 1 (mod q), thus
a^pow == a^pow' (mod q), where pow' = pow (mod q-1).

Example: For the final Pépin-test residue R for F30 (N = 2^2^30 + 1, pow = (N-1)/2 = 2^(2^30 - 1)), with
2 known small prime factors p = 640126220763137 and q = 1095981164658689, we get

p: pow' = 2^(2^30 - 1) (mod p-1) = 433448099512320, R == 367500396407933 (mod p)
q: pow' = 2^(2^30 - 1) (mod q-1) = 96576634617856, R == 95971534699540 (mod q).[/quote]
In the results below, for each prime factor, 'A: R == ' is the full residue (mod p) and 'B: R ==' is the (mod p) squaring-chain checksum computed as above.

For all the key quantities I give the "enhanced Selfridge-Hurwitz residue" triplet, which consists of residues mod 2^64 (hex), 2^35-1 and 2^36-1, the latter two printed in decimal form. Lastly, I computed GCD(A - B,C) to check whether each cofactor is possibly a prime power, as described in Phil Moore's Part-1-linked post [or cf. Crandall & Pomerance (2nd ed.), exercise 4.8]. Unsurprisingly, all cofactors come up "composite", and even more unsurpisingly, none is a prime power:
[code][b]F25:[/b] (I also re-ran @2048K FFT and confirmed the same values):
Computing 33554431-squaring residue R (mod known prime q = 25991531462657)
A: R == 5785371429337 (mod q)
B: R == 5785371429337 (mod q)
Computing 33554431-squaring residue R (mod known prime q = 204393464266227713)
A: R == 45685916416586703 (mod q)
B: R == 45685916416586703 (mod q)
Computing 33554431-squaring residue R (mod known prime q = 2170072644496392193)
A: R == 991358595838356746 (mod q)
B: R == 991358595838356746 (mod q)
F25: using FFT length 1792K = 1835008 8-byte floats.
this gives an average 18.285714285714285 bits per digit
Doing one mod-F25 squaring of iteration-33554431 residue [Res64 = CB913B6E3B1FFE2A] to get Fermat-PRP residue
Fermat-PRP residue (A) = 0x7B6B087B84A45562,26994100025,66886679148
Processed 162 bits in binary modpow; MaxErr = 0.203125000
3^(F-1) residue (B) = 0x2909830729BFA110,30348546187,21153567739
(A - B) Res64 = 0x526185745AE4B453, C Res64 = 0x21CC1C03F0000001
(A - B)/C: Quotient = 10040206794592865905247433568023540557761720874501, Remainder Res64 = 0x7A551893ACF4BE4E
Suyama Cofactor-PRP test of F25 / 25991531462657 / 204393464266227713 / 2170072644496392193 with FFT length 1835008 = 1792 K:
(A - B) mod C has Res64,35m1,36m1: 0x7A551893ACF4BE4E,34178045549,65496009072.
This cofactor is COMPOSITE [C10100842]. Checking prime-power-ness via GCD(A - B,C) ...
GCD((A - B)[33554304 bits], C[33554270 bits]) = 1
Time for GCD = 00:00:10.857
Cofactor is not a prime power.

[b]F26:[/b] (I also re-ran @4096K FFT and confirmed the same values):
Computing 67108863-squaring residue R (mod known prime q = 76861124116481)
A: R == 10772922302714 (mod q)
B: R == 10772922302714 (mod q)
F26: using FFT length 3584K = 3670016 8-byte floats.
this gives an average 18.285714285714285 bits per digit
Doing one mod-F26 squaring of iteration-67108863 residue [Res64 = 2B58BDE2A0C14045] to get Fermat-PRP residue
Fermat-PRP residue (A) = 0xFBB406B3A281838C, 2193939191, 7371940776
Processed 46 bits in binary modpow; MaxErr = 0.187500000
3^(F-1) residue (B) = 0x25F5AB0FFC728C87,32173443562,20444894301
(A - B) Res64 = 0xD5BE5BA3A60EF705, C Res64 = 0x23FFBA1860000001
(A - B)/C: Quotient = 49561426973911, Remainder Res64 = 0xB7BD14979ACF222E
Suyama Cofactor-PRP test of F26 / 76861124116481 with FFT length 3670016 = 3584 K:
(A - B) mod C has Res64,35m1,36m1: 0xB7BD14979ACF222E, 6416457631,15466936163.
This cofactor is COMPOSITE [C20201768]. Checking prime-power-ness via GCD(A - B,C) ...
GCD((A - B)[67108864 bits], C[67108818 bits]) = 1
Time for GCD = 00:00:25.555
Cofactor is not a prime power.

[b]F27:[/b] (I also re-ran @7.5M and 8M FFT and confirmed the same values):
Computing 134217727-squaring residue R (mod known prime q = 151413703311361)
A: R == 36549260191848 (mod q)
B: R == 36549260191848 (mod q)
Computing 134217727-squaring residue R (mod known prime q = 231292694251438081)
A: R == 68921389296064753 (mod q)
B: R == 68921389296064753 (mod q)
F27: using FFT length 7168K = 7340032 8-byte floats.
this gives an average 18.285714285714285 bits per digit
Doing one mod-F27 squaring of iteration-134217727 residue [Res64 = 043A6C8B9272B297] to get Fermat-PRP residue
Fermat-PRP residue (A) = 0xC3B191C45CCD7820,21887650168,55240256170
Processed 104 bits in binary modpow; MaxErr = 0.281250000
3^(F-1) residue (B) = 0x485F292142F953EC,20036545122,26481669619
(A - B) Res64 = 0x7B5268A319D42434, C Res64 = 0xD8C9BECF60000001
(A - B)/C: Quotient = 13668745121613830518711853822333, Remainder Res64 = 0x79E89190C56372B7
Suyama Cofactor-PRP test of F27 / 151413703311361 / 231292694251438081 with FFT length 7340032 = 7168 K:
(A - B) mod C has Res64,35m1,36m1: 0x79E89190C56372B7, 2608954171,29330680430.
This cofactor is COMPOSITE [C40403531]. Checking prime-power-ness via GCD(A - B,C) ...
GCD((A - B)[134217664 bits], C[134217624 bits]) = 1
Time for GCD = 00:00:59.492
Cofactor is not a prime power.

[b]F28:[/b] (I also re-ran @16M FFT and confirmed the same values):
Computing 268435455-squaring residue R (mod known prime q = 1766730974551267606529)
A: R == 688455961638747199546 (mod q)
B: R == 688455961638747199546 (mod q)
F28: using FFT length 15360K = 15728640 8-byte floats.
this gives an average 17.066666666666666 bits per digit
Doing one mod-F28 squaring of iteration-268435455 residue [Res64 = 468731C3E6F13E8F] to get Fermat-PRP residue
Fermat-PRP residue (A) = 0x7CAE1275B362B2CA,20461105324,30894284803
Processed 70 bits in binary modpow; MaxErr = 0.296875000
3^(F-1) residue (B) = 0xBD496177DB5155A0,33633822046, 3778945069
(A - B) Res64 = 0xBF64B0FDD8115D2B, C Res64 = 0x39AEB33000000001
(A - B)/C: Quotient = 585290509330851034797, Remainder Res64 = 0x790BE051D73D667E
Suyama Cofactor-PRP test of F28 / 1766730974551267606529 with FFT length 15728640 = 15360 K:
(A - B) mod C has Res64,35m1,36m1: 0x790BE051D73D667E, 4159568998,55362539930.
This cofactor is COMPOSITE [C80807103]. Checking prime-power-ness via GCD(A - B,C) ...
GCD((A - B)[268435392 bits], C[268435386 bits]) = 1
Time for GCD = 00:02:16.895
Cofactor is not a prime power.

[b]F29:[/b] (I also re-ran @32M FFT and confirmed the same values):
Computing 536870911-squaring residue R (mod known prime q = 2405286912458753)
A: R == 2364493566098202 (mod q)
B: R == 2364493566098202 (mod q)
read_ppm1_savefiles: Hit EOF in read of FFT-kblocks ... assuming a pre-v18 savefile.
F29: using FFT length 30720K = 31457280 8-byte floats.
this gives an average 17.066666666666666 bits per digit
Doing one mod-F29 squaring of iteration-536870911 residue [Res64 = BECE1E5FFBE5EC12] to get Fermat-PRP residue
Fermat-PRP residue (A) = 0x9BD416DB3918C152,32748692453,58920206666
Processed 51 bits in binary modpow; MaxErr = 0.375000000
3^(F-1) residue (B) = 0x30AFC6010F62884B, 7473784009, 6544795868
(A - B) Res64 = 0x6B2450DA29B63908, C Res64 = 0x3FF7746780000001
(A - B)/C: Quotient = 1671099484324226, Remainder Res64 = 0xAB3DBCEFFE907786
Suyama Cofactor-PRP test of F29 / 2405286912458753 with FFT length 31457280 = 30720 K:
(A - B) mod C has Res64,35m1,36m1: 0xAB3DBCEFFE907786,13349243207,41564307636.
This cofactor is COMPOSITE [C161614233]. Checking prime-power-ness via GCD(A - B,C) ...
GCD((A - B)[536870912 bits], C[536870861 bits]) = 1
Time for GCD = 00:05:09.784
Cofactor is not a prime power.

[b]F30:[/b] (I also re-ran @64M FFT and confirmed the same values):
Computing 1073741823-squaring residue R (mod known prime q = 640126220763137)
A: R == 367500396407933 (mod q)
B: R == 367500396407933 (mod q)
Computing 1073741823-squaring residue R (mod known prime q = 1095981164658689)
A: R == 95971534699540 (mod q)
B: R == 95971534699540 (mod q)
F30: using FFT length 61440K = 62914560 8-byte floats.
this gives an average 17.066666666666666 bits per digit
Doing one mod-F30 squaring of iteration-1073741823 residue [Res64 = A70C2A3DB98D6D9D] to get Fermat-PRP residue
Fermat-PRP residue (A) = 0x1D7AA56A5428333C,30698378043,40327643421
Processed 99 bits in binary modpow; MaxErr = 0.406250000
3^(F-1) residue (B) = 0x6F856680482D3F4E, 9072042323,48656226334
(A - B) Res64 = 0xADF53EEA0BFAF3EE, C Res64 = 0xFFF9D50500000001
(A - B)/C: Quotient = 26538003597554197569639536298, Remainder Res64 = 0xEDFCE7FC50271D44
Suyama Cofactor-PRP test of F30 / 640126220763137 / 1095981164658689 with FFT length 62914560 = 61440 K:
(A - B) mod C has Res64,35m1,36m1: 0xEDFCE7FC50271D44, 9723857098,34401987830.
This cofactor is COMPOSITE [C323228467]. Checking prime-power-ness via GCD(A - B,C) ...
GCD((A - B)[1073741760 bits], C[1073741725 bits]) = 1
Time for GCD = 00:11:38.593
Cofactor is not a prime power.[/code]

ewmayer 2022-07-01 22:35

F30 primality test and matching double-check
 
[See posts #65, #70, #74 and #75 in this thread for previous notes re. the F30 Pépin tests.]

At looooooooooooong last my multiple-systems-hopping on-again/off-again DC run of the Pépin test on F30 (the first-completed run was @60K, DC run was @64M FFT has finished. A rough and ready précis of the systems used along the way:[LIST][*] [b]25 Mar - 02 Apr 2017:[/b] By way of exploration, started a Pépin test of F30 on my Intel Haswell quad, 4c4t, using 256-bit AVX2+FMA3 SIMD build of Mlucas 15.0. This ran at a throughput of ~100 ms/iter for a little over a week, completing 5.6M iterations (0.52% complete).

[*] [b]02 Sep 2017 - 17 April 2019:[/b] Resumed in 64-threaded mode on a 64-core KNL server crowdfunded by GIMPS members and physically hosted by David Stanfill, using an AVX-512 build of Mlucas 17.0 (the just-released version supporting AVX-512.) That ran at a rate of 68 ms/iter, reaching iteration 734400000 (68.4% complete). Around that time the KNL went offline and David Stanfill went AWOL; I believe the 'net slang is "ghosted". [Aside: David had mentioned he was cryptocurrency "mining" in a major way; somehow that and such disappearing acts seem to go together, as one is seeing with the recent global crypto-ponzi collapse.]

[*] [b]28 Nov 2020:[/b] Resumed from the 730M-iter checkpoint in 64-threaded mode on a 68-core|272-thread KNL workstation bought used for $500, using avx512 build of same Mlucas 17.1 codebase used for above for consistency, and running one thread on each of physical cores 0-63. This crunched steadily at a throughput of 64 ms/iter for 11 months, reaching 1014M-iter, a little over 94% complete.

[*] [b]31 Oct 2021:[/b] Continued in 64c64t mode on the above KNL workstation, but now as a low-priority "backup" run to a 64c128t p-1 run (first to B1 = 10^7, then in 100M-prime-interval chunks to B2 = 400M) of F33, using the p-1 functionality added to Mlucas over the course of most of 2021. In backup-run mode throughput was ~300 ms/iter; this ran for ~6 months, reaching 1070M-iter, a little over 99.66% complete, until 4 May this year, when the KNL gve up the ghost.

[*] [b]12 Jun 2022:[/b] After various attempts at fault-tracing and system-revival proved fruitless, I pulled the SSD out of the KNL and installed in my desktop open-frame GPU crunching rig, and with some assistance from Paul Underwood successfully mounted the KNL filestystem (the KNL ran CentOS, vs the desktop rig's Ubuntu, so some FS-naming-convention sleuthing was needed). The filesystem was thankfully intact, so I copied the F30-run logfile and latest checkpoint file over to my 2c4t Intel NUC8i3CYS mini and completed the F30 double-check run @64M FFT using AVX-512 build there. That ran at ~360 ms/iter and finished on 28 June; the Selfridge-Hurwitz residues match those of the earlier run @60M FFT:[/LIST][code][Jun 27 22:45:52] F30 Iter# = 1073690000 [100.00% complete] clocks = 01:00:38.498 [ 0.3638 sec/iter] Res64: C52F182238099986. AvgMaxErr = 0.018915548. MaxErr = 0.023437500.
[Jun 27 23:46:36] F30 Iter# = 1073700000 [100.00% complete] clocks = 01:00:37.609 [ 0.3638 sec/iter] Res64: B7D9DAA9B4FC17D8. AvgMaxErr = 0.018934628. MaxErr = 0.023437500.
[Jun 28 00:47:22] F30 Iter# = 1073710000 [100.00% complete] clocks = 01:00:39.309 [ 0.3639 sec/iter] Res64: D90E3A766F5D1B5C. AvgMaxErr = 0.018922704. MaxErr = 0.023437500.
[Jun 28 01:48:07] F30 Iter# = 1073720000 [100.00% complete] clocks = 01:00:38.582 [ 0.3639 sec/iter] Res64: 2D8B46F0E8274E4C. AvgMaxErr = 0.018944431. MaxErr = 0.023437500.
[Jun 28 02:48:53] F30 Iter# = 1073730000 [100.00% complete] clocks = 01:00:39.394 [ 0.3639 sec/iter] Res64: 1CFA1EBC9CEE1C12. AvgMaxErr = 0.018940419. MaxErr = 0.025390625.
[Jun 28 04:07:09] F30 Iter# = 1073740000 [100.00% complete] clocks = 01:18:08.842 [ 0.4689 sec/iter] Res64: 005D2E424C47F6A8. AvgMaxErr = 0.018948376. MaxErr = 0.023437500.
[Jun 28 04:27:00] F30 Iter# = 1073741823 [100.00% complete] clocks = 00:19:42.173 [ 0.6485 sec/iter] Res64: A70C2A3DB98D6D9D. AvgMaxErr = 0.018940398. MaxErr = 0.023437500.
F30 is not prime. Res64: A70C2A3DB98D6D9D. Program: E17.1
F30 mod 2^36 = 58947628445
F30 mod 2^35 - 1 = 26425548225
F30 mod 2^36 - 1 = 59810773698[/code]
So in total, a little over 5 calendar years from start to finish for this patchwork-quilt DC run. The complete-final-residue files also have matching MD5 checksums, and as described in the above posts re. cofactor status, for such cases with one or more known factors, we can also use those to quickly compute modular-squaring-chain checksums of bitness roughly on the order of the known factors.

I've posted links-to/descriptions-of the every-100Miter interim-checkpoint files, the full final residue file and the 2 run-logfiles (for the first run @60M FFT and the just-completed DC @64M) in the OP of this thread. I also have persistent full checkpoint files at 10Miter intervals, but as each such files is 2^30/8 = 128 Mbyte, and there are ceiling(2^30/(10 million)) = 108 such, thus the total archive is ~13GB. The residue files are more or less completely random byte data, i.e. cannot effectively be compressed. I have burned a copy of that archive to a thumb drive as a backup for the one on my HD, and will make it available via a snail-mailed 16GB microSD on request should anyone wish to do e.g. a parallel run validation via rerun of said 10Miter intervals.

I don't know with any certainty whether this constitutes the first gigabit primality-test-with-double-check; if anyone has heard of another such, please post a link.

For F31 and beyond - I have the Gerbicz check working for Fermat-mod arithmetic in the in-dev Mlucas v21 code; on top of that we now also have the PRP-proof mechanism being used for GIMPS work; that can also be used for Pépin-test work, as that inherently uses a long chain of successive mod-squarings. In effect, a certificate of primality, independently checkable in a small fraction of the time of the original primality test.


All times are UTC. The time now is 17:14.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.