Pépin tests of Fermat numbers beyond F24
Placeholder thread to summarize what has been done in this area and collect new results as they complete.
General math/algo background: As is well known, Fermat numbers admit both a very efficient rigorous primality test and a fast discrete "implicit mod" convolution method for effecting it, analogously to the Mersennes. I'm running the Pépin test for F[sub]28[/sub] as I type this on my quadcore [strike]Sandy Bridge[/strike]Haswell box. This number (in fact all of F2532) is already known to be composite by way of a small factor, but the Pépin residue is still useful for fast cofactor PRP testing, and the Fermats make a good codedevelopment milestone for my indevelopment AVX and manycore codes, since the Fermatmod discrete convolution has a somewhat simpler structure than does the Mersennemod. (Technically, it's a complex "rightangle" transform with hybrid of acyclic weighting for the plussign mod and a Mersennestyle irrationalbase DWT to allow for use nonpowerof2 transform lengths  my F[sub]28[/sub] run is using an FFT length of 15*2[sup]20[/sup] floating doubles.) [i] [Text below is onlyslightlyedited version of an email I sent to Wilfrid Keller, Richard Crandall and Jason Papadopoulos on 13 November 2012] [/i] Available [url=http://hogranch.com/mayer/files.tgz]here[/url] in .tgz format (use 'tar xzf files.tgz' to unpack) are summary runstatus files for my successful (in the sense that 2 runs at different FFT lengths yield matching residues) Pépintest compositeness results for F2426. (F24 was rerun to test the hybrid DWT code described above). I believe F25 and perhaps F26 have already been Pépintested by others (likely using some version of George's x86 assembler FFT), in which event those residues can be compared to mine. These were run using a C/assembler hybrid of my own Fermatmod convolution code, using SSE2 on the Intel x86 for the heavy computational work. This implements both the traditional powerof2FFT which uses a negacyclicweighting to effect Fermatmod multiply arithmetic, as well as the nonpowerof2FFT alluded to in the Crandall/Mayer/Papadopoulos F24 paper, which combines a negacyclic weighting and a Mersennemodstyle dualbase DWT. The machinery was very modest sittingonmydesk stuff: In each case the smaller nonpowerof2FFTlength run was on 1 core of a Core2 dualcore Lenovo notebook (vintage 2008), built under Win32 using MS Visual Studio. This machine's processor is rated at 1.67 GHz, but due to a quirk of the powermanagement system, ever since the battery died 3 years ago, the system only runs at 2/3 speed, that is ~1.1 GHz. (I chose not to buy a new one, as the purchase of the MacBook for my 64bit builds allowed me to convert the Lenovo to stayathome plugin status). Since F24 had already been proven composite via matching runs at FFT length 1024k (k mean 1024 floating doubles), I include only the files for the rerun at 896k (humanreadable checkpointstatus file: f24_896k.stat) on the above Lenovo notebook using the hybrid DWT code, said test performed in October of 2011. F25 was tested @1792 on the same machine from late October 2011 to January 2012 (checkpoint file: f25_1792k.stat); F26 ran from January to October of this year (checkpoint file: f26_3584k.stat). The larger powerof2 runlengths for F25 and F26 were done on 1 core of a 2 GHz Core2 Duo Apple MacBook, built under 64bit OS X using GCC 4.2. The 64bit builds include versions of various key assembly macros which take advantage of the doubled SSE register set (from 8 to 16 128bit SSE registers) available in 64bit mode; this yields ~10% added percycle throughput, on top of that afforeded by the higher clock rate, larger caches and faster system memory of the newer machine. On the other hand, the MacBook is my personaluse mobile machine, so does not enjoy quite the 24x7 uptime of the Lenovo. That is reflected in the frequent restartsfrominterrupt and timing slowdowns reflected in the respective checkpointstatus files, f25_2048k.stat [run from 02 Dec  27 Jan 2012] and f26_4096k.stat [27 Jan  08 Sep 2012; in there you can see the timing boost I got between 1824 Jul, when my macbook's OEM fan finally got terminally dustclogged/wornout and I replaced it]. F27 and beyond will run in multithreaded mode on the above and faster/morecore hardware; it was only very recently that I achieved a first working multithreaded version of the SSE2based convolution code. I am currently playing with various threadmanagement schemes (including a nice threadpool API Jason kindly has shared, which he adapted from some opensource code he found) to see what works best in the sense of efficiency and parallelscalability. I also include a (single) copy of the final bytewise residue files for all of the above exponents; in each case I first compared the full Pépin'test residues for sameexponent runs at different runlengths to via diff of the respective final bytewise residue files. The format for these residue files is slightly different than that described in the F24 paper, namely: o 1 byte encoding an internal test type (other users of this file should ignore this); o 1 byte encoding an internal modulus type (other users of this file should ignore this); o 8 bytes encoding "number of mod squarings performed" in littleendian byte order (bytes stored in ascending order of significance); o For modulus N, exactly ceiling(log2(N)/8) residue bytes, ordered from low to high, with residue R nonnegativenormalized, i.e. 0 <= R < N. For the Fermat number F_m = 2^2^m + 1, that translates to 2^(m3) + 1 bytes; o 8 bytes encoding the leastsignificant 64 buts of R (R modulo 2^64) in littleendian byte order; o 5 bytes encoding the (R modulo 2^351) in littleendian byte order; o 5 bytes encoding the (R modulo 2^361) in littleendian byte order; o 1 byte for the C endoffile (EOF) character. Thus, e.g. for F29, such a savefile will be precisely 2^26 + 29 = 67108893 bytes in size. The small C function attached below may be easily adapted to read the files, if desired. Anyone making use of the residues (e.g. for cofactor PRP testing) should validate their residuereading code by directly computing the 3 SelfridgeHurwitz residues from the bytewise residue and comparing those values to the SH residues which are stored in the penultimate 18 bytes of the savefile. (And both of those sets should match the final residues printed in the respective run's f**.stat file). Cheers Ernst =============================== [b]Edit (13 Nov 2017):[/b] Here are links to tbz2'ed (use 'tar xjf *tbz2' to unpack) statusandfinalresidue files for F24F29. F24 has just the 896KFFT .stat file, since that result matched the original 1MFFT one published in the Crandall/Mayer/Papadopoulos [i]Math Comp[/i] paper. Each subsequent archive has a pair of such status files showing dates and interim Res64s for the Pe'pin tests at the respective FFT lengths used. F29 has a trio of such .stat files, one from the original (and fatallyhosedbyundetectedhardwareerrorbetweenthe25.63Miterand25.64Mitercheckpoints) 30MFFT run on my haswell quad which I've given a '.orig' extension, and the subsequent 2 from the sidebyside runs at 30M and 32M FFT which I ran to iter ~230M on the Haswell while doing neardaily Res64 crosschecks, then moved to a pair of faster systems (32core Xeon/avx2 and 64core KNL/avx512) to finish, as describe in my 13 Nov 2017 new post to this thread: [url=http://www.mersenneforum.org/mayer/F24_FILES.tbz2]F24_FILES.tbz2[/url] (2MB, MD5 = 06025e5d6ca3a746b5c0be1b3fd12465) [url=http://www.mersenneforum.org/mayer/F25_FILES.tbz2]F25_FILES.tbz2[/url] (4MB, MD5 = 6db5632defdb06e1d7fde47322102544) [url=http://www.mersenneforum.org/mayer/F26_FILES.tbz2]F26_FILES.tbz2[/url] (8MB, MD5 = 83bc460c7eb29b69364e038c53e72710) [url=http://www.mersenneforum.org/mayer/F27_FILES.tbz2]F27_FILES.tbz2[/url] (169MB, MD5 = b863de5ff203c82fa8b1611de1c5c627) [url=http://www.mersenneforum.org/mayer/F28_FILES.tbz2]F28_FILES.tbz2[/url] (742MB, MD5 = 2df7884c7c0ddd92f722d31f4b7392b7) The F27 and F28 archives include a set of irregularlyspaced roughlyevery1Miter interim savefiles in addition to the final Pépinresidue one. [b]Edit (04 May 2022):[/b] For F29 and F30 I have every10Miter savefiles but only uploaded the every100Miter and finalresidue ones due to server space constraints, and did not tar/zip the respective archives, since the residue files are effectively randombitstrings and thus incompressible, and the resulting archive size quadruples with each increment in Fermatnumber index. One can download just one or any desired subset of same using ftp or wget; here are the file names, sizes and, for the residue files, the MD5 hashes: [code]F29_FILES: Filename Size(bytes) Date MD5 f29.100M 67108893 Sep 17 2017 0f7872a6de61f14da09c28aeee87cc9e f29.200M 67108893 Sep 17 2017 4ad9f30c19d98026258cb148245ab9fc f29.300M 67108893 Oct 31 2017 f2165c49b34c311379bb60829676500f f29.400M 67108893 Oct 31 2017 6485f40256692e6c11d899eb0f785bf3 f29.500M 67108893 Oct 31 2017 d68cae02a296acce6e448cb500cd68bb f29 67108893 Nov 13 2017 f7dfc69e905543b31a6591002e52e228 f29_30M.stat 4945113 Nov 12 2017 f29_30M.stat.orig 8329337 Sep 17 2017 f29_32M.stat 4517584 Nov 12 2017 F30_FILES: Filename Size(bytes) Date MD5 f30.100M 134217757 Apr 17 2019 b4bda3a28af98776c40aa3c9791467d5 f30.200M 134217757 Apr 17 2019 dff26f37fe35c23b9d45f7de115320a5 f30.300M 134217757 Apr 17 2019 2a81d511d44ac49433dad5ccf5863ef1 f30.400M 134217757 Apr 17 2019 2044d695f0d98f5573c3fe325e3ff38b f30.500M 134217757 Apr 17 2019 8015a7f0112ecb355868a9428c4dba2e f30.600M 134217757 Apr 17 2019 5a3bc91a80f02d640c4fb244b8f284f7 f30.700M 134217757 Jan 18 2020 05f797b977ca07d9c4af4a42b3100ec5 f30.800M 134217757 Jan 18 2020 65e2cb73af7c0c7ce3b89d3d486d8c9b f30.900M 134217757 Jan 18 2020 d5bf62f54f2db1519638cd4e2889f20c f30.1000M 134217757 Jan 18 2020 e6534f8ac299ff6cec89c4d2d1e5966e f30 134217757 Jan 18 2020 674218092a73b72d0bb0e5a4ca9e3326 f30_60M.stat 1618448 Nov 27 2020 f30_64M.stat 7156472 Jul 1 2022[/code] =============================== [code]/*** READ: Assumes a valid file pointer has been gotten via a call of the form fp = fopen(RESTARTFILE,"rb"); ***/ int read_ppm1_savefiles(uint64 p, FILE*fp, uint32*ilo, uint8 arr_tmp[], uint64*Res64, uint64*Res35m1, uint64*Res36m1) { uint32 i; uint32 nbytes; uint64 nsquares= 0; ASSERT(HERE, !(p >> 32), "read_ppm1_savefiles: p must be 32bit or less!"); /* Future versions will need to loosen this p < 2^32 restriction: */ if(!file_valid(fp)) { sprintf(cbuf, "read_ppm1_savefiles: File pointer invalid for read!\n"); return FALSE; } fprintf(stderr, " INFO: restart file %s found...reading...\n",RESTARTFILE); /* t: */ if((i = fgetc(fp)) != TEST_TYPE) { sprintf(cbuf, "read_ppm1_savefiles: TEST_TYPE != fgetc(fp)\n"); return FALSE; } /* m: */ if((i = fgetc(fp)) != MODULUS_TYPE) { sprintf(cbuf, "ERROR: read_ppm1_savefiles: MODULUS_TYPE != fgetc(fp)\n"); return FALSE; } /* s: */ for(nbytes = 0; nbytes < 8; nbytes++) { i = fgetc(fp); nsquares += (uint64)i << (8*nbytes); } /* For now, just make sure nsquares < 2^32 and copy to ilo: */ if(nsquares >= p) { sprintf(cbuf,"read_ppm1_savefiles: nsquares = %llu out of range, should be < p = %llu\n", nsquares, p); // return FALSE; } *ilo = nsquares; /* r: */ /* Set the expected number of residue bytes, depending on the modulus: */ if(MODULUS_TYPE == MODULUS_TYPE_MERSENNE) { nbytes = (p + 7)/8; TRANSFORM_TYPE = REAL_WRAPPER; } else if(MODULUS_TYPE == MODULUS_TYPE_FERMAT) { ASSERT(HERE, p % 8 == 0,"read_ppm1_savefiles: p % 8 == 0"); nbytes = (p/8) + 1; TRANSFORM_TYPE = RIGHT_ANGLE; } i = fread(arr_tmp, sizeof(char), nbytes, fp); /* Read bytewise residue... */ if(i != nbytes) { sprintf(cbuf, "read_ppm1_savefiles: Error reading bytewise residue array.\n") ; return FALSE; } if(ferror(fp)) { sprintf(cbuf, "read_ppm1_savefiles: Unknown Error reading bytewise residue array.\n") ; return FALSE; } if(feof(fp)) { sprintf(cbuf, "read_ppm1_savefiles: Endoffile encountered while attempting to read bytewise residue array.\n") ; return FALSE; } /* 8 bytes for Res64: */ *Res64 = 0; for(nbytes = 0; nbytes < 8; nbytes++) { i = fgetc(fp); *Res64 += (uint64)i << (8*nbytes); } /* 5 bytes for Res35m1: */ *Res35m1 = 0; for(nbytes = 0; nbytes < 5; nbytes++) { i = fgetc(fp); *Res35m1 += (uint64)i << (8*nbytes); } ASSERT(HERE, *Res35m1 <= 0x00000007FFFFFFFFull,"read_ppm1_savefiles: *Res35m1 <= 0x00000007ffffffff"); /* 5 bytes for Res36m1: */ *Res36m1 = 0; for(nbytes = 0; nbytes < 5; nbytes++) { i = fgetc(fp); *Res36m1 += (uint64)i << (8*nbytes); } ASSERT(HERE, *Res36m1 <= 0x0000000FFFFFFFFFull,"read_ppm1_savefiles: *Res36m1 <= 0x0000000fffffffff"); /* Don't deallocate arr_tmp here, since we'll need it later for savefile writes. */ return TRUE; }[/code] 
1 Attachment(s)
F27 has one complete run, at FFT length 7*2[sup]20[/sup] doubles, from 7 Dec 2012  11 Mar 2013 on all 4 cores of my quadcore Sandy Bridge box, using the aforementioned 64bit GCCbuilt multithreaded SSE2 code. The checkpointsummarystatus file (attached) shows a number of roundoff warnings, reflecting the fact that while FFT lengths of the form 7*2[sup]m7[/sup] are perfectly adequate for F[sub]m[/sub] with m <= 26, F[sub]27[/sub] is the largest Fermat number testable using such a length and a doublefloatbased FFT.
[i] [Mar 11 22:59:49] F27 Iter# = 134210000 clocks = 00:00:00.000 [ 0.0593 sec/iter] Res64: C1A1546143F5D665. AvgMaxErr = 0.257079724. MaxErr = 0.343750000 [Mar 11 23:07:28] F27 Iter# = 134217727 clocks = 00:00:00.000 [ 0.0593 sec/iter] Res64: 043A6C8B9272B297. AvgMaxErr = 0.198636938. MaxErr = 0.343750000 F27 is not prime. Res64: 043A6C8B9272B297. Program: E3.0x R27 mod 2^36 = 49701630615 R27 mod 2^35  1 = 30710305624 R27 mod 2^36  1 = 39732934618 [/i] The second run of F[sub]27[/sub], at a mix of FFT lengths 2[sup]23[/sup] and 15*2[sup]19[/sup], actually started earlier than the above one (08 Sep 2012), but has been crawling along on my nowquiteaged Core2based macbook. That has reached iteration ~85m (with all interim residues matching those of the above run), but with the cooling fan again beginning to sound stressed, I have suspended it and will complete it on my Haswell quad once that completes the first run of F[sub]28[/sub] next month. 
Hi all,
Thank you ewmayer for your thorough analysis of Fermat analysis. I appreciate your post. Regards, Matt A. 
1 Attachment(s)
First run of F[sub]28[/sub], at FFT length 15360k = 15*2[sup]20[/sup] doubles, finished yesterday. The timings in the attached checkpointsummarystatus file reflect the first part of the run being done on my old 4core Sandy Bridge system using the SSE2 Mlucas code, then switching to AVX once that code became available (my AVX port was driven by the desire to get this run online ASAP; other FFT radices and functionality like Mersennemod convolution came after that), and finally switching to my new Haswell quad in June.
The summaries for the last few checkpoints and the final SelfridgeHurwitz residues are [i] [Nov 10 16:57:24] F28 Iter# = 268430000 clocks = 00:00:00.000 [ 0.0638 sec/iter] Res64: 360A9E1308A3165A. AvgMaxErr = 0.145900781. MaxErr = 0.218750000 [Nov 10 17:03:13] F28 Iter# = 268435455 clocks = 00:00:00.000 [ 0.0640 sec/iter] Res64: 468731C3E6F13E8F. AvgMaxErr = 0.079615625. MaxErr = 0.203125000 F28 is not prime. Res64: 468731C3E6F13E8F. Program: E3.0x F28 mod 2^36 = 16759471759 F28 mod 2^35  1 = 30476727665 F28 mod 2^36  1 = 9104636564 Sun Nov 10 17:22:00 PST 2013 [/i] Before beginning the DC run of F[sub]28[/sub] I will spend a few weeks to complete the abovementioned DC run of F[sub]27[/sub] which I had been doing on my Core2based macbook. 
1 Attachment(s)
Second run of F27, at a mix of FFT lengths 8192k and 7680k  I briefly used the larger length early on while working through some multithreaded performance issues  finished last night. Both this and the earlier run @7168k (which was on the ragged edge of roundofferrorness) yield matching results.
If you compare the latter parts of the above 7168krun statfile to the one attached below, you see a more than 2x throughput boost between the 2 respective quadcoresystems: Sandy Bridge+ddr 1333, versus Haswell + ddr2400. (My code benefits far more from the larger caches and such of Haswell than does George's). 2nd run of F28, this one at 16384k, is now underway. 
[QUOTE=ewmayer;360379]2nd run of F28, this one at 16384k, is now underway.[/QUOTE]
Ernst, have you performed the Suyama test of the cofactor of F28, also of F27 ? That not only tells you whether the cofactor is composite, but whether it is a prime power. I don't recall that anyone has yet reported that the cofactor of F28 is composite. 
[QUOTE=philmoore;360389]Ernst, have you performed the Suyama test of the cofactor of F28, also of F27 ? That not only tells you whether the cofactor is composite, but whether it is a prime power. I don't recall that anyone has yet reported that the cofactor of F28 is composite.[/QUOTE]
No, not yet  I am currently just accumulating the Pépintest residues, will code up the Suyama posttest sometime next year. (I had a working version of this in my veryold Fortran90 code, but never implemented it in C/assembler.) 
[QUOTE=ewmayer;360391]No, not yet  I am currently just accumulating the Pépintest residues, will code up the Suyama posttest sometime next year. (I had a working version of this in my veryold Fortran90 code, but never implemented it in C/assembler.)[/QUOTE]
The GCD step is the most timeconsuming part of the test, but perhaps even the GMP version could do these in a reasonable time now (?) (Not that I have ever had the need to test it!) And we could also test GMP against George's code for verification. 
[QUOTE=philmoore;360397]The GCD step is the most timeconsuming part of the test, but perhaps even the GMP version could do these in a reasonable time now (?) (Not that I have ever had the need to test it!) And we could also test GMP against George's code for verification.[/QUOTE]
Isn't a multiword remainder algo more efficient than a GCD here? 
Latest code enhancements drop the periteration time of my ongoing DC run of F28 @16384K from a (disappointing) 68 ms to 58 ms. The reason I say 68 ms was disappointing is that based on my 25 ms timing for the F27 DC run @7680K I expected a time of ~55ms for the slightlymorethandoubling of the FFT length.
The code change yielding the improved timing was to implement a larger leadingpass FFT radix (call it r0) of (complexdata) 128. Until yesterday the largest r0 I had was 64. Now one of the key performancerelated trends I have observed in playing with the multithreaded SIMD code over the past year is that it really likes larger r0 values. I believe this is a result of r0 determining the perthread dataset size, as (total mainarray size)/r0. An especially critical threshold here seems to be 1 MB, which is interesting because it is not obviously related to any of the cache sizes  it is several times (more than 2x) smaller than L2, and many times larger than L1. Nonetheless, the threshold effect is striking, and persists across multiple generations of the Intel x86, at least from Core2 onward. Anyhoo, for F28 @16384K, (total mainarray size) = 128 MB, thus r0 = 64 yields perthread dataset size of 2 MB; r0 = 128 gets us back to 1 MB. Time to see if another doubling of r0 gives another boost  implementing these is less trivial than it sounds because: o My code uses a fused finaliFFTpass/carry/initialfFFTpass strategy, thus the DIF and DIT DFTs at any new radix r0 need to be combined with the carry macros, and (for r0 a power of 2 or nearly so) these need to handle the different carry propagation schemes used for (realdata) Mersennemod and (complexdata, "right angle" twisted transform) Fermatmod; o For each new r0 I need to implement 3 distinct code paths: Scalar C, SSE2 and AVX. These share much of the "ambient" C code, but the DFT macros are custom inlineasm for the SIMD paths. For now (since F28 at the above FFT length is the focus) I can just stick to r0 a power of 2, but eventually I need to go back and "fill in" the various nonpowerof2 radices as well, for instance between r0 = 64 and 128 I also will need to support r0 = 72,80,88,96,104,112 and 120. Each such radix needs 23 weeks of work to code ... much of my 2014 will be consumed this way. One bonus, though, is that larger r0 is also wellgeared for manycore, since r0 sets the maximum number of threads my code can use. 
Nice little timing breakthrough recently  short title is "compactobjectcode schemes for largeradix DFTs". Briefly, if one is dealing with a blob of code  mainly "hot sections" such as loop bodies here  which exceeds the L1 Icache size, instructions must be fetched from L2, i.e. compete with data fetches for L2toL1 bandwidth. In my case the main culprits are the routines which wrap the carry step, and are laid out with a codeintensive loop body as follows:
1) Do radixR finaliFFT pass on R complex data; 2) Send each output of [1] to the applicable DWTunweight/normalize/round/carry/DWTreweight macro (distinct ones for Fermat and Mersennemod transform modes); 3) Do radixR initial fFFT pass on the R complex normalized data. For R > 16 each DFT can easily run to a thousand instructions or more, and each of the R carrymacro invocations in [2] is on the order of 50100 instructions. I first considered reducing the sheer amount of code involved in the context of the nonSIMD (scalardouble) builds, where the optimizer has to deal with all of the code resulting from 13 (as opposed to just inlining blocks of assembly code, as in SIMD builds). I found that simple steps like modifying [2] to a single parametrized carrymacro call wrapped in a loop not only slashed the compile time, it appreciably boosted the resulting code speed. That's when I also started trying the strategy on the SIMD code paths. Upshot: ~30% speedup across the board [i.e. all FFT lengths, and for both 1thread and multithreaded] for scalardouble and SSE2 builds on my aging Core2Duo Macbook. I am now within a factor of 2x of Prime95, as typified by the following Mlucas/Prime95 timing ratios (in mSec/iter) at FFT length 2560K: 1thread: 128/70 = 1.83x 2thread: 73/46 = 1.59x [These are using p95 v25.10  I don't think George has sped the old SSE2 code much in the interim, but perhaps the multithreaded performance has been upped.] The one annoyance: I get no speedup from the objectcodeshrinkage on my Haswell quad. But my code already got a much greater percycle boost than p95 on Haswell, which leads me to believe the various hardware improvements [doubled L1 cache sizes, doubled L2toL1 bandwidth] on Haswell mitigated the Icacheoverflow issue for me. I believe this phenomenon also explains many of the seeminglybizarre performance issues I previously encountered with my SSE2 builds on Core2. 
Nice to hear!
I just saw your previous post about the 1MB limit to performance. Could it be a limit on the number of TLB's on your machine? A single translation lookaside buffer speeds up access to a single 4kB page, so if your machine has a TLB with 128 entries and you pull in data from more than 128 different widely separated addresses then you will probably get noticeable slowdowns. Usually a TLB miss takes much less time than an access from RAM, but if you would otherwise be running from L2 then the miss penalty should be noticeable. Yes, this means modern x86 machines don't have enough TLB entries to cover their own caches; those caches are huge now, and a TLB is logic that is in the critical path of every memory access. The good news on linux is that you can prove whether this is happening, by allocating a partition somewhere of type 'hugetlbfs' that is backed by RAM. Accesses inside this partition are backed by 'second level TLB' entries, which on x86 cover 2MB or 4MB. If your code speeds up drastically in this case, you know that it's TLB thrashing. The bad news is that you can't do much about it, the TLB size is a fundamental machine limit and the only way to speed up is to respect that limit. The worse news is that I have no idea how to do the test :) 
Thanks for the commentary and suggestions  I considered that the effect might be a TLB issue rather than L2 cache per se, but as you note, saw no easy way to test it. Perhaps George can provide some useful insight here, as I know he did much wrestling with TLB issues in days of yore.
From my perspective, as I now have a nice unifiedcode methodology [most code aside from the DFT cores common for different radices] for implementing largeradix leadingpass DFTs (the ones that wrap the carry phase as noted above), absent a simpler directfiddlewithTLBusage fix I will simply add largerradix DFTs as needed to keep the perthread data chunksizes < 1 MB. Note that on preHaswell chips this is closely tied to the compactobjectcode issue, because previously those L1 Icache overflow issues were negating the chunksizebeneficial effects of the larger radices. 
[QUOTE=jasonp;369016]Could it be a limit on the number of TLB's on your machine? A single translation lookaside buffer speeds up access to a single 4kB page, so if your machine has a TLB with 128 entries and you pull in data from more than 128 different widely separated addresses then you will probably get noticeable slowdowns. Usually a TLB miss takes much less time than an access from RAM, but if you would otherwise be running from L2 then the miss penalty should be noticeable. Yes, this means modern x86 machines don't have enough TLB entries to cover their own caches; [/QUOTE]
I'm not convinced TLB accesses are a problem. Prime95 in pass 1 where we are using a large stride reads scattered data from RAM and copies it to a scratch area, the various FFT steps and normalization are done in the contiguous scratch area where TLBs are not a problem before copying the final results back to scattered RAM. Yes, there are some TLB misses at the start and end, but the cost appears to be insignificant. I've implemented large pages FFTs in Windows and did not observe a measurable speed difference. Now if Ernst is trying to do the entire large stride pass 1 inplace, then the TLB costs could be much more significant. 
[QUOTE=Prime95;369070]I'm not convinced TLB accesses are a problem.
Prime95 in pass 1 where we are using a large stride reads scattered data from RAM and copies it to a scratch area, the various FFT steps and normalization are done in the contiguous scratch area where TLBs are not a problem before copying the final results back to scattered RAM. Yes, there are some TLB misses at the start and end, but the cost appears to be insignificant. I've implemented large pages FFTs in Windows and did not observe a measurable speed difference. Now if Ernst is trying to do the entire large stride pass 1 inplace, then the TLB costs could be much more significant.[/QUOTE] No  The way I do things is very much as you describe (holy convergent evolution, Batman)  largeregularstride mainarray data go into the radixR DFT preceding the carry step, outputs of finaliFFT DFT are into small block of local storage, after the carry step we read from that, do the initialfFFT radixR DFT and write results back to the largestride main array locations. I've been working to increase the pass 1 radix (R in my notation) to get the ensuing pass2 (remaining fFFT passes/dyadicsquaring/allbutlastiFFTpass) data block sizes below the L2 size, and eventually to something which will fit in L1. I can see why the localarray aspect of the DFT/carry/DFT stage is TLBinsensitive, but what about the scattereddata reads and writes bookending things? One more note of possible relevance: For large radices I've found it expedient for the scalardouble code path to write things back to the scattered main array locs rather than to local memory. Interestingly I see no significant performance hit from doing so, contrary to my expectations. (This is not an option currently for the SIMD builds since those use carry macros which assume data are in contiguous memory.) 
[QUOTE=ewmayer;368933]The one annoyance: I get no speedup from the objectcodeshrinkage on my Haswell quad. But my code already got a much greater percycle boost than p95 on Haswell, which leads me to believe the various hardware improvements [doubled L1 cache sizes, doubled L2toL1 bandwidth] on Haswell mitigated the Icacheoverflow issue for me.[/QUOTE]
Another hypothesis: much better hardware prefetcher + branch prediction on Haswell :smile: 
Intel Vtune
If one had the time, one could download Intel's excellent utility, Vtune, to determine what exactly is going on. It's free for 30 days. Quite a few years ago, for a project, I had the chance to use an early version of it and was VERY impressed; it was quite amazing to actually look inside the cpu to see what's going on and make changes to improve response rather that doing the trial and error approach. Although there are lots of features requiring some dedicated time to learn, I also remember it had some "quick start" modes to get results quickly. I believe there are some YouTube videos that demonstrate its features.

[QUOTE=tServo;369241]If one had the time, one could download Intel's excellent utility, Vtune, to determine what exactly is going on.[/QUOTE]
Does that require an Intel compiler install underneath? 
[QUOTE=ewmayer;369249]Does that require an Intel compiler install underneath?[/QUOTE]
It says it doesn't. With Visual Studio being so prevalent on Windoze boxes, it makes sense that they wouldn't limit it. In examining its new features, one that I found VERY interesting that they improved from the time I used it is that you can collect data from remote machines by running a collection piece on the other machine that doesn't require a seperate license for it. Previously, if you had 3 different Intel cpu architectures you were targeting, you needed a license for each machine or an expensive floating license. Now, it looks like 1 will do the trick. Very cool. BTW, its name is now Vtune Amplifier XE 2013. sheesh, I'll never understand marketing types. 
1 Attachment(s)
Second run of F[sub]28[/sub], at FFT length 16384 kdoubles, finished today, and match the one for 15360k I posted last November in this thread. The timings in the attached checkpointsummarystatus file reflect the first part of the run being done using leading radix r0 = 64 (yielding a perthread dataset size of ~2 MB) and then switching to r0 = 128 (1MB/thread, right at the 1 MB threshold which is critical on Intel x86) and later r0 = 256 (~0.5MB/thread) as the new code for those radices became available.
The summaries for the first few and last few checkpoints and the final SelfridgeHurwitz residues are [i] F28: using FFT length 16384K = 16777216 8byte floats. this gives an average 16.000000000000000 bits per digit Using complex FFT radices 64 16 16 16 32 [Nov 26 09:00:00] F28 Iter# = 10000 clocks = 00:00:00.000 [ 0.0695 sec/iter] Res64: 0F38D26B35C6794C. AvgMaxErr = 0.010993669. MaxErr = 0.013671875 [Nov 26 09:11:38] F28 Iter# = 20000 clocks = 00:00:00.000 [ 0.0696 sec/iter] Res64: 6E621D1C6A05BC46. AvgMaxErr = 0.011034805. MaxErr = 0.015625000 ... Using complex FFT radices 128 16 16 16 16 [Dec 14 14:50:10] F28 Iter# = 22690000 clocks = 00:00:00.000 [ 0.0580 sec/iter] Res64: 262012254B592782. AvgMaxErr = 0.010806998. MaxErr = 0.013671875 [Dec 14 14:59:50] F28 Iter# = 22700000 clocks = 00:00:00.000 [ 0.0579 sec/iter] Res64: 0002AF208FE85365. AvgMaxErr = 0.010797848. MaxErr = 0.013671875 ... Using complex FFT radices 256 8 16 16 16 [Jan 09 14:55:42] F28 Iter# = 61130000 clocks = 00:00:00.000 [ 0.0567 sec/iter] Res64: F3699600BCBFECE0. AvgMaxErr = 0.011603018. MaxErr = 0.015625000 [Jan 09 15:05:10] F28 Iter# = 61140000 clocks = 00:00:00.000 [ 0.0567 sec/iter] Res64: 2DD786952E04FC17. AvgMaxErr = 0.011608487. MaxErr = 0.015625000 ... [May 27 04:12:06] F28 Iter# = 268390000 clocks = 00:00:00.000 [ 0.0575 sec/iter] Res64: E529CBABAE1D8B17. AvgMaxErr = 0.011612073. MaxErr = 0.015625000 [May 27 04:21:42] F28 Iter# = 268400000 clocks = 00:00:00.000 [ 0.0574 sec/iter] Res64: 7D195E1B0DD4A93E. AvgMaxErr = 0.011611240. MaxErr = 0.015625000 [May 27 04:31:17] F28 Iter# = 268410000 clocks = 00:00:00.000 [ 0.0574 sec/iter] Res64: 840DBB498F57E870. AvgMaxErr = 0.011617642. MaxErr = 0.015625000 [May 27 04:40:53] F28 Iter# = 268420000 clocks = 00:00:00.000 [ 0.0574 sec/iter] Res64: 64BE2132426CDD1C. AvgMaxErr = 0.011605388. MaxErr = 0.015625000 [May 27 04:50:28] F28 Iter# = 268430000 clocks = 00:00:00.000 [ 0.0575 sec/iter] Res64: 360A9E1308A3165A. AvgMaxErr = 0.011610274. MaxErr = 0.015625000 [May 27 04:55:43] F28 Iter# = 268435455 clocks = 00:00:00.000 [ 0.0574 sec/iter] Res64: 468731C3E6F13E8F. AvgMaxErr = 0.006339752. MaxErr = 0.015625000 F28 is not prime. Res64: 468731C3E6F13E8F. Program: E3.0x R28 mod 2^36 = 16759471759 R28 mod 2^35  1 = 30476727665 R28 mod 2^36  1 = 9104636564[/i] Tarchive of the 2 checkpointsummary files and the final bytewise residue file (identical for both runs, thus just 1 copy) in the previouslydescribed format are [url=http://www.hogranch.com/mayer/f28_runs.tgz]here[/url] (33 MB, so please do not download frivolously  the runstatus file alone is in the much smaller attachment below, if you just want the summary data). The MD5 checksum on the extracted bytewise residue file is [i] md5sum f28 = 1554face46e633eb2fd773df2647cd2a[/i] F29 (@30720K) is taking ~110 ms/iter on all 4 cores of my nonOCed Haswell 4670K; I'd really like to get that down to ~86ms (i.e. 1 Miter/day) which would still need 18 months for the complete run, so it may be time to consider [strike]performanceenhancing drugs[/strike] overclocking the hardware. 
[QUOTE=ewmayer;374437]F29 (@30720K) is taking ~110 ms/iter on all 4 cores of my nonOCed Haswell 4670K; I'd really like to get that down to ~86ms (i.e. 1 Miter/day) which would still need 18 months for the complete run, so it may be time to consider [strike]performanceenhancing drugs[/strike] overclocking the hardware.[/QUOTE]
My initial stab (summer 2013) at FMArelated speedups was restricted to the intermediateFFTpass radix16 routines, which do roughly half the work in a run like my ongoing F29 one, where I am using complex FFT radices 240,[b]16[/b],[b]16[/b],[b]16[/b],16 (product = 15M, or 30 Mdoubles in realdata terms). The bolded radices are the previouslyFMAized ones. That effort was somewhat disppointing, since it yielded an overall speedup of 45% over nonFMA AVX, i.e. perhaps 810% speedup for the FMAoptimized code macros. (I was hoping for something like 15% or better). Part of the disappointment is due to a very basic aspect of my FFT implementation, namely that I use a posttwiddles complexmultiply scheme for the various passes of the inverse FFT, in which the complex twiddles are applied to the outputs of the DFT use in each iFFT pass. This is nice from a dataflowsymmetry and auxiliarydataindexing perspective, but is not favorable for an FMAbased code speedup because instead of leading to a nice "tangent structure" (as for the forward, pretwiddle passes) in which the twiddle MULs can be absorbed nicely in with the ADD/SUBs, one instead ends up in the iFFT with a raft of "trivial" FMAs with one of the two multiplicands = 1 (or simply retains the original ADD/SUBs those replace), and at the end must do a batch of pureMULs (i.e. wasting the fused mul/add aspect of FMA) as part of the posttwiddling. Here are the resulting opcounts for forward and inverse radix16 complex FFT passes: Fwd: 174 FMA [Breakdown: 87 FMA, 2 FMS, 85 FNMA, 0 FNMS.] Inv: 174 FMA [a whopping 112 of which have a trivial unity multiplicand], 34 MUL. Upshot: I probably get the desired FMAspeedup in the forward FFT routines restructured thusly, but get little speedup in the iFFT due to the above issues. Partly as a result of the effort expended to achieve this mediocre gain I did not consider widening my FMArelated optimizations beyond the above 2 "heavy lifting" routines until recently. However, I will henceforth be focusing on morelocalized "peephole" style FMA code opts, which can be done incrementally and in a localized fashion. For starters I FMAized the other code macros (radix15 and twiddleless radix16 assembled into radix240, and fused radix16 finalfwdFFT/dyadicsquare/initialinvFFTpass) used in the above set of radices for my ongoing F29 run. That and a halfdozen other miscellaneous 1%level tweaks have cut the periteration timing @30 Mdoubles to 95 msec, which is within 10% of my "stretch goal" of 86 msec. (Still running at stock  if I added a ginormocooler and OCed up the wazoo I could probably hit 80 msec with the current code, but where's the challenge in that? :) 
In the radix16 code, have you tried prefetching the next radix16 data into the L1 cache?

[QUOTE=Prime95;384890]In the radix16 code, have you tried prefetching the next radix16 data into the L1 cache?[/QUOTE]
My current prefetch scheme there allows for a prefetch with a compiletime "fetchahead" distance specified. Based on experiments I found the best value of that to be ~1kB, very different from the "fetch very next set of inputs" scheme you describe. But let me do a quickrevert to the latter when I get home this evening and have access to the Haswell, and I will let you know what the result is. One of my ongoing frustrations is that I have not been able to get anywhere near the big speedups you describe with prefetch  mine have been at best 23%, an order of magnitude lower than expected. Getting another 2030% speedup effectively "for free" would of course be huge, but I have never seen anything near that with my prefetch experiments over the years. 
Prime95 v27 and before prefetched way ahead into the L2 cache. That is, say were doing the first pass operating on 128KB of data. During the processing of the 128KB of data I slowly prefetch the next 128KB of data into the L2 cache.
In v28, I added the prefetch next few cache lines into the L1 cache. This moves the data from the L2 cache into the L1 cache. How many total passes over main memory does your FFT take? Prime95 does the forward and inverse FFTs with reading and writing the FFT data only twice. If your code is doing more than that you will run into memory bandwidth limitations sooner. 
I just replayed with the prefetch params in the forwardradix16 DFT macro, tried bytesahead for the address offset ranging from 8 (just one double ahead) to 4096 ... again only a few % variation in the resulting timings, and 1kB seems roughly the best, though the effect is weak. All these current prefetches are the t1 variety.
[QUOTE=Prime95;384941]Prime95 v27 and before prefetched way ahead into the L2 cache. That is, say were doing the first pass operating on 128KB of data. During the processing of the 128KB of data I slowly prefetch the next 128KB of data into the L2 cache. In v28, I added the prefetch next few cache lines into the L1 cache. This moves the data from the L2 cache into the L1 cache. How many total passes over main memory does your FFT take? Prime95 does the forward and inverse FFTs with reading and writing the FFT data only twice. If your code is doing more than that you will run into memory bandwidth limitations sooner.[/QUOTE] Well, you have to store the notcurrentlyinregisters data somewhere, so where do you cache the data if not in the main array? In some contiguous chunk of local memory? I'll illustrate my data move strategy using the above set of complex radices (240,16,16,16,16) I'm using for F29. Call the overall runlength in terms of doubles n: [1] Initial forward radix240 decimationinfrequency (DIF) pass on n/240strideseparated data; [2] Remaining 4 fFFT passes, dyadic square step, and initial 4 iFFT passes (iFFT runs radices in reverse order, thus again 4 x radix16 here) done on one contiguous n/240sized data chunk at a time; [3] Final radix240 decimationintime (DIT) pass on n/240strideseparated data; [4] Round and do carries; Steps 1,3,4 (actual order is 3,4,1 here) are all fused into a single pass through the data. Step 2 involves multiple passes, but all on a contiguous block of data. (And I process 240 such blocks in turn). The size of each of these blocks relates to the "1 MB chunk threshold" I described earlier, which is crucial for good performance on the x86. 4 passes of complex radix16 means 16^4 = 2^16 complex doubles, i.e. 2^17 doubles, exactly 1MB. For F30 if I again used leading radix 240 the resulting contiguousdata chunks would be 2MB, and I would expect (and have already confirmed) a timing deterioration. I already have in place a leadingradix960 routine, which is slightly worse than 240 when used for F29, but is better for F30, and better still for F31. 
[QUOTE=ewmayer;384960]
[1] Initial forward radix240 decimationinfrequency (DIF) pass on n/240strideseparated data; [2] Remaining 4 fFFT passes, dyadic square step, and initial 4 iFFT passes (iFFT runs radices in reverse order, thus again 4 x radix16 here) done on one contiguous n/240sized data chunk at a time; [3] Final radix240 decimationintime (DIT) pass on n/240strideseparated data; [4] Round and do carries;[/QUOTE] This is what I call 2 passes over main memory  exactly what prime95 does. The combined steps 3,4,1 work on 240*16 = 4KB chunks. These 4KB chunks easily fit it the L1 cache while being processed. Two comments on performance: 1) While processing the 4KB chunk in the L1 cache, you should be prefetching the next 4KB chunk into the L1 cache. My concern is that you are not doing enough FPU work to hide the latencies of reading the next 4KB chunk from main memory. If I'm right, you can improve the FPU code all you want but your actual timings won't change. 2) You likely break up the up into two steps 16 by 15 (actually inverse 16,15,carries,forward 15,16) and you are working with horribly large strides. During the inverse radix16 step in my example, Prime95 would read the horrible stride data and write it to a contiguous 4KB scratch area. The remaining steps then deal with the pleasant small stride scratch area until the final forward radix16 step writes it back to the FFT data area with horrible strides. What I call pass 2 over main memory, processes chunks of 16^4 * 16 bytes = 1MB. This pass operates out of the L3 cache. Some comments on performance: 1) The L3 cache is a lot slower than the L2 cache. The CPU will not be able to hide the latencies without prefetching into L1 cache. You've found that prefetching 1KB ahead seems to be optimal for hiding this latency. 2) While you are processing the 1MB chunk, you can prefetch the next 1MB chunk into the L3 cache. You are doing plenty of FPU work to easily hide all main memory latency. Except 3) I'm worried you may be exceeding the L3 cache's capabilities. If you operate on 1MB of data while prefetching the next 1MB of data and you have 4 cores going  that's 8MB of data. It probably won't all fit. I also worry about the contention you've got going with 4 cores all placing very heavy read and write demands on the shared L3 cache. Is it possible to exceed the L3 cache's bandwidth capabilities? I don't know. Here's what I'd recommend trying: In pass 1, go with 240 x 16  60KB. While processing, prefetch the next 60KB into the L2 cache (spread these prefetches out!). Total demand on the L2 cache is 180+KB (60KB FFT data, next 60KB FFT data, 60KB scratch area, some sin/cos and weights). A tight fit. In pass 2, go with 16x16x16  64KB. Should fit nicely in the L2 cache since you don't need a scratch area. That's a lot of work. I hope I'm right and it will be worth the effort. No guarantees. 
George, many thanks for the thoughtful, detailed reply.
[QUOTE=Prime95;384980]This is what I call 2 passes over main memory  exactly what prime95 does.[/QUOTE] Holy convergent evolution, Batman! :) Actually, the key pieces of the scheme  the fusedradix/carry step and a similar fusion of the final fFFT pass, dyadicsquare step and initial iFFT pass (same radix as the finalfFFT, and sharing the same twiddles) were already in place in the Fortran90 code I used for the 1999 test of F24. The key pieces missing at that time (beside the transition to C and mass deployment of inline ASM) were that the 1999 code still used an outofplace "ping pong" FFT (which performed well on at least one chip family, MIPS, but sucked hard on most others, from Alpha to x86), and instead of doing a fused step [2] to all but the finaliFFTradix/carry/initialfFFTradix on one contiguous mainarray data chunk at a time, I did the intermediateradix (= all but the initial and final radix of each FFT) passes on the whole array, thus a set of 5 radicesforeachFFT as I am using for F29 meant 8 passes through the whole array. That predictably scales badly to large FFT lengths. [QUOTE]The combined steps 3,4,1 work on 240*16 = 4KB chunks. These 4KB chunks easily fit it the L1 cache while being processed.[/QUOTE] I use a complex FFT, so radix240 means 240*(16,32,64) = (3.75,7.5,15)KB in scalardouble, SSE2 and AVX build modes, respectively. Even the AVXmode 15KB should fit nicely in L1 on Haswell with its 32KB level1 Dcache, though. However, note that for radices like 240 which decompose into 2 coprime factors (240 = 15*16 = odd * powerof2, but could also be the product of 2 coprime odds or odd1 * (odd2 * powerof2) with odd1,2 coprime), I typically use 2 such contiguous local stores, due to the indexpermutation flexibility demanded by my twiddleless implementation of such coprime radix pairs. So e.g. pass1 reads from largestrideseparated mainarray locations, writes to store1, pass 2 reads from stroe 1 and writes to store 2. But I still think the mainarray step is the crucial one here, as you note below. [QUOTE]Two comments on performance: 1) While processing the 4KB chunk in the L1 cache, you should be prefetching the next 4KB chunk into the L1 cache. My concern is that you are not doing enough FPU work to hide the latencies of reading the next 4KB chunk from main memory. If I'm right, you can improve the FPU code all you want but your actual timings won't change.[/QUOTE] Yes, that sounds like a promising "try this first" step  I have done limited prefetch experiments for the largestride step, but IIRC my attempts were influenced by my experience with prefetch within the contiguous Step [2] data blocks, where fetchahead distances ~1KB seem best for me. I don't think I ever made a concerted effort to fetch just the next set of complex datatobeused into L1. [QUOTE]2) You likely break up the up into two steps 16 by 15 (actually inverse 16,15,carries,forward 15,16) and you are working with horribly large strides. During the inverse radix16 step in my example, Prime95 would read the horrible stride data and write it to a contiguous 4KB scratch area. The remaining steps then deal with the pleasant small stride scratch area until the final forward radix16 step writes it back to the FFT data area with horrible strides.[/QUOTE] Same here, but with the extra twis of 2 scratch areas for coprime radices. The carry macros also operate on the lastused scratch area. Carries also need auxiliary data  DWT weights for Mersennemod, and complexrootsof(1) (similar to the FFT twiddles, but those are (n/2)th roots of +1, as opposed to the (n/2)th roots of 1 (same as the first (n/2), that is uppercomplexhalfplane (n)th roots of +1) needed for the "acyclic twisting" scheme for moduli like Fermats with a "+" sign. I use a 2table multiply scheme for onthefly generation of both FFT twiddles and IBDWT weights. As I noted to you some years ago for the IBDWT weights w_0,...n1 I make use of the identity w_j * w_nj = 2 to allow a single array to encode both the forward and inverse weights, at the price of some slightlylesscachefriendly reverserunning indexing for the inverseweights step. My 2tables schemes have each pair of tables within a factor of 2 of each in size, thus I have a pair of O(sqrt n)sized tables of complex data for FFT twiddles, another such pair for acyclictwisting weights in Fermatmod mode, and a pair of similarsized real data table for DWT weights in Mersennemod mode. Note that only one such pair of data tables competes for L1 cache with actual "live" transform (i.e. residuerelated) data in any phase of the overall FFTmodmul scheme, though: The fused steps [1,3,4] need either IBDWT weights or acyclictwisting weights depending on the kind of modulus; step [2] needs just the 2 subtables needed to generate the complexFFT twiddles. Aside: For nonpowerof2 FFT lengths (this case was not considered in the nowfamous 1994 Crandall/Fagin DWT paper) the Fermatmod scheme also needs Mersennestyle IBDWT weights in addition to the acyclictwisting weights, but the IBDWTs are memorynegligible here because for an FFT length of form odd * 2^m the IBDWT weights cycle with sequence length (odd), and it makes little sense to use oddradix components much larger than 15. (The largest such I have is 63, and will be sued specifically for the Fermat numbers F32 and F33, where an FFT length of form 15 * 2^m will no longer suffice due to ROE growth  I will use FFT length 63 * 2^m for one of the 2 independent tests, and a powerof2length 2^(m+6) for the other. [QUOTE]What I call pass 2 over main memory, processes chunks of 16^4 * 16 bytes = 1MB. This pass operates out of the L3 cache. Some comments on performance: 1) The L3 cache is a lot slower than the L2 cache. The CPU will not be able to hide the latencies without prefetching into L1 cache. You've found that prefetching 1KB ahead seems to be optimal for hiding this latency. 2) While you are processing the 1MB chunk, you can prefetch the next 1MB chunk into the L3 cache. You are doing plenty of FPU work to easily hide all main memory latency. Except 3) I'm worried you may be exceeding the L3 cache's capabilities. If you operate on 1MB of data while prefetching the next 1MB of data and you have 4 cores going  that's 8MB of data. It probably won't all fit. I also worry about the contention you've got going with 4 cores all placing very heavy read and write demands on the shared L3 cache. Is it possible to exceed the L3 cache's bandwidth capabilities? I don't know.[/QUOTE] So if I read you right the evidence for the L3bandwidth issue you raise would be that the prefetchintoL3 scheme helps for (say) 1 or 2threaded runs, but hurts performance (vs noL3prefetch) above some thread count in the 4ish range? [QUOTE]Here's what I'd recommend trying: In pass 1, go with 240 x 16  60KB. While processing, prefetch the next 60KB into the L2 cache (spread these prefetches out!). Total demand on the L2 cache is 180+KB (60KB FFT data, next 60KB FFT data, 60KB scratch area, some sin/cos and weights). A tight fit.[/QUOTE] In "240 x 16", is the 16 a datachunk byte count (as in your "combined steps 3,4,1 work on 240*16 = 4KB chunks" not above), or are you referring to radix16, i.e. to prefetching data for the first 2 FFT passes into L2? [QUOTE]In pass 2, go with 16x16x16  64KB. Should fit nicely in the L2 cache since you don't need a scratch area.[/QUOTE] On considering my experience with your comments, I'm not even sure prefetch in pass [2] is worth spending much time on, since the whole point of properly choosing the leading radix (e.g 240 in our example here) is so that the system should have an easy time fetching the needed data from the higherlevel caches ... but I will first attempt to tackle the largestriderelated data fetch issues for fused steps [1,3,4] in my breakdown. [QUOTE]That's a lot of work. I hope I'm right and it will be worth the effort. No guarantees.[/QUOTE] Well, as you know, this is a longterm project for me. :) It's nice to have some promising directional hints, for sure. For the coming month or so, I'll probably continue with on ongoing work of FMAizing the remaining key AVX code macros which need it  don't expect huge gains from that on Haswell, but reliable gains and hopefullyevenbetteronBroadwell when it comes out  and do some largestriderelated prefetch experiments as a side project. Will let you know what I find as results become available. 
I just added some prefetcht1 calls to my AVX Fermatmod carry macro, such that during processing of the 240 (or more generally, the R, where R is the leading radix) carries within the segmented main data array (which is not directly in the picture during carry processing, as this is all done using a small contiguous local storage area as described above), prefetchesintoL1 of the data at the R distinct n/Rstrideseparated mainarray locations get done while the carries are processed. The carry macros seem the logical place to do this prefetching, since even though they do not otherwise access the main array in any way, they make is easy to propagate such prefetches to all the various leadingradix/carry routines I have.
For these experiments I added code such that the bytesahead offset for the prefetches is set as a compiletime commandline define and played with various values: 64 is the obvious candidate in AVX mode since an "AVX complex double" consists of 4 adjacent quartets of doubles, thus "fetch the next complex datum at each array location" means fetch 64 bytes ahead of the current data addresses. In practice, though, 32 bytesahead proves a smidge faster (and I also tried other values, like 16 and 128, all of which were clearly worse). Instant 4% speedup (95ms/iter > 91 ms/iter for my ongoing F29 run), the clearest timing "signal" I have ever seen in all my playingwithprefetch. Thanks, George! Hopefully this is the start of a trend  still lots of other things to play with here, as you lay out above. 
First Pe'pin test of F29 completed late Wednesday  took slightly under 20 months at FFT length 30M. Startofrun periteration time on my 3.3 GHz Haswell 4670 quad (running at stock) was ~115ms; several rounds of AVX2/FMAoriented optimizations brought this down to ~90ms, which was representative of roughly the final 80% of the iterations.
On to run #2 (@FFT length 32M) ... our own Mike/Xyzzy was kind enough to run the first ~25Miter @32M over the last several months for me on his 2core Haswell NUC, so I am proceeding where that run left off. [UPDATE] I was going to post the final checkpointRes64 and the traditional SelfridgeHurwitz residues along with some notes on run #2 after letting the latter run for a day or two on the same quadHaswell system, but the latter has already turned up a 'bonk!' between iteration 25M and 26M. Not yet 100% sure whether the error is in my new or the old run, but I suspect the latter because that entire run was 'fragile' and crashatrandomintervalsrangingfromanhourtoseveralmonths prone. Said crashes were accompanied by either 'fatal roundoff error = 0.5' or 'nonzero exit carry' errors indicating some kind of data corruption, whereas the FFTM = 32M run has been clean of any untowardness. I further note that the 30M run used a fast but 'preliminary' binary of my v14.1 code release ... when I compared that to the eventual stabilized release code it was 23% faster at the FFT length in question, so I continued using it. Big mistake, as I now suspect the aforementioned 'noisy' fatal baddata errors were likely accompanied by silent fails ... note the residue discrepancy between the 30M and 32M runs appears at a checkpoint where there are no signs of 'distress' from either run. So, likely I am looking at a massive waste of compute effort ... lesson learned, never again do such a runpair in serial mode, always do the 2 sidebyside on systems yielding as similar speeds as possible. To that end, I am looking for a [strike]victim[/strike]volunteer to [strike]go on a suicide mission[/strike] help me redo the 30M run in parallel with the 32M one. Best will be someone with a similar Linux Haswellquad to my own ... should one prove slightly faster than the other it will get to finish the 32MFFT run, which needs ~6% more cycles than the 30M one. Or we could consider running @30M using just 3 cores on the faster system, leaving one core free for other tasks. Note that [url=http://www.imdb.com/title/tt0060009/?ref_=nm_knf_t1]should you choose to accept this mission[/url], it will be a significant runtime commitment  around 18 months and 0 chance of a prime, although should you change your mind (or blow up your system) at some point along the way you can hand things off (even if your savefiles get nuked, since we will be crosschecking those vs mine) to a fresh recruit. But you will be contributing to a small piece of math history and of course getting a grateful 'thanks!' in any publication that may eventually result. 
Just curious, did the system have ECC memory?

None of his (or ours) has ECC memory.
:mike: 
18 month test without ECC RAM! :loco:

I envy these algorithms!

Too bad I still don't owe a Haswell... :sad:
Luigi 
[QUOTE=ET_;423748]Too bad I still don't owe a Haswell... :sad:
Luigi[/QUOTE] Hm, do you want to owe it, or to own it? :razz: 
[QUOTE=ewmayer;423664]…although should you change your mind (or blow up your system) at some point along the way you can hand things off (even if your savefiles get nuked, since we will be crosschecking those vs mine) to a fresh recruit.[/QUOTE]:bump:

[QUOTE=ET_;423748]Too bad I still don't owe a Haswell... :sad:
Luigi[/QUOTE] Now you have the perfect excuse to run out and buy one! You can probably get used systems/components fairly cheap now. Re. ramshanker's "I envy these algorithms" comment  you envy abject #FAILure? 
[QUOTE=ewmayer;423829]Now you have the perfect excuse to run out and buy one! You can probably get used systems/components fairly cheap now.
[/QUOTE] I will, probably for my birthday on March 7th. Will I be able to help then, or will it be too late? :boxer::spinner: 
[QUOTE=ET_;423867]I will, probably for my birthday on March 7th. Will I be able to help then, or will it be too late? :boxer::spinner:[/QUOTE]
Not at all! 
[QUOTE=ewmayer;423925]Not at all![/QUOTE]
Fine, let's keep in touch then :smile: 
[QUOTE=ET_;423980]Fine, let's keep in touch then :smile:[/QUOTE]
While my 32M Haswell run continues  at ~900,000 iterations per day  I'll be using spare cycles on my slow Core2 Macbook to rerun the roughly 1 million iterations from Mike's last 32M savefile (whose last checkpoint data agreed with those of my justcompleted 30M run at the iteration in question) to the point at which my 30M and 32MFFT runs diverge, in order to ascertain which of the 2 runs went off the rails. (Like I said I'm almost certain it was the 30M one, but I need to confirm that via a 3rd run.) 
[QUOTE=ewmayer;424035]While my 32M Haswell run continues  at ~900,000 iterations per day  I'll be using spare cycles on my slow Core2 Macbook to rerun the roughly 1 million iterations from Mike's last 32M savefile (whose last checkpoint data agreed with those of my justcompleted 30M run at the iteration in question) to the point at which my 30M and 32MFFT runs diverge, in order to ascertain which of the 2 runs went off the rails. (Like I said I'm almost certain it was the 30M one, but I need to confirm that via a 3rd run.)[/QUOTE]
It's official  rerun of iters 24.92M (Where Mike left off his HaswellNUC DC run @FFTlen 32M)  25.64M (where my Haswellquad continuation of same first mismatches my original run @FFTlen 30M) using the abovementioned SSE2 build on my Core2Duo Macbook @FFTlen 30M matches my (ongoing) Haswell 32M continuation run: [b][1] Original Haswellquad run @30M:[/b] [i] [Jun 30 12:51:11] F29 Iter# = 25630000 clocks = 00:17:59.766 [ 0.1080 sec/iter] Res64: C782D1DA346F61C4. AvgMaxErr = 0.118563074. MaxErr = 0.156250000 [Jun 30 13:09:15] F29 Iter# = 25640000 clocks = 00:18:00.266 [ 0.1080 sec/iter] Res64: [b]BCF83642C5E19A09[/b]. AvgMaxErr = 0.118375359. MaxErr = 0.148437500 [/i] [b][2] DC HaswellNUC/quad run @32M:[/b] [i] [Jan 21 14:55:02] F29 Iter# = 25630000 clocks = 00:15:58.792 [ 0.0959 sec/iter] Res64: C782D1DA346F61C4. AvgMaxErr = 0.023437500. MaxErr = 0.023437500 [Jan 21 15:11:06] F29 Iter# = 25640000 clocks = 00:15:59.235 [ 0.0959 sec/iter] Res64: [b]12AF4B8DD106DF85[/b]. AvgMaxErr = 0.023437500. MaxErr = 0.023437500 [/i] [b][3] DC Core2Duo/SSE2build run @30M[/b] (timings vary widely since only run when plugged in): [i] [Feb 03 23:55:27] F29 Iter# = 25630000 [ 4.77% complete] clocks = 02:57:20.963 [ 1.0641 sec/iter] Res64: C782D1DA346F61C4. AvgMaxErr = 0.102656999. MaxErr = 0.132812500. [Feb 04 18:19:56] F29 Iter# = 25640000 [ 4.78% complete] clocks = 05:04:33.375 [ 1.8273 sec/iter] Res64: [b]12AF4B8DD106DF85[/b]. AvgMaxErr = 0.102699611. MaxErr = 0.140625000. [/i] Thus, everything below beyond iter 25.63M in my original Haswell/AVX2 run @FFTlen 30M is garbage. Luigi, you order your birthday present yet? My hope is that things will run sufficiently fast on your newer hardware that you might get throughput matching my 4core speed using just 3 (or even 2, due to memory contention giving diminishing returns beyond 2threads) threads, which will make me feel less guilty about monopolizing your cycles. :) 
Mr Mayer
I may be willing to donate a couple threadmonths of haswell desktop until Luigi gets his machine up and running. If you might be so kind as to PM me instructions for Linux setup (I run ubuntu, if it matters), I can continue a doublecheck alongside your current run, with the caveat that I'll lose interest after 3060 days. 
[QUOTE=VBCurtis;425276]Mr Mayer
I may be willing to donate a couple threadmonths of haswell desktop until Luigi gets his machine up and running. If you might be so kind as to PM me instructions for Linux setup (I run ubuntu, if it matters), I can continue a doublecheck alongside your current run, with the caveat that I'll lose interest after 3060 days.[/QUOTE] Thanks, Curtis  please, enough with the "Mr.'  you're making me feel my age. :) First step is to download the current Mlucas release from [url=http://hogranch.com/mayer/README.html]here[/url] and build it  please use the autobuild version at top of the 'Recent News' section. It should end with a binary that is linked to the executable which reflects your platform, e.g. the USE_AVX2 build for Haswell+. Assuming you successfully build & install, run the medium selftests using 'Mlucas s m' in a dir created for your work. Fermattesting needs a few further tweaks, but first let's get through the above. 
[QUOTE=ewmayer;425270]
Luigi, you order your birthday present yet? My hope is that things will run sufficiently fast on your newer hardware that you might get throughput matching my 4core speed using just 3 (or even 2, due to memory contention giving diminishing returns beyond 2threads) threads, which will make me feel less guilty about monopolizing your cycles. :)[/QUOTE] I completed the todo list and cleared my doubts about the pieces to pick up: I will place my order next monday, and get the new toy before the end of the week. Luigi 
Ernst, can we have a status on F29?
Just curious... 
[QUOTE=LaurV;442492]Ernst, can we have a status on F29?[/QUOTE]It is probably still composite.
:razz: 
[QUOTE=LaurV;442492]Ernst, can we have a status on F29?
Just curious...[/QUOTE] Well, I'm doing both the 30M and 32MFFT runs sidebyside on my Haswell quad, so each is running half as fast as it would alone, but it makes it very easy to crosscheck residues, since 30M runs only a few % faster than 32M (.1810 sec/iter vs .1860 sec/iter). I'm running each job 4threaded, even though it gets only the equivalent of 2 cores compute power. Why am I doing this? Because my Haswell system, despite running at stock, has never been completely stable (whereas the very same binary has been running rocksolidly on my slower Mikebuilt Broadwell NUC for over a year, with the only interruptions being power outages ... but note the NUC is doing 1sttime LL tests, not Fermat testing). I average around 1 week between mysteryfails; roughly half the time both jobs start emitting fatal roundoff errors (sign of data corruption) at the same time; ~30% of the fails involve just one run  this is the reason for making both 4threaded, so the surviving run can continue at full 4core speed  and ~20% of the fails are BSODstyle (no UPS on this system, not worth the hassle). When a run fails due to fatalROEstyle data corruption it will attempt to autorestart for the mostrecent savefile; this succeeds perhaps 1/3rd of the time, suggesting that is the fraction of times the datacorruption occurred in the main residue array as opposed to the auxiliary data tables needed for the FFTmul. (Adding code logic to autoreinit those tables is on my todo list, but not currently a high priority.) Whenever I reboot/restart it gives me a chance to allow whichever of the 2 runs has fallen behind to catch up before restarting the other one. There have been at least 2 (perhaps 3  too lazy to dig thru the savefiles just now) occasions where one run has started emitting nonmatching residues  in such cases I have to halt both and restart from the nearest savefile preceding the start of mismatching residues, since I have no [i]a priori[/i] way of determining which one went off the rails. Anyhow, both runs passed 130M iters a few days ago, which is a smidge more than 1/4 done. Total time for the sidebyside runs will be ~3 years, so only 27 months to go. :( Since I plan to do F30 on a manycore system (I'm lookin' at you, Knight's Landing), it is possible those parallel runs will complete before the F29 ones, or not much later. I'll have a better sense of things by EOY, roughly the timeframe I estimate for a halfwaydecent initial AVX512 assemblycode deployment. 
Nice, thanks for sharing that!
That is what I (used to) do when running two LLs in parallel in two Titans  stop both and resume from the last match, as there was no way to know which one failed. I said "used to" because momentarily I stopped my LL testing at all (no resources), and not because I would do it differently now. The old way is still the best. You can save a lot of time, actually (your time and others' time, compared with the version where one test would fail and you don't know and report a bad result, and others need to TC). @Xyzzy: :yucky: (as said before, this icon is improper named, it has nothing to do with "yuck", i see a grumpy guy sticking out a tongue in defiance, or spite, of another. Long ago I wanted it for myself, but you gave me the lightning rabbit... hey, btw, now with pokemon in vogue, maybe I can make some money if I sell it? Any takers?) :laurv: 
As you probably already noticed, "my" Skylake didn't come.
I had health issues, then had to sell my mom's home, my wife's home was too small and we had to look for another one, sell the old one, buy the new one in a city that is 450 miles away from where I live, prepare to relocate, look for a job and some other things I could put on the unhappy me thread. I want to apologize with the forum and Ernst for offering a helping hand and silently retracting it. The new PC should finally arrive after our relocation is complete, I hope end of October (you will know by then). If help is still needed on November, I hope I will be ready. 
[QUOTE=ET_;443146]As you probably already noticed, "my" Skylake didn't come.
I had health issues, then had to sell my mom's home, my wife's home was too small and we had to look for another one, sell the old one, buy the new one in a city that is 450 miles away from where I live, prepare to relocate, look for a job and some other things I could put on the unhappy me thread. I want to apologize with the forum and Ernst for offering a helping hand and silently retracting it. The new PC should finally arrive after our relocation is complete, I hope end of October (you will know by then). If help is still needed on November, I hope I will be ready.[/QUOTE] No worries, Luigi, and I hope things have calmed down on the various fronts you mention. In the meantime, the usefulness of having both runs executing locally on the same (or similar) hardware has become clear to me  If I had a 2nd *well quad, that would double my throughput, but can't justify the expense. So I plan to just have my Haswell complete F29, during which time I will produce AVX512 code which will hopefully bring F30 within reasonable reach, either on a KNL or one of the future desktop systems which will support that dooublethewidthofAVX2 instruction set. Luigi, you go ahead and put your hardware, when it arrives, to use for GIMPS  you definitely have a better chance of discovering a prime that way. :) 
The 2 sidebyside runs at 30M and 32M FFT I mentioned in my Sep 2016 post both got to iter ~230M on my Haswell in late April with me doing neardaily Res64 crosschecks, but it was a major headache, the Haswell  since powered off except for occasional avx2 code tests and tweaks  being sufficiently unstable that it was often a challenge to get 24 hours of uptime. At that point I moved the respective runs to a pair of faster systems, both physically hosted by our own David Stanfill, a.k.a. user airsquirrels. One system is David's own GPUfilled 32core Xeon/avx2, the other is the Intel Knights Landing box a bunch of us GIMPSers interested in doing/supporting AVX512 code development crowdfunded this past spring. After doing the basic AVX512 assembly code port and some experiments to figure out which FFT length should be run on each of these 2 servers, the 30MFFT run was finished on the Xeon (best periter time there ~24 ms using 32 threads) and the 32M on the
64core KNL (besttime ~35ms/iter using 64 threads) to finish. The Xeon run completed in late July, and the KNL run finished on 2 Sep., both gave these final SelfridgeHurwitz residues (as well as matching fulllength final residue files): [i] F29 is not prime. Res64: BECE1E5FFBE5EC12. Program: E17.0 R29 mod 2^36 = 68650658834 R29 mod 2^35  1 = 17796269382 R29 mod 2^36  1 = 21246329964 [/i] Since my original Haswell runs were from a common 32MFFT savefile (as opposed to being run completely independently @30 and 32M) and had numerous restarts as well as several residue mismatches needing backtracking to see which of the 2 runs (which were proceeding sidebyside on my quad Haswell at similar speeds, allowing for easier crosschecking) went "off the rails", dataintegritywise, after the faster Xeon completed its run of the final ~300Miters I further used it to rerun the first ~230Miters, all @30M FFT length. That partial run just completed over the weekend, with 230 Miter result matching those of the Haswellbegun runs. Many thanks to David for the generous gift of Xeon cycles and the IT support. I've posted a tarball of the interimcheckpoint status files and the full final residue file via link added to the OP in this thread. I also have persistent full checkpoint files at 10Miter intervals, but as each such files is 2^29/8 = 64Mbyte, and there are 2^29/(10 million) = 53 such, the total archive is ~3.3GB. The residue files are more or less completely random byte data, i.e. cannot effectively be compressed. I have burned a copy of that archive to a thumb drive as a backup for the one on my HD, and will make it available on request should anyone wish to do e.g. a parallel run validation via rerun of said 10Miter intervals. 
1 Attachment(s)
Since I am currently in the process of adding PRP (as opposed to LL) testing for Mersenne numbers, as well as fast Suyamastyle cofactorPRP testing ('fast' in the 'when starting with a PRPtest residue for the modulus in question' sense) for both Mersenn and Fermat numbers to my Mlucas code, this seems worth sharing here. While going through a box of some old documents today, I came across a folder of overheadprojector transparecies for several talks I gave back in my 6year stint in academia from '93'99. One set of slides was for the following talk I gave at the 1997 West Coast NT conference at Asilomar conference center in Pacific Grove, near Monterey. (My Christmasgifttoself last month was a $60 Epson flatbed scanner, which is paying for itself in all kinds of useful ways.)
There is an obvious small typo on slide 4, the 2nd bullet point should read (boldface highlights the needed fix) "...since (M_p1)/2 is just a string of (p1) binary ones, the Euler base3 test amounts to a series of [b](p2)[/b] squarings and multiplications by 3...": 
I never realized you were at Case!

Update on the sidebyside Pépin tests of F30:
Run #1 using AVX512 build on all 64 processors of the 1.3 GHz "GIMPS KNL" @64M FFT, begun 25 Mar 2017, just passed 1/3rd mark at 360M iterations. At 67.7 ms/iter ETC is 31. Dec 2019 (ugh.) Run #2 using AVX2 build on all 32 processors of David Stanfill's 2.3 GHz Xeon server [E52698 v3] @60M FFT, begun 12 Nov 2017, is 20M iterations behind the KNL one. The best runtime I've gotten on this system is 48 ms/iter, but (for reasons not fully known, best guess is some combo of system load from other lowCPU processes and possibly temps) things more typically run 1015% slower than that. The Xeon also tends to have more outages than the KNL, so let's add another 5% to that, say 57 ms/iter, ETC is midOctober 2019. Unlike my F29 runs on my Haswell quad, there have been no interimresidue mismatches between the above 2 runs. Hooray for ECC memory! One intersting issue I ran into recently with regard to future test of larger Fermat numbers relates to residue shift such as George's Prime95 code uses to effect differentdata firsttime and doublecheck runs of a given Mersenne number at the same FFT length for both runs: Recently finished adding support for this to my Mlucas code. In the context of the LL test for Mersennes, the trickiest part, especially for highly optimized codes which sandwich the carry step between the final pass of the inverse FFT for one iteration and the initial pass of the forwardFFT for the next iteration (and as in my case, combine all 3 steps into the body of a single processing loop), is efficiently figuring out where to inject the shifted 2 which in an unshiftedresidue LL test simply gets subtracted from the 0element of the residue array. In the context of shift, one needs to first efficiently compute which residue word gets the shifted 2, and remember, this is in the context of the crazy mix of 2 different wordsizes (floor(p/n) and ceiling(p/n)) mandated by the CrandallFagin DWT. Then one is typically dealing with a multithreaded decomposition of the above fused FFTpass/carrystep loop, so one needs to figure out which thread is processing said array word. Lots of fun. But, the update of the residue shift count each iteration is simple: it's simply a doubling (mod p), and the fact that p is prime means the odds of the resulting sequence of periteration shift counts will repeat in some kind of shortcycle sequence (or even worse, fall into some kind of limit cycle which is independent of the initial shift, i.e. which would be the same for 2 runs starting out with diffeent ahift values) are small. For Fermatmod arithmetic we have the opposite problem: There is no 2 carry needing to be injected each iteration as there is for the LL test, but the periteration shiftcount doubling is problematic, since our modulus = 2^(2^n) + 1, i.e. our shift count gets doubled mod (2^n), which is guaranteed to yield (and then stay at) shift count s = 0 after at most n iterations. Instead updating the shift each iteration via s = 2*s+1, which one can do by including a scalar multiplier equal to the initial seed of the Pépin test, is little better, e.g. for n = 4, consider the kinds of shiftcount sequences that result from different values of the initial shift: 0,1,3,7,15,15,... 1,3,7,15,15,... 2,5,11,7,15,15,... 3,7,15,15,... 4,9,3,7,15,15,... 5,11,7,15,15,... 6,13,11,7,15,15,... 7,15,15,... 8,1,3,7,15,15,... The most promising avenue I.ve found here is to update the shift count via s = 2*s + random[0,1]. The reason this becomes relevant for Fermat numbers only a few indices above F30 is this: so far I've been able to use dual residueshiftless runs at slightly different FFT lengths to ensure that all but the initial handful of iterations for each run involve different FFT data. For example for F24 we can do one run at 896K and another at 960K or 1M. for F30, as noted above, the minimal FFT length has crept up relative to the nexthigher power of 2 due to roundoff error trends, and each doubling of the modulus length squeezes that lower bound that much closer to the power of 2 FFT length. For F33 I have already put in place a special FFT length of form 63*2^k, but that largeish odd composite radix is relatively inefficient, so if it proves slower than the power of 2 just above it, it will be useful to be able to do both runs at the same powerfo2 FFT length, using the residueshift scheme to ensure differentdataness. 
Has anyone heard from David Stanfill, a.k.a. airsquirrels, of late? I had my 2 sidebyside tests of F30 running on a pair of servers he hosts, one the GIMPS KNL (doing the run @64M FFT, last check ~70% done) another a 32core Intel Xeon server (doing the run @60M FFT). The Xeon started its run some months after the KNL but was running faster (~55 ms/iter versus 68 ms/iter) and had pulled ahead of the KNL early this year. But I lost ability to ssh into it (from the KNL is only option, the way David configured things) a couple months ago, sent David several emails to that effect, no reply. And as of several days ago I can no longer access the KNL, either. I'm starting to get a bit worried  his forum profile indicates last activity back in January.

[QUOTE=ewmayer;515671]Has anyone heard from David Stanfill, a.k.a. airsquirrels, of late? I had my 2 sidebyside tests of F30 running on a pair of servers he hosts, one the GIMPS KNL (doing the run @64M FFT, last check ~70% done) another a 32core Intel Xeon server (doing the run @60M FFT). The Xeon started its run some months after the KNL but was running faster (~55 ms/iter versus 68 ms/iter) and had pulled ahead of the KNL early this year. But I lost ability to ssh into it (from the KNL is only option, the way David configured things) a couple months ago, sent David several emails to that effect, no reply. And as of several days ago I can no longer access the KNL, either. I'm starting to get a bit worried  his forum profile indicates last activity back in January.[/QUOTE]
Maybe you can find out where he is through this site: [url]http://press.airsquirrels.com/2019/03/squirrelsllcnamesandrewgouldnewceoannouncessplitfromsquirrelsresearchlabs/[/url] I guess he is really busy! 
[QUOTE=paulunderwood;515672]Maybe you can find out where he is through this site:
[url]http://press.airsquirrels.com/2019/03/squirrelsllcnamesandrewgouldnewceoannouncessplitfromsquirrelsresearchlabs/[/url] I guess he is really busy![/QUOTE] Thanks  I knew well how busy he was, but he never failed to reply to multiple emails over such a long span previously. (Including one more "you are still among the living, I trust?" missive last night). And I hate to nag, really don't want to have to resort to trying to reach him via phone. 
[QUOTE=ewmayer;515673]Thanks  I knew well how busy he was, but he never failed to reply to multiple emails over such a long span previously. (Including one more "you are still among the living, I trust?" missive last night). And I hate to nag, really don't want to have to resort to trying to reach him via phone.[/QUOTE]
I did a "whois squirrelsresearch.com" and the closest I could get is [email]squirrelsresearch.com@domainsbyproxy.com[/email] You could also try this form: [url]http://www.squirrelsresearch.com/contact/[/url] HTH 
[QUOTE=paulunderwood;515674]I did a "whois squirrelsresearch.com" and the closest I could get is [email]squirrelsresearch.com@domainsbyproxy.com[/email] You could also try this form:
[url]http://www.squirrelsresearch.com/contact/[/url] HTH[/QUOTE] Just send a missive via the contact form  thanks! 
NO MORE FERMAT PRIMES
Here’s the proof that there are no more Fermat Primes:
The following website calculates the total number of integers that equal a given totient number: [url]http://www.numbertheory.org/php/carmichael.html[/url] Plug in n = 2k, k ≥ 33 Plug in e = 0 Plug in f = 0 All the solutions for n = 2k (k ≥ 33), contain exactly 32 solutions. All 32 solutions are even and are equal to 2 * the solutions for the previous n. Because there are no odd numbers, and that all subsequent solutions will just be a power of 2 times any of the 32 previous solutions, then there can’t be any more prime numbers. Prime numbers are odd, and there’s only even solutions for n = 2k (k ≥ 33). Thererfore, there are no more fermat prime numbers. 
Disclaimer: "2k" in your post should be read as 2[SUP]k[/SUP].
Do you understand what that calculator does, and how? 
Still nary a peep from David ... I think I need to start looking for a new Xeon server to finish my F30 runs on ... what is the cheapest 32core avx2capable Xeon server in the severalyearsold usedhardware market, I wonder. The other problem is that while I have recent localdownload copies of the runstatus file and every10Miter savefiles for the 64MFFT run, my 60MFFT runstatus file is 56 months out of date, i.e. I have no proof that I did the last ~150M iters or so at that FFT length. Of course one can use the 10Miter savefiles to do a spedup parallel verification (multiple machines, each reproducing a separate 10Miter sbinterval), but rather a spot of bother in any event.

Before Robert Gerbicz came up with Gerbicz error checking for PRP tests for Mersenne numbers, I think he invented [URL="https://www.mersenneforum.org/showthread.php?t=22510"]something similar for Pépin tests[/URL].
So I wonder if it would make sense to pause the tests on F30 in order to incorporate error checking, to increase the confidence in the final result. 
[QUOTE=GP2;515983]Before Robert Gerbicz came up with Gerbicz error checking for PRP tests for Mersenne numbers, I think he invented [URL="https://www.mersenneforum.org/showthread.php?t=22510"]something similar for Pépin tests[/URL].
So I wonder if it would make sense to pause the tests on F30 in order to incorporate error checking, to increase the confidence in the final result.[/QUOTE] Good suggestion, but in my case the whole point of doing sidebyside runs at FFT lengths 60M and 64M, with the hardware/FFTlength matched up to give similar throughput, was to be able to do regular crosschecking of interim residues. Before I lost access to the 32core Avx2Xeon system that run was perhaps 23% ahead of the KNL run, i.e. a few tens of millions of iterations. I've had a generous offer from a fellow forumite to use a 32core Skylake Xeon system he owns (or has access to) to finish F30  depending on what his detailed tuningrun timings looks like, we may be able to use 16 cores for each of 2 runs at 60M and 64M, respectively, and get throughput not much less than I was getting before. The only issue is the missing runstatus file documentation of the 60M run, where the last copy I ftp'ed over to my PC is iter ~530M, roughly 200Miter behind the last check of that run before the Avx2 Xeon went AWOL. I may simply start both 60M and 64M runs from the 740Miter savefile produced by the 64M run (last I copied to my PC before the KNL went AWOL), and hope to reach someone at Squirrels Inc who can help me recover the missing files between now and end of the year. Failing that I can redo the missing 200Miter starting with the 530Miter savefile at a later date in order to fill in the gap. 
For future runs, started from scratch, would Gerbicz error checking eliminate the need for parallel runs?

[QUOTE=Xyzzy;516242]For future runs, started from scratch, would Gerbicz error checking eliminate the need for parallel runs?[/QUOTE]
Yes. Even for the present run, you can start from a known good check point and continue with GEC. 
[QUOTE=axn;516266]Yes. Even for the present run, you can start from a known good check point and continue with GEC.[/QUOTE]
I'm not so sure about that  it's the same issue that's been debated in other threads hereabouts: "Does GEC means GIMPS can skip doublechecks?" GEC can certainly give the tester muchgreater confidence in the results of his run, but then there remains the issue of "now you need to convince the rest of the world". My sense is that for academicstyle compNT research like Fermat testing you're always going to need at least 2 independent runs. The academicCS types are already gonna look askance at the fact of 2 runs done using floatingpoint rather than integer math. I will post the complete set of 10Miter interim residue files online after my runs complete so anyone who desires to do so can do a parallel DC using whatever code/algorithm they like, but personally I've no interested in (IMO) wasting time on the kind of pureinteger parallel verification we did for F24. 
Personally, I'm part of the "independent run" camp because that limits the number of failure points that can lead to an incorrect announcement.
What's the probability that GEC fails to detect an error? How many GEC checks are there? What about implementation bugs? Is it true that a single missed error among all the checks will result in an undetected wrong result? The rule that I usually follow is: "Can an adversary cause a misleading result by injecting a single error into a flawless computation at just the right place and time?" I don't know how GEC is implemented. But lets say your save file encounters a bit flip from when you wrote it to when you read it. Will it catch that? 
A bit of good news re. the missing data from my 60MFFT F30 run ... that was running on a 32core avx2 Xeon, the last time I was able to access said machine was late February (Feb 21, as I now can precisely date.) I knew that run was at iteration ~700M at that point, but my local copy of the runstatus file was months older, and only covered through iteration 531.4M. But!  and this is one of the reasons I keep my macbook classic's OS frozen at the wonderfully stable OS X 10.6.8, despite the attendant lackofsupport headache, and why I set my xterms to save en effectively unlimited command history  it turns out the Xterm from which I would do my severaltimesweekly runstatus quickchecks has data going all the way back to the above filedownload date, last October. And a check of whendidIlastneedtoreboot confirms:
[i] MacBook:obj_sse2 ewmayer$ uptime 14:56 up 206 days, 15:51, 3 users, load averages: 0.22 0.24 0.39 [/i] So by simply scrolling down through said window's long captured history I was able to recover a nicely granular snapshot of the 'missing' run history. My regular runstatus checks consisted of 'tail 2 f30.stat && date', here the last portion of the thuslyrecovered history: [code][Feb 19 15:10:28] F30 Iter# = 681200000 [63.44% complete] clocks = 01:36:54.030 [ 0.0581 sec/iter] Res64: C120506BBDB97A13. AvgMaxErr = 0.212160316. MaxErr = 0.312500000. [Feb 19 16:50:10] F30 Iter# = 681300000 [63.45% complete] clocks = 01:39:38.457 [ 0.0598 sec/iter] Res64: 63EA4A0A58D2DCFA. AvgMaxErr = 0.212213476. MaxErr = 0.312500000. [Feb 21 20:10:37] F30 Iter# = 684600000 [63.76% complete] clocks = 01:35:04.862 [ 0.0570 sec/iter] Res64: D4F7D0654B95C2C3. AvgMaxErr = 0.212131626. MaxErr = 0.281250000. [Feb 21 21:48:03] F30 Iter# = 684700000 [63.77% complete] clocks = 01:37:22.205 [ 0.0584 sec/iter] Res64: 9E0B4E5DD7C1C0A5. AvgMaxErr = 0.212171292. MaxErr = 0.312500000.[/code] My KNL run @64M went up through just a little past iteration 730M, so assuming the avx512 SkylakeXeon timings @60M are, say, ~10% faster than @64M, I'll probably have my new cycle donor restart the 60M run from the iteration 680M checkpoint file and the 64M run from the iteration 730M checkpoint file, and let the former slowly catch up with the latter. 
[QUOTE=Mysticial;516293]What's the probability that GEC fails to detect an error?[/quote]
Infinitessimal to a very large power. [quote]How many GEC checks are there?[/quote] Irrelevant. [quote] What about implementation bugs?[/quote] None known in prime95 29.6 (29.4 had one). Of course, if they were known we would fix them. [quote] Is it true that a single missed error among all the checks will result in an undetected wrong result?[/quote] A proper GEC implementation won't have a single missed error. [quote]The rule that I usually follow is: "Can an adversary cause a misleading result by injecting a single error into a flawless computation at just the right place and time?"[/quote] I've tried to make sure that can't happen, but now that you mention it there may be a way. If the loop counter is corrupted right after a GEC check there might be an issue. I'll have to go read the code. I think stack corruption that controls program flow (or instruction pointer corruption) is the biggest vulnerability. [quote]I don't know how GEC is implemented. But lets say your save file encounters a bit flip from when you wrote it to when you read it. Will it catch that?[/QUOTE] Yes. There are always two bignums in memory (and the save file). If either is bitflipped a later GEC check will catch the error. Taking all the above into account, I agree that doublechecks are still necessary, albeit at a lower priority than LL doublechecking. 
[QUOTE=ewmayer;516290]I'm not so sure about that  it's the same issue that's been debated in other threads hereabouts: "Does GEC means GIMPS can skip doublechecks?" GEC can certainly give the tester muchgreater confidence in the results of his run, but then there remains the issue of "now you need to convince the rest of the world".
My sense is that for academicstyle compNT research like Fermat testing you're always going to need at least 2 independent runs. The academicCS types are already gonna look askance at the fact of 2 runs done using floatingpoint rather than integer math. I will post the complete set of 10Miter interim residue files online after my runs complete so anyone who desires to do so can do a parallel DC using whatever code/algorithm they like, but personally I've no interested in (IMO) wasting time on the kind of pureinteger parallel verification we did for F24.[/QUOTE] Unfortunately, convincing the rest of the world is more of an expectation management thing rather than actual mathematical thing. So you're probably right that independent double check is necessary in these scenarios. However ... A simple scheme where a single GEC run plus saving all the GECchecked full residues, should together constitute a verifiable proof. With these residues, you or anyone else can do an independent double check later. In fact, this could be a novel angle to an academic paper  how you managed to do a single run to positively verify a number using GEC! /IMO 
[QUOTE=Prime95;516316]
Taking all the above into account, I agree that doublechecks are still necessary, albeit at a lower priority than LL doublechecking.[/QUOTE] I imagine that some kind of doublecheck is needed anyway to protect against Byzantine faults. 
Thanks to a generous donation of cycles on a 32core avx512 machine by a forumite who wishes to remain anonymous, the 60MFFT F30 run has been restarted from the 680Miter savefile. At that FFT length this machine gets 60ms/iter using 16 cores, but the  scaling deteriorates beyond that, using all 32 cores we get 50ms/iter. In contrast, David Stanfill's 32core avx2 Xeon which I had been using for that run was in the 8090ms range using 16 cores, but continued to scale nicely up to the full 32, getting 50 ms/iter as its fastest timing. More typically, due to other system loads (mainly the GPU), I got ~55 ms/iter on that one.
I am still trying to fiind out whassup w/David and the GIMPS KNL; to that end I sent email to several addresses (squirrelsresearch.com@domainsbyproxy.com, [email]press@airsquirrels.com[/email]) last weekend, but have received no reply. I was hoping to arrange to have the KNL groundshipped to a GIMPS officer (maybe Aaron Blosser, who admins the Primenet server, has room) and then restart my 64M run on it. 
My anonymous cycle donor's continuation of my 2/3complete run of F30 @60M FFT length finished on
Friday, he has given me permission to thank him by name: it is Ryan Propper, who also did some yeoman's massDC work for GIMPS last year. Here the final residues, again the (mod 2^36) one of the nowtraditional SelfridgeHurwitz residue triplet is simply a decimalform recapitulation of the low 9 hexits of the GIMPSstyle Res64: [code][Jan 15 11:28:26] F30 Iter# = 1069900000 [99.64% complete] clocks = 01:34:00.016 [ 0.0564 sec/iter] Res64: D4E1ADB4D56B90F5. AvgMaxErr = 0.184720270. MaxErr = 0.281250000. [Jan 15 13:02:36] F30 Iter# = 1070000000 [99.65% complete] clocks = 01:34:05.574 [ 0.0565 sec/iter] Res64: E0A3C552712AD603. AvgMaxErr = 0.184693172. MaxErr = 0.281250000. [Jan 15 14:36:48] F30 Iter# = 1070100000 [99.66% complete] clocks = 01:34:05.895 [ 0.0565 sec/iter] Res64: 44DE22333F576C20. AvgMaxErr = 0.184676185. MaxErr = 0.281250000. [Jan 15 16:09:26] F30 Iter# = 1070200000 [99.67% complete] clocks = 01:32:33.138 [ 0.0555 sec/iter] Res64: C92DEFA95553316A. AvgMaxErr = 0.184739228. MaxErr = 0.265625000. ... [Jan 17 23:54:10] F30 Iter# = 1073741823 [100.00% complete] clocks = 00:38:54.202 [ 0.0558 sec/iter] Res64: A70C2A3DB98D6D9D. AvgMaxErr = 0.184703967. MaxErr = 0.250000000. F30 is not prime. Res64: A70C2A3DB98D6D9D. Program: E17.1 F30 mod 2^36 = 58947628445 F30 mod 2^35  1 = 26425548225 F30 mod 2^36  1 = 59810773698[/code] It's kinda stunning to see those iteration counts over 1 billion in the statfile checkpoint lines  one knows they're gonna occur, but still, seeing actual run data like that for the first time is a bit of a "wow" experience. These still await confirmation by way of completion of the second run @64M FFT  I had been doing both the 60M and 64M runs on hardware hosted by David Stanfill, the 64M one was on the GIMPScrowdfunded Intel Knights Landing machine which we used for early avx512 code development. David went AWOL last spring, and Ryan has kindly agreed to also finish the 64MFFT run on hardware available to him. Restarting that one @iteration 730M, at the 60 ms/iter he is getting on the same 32core AVX512 Intel Skylake virtual machine he used to finish the 60M run, the ETC is 8 months from now. Ryan also sent me the every10Miter persistent residue savefiles (in the same kind of FFTlengthindependent bytewise format Mlucas uses for Mersennetest savefiles) for his run, so  again pending crossconfirmation via the 64M run  I have a complete chain of 107 such flles (128 Mbyte each) which can be used for any future independent confirmation of the F30 computation in distributedcomputation fashion by multiple machines, each crunching one such 10Miter subinterval and verifying the next link in the residue chain. 
Does this simply say that F30 is composite, or does it help to determine whether the cofactor F30 / ((149041*2^32 + 1)*(127589*2^33 + 1)) is composite ([URL="https://www.mersenneforum.org/showthread.php?t=19840"]Suyama[/URL])? /JeppeSN

@Jeppe: This is simply the basic Pépin residue ... I have Suyama cofactorPRP code based on my own custom GMPstyle bigint library also in place, but the various shortlength residues of said test for the smaller Fermats where others have done the Pépin+Suyama and published e.g. the Res64 for the latter do not agree with mine. It may simply be a difference in some convention re. generation of the residue, but I need to dig into things and determine its source before I make any public announcements re. the F30 cofactor. (And also one wants the 2nd doublecheck run at the slightly larger FFT length to finish and confirm the Pépintest results.)

[QUOTE=ewmayer;535962]@Jeppe: This is simply the basic Pépin residue ... (And also one wants the 2nd doublecheck run at the slightly larger FFT length to finish and confirm the Pépintest results.)[/QUOTE]
Have you checked that: (3^(2^(2^m)) mod F_n) mod q = 3^(2^(2^m)) mod q is true for the stored interim residues? Since you have 3^(2^(2^m)) mod F_n then to get the left side is trivial and you can get quickly the right side, becuase q is small , the check is cheap. Though it is not that veryvery strong check, but the two known factors could find a mismatch with probability = 1(1e10) which is still quite impressive. Note that we don't get the strength of 1/(q1) or 1/znorder(Mod(3,q)) because the order is divisible by a "large" power of two, and that will be lost in the iterated squarings of a wrong residue. 
[QUOTE=R. Gerbicz;535966]Have you checked that:
(3^(2^(2^m)) mod F_n) mod q = 3^(2^(2^m)) mod q is true for the stored interim residues? Since you have 3^(2^(2^m)) mod F_n then to get the left side is trivial and you can get quickly the right side, becuase q is small , the check is cheap. Though it is not that veryvery strong check, but the two known factors could find a mismatch with probability = 1(1e10) which is still quite impressive. Note that we don't get the strength of 1/(q1) or 1/znorder(Mod(3,q)) because the order is divisible by a "large" power of two, and that will be lost in the iterated squarings of a wrong residue.[/QUOTE] ITYM (3^(2^(2^m  1)) mod Fm) mod q = 3^(2^(2^m  1)) mod q, since the Pépin test is just an Eulerpseudoprime test (which happens to prove rigorous in the special case of the Fm), i.e. computes 3^(2^((Fm  1)/2)) mod Fm = 3^(2^((2^m)/2)) mod Fm. But in any event, that's a good suggestion  fiddled my code to load the final Pépintest residue R from the savefile and for the 2 known small prime factors p=640126220763137 and q=1095981164658689 computed R == 367500396407933 (mod p) and R == 95971534699540 (mod q). I then separately computed 3^(2^(2^m  1)) mod p and q via (2^m1) = 1073741823 iterated squarings modulo p and q, separately, and the results match, which anyone can confirm on even quite modest hardware. So we have quite good confidence in the result, on top of the 2 runs agreeing through ~740 Miter (as of the latest 64MFFT run update from Ryan), and said runs being done on highend server ssytems with ECC memory. For F31 I will also be using the nowfamous periodic wholeresidue check named in honor of the abovequoted forum member. :) 
[QUOTE=ewmayer;536075]
I then separately computed 3^(2^(2^m  1)) mod p and q via (2^m1) = 1073741823 iterated squarings modulo p and q, separately, and the results match, which anyone can confirm on even quite modest hardware.[/QUOTE] Trivial note: it is faster to reduce the exponent mod (p1) [use Fermat's little theorem]. 
Fermatmodmul with circularly shifted residues
I spent an appreciable amount of time last year wrestling with basic concepts and implementationincode of Fermatmod squaring chains in the presence of residue shift. This proves rather more interesting (a polite euphemism for "pain in the butt") than for the Mersennemod case. The Gerbicz check renders it less important, but it may still prove useful in some contexts, so figured I'd post a summary.
Let s0 = initial shift, i.e. instead of starting the Pépin test of Fm with initial seed x0, we start with (x0 << s0) (mod Fm). [1] Data layout: for Mersennemod IBDWT, we have 2 different wordsizes differing by one bit, so need to loop over the elements in our variablewordsize array until we find the one containing the (s0)th bit, then shift x0 by the appropriate small number of bits to place it on the proper bit boundary within that target word. For Fermatmod, the easy case is that of powerof2 FFT length, for which we have a fixed powerof2 wordsize. More generally we will have a nonpowerof2 FFT length optimized for the target modulus  e.g. for F24 we can use 896 Kdoubles, and for F30 I did one run at 60M and the other (now 93% complete) at 64M. In both cases we use a special acyclictransformeffecting DWT which multiplies our N input words by the first N (2N)th complex roots of unity, i.e. the (j)th input gets multiplied by DWT weight w[j] = exp(I*Pi*j/N). This means multiplying our original real input word by a complex number, but we notice that w[j+N/2] = I*w[j], making it natural to group the (j)th and (j+N/2)th input words together in memory and treat them as a single input to a complex FFT. This is the socalled "rightangle transform" trick, and is detailed in the original Crandall/Fagin 1994 DWT paper. In the Mersennemod IBDWT case we can also treat adjacent even/oddindexed imput elements as a single input to a complex FFT, but as the Mersennemod IBDWT weights are strictly real, these are simply pairs of adjacent input words. For Fermatmod arithmetic using a nonpowerof2length transform we need to layer a Mersennestyle dualwordlength base representation and similarly defined IBDWT weights atop our acycliceffecting DWT. One simplification relative to the Mersenne case is that for FFT length [odd]*2^k, the Fermatmod wordsizes and IBDWT weights recur in repeating length[odd] patterns. In the context of a rightangle Fermatmod acyclic transform, finding the target word for a given initialseed shift means looping first over the evenindexed elements of our residue vector (considered here as a lengthN real array), then if the accumulated wordsizes are still less than the target shift s0, looping over the oddindexed elements until we reach the target word. For the Fermatmod case we need only do this for the initial seed of the Pépin test; for the Mersennemod case, specifically the LucasLehmer test, we must also repeat the exercise on each subsequent iteration in order to find the proper bit offset for injection of the 2 term added to the outout of each modsquaring output. This can be done very efficiently  specifically in O(1) time  by initializing a suitable bitmap encoding the variable wordsizes of the Mersennemod IBDWT and using that to effect a fast lookup scheme. [2] For a basic squaringchainwithshift scheme as we do for Mersenne number M(p), each modsquaring doubles the shift count (mod p). That means that for any s0 != 0 (mod p) the shift will remain nonzero and nonrepeating during the LL/PRPtest, since at squaring i, our shift s = s0.2^i (mod p) and the multiplicative group (mod p) is of maximal order p1. OTOH for squaring chains modulo a Fermat number Fm, each subsequent squaring doubles the shift count (mod 2^m), so we have the problem that any nonzero initial shift s0 will lead to a shift s = 0 after precisely m squaremods, and the shift will remain 0 throughout the remainder of the squaring chain. One way to get around this problem is to define an auxiliary pseudorandombit array, and on the (i)th squaremod, if bit[i] = 1, do a further moddoubling of the residue, thus yielding the shiftcount update s[i] = 2.s[i1] + bit[i]. For a pair of Pépin tests of Fm such as are typically done to allow for realtime crossvalidation of interim residues, it is also advisable to seed the bitarrays differently for each run. [3] For the randombitoffset scheme in [2], we further need a postprocessing step to determine the correct sign to apply to the final residue; this relates to the shiftdoubling modulo the binary exponent which occurs on each squaremod. For Mp, squaring a shifted residue R.2^s gives R^2.2^2s (mod Mp) and 2^2s == 2^(2sp) (mod Mp), so at each iteration we simply double the shift count (mod p). For arithmetic modulo Fm, by way of contrast, each time the doubled shift count incurs a reduction (mod 2^m) the shifted residue undergoes a change of sign: R^2.2^2s (mod Fm) == R^2.2^(2s2^m) (mod Fm), so a shiftedresidue scheme must keep track of both the shift count and the sign of the shifted residue, relative to the true unshifted one. However: since  unlike LL  this PRP test involves just repeated modsquaring, we don't care if the sign of the residue on any particular intermediate iteration is flipped, since the subsequent modsquaring will flip it back to +. The only thing we care about is whether the sign of the final residue (either for the Pépin test or for a particular checkpoint subinterval thereof) has flipped w.r.to that of its immediate predecessor, the penultimate residue, since such a finalsquaring signflip has no ensuing modsquaring to autocorrect it. Thus, if there is a finalsquaring sign flip we need to unflip the sign of the resulting residue manually. But that is not all, since generating a shiftremoved final residue also involves a modular multiply or divide by a power of 2. I've always found it conceptually easier to multiplymod rather to dividemod, so for both that and lesscodetomaintain reasons, my mi64 multiword int library has lowlevel code only for leftward circular shift, and effects sbit rightward circular shift of a bbit number by a (bs)bit leftward one. That is all that is needed working modulo a Mersenne number, as these are of form 2^p1 and thus any bits offshifted to the left get rewrapped from the right with + sign. Once again, Fermat moduli are trickier. Modulo Fm, a leftward circular shift must feed the bits offshifted to the left back into the right end with a  sign, and the aforementioned shrcdoneasshlc substitution implies a negation (mod Fm) of the resulting shiftremoved residue. Thus, if at end of our current checkpoint interval or Pépin test, our residue R needed a signflip due the doubledshiftcount needing a reduction (mod 2^m), our shrcdoneasshlc already took care of it. If not, the residue needs an explicit negation. Rather than doing an explicit Fm  R with the attendant digitwise negations and borrows, one can simply do a bitwisecomplement of the residue vector and further += 2 of limb 0 in order to recover the unshifted residue. 
F25F30 cofactor status, Part 1
[I PMed a moreorlessidentical draft of this to Prime95, ATH; yorix, jasonp, philmoore and R. Gerbicz yesterday]
[b]F25F30 cofactor status:[/b] In the context of testing the cofactorPRP functionality which will be supported in Mlucas v21, I've at long last done Suyamastyle PRP tests on the cofactors of F25F30, using the Pépintest residues I've generated for same over the past 10 years, as detailed in this thread  original primalitytests for F2526 detailed in post #1 (recently updated with latest results), F27 in #2 and #5, F28 in #4 and #20, F29 in #29,#42,#52. For F30, the interimresidue files and logfile for the completed run @60 M FFT have been uploaded but the 99.5%complete doublecheck run @FFT 64M is stalled due to my KNL having gone offline last week  I suspect the cheapie PSU installed by the vendor of this barebones system, but not yet had time to get under the hood, so to speak. The cofactorPRP results are tabulated in a "Part 2" followup to this note. My code only supports the Suyama cofactorPRP method, which starts from the basic Pépintest residue, does one modsquaring to generate the FermatPRP residue A from that and an additional lg2(F) modsquarings to generate the subtrahend B for the Suyama test, where F is the product of known prime factors; it makes little sense to spend roughly the same amount on a direct PRP test of such cofactors as needed by the Pépin test of F_m, only to have to redo a similar amount of work when a new factor is discovered. I first added some basic cofactorPRP code to Mlucas in early 2018, and used it to test the cofactors of F2529 using the Pépintest residues I had generated via my primaiity tests for same. Did not post the cofactorPRP results at the time because the accompanying Res64 values mismatched ones posted by Andreas Höglund (a.k.a. ATH), who had done cofactorPRP tests of F25 and F26 using George's code. At the time I was trying to figure out why said result mismatched Andreas' value for what I thought was the same quantity, and in the ensuing PM exchange it did emerge that George's codeatthetime (2009) which Andreas used in fact was doing a directPRPtest of the cofactor, but I was unaware of the cofactorPRP runs for F25 and F26 done by forumite Yar (a.k.a. yorix  details below) in late 2017 using the thenlatest Prime95 version. Having more pressing concerns at the time I decided to put that aside and revisit later. Mlucas v21 will have 2 major feature adds, PRPCF and PRPcert support, so in the context of the first, 'later' is now. In 20092010 Andreas Höglund used George's code to test the cofactors of F2527; cf. Posts #51, #62, #64 in [url=http://mersenneforum.org/showthread.php?t=12168]this thread[/url]: [quote]UID: athath, F25/known_factors is not prime. RES64: 44BFC8D231602007. Wd1: B9307E03,00000000 Known factors used for PRP test were: 25991531462657,204393464266227713,2170072644496392193 UID: athath, F26/76861124116481 is not prime. RES64: 6C433D4E3CC9522E. UID: athath, F27/151413703311361/231292694251438081 is not prime. RES64: 481F26965DE16117.[/quote] Those posts were not clear on precisely what *type* of cofactorPRP test was run  direct FermatPRP test on the cofactor C, or a Pépintest (R = 3^((N1)/2) (mod N)) of the Fermat number N = F*C followed by the Suyama postprocessing step (cf. the post by Phil Moore [url=https://mersenneforum.org/showpost.php?p=210599&postcount=67]here[/url]), where one computes first the FermatPRP residue A = 3^(N1) (mod N) = R^2 (mod N) via a single modsquaring of the Pépintest residue R, then uses the product of known factors F to compute B = 3^(F1) (mod N), and checks if the difference (A  B) is divisible by the cofactor C. The Suyama version is preferable because it starts with a basic Pépintest of the N, and as new factors are found that same residue can be used to quickly (#of squarings = bits in productofknownprimefactors) check each resulting cofactor for PRPness. In late 2017 user Yar (don't know last name, uid = yorix) [url=https://www.mersenneforum.org/showthread.php?t=22668]posted a thread[/url] about his own cofactorPRP runs for F2526, again using George's code, but now mentioning 2 different run types for each, which clarifies the types of cofactorPRP tests used: [quote]PRP test for F25 with known factors: PRP=N/A,1,2,33554432,1,"25991531462657,204393464266227713,2170072644496392193" gives 'composite' with res64: 7B6B087B84A45562 PRP test for F26 with known factors: PRP=N/A,1,2,67108864,1,"76861124116481" gives 'composite' with res64: FBB406B3A281838C[/quote] We see that Res64 differs from ATH's above  Yar notes "This test with base 3 and returns the same residue independent of the number of known factors", which indicates a Suyamastyle cofactorPRP test. For F25 and F26 my FermatPRP Res64 values match Yar's for the Suyamastyle cofactor check, and his directcofactor results match ATH's (though that is a samesoftwareused match). But it would be nice if someone using George's code could confirm not just the FermatPRP residues (A) for the above, but for purposes of pronouncing the ensuing cofactorPRP results "crossverified using independent codes", we should also crosscheck the (A  B) mod C ones, where C is the cofactor. The FermatPRP computation of course constitutes the overwhelming bulk of a PRPCF run, but it is important to also have assurance that the Bresidue and (A  B) mod C have been correctly computed. [ @George, would it be possible for you to tweak your code to print the Res64 checksum for the latter quantity? If Yar still has the FermatPRPtest residues from his runs of F25 and F26 using your code, it would then be a simple matter to rerun the Suyama step from those and crosscheck the (A  B) mod C residues. If not, the F25 and F26 fulllength tests are now easily runnable on modest hardware. Also, is there a different Prime95/mprime worktodoentry syntax for a directPRP test of a cofactor vs a 2step approach (FermatPRP of N = F*C followed by Suyama step, or did you switch from the former to the latter along the way? Yar only echoes the workfile entry for the 2step run, but notes that used 29.4 and then in a later post notes retrying with v29.3 and getting results matching Andreas' earlier ones. ] Lastly, I did not see any post by Yar regarding a Suyamastyle cofactorPRP test for F27 using Prime95. It would be nice for someone to do such so we can put that one to bed, as well, at least until another factor is discovered. I could obviously do it myself, but I think the "different persons using different codes" optics would be better. Compute cost would be 2030% more than a current GIMPS wavefrontPRP test. Results for my Mlucas v21prototype cofactorPRP runs follow in Part 2. 
F25F30 cofactor status, Part 2
In each case I begin with a quick Pépintest residue sanity check, summarized here:
[quote]For PRPtests of N with knownfactors (e.g. prelude to a cofactorPRP postprocessing step), R. Gerbicz [post #79 of [url]https://mersenneforum.org/showthread.php?t=18748&page=8][/url] suggests, using the specific case of the (m)th Fermat number F_m to illustrate: [quote] Have you checked that: (3^(2^(2^m)) mod F_m) mod q = 3^(2^(2^m)) mod q is true for the stored interim residues? Since you have 3^(2^(2^m)) mod F_n then to get the left side is trivial and you can get quickly the right side, becuase q is small , the check is cheap. Though it is not that veryvery strong check, but the two known factors could find a mismatch with probability = 1(1e10) which is still quite impressive. Note that we don't get the strength of 1/(q1) or 1/znorder(Mod(3,q)) because the order is divisible by a "large" power of two, and that will be lost in the iterated squarings of a wrong residue. [/quote] (For Pépintest residue, we'd actually check (3^(2^(2^m  1)) mod F_m) mod q = 3^(2^(2^m  1)) mod q.) More generally, for the powermod residue R = a^pow (mod N) where N has prime factor q, we check whether a^pow (mod N) = a^pow (mod q). If pow > q, the RHS can be simplified using Fermat's little theorem, a^(q1) == 1 (mod q), thus a^pow == a^pow' (mod q), where pow' = pow (mod q1). Example: For the final Pépintest residue R for F30 (N = 2^2^30 + 1, pow = (N1)/2 = 2^(2^30  1)), with 2 known small prime factors p = 640126220763137 and q = 1095981164658689, we get p: pow' = 2^(2^30  1) (mod p1) = 433448099512320, R == 367500396407933 (mod p) q: pow' = 2^(2^30  1) (mod q1) = 96576634617856, R == 95971534699540 (mod q).[/quote] In the results below, for each prime factor, 'A: R == ' is the full residue (mod p) and 'B: R ==' is the (mod p) squaringchain checksum computed as above. For all the key quantities I give the "enhanced SelfridgeHurwitz residue" triplet, which consists of residues mod 2^64 (hex), 2^351 and 2^361, the latter two printed in decimal form. Lastly, I computed GCD(A  B,C) to check whether each cofactor is possibly a prime power, as described in Phil Moore's Part1linked post [or cf. Crandall & Pomerance (2nd ed.), exercise 4.8]. Unsurprisingly, all cofactors come up "composite", and even more unsurpisingly, none is a prime power: [code][b]F25:[/b] (I also reran @2048K FFT and confirmed the same values): Computing 33554431squaring residue R (mod known prime q = 25991531462657) A: R == 5785371429337 (mod q) B: R == 5785371429337 (mod q) Computing 33554431squaring residue R (mod known prime q = 204393464266227713) A: R == 45685916416586703 (mod q) B: R == 45685916416586703 (mod q) Computing 33554431squaring residue R (mod known prime q = 2170072644496392193) A: R == 991358595838356746 (mod q) B: R == 991358595838356746 (mod q) F25: using FFT length 1792K = 1835008 8byte floats. this gives an average 18.285714285714285 bits per digit Doing one modF25 squaring of iteration33554431 residue [Res64 = CB913B6E3B1FFE2A] to get FermatPRP residue FermatPRP residue (A) = 0x7B6B087B84A45562,26994100025,66886679148 Processed 162 bits in binary modpow; MaxErr = 0.203125000 3^(F1) residue (B) = 0x2909830729BFA110,30348546187,21153567739 (A  B) Res64 = 0x526185745AE4B453, C Res64 = 0x21CC1C03F0000001 (A  B)/C: Quotient = 10040206794592865905247433568023540557761720874501, Remainder Res64 = 0x7A551893ACF4BE4E Suyama CofactorPRP test of F25 / 25991531462657 / 204393464266227713 / 2170072644496392193 with FFT length 1835008 = 1792 K: (A  B) mod C has Res64,35m1,36m1: 0x7A551893ACF4BE4E,34178045549,65496009072. This cofactor is COMPOSITE [C10100842]. Checking primepowerness via GCD(A  B,C) ... GCD((A  B)[33554304 bits], C[33554270 bits]) = 1 Time for GCD = 00:00:10.857 Cofactor is not a prime power. [b]F26:[/b] (I also reran @4096K FFT and confirmed the same values): Computing 67108863squaring residue R (mod known prime q = 76861124116481) A: R == 10772922302714 (mod q) B: R == 10772922302714 (mod q) F26: using FFT length 3584K = 3670016 8byte floats. this gives an average 18.285714285714285 bits per digit Doing one modF26 squaring of iteration67108863 residue [Res64 = 2B58BDE2A0C14045] to get FermatPRP residue FermatPRP residue (A) = 0xFBB406B3A281838C, 2193939191, 7371940776 Processed 46 bits in binary modpow; MaxErr = 0.187500000 3^(F1) residue (B) = 0x25F5AB0FFC728C87,32173443562,20444894301 (A  B) Res64 = 0xD5BE5BA3A60EF705, C Res64 = 0x23FFBA1860000001 (A  B)/C: Quotient = 49561426973911, Remainder Res64 = 0xB7BD14979ACF222E Suyama CofactorPRP test of F26 / 76861124116481 with FFT length 3670016 = 3584 K: (A  B) mod C has Res64,35m1,36m1: 0xB7BD14979ACF222E, 6416457631,15466936163. This cofactor is COMPOSITE [C20201768]. Checking primepowerness via GCD(A  B,C) ... GCD((A  B)[67108864 bits], C[67108818 bits]) = 1 Time for GCD = 00:00:25.555 Cofactor is not a prime power. [b]F27:[/b] (I also reran @7.5M and 8M FFT and confirmed the same values): Computing 134217727squaring residue R (mod known prime q = 151413703311361) A: R == 36549260191848 (mod q) B: R == 36549260191848 (mod q) Computing 134217727squaring residue R (mod known prime q = 231292694251438081) A: R == 68921389296064753 (mod q) B: R == 68921389296064753 (mod q) F27: using FFT length 7168K = 7340032 8byte floats. this gives an average 18.285714285714285 bits per digit Doing one modF27 squaring of iteration134217727 residue [Res64 = 043A6C8B9272B297] to get FermatPRP residue FermatPRP residue (A) = 0xC3B191C45CCD7820,21887650168,55240256170 Processed 104 bits in binary modpow; MaxErr = 0.281250000 3^(F1) residue (B) = 0x485F292142F953EC,20036545122,26481669619 (A  B) Res64 = 0x7B5268A319D42434, C Res64 = 0xD8C9BECF60000001 (A  B)/C: Quotient = 13668745121613830518711853822333, Remainder Res64 = 0x79E89190C56372B7 Suyama CofactorPRP test of F27 / 151413703311361 / 231292694251438081 with FFT length 7340032 = 7168 K: (A  B) mod C has Res64,35m1,36m1: 0x79E89190C56372B7, 2608954171,29330680430. This cofactor is COMPOSITE [C40403531]. Checking primepowerness via GCD(A  B,C) ... GCD((A  B)[134217664 bits], C[134217624 bits]) = 1 Time for GCD = 00:00:59.492 Cofactor is not a prime power. [b]F28:[/b] (I also reran @16M FFT and confirmed the same values): Computing 268435455squaring residue R (mod known prime q = 1766730974551267606529) A: R == 688455961638747199546 (mod q) B: R == 688455961638747199546 (mod q) F28: using FFT length 15360K = 15728640 8byte floats. this gives an average 17.066666666666666 bits per digit Doing one modF28 squaring of iteration268435455 residue [Res64 = 468731C3E6F13E8F] to get FermatPRP residue FermatPRP residue (A) = 0x7CAE1275B362B2CA,20461105324,30894284803 Processed 70 bits in binary modpow; MaxErr = 0.296875000 3^(F1) residue (B) = 0xBD496177DB5155A0,33633822046, 3778945069 (A  B) Res64 = 0xBF64B0FDD8115D2B, C Res64 = 0x39AEB33000000001 (A  B)/C: Quotient = 585290509330851034797, Remainder Res64 = 0x790BE051D73D667E Suyama CofactorPRP test of F28 / 1766730974551267606529 with FFT length 15728640 = 15360 K: (A  B) mod C has Res64,35m1,36m1: 0x790BE051D73D667E, 4159568998,55362539930. This cofactor is COMPOSITE [C80807103]. Checking primepowerness via GCD(A  B,C) ... GCD((A  B)[268435392 bits], C[268435386 bits]) = 1 Time for GCD = 00:02:16.895 Cofactor is not a prime power. [b]F29:[/b] (I also reran @32M FFT and confirmed the same values): Computing 536870911squaring residue R (mod known prime q = 2405286912458753) A: R == 2364493566098202 (mod q) B: R == 2364493566098202 (mod q) read_ppm1_savefiles: Hit EOF in read of FFTkblocks ... assuming a prev18 savefile. F29: using FFT length 30720K = 31457280 8byte floats. this gives an average 17.066666666666666 bits per digit Doing one modF29 squaring of iteration536870911 residue [Res64 = BECE1E5FFBE5EC12] to get FermatPRP residue FermatPRP residue (A) = 0x9BD416DB3918C152,32748692453,58920206666 Processed 51 bits in binary modpow; MaxErr = 0.375000000 3^(F1) residue (B) = 0x30AFC6010F62884B, 7473784009, 6544795868 (A  B) Res64 = 0x6B2450DA29B63908, C Res64 = 0x3FF7746780000001 (A  B)/C: Quotient = 1671099484324226, Remainder Res64 = 0xAB3DBCEFFE907786 Suyama CofactorPRP test of F29 / 2405286912458753 with FFT length 31457280 = 30720 K: (A  B) mod C has Res64,35m1,36m1: 0xAB3DBCEFFE907786,13349243207,41564307636. This cofactor is COMPOSITE [C161614233]. Checking primepowerness via GCD(A  B,C) ... GCD((A  B)[536870912 bits], C[536870861 bits]) = 1 Time for GCD = 00:05:09.784 Cofactor is not a prime power. [b]F30:[/b] (I also reran @64M FFT and confirmed the same values): Computing 1073741823squaring residue R (mod known prime q = 640126220763137) A: R == 367500396407933 (mod q) B: R == 367500396407933 (mod q) Computing 1073741823squaring residue R (mod known prime q = 1095981164658689) A: R == 95971534699540 (mod q) B: R == 95971534699540 (mod q) F30: using FFT length 61440K = 62914560 8byte floats. this gives an average 17.066666666666666 bits per digit Doing one modF30 squaring of iteration1073741823 residue [Res64 = A70C2A3DB98D6D9D] to get FermatPRP residue FermatPRP residue (A) = 0x1D7AA56A5428333C,30698378043,40327643421 Processed 99 bits in binary modpow; MaxErr = 0.406250000 3^(F1) residue (B) = 0x6F856680482D3F4E, 9072042323,48656226334 (A  B) Res64 = 0xADF53EEA0BFAF3EE, C Res64 = 0xFFF9D50500000001 (A  B)/C: Quotient = 26538003597554197569639536298, Remainder Res64 = 0xEDFCE7FC50271D44 Suyama CofactorPRP test of F30 / 640126220763137 / 1095981164658689 with FFT length 62914560 = 61440 K: (A  B) mod C has Res64,35m1,36m1: 0xEDFCE7FC50271D44, 9723857098,34401987830. This cofactor is COMPOSITE [C323228467]. Checking primepowerness via GCD(A  B,C) ... GCD((A  B)[1073741760 bits], C[1073741725 bits]) = 1 Time for GCD = 00:11:38.593 Cofactor is not a prime power.[/code] 
F30 primality test and matching doublecheck
[See posts #65, #70, #74 and #75 in this thread for previous notes re. the F30 Pépin tests.]
At looooooooooooong last my multiplesystemshopping onagain/offagain DC run of the Pépin test on F30 (the firstcompleted run was @60K, DC run was @64M FFT has finished. A rough and ready précis of the systems used along the way:[LIST][*] [b]25 Mar  02 Apr 2017:[/b] By way of exploration, started a Pépin test of F30 on my Intel Haswell quad, 4c4t, using 256bit AVX2+FMA3 SIMD build of Mlucas 15.0. This ran at a throughput of ~100 ms/iter for a little over a week, completing 5.6M iterations (0.52% complete). [*] [b]02 Sep 2017  17 April 2019:[/b] Resumed in 64threaded mode on a 64core KNL server crowdfunded by GIMPS members and physically hosted by David Stanfill, using an AVX512 build of Mlucas 17.0 (the justreleased version supporting AVX512.) That ran at a rate of 68 ms/iter, reaching iteration 734400000 (68.4% complete). Around that time the KNL went offline and David Stanfill went AWOL; I believe the 'net slang is "ghosted". [Aside: David had mentioned he was cryptocurrency "mining" in a major way; somehow that and such disappearing acts seem to go together, as one is seeing with the recent global cryptoponzi collapse.] [*] [b]28 Nov 2020:[/b] Resumed from the 730Miter checkpoint in 64threaded mode on a 68core272thread KNL workstation bought used for $500, using avx512 build of same Mlucas 17.1 codebase used for above for consistency, and running one thread on each of physical cores 063. This crunched steadily at a throughput of 64 ms/iter for 11 months, reaching 1014Miter, a little over 94% complete. [*] [b]31 Oct 2021:[/b] Continued in 64c64t mode on the above KNL workstation, but now as a lowpriority "backup" run to a 64c128t p1 run (first to B1 = 10^7, then in 100Mprimeinterval chunks to B2 = 400M) of F33, using the p1 functionality added to Mlucas over the course of most of 2021. In backuprun mode throughput was ~300 ms/iter; this ran for ~6 months, reaching 1070Miter, a little over 99.66% complete, until 4 May this year, when the KNL gve up the ghost. [*] [b]12 Jun 2022:[/b] After various attempts at faulttracing and systemrevival proved fruitless, I pulled the SSD out of the KNL and installed in my desktop openframe GPU crunching rig, and with some assistance from Paul Underwood successfully mounted the KNL filestystem (the KNL ran CentOS, vs the desktop rig's Ubuntu, so some FSnamingconvention sleuthing was needed). The filesystem was thankfully intact, so I copied the F30run logfile and latest checkpoint file over to my 2c4t Intel NUC8i3CYS mini and completed the F30 doublecheck run @64M FFT using AVX512 build there. That ran at ~360 ms/iter and finished on 28 June; the SelfridgeHurwitz residues match those of the earlier run @60M FFT:[/LIST][code][Jun 27 22:45:52] F30 Iter# = 1073690000 [100.00% complete] clocks = 01:00:38.498 [ 0.3638 sec/iter] Res64: C52F182238099986. AvgMaxErr = 0.018915548. MaxErr = 0.023437500. [Jun 27 23:46:36] F30 Iter# = 1073700000 [100.00% complete] clocks = 01:00:37.609 [ 0.3638 sec/iter] Res64: B7D9DAA9B4FC17D8. AvgMaxErr = 0.018934628. MaxErr = 0.023437500. [Jun 28 00:47:22] F30 Iter# = 1073710000 [100.00% complete] clocks = 01:00:39.309 [ 0.3639 sec/iter] Res64: D90E3A766F5D1B5C. AvgMaxErr = 0.018922704. MaxErr = 0.023437500. [Jun 28 01:48:07] F30 Iter# = 1073720000 [100.00% complete] clocks = 01:00:38.582 [ 0.3639 sec/iter] Res64: 2D8B46F0E8274E4C. AvgMaxErr = 0.018944431. MaxErr = 0.023437500. [Jun 28 02:48:53] F30 Iter# = 1073730000 [100.00% complete] clocks = 01:00:39.394 [ 0.3639 sec/iter] Res64: 1CFA1EBC9CEE1C12. AvgMaxErr = 0.018940419. MaxErr = 0.025390625. [Jun 28 04:07:09] F30 Iter# = 1073740000 [100.00% complete] clocks = 01:18:08.842 [ 0.4689 sec/iter] Res64: 005D2E424C47F6A8. AvgMaxErr = 0.018948376. MaxErr = 0.023437500. [Jun 28 04:27:00] F30 Iter# = 1073741823 [100.00% complete] clocks = 00:19:42.173 [ 0.6485 sec/iter] Res64: A70C2A3DB98D6D9D. AvgMaxErr = 0.018940398. MaxErr = 0.023437500. F30 is not prime. Res64: A70C2A3DB98D6D9D. Program: E17.1 F30 mod 2^36 = 58947628445 F30 mod 2^35  1 = 26425548225 F30 mod 2^36  1 = 59810773698[/code] So in total, a little over 5 calendar years from start to finish for this patchworkquilt DC run. The completefinalresidue files also have matching MD5 checksums, and as described in the above posts re. cofactor status, for such cases with one or more known factors, we can also use those to quickly compute modularsquaringchain checksums of bitness roughly on the order of the known factors. I've posted linksto/descriptionsof the every100Miter interimcheckpoint files, the full final residue file and the 2 runlogfiles (for the first run @60M FFT and the justcompleted DC @64M) in the OP of this thread. I also have persistent full checkpoint files at 10Miter intervals, but as each such files is 2^30/8 = 128 Mbyte, and there are ceiling(2^30/(10 million)) = 108 such, thus the total archive is ~13GB. The residue files are more or less completely random byte data, i.e. cannot effectively be compressed. I have burned a copy of that archive to a thumb drive as a backup for the one on my HD, and will make it available via a snailmailed 16GB microSD on request should anyone wish to do e.g. a parallel run validation via rerun of said 10Miter intervals. I don't know with any certainty whether this constitutes the first gigabit primalitytestwithdoublecheck; if anyone has heard of another such, please post a link. For F31 and beyond  I have the Gerbicz check working for Fermatmod arithmetic in the indev Mlucas v21 code; on top of that we now also have the PRPproof mechanism being used for GIMPS work; that can also be used for Pépintest work, as that inherently uses a long chain of successive modsquarings. In effect, a certificate of primality, independently checkable in a small fraction of the time of the original primality test. 
All times are UTC. The time now is 17:14. 
Powered by vBulletin® Version 3.8.11
Copyright ©2000  2022, Jelsoft Enterprises Ltd.