20131024, 00:07  #1 
∂^{2}ω=0
Sep 2002
República de California
5×17×137 Posts 
Pépin tests of Fermat numbers beyond F24
Placeholder thread to summarize what has been done in this area and collect new results as they complete.
General math/algo background: As is well known, Fermat numbers admit both a very efficient rigorous primality test and a fast discrete "implicit mod" convolution method for effecting it, analogously to the Mersennes. I'm running the Pépin test for F_{28} as I type this on my quadcore [Text below is onlyslightlyedited version of an email I sent to Wilfrid Keller, Richard Crandall and Jason Papadopoulos on 13 November 2012] Available here in .tgz format (use 'tar xzf files.tgz' to unpack) are summary runstatus files for my successful (in the sense that 2 runs at different FFT lengths yield matching residues) Pépintest compositeness results for F2426. (F24 was rerun to test the hybrid DWT code described above). I believe F25 and perhaps F26 have already been Pépintested by others (likely using some version of George's x86 assembler FFT), in which event those residues can be compared to mine. These were run using a C/assembler hybrid of my own Fermatmod convolution code, using SSE2 on the Intel x86 for the heavy computational work. This implements both the traditional powerof2FFT which uses a negacyclicweighting to effect Fermatmod multiply arithmetic, as well as the nonpowerof2FFT alluded to in the Crandall/Mayer/Papadopoulos F24 paper, which combines a negacyclic weighting and a Mersennemodstyle dualbase DWT. The machinery was very modest sittingonmydesk stuff: In each case the smaller nonpowerof2FFTlength run was on 1 core of a Core2 dualcore Lenovo notebook (vintage 2008), built under Win32 using MS Visual Studio. This machine's processor is rated at 1.67 GHz, but due to a quirk of the powermanagement system, ever since the battery died 3 years ago, the system only runs at 2/3 speed, that is ~1.1 GHz. (I chose not to buy a new one, as the purchase of the MacBook for my 64bit builds allowed me to convert the Lenovo to stayathome plugin status). Since F24 had already been proven composite via matching runs at FFT length 1024k (k mean 1024 floating doubles), I include only the files for the rerun at 896k (humanreadable checkpointstatus file: f24_896k.stat) on the above Lenovo notebook using the hybrid DWT code, said test performed in October of 2011. F25 was tested @1792 on the same machine from late October 2011 to January 2012 (checkpoint file: f25_1792k.stat); F26 ran from January to October of this year (checkpoint file: f26_3584k.stat). The larger powerof2 runlengths for F25 and F26 were done on 1 core of a 2 GHz Core2 Duo Apple MacBook, built under 64bit OS X using GCC 4.2. The 64bit builds include versions of various key assembly macros which take advantage of the doubled SSE register set (from 8 to 16 128bit SSE registers) available in 64bit mode; this yields ~10% added percycle throughput, on top of that afforeded by the higher clock rate, larger caches and faster system memory of the newer machine. On the other hand, the MacBook is my personaluse mobile machine, so does not enjoy quite the 24x7 uptime of the Lenovo. That is reflected in the frequent restartsfrominterrupt and timing slowdowns reflected in the respective checkpointstatus files, f25_2048k.stat [run from 02 Dec  27 Jan 2012] and f26_4096k.stat [27 Jan  08 Sep 2012; in there you can see the timing boost I got between 1824 Jul, when my macbook's OEM fan finally got terminally dustclogged/wornout and I replaced it]. F27 and beyond will run in multithreaded mode on the above and faster/morecore hardware; it was only very recently that I achieved a first working multithreaded version of the SSE2based convolution code. I am currently playing with various threadmanagement schemes (including a nice threadpool API Jason kindly has shared, which he adapted from some opensource code he found) to see what works best in the sense of efficiency and parallelscalability. I also include a (single) copy of the final bytewise residue files for all of the above exponents; in each case I first compared the full Pépin'test residues for sameexponent runs at different runlengths to via diff of the respective final bytewise residue files. The format for these residue files is slightly different than that described in the F24 paper, namely: 1 byte encoding an internal test type (other users of this file should ignore this); 1 byte encoding an internal modulus type (other users of this file should ignore this); 8 bytes encoding "number of mod squarings performed" in littleendian byte order; n/8 bytes storing the Pépintest residue R of Fn (starting with seed 3) in littleendian byte order; 8 bytes encoding the leastsignificant 64 buts of R (R modulo 2^64) in littleendian byte order; 5 bytes encoding the (R modulo 2^351) in littleendian byte order; 5 bytes encoding the (R modulo 2^361) in littleendian byte order; 1 byte for the C endoffile (EOF) character. The small C function attached below may be easily adapted to read the files, if desired. Anyone making use of the residues (e.g. for cofactor PRP testing) should validate their residuereading code by directly computing the 3 SelfridgeHurwitz residues from the bytewise residue and comparing those values to the SH residues which are stored in the penultimate 18 bytes of the savefile. (And both of those sets should match the final residues printed in the respective run's f**.stat file). Cheers Ernst =============================== Edit (13 Nov 2017): Here are links to tbz2'ed (use 'tar xjf *tbz2' to unpack) statusandfinalresidue files for F24F29. F24 has just the 896KFFT .stat file, since that result matched the original 1MFFT one published in the Crandall/Mayer/Papadopoulos Math Comp paper. Each subsequent archive has a pair of such status files showing dates and interim Res64s for the Pe'pin tests at the respective FFT lengths used. F29 has a trio of such .stat files, one from the original (and fatallyhosedbyundetectedhardwareerrorbetweenthe25.63Miterand25.64Mitercheckpoints) 30MFFT run on my haswell quad which I've given a '.orig' extension, and the subsequent 2 from the sidebyside runs at 30M and 32M FFT which I ran to iter ~230M on the Haswell while doing neardaily Res64 crosschecks, then moved to a pair of faster systems (32core Xeon/avx2 and 64core KNL/avx512) to finish, as describe in my 13 Nov 2017 new post to this thread: F24_FILES.tbz2 (2MB, MD5 = 06025e5d6ca3a746b5c0be1b3fd12465) F25_FILES.tbz2 (4MB, MD5 = 6db5632defdb06e1d7fde47322102544) F26_FILES.tbz2 (8MB, MD5 = 83bc460c7eb29b69364e038c53e72710) F27_FILES.tbz2 (16MB, MD5 = fea03c9ac95668e90ab22b64b25c5882) F28_FILES.tbz2 (32MB, MD5 = b8d9d6a2136005f4e1dd5fbc819c1e1d) F29_FILES.tbz2 (64MB, MD5 = 7ac7a71eee658acd9f8fd8152882997a) =============================== Code:
/*** READ: Assumes a valid file pointer has been gotten via a call of the form fp = fopen(RESTARTFILE,"rb"); ***/ int read_ppm1_savefiles(uint64 p, FILE*fp, uint32*ilo, uint8 arr_tmp[], uint64*Res64, uint64*Res35m1, uint64*Res36m1) { uint32 i; uint32 nbytes; uint64 nsquares= 0; ASSERT(HERE, !(p >> 32), "read_ppm1_savefiles: p must be 32bit or less!"); /* Future versions will need to loosen this p < 2^32 restriction: */ if(!file_valid(fp)) { sprintf(cbuf, "read_ppm1_savefiles: File pointer invalid for read!\n"); return FALSE; } fprintf(stderr, " INFO: restart file %s found...reading...\n",RESTARTFILE); /* t: */ if((i = fgetc(fp)) != TEST_TYPE) { sprintf(cbuf, "read_ppm1_savefiles: TEST_TYPE != fgetc(fp)\n"); return FALSE; } /* m: */ if((i = fgetc(fp)) != MODULUS_TYPE) { sprintf(cbuf, "ERROR: read_ppm1_savefiles: MODULUS_TYPE != fgetc(fp)\n"); return FALSE; } /* s: */ for(nbytes = 0; nbytes < 8; nbytes++) { i = fgetc(fp); nsquares += (uint64)i << (8*nbytes); } /* For now, just make sure nsquares < 2^32 and copy to ilo: */ if(nsquares >= p) { sprintf(cbuf,"read_ppm1_savefiles: nsquares = %llu out of range, should be < p = %llu\n", nsquares, p); // return FALSE; } *ilo = nsquares; /* r: */ /* Set the expected number of residue bytes, depending on the modulus: */ if(MODULUS_TYPE == MODULUS_TYPE_MERSENNE) { nbytes = (p + 7)/8; TRANSFORM_TYPE = REAL_WRAPPER; } else if(MODULUS_TYPE == MODULUS_TYPE_FERMAT) { ASSERT(HERE, p % 8 == 0,"read_ppm1_savefiles: p % 8 == 0"); nbytes = (p/8) + 1; TRANSFORM_TYPE = RIGHT_ANGLE; } i = fread(arr_tmp, sizeof(char), nbytes, fp); /* Read bytewise residue... */ if(i != nbytes) { sprintf(cbuf, "read_ppm1_savefiles: Error reading bytewise residue array.\n") ; return FALSE; } if(ferror(fp)) { sprintf(cbuf, "read_ppm1_savefiles: Unknown Error reading bytewise residue array.\n") ; return FALSE; } if(feof(fp)) { sprintf(cbuf, "read_ppm1_savefiles: Endoffile encountered while attempting to read bytewise residue array.\n") ; return FALSE; } /* 8 bytes for Res64: */ *Res64 = 0; for(nbytes = 0; nbytes < 8; nbytes++) { i = fgetc(fp); *Res64 += (uint64)i << (8*nbytes); } /* 5 bytes for Res35m1: */ *Res35m1 = 0; for(nbytes = 0; nbytes < 5; nbytes++) { i = fgetc(fp); *Res35m1 += (uint64)i << (8*nbytes); } ASSERT(HERE, *Res35m1 <= 0x00000007FFFFFFFFull,"read_ppm1_savefiles: *Res35m1 <= 0x00000007ffffffff"); /* 5 bytes for Res36m1: */ *Res36m1 = 0; for(nbytes = 0; nbytes < 5; nbytes++) { i = fgetc(fp); *Res36m1 += (uint64)i << (8*nbytes); } ASSERT(HERE, *Res36m1 <= 0x0000000FFFFFFFFFull,"read_ppm1_savefiles: *Res36m1 <= 0x0000000fffffffff"); /* Don't deallocate arr_tmp here, since we'll need it later for savefile writes. */ return TRUE; } Last fiddled with by ewmayer on 20171114 at 01:20 Reason: Added links to tbz2'ed statusandfinalresidue files for F24F29 
20131024, 00:43  #2 
∂^{2}ω=0
Sep 2002
República de California
2D7D_{16} Posts 
F27 has one complete run, at FFT length 7*2^{20} doubles, from 7 Dec 2012  11 Mar 2013 on all 4 cores of my quadcore Sandy Bridge box, using the aforementioned 64bit GCCbuilt multithreaded SSE2 code. The checkpointsummarystatus file (attached) shows a number of roundoff warnings, reflecting the fact that while FFT lengths of the form 7*2^{m7} are perfectly adequate for F_{m} with m <= 26, F_{27} is the largest Fermat number testable using such a length and a doublefloatbased FFT.
[Mar 11 22:59:49] F27 Iter# = 134210000 clocks = 00:00:00.000 [ 0.0593 sec/iter] Res64: C1A1546143F5D665. AvgMaxErr = 0.257079724. MaxErr = 0.343750000 [Mar 11 23:07:28] F27 Iter# = 134217727 clocks = 00:00:00.000 [ 0.0593 sec/iter] Res64: 043A6C8B9272B297. AvgMaxErr = 0.198636938. MaxErr = 0.343750000 F27 is not prime. Res64: 043A6C8B9272B297. Program: E3.0x R27 mod 2^36 = 49701630615 R27 mod 2^35  1 = 30710305624 R27 mod 2^36  1 = 39732934618 The second run of F_{27}, at a mix of FFT lengths 2^{23} and 15*2^{19}, actually started earlier than the above one (08 Sep 2012), but has been crawling along on my nowquiteaged Core2based macbook. That has reached iteration ~85m (with all interim residues matching those of the above run), but with the cooling fan again beginning to sound stressed, I have suspended it and will complete it on my Haswell quad once that completes the first run of F_{28} next month. Last fiddled with by ewmayer on 20171115 at 00:23 Reason: SH: F27 > R27 
20131025, 17:42  #3 
"Matthew Anderson"
Dec 2010
Oregon, USA
3^{2}·89 Posts 
Hi all,
Thank you ewmayer for your thorough analysis of Fermat analysis. I appreciate your post. Regards, Matt A. 
20131111, 21:24  #4 
∂^{2}ω=0
Sep 2002
República de California
5×17×137 Posts 
First run of F_{28}, at FFT length 15360k = 15*2^{20} doubles, finished yesterday. The timings in the attached checkpointsummarystatus file reflect the first part of the run being done on my old 4core Sandy Bridge system using the SSE2 Mlucas code, then switching to AVX once that code became available (my AVX port was driven by the desire to get this run online ASAP; other FFT radices and functionality like Mersennemod convolution came after that), and finally switching to my new Haswell quad in June.
The summaries for the last few checkpoints and the final SelfridgeHurwitz residues are [Nov 10 16:57:24] F28 Iter# = 268430000 clocks = 00:00:00.000 [ 0.0638 sec/iter] Res64: 360A9E1308A3165A. AvgMaxErr = 0.145900781. MaxErr = 0.218750000 [Nov 10 17:03:13] F28 Iter# = 268435455 clocks = 00:00:00.000 [ 0.0640 sec/iter] Res64: 468731C3E6F13E8F. AvgMaxErr = 0.079615625. MaxErr = 0.203125000 F28 is not prime. Res64: 468731C3E6F13E8F. Program: E3.0x F28 mod 2^36 = 16759471759 F28 mod 2^35  1 = 30476727665 F28 mod 2^36  1 = 9104636564 Sun Nov 10 17:22:00 PST 2013 Before beginning the DC run of F_{28} I will spend a few weeks to complete the abovementioned DC run of F_{27} which I had been doing on my Core2based macbook. 
20131127, 00:17  #5 
∂^{2}ω=0
Sep 2002
República de California
5·17·137 Posts 
Second run of F27, at a mix of FFT lengths 8192k and 7680k  I briefly used the larger length early on while working through some multithreaded performance issues  finished last night. Both this and the earlier run @7168k (which was on the ragged edge of roundofferrorness) yield matching results.
If you compare the latter parts of the above 7168krun statfile to the one attached below, you see a more than 2x throughput boost between the 2 respective quadcoresystems: Sandy Bridge+ddr 1333, versus Haswell + ddr2400. (My code benefits far more from the larger caches and such of Haswell than does George's). 2nd run of F28, this one at 16384k, is now underway. 
20131127, 02:20  #6 
"Phil"
Sep 2002
Tracktown, U.S.A.
1119_{10} Posts 
Ernst, have you performed the Suyama test of the cofactor of F28, also of F27 ? That not only tells you whether the cofactor is composite, but whether it is a prime power. I don't recall that anyone has yet reported that the cofactor of F28 is composite.

20131127, 03:26  #7 
∂^{2}ω=0
Sep 2002
República de California
5·17·137 Posts 
No, not yet  I am currently just accumulating the Pépintest residues, will code up the Suyama posttest sometime next year. (I had a working version of this in my veryold Fortran90 code, but never implemented it in C/assembler.)

20131127, 04:15  #8 
"Phil"
Sep 2002
Tracktown, U.S.A.
1119_{10} Posts 
The GCD step is the most timeconsuming part of the test, but perhaps even the GMP version could do these in a reasonable time now (?) (Not that I have ever had the need to test it!) And we could also test GMP against George's code for verification.

20131127, 04:38  #9 
∂^{2}ω=0
Sep 2002
República de California
5·17·137 Posts 
Isn't a multiword remainder algo more efficient than a GCD here?

20131215, 21:48  #10 
∂^{2}ω=0
Sep 2002
República de California
2D7D_{16} Posts 
Latest code enhancements drop the periteration time of my ongoing DC run of F28 @16384K from a (disappointing) 68 ms to 58 ms. The reason I say 68 ms was disappointing is that based on my 25 ms timing for the F27 DC run @7680K I expected a time of ~55ms for the slightlymorethandoubling of the FFT length.
The code change yielding the improved timing was to implement a larger leadingpass FFT radix (call it r0) of (complexdata) 128. Until yesterday the largest r0 I had was 64. Now one of the key performancerelated trends I have observed in playing with the multithreaded SIMD code over the past year is that it really likes larger r0 values. I believe this is a result of r0 determining the perthread dataset size, as (total mainarray size)/r0. An especially critical threshold here seems to be 1 MB, which is interesting because it is not obviously related to any of the cache sizes  it is several times (more than 2x) smaller than L2, and many times larger than L1. Nonetheless, the threshold effect is striking, and persists across multiple generations of the Intel x86, at least from Core2 onward. Anyhoo, for F28 @16384K, (total mainarray size) = 128 MB, thus r0 = 64 yields perthread dataset size of 2 MB; r0 = 128 gets us back to 1 MB. Time to see if another doubling of r0 gives another boost  implementing these is less trivial than it sounds because: o My code uses a fused finaliFFTpass/carry/initialfFFTpass strategy, thus the DIF and DIT DFTs at any new radix r0 need to be combined with the carry macros, and (for r0 a power of 2 or nearly so) these need to handle the different carry propagation schemes used for (realdata) Mersennemod and (complexdata, "right angle" twisted transform) Fermatmod; o For each new r0 I need to implement 3 distinct code paths: Scalar C, SSE2 and AVX. These share much of the "ambient" C code, but the DFT macros are custom inlineasm for the SIMD paths. For now (since F28 at the above FFT length is the focus) I can just stick to r0 a power of 2, but eventually I need to go back and "fill in" the various nonpowerof2 radices as well, for instance between r0 = 64 and 128 I also will need to support r0 = 72,80,88,96,104,112 and 120. Each such radix needs 23 weeks of work to code ... much of my 2014 will be consumed this way. One bonus, though, is that larger r0 is also wellgeared for manycore, since r0 sets the maximum number of threads my code can use. Last fiddled with by ewmayer on 20131215 at 21:50 
20140314, 06:18  #11 
∂^{2}ω=0
Sep 2002
República de California
11645_{10} Posts 
Nice little timing breakthrough recently  short title is "compactobjectcode schemes for largeradix DFTs". Briefly, if one is dealing with a blob of code  mainly "hot sections" such as loop bodies here  which exceeds the L1 Icache size, instructions must be fetched from L2, i.e. compete with data fetches for L2toL1 bandwidth. In my case the main culprits are the routines which wrap the carry step, and are laid out with a codeintensive loop body as follows:
1) Do radixR finaliFFT pass on R complex data; 2) Send each output of [1] to the applicable DWTunweight/normalize/round/carry/DWTreweight macro (distinct ones for Fermat and Mersennemod transform modes); 3) Do radixR initial fFFT pass on the R complex normalized data. For R > 16 each DFT can easily run to a thousand instructions or more, and each of the R carrymacro invocations in [2] is on the order of 50100 instructions. I first considered reducing the sheer amount of code involved in the context of the nonSIMD (scalardouble) builds, where the optimizer has to deal with all of the code resulting from 13 (as opposed to just inlining blocks of assembly code, as in SIMD builds). I found that simple steps like modifying [2] to a single parametrized carrymacro call wrapped in a loop not only slashed the compile time, it appreciably boosted the resulting code speed. That's when I also started trying the strategy on the SIMD code paths. Upshot: ~30% speedup across the board [i.e. all FFT lengths, and for both 1thread and multithreaded] for scalardouble and SSE2 builds on my aging Core2Duo Macbook. I am now within a factor of 2x of Prime95, as typified by the following Mlucas/Prime95 timing ratios (in mSec/iter) at FFT length 2560K: 1thread: 128/70 = 1.83x 2thread: 73/46 = 1.59x [These are using p95 v25.10  I don't think George has sped the old SSE2 code much in the interim, but perhaps the multithreaded performance has been upped.] The one annoyance: I get no speedup from the objectcodeshrinkage on my Haswell quad. But my code already got a much greater percycle boost than p95 on Haswell, which leads me to believe the various hardware improvements [doubled L1 cache sizes, doubled L2toL1 bandwidth] on Haswell mitigated the Icacheoverflow issue for me. I believe this phenomenon also explains many of the seeminglybizarre performance issues I previously encountered with my SSE2 builds on Core2. Last fiddled with by ewmayer on 20140314 at 06:19 
Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
P1/P+1 on Fermat numbers  ATH  Operazione Doppi Mersennes  2  20150125 06:27 
What are the Primality Tests ( not factoring! ) for Fermat Numbers?  Erasmus  Math  46  20140808 20:05 
LLT numbers, linkd with Mersenne and Fermat numbers  T.Rex  Math  4  20050507 08:25 
Two Primality tests for Fermat numbers  T.Rex  Math  2  20040911 07:26 
Fermat Numbers  devarajkandadai  Math  8  20040727 12:27 