20141010, 21:48  #23  
∂^{2}ω=0
Sep 2002
República de California
2DE0_{16} Posts 
Quote:
One of my ongoing frustrations is that I have not been able to get anywhere near the big speedups you describe with prefetch  mine have been at best 23%, an order of magnitude lower than expected. Getting another 2030% speedup effectively "for free" would of course be huge, but I have never seen anything near that with my prefetch experiments over the years. 

20141010, 22:50  #24 
P90 years forever!
Aug 2002
Yeehaw, FL
2^{3}×3×5×67 Posts 
Prime95 v27 and before prefetched way ahead into the L2 cache. That is, say were doing the first pass operating on 128KB of data. During the processing of the 128KB of data I slowly prefetch the next 128KB of data into the L2 cache.
In v28, I added the prefetch next few cache lines into the L1 cache. This moves the data from the L2 cache into the L1 cache. How many total passes over main memory does your FFT take? Prime95 does the forward and inverse FFTs with reading and writing the FFT data only twice. If your code is doing more than that you will run into memory bandwidth limitations sooner. 
20141011, 05:34  #25  
∂^{2}ω=0
Sep 2002
República de California
2^{5}·367 Posts 
I just replayed with the prefetch params in the forwardradix16 DFT macro, tried bytesahead for the address offset ranging from 8 (just one double ahead) to 4096 ... again only a few % variation in the resulting timings, and 1kB seems roughly the best, though the effect is weak. All these current prefetches are the t1 variety.
Quote:
I'll illustrate my data move strategy using the above set of complex radices (240,16,16,16,16) I'm using for F29. Call the overall runlength in terms of doubles n: [1] Initial forward radix240 decimationinfrequency (DIF) pass on n/240strideseparated data; [2] Remaining 4 fFFT passes, dyadic square step, and initial 4 iFFT passes (iFFT runs radices in reverse order, thus again 4 x radix16 here) done on one contiguous n/240sized data chunk at a time; [3] Final radix240 decimationintime (DIT) pass on n/240strideseparated data; [4] Round and do carries; Steps 1,3,4 (actual order is 3,4,1 here) are all fused into a single pass through the data. Step 2 involves multiple passes, but all on a contiguous block of data. (And I process 240 such blocks in turn). The size of each of these blocks relates to the "1 MB chunk threshold" I described earlier, which is crucial for good performance on the x86. 4 passes of complex radix16 means 16^4 = 2^16 complex doubles, i.e. 2^17 doubles, exactly 1MB. For F30 if I again used leading radix 240 the resulting contiguousdata chunks would be 2MB, and I would expect (and have already confirmed) a timing deterioration. I already have in place a leadingradix960 routine, which is slightly worse than 240 when used for F29, but is better for F30, and better still for F31. 

20141011, 14:39  #26  
P90 years forever!
Aug 2002
Yeehaw, FL
2^{3}×3×5×67 Posts 
Quote:
The combined steps 3,4,1 work on 240*16 = 4KB chunks. These 4KB chunks easily fit it the L1 cache while being processed. Two comments on performance: 1) While processing the 4KB chunk in the L1 cache, you should be prefetching the next 4KB chunk into the L1 cache. My concern is that you are not doing enough FPU work to hide the latencies of reading the next 4KB chunk from main memory. If I'm right, you can improve the FPU code all you want but your actual timings won't change. 2) You likely break up the up into two steps 16 by 15 (actually inverse 16,15,carries,forward 15,16) and you are working with horribly large strides. During the inverse radix16 step in my example, Prime95 would read the horrible stride data and write it to a contiguous 4KB scratch area. The remaining steps then deal with the pleasant small stride scratch area until the final forward radix16 step writes it back to the FFT data area with horrible strides. What I call pass 2 over main memory, processes chunks of 16^4 * 16 bytes = 1MB. This pass operates out of the L3 cache. Some comments on performance: 1) The L3 cache is a lot slower than the L2 cache. The CPU will not be able to hide the latencies without prefetching into L1 cache. You've found that prefetching 1KB ahead seems to be optimal for hiding this latency. 2) While you are processing the 1MB chunk, you can prefetch the next 1MB chunk into the L3 cache. You are doing plenty of FPU work to easily hide all main memory latency. Except 3) I'm worried you may be exceeding the L3 cache's capabilities. If you operate on 1MB of data while prefetching the next 1MB of data and you have 4 cores going  that's 8MB of data. It probably won't all fit. I also worry about the contention you've got going with 4 cores all placing very heavy read and write demands on the shared L3 cache. Is it possible to exceed the L3 cache's bandwidth capabilities? I don't know. Here's what I'd recommend trying: In pass 1, go with 240 x 16  60KB. While processing, prefetch the next 60KB into the L2 cache (spread these prefetches out!). Total demand on the L2 cache is 180+KB (60KB FFT data, next 60KB FFT data, 60KB scratch area, some sin/cos and weights). A tight fit. In pass 2, go with 16x16x16  64KB. Should fit nicely in the L2 cache since you don't need a scratch area. That's a lot of work. I hope I'm right and it will be worth the effort. No guarantees. 

20141012, 01:54  #27  
∂^{2}ω=0
Sep 2002
República de California
2^{5}×367 Posts 
George, many thanks for the thoughtful, detailed reply.
Quote:
Quote:
However, note that for radices like 240 which decompose into 2 coprime factors (240 = 15*16 = odd * powerof2, but could also be the product of 2 coprime odds or odd1 * (odd2 * powerof2) with odd1,2 coprime), I typically use 2 such contiguous local stores, due to the indexpermutation flexibility demanded by my twiddleless implementation of such coprime radix pairs. So e.g. pass1 reads from largestrideseparated mainarray locations, writes to store1, pass 2 reads from stroe 1 and writes to store 2. But I still think the mainarray step is the crucial one here, as you note below. Quote:
Quote:
The carry macros also operate on the lastused scratch area. Carries also need auxiliary data  DWT weights for Mersennemod, and complexrootsof(1) (similar to the FFT twiddles, but those are (n/2)th roots of +1, as opposed to the (n/2)th roots of 1 (same as the first (n/2), that is uppercomplexhalfplane (n)th roots of +1) needed for the "acyclic twisting" scheme for moduli like Fermats with a "+" sign. I use a 2table multiply scheme for onthefly generation of both FFT twiddles and IBDWT weights. As I noted to you some years ago for the IBDWT weights w_0,...n1 I make use of the identity w_j * w_nj = 2 to allow a single array to encode both the forward and inverse weights, at the price of some slightlylesscachefriendly reverserunning indexing for the inverseweights step. My 2tables schemes have each pair of tables within a factor of 2 of each in size, thus I have a pair of O(sqrt n)sized tables of complex data for FFT twiddles, another such pair for acyclictwisting weights in Fermatmod mode, and a pair of similarsized real data table for DWT weights in Mersennemod mode. Note that only one such pair of data tables competes for L1 cache with actual "live" transform (i.e. residuerelated) data in any phase of the overall FFTmodmul scheme, though: The fused steps [1,3,4] need either IBDWT weights or acyclictwisting weights depending on the kind of modulus; step [2] needs just the 2 subtables needed to generate the complexFFT twiddles. Aside: For nonpowerof2 FFT lengths (this case was not considered in the nowfamous 1994 Crandall/Fagin DWT paper) the Fermatmod scheme also needs Mersennestyle IBDWT weights in addition to the acyclictwisting weights, but the IBDWTs are memorynegligible here because for an FFT length of form odd * 2^m the IBDWT weights cycle with sequence length (odd), and it makes little sense to use oddradix components much larger than 15. (The largest such I have is 63, and will be sued specifically for the Fermat numbers F32 and F33, where an FFT length of form 15 * 2^m will no longer suffice due to ROE growth  I will use FFT length 63 * 2^m for one of the 2 independent tests, and a powerof2length 2^(m+6) for the other. Quote:
Quote:
Quote:
Quote:
For the coming month or so, I'll probably continue with on ongoing work of FMAizing the remaining key AVX code macros which need it  don't expect huge gains from that on Haswell, but reliable gains and hopefullyevenbetteronBroadwell when it comes out  and do some largestriderelated prefetch experiments as a side project. Will let you know what I find as results become available. 

20141013, 02:13  #28 
∂^{2}ω=0
Sep 2002
República de California
2^{5}×367 Posts 
I just added some prefetcht1 calls to my AVX Fermatmod carry macro, such that during processing of the 240 (or more generally, the R, where R is the leading radix) carries within the segmented main data array (which is not directly in the picture during carry processing, as this is all done using a small contiguous local storage area as described above), prefetchesintoL1 of the data at the R distinct n/Rstrideseparated mainarray locations get done while the carries are processed. The carry macros seem the logical place to do this prefetching, since even though they do not otherwise access the main array in any way, they make is easy to propagate such prefetches to all the various leadingradix/carry routines I have.
For these experiments I added code such that the bytesahead offset for the prefetches is set as a compiletime commandline define and played with various values: 64 is the obvious candidate in AVX mode since an "AVX complex double" consists of 4 adjacent quartets of doubles, thus "fetch the next complex datum at each array location" means fetch 64 bytes ahead of the current data addresses. In practice, though, 32 bytesahead proves a smidge faster (and I also tried other values, like 16 and 128, all of which were clearly worse). Instant 4% speedup (95ms/iter > 91 ms/iter for my ongoing F29 run), the clearest timing "signal" I have ever seen in all my playingwithprefetch. Thanks, George! Hopefully this is the start of a trend  still lots of other things to play with here, as you lay out above. 
20160123, 01:52  #29 
∂^{2}ω=0
Sep 2002
República de California
2^{5}×367 Posts 
First Pe'pin test of F29 completed late Wednesday  took slightly under 20 months at FFT length 30M. Startofrun periteration time on my 3.3 GHz Haswell 4670 quad (running at stock) was ~115ms; several rounds of AVX2/FMAoriented optimizations brought this down to ~90ms, which was representative of roughly the final 80% of the iterations.
On to run #2 (@FFT length 32M) ... our own Mike/Xyzzy was kind enough to run the first ~25Miter @32M over the last several months for me on his 2core Haswell NUC, so I am proceeding where that run left off. [UPDATE] I was going to post the final checkpointRes64 and the traditional SelfridgeHurwitz residues along with some notes on run #2 after letting the latter run for a day or two on the same quadHaswell system, but the latter has already turned up a 'bonk!' between iteration 25M and 26M. Not yet 100% sure whether the error is in my new or the old run, but I suspect the latter because that entire run was 'fragile' and crashatrandomintervalsrangingfromanhourtoseveralmonths prone. Said crashes were accompanied by either 'fatal roundoff error = 0.5' or 'nonzero exit carry' errors indicating some kind of data corruption, whereas the FFTM = 32M run has been clean of any untowardness. I further note that the 30M run used a fast but 'preliminary' binary of my v14.1 code release ... when I compared that to the eventual stabilized release code it was 23% faster at the FFT length in question, so I continued using it. Big mistake, as I now suspect the aforementioned 'noisy' fatal baddata errors were likely accompanied by silent fails ... note the residue discrepancy between the 30M and 32M runs appears at a checkpoint where there are no signs of 'distress' from either run. So, likely I am looking at a massive waste of compute effort ... lesson learned, never again do such a runpair in serial mode, always do the 2 sidebyside on systems yielding as similar speeds as possible. To that end, I am looking for a Note that should you choose to accept this mission, it will be a significant runtime commitment  around 18 months and 0 chance of a prime, although should you change your mind (or blow up your system) at some point along the way you can hand things off (even if your savefiles get nuked, since we will be crosschecking those vs mine) to a fresh recruit. But you will be contributing to a small piece of math history and of course getting a grateful 'thanks!' in any publication that may eventually result. 
20160123, 05:09  #30 
Bemusing Prompter
"Danny"
Dec 2002
California
9B2_{16} Posts 
Just curious, did the system have ECC memory?

20160123, 05:19  #31 
Aug 2002
8514_{10} Posts 
None of his (or ours) has ECC memory.

20160123, 07:24  #32 
Undefined
"The unspeakable one"
Jun 2006
My evil lair
2^{2}×13×127 Posts 
18 month test without ECC RAM!

20160123, 15:05  #33 
"Ram Shanker"
May 2015
Delhi
2·19 Posts 
I envy these algorithms!

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
P1/P+1 on Fermat numbers  ATH  Operazione Doppi Mersennes  2  20150125 06:27 
What are the Primality Tests ( not factoring! ) for Fermat Numbers?  Erasmus  Math  46  20140808 20:05 
LLT numbers, linkd with Mersenne and Fermat numbers  T.Rex  Math  4  20050507 08:25 
Two Primality tests for Fermat numbers  T.Rex  Math  2  20040911 07:26 
Fermat Numbers  devarajkandadai  Math  8  20040727 12:27 