[QUOTE=Prime95;384890]In the radix16 code, have you tried prefetching the next radix16 data into the L1 cache?[/QUOTE]
My current prefetch scheme there allows for a prefetch with a compiletime "fetchahead" distance specified. Based on experiments I found the best value of that to be ~1kB, very different from the "fetch very next set of inputs" scheme you describe. But let me do a quickrevert to the latter when I get home this evening and have access to the Haswell, and I will let you know what the result is. One of my ongoing frustrations is that I have not been able to get anywhere near the big speedups you describe with prefetch  mine have been at best 23%, an order of magnitude lower than expected. Getting another 2030% speedup effectively "for free" would of course be huge, but I have never seen anything near that with my prefetch experiments over the years. 
Prime95 v27 and before prefetched way ahead into the L2 cache. That is, say were doing the first pass operating on 128KB of data. During the processing of the 128KB of data I slowly prefetch the next 128KB of data into the L2 cache.
In v28, I added the prefetch next few cache lines into the L1 cache. This moves the data from the L2 cache into the L1 cache. How many total passes over main memory does your FFT take? Prime95 does the forward and inverse FFTs with reading and writing the FFT data only twice. If your code is doing more than that you will run into memory bandwidth limitations sooner. 
I just replayed with the prefetch params in the forwardradix16 DFT macro, tried bytesahead for the address offset ranging from 8 (just one double ahead) to 4096 ... again only a few % variation in the resulting timings, and 1kB seems roughly the best, though the effect is weak. All these current prefetches are the t1 variety.
[QUOTE=Prime95;384941]Prime95 v27 and before prefetched way ahead into the L2 cache. That is, say were doing the first pass operating on 128KB of data. During the processing of the 128KB of data I slowly prefetch the next 128KB of data into the L2 cache. In v28, I added the prefetch next few cache lines into the L1 cache. This moves the data from the L2 cache into the L1 cache. How many total passes over main memory does your FFT take? Prime95 does the forward and inverse FFTs with reading and writing the FFT data only twice. If your code is doing more than that you will run into memory bandwidth limitations sooner.[/QUOTE] Well, you have to store the notcurrentlyinregisters data somewhere, so where do you cache the data if not in the main array? In some contiguous chunk of local memory? I'll illustrate my data move strategy using the above set of complex radices (240,16,16,16,16) I'm using for F29. Call the overall runlength in terms of doubles n: [1] Initial forward radix240 decimationinfrequency (DIF) pass on n/240strideseparated data; [2] Remaining 4 fFFT passes, dyadic square step, and initial 4 iFFT passes (iFFT runs radices in reverse order, thus again 4 x radix16 here) done on one contiguous n/240sized data chunk at a time; [3] Final radix240 decimationintime (DIT) pass on n/240strideseparated data; [4] Round and do carries; Steps 1,3,4 (actual order is 3,4,1 here) are all fused into a single pass through the data. Step 2 involves multiple passes, but all on a contiguous block of data. (And I process 240 such blocks in turn). The size of each of these blocks relates to the "1 MB chunk threshold" I described earlier, which is crucial for good performance on the x86. 4 passes of complex radix16 means 16^4 = 2^16 complex doubles, i.e. 2^17 doubles, exactly 1MB. For F30 if I again used leading radix 240 the resulting contiguousdata chunks would be 2MB, and I would expect (and have already confirmed) a timing deterioration. I already have in place a leadingradix960 routine, which is slightly worse than 240 when used for F29, but is better for F30, and better still for F31. 
[QUOTE=ewmayer;384960]
[1] Initial forward radix240 decimationinfrequency (DIF) pass on n/240strideseparated data; [2] Remaining 4 fFFT passes, dyadic square step, and initial 4 iFFT passes (iFFT runs radices in reverse order, thus again 4 x radix16 here) done on one contiguous n/240sized data chunk at a time; [3] Final radix240 decimationintime (DIT) pass on n/240strideseparated data; [4] Round and do carries;[/QUOTE] This is what I call 2 passes over main memory  exactly what prime95 does. The combined steps 3,4,1 work on 240*16 = 4KB chunks. These 4KB chunks easily fit it the L1 cache while being processed. Two comments on performance: 1) While processing the 4KB chunk in the L1 cache, you should be prefetching the next 4KB chunk into the L1 cache. My concern is that you are not doing enough FPU work to hide the latencies of reading the next 4KB chunk from main memory. If I'm right, you can improve the FPU code all you want but your actual timings won't change. 2) You likely break up the up into two steps 16 by 15 (actually inverse 16,15,carries,forward 15,16) and you are working with horribly large strides. During the inverse radix16 step in my example, Prime95 would read the horrible stride data and write it to a contiguous 4KB scratch area. The remaining steps then deal with the pleasant small stride scratch area until the final forward radix16 step writes it back to the FFT data area with horrible strides. What I call pass 2 over main memory, processes chunks of 16^4 * 16 bytes = 1MB. This pass operates out of the L3 cache. Some comments on performance: 1) The L3 cache is a lot slower than the L2 cache. The CPU will not be able to hide the latencies without prefetching into L1 cache. You've found that prefetching 1KB ahead seems to be optimal for hiding this latency. 2) While you are processing the 1MB chunk, you can prefetch the next 1MB chunk into the L3 cache. You are doing plenty of FPU work to easily hide all main memory latency. Except 3) I'm worried you may be exceeding the L3 cache's capabilities. If you operate on 1MB of data while prefetching the next 1MB of data and you have 4 cores going  that's 8MB of data. It probably won't all fit. I also worry about the contention you've got going with 4 cores all placing very heavy read and write demands on the shared L3 cache. Is it possible to exceed the L3 cache's bandwidth capabilities? I don't know. Here's what I'd recommend trying: In pass 1, go with 240 x 16  60KB. While processing, prefetch the next 60KB into the L2 cache (spread these prefetches out!). Total demand on the L2 cache is 180+KB (60KB FFT data, next 60KB FFT data, 60KB scratch area, some sin/cos and weights). A tight fit. In pass 2, go with 16x16x16  64KB. Should fit nicely in the L2 cache since you don't need a scratch area. That's a lot of work. I hope I'm right and it will be worth the effort. No guarantees. 
George, many thanks for the thoughtful, detailed reply.
[QUOTE=Prime95;384980]This is what I call 2 passes over main memory  exactly what prime95 does.[/QUOTE] Holy convergent evolution, Batman! :) Actually, the key pieces of the scheme  the fusedradix/carry step and a similar fusion of the final fFFT pass, dyadicsquare step and initial iFFT pass (same radix as the finalfFFT, and sharing the same twiddles) were already in place in the Fortran90 code I used for the 1999 test of F24. The key pieces missing at that time (beside the transition to C and mass deployment of inline ASM) were that the 1999 code still used an outofplace "ping pong" FFT (which performed well on at least one chip family, MIPS, but sucked hard on most others, from Alpha to x86), and instead of doing a fused step [2] to all but the finaliFFTradix/carry/initialfFFTradix on one contiguous mainarray data chunk at a time, I did the intermediateradix (= all but the initial and final radix of each FFT) passes on the whole array, thus a set of 5 radicesforeachFFT as I am using for F29 meant 8 passes through the whole array. That predictably scales badly to large FFT lengths. [QUOTE]The combined steps 3,4,1 work on 240*16 = 4KB chunks. These 4KB chunks easily fit it the L1 cache while being processed.[/QUOTE] I use a complex FFT, so radix240 means 240*(16,32,64) = (3.75,7.5,15)KB in scalardouble, SSE2 and AVX build modes, respectively. Even the AVXmode 15KB should fit nicely in L1 on Haswell with its 32KB level1 Dcache, though. However, note that for radices like 240 which decompose into 2 coprime factors (240 = 15*16 = odd * powerof2, but could also be the product of 2 coprime odds or odd1 * (odd2 * powerof2) with odd1,2 coprime), I typically use 2 such contiguous local stores, due to the indexpermutation flexibility demanded by my twiddleless implementation of such coprime radix pairs. So e.g. pass1 reads from largestrideseparated mainarray locations, writes to store1, pass 2 reads from stroe 1 and writes to store 2. But I still think the mainarray step is the crucial one here, as you note below. [QUOTE]Two comments on performance: 1) While processing the 4KB chunk in the L1 cache, you should be prefetching the next 4KB chunk into the L1 cache. My concern is that you are not doing enough FPU work to hide the latencies of reading the next 4KB chunk from main memory. If I'm right, you can improve the FPU code all you want but your actual timings won't change.[/QUOTE] Yes, that sounds like a promising "try this first" step  I have done limited prefetch experiments for the largestride step, but IIRC my attempts were influenced by my experience with prefetch within the contiguous Step [2] data blocks, where fetchahead distances ~1KB seem best for me. I don't think I ever made a concerted effort to fetch just the next set of complex datatobeused into L1. [QUOTE]2) You likely break up the up into two steps 16 by 15 (actually inverse 16,15,carries,forward 15,16) and you are working with horribly large strides. During the inverse radix16 step in my example, Prime95 would read the horrible stride data and write it to a contiguous 4KB scratch area. The remaining steps then deal with the pleasant small stride scratch area until the final forward radix16 step writes it back to the FFT data area with horrible strides.[/QUOTE] Same here, but with the extra twis of 2 scratch areas for coprime radices. The carry macros also operate on the lastused scratch area. Carries also need auxiliary data  DWT weights for Mersennemod, and complexrootsof(1) (similar to the FFT twiddles, but those are (n/2)th roots of +1, as opposed to the (n/2)th roots of 1 (same as the first (n/2), that is uppercomplexhalfplane (n)th roots of +1) needed for the "acyclic twisting" scheme for moduli like Fermats with a "+" sign. I use a 2table multiply scheme for onthefly generation of both FFT twiddles and IBDWT weights. As I noted to you some years ago for the IBDWT weights w_0,...n1 I make use of the identity w_j * w_nj = 2 to allow a single array to encode both the forward and inverse weights, at the price of some slightlylesscachefriendly reverserunning indexing for the inverseweights step. My 2tables schemes have each pair of tables within a factor of 2 of each in size, thus I have a pair of O(sqrt n)sized tables of complex data for FFT twiddles, another such pair for acyclictwisting weights in Fermatmod mode, and a pair of similarsized real data table for DWT weights in Mersennemod mode. Note that only one such pair of data tables competes for L1 cache with actual "live" transform (i.e. residuerelated) data in any phase of the overall FFTmodmul scheme, though: The fused steps [1,3,4] need either IBDWT weights or acyclictwisting weights depending on the kind of modulus; step [2] needs just the 2 subtables needed to generate the complexFFT twiddles. Aside: For nonpowerof2 FFT lengths (this case was not considered in the nowfamous 1994 Crandall/Fagin DWT paper) the Fermatmod scheme also needs Mersennestyle IBDWT weights in addition to the acyclictwisting weights, but the IBDWTs are memorynegligible here because for an FFT length of form odd * 2^m the IBDWT weights cycle with sequence length (odd), and it makes little sense to use oddradix components much larger than 15. (The largest such I have is 63, and will be sued specifically for the Fermat numbers F32 and F33, where an FFT length of form 15 * 2^m will no longer suffice due to ROE growth  I will use FFT length 63 * 2^m for one of the 2 independent tests, and a powerof2length 2^(m+6) for the other. [QUOTE]What I call pass 2 over main memory, processes chunks of 16^4 * 16 bytes = 1MB. This pass operates out of the L3 cache. Some comments on performance: 1) The L3 cache is a lot slower than the L2 cache. The CPU will not be able to hide the latencies without prefetching into L1 cache. You've found that prefetching 1KB ahead seems to be optimal for hiding this latency. 2) While you are processing the 1MB chunk, you can prefetch the next 1MB chunk into the L3 cache. You are doing plenty of FPU work to easily hide all main memory latency. Except 3) I'm worried you may be exceeding the L3 cache's capabilities. If you operate on 1MB of data while prefetching the next 1MB of data and you have 4 cores going  that's 8MB of data. It probably won't all fit. I also worry about the contention you've got going with 4 cores all placing very heavy read and write demands on the shared L3 cache. Is it possible to exceed the L3 cache's bandwidth capabilities? I don't know.[/QUOTE] So if I read you right the evidence for the L3bandwidth issue you raise would be that the prefetchintoL3 scheme helps for (say) 1 or 2threaded runs, but hurts performance (vs noL3prefetch) above some thread count in the 4ish range? [QUOTE]Here's what I'd recommend trying: In pass 1, go with 240 x 16  60KB. While processing, prefetch the next 60KB into the L2 cache (spread these prefetches out!). Total demand on the L2 cache is 180+KB (60KB FFT data, next 60KB FFT data, 60KB scratch area, some sin/cos and weights). A tight fit.[/QUOTE] In "240 x 16", is the 16 a datachunk byte count (as in your "combined steps 3,4,1 work on 240*16 = 4KB chunks" not above), or are you referring to radix16, i.e. to prefetching data for the first 2 FFT passes into L2? [QUOTE]In pass 2, go with 16x16x16  64KB. Should fit nicely in the L2 cache since you don't need a scratch area.[/QUOTE] On considering my experience with your comments, I'm not even sure prefetch in pass [2] is worth spending much time on, since the whole point of properly choosing the leading radix (e.g 240 in our example here) is so that the system should have an easy time fetching the needed data from the higherlevel caches ... but I will first attempt to tackle the largestriderelated data fetch issues for fused steps [1,3,4] in my breakdown. [QUOTE]That's a lot of work. I hope I'm right and it will be worth the effort. No guarantees.[/QUOTE] Well, as you know, this is a longterm project for me. :) It's nice to have some promising directional hints, for sure. For the coming month or so, I'll probably continue with on ongoing work of FMAizing the remaining key AVX code macros which need it  don't expect huge gains from that on Haswell, but reliable gains and hopefullyevenbetteronBroadwell when it comes out  and do some largestriderelated prefetch experiments as a side project. Will let you know what I find as results become available. 
I just added some prefetcht1 calls to my AVX Fermatmod carry macro, such that during processing of the 240 (or more generally, the R, where R is the leading radix) carries within the segmented main data array (which is not directly in the picture during carry processing, as this is all done using a small contiguous local storage area as described above), prefetchesintoL1 of the data at the R distinct n/Rstrideseparated mainarray locations get done while the carries are processed. The carry macros seem the logical place to do this prefetching, since even though they do not otherwise access the main array in any way, they make is easy to propagate such prefetches to all the various leadingradix/carry routines I have.
For these experiments I added code such that the bytesahead offset for the prefetches is set as a compiletime commandline define and played with various values: 64 is the obvious candidate in AVX mode since an "AVX complex double" consists of 4 adjacent quartets of doubles, thus "fetch the next complex datum at each array location" means fetch 64 bytes ahead of the current data addresses. In practice, though, 32 bytesahead proves a smidge faster (and I also tried other values, like 16 and 128, all of which were clearly worse). Instant 4% speedup (95ms/iter > 91 ms/iter for my ongoing F29 run), the clearest timing "signal" I have ever seen in all my playingwithprefetch. Thanks, George! Hopefully this is the start of a trend  still lots of other things to play with here, as you lay out above. 
First Pe'pin test of F29 completed late Wednesday  took slightly under 20 months at FFT length 30M. Startofrun periteration time on my 3.3 GHz Haswell 4670 quad (running at stock) was ~115ms; several rounds of AVX2/FMAoriented optimizations brought this down to ~90ms, which was representative of roughly the final 80% of the iterations.
On to run #2 (@FFT length 32M) ... our own Mike/Xyzzy was kind enough to run the first ~25Miter @32M over the last several months for me on his 2core Haswell NUC, so I am proceeding where that run left off. [UPDATE] I was going to post the final checkpointRes64 and the traditional SelfridgeHurwitz residues along with some notes on run #2 after letting the latter run for a day or two on the same quadHaswell system, but the latter has already turned up a 'bonk!' between iteration 25M and 26M. Not yet 100% sure whether the error is in my new or the old run, but I suspect the latter because that entire run was 'fragile' and crashatrandomintervalsrangingfromanhourtoseveralmonths prone. Said crashes were accompanied by either 'fatal roundoff error = 0.5' or 'nonzero exit carry' errors indicating some kind of data corruption, whereas the FFTM = 32M run has been clean of any untowardness. I further note that the 30M run used a fast but 'preliminary' binary of my v14.1 code release ... when I compared that to the eventual stabilized release code it was 23% faster at the FFT length in question, so I continued using it. Big mistake, as I now suspect the aforementioned 'noisy' fatal baddata errors were likely accompanied by silent fails ... note the residue discrepancy between the 30M and 32M runs appears at a checkpoint where there are no signs of 'distress' from either run. So, likely I am looking at a massive waste of compute effort ... lesson learned, never again do such a runpair in serial mode, always do the 2 sidebyside on systems yielding as similar speeds as possible. To that end, I am looking for a [strike]victim[/strike]volunteer to [strike]go on a suicide mission[/strike] help me redo the 30M run in parallel with the 32M one. Best will be someone with a similar Linux Haswellquad to my own ... should one prove slightly faster than the other it will get to finish the 32MFFT run, which needs ~6% more cycles than the 30M one. Or we could consider running @30M using just 3 cores on the faster system, leaving one core free for other tasks. Note that [url=http://www.imdb.com/title/tt0060009/?ref_=nm_knf_t1]should you choose to accept this mission[/url], it will be a significant runtime commitment  around 18 months and 0 chance of a prime, although should you change your mind (or blow up your system) at some point along the way you can hand things off (even if your savefiles get nuked, since we will be crosschecking those vs mine) to a fresh recruit. But you will be contributing to a small piece of math history and of course getting a grateful 'thanks!' in any publication that may eventually result. 
Just curious, did the system have ECC memory?

None of his (or ours) has ECC memory.
:mike: 
18 month test without ECC RAM! :loco:

I envy these algorithms!

All times are UTC. The time now is 17:08. 
Powered by vBulletin® Version 3.8.11
Copyright ©2000  2022, Jelsoft Enterprises Ltd.