![]() |
![]() |
#1 |
P90 years forever!
Aug 2002
Yeehaw, FL
32·907 Posts |
![]()
I have a mystery that perhaps Mystical can provide some insight.
Some background info, the details of which aren't extremely important. I'm optimizing the polynomial multiplication code where poly coefficients are gwnums. As an example, lets say you have two degree-50000 polynomials to multiply producing a degree-100000 polynomial, gwnum coefficents are a tad over 8KB in size. Poly multiplication takes 64 bytes from each input gwnum, zero-pads to a convenient FFT size, performs two length 100000 FFTs, pointwide multiply, and a length 100000 inverse FFT, then outputs 64 bytes to 100000 output gwnums. The length 100000 inverse FFT is done in two passes on a 7MB buffer. One pass does 448 unit-stride length-256 inverse FFTs, the next pass does stride-256 reads into a scratch area then a length 448 inverse FFT with the result output with stride-256 to the 7MB buffer. Finally, the 7MB buffer is copied (multi-threaded) in 64-byte chunks to the 100,000 gwnums (stride just over 8KB). My optimization idea is to skip outputting to the 7MB buffer, instead writing directly to every 256th gwnum of the 100,000 output gwnums. The good news: On my i5-6500, 4 cores, 6MB L3 cache, ten percent faster - sweet. The bad news: On my Skylake-X i7-7820X, 8 cores, 11MB L3 cache, twenty percent slower - ugly. Another user reported similar bad news on some kind of server CPU setup. 14(?) cores, large L3 cache, AVX architecture. The mystery is how can the new code be slower? I've eliminated a strided write to a 7MB buffer. The writes to the 100000 gwnums are strided in both cases, albeit with a larger stride in my new code. To summarize, the old code: 256 FFT pass 2 runs. Each copies 64B every 16KB to a 7MB buffer. 1 multithreaded call copying 64B every 8KB to the 100K gwnums. the new code: 256 FFT pass 2 runs. Each copies 64B every 2MB to 100K/256 gwnums. P.S. No thrashing. Unless I screwed up, both systems have 512 huge pages. Last fiddled with by Prime95 on 2022-11-09 at 02:48 |
![]() |
![]() |
![]() |
#2 |
Sep 2016
2·5·37 Posts |
![]()
Is adjacent line prefetch on or off? Because that effectively doubles the cache line sizes to 128 bytes. So if you're only touching 64 bytes at a time, you're doubling up your bandwidth usage. (not sure about the situation for writes using non-temporal stores)
From my experience, adjacent line prefetch is always enabled by default on systems that expose the option. And the only systems that expose the options are server or HEDT ones. (granted, my retail AM5 mobo has it) Personally, I have never dared to attempt any stride larger than the associativity without:
I have tried some one-off experiments with single cacheline blocks, but results were very volatile across different hardware. Though I never studied it too much. Most of y-cruncher's FFT-like code uses large and run-time adjustable block sizes. The first thing that comes to my mind is that the larger stride can cause super-alignment if you're not padding things out. Last fiddled with by Mysticial on 2022-11-09 at 10:43 |
![]() |
![]() |
![]() |
#3 | ||
P90 years forever!
Aug 2002
Yeehaw, FL
32·907 Posts |
![]() Quote:
Quote:
Conclusion: Stride distance is not the issue. The new code does its 100000 gwnum writes in small batches, whereas the old code waits until the entire FFT is done, then blasts all 100000 gwnum writes in one batch. Could there be some sort of stall associated with initiating a batch of scattered writes and the new code is stalling 448 times vs. the old code only stalling once? |
||
![]() |
![]() |
![]() |
#4 | |
"Mihai Preda"
Apr 2015
26458 Posts |
![]() Quote:
You're doing the FFTs over 64bytes from each gwnum, but in the "pointwise multiply" step don't you need to re-aggregate the 64-byte parts corresponding to a full gwnum (poly coeficient) before doing the pointwise mul? And the "pointwise mul" is then a MUL between full gwnums (not between 64bytes chunks). Do I misunderstand it? is it correct to actually do the "pointwise mul" only over the fft-transformed 64-byte chunks? Last fiddled with by preda on 2022-11-09 at 18:34 |
|
![]() |
![]() |
![]() |
#5 | |
P90 years forever!
Aug 2002
Yeehaw, FL
32·907 Posts |
![]() Quote:
One does have to make sure each gwnum has enough "head room" to avoid round-off errors. In my multiplying example that means adding 50000 sub-products which will "use up" several bits of a double. Prime95 differs from GMP-ECM and the ECM stage 2 papers. Those break down coefficients using small primes and reassemble with Chinese Remainder Theorem. Last fiddled with by Prime95 on 2022-11-09 at 21:02 |
|
![]() |
![]() |
![]() |
#6 | |
Sep 2016
1011100102 Posts |
![]() Quote:
From what I've read, Skylake X's mesh cache takes longer to retire NT-stores which can clog up your reorder buffer. Also, about your multi-threaded copy. How does that work? Are you having different threads touch the data which was just written out by a single thread? |
|
![]() |
![]() |
![]() |
#7 | |
If I May
"Chris Halsall"
Sep 2002
Barbados
1108510 Posts |
![]() Quote:
Write lots of code. Test at scale. Move fast. Break stuff. Be comfortable failing!!! ![]() |
|
![]() |
![]() |
![]() |
#8 | ||
P90 years forever!
Aug 2002
Yeehaw, FL
1FE316 Posts |
![]() Quote:
Quote:
Strategy 1) Each thread works on its own 7MB buffer filled with 64 bytes from the 100000 gwnums. a) A single-threaded FFT writes results back to the 7MB buffer. The 7MB buffer is then copied b) A single threaded FFT writes results to the 100000 gwnums as they are calculated. 20% slower.Strategy 2) All threads work on a single 7MB buffer. a) A multi-theaded FFT writes results back to the 7MB buffer. The 7MB is then multi-threaded copied to b) A multi-threaded FFT writes results to the 100000 gwnums as they are calculated. 20% slower.BTW strategy 1 generally works better for polynomial sizes below about 25000 (fits in L3 cache). I'll now google skylake's mesh cache. There must be an architectural reason why SkylakeX shows a 20% hit while my old core i5 shows a 10% gain. |
||
![]() |
![]() |
![]() |
#9 |
P90 years forever!
Aug 2002
Yeehaw, FL
1FE316 Posts |
![]()
I can't see how the mesh bus vs ring bus would adversely affect my code. My bet is the change from an inclusive L3 cache to an exclusive L3 cache. I haven't come up with a theory as to how this is hurting my code.
I'm currently gathering a lot of data (various gwnum sizes and polynomial sizes) to look for patterns (such as does the penalty kick in when the amount of data being processed exceeds L2 cache size? L3 cache size? L2+L3 cache size? I mentioned a 3rd machine was exhibiting the significant slowdown. Does anyone know if the v5-2680v4 CPU has an inclusive or exclusive L3 cache? |
![]() |
![]() |
![]() |
#10 |
P90 years forever!
Aug 2002
Yeehaw, FL
32×907 Posts |
![]() |
![]() |
![]() |
![]() |
#11 |
Sep 2016
1011100102 Posts |
![]()
The mesh cache is inferior to ring cache in almost every performance aspect. Core-to-core latency is about double. Bandwidth is terrible (only twice the DRAM bandwidth).
It's bad enough that I tune y-cruncher to ignore Skylake X's L3 and assume the L2 is the last level cache. Obviously, ring cache has scalability issues. So Intel had to move off it at some point. -------- The reason why I asked about how the multi-threading was done is to see if you're hitting the case of poor broadcast performance behavior. I don't think it applies here, but I'll tell the story anyway since it's really interesting and infuriating. Basically if you write to an address, then immediately have multiple cores read it simultaneously, the performance is as-if they were writing to it instead of just reading. This royally screws up low-latency code where you want to signal to a ton of spinning threads that data is ready for them to consume. What happens is that when a core cache misses on a cache line that is in another core's cache, it always pulls it in for ownership (and evicting from all other caches) as opposed to just shared - even if the access is a load. The effect is that if you have say - 20 cores spinning on a variable waiting for it to flip before continuing, the cache line will bounce across the 20 cores many times each before the director realizes what's going on and finally decides to stop the madness and give it to everyone in shared state. This reason why Intel does this is because they are optimizing for poorly written spin locks that do read + cmpxchg. Without this "optimization", both the read and the cmpxchg would result in coherency round trips - the read brings it in as shared, the cmpxchg upgrades it to exclusive and evicts from all other caches - thus double the latency. Good spin locks usually lead with the cmpxchg under the assumption of no contention. Only if it fails does it enter a read loop. This has been the case since at least Haswell (never tested on anything earlier). And it got significantly worse upon switching to mesh cache in Skylake SP. Last fiddled with by Mysticial on 2022-11-12 at 11:47 |
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
How much time is spent on memory access in PRPs? | bentonsar | Hardware | 7 | 2021-11-21 12:41 |
Direct Graphics Memory Access | Xyzzy | GPU Computing | 0 | 2020-12-11 03:11 |
cpu memory access speed in detail | tServo | Hardware | 0 | 2020-07-21 23:43 |
Too Much Internet Access. | M0CZY | Software | 3 | 2005-10-17 15:41 |
Need access to a PowerPC G4 and G5 | ewmayer | Hardware | 0 | 2005-05-03 22:15 |