mersenneforum.org  

Go Back   mersenneforum.org > Other Stuff > Open Projects > y-cruncher

Reply
 
Thread Tools
Old 2022-11-09, 02:47   #1
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

22·43·47 Posts
Default Strided memory access

I have a mystery that perhaps Mystical can provide some insight.

Some background info, the details of which aren't extremely important. I'm optimizing the polynomial multiplication code where poly coefficients are gwnums. As an example, lets say you have two degree-50000 polynomials to multiply producing a degree-100000 polynomial, gwnum coefficents are a tad over 8KB in size.
Poly multiplication takes 64 bytes from each input gwnum, zero-pads to a convenient FFT size, performs two length 100000 FFTs, pointwide multiply, and a length 100000 inverse FFT, then outputs 64 bytes to 100000 output gwnums.

The length 100000 inverse FFT is done in two passes on a 7MB buffer. One pass does 448 unit-stride length-256 inverse FFTs, the next pass does stride-256 reads into a scratch area then a length 448 inverse FFT with the result output with stride-256 to the 7MB buffer.

Finally, the 7MB buffer is copied (multi-threaded) in 64-byte chunks to the 100,000 gwnums (stride just over 8KB).

My optimization idea is to skip outputting to the 7MB buffer, instead writing directly to every 256th gwnum of the 100,000 output gwnums.

The good news: On my i5-6500, 4 cores, 6MB L3 cache, ten percent faster - sweet.
The bad news: On my Skylake-X i7-7820X, 8 cores, 11MB L3 cache, twenty percent slower - ugly.
Another user reported similar bad news on some kind of server CPU setup. 14(?) cores, large L3 cache, AVX architecture.

The mystery is how can the new code be slower? I've eliminated a strided write to a 7MB buffer. The writes to the 100000 gwnums are strided in both cases, albeit with a larger stride in my new code.

To summarize, the old code:
256 FFT pass 2 runs. Each copies 64B every 16KB to a 7MB buffer.
1 multithreaded call copying 64B every 8KB to the 100K gwnums.
the new code:
256 FFT pass 2 runs. Each copies 64B every 2MB to 100K/256 gwnums.

P.S. No thrashing. Unless I screwed up, both systems have 512 huge pages.

Last fiddled with by Prime95 on 2022-11-09 at 02:48
Prime95 is offline   Reply With Quote
Old 2022-11-09, 10:22   #2
Mysticial
 
Mysticial's Avatar
 
Sep 2016

1011100102 Posts
Default

Is adjacent line prefetch on or off? Because that effectively doubles the cache line sizes to 128 bytes. So if you're only touching 64 bytes at a time, you're doubling up your bandwidth usage. (not sure about the situation for writes using non-temporal stores)

From my experience, adjacent line prefetch is always enabled by default on systems that expose the option. And the only systems that expose the options are server or HEDT ones. (granted, my retail AM5 mobo has it)

Personally, I have never dared to attempt any stride larger than the associativity without:
  1. Super-alignment padding.
  2. Accessing many cache lines contiguously before hopping to next stride.
IOW, most of my code is stride 8 - 16, but running on contiguous blocks of multiple cache lines along with some very messy padding scheme that periodically inserts 1 or 2 cache lines. If I need larger reductions to eliminate an extra pass over memory, I implement it breaking it up into smaller reductions of radix 8 or 16 - using a small local buffer. (thus recursing the FFT into the reductions themselves)

I have tried some one-off experiments with single cacheline blocks, but results were very volatile across different hardware. Though I never studied it too much. Most of y-cruncher's FFT-like code uses large and run-time adjustable block sizes.


Quote:
Originally Posted by Prime95 View Post
The mystery is how can the new code be slower? I've eliminated a strided write to a 7MB buffer. The writes to the 100000 gwnums are strided in both cases, albeit with a larger stride in my new code.
The first thing that comes to my mind is that the larger stride can cause super-alignment if you're not padding things out.

Last fiddled with by Mysticial on 2022-11-09 at 10:43
Mysticial is offline   Reply With Quote
Old 2022-11-09, 15:45   #3
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

22·43·47 Posts
Default

Quote:
Originally Posted by Mysticial View Post
Is adjacent line prefetch on or off? Because that effectively doubles the cache line sizes to 128 bytes.
Adjacent line prefetch is likely on. However, on Skylake-X where I'm getting the 20% performance drop I actually read and write 128-byte chunks because its using AVX-512 instructions.

Quote:
The first thing that comes to my mind is that the larger stride can cause super-alignment if you're not padding things out.
I tried the following experiment. Instead of storing the 100000 gwnums linearly in memory, I shuffled them. So instead of large strides writing to the gwnums, I now have large but random strides. The 20% penalty is still there.

Conclusion: Stride distance is not the issue.

The new code does its 100000 gwnum writes in small batches, whereas the old code waits until the entire FFT is done, then blasts all 100000 gwnum writes in one batch. Could there be some sort of stall associated with initiating a batch of scattered writes and the new code is stalling 448 times vs. the old code only stalling once?
Prime95 is offline   Reply With Quote
Old 2022-11-09, 18:31   #4
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

2×3×5×47 Posts
Default

Quote:
Originally Posted by Prime95 View Post
I'm optimizing the polynomial multiplication code where poly coefficients are gwnums. As an example, lets say you have two degree-50000 polynomials to multiply producing a degree-100000 polynomial, gwnum coefficents are a tad over 8KB in size.
Poly multiplication takes 64 bytes from each input gwnum, zero-pads to a convenient FFT size, performs two length 100000 FFTs, pointwide multiply, and a length 100000 inverse FFT, then outputs 64 bytes to 100000 output gwnums.
Not on-topic, but I have a question about the poly multiplication, in particular the "pointwise multiply" step above.

You're doing the FFTs over 64bytes from each gwnum, but in the "pointwise multiply" step don't you need to re-aggregate the 64-byte parts corresponding to a full gwnum (poly coeficient) before doing the pointwise mul? And the "pointwise mul" is then a MUL between full gwnums (not between 64bytes chunks).

Do I misunderstand it? is it correct to actually do the "pointwise mul" only over the fft-transformed 64-byte chunks?

Last fiddled with by preda on 2022-11-09 at 18:34
preda is offline   Reply With Quote
Old 2022-11-09, 21:02   #5
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

22×43×47 Posts
Default

Quote:
Originally Posted by preda View Post
Do I misunderstand it? is it correct to actually do the "pointwise mul" only over the fft-transformed 64-byte chunks?
The gwnum coefficients are first FFTed using gwnum. Polymult (in AVX) now takes 4 complex values (64 bytes) from every gwnum and does forward FFT, pointwise multiply, inverse FFT, write back to the gwnum. Repeat for however many 64-byte values are in the FFTed gwnum. Finally, perform a gwnum inverse FFT on every gwnum poly coefficient.

One does have to make sure each gwnum has enough "head room" to avoid round-off errors. In my multiplying example that means adding 50000 sub-products which will "use up" several bits of a double.

Prime95 differs from GMP-ECM and the ECM stage 2 papers. Those break down coefficients using small primes and reassemble with Chinese Remainder Theorem.

Last fiddled with by Prime95 on 2022-11-09 at 21:02
Prime95 is offline   Reply With Quote
Old 2022-11-09, 22:43   #6
Mysticial
 
Mysticial's Avatar
 
Sep 2016

2·5·37 Posts
Default

Quote:
Originally Posted by Prime95 View Post
The new code does its 100000 gwnum writes in small batches, whereas the old code waits until the entire FFT is done, then blasts all 100000 gwnum writes in one batch. Could there be some sort of stall associated with initiating a batch of scattered writes and the new code is stalling 448 times vs. the old code only stalling once?
Are the writes normal writes or non-temporal?

From what I've read, Skylake X's mesh cache takes longer to retire NT-stores which can clog up your reorder buffer.

Also, about your multi-threaded copy. How does that work? Are you having different threads touch the data which was just written out by a single thread?
Mysticial is offline   Reply With Quote
Old 2022-11-09, 23:24   #7
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

10,861 Posts
Default

Quote:
Originally Posted by Mysticial View Post
Also, about your multi-threaded copy. How does that work? Are you having different threads touch the data which was just written out by a single thread?
Please forgive me for this... But... It is wonderful watching wizards figure out how to do magic.

Write lots of code. Test at scale. Move fast. Break stuff.

Be comfortable failing!!!
chalsall is online now   Reply With Quote
Old 2022-11-10, 02:01   #8
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

11111100101002 Posts
Default

Quote:
Originally Posted by Mysticial View Post
Are the writes normal writes or non-temporal?

From what I've read, Skylake X's mesh cache takes longer to retire NT-stores which can clog up your reorder buffer.
Writes were normal writes. I just tried NT-stores and it helps by about 1% (all cases).

Quote:
Also, about your multi-threaded copy. How does that work? Are you having different threads touch the data which was just written out by a single thread?
I have two different multi-threading options. The 20% hit takes place in both strategies.

Strategy 1) Each thread works on its own 7MB buffer filled with 64 bytes from the 100000 gwnums.
a) A single-threaded FFT writes results back to the 7MB buffer. The 7MB buffer is then copied
single-threaded to the 100000 gwnums 64 bytes at a time.
b) A single threaded FFT writes results to the 100000 gwnums as they are calculated. 20% slower.
Strategy 2) All threads work on a single 7MB buffer.
a) A multi-theaded FFT writes results back to the 7MB buffer. The 7MB is then multi-threaded copied to
the 100000 gwnums -- each thread copies a set of 512 64-bytes until there are no more sets.
b) A multi-threaded FFT writes results to the 100000 gwnums as they are calculated. 20% slower.
BTW strategy 1 generally works better for polynomial sizes below about 25000 (fits in L3 cache).

I'll now google skylake's mesh cache. There must be an architectural reason why SkylakeX shows a 20% hit while my old core i5 shows a 10% gain.
Prime95 is offline   Reply With Quote
Old 2022-11-12, 03:45   #9
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

22×43×47 Posts
Default

I can't see how the mesh bus vs ring bus would adversely affect my code. My bet is the change from an inclusive L3 cache to an exclusive L3 cache. I haven't come up with a theory as to how this is hurting my code.

I'm currently gathering a lot of data (various gwnum sizes and polynomial sizes) to look for patterns (such as does the penalty kick in when the amount of data being processed exceeds L2 cache size? L3 cache size? L2+L3 cache size?

I mentioned a 3rd machine was exhibiting the significant slowdown. Does anyone know if the v5-2680v4 CPU has an inclusive or exclusive L3 cache?
Prime95 is offline   Reply With Quote
Old 2022-11-12, 06:09   #10
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

22·43·47 Posts
Default

Quote:
Originally Posted by Prime95 View Post
I mentioned a 3rd machine was exhibiting the significant slowdown. Does anyone know if the v5-2680v4 CPU has an inclusive or exclusive L3 cache?
Grrr, it seems the v5-2680v4 has an inclusive L3 cache.
Prime95 is offline   Reply With Quote
Old 2022-11-12, 11:45   #11
Mysticial
 
Mysticial's Avatar
 
Sep 2016

2×5×37 Posts
Default

The mesh cache is inferior to ring cache in almost every performance aspect. Core-to-core latency is about double. Bandwidth is terrible (only twice the DRAM bandwidth).

It's bad enough that I tune y-cruncher to ignore Skylake X's L3 and assume the L2 is the last level cache.

Obviously, ring cache has scalability issues. So Intel had to move off it at some point.

--------

The reason why I asked about how the multi-threading was done is to see if you're hitting the case of poor broadcast performance behavior. I don't think it applies here, but I'll tell the story anyway since it's really interesting and infuriating.

Basically if you write to an address, then immediately have multiple cores read it simultaneously, the performance is as-if they were writing to it instead of just reading. This royally screws up low-latency code where you want to signal to a ton of spinning threads that data is ready for them to consume.

What happens is that when a core cache misses on a cache line that is in another core's cache, it always pulls it in for ownership (and evicting from all other caches) as opposed to just shared - even if the access is a load. The effect is that if you have say - 20 cores spinning on a variable waiting for it to flip before continuing, the cache line will bounce across the 20 cores many times each before the director realizes what's going on and finally decides to stop the madness and give it to everyone in shared state.

This reason why Intel does this is because they are optimizing for poorly written spin locks that do read + cmpxchg. Without this "optimization", both the read and the cmpxchg would result in coherency round trips - the read brings it in as shared, the cmpxchg upgrades it to exclusive and evicts from all other caches - thus double the latency. Good spin locks usually lead with the cmpxchg under the assumption of no contention. Only if it fails does it enter a read loop.

This has been the case since at least Haswell (never tested on anything earlier). And it got significantly worse upon switching to mesh cache in Skylake SP.

Last fiddled with by Mysticial on 2022-11-12 at 11:47
Mysticial is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
How much time is spent on memory access in PRPs? bentonsar Hardware 7 2021-11-21 12:41
Direct Graphics Memory Access Xyzzy GPU Computing 0 2020-12-11 03:11
cpu memory access speed in detail tServo Hardware 0 2020-07-21 23:43
Too Much Internet Access. M0CZY Software 3 2005-10-17 15:41
Need access to a PowerPC G4 and G5 ewmayer Hardware 0 2005-05-03 22:15

All times are UTC. The time now is 01:34.


Tue Nov 29 01:34:15 UTC 2022 up 102 days, 23:02, 1 user, load averages: 1.21, 1.21, 1.25

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔