mersenneforum.org  

Go Back   mersenneforum.org > Other Stuff > Open Projects > y-cruncher

Reply
 
Thread Tools
Old 2022-11-12, 12:04   #12
Mysticial
 
Mysticial's Avatar
 
Sep 2016

2·5·37 Posts
Default

Just curious, what happens if you remove all the computation and test the raw access pattern?

This could be a good starting for point inserting hardware counters.
Mysticial is offline   Reply With Quote
Old 2023-02-07, 16:01   #13
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

13×239 Posts
Default

Quote:
Originally Posted by Mysticial View Post
The mesh cache is inferior to ring cache in almost every performance aspect. Core-to-core latency is about double. Bandwidth is terrible (only twice the DRAM bandwidth).

It's bad enough that I tune y-cruncher to ignore Skylake X's L3 and assume the L2 is the last level cache.

Obviously, ring cache has scalability issues. So Intel had to move off it at some point.

--------

The reason why I asked about how the multi-threading was done is to see if you're hitting the case of poor broadcast performance behavior. I don't think it applies here, but I'll tell the story anyway since it's really interesting and infuriating.

Basically if you write to an address, then immediately have multiple cores read it simultaneously, the performance is as-if they were writing to it instead of just reading. This royally screws up low-latency code where you want to signal to a ton of spinning threads that data is ready for them to consume.

What happens is that when a core cache misses on a cache line that is in another core's cache, it always pulls it in for ownership (and evicting from all other caches) as opposed to just shared - even if the access is a load. The effect is that if you have say - 20 cores spinning on a variable waiting for it to flip before continuing, the cache line will bounce across the 20 cores many times each before the director realizes what's going on and finally decides to stop the madness and give it to everyone in shared state.

This reason why Intel does this is because they are optimizing for poorly written spin locks that do read + cmpxchg. Without this "optimization", both the read and the cmpxchg would result in coherency round trips - the read brings it in as shared, the cmpxchg upgrades it to exclusive and evicts from all other caches - thus double the latency. Good spin locks usually lead with the cmpxchg under the assumption of no contention. Only if it fails does it enter a read loop.

This has been the case since at least Haswell (never tested on anything earlier). And it got significantly worse upon switching to mesh cache in Skylake SP.
Interesting. What was your work-around for the broadcast situation?
Mark Rose is offline   Reply With Quote
Old 2023-02-07, 16:15   #14
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

13×239 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Poly multiplication takes 64 bytes from each input gwnum, zero-pads to a convenient FFT size, performs two length 100000 FFTs, pointwide multiply, and a length 100000 inverse FFT, then outputs 64 bytes to 100000 output gwnums.
Do you have a DDR5 system for testing on as well? I only mention this because DDR5 is capable of transferring 64 bytes of memory in a single burst, whereas DDR4 requires two. If you hit main memory this may have an impact on your algorithm's performance.
Mark Rose is offline   Reply With Quote
Old 2023-02-07, 18:51   #15
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

816810 Posts
Default

Quote:
Originally Posted by Mark Rose View Post
Do you have a DDR5 system for testing on as well? I only mention this because DDR5 is capable of transferring 64 bytes of memory in a single burst, whereas DDR4 requires two. If you hit main memory this may have an impact on your algorithm's performance.
I do not have a DDR5 system.

I struggled with this for quite a while trying various padding strategies. I can't say I learned much about how memory layouts affect performance. I did find that streaming stores were beneficial once data no longer fits in the L3 cache. In the end, best results were obtained by allocating the array of gwnums linearly in memory, then just randomly scrambling the array of pointers.
Prime95 is offline   Reply With Quote
Old 2023-02-07, 18:59   #16
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

22·47·59 Posts
Default

Quote:
Originally Posted by Prime95 View Post
I do not have a DDR5 system.
Is there any chance that could to provided to you?

"Virtual" could work. Translation of atoms can be very expensive.

Not to say knowledge is cheap...
chalsall is online now   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
How much time is spent on memory access in PRPs? bentonsar Hardware 7 2021-11-21 12:41
Direct Graphics Memory Access Xyzzy GPU Computing 0 2020-12-11 03:11
cpu memory access speed in detail tServo Hardware 0 2020-07-21 23:43
Too Much Internet Access. M0CZY Software 3 2005-10-17 15:41
Need access to a PowerPC G4 and G5 ewmayer Hardware 0 2005-05-03 22:15

All times are UTC. The time now is 14:04.


Wed Feb 8 14:04:55 UTC 2023 up 174 days, 11:33, 1 user, load averages: 0.79, 0.87, 0.86

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔