mersenneforum.org  

Go Back   mersenneforum.org > Other Stuff > Open Projects > y-cruncher

Reply
 
Thread Tools
Old 2023-07-29, 06:56   #12
Mysticial
 
Mysticial's Avatar
 
Sep 2016

2·3·67 Posts
Default

Quote:
Originally Posted by retina View Post
I am.

Itanium.

'nuf said.
Oh I don't mean replacing the existing stuff with something new. I was suggesting that they use the proposed REX2 prefix byte to deliminate a new instruction format. Something sane like ARM or whatever. At least put your 5 bits for a register operand together instead of splitting them up across 3 different bytes with random inversions.

So the old x86 stuff would stay as is. The new byte would prefix a new encoding instead of trying to reuse as much of the old (crap) as possible.

Last fiddled with by Mysticial on 2023-07-29 at 06:58
Mysticial is offline   Reply With Quote
Old 2023-07-29, 08:31   #13
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

23·3·19 Posts
Default

Quote:
Originally Posted by Mysticial View Post
it's not entirely unexpected since port5's 512-bit unit doesn't split up in 256-bit mode. So you lose the upper half with 256-bit code.
Thanks for the testing.

I'm not awake yet, but I thought Tiger Lake doesn't have the extra FMA unit on port 5? Or are you doing something else? Probably a bad habit for me that I associate AVX with how much FMA you can shove through it!
mackerel is offline   Reply With Quote
Old 2023-07-29, 09:31   #14
Mysticial
 
Mysticial's Avatar
 
Sep 2016

2×3×67 Posts
Default

Quote:
Originally Posted by mackerel View Post
Thanks for the testing.

I'm not awake yet, but I thought Tiger Lake doesn't have the extra FMA unit on port 5? Or are you doing something else? Probably a bad habit for me that I associate AVX with how much FMA you can shove through it!
Port5 can do integer SIMD and shuffles.
Mysticial is offline   Reply With Quote
Old 2023-07-29, 14:09   #15
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

153108 Posts
Default

Quote:
Originally Posted by Mysticial View Post
I was suggesting that they use the proposed REX2 prefix byte to deliminate a new instruction format. Something sane like ARM or whatever. At least put your 5 bits for a register operand together instead of splitting them up across 3 different bytes with random inversions.
Like this?

REX2 + 4-byte ARM64
REX3 + 4-byte RISC-V
REX4 + 4-byte Alpha

Four CPUs in one.

The PC alignment requirement is completely borked, but I'm sure Intel can figure it out.

Last fiddled with by retina on 2023-07-29 at 14:09
retina is online now   Reply With Quote
Old 2023-07-29, 15:58   #16
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Liverpool (GMT/BST)

22×1,553 Posts
Default

Quote:
Originally Posted by Mysticial View Post
Intel's E-cores already split 256-bit instructions into 2 x 128-bit. Assuming they don't widen them any time soon, 512-bit would mean quad-pumping. The way AMD double-pumped Zen4 is quite ingenious in that's there's almost no cost to the splitting. So I don't really understand why Intel can't just do the same.
Still doesn't seem to be much disadvantage. Yes, it will take longer but it still needs doing. IMO they should abandon fixed lengths and allow you to request 128*2^x bit instructions. Why fix the instructions to what the hardware can do? Allowing longer requests means free speedups with the same code if intel later widens it in the CPU(which will inevitably happen eventually). I believe ARM took this approach.
henryzz is offline   Reply With Quote
Old 2023-07-29, 17:51   #17
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

2·3·7·199 Posts
Default

Quote:
Originally Posted by Mysticial View Post
It looks like Zen4 cannot buffer 8 super-aligned NT-stores across loop iterations for write-combining.
Can you define "super-aligned"? Thanks.
Prime95 is offline   Reply With Quote
Old 2023-07-29, 18:01   #18
Mysticial
 
Mysticial's Avatar
 
Sep 2016

40210 Posts
Default

Quote:
Originally Posted by henryzz View Post
Still doesn't seem to be much disadvantage. Yes, it will take longer but it still needs doing. IMO they should abandon fixed lengths and allow you to request 128*2^x bit instructions. Why fix the instructions to what the hardware can do? Allowing longer requests means free speedups with the same code if intel later widens it in the CPU(which will inevitably happen eventually). I believe ARM took this approach.
I haven't studied flex width concepts too much, but I can already see non-trivial issues with it.

If you allow the program to specify the width, you'll run into an issue where the hardware just can't do it. Say the hardware is natively 128-bit, and I request 2048-bit just because I can. The CPU won't have the register state to hold 2048-bit x 32 registers. And if you try to pipe it through memory, you're effectively running without registers. (IOW, slow.)

If you allow the program to ask the hardware for its size so it can adapt, you've now burdened the program with being able to correctly and efficiently run on any vector length. This just isn't feasible for numerous reasons. For example, shuffles are inherently tied to the vector length. And while many applications have length-agnostic algorithms to handle simple things like a general NxN transpose, they are far from trivial and far from being optimal given a specific set of hardware. Then you have cases where the optimal data layout depends on the native vector size.

For example, complex FFTs like to use a semi-interleaved pattern like this:
Code:
[r0 r1 r2 r3][i0 i1 i2 i3][r4 r5 r6 r7][i4 i5 i6 i7]
where each vector consists entirely of real or imaginary parts. But the real and imaginary parts are kept together in adjacent vectors for locality purposes.

What happens when this length-dependent data layout has to cross an API abstraction layer? You can't possibly expect both sides of an API to be able to handle flex width.
Mysticial is offline   Reply With Quote
Old 2023-07-29, 18:07   #19
Mysticial
 
Mysticial's Avatar
 
Sep 2016

2·3·67 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Can you define "super-aligned"? Thanks.
Offset by (multiples of) large powers-of-two. So last N bits of the addresses are identical.

(FWIW, I made a conscious decision not to insert padding at the highest levels. Which would be a separate discussion.)
Mysticial is offline   Reply With Quote
Reply

Thread Tools


All times are UTC. The time now is 10:40.


Sun Sep 24 10:40:24 UTC 2023 up 11 days, 8:22, 0 users, load averages: 0.98, 0.93, 0.86

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔