![]() |
![]() |
#12 |
Sep 2016
2·3·67 Posts |
![]()
Oh I don't mean replacing the existing stuff with something new. I was suggesting that they use the proposed REX2 prefix byte to deliminate a new instruction format. Something sane like ARM or whatever. At least put your 5 bits for a register operand together instead of splitting them up across 3 different bytes with random inversions.
So the old x86 stuff would stay as is. The new byte would prefix a new encoding instead of trying to reuse as much of the old (crap) as possible. Last fiddled with by Mysticial on 2023-07-29 at 06:58 |
![]() |
![]() |
![]() |
#13 | |
Feb 2016
UK
23·3·19 Posts |
![]() Quote:
I'm not awake yet, but I thought Tiger Lake doesn't have the extra FMA unit on port 5? Or are you doing something else? Probably a bad habit for me that I associate AVX with how much FMA you can shove through it! |
|
![]() |
![]() |
![]() |
#14 |
Sep 2016
2×3×67 Posts |
![]()
Port5 can do integer SIMD and shuffles.
|
![]() |
![]() |
![]() |
#15 | |
Undefined
"The unspeakable one"
Jun 2006
My evil lair
153108 Posts |
![]() Quote:
REX2 + 4-byte ARM64 REX3 + 4-byte RISC-V REX4 + 4-byte Alpha Four CPUs in one. The PC alignment requirement is completely borked, but I'm sure Intel can figure it out. Last fiddled with by retina on 2023-07-29 at 14:09 |
|
![]() |
![]() |
![]() |
#16 | |
Just call me Henry
"David"
Sep 2007
Liverpool (GMT/BST)
22×1,553 Posts |
![]() Quote:
|
|
![]() |
![]() |
![]() |
#17 |
P90 years forever!
Aug 2002
Yeehaw, FL
2·3·7·199 Posts |
![]() |
![]() |
![]() |
![]() |
#18 | |
Sep 2016
40210 Posts |
![]() Quote:
If you allow the program to specify the width, you'll run into an issue where the hardware just can't do it. Say the hardware is natively 128-bit, and I request 2048-bit just because I can. The CPU won't have the register state to hold 2048-bit x 32 registers. And if you try to pipe it through memory, you're effectively running without registers. (IOW, slow.) If you allow the program to ask the hardware for its size so it can adapt, you've now burdened the program with being able to correctly and efficiently run on any vector length. This just isn't feasible for numerous reasons. For example, shuffles are inherently tied to the vector length. And while many applications have length-agnostic algorithms to handle simple things like a general NxN transpose, they are far from trivial and far from being optimal given a specific set of hardware. Then you have cases where the optimal data layout depends on the native vector size. For example, complex FFTs like to use a semi-interleaved pattern like this: Code:
[r0 r1 r2 r3][i0 i1 i2 i3][r4 r5 r6 r7][i4 i5 i6 i7] What happens when this length-dependent data layout has to cross an API abstraction layer? You can't possibly expect both sides of an API to be able to handle flex width. |
|
![]() |
![]() |
![]() |
#19 |
Sep 2016
2·3·67 Posts |
![]() |
![]() |
![]() |