View Single Post
Old 2020-07-05, 22:49   #9
ewmayer's Avatar
Sep 2002
Rep├║blica de California

11,743 Posts

Specialized traansform-doing hardware for DSPs is widespread, the problem for us is that the precision and convo-size needs of the mobile-telecoms industry are rather different than ours.

Originally Posted by preda View Post
Next, what would be a good elementary operation (a basic building block) for computing large FP-FFTs. For example, what we have right now (on CPUs/GPUs) is FMA ("fused multiply add") which is generally useful but not particularly great for FFTs (especially in the "high register pressure" context of the GPUs)

I was thinking of having some giant "twiddle OP":
twiddle(A,B,C): return (A*B+C, A*B-C)

where A,B,C are complex values; such an OP may be great for FFTs.
I've written about such an op here on several occasions, though focusing on the real-operands case - in the floating-point realm it's especially attractive because only one MUL is needed and the FP-operand-preprocessing pipeline stages (exponent and significand extraction, relative-shifting the A*B and C significands by the difference in exponents so as to align them) also can be done just once before feeding the 2 data pairs to the adder and subtracter.

An even bigger "why no such hardware instruction?" for me is complex multiply - I recall having a huge "WTF?" moment when I first saw the x86 SSE2 instruction set specification, seeing instantly how potentially useful it was for the scientific computing community, and seeing how Intel/AMD apparently completely disregarded the needs of said community in their instruction set design. And CMUL, all these years later, they still omit it. "Guys, we'd be OK if there were such a SIMD instruction and the latency was high, just give us enough registers to be able to hide the latency and we'll be in Happyville."

Intel et al are fabulous (ha, made a punny on 'fabless') when it comes to hardware, but absolute shit instruction set design. Having worked with both the old DEC Alpha ISA and the current ARMv8 one, the contrast with Intel's fumbling-in-the-dark long an painful road from MMX to SSE and beyond is massive. AVX-512 is actually halfway decent despite the lack of instructions like vector both-halves-of-128-bit product and CMUL, but it took them, what, 20 years to get there?

A specialized form of CMUL in which one operand is a root of unity would be really useful for convolutions - if there were some digital magic by which one could cheaply interconvert between Cartesian and polar form for a complex number, one could do such a twidde mul by converting the 2 inputs to polar form and doing 1 real add of the 2 angles, then back to (x,y) representation.

Last fiddled with by ewmayer on 2020-07-05 at 22:58
ewmayer is offline   Reply With Quote