![]() |
![]() |
#12 | |
Undefined
"The unspeakable one"
Jun 2006
My evil lair
3×37×61 Posts |
![]() Quote:
It seems to me that doing FFT on multi-megabit numbers was not a high priority for Intel/AMD/whoever. I would imagine video/audio encoding/decoding is very high on their list of things to consider when designing new instructions. How many people do multi-precision arithmetic compared to how many play games, listen to music and/or watch movies? I imagine the ratio is a rather small one. |
|
![]() |
![]() |
![]() |
#13 | |
Jan 2008
France
22·149 Posts |
![]() Quote:
But as you wrote, these are not written with MP in mind ![]() |
|
![]() |
![]() |
![]() |
#14 | |
Bamboozled!
"๐บ๐๐ท๐ท๐ญ"
May 2003
Down not across
2×5×11×107 Posts |
![]() Quote:
Paul |
|
![]() |
![]() |
![]() |
#15 | ||
Dec 2008
Boycotting the Soapbox
13208 Posts |
![]() Quote:
My guess is that it's the result of one serious rochambeau tournament that took the better part of the afternoon, too. The guy in the sales-department won alot: He got to add 20 new instructions by aliasing them to existing instructions, and the guy responsible for putting together the "Instruction Set Reference", probably has a template and "add redundant instruction" macroed to F11. Quote:
Somewhere there's a forum specialized on video encoding with guys concluding that pmuludq must be for multi-megabit computations, because it's useless for anything else. The rest of the thread is devoted to missing 16-bit floating point support, and wtf to do with the 205 instead of 200 idle cycles now that paddusw, psubsw, etc. eliminated some code instructions. Last fiddled with by __HRB__ on 2009-02-21 at 16:46 |
||
![]() |
![]() |
![]() |
#16 |
"Nancy"
Aug 2002
Alexandria
2,467 Posts |
![]()
While we're whining about instruction sets, aside from the general travesty of a cpu that is the x86 architecture, why-oh-why is there no setcc that sets a whole register to 0 or 1? With the current instructions, not only do you get a false dependency thanks to a partial register write, but you can't store the carry flag for later easy adding either, since the reg will still contain garbage in bits 8,...,63. This drove me mad while writing mulredc code for GMP-ECM.
Alex |
![]() |
![]() |
![]() |
#17 | |
Jan 2008
France
22·149 Posts |
![]() Quote:
![]() |
|
![]() |
![]() |
![]() |
#18 | |
Undefined
"The unspeakable one"
Jun 2006
My evil lair
3·37·61 Posts |
![]() Quote:
Last fiddled with by retina on 2009-02-21 at 17:16 |
|
![]() |
![]() |
![]() |
#19 | |
"Nancy"
Aug 2002
Alexandria
1001101000112 Posts |
![]() Quote:
Yes, "sbb reg, same reg" can be used to set a full register according only to the carry flag, and recent cpus even know that this instruction does not depend on the previous value of "reg," so no false dependency occurs here. However, I use three registers as a kind of ring buffer for the carry propagation, and need to add to the register that holds the carry... so having the carry value negated was a bit of a problem. I thought about flipping the sign of the partial result in every pass so I could use sbb and then keep subtracting. I didn't do that, though... I may try some time. Alex |
|
![]() |
![]() |
![]() |
#20 |
Dec 2008
Boycotting the Soapbox
2D016 Posts |
![]()
That's clever. I quickly deleted my post with an inferior solution.
I was thinking about using rcr or rcl to create a carry cache and use clc to remove dependencies to do several multiprecision adds in parallel. Your solution is much better. You 'da Man!. Here's why I'm so exited: Let's suppose we want to do 2 multiprecision adds in parallel. With eax & edx as temporaries. After sbb eax,eax a add edx,edx will a) remove the dependency on the carry b) restore the carry so we can do load/adc/store N times ending with sbb edx,edx and setting up the second stream with add eax,eax I count 4 + 2*3*N instructions, so, if e.g. N==4, and Core 2 can do 1 add/clock, we're doing 8 multiprecision adds in 12 cycles. This is 25% faster than the naive implementation. If we're doing X+Y and X-Y and can reuse X and/or Y, Athlons might be able to do more than one 64-bit adc/clock. This would allow nice butterflies for medium sized power-of-two moduli. Edit: Two adds can be replaced with shifts. Duh. 10/8=1.25 clocks/adc for core 2. Last fiddled with by __HRB__ on 2009-02-21 at 18:39 Reason: I'm a moron |
![]() |
![]() |
![]() |
#21 | |
"Nancy"
Aug 2002
Alexandria
2,467 Posts |
![]() Quote:
Alex |
|
![]() |
![]() |
![]() |
#22 | |
Jan 2008
France
22×149 Posts |
![]() Quote:
![]() If I make this remark it's because I have restrictions on the order of instructions: I get registers allocated from above (inputs and destination); so in my case I was doing a cmp + setcc + signextend; and I wrongly thought the only other option was mov 0 + cmp + setcc ("only" due to wrong "mov 0 clobbers flags" assumption) which doesn't work if my allocated destination reg overlaps one of the input regs. Thanks for the hint! |
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Posts that seem less than useless, or something like that | jasong | Forum Feedback | 1054 | 2022-06-20 22:34 |
Fedora gedit for bash has become useless | EdH | Linux | 11 | 2016-05-13 15:36 |
Useless DC assignment | lycorn | PrimeNet | 16 | 2009-09-08 18:16 |
Useless p-1 work | jocelynl | Data | 4 | 2004-11-28 13:28 |