![]() |
![]() |
#1 |
P90 years forever!
Aug 2002
Yeehaw, FL
13×563 Posts |
![]()
The following snipet is typical of prime95's FFT code. Read eight registers, do some math, write it back.
On the P4 the code below takes 41 and 45 clocks if the data is in the L1 and L2 cache respectively. The optimal is 40 clocks (20 addpd/subpd instructions). The Opteron takes 51 and 74 respectively. If you comment out the loads or stores you can get to the 40 clock optimum. Moving the stores up in the code or spreading out the loads did not help. I'm beginning to suspect the Opteron bottleneck is in loading and storing the XMM registers. More research is needed - I've only got one day invested. Dresdenboy, any insights??? [code:1]x2cl_eight_reals_fft MACRO srcreg,srcinc,d1 movapd xmm0, [srcreg] movapd xmm1, [srcreg+d1] movapd xmm2, [srcreg+16] movapd xmm3, [srcreg+d1+16] movapd xmm4, [srcreg+32] movapd xmm5, [srcreg+d1+32] movapd xmm6, [srcreg+48] movapd xmm7, [srcreg+d1+48] lea srcreg, [srcreg+srcinc] x8r_fft movapd [srcreg-srcinc], xmm7 movapd [srcreg-srcinc+16], xmm6 movapd [srcreg-srcinc+32], xmm4 movapd [srcreg-srcinc+48], xmm5 movapd [srcreg-srcinc+d1], xmm1 movapd [srcreg-srcinc+d1+16], xmm3 movapd [srcreg-srcinc+d1+32], xmm0 movapd [srcreg-srcinc+d1+48], xmm2 ENDM x8r_fft MACRO subpd xmm3, xmm7 ;; new R8 = R4 - R8 multwo xmm7 addpd xmm7, xmm3 ;; new R4 = R4 + R8 subpd xmm1, xmm5 ;; new R6 = R2 - R6 multwo xmm5 addpd xmm5, xmm1 ;; new R2 = R2 + R6 mulpd xmm3, XMM_SQRTHALF ;; R8 = R8 * square root mulpd xmm1, XMM_SQRTHALF ;; R6 = R6 * square root subpd xmm0, xmm4 ;; new R5 = R1 - R5 multwo xmm4 addpd xmm4, xmm0 ;; new R1 = R1 + R5 subpd xmm5, xmm7 ;; R2 = R2 - R4 (new & final R4) multwo xmm7 ;; R4 = R4 * 2 subpd xmm2, xmm6 ;; new R7 = R3 - R7 multwo xmm6 addpd xmm6, xmm2 ;; new R3 = R3 + R7 subpd xmm1, xmm3 ;; R6 = R6 - R8 (Real part) multwo xmm3 ;; R8 = R8 * 2 subpd xmm4, xmm6 ;; R1 = R1 - R3 (new & final R3) multwo xmm6 ;; R3 = R3 * 2 addpd xmm7, xmm5 ;; R4 = R2 + R4 (new R2) addpd xmm3, xmm1 ;; R8 = R6 + R8 (Imaginary part) subpd xmm0, xmm1 ;; R5 = R5 - R6 (final R7) multwo xmm1 ;; R6 = R6 * 2 addpd xmm6, xmm4 ;; R3 = R1 + R3 (new R1) subpd xmm2, xmm3 ;; R7 = R7 - R8 (final R8) multwo xmm3 ;; R8 = R8 * 2 subpd xmm6, xmm7 ;; R1 = R1 - R2 (final R2) multwo xmm7 ;; R2 = R2 * 2 addpd xmm1, xmm0 ;; R6 = R5 + R6 (final R5) addpd xmm3, xmm2 ;; R8 = R7 + R8 (final R6) addpd xmm7, xmm6 ;; R2 = R1 + R2 (final R1) ENDM [/code:1] |
![]() |
![]() |
![]() |
#2 |
Aug 2002
6516 Posts |
![]()
What if you load the xmm registers in the order you need them in x8r_fft? Usually only load is bottleneck since you need the reasult to move forward. Store you can just post it.
Maybe it is the address calculation that takes time. Are they always in one cacheline(or two since there is a dl)? |
![]() |
![]() |
![]() |
#3 |
Apr 2003
Berlin, Germany
5518 Posts |
![]()
One quick idea I have here is that a movapd (which issues 2 ops to either 2 of the FADD/FMUL/FSTORE units) can't be paired well with calculating instructions. It could help to split them up into movhpd and movlpd although they have higher load latency (4).
Is the code running in 32bit mode? In 64bit mode it could help to move XMM_TWO and XMM_SQRTHALF to some free SSE2 registers. Currently the code need 6 64bit loads from cache for the first 2 instructions of x8r_fft. I'll try the code piece in pipeline simulator to check if there is some issue we can't think of that easily. On weekend I'll test some modified code on sourceforge's compile farm. |
![]() |
![]() |
![]() |
#4 |
Apr 2003
Berlin, Germany
192 Posts |
![]()
Currently it looks like there are too many mem accesses. When I simply replace all muls with constants in memory by muls which multiply the register with itself and shuffling the movapd's a bit the clock count goes down to 44.
That could be the same number or even smaller if the memory constants are located in additional registers. And as it seems the decoders aren't limiting because aligning the code to 8byte boundaries doesn't change anything. Maybe this effect is hidden due to FPU scheduler contention. |
![]() |
![]() |
![]() |
#5 |
Apr 2003
Berlin, Germany
192 Posts |
![]()
After aligning the branch target of the test loop to a 16byte boundary and some more reshuffling of loads/stores I got it down to 47cycles.
Interestingly it can change by 5 or more cycles if just one memory source address is changed by some n*16 bytes which could cause bank conflict penalties if bits [5:3] are the same for 2 loads during a cycle. An optimization tutorial by Tim Wilkens has shown some pipeline behaviour of SSE2 code for DGEMM. As it seems a movapd from memory executes as 2 sequential loads which are scheduled to one of the 3 units. I have to make some tests if this is true or if it schedules the loads to 2 different units in the same cycle. |
![]() |
![]() |
![]() |
#6 |
Apr 2003
Berlin, Germany
192 Posts |
![]()
Some investigations and known facts:
- code padding for SSE2 is not necessary - loads of full SSE2 regs are executed in 2 consecutive cycles by one unit - memory operands cause one cache access per cycle (here one has to take care for the increased latencies of 7 cycles) but usually there is enough room for that. OTOH there are a lot of loads, stores and memory operands in the above code.. So the reason for the 51cycles seems to be L1 cache bandwidth and changed latency (in regard to the P4) of some instructions which causes the scheduler to be less effective. It only holds 12 lines of 3 macroOps each. The following disassembled code of my test loop executes in 32 cycles. It shows that it is possible to sustain 3 FPU ops and 2 loads per cycle. If we consider the availability of 64bit mode it would be easier to go for doing 2 iterations in ~82 cycles instead of one in 41. [code:1] 4005e0: 66 0f 58 46 20 addpd 0x20(%rsi),%xmm0 4005e5: 66 0f 59 c9 mulpd %xmm1,%xmm1 4005e9: 66 44 0f 28 43 f0 movapd 0xfffffffffffffff0(%rbx),%xmm8 4005ef: 66 0f 58 56 20 addpd 0x20(%rsi),%xmm2 4005f4: 66 0f 59 db mulpd %xmm3,%xmm3 4005f8: 66 44 0f 28 4b 10 movapd 0x10(%rbx),%xmm9 4005fe: 66 0f 58 66 20 addpd 0x20(%rsi),%xmm4 400603: 66 0f 59 ed mulpd %xmm5,%xmm5 400607: 66 44 0f 28 53 20 movapd 0x20(%rbx),%xmm10 40060d: 66 0f 58 46 20 addpd 0x20(%rsi),%xmm0 400612: 66 0f 59 c9 mulpd %xmm1,%xmm1 400616: 66 44 0f 28 5b 30 movapd 0x30(%rbx),%xmm11 40061c: 66 0f 58 56 20 addpd 0x20(%rsi),%xmm2 400621: 66 0f 59 db mulpd %xmm3,%xmm3 400625: 66 44 0f 28 63 d0 movapd 0xffffffffffffffd0(%rbx),%xmm12 40062b: 66 0f 58 66 20 addpd 0x20(%rsi),%xmm4 400630: 66 0f 59 ed mulpd %xmm5,%xmm5 400634: 66 44 0f 28 6b e0 movapd 0xffffffffffffffe0(%rbx),%xmm13 40063a: 66 0f 58 46 20 addpd 0x20(%rsi),%xmm0 40063f: 66 0f 59 c9 mulpd %xmm1,%xmm1 400643: 66 44 0f 28 73 f0 movapd 0xfffffffffffffff0(%rbx),%xmm14 400649: 66 0f 58 56 20 addpd 0x20(%rsi),%xmm2 40064e: 66 0f 59 db mulpd %xmm3,%xmm3 400652: 66 44 0f 28 7b d0 movapd 0xffffffffffffffd0(%rbx),%xmm15 400658: 66 0f 58 66 20 addpd 0x20(%rsi),%xmm4 40065d: 66 0f 59 ed mulpd %xmm5,%xmm5 400661: 66 44 0f 28 43 10 movapd 0x10(%rbx),%xmm8 400667: 66 0f 58 46 20 addpd 0x20(%rsi),%xmm0 40066c: 66 0f 59 c9 mulpd %xmm1,%xmm1 400670: 66 44 0f 28 4b 20 movapd 0x20(%rbx),%xmm9 400676: 66 0f 58 56 20 addpd 0x20(%rsi),%xmm2 40067b: 66 0f 59 db mulpd %xmm3,%xmm3 40067f: 66 44 0f 28 53 30 movapd 0x30(%rbx),%xmm10 400685: 66 0f 58 66 20 addpd 0x20(%rsi),%xmm4 40068a: 66 0f 59 ed mulpd %xmm5,%xmm5 40068e: 66 44 0f 28 5b d0 movapd 0xffffffffffffffd0(%rbx),%xmm11 400694: 66 0f 58 46 20 addpd 0x20(%rsi),%xmm0 400699: 66 0f 59 c9 mulpd %xmm1,%xmm1 40069d: 66 44 0f 28 63 e0 movapd 0xffffffffffffffe0(%rbx),%xmm12 4006a3: 66 0f 58 56 20 addpd 0x20(%rsi),%xmm2 4006a8: 66 0f 59 db mulpd %xmm3,%xmm3 4006ac: 66 44 0f 28 6b f0 movapd 0xfffffffffffffff0(%rbx),%xmm13 4006b2: 66 0f 58 66 20 addpd 0x20(%rsi),%xmm4 4006b7: 66 0f 59 ed mulpd %xmm5,%xmm5 4006bb: 66 44 0f 28 73 30 movapd 0x30(%rbx),%xmm14 [/code:1] |
![]() |
![]() |
![]() |
#7 |
P90 years forever!
Aug 2002
Yeehaw, FL
1C9716 Posts |
![]()
As an aside, do you know if Microsoft MASM is going to support the extra XMM registers? Also, is it true that these extra registers are only available in 64-bit mode? If so, then MASM would have to output a whole new object file format, true?
Converting all that assembly code to some other syntax would be a horrendously tedious task. |
![]() |
![]() |
![]() |
#8 | |
Apr 2003
Berlin, Germany
192 Posts |
![]()
There is a version of MASM for AMD64 available:
(found at http://www.sandpile.org/post/msgs/20004230.htm) Quote:
In a discussion about the DDK it was mentioned that the 64bit mode only supports SSE and SSE2, no more x87, 3DNow! or MMX. I don't know if this is true - but it won't make sense to disable that - the instruction codes aren't used otherwise and x87 is at least needed for transcendentals and other more complex functions. Matthias |
|
![]() |
![]() |
![]() |
#9 | |
Aug 2002
3×37 Posts |
![]() Quote:
[code:1] addpd xmm0, xmm1; r0 <- r0 + r1 mul_minustwo xmm1 ; r1 <- -2*r1; addpd xmm0, xmm1; r1 <- r0 - r1 [/code:1] At exit, xmm0 still is new r0 and xmm1 new r1. The advantage here is the same xmm register contains same r at exit. That way it is easy to understand the code, I think. My 0.02$. :) Guillermo |
|
![]() |
![]() |
![]() |
#10 | |
P90 years forever!
Aug 2002
Yeehaw, FL
13×563 Posts |
![]() Quote:
Anyway, the real reason it is a mul-by-two is that it lets you choose between "addpd reg, reg" or "mul reg, XMM_TWO" rather easily. |
|
![]() |
![]() |
![]() |
#11 |
P90 years forever!
Aug 2002
Yeehaw, FL
13×563 Posts |
![]()
Thanks for the MASM link, I'll play with it some. Already, I've noticed that some x86 instructions no longer exist. Like "push ebp" and "push OFFSET global_var". Looks like I'll have to download the x86-64 manual.
I sure hope the MASM output can be turned into a linux compatible object file. Does anyone know what format the MASM object file is? |
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
The bandwidth bottleneck is apparently much older than I thought | Dubslow | Hardware | 5 | 2017-11-16 19:50 |
Opteron is Hyperthreaded ? | bgbeuning | Information & Answers | 3 | 2016-01-10 08:26 |
Modular Inversion Bottleneck | Sam Kennedy | Programming | 4 | 2013-01-25 16:50 |
AMD Athlon 64 vs AMD Opteron for ecm | thomasn | Factoring | 6 | 2004-11-08 13:25 |
AMD Opteron | naclosagc | Software | 27 | 2003-08-10 19:14 |