![]() |
![]() |
#1 |
May 2003
3·97 Posts |
![]()
Does anyone here know how well the Athlon 64 is able to implement SSE2 instructions? I haven't seen any benchmarks on mersenne.org, so I don't know how it compares to P4. The reason I ask is because I was looking at a laptop for med school next fall and it has come down between two models, one with a P4 2.8GHz and the other with an Athlon 64 3000+. At the moment, the decision seems to have come down to the processors.
Is anybody here already using an Athlon 64? |
![]() |
![]() |
![]() |
#2 |
Jun 2003
32×17 Posts |
![]()
I don't have any direct experience but I believe the Athlon 64 does SSE2 almost as well as the P4 (i.e. P4's are still better at Prime 95). P4's with their higher clock speed are faster at Prime 95 than equivalent Athlon 64's with their lower clock speeds. That said, I still believe the Athlon 64 to be a much better choice than the P4, especially for a laptop when battery life and processor temperature is a consideration. AMD has better power management software than Intel (i.e. longer battery life) and their CPU's run cooler (less heat inside the lap top to get rid of). The Athlon 64 is also a more "future proof" solution than the P4 as Microsoft will eventually offer a 64 bit version of Windows that the Athlon 64 can run that the P4 can not.
|
![]() |
![]() |
![]() |
#3 |
Nov 2003
3×5×11 Posts |
![]()
The last page of the Perpetual Benchmark Page has Athlon 64 benchmarks.
|
![]() |
![]() |
![]() |
#4 |
May 2003
3·97 Posts |
![]()
Now, I have heard somewhere that George Woltman is currently working on Opteron and Athlon 64 optimizations for Prime95? Correct me if I'm wrong, but wouldn't this make the Athlon 64 a better choice than the P4 for Prime95?
|
![]() |
![]() |
![]() |
#5 |
Apr 2004
148 Posts |
![]()
There is a bug in the SSE2 instruction in Athlon64's. There is a couple of other dc project that have tried SSE2 w/ Athlon64's and they are slower than if SSE2 is disabled. The only projects I know that have tried SSE2 w Athlon64's are seventeen or bust and MD5crk and both are slower with SSE2 enabled.
|
![]() |
![]() |
![]() |
#6 |
Sep 2002
2×331 Posts |
![]()
I don't think Athlon64s have a bug in SSE2 but rather an unoptimized implementation.
Some modes of accessing the SSE2 registers have a significant time penalty. There is shared math units with x87 FPU and SSE2. The P4 has a very slow implementation of x87 FPU instructions, inferior to the P3, with a tradeoff of optimizing the SSE2 circuitry. There are separate math units, x87 FPU with reduced circuitry and extensive SSE2 unit. |
![]() |
![]() |
![]() |
#7 |
"William Garnett III"
Oct 2002
Langhorne, PA
2·43 Posts |
![]()
Yeah Bionic, these Athlon FX processors, even with dual channel DDR400, have poor SSE2. And even at same clock speed, Pentium 4 completely destroys the Athlon FX. On Prime95, they provide a very, very slight improvement in iteration time than with SSE2 disabled; it's almost like a complete waste.
I would of loved to have a processor which was best at old x87 floating point code (which is still is) and at same time provide killer SSE2. What happened? Is it really a bug? AMD obviously knew that SSE2 performance basically was TERRIBLE before releasing the chip. Why would they continue to release it? regards, william |
![]() |
![]() |
![]() |
#8 |
Sep 2002
10100101102 Posts |
![]()
AMD had a release deadline and needed SSE2 so programs that don't use the x87 FPU
but SSE2 instead, would run. AMD added all the 64bit circuitry and wanted the processor to run cool and get a decent amount of chips on each wafer. AMD can optimize its SSE2 implementation in a future version, ie add more math units or improve access times to SSE2. It is the same reason the P4 has a very slow x87 FPU, with reduced circuitry, when it had the optimized SSE2, it had to still run programs that called FPU instructions. To remove completely it would have caused all the older programs that expect the FPU to fail. (Yes it is possible to use an emulator program or have emulation code in the OS but the performance would have be 10 to 100 times slower than even the current reduced circuitry version.) |
![]() |
![]() |
![]() |
#9 | |
P90 years forever!
Aug 2002
Yeehaw, FL
23×1,021 Posts |
![]() Quote:
|
|
![]() |
![]() |
![]() |
#10 |
Apr 2004
3 Posts |
![]()
From what I've heard part of the reason for the Athlon64's poor performance (relative to the P4) is because of the halved cache bandwidth from L1 when using explicit register loads (the other reason is less FLOPs).
Since the problem does not occur with memory operands, have you thought of using MAXPD/MAXSD instructions as a sneaky way to load things from memory? This does require you to set aside an XMM register to hold 0 and use some MOVs at the end of the loop to zero the registers you load into, but the MOVs should be free since register renaming allows them to be hoisted and done in parallel with the "real" work. Of course you could also go the less hackish route of actually using mem operands "properly" but that might be more work. |
![]() |
![]() |
![]() |
#11 | |
Apr 2003
Berlin, Germany
192 Posts |
![]() Quote:
If it is possible, the code should be modified to use mem operands, because this doesn't require additional instructions (to clear registers and sneakily load values) which occupy the FADD/FMUL pipelines. And there are other ways to handle the SSE2 loads and stores more effectively on Opteron/Athlon 64 like spreading these instructions over the calculating blocks to execute them in otherwise unused time slots of the FMISC pipeline. |
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
RSA and SSE2 | Cyclamen Persicum | Math | 5 | 2003-11-10 07:41 |
Is TF from 2^64 to 2^65 using SSE2? | TauCeti | Software | 3 | 2003-10-17 06:30 |
P4 SSE2 routine bug? | TTn | Lounge | 27 | 2003-07-17 17:14 |
SSE2 ? | TauCeti | NFSNET Discussion | 8 | 2003-06-30 12:58 |
The effect of SSE2 in P4s | cmokruhl10 | Hardware | 8 | 2003-06-17 11:18 |