mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2004-04-18, 15:44   #1
ThomRuley
 
ThomRuley's Avatar
 
May 2003

3·97 Posts
Default Athlon 64 and SSE2

Does anyone here know how well the Athlon 64 is able to implement SSE2 instructions? I haven't seen any benchmarks on mersenne.org, so I don't know how it compares to P4. The reason I ask is because I was looking at a laptop for med school next fall and it has come down between two models, one with a P4 2.8GHz and the other with an Athlon 64 3000+. At the moment, the decision seems to have come down to the processors.

Is anybody here already using an Athlon 64?
ThomRuley is offline   Reply With Quote
Old 2004-04-18, 16:27   #2
RMAC9.5
 
RMAC9.5's Avatar
 
Jun 2003

32×17 Posts
Default

I don't have any direct experience but I believe the Athlon 64 does SSE2 almost as well as the P4 (i.e. P4's are still better at Prime 95). P4's with their higher clock speed are faster at Prime 95 than equivalent Athlon 64's with their lower clock speeds. That said, I still believe the Athlon 64 to be a much better choice than the P4, especially for a laptop when battery life and processor temperature is a consideration. AMD has better power management software than Intel (i.e. longer battery life) and their CPU's run cooler (less heat inside the lap top to get rid of). The Athlon 64 is also a more "future proof" solution than the P4 as Microsoft will eventually offer a 64 bit version of Windows that the Athlon 64 can run that the P4 can not.
RMAC9.5 is offline   Reply With Quote
Old 2004-04-18, 17:35   #3
nfortino
 
nfortino's Avatar
 
Nov 2003

3×5×11 Posts
Default

The last page of the Perpetual Benchmark Page has Athlon 64 benchmarks.
nfortino is offline   Reply With Quote
Old 2004-04-18, 22:46   #4
ThomRuley
 
ThomRuley's Avatar
 
May 2003

3·97 Posts
Default

Now, I have heard somewhere that George Woltman is currently working on Opteron and Athlon 64 optimizations for Prime95? Correct me if I'm wrong, but wouldn't this make the Athlon 64 a better choice than the P4 for Prime95?
ThomRuley is offline   Reply With Quote
Old 2004-04-19, 02:53   #5
Bionic_Redneck
 
Bionic_Redneck's Avatar
 
Apr 2004

148 Posts
Default

There is a bug in the SSE2 instruction in Athlon64's. There is a couple of other dc project that have tried SSE2 w/ Athlon64's and they are slower than if SSE2 is disabled. The only projects I know that have tried SSE2 w Athlon64's are seventeen or bust and MD5crk and both are slower with SSE2 enabled.
Bionic_Redneck is offline   Reply With Quote
Old 2004-04-19, 03:57   #6
dsouza123
 
dsouza123's Avatar
 
Sep 2002

2×331 Posts
Default

I don't think Athlon64s have a bug in SSE2 but rather an unoptimized implementation.
Some modes of accessing the SSE2 registers have a significant time penalty.
There is shared math units with x87 FPU and SSE2.

The P4 has a very slow implementation of x87 FPU instructions, inferior to the P3, with a tradeoff of optimizing the SSE2 circuitry.
There are separate math units, x87 FPU with reduced circuitry and extensive SSE2 unit.
dsouza123 is offline   Reply With Quote
Old 2004-04-19, 03:57   #7
wfgarnett3
 
wfgarnett3's Avatar
 
"William Garnett III"
Oct 2002
Langhorne, PA

2·43 Posts
Default

Yeah Bionic, these Athlon FX processors, even with dual channel DDR400, have poor SSE2. And even at same clock speed, Pentium 4 completely destroys the Athlon FX. On Prime95, they provide a very, very slight improvement in iteration time than with SSE2 disabled; it's almost like a complete waste.

I would of loved to have a processor which was best at old x87 floating point code (which is still is) and at same time provide killer SSE2. What happened? Is it really a bug? AMD obviously knew that SSE2 performance basically was TERRIBLE before releasing the chip. Why would they continue to release it?

regards,
william
wfgarnett3 is offline   Reply With Quote
Old 2004-04-19, 04:25   #8
dsouza123
 
dsouza123's Avatar
 
Sep 2002

10100101102 Posts
Default

AMD had a release deadline and needed SSE2 so programs that don't use the x87 FPU
but SSE2 instead, would run.
AMD added all the 64bit circuitry and wanted the processor to run cool and get a decent amount of chips on each wafer.

AMD can optimize its SSE2 implementation in a future version, ie add more math units or improve access times to SSE2.

It is the same reason the P4 has a very slow x87 FPU, with reduced circuitry, when it had the optimized SSE2, it had to still run programs that called FPU instructions.
To remove completely it would have caused all the older programs that expect the FPU to fail.
(Yes it is possible to use an emulator program or have emulation code in the OS
but the performance would have be 10 to 100 times slower than even the current reduced circuitry version.)
dsouza123 is offline   Reply With Quote
Old 2004-04-21, 03:56   #9
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

23×1,021 Posts
Default

Quote:
Originally Posted by ThomRuley
Now, I have heard somewhere that George Woltman is currently working on Opteron and Athlon 64 optimizations for Prime95? Correct me if I'm wrong, but wouldn't this make the Athlon 64 a better choice than the P4 for Prime95?
NO. The P4 and Athlon 64 have the same peak theoretical FPU throughput per clock cycle. The faster clocking of the P4 makes it the clear winner for now. This observation applies to prime95 only.
Prime95 is online now   Reply With Quote
Old 2004-04-24, 02:51   #10
IlleglWpns
 
Apr 2004

3 Posts
Default

From what I've heard part of the reason for the Athlon64's poor performance (relative to the P4) is because of the halved cache bandwidth from L1 when using explicit register loads (the other reason is less FLOPs).

Since the problem does not occur with memory operands, have you thought of using MAXPD/MAXSD instructions as a sneaky way to load things from memory? This does require you to set aside an XMM register to hold 0 and use some MOVs at the end of the loop to zero the registers you load into, but the MOVs should be free since register renaming allows them to be hoisted and done in parallel with the "real" work.

Of course you could also go the less hackish route of actually using mem operands "properly" but that might be more work.
IlleglWpns is offline   Reply With Quote
Old 2004-04-24, 09:07   #11
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

192 Posts
Default

Quote:
Originally Posted by IlleglWpns
Since the problem does not occur with memory operands, have you thought of using MAXPD/MAXSD instructions as a sneaky way to load things from memory? This does require you to set aside an XMM register to hold 0 and use some MOVs at the end of the loop to zero the registers you load into, but the MOVs should be free since register renaming allows them to be hoisted and done in parallel with the "real" work.
Yes, this is a nice idea. It is similar to using XORPD in combination with a zeroed register. But MAXPD would choose 0.0 if the value to be loaded is negative.

If it is possible, the code should be modified to use mem operands, because this doesn't require additional instructions (to clear registers and sneakily load values) which occupy the FADD/FMUL pipelines. And there are other ways to handle the SSE2 loads and stores more effectively on Opteron/Athlon 64 like spreading these instructions over the calculating blocks to execute them in otherwise unused time slots of the FMISC pipeline.
Dresdenboy is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
RSA and SSE2 Cyclamen Persicum Math 5 2003-11-10 07:41
Is TF from 2^64 to 2^65 using SSE2? TauCeti Software 3 2003-10-17 06:30
P4 SSE2 routine bug? TTn Lounge 27 2003-07-17 17:14
SSE2 ? TauCeti NFSNET Discussion 8 2003-06-30 12:58
The effect of SSE2 in P4s cmokruhl10 Hardware 8 2003-06-17 11:18

All times are UTC. The time now is 05:37.


Thu Feb 9 05:37:50 UTC 2023 up 175 days, 3:06, 1 user, load averages: 0.93, 0.87, 0.94

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔