![]() |
![]() |
#1 |
Mar 2003
2 Posts |
![]()
Just for grins, I ran benchmarks on my P4 2320MHz with and without SSE2. To disable SSE2, I put "CpuSupportsSSE2=0" in local.ini. The end result is SSE2 is worth about a 3x speedup. At a similar clock rate, an Athlon crushes the P4 without SSE2. So, it stands to reason that an Athlon with SSE2 would be a sweet processor.
WITH SSE2: [code:1]Intel(R) Pentium(R) 4 CPU 1.60GHz CPU speed: 2319.69 MHz CPU features: RDTSC, CMOV, PREFETCH, MMX, SSE, SSE2 L1 cache size: 8 KB L2 cache size: 512 KB L1 cache line size: 64 bytes L2 cache line size: 64 bytes TLBS: 64 Prime95 version 23.2, RdtscTiming=1 Best time for 384K FFT length: 16.327 ms. Best time for 448K FFT length: 19.262 ms. Best time for 512K FFT length: 21.948 ms. Best time for 640K FFT length: 28.516 ms. Best time for 768K FFT length: 34.684 ms. Best time for 896K FFT length: 42.614 ms. Best time for 1024K FFT length: 46.313 ms. Best time for 1280K FFT length: 62.494 ms. Best time for 1536K FFT length: 76.973 ms. Best time for 1792K FFT length: 96.039 ms. Best time for 2048K FFT length: 108.204 ms.[/code:1] WITHOUT SSE2: [code:1]Intel(R) Pentium(R) 4 CPU 1.60GHz CPU speed: 2319.60 MHz CPU features: RDTSC, CMOV, PREFETCH, MMX, SSE L1 cache size: 8 KB L2 cache size: 512 KB L1 cache line size: 64 bytes L2 cache line size: 64 bytes TLBS: 64 Prime95 version 23.2, RdtscTiming=1 Best time for 384K FFT length: 49.217 ms. Best time for 448K FFT length: 58.415 ms. Best time for 512K FFT length: 65.181 ms. Best time for 640K FFT length: 84.251 ms. Best time for 768K FFT length: 106.694 ms. Best time for 896K FFT length: 123.732 ms. Best time for 1024K FFT length: 143.338 ms. Best time for 1280K FFT length: 184.298 ms. Best time for 1536K FFT length: 221.057 ms. Best time for 1792K FFT length: 261.316 ms. Best time for 2048K FFT length: 298.231 ms.[/code:1] ATHLON: [code:1]AMD Athlon(tm) XP 2600+ CPU speed: 2254.60 MHz CPU features: RDTSC, CMOV, PREFETCH, MMX, SSE L1 cache size: 64 KB L2 cache size: 256 KB L1 cache line size: 64 bytes L2 cache line size: 64 bytes L1 TLBS: 32 L2 TLBS: 256 Prime95 version 23.2, RdtscTiming=1 Best time for 384K FFT length: 26.646 ms. Best time for 448K FFT length: 30.303 ms. Best time for 512K FFT length: 33.531 ms. Best time for 640K FFT length: 43.511 ms. Best time for 768K FFT length: 53.481 ms. Best time for 896K FFT length: 62.682 ms. Best time for 1024K FFT length: 71.481 ms. Best time for 1280K FFT length: 92.451 ms. Best time for 1536K FFT length: 111.819 ms. Best time for 1792K FFT length: 137.076 ms. Best time for 2048K FFT length: 153.008 ms.[/code:1] |
![]() |
![]() |
![]() |
#2 |
Aug 2002
2×7×13×47 Posts |
![]()
Yes, but the AMD at 2300MHz is at the limit of it's design... The P4 can scale much farther... The reason the P4 is so crappy in IPC (long pipeline) is precisely the reason it can scale so high...
In other words, life is (always) a compromise... |
![]() |
![]() |
![]() |
#3 | |
Aug 2002
3·83 Posts |
![]() Quote:
|
|
![]() |
![]() |
![]() |
#4 |
Aug 2002
2568 Posts |
![]()
The point of this post is zero if you ask me.
SSE2 is part of what makes a P4 a P4. If you want to cripple a P4 and make an Athlon XP looks better then thats fine. But it doesnt mean anything. |
![]() |
![]() |
![]() |
#5 |
Mar 2003
210 Posts |
![]()
The point of this post is to show that the only reason P4s are the GIMPS "CPU of choice" is SSE2. For every other distributed computing project, Athlons are the "CPU of choice" by quite a margin. All I'm saying is that if AMD released an Athlon with SSE2 (quite possible... and coming), it would eclipse the P4.
|
![]() |
![]() |
![]() |
#6 | |
Aug 2002
205528 Posts |
![]() Quote:
|
|
![]() |
![]() |
![]() |
#7 | |
Aug 2002
CA16 Posts |
![]() Quote:
|
|
![]() |
![]() |
![]() |
#8 | |
Aug 2002
59 Posts |
![]() Quote:
It will still be interesting to see how the Opteron/Athlon 64 performs with Prime though. |
|
![]() |
![]() |
![]() |
#9 |
Apr 2003
Berlin, Germany
192 Posts |
![]()
Some slides on optimization for opteron even say, that for double precision one should use x87 since peak throughput are nearly the same as for SSE2 with the plus of having more instructions (log2, sin, cos ...).
Athlon+AMD64 can do FADD+FMUL+FLD/FST per cycle or SSE2 (AMD64 only) with 1 FMUL and 1 FADD. Unfortunately an ADDPD can't be run in parallel to a MULPD because each of them needs 2 decoder ports of the 3. But an ADDPS could be issued since it needs one port. (2 ports are needed to issue double FP operations for each half). On P4 there are also only one 64bit FP op per cycle possible with SSE2. But P4 reaches higher clocks which is it's advantage here. But SSE2 offers also to have 32 double precision FP numbers in the 16 registers at once while x87 still has only 8 regs. DDB |
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Mandela effect, general discussion and how would you study it scientifically? | jasong | Lounge | 12 | 2018-04-15 05:13 |
Murphy's law in full effect against me | JuanTutors | Lounge | 3 | 2007-06-15 16:42 |
RSA and SSE2 | Cyclamen Persicum | Math | 5 | 2003-11-10 07:41 |
Is TF from 2^64 to 2^65 using SSE2? | TauCeti | Software | 3 | 2003-10-17 06:30 |
SSE2 ? | TauCeti | NFSNET Discussion | 8 | 2003-06-30 12:58 |