mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > PrimeNet

Reply
 
Thread Tools
Old 2008-04-16, 11:28   #1
James Heinrich
 
James Heinrich's Avatar
 
"James Heinrich"
May 2004
ex-Northern Ontario

3·1,039 Posts
Default 64-bit performance of v25.6

I just ran benchmarks of v25.6 on a fresh Vista64 install and found some odd results. The FFT tests are virtually identical between 32- and 64-bit versions, but the TF times vary. That's expected, but what I didn't expect is that the 64-bit version of Prime95 is actually slower than the 32-bit version (by ~7%) above 64-bits
Obviously I should only TF <= 2^64 with benchmarks like that...
(see attachment for numbers)
James Heinrich is online now   Reply With Quote
Old 2008-04-18, 12:34   #2
lycorn
 
lycorn's Avatar
 
Sep 2002
Oeiras, Portugal

140510 Posts
Default

That´s strange...
Have you tried to actually trial factor some exponents and compare the results? It may be some problem with the benchmark code, and the actual performance be better.
I have tried version 24 in a 64-bit system (XP-64) and trial factoring was really great, nearly twice as fast as the 32-bit version! Now I can´t see any reason for such poor performance under 25.6...
lycorn is offline   Reply With Quote
Old 2008-04-19, 09:26   #3
Batalov
 
Batalov's Avatar
 
"Serge"
Mar 2008
Phi(4,2^7658614+1)/2

22×2,281 Posts
Default

For TF, I use only 64-bit P95 (v.25.6) - it is approximately 1.6-1.7 times faster. But only for TF; everything else has the same speed, while more memory is used (the pointers are larger). This is on an AMD Opteron; should be the same on Windows, but I haven't checked (but I will, I have a box somewhere).

Something is odd with the benchmarks. Also, if one would compare two running processes (a 32-bit and a 64-bit) and their periodical printouts one might be easily confused - the 64-bit spits out a progress line almost in 1.8-2 times longer intervals of time - BUT it is important then to notice that its chunks of work are 3 times bigger. Therefore, a ballpark - 3x more job done in 1.8x time.
Batalov is offline   Reply With Quote
Old 2008-04-19, 11:48   #4
S485122
 
S485122's Avatar
 
Sep 2006
Brussels, Belgium

110001010002 Posts
Default

AMD processors do not have the problem, the Core2 do. This means that 64 bits Prime95 on Core2 should trial factor to 64 bits only, leaving the rest to more efficient OS/Processor combinations. There is not enough data to see if it is OS related (I suppose not.)

Prime95 benchmarks are not very acurate (they measure BEST times ofver a few iterations.) But I see the same kind of results on "real life" trial factoring.

Jacob
S485122 is online now   Reply With Quote
Old 2008-04-20, 11:27   #5
James Heinrich
 
James Heinrich's Avatar
 
"James Heinrich"
May 2004
ex-Northern Ontario

3×1,039 Posts
Default

I benchmarked my Intel Core2 Q6600 (overclocked to 3465MHz) and my AMD X2 3600+ (overclocked to 2613MHz). See attachment for details of numbers and some graphs. Numbers in graphs are scaled to equivalent of 1000MHz for both CPUs for comparison. Some observations:
  • Core2 is faster than X2 at TF in Prime95-32 (12% faster <= 2^61; 23% faster 2^62-63; 68% faster >= 2^64)
  • X2 is faster than Core2 at TF in Prime95-64 (by 11-28%)
  • Core2 usually performs better at TF in Prime95-64 than Prime95-32, by 17%-33%, except 2^60/61/63 is almost tied, and >= 2^65 is actually 8% slower in 64-bit than 32-bit.
  • Multi-threaded FFT scales very well from 1 to 2 threads on Core2, with only modest gains going to 3 and 4 threads. One observed anomaly is that FFT <= 1024K performs significantly worse on 3 and 4 threads than it does on 2 threads
  • X2 performance increase is less dramatic going from 1 to 2 threads than on Core2 but no unexpected results.
  • Overall the Core2 walks all over the X2 in FFT performance (no significant difference for either between Prime95-64 and Prime95-32). Numbers shown are scaled to 1000MHz equivalent for comparison, and the Core2 is at least 100% faster across the board (sometimes closer to 200%), and in most cases a single Core2 thread is still >= 100% faster than two-threaded X2.
Attached Thumbnails
Click image for larger version

Name:	32vs64.png
Views:	163
Size:	86.9 KB
ID:	2414  
James Heinrich is online now   Reply With Quote
Old 2008-04-20, 11:31   #6
James Heinrich
 
James Heinrich's Avatar
 
"James Heinrich"
May 2004
ex-Northern Ontario

60558 Posts
Default

And here is the Excel file I used for the above graphs. Raw benchmark data for both systems is also included in this file. Perhaps I made a calculation flaw somewhere?
Attached Files
File Type: zip 32vs64.zip (96.5 KB, 96 views)
James Heinrich is online now   Reply With Quote
Old 2008-04-21, 09:37   #7
Batalov
 
Batalov's Avatar
 
"Serge"
Mar 2008
Phi(4,2^7658614+1)/2

23A416 Posts
Default approx. same results on my Win64 box, too

I second this concern. I looked at the timings on a Win64 box, as promised. And not just benchmarks but timings (note I set "Options::Preferences::Iterations between screen outputs" [OutputIterations= in prime.txt] to 150000 on 32-bit and 50000 on 64-bit; this parameter is not taken literally by the binary; you need to notice that %-age reports are the same, so we compare apples to apples.) The box is running 64-bit Windows XP Pro SP2, though not the Vista ... Here are the specs, just in case - http://valid.x86-secret.com/show_oc.php?id=339904

The results are not so different from yours. I will not bore you with numbers. (I did one and two thread TF's of 37M-range exponents, deleting temp files and going to diff bit levels by editing worktodo.txt ) The difference is that 64-bit P95 on that box never quite gets slower than 32-bit, but it gets just 1% faster at TF 2^65 and higher (which is within a margin of error to say ~= equal), not 8% slower; and similar to your results the lower bit-levels have a little more advantage to the 64-bit (~10-15% faster), but never reaches the AMD-type difference (60-70% faster). I only hope that it is _just_ speed, not a bug.

Tomorrow will need to take a solvable p with a factor ~ 2^65 and check that both binaries find it. On the AMD I already did that with both 32-bit, 64-bit binaries and with a pair of known solvable cases of TF and P-1 (that happened when I had no factors for a month. But both linux64/linux32 mprime and AMD and P95 on Windows successfully found known solutions and then found me some factors later. One thread found me several factors within an hour or so, and then again nothing for a month )

P.S. A note for the (GU)I programming: Not a bug, really, but a feature... It would be less confusing if the OutputIterations=NNNNN was divided by three internally inside the 64-bit binary code in the TF branch, so that the screen progress reports would be spat out at the same speed as by the 32-bit binary. I've found the place in the code for benchmark where the timing result is divided by three! But not in the progress code. Divided or multiplied - you will figure it out. The assembly piece of code for the 64-bit prog does THREE times more work. Right?
Batalov is offline   Reply With Quote
Old 2008-04-21, 09:58   #8
S00113
 
S00113's Avatar
 
Dec 2003

23·33 Posts
Default

These results are somewhat artificial. By overclocking a lot you get very dependent on memory bandwith and may even have to clock your RAM down to keep your system stable. This favours factoring and small FFT sizes.

I recently added more RAM, and to my surprise I had to reduce overclocking of the CPU from 5% to 1% to make my new RAM work properly at advertised speed. memtest+ didn't find anything wrong, but mprime complained after a few minutes, or seconds when things had warmed up. The solution was either to reduce RAM timings or to reduce overclocking. Keeping RAM timings and reducing overclocking gave better results at large FFT sizes while reducing RAM timings while keeping the CPU overclocked gave better timings on factoring.

Also an AMD processor with 1 MB L2 cache would give much better results on FFT. A Q6600 have 8 MB cache, while your X2 3600+ only have 512 KB. In the larger FFT tests you are actually measuring memory speed in your AMD box, not CPU.

I'll be back with a benchmark from my machine (X2 4400+, Socket 939) later.
S00113 is offline   Reply With Quote
Old 2008-04-21, 12:30   #9
James Heinrich
 
James Heinrich's Avatar
 
"James Heinrich"
May 2004
ex-Northern Ontario

1100001011012 Posts
Default

Quote:
Originally Posted by S00113 View Post
These results are somewhat artificial. By overclocking a lot you get very dependent on memory bandwith and may even have to clock your RAM down to keep your system stable. This favours factoring and small FFT sizes.
I certainly won't deny a lack of RAM bandwidth. On the Q6600 the RAM is running at almost-rated speed: 770MHz. Would be 800MHz but could never quite get the system stable at 3.6GHz. Nevertheless, if I had money for 1066MHz RAM (or faster) I'm sure I'd see considerable benefit. On the 3600+ side, yes, starved for bandwidth again thanks to a poor-choice overclocking board (Asus M2N-E), the RAM is running at 733MHz. I know the Q6600 system is a bit more flexible, I could push the RAM to ~900MHz at least long enough to get some benchmark numbers if that would be useful to anyone.

Quote:
Originally Posted by S00113 View Post
A Q6600 have 8 MB cache, while your X2 3600+ only have 512 KB.
A slight overstatement -- the Q6600 has 2MB/core, compared to the 512kB/core for the 3600+ (8MB and 1MB total, respectively)
James Heinrich is online now   Reply With Quote
Old 2008-04-21, 12:43   #10
Cruelty
 
Cruelty's Avatar
 
May 2005

31228 Posts
Default

Quote:
Originally Posted by James Heinrich View Post
A slight overstatement -- the Q6600 has 2MB/core, compared to the 512kB/core for the 3600+ (8MB and 1MB total, respectively)
Actually, it has 2 x 4MB
Cruelty is offline   Reply With Quote
Old 2008-04-21, 20:35   #11
S00113
 
S00113's Avatar
 
Dec 2003

23×33 Posts
Default

[quote=S00113;131952I'll be back with a benchmark from my machine (X2 4400+, Socket 939) later.[/quote]

Here is a benchmark from my Athlon X2 4400+, as promised. It is only 1% overclocked with reasonably fast DDR400 RAM (not overclocked). It beats your faster clocked 3600+ at about 5120 K FFT, and at 2048 K when running dual threaded, and at every size when normalized to 1000 MHz. The advantage of Core 2's larger L2 is hard to beat.

Code:
AMD Athlon(tm) 64 X2 Dual Core Processor 4400+
CPU speed: 2207.46 MHz, 2 cores
CPU features: RDTSC, CMOV, Prefetch, 3DNow!, MMX, SSE, SSE2
L1 cache size: 64 KB
L2 cache size: 1024 KB
L1 cache line size: 64 bytes
L2 cache line size: 64 bytes
L1 TLBS: 32
L2 TLBS: 512
Prime95 64-bit version 25.6, RdtscTiming=1
Best time for 768K FFT length: 33.281 ms.
Best time for 896K FFT length: 39.768 ms.
Best time for 1024K FFT length: 44.120 ms.
Best time for 1280K FFT length: 56.410 ms.
Best time for 1536K FFT length: 68.693 ms.
Best time for 1792K FFT length: 83.098 ms.
Best time for 2048K FFT length: 92.796 ms.
Best time for 2560K FFT length: 122.523 ms.
Best time for 3072K FFT length: 149.391 ms.
Best time for 3584K FFT length: 179.569 ms.
Best time for 4096K FFT length: 200.497 ms.
Best time for 5120K FFT length: 259.962 ms.
Best time for 6144K FFT length: 318.204 ms.
Best time for 7168K FFT length: 388.933 ms.
Best time for 8192K FFT length: 443.642 ms.
Timing FFTs using 2 threads.
Best time for 768K FFT length: 20.824 ms.
Best time for 896K FFT length: 24.822 ms.
Best time for 1024K FFT length: 28.336 ms.
Best time for 1280K FFT length: 37.422 ms.
Best time for 1536K FFT length: 45.127 ms.
Best time for 1792K FFT length: 53.983 ms.
Best time for 2048K FFT length: 60.583 ms.
Best time for 2560K FFT length: 80.743 ms.
Best time for 3072K FFT length: 97.034 ms.
Best time for 3584K FFT length: 115.233 ms.
Best time for 4096K FFT length: 129.359 ms.
Best time for 5120K FFT length: 143.909 ms.
Best time for 6144K FFT length: 181.758 ms.
Best time for 7168K FFT length: 234.895 ms.
Best time for 8192K FFT length: 292.915 ms.
Best time for 58 bit trial factors: 3.344 ms.
Best time for 59 bit trial factors: 3.337 ms.
Best time for 60 bit trial factors: 3.565 ms.
Best time for 61 bit trial factors: 3.560 ms.
Best time for 62 bit trial factors: 4.258 ms.
Best time for 63 bit trial factors: 4.985 ms.
Best time for 64 bit trial factors: 6.084 ms.
Best time for 65 bit trial factors: 7.255 ms.
Best time for 66 bit trial factors: 7.209 ms.
Best time for 67 bit trial factors: 7.194 ms.
S00113 is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
CPU performance counters, how to use them? ldesnogu Programming 2 2009-02-22 13:45
ICC performance gain testi Msieve 5 2008-11-20 03:00
64 bit performance? zacariaz Hardware 1 2007-05-10 13:08
LLR performance on k and n robert44444uk 15k Search 1 2006-02-09 01:43
Performance battlemaxx Prime Sierpinski Project 4 2005-06-29 20:32

All times are UTC. The time now is 18:40.

Thu Oct 22 18:40:23 UTC 2020 up 42 days, 15:51, 2 users, load averages: 1.77, 1.92, 2.14

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.