20080416, 11:28  #1 
"James Heinrich"
May 2004
exNorthern Ontario
2·1,621 Posts 
64bit performance of v25.6
I just ran benchmarks of v25.6 on a fresh Vista64 install and found some odd results. The FFT tests are virtually identical between 32 and 64bit versions, but the TF times vary. That's expected, but what I didn't expect is that the 64bit version of Prime95 is actually slower than the 32bit version (by ~7%) above 64bits
Obviously I should only TF <= 2^64 with benchmarks like that... (see attachment for numbers) 
20080418, 12:34  #2 
Sep 2002
Oeiras, Portugal
1441_{10} Posts 
That´s strange...
Have you tried to actually trial factor some exponents and compare the results? It may be some problem with the benchmark code, and the actual performance be better. I have tried version 24 in a 64bit system (XP64) and trial factoring was really great, nearly twice as fast as the 32bit version! Now I can´t see any reason for such poor performance under 25.6... 
20080419, 09:26  #3 
"Serge"
Mar 2008
Phi(4,2^7658614+1)/2
2·3·1,543 Posts 
For TF, I use only 64bit P95 (v.25.6)  it is approximately 1.61.7 times faster. But only for TF; everything else has the same speed, while more memory is used (the pointers are larger). This is on an AMD Opteron; should be the same on Windows, but I haven't checked (but I will, I have a box somewhere).
Something is odd with the benchmarks. Also, if one would compare two running processes (a 32bit and a 64bit) and their periodical printouts one might be easily confused  the 64bit spits out a progress line almost in 1.82 times longer intervals of time  BUT it is important then to notice that its chunks of work are 3 times bigger. Therefore, a ballpark  3x more job done in 1.8x time. 
20080419, 11:48  #4 
Sep 2006
Brussels, Belgium
3^{3}·61 Posts 
AMD processors do not have the problem, the Core2 do. This means that 64 bits Prime95 on Core2 should trial factor to 64 bits only, leaving the rest to more efficient OS/Processor combinations. There is not enough data to see if it is OS related (I suppose not.)
Prime95 benchmarks are not very acurate (they measure BEST times ofver a few iterations.) But I see the same kind of results on "real life" trial factoring. Jacob 
20080420, 11:27  #5 
"James Heinrich"
May 2004
exNorthern Ontario
2·1,621 Posts 
I benchmarked my Intel Core2 Q6600 (overclocked to 3465MHz) and my AMD X2 3600+ (overclocked to 2613MHz). See attachment for details of numbers and some graphs. Numbers in graphs are scaled to equivalent of 1000MHz for both CPUs for comparison. Some observations:

20080420, 11:31  #6 
"James Heinrich"
May 2004
exNorthern Ontario
2·1,621 Posts 
And here is the Excel file I used for the above graphs. Raw benchmark data for both systems is also included in this file. Perhaps I made a calculation flaw somewhere?

20080421, 09:37  #7 
"Serge"
Mar 2008
Phi(4,2^7658614+1)/2
242A_{16} Posts 
approx. same results on my Win64 box, too
I second this concern. I looked at the timings on a Win64 box, as promised. And not just benchmarks but timings (note I set "Options::Preferences::Iterations between screen outputs" [OutputIterations= in prime.txt] to 150000 on 32bit and 50000 on 64bit; this parameter is not taken literally by the binary; you need to notice that %age reports are the same, so we compare apples to apples.) The box is running 64bit Windows XP Pro SP2, though not the Vista ... Here are the specs, just in case  http://valid.x86secret.com/show_oc.php?id=339904
The results are not so different from yours. I will not bore you with numbers. (I did one and two thread TF's of 37Mrange exponents, deleting temp files and going to diff bit levels by editing worktodo.txt ) The difference is that 64bit P95 on that box never quite gets slower than 32bit, but it gets just 1% faster at TF 2^65 and higher (which is within a margin of error to say ~= equal), not 8% slower; and similar to your results the lower bitlevels have a little more advantage to the 64bit (~1015% faster), but never reaches the AMDtype difference (6070% faster). I only hope that it is _just_ speed, not a bug. Tomorrow will need to take a solvable p with a factor ~ 2^65 and check that both binaries find it. On the AMD I already did that with both 32bit, 64bit binaries and with a pair of known solvable cases of TF and P1 (that happened when I had no factors for a month. But both linux64/linux32 mprime and AMD and P95 on Windows successfully found known solutions and then found me some factors later. One thread found me several factors within an hour or so, and then again nothing for a month ) P.S. A note for the (GU)I programming: Not a bug, really, but a feature... It would be less confusing if the OutputIterations=NNNNN was divided by three internally inside the 64bit binary code in the TF branch, so that the screen progress reports would be spat out at the same speed as by the 32bit binary. I've found the place in the code for benchmark where the timing result is divided by three! But not in the progress code. Divided or multiplied  you will figure it out. The assembly piece of code for the 64bit prog does THREE times more work. Right? 
20080421, 09:58  #8 
Dec 2003
216_{10} Posts 
These results are somewhat artificial. By overclocking a lot you get very dependent on memory bandwith and may even have to clock your RAM down to keep your system stable. This favours factoring and small FFT sizes.
I recently added more RAM, and to my surprise I had to reduce overclocking of the CPU from 5% to 1% to make my new RAM work properly at advertised speed. memtest+ didn't find anything wrong, but mprime complained after a few minutes, or seconds when things had warmed up. The solution was either to reduce RAM timings or to reduce overclocking. Keeping RAM timings and reducing overclocking gave better results at large FFT sizes while reducing RAM timings while keeping the CPU overclocked gave better timings on factoring. Also an AMD processor with 1 MB L2 cache would give much better results on FFT. A Q6600 have 8 MB cache, while your X2 3600+ only have 512 KB. In the larger FFT tests you are actually measuring memory speed in your AMD box, not CPU. I'll be back with a benchmark from my machine (X2 4400+, Socket 939) later. 
20080421, 12:30  #9  
"James Heinrich"
May 2004
exNorthern Ontario
110010101010_{2} Posts 
Quote:
A slight overstatement  the Q6600 has 2MB/core, compared to the 512kB/core for the 3600+ (8MB and 1MB total, respectively) 

20080421, 12:43  #10 
May 2005
2×809 Posts 

20080421, 20:35  #11 
Dec 2003
2^{3}·3^{3} Posts 
[quote=S00113;131952I'll be back with a benchmark from my machine (X2 4400+, Socket 939) later.[/quote]
Here is a benchmark from my Athlon X2 4400+, as promised. It is only 1% overclocked with reasonably fast DDR400 RAM (not overclocked). It beats your faster clocked 3600+ at about 5120 K FFT, and at 2048 K when running dual threaded, and at every size when normalized to 1000 MHz. The advantage of Core 2's larger L2 is hard to beat. Code:
AMD Athlon(tm) 64 X2 Dual Core Processor 4400+ CPU speed: 2207.46 MHz, 2 cores CPU features: RDTSC, CMOV, Prefetch, 3DNow!, MMX, SSE, SSE2 L1 cache size: 64 KB L2 cache size: 1024 KB L1 cache line size: 64 bytes L2 cache line size: 64 bytes L1 TLBS: 32 L2 TLBS: 512 Prime95 64bit version 25.6, RdtscTiming=1 Best time for 768K FFT length: 33.281 ms. Best time for 896K FFT length: 39.768 ms. Best time for 1024K FFT length: 44.120 ms. Best time for 1280K FFT length: 56.410 ms. Best time for 1536K FFT length: 68.693 ms. Best time for 1792K FFT length: 83.098 ms. Best time for 2048K FFT length: 92.796 ms. Best time for 2560K FFT length: 122.523 ms. Best time for 3072K FFT length: 149.391 ms. Best time for 3584K FFT length: 179.569 ms. Best time for 4096K FFT length: 200.497 ms. Best time for 5120K FFT length: 259.962 ms. Best time for 6144K FFT length: 318.204 ms. Best time for 7168K FFT length: 388.933 ms. Best time for 8192K FFT length: 443.642 ms. Timing FFTs using 2 threads. Best time for 768K FFT length: 20.824 ms. Best time for 896K FFT length: 24.822 ms. Best time for 1024K FFT length: 28.336 ms. Best time for 1280K FFT length: 37.422 ms. Best time for 1536K FFT length: 45.127 ms. Best time for 1792K FFT length: 53.983 ms. Best time for 2048K FFT length: 60.583 ms. Best time for 2560K FFT length: 80.743 ms. Best time for 3072K FFT length: 97.034 ms. Best time for 3584K FFT length: 115.233 ms. Best time for 4096K FFT length: 129.359 ms. Best time for 5120K FFT length: 143.909 ms. Best time for 6144K FFT length: 181.758 ms. Best time for 7168K FFT length: 234.895 ms. Best time for 8192K FFT length: 292.915 ms. Best time for 58 bit trial factors: 3.344 ms. Best time for 59 bit trial factors: 3.337 ms. Best time for 60 bit trial factors: 3.565 ms. Best time for 61 bit trial factors: 3.560 ms. Best time for 62 bit trial factors: 4.258 ms. Best time for 63 bit trial factors: 4.985 ms. Best time for 64 bit trial factors: 6.084 ms. Best time for 65 bit trial factors: 7.255 ms. Best time for 66 bit trial factors: 7.209 ms. Best time for 67 bit trial factors: 7.194 ms. 
Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
CPU performance counters, how to use them?  ldesnogu  Programming  2  20090222 13:45 
ICC performance gain  testi  Msieve  5  20081120 03:00 
64 bit performance?  zacariaz  Hardware  1  20070510 13:08 
LLR performance on k and n  robert44444uk  15k Search  1  20060209 01:43 
Performance  battlemaxx  Prime Sierpinski Project  4  20050629 20:32 