mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2010-10-15, 01:30   #12
harlee
 
harlee's Avatar
 
Sep 2006
Odenton, MD, USA

101001002 Posts
Default

Prime95,

When you get a chance, could you please check the FFT timings on the Northwood P4? On my system the times at a lot worse starting at the 5120K FFT size as compared to v25.11. Also, running two threads really takes a large hit on timings.

Intel(R) Pentium(R) 4 CPU 2.60GHz
CPU speed: 2593.64 MHz, with hyperthreading
CPU features: RDTSC, CMOV, Prefetch, MMX, SSE, SSE2
L1 cache size: 8 KB
L2 cache size: 512 KB
L1 cache line size: 64 bytes
L2 cache line size: 128 bytes
TLBS: 64

Code:
Prime95 32-bit version 25.11, RdtscTiming=1
Best time for 2048K FFT length: 72.493 ms.
Best time for 2560K FFT length: 94.382 ms.
Best time for 3072K FFT length: 113.950 ms.
Best time for 3584K FFT length: 140.564 ms.
Best time for 4096K FFT length: 157.926 ms.
Best time for 5120K FFT length: 204.788 ms.
Best time for 6144K FFT length: 247.533 ms.
Best time for 7168K FFT length: 304.596 ms.
Best time for 8192K FFT length: 343.190 ms.

Timing FFTs using 2 threads on 1 physical CPUs.
Best time for 2048K FFT length: 72.402 ms.
Best time for 2560K FFT length: 98.310 ms.
Best time for 3072K FFT length: 116.630 ms.
Best time for 3584K FFT length: 143.320 ms.
Best time for 4096K FFT length: 161.859 ms.
Best time for 5120K FFT length: 209.678 ms.
Best time for 6144K FFT length: 252.714 ms.
Best time for 7168K FFT length: 319.826 ms.
Best time for 8192K FFT length: 373.945 ms.
Code:
Prime95 32-bit version 26.3, RdtscTiming=1
Best time for 2048K FFT length: 71.826 ms., avg: 71.990 ms.
Best time for 2560K FFT length: 94.154 ms., avg: 94.687 ms.
Best time for 3072K FFT length: 115.905 ms., avg: 116.100 ms.
Best time for 3584K FFT length: 146.346 ms., avg: 146.889 ms.
Best time for 4096K FFT length: 160.572 ms., avg: 161.003 ms.
Best time for 5120K FFT length: 269.619 ms., avg: 270.445 ms.
Best time for 6144K FFT length: 273.642 ms., avg: 274.084 ms.
Best time for 7168K FFT length: 381.843 ms., avg: 382.839 ms.
Best time for 8192K FFT length: 474.259 ms., avg: 475.534 ms.

Timing FFTs using 2 threads on 1 physical CPUs.
Best time for 2048K FFT length: 73.795 ms., avg: 74.213 ms.
Best time for 2560K FFT length: 101.121 ms., avg: 101.620 ms.
Best time for 3072K FFT length: 126.920 ms., avg: 127.683 ms.
Best time for 3584K FFT length: 199.918 ms., avg: 201.022 ms.
Best time for 4096K FFT length: 199.139 ms., avg: 199.910 ms.
Best time for 5120K FFT length: 273.639 ms., avg: 285.329 ms.
Best time for 6144K FFT length: 303.541 ms., avg: 310.955 ms.
Best time for 7168K FFT length: 411.566 ms., avg: 422.451 ms.
Best time for 8192K FFT length: 610.665 ms., avg: 614.937 ms.
harlee is offline   Reply With Quote
Old 2010-10-16, 21:59   #13
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

11×673 Posts
Default

Quote:
Originally Posted by harlee View Post
When you get a chance, could you please check the FFT timings on the Northwood P4? On my system the times at a lot worse starting at the 5120K FFT size as compared to v25.11.
This is not surprising. In 26.3, I arbitrarily decided not to include Northwood optimized FFTs above 4M. AMD K8s are only optimized through 4M as well. Later model P4s are optimized though 6M. Only Core 2 and AMD K10s are optimized through 32M.
Prime95 is offline   Reply With Quote
Old 2010-10-17, 00:18   #14
harlee
 
harlee's Avatar
 
Sep 2006
Odenton, MD, USA

16410 Posts
Default

Quote:
Originally Posted by Prime95 View Post
This is not surprising. In 26.3, I arbitrarily decided not to include Northwood optimized FFTs above 4M. AMD K8s are only optimized through 4M as well. Later model P4s are optimized though 6M. Only Core 2 and AMD K10s are optimized through 32M.
Thanks for the info
harlee is offline   Reply With Quote
Old 2010-10-18, 17:23   #15
Ethan Hansen
 
Ethan Hansen's Avatar
 
Oct 2005

2816 Posts
Default Processor affinity in Linux x64 version

As far as I can ascertain, there appears to be no method to force CPU affinity for each thread on mprime x64, at least for dual-processor, hyperthreaded systems. This behavior predates V26.3, but it is still present in the current version. The AffinityScramble string in local.txt appears not to be honored.

System details: Mprime is running doublechecks on a dual Xeon E5520 Centos 5 box with hyperthreading enabled. This system runs a fairly busy web server, and we noticed a hit to server responsiveness when mprime was running on all 8 physical/16 virtual cores. Dropping one core per CPU from mprime's run-list proved sufficient to restore full responsiveness to the server. Two workers are running using 6 multithreaded CPUs each. I assumed setting the AffinityScramble string in local.txt would do the trick, e.g.:
Code:
AffinityScramble=01234589ABCD67EF
No such luck. Running htop showed each mprime thread jumping from processor to processor, and taskset confirmed that the CPU mask for each thread was 0xFFFF. Manually forcing the threads to use the CPU affinity specified by AffinityScramble ends up not only freeing a physical core on each CPU, but also produces a higher load system load average. I tested this by taking the web server off-line for a half hour last night. When running in the default, promiscuous-CPU mode, the 12 mprime threads produced a load average of 8.6 (i.e. the 12 threads used 8.6 CPUs worth of processor time). Using taskset to force the CPU affinity of each thread raised the load average to 10.7. I assume this corresponds to less memory swapping going on with correspondingly higher CPU utilization. I do not see any practical way to benchmark this, as there is no way to quickly set the processor affinity during a benchmark run.

So, we appear to get more efficient CPU utilization by forcing thread affinity, but the gains are temporary. Each time mprime starts a new exponent or pauses LL testing to trial factor a newly assigned exponent, the default behavior of threads not being bound to a specific CPU resumes. I welcome any suggestions if I have mprime configured incorrectly.
Ethan Hansen is offline   Reply With Quote
Old 2010-10-18, 20:03   #16
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

1CEB16 Posts
Default

Quote:
Originally Posted by Ethan Hansen View Post
No such luck.
Can you email me the complete prime.txt and local.txt files?
Prime95 is offline   Reply With Quote
Old 2010-10-18, 23:30   #17
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

1F5E16 Posts
Default

Is there an ETA for "UseLargePages" for mprime?

Xyzzy is offline   Reply With Quote
Old 2010-10-19, 00:43   #18
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

11·673 Posts
Default

Quote:
Originally Posted by Ethan Hansen View Post
As far as I can ascertain, there appears to be no method to force CPU affinity for each thread on mprime x64, at least for dual-processor, hyperthreaded systems.
"Smart affinity" when num_workers * threads_per_worker is not the same as the number of cores or number of logical processors is the same as run on any logical processor.

You'll need to set worker #1 to run on cpu #1 (the 5 helper threads will be assigned to cpu #2 - #6). Set the second worker to run on cpu #7 (helper threads will be assigned to cpu #8 - #12).
Prime95 is offline   Reply With Quote
Old 2010-10-19, 00:44   #19
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

11×673 Posts
Default

Quote:
Originally Posted by Xyzzy View Post
Is there an ETA for "UseLargePages" for mprime?
Probably never. Preliminary tests were not very promising (no noticeable speed increase).
Prime95 is offline   Reply With Quote
Old 2010-10-19, 00:51   #20
Rhyled
 
Rhyled's Avatar
 
May 2010

3F16 Posts
Default

Quote:
Originally Posted by petrw1 View Post
1. All that I have read here and experienced suggests that hyperthreading does NOT really double your number of CPUs and it is almost always best to only use 4 workers and then experiment with whether or not you want to use 2 CPUs with any of the workers.
I can confirm that statement running an i7 920, Windows 7 64-bit & Prime 95 v25.11. Total throughput actually dropped ~ 5% when hyperthreading was enabled, while running a single number on each core. That's not counting the extra speed you can overclock without HT running, so I net about 10% better throughput on 4 cores w/o HT.

I haven't tried it on v 26.3, yet, but I'd expect the same. HT works best when the cores aren't completely saturated, and Prime95 really hits the floating point logic hard.
Rhyled is offline   Reply With Quote
Old 2010-10-19, 04:27   #21
ixfd64
Bemusing Prompter
 
ixfd64's Avatar
 
"Danny"
Dec 2002
California

23×33×11 Posts
Default

Are there any plans for a GUI version of Mprime?
ixfd64 is offline   Reply With Quote
Old 2010-10-19, 04:30   #22
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

11×673 Posts
Default

Quote:
Originally Posted by ixfd64 View Post
Are there any plans for a GUI version of Mprime?
Volunteers welcome :)

Is there a standard Linux GUI now?
Prime95 is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Prime95 version 29.4 Prime95 Software 441 2020-02-16 15:18
Prime95 version 27.3 Prime95 Software 148 2012-03-18 19:24
Prime95 version 25.5 Prime95 PrimeNet 369 2008-02-26 05:21
Prime95 version 25.4 Prime95 PrimeNet 143 2007-09-24 21:01
When the next prime95 version ? pacionet Software 74 2006-12-07 20:30

All times are UTC. The time now is 23:27.

Mon Apr 12 23:27:20 UTC 2021 up 4 days, 18:08, 1 user, load averages: 3.05, 2.56, 2.44

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.