mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > PrimeNet

Reply
 
Thread Tools
Old 2011-11-04, 19:42   #826
bcp19
 
bcp19's Avatar
 
Oct 2011

7·97 Posts
Default

Quote:
Originally Posted by Mr. P-1 View Post
There are two things happening here.

First, without changing the bounds, the algorithm runs faster with more memory. Secondly, both P-1 and ECM are memory-bandwidth-hungry algorithms. If both are running at the same time, they will compete for the available memory bandwidth, slowing down both.
So, if I understand what you are saying, when the extra memory was made available, the ECM grabbed more causing the restarted P-1 to actually be using less memory and therefore go slower?

Hmm, is there a recommended type of work for each core to minimize problems like this? I notice my other system seems to run faster with core 1 and 3 doing LL/DC and core 2 and 4 doing TF (with 4 cores running LL the time per iteration was ~.090 for all 4, with the LL/TF/LL/TF setup the LL's are at ~.060). I know when the P-1 finishes and core 1 moves onto LL it won't be as much of a memory hog (plus the ECM will be done), would running a P-1 be better on core 2, 3 or 4? From my other system it seems 1/2 and 3/4 are kinda linked, so any thoughts would be appreciated.
bcp19 is offline   Reply With Quote
Old 2011-11-04, 19:45   #827
James Heinrich
 
James Heinrich's Avatar
 
"James Heinrich"
May 2004
ex-Northern Ontario

318610 Posts
Default

Quote:
Originally Posted by Dubslow View Post
I remember reading somewhere that you aren't supposed to report no factor results from mfakto?
That would be kind of pointless... where did you read that?
James Heinrich is offline   Reply With Quote
Old 2011-11-04, 20:08   #828
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

3×2,399 Posts
Default

Ah, the GPU FAQ PDF guide that Brain created, available in the GPU FAQ threads.
Dubslow is offline   Reply With Quote
Old 2011-11-04, 20:16   #829
KyleAskine
 
KyleAskine's Avatar
 
Oct 2011
Maryland

2×5×29 Posts
Default

Quote:
Originally Posted by James Heinrich View Post
That would be kind of pointless... where did you read that?
Agree... if this is really the case, that would probably mean it isn't worth anyone's time to run it, since 98% of the factors will have to be rechecked by mfaktc
KyleAskine is offline   Reply With Quote
Old 2011-11-04, 21:19   #830
delta_t
 
delta_t's Avatar
 
Nov 2002
Anchorage, AK

5458 Posts
Default

Quote:
Originally Posted by Dubslow View Post
I remember reading somewhere that you aren't supposed to report no factor results from mfakto?
That has since been resolved by bdot I believe. Check http://mersenneforum.org/showpost.ph...&postcount=132 and http://mersenneforum.org/showpost.ph...&postcount=136

So mfakto 0.09 should be the latest version and fixed.

Last fiddled with by delta_t on 2011-11-04 at 21:24 Reason: Added links to mfakto thread
delta_t is offline   Reply With Quote
Old 2011-11-05, 00:26   #831
Mr. P-1
 
Mr. P-1's Avatar
 
Jun 2003

7×167 Posts
Default

Quote:
Originally Posted by bcp19 View Post
So, if I understand what you are saying, when the extra memory was made available, the ECM grabbed more causing the restarted P-1 to actually be using less memory and therefore go slower?
I should clarify what I said before. With the same bounds, and other things being equal, stage 2 should run faster with more memory, however the effect is minor, unless you're close to the minimum. The available memory doesn't effect the speed of stage 1, once the B1 bound has been fixed at the start of the run.

The second issue I mentioned (which is probably the more significant one in causing the effects you have mentioned), doesn't depend much at all upon the specific amount of memory any particular thread is using. Rather it depends upon the nature of the work that thread is doing.

Quote:
Hmm, is there a recommended type of work for each core to minimize problems like this?
How your system will perform under various workloads is dependent upon your hardware, including the type and speed of your processor and memory and whether and how you overclock.

GIMPS worktypes fall into three categories:

Low Bandwidth: TF - all types.
Medium Bandwidth: LL - all types, ECM Stage 1, P-1 Stage 1
High Bandwidth: ECM Stage 2, P-1 Stage 2.

I wouldn't recommend doing TF at all on a CPU any more. GPUs are so much faster at this type of work, that doing it with a CPU is a waste of a core. What I would recommend you do is put MaxHighMemWorkers=1, into local.txt. (You need to shut down P95 before you make changes to local.txt or they will be reverted.) Then run the program with P-1 on all four cores. As soon as you have one core running stage 2, note the timings of both the stage 2 core and the stage 1 cores.

Change MaxHighMemWorkers to 2. Wait for a second core to go to stage 2, and again note the timings. Decide if you are willing to take the hit. If yes, then run ECM/P-1 on all four cores, with MaxHighMemWorkers equal to 2. If not then run ECM/P-1 on two cores, and LLs/doublechecks on the other two, with MaxHighMemWorkers equal to 1.

This assumes you have high memory available all the time. If you don't then you are likely to quickly accumulate a backlog of uncompleted stage 2. Even if you do, with twice as many cores running P-1 as MaxHighMemWorkers, you will slowly accumulate uncompleted stage 2. Clear the backlog by occasionally running an LL test on one of your P-1 cores.

Quote:
would running a P-1 be better on core 2, 3 or 4? From my other system it seems 1/2 and 3/4 are kinda linked, so any thoughts would be appreciated.
I've never noticed it making a difference which core does a particular type of work.
Mr. P-1 is offline   Reply With Quote
Old 2011-11-05, 00:40   #832
James Heinrich
 
James Heinrich's Avatar
 
"James Heinrich"
May 2004
ex-Northern Ontario

2·33·59 Posts
Default

Quote:
Originally Posted by bcp19 View Post
From my other system it seems 1/2 and 3/4 are kinda linked, so any thoughts would be appreciated.
What CPU do you have? It almost sounds like you're describing a hyperthreaded dual-core.
James Heinrich is offline   Reply With Quote
Old 2011-11-05, 01:36   #833
bcp19
 
bcp19's Avatar
 
Oct 2011

7×97 Posts
Default

Quote:
Originally Posted by James Heinrich View Post
What CPU do you have? It almost sounds like you're describing a hyperthreaded dual-core.
Intel Core 2 Quad Q8200 @2.33 GHz 64 bit Vista
bcp19 is offline   Reply With Quote
Old 2011-11-05, 01:39   #834
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

3·2,399 Posts
Default

Not hyperthreaded.
http://ark.intel.com/products/36547/...z-1333-MHz-FSB)
Dubslow is offline   Reply With Quote
Old 2011-11-05, 01:43   #835
James Heinrich
 
James Heinrich's Avatar
 
"James Heinrich"
May 2004
ex-Northern Ontario

2·33·59 Posts
Default

Quote:
Originally Posted by bcp19 View Post
Intel Core 2 Quad Q8200
Ah, I thought so. The early Intel Quads (including your Q8200 and my slightly older Q6600 are actually dual-dual-core CPUs rather than true quad-cores:
Quote:
Analogous to the Pentium D branded CPUs, the Kentsfields comprise two separate silicon dies (each equivalent to a single Core 2 duo) on one MCM (multi-chip module)
Quote:
Yorkfield-6M ... are made from two Wolfdale-3M like cores, so they have a total of 6 MB of L2 cache, with 3 MB shared by two cores. They are used in Core 2 Quad Q8xxx with 4 MB cache enabled...
James Heinrich is offline   Reply With Quote
Old 2011-11-05, 02:35   #836
bcp19
 
bcp19's Avatar
 
Oct 2011

2A716 Posts
Default

Quote:
Originally Posted by Mr. P-1 View Post
I wouldn't recommend doing TF at all on a CPU any more. GPUs are so much faster at this type of work, that doing it with a CPU is a waste of a core. What I would recommend you do is put MaxHighMemWorkers=1, into local.txt. (You need to shut down P95 before you make changes to local.txt or they will be reverted.) Then run the program with P-1 on all four cores. As soon as you have one core running stage 2, note the timings of both the stage 2 core and the stage 1 cores.
I understand your arguments, but if I have 4 cores running LL, I can complete 8 LL's in approx 80 days, where if I use the LL/TF/LL/TF in the same 80 days I can complete 6 LL's and ~160 TF (with the extimated 1% factor found this saves 1.6LL and 1.6DC). It just seems to be a more efficient use of the CPU's.

I tried testing an LL/ECM/LL/TF and an LL/P-1/LL/TF, but in each of those, core 1 would run ~.090 per iteration while core 3 was .063 and the TF was like 2 seconds slower (237 to 239 sec per .14%) during S1. Interestingly during S2 (on the ECM), core 1 would drop to .084 while Core 3 had a few extra blips at .064 and core 4 jumped to about 248sec. The ECM took 11 hours to run compared to 9 hours with all 4 cores doing ECM. The P-1 I did not run to completion and moved the assignment to another machine as it was looking like a 3-4 day run on it. And the only time all 4 cores ran P-1 was when I first started and got 4 LL's that needed P-1 and I have no idea how long they took VS P-1 and something else.
bcp19 is offline   Reply With Quote
Reply

Thread Tools


All times are UTC. The time now is 11:28.

Sun Nov 29 11:28:21 UTC 2020 up 80 days, 8:39, 3 users, load averages: 0.92, 1.12, 1.13

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.