mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2004-04-16, 13:31   #1
S00113
 
S00113's Avatar
 
Dec 2003

23×33 Posts
Default Relative speeds of hardware for different types of work

What are the relative speeds of different types of CPU/cache/memory combinations on different types of work?

o What is the maximum amount of memory that will be used on stage 2 of P-1 on different CPUs and ranges?
o At what cahce sizes do different FFT sizes "break down" and hit the memory access bottleneck?
o How many cycles do each kind of CPU use on each iteration of
- factoring (to how many bits?)
- LL testing (different FFTs)
- P-1 testing (both stages)

My .ini files are genreated automatically based on information about CPU and memory, and I would like each machine to be as productive as possible. Time is not a major problem, as I back up all changed files every second hour and complete the work on another machine if the first one breaks down. Memory use is not a problem either, since I shut down mprime when a user logs on. (I don't know how to do this on Windows machines, so I use them for factoring only.)

Are there plans to implement smarter work distribution in the next version of the server? I.e. if P4s are very effective for factoring up to 64 bits, but less effective to 65 bits, then P4s used for factoring should return their exponent to the server at 64 bits and get a new one, while the first is sent to a PIII or earlier for factoring to 65 bits or above.
S00113 is offline   Reply With Quote
Old 2004-04-16, 21:39   #2
dsouza123
 
dsouza123's Avatar
 
Sep 2002

2·331 Posts
Default

For trial factoring:

P4s are very slow at 64 bits and below but very quick at 65 and higher (SSE2 used instead of FPU).

For 64 bits and below, at the same clock speed 1 Ghz,
Athlon is considerably faster, PIII somewhat faster and the P4 extremely slow.

A P4 at a sufficiently high clock speed will still be OK because alot of the time
is spent at 65 bits and above using the SSE2 instructions.
dsouza123 is offline   Reply With Quote
Old 2004-04-18, 13:40   #3
Mr. P-1
 
Mr. P-1's Avatar
 
Jun 2003

7·167 Posts
Default

LL testing, Doublechecking, and P-1 are all basically the same kind of work, i.e., they use the FP FFT transform to do huge multiplications. A processor which is good at one of them will be equally good at the others. Trial Factoring uses only integer arithmetic, except on P4s and recent Celerons, which use FP arithmetic (in a completely different way from FFT) over 64 bits.

P4s and recent Celerons are best at FFT work. They're OK for trial factoring above 64 bits, lousy below 64 bits.

Athlons and Durons and P3s are OK at everything.

Older processors are best at trial factoring.

A reasonable amount of cache is important for FFT work. Not sure how important it is for TF.

Large amounts of memory are used only during stage 2 of P-1. How much is actually used will depend upon the exponant, and the amount you make available. Quite a complex algorithm is used to decide how much to use. Basically if the optimal amount is larger than the available amount (Very likely, because the optimal amount is it the GBs for exponants currently being tested) then it repeatedly decrements the amount until it fits. The amount by which it decrements is very large - tens or even hundreds of MBs. Also, the client is programmed not to go right up to the limit you set.

The upshot of all this is that it will use quite a bit less than you allow.
Mr. P-1 is offline   Reply With Quote
Old 2004-04-19, 05:10   #4
cheesehead
 
cheesehead's Avatar
 
"Richard B. Woods"
Aug 2002
Wisconsin USA

11110000011002 Posts
Default

Quote:
Originally Posted by S00113
o At what cahce sizes do different FFT sizes "break down" and hit the memory access bottleneck?
It's not really a question of cache sizes at which FFT sizes hit memory access bottlenecks.

Prime95 takes the L1 cache size into account and structures its memory accesses to maximize throughput accordingly. That is, if L1 cache size is 8KB then Prime95 will completely process an 8KB slice of the FFT before it processes the next 8KB slice of the FFT. Furthermore, it "touches" blocks of memory ahead of when it will perform arithmetic on their contents, so that those blocks will be prefetched into cache while Prime95 is processing something else. So in this regard Prime95 already optimizes memory accesses, and FFT size doesn't matter much to that.

(And everything above about Prime95 also applies to mprime.)
Quote:
I would like each machine to be as productive as possible.
Okay, but keep in mind that in some cases the differences in productivity (between different combinations of CPU and GIMPS work assignment types) are relatively minor, and forum participants sometimes make more of these differences than they deserve.

Last fiddled with by cheesehead on 2004-04-19 at 05:19
cheesehead is offline   Reply With Quote
Old 2004-04-19, 05:33   #5
cheesehead
 
cheesehead's Avatar
 
"Richard B. Woods"
Aug 2002
Wisconsin USA

1E0C16 Posts
Default

Quote:
Originally Posted by S00113
I.e. if P4s are very effective for factoring up to 64 bits, but less effective to 65 bits, then P4s used for factoring should return their exponent to the server at 64 bits and get a new one, while the first is sent to a PIII or earlier for factoring to 65 bits or above.
As pointed out earlier, P4 trial-factoring efficiency is the other way around. But in either case, one needs to take into account the inefficiency created by splitting a factoring assignment into two parts -- twice the assignment management and communication overhead, more opportunity for mistakes, and so on. Finally, note that for numbers TFed to more than 2^64, the time spent on TF to 2^64 will be only a fraction of the time spent on TF above 2^64, so whatever inefficiency there is for TF below 2^64, it's only a small part of the total effort.
cheesehead is offline   Reply With Quote
Old 2004-04-19, 20:32   #6
S00113
 
S00113's Avatar
 
Dec 2003

23·33 Posts
Default

Quote:
Originally Posted by cheesehead
As pointed out earlier, P4 trial-factoring efficiency is the other way around. But in either case, one needs to take into account the inefficiency created by splitting a factoring assignment into two parts -- twice the assignment management and communication overhead, more opportunity for mistakes, and so on. Finally, note that for numbers TFed to more than 2^64, the time spent on TF to 2^64 will be only a fraction of the time spent on TF above 2^64, so whatever inefficiency there is for TF below 2^64, it's only a small part of the total effort.
I think it still would be an improvement. Windows machines are reinstalled very often. When one of my factoring Windows machines report progress, it would be nice if the amount of factoring done on that expoment was recorded. As of today, even if the machine had factored an exponent from 2^60 to 2^67, the report still says it is at 2^60 until factoring to 2^68 is complete. If this is an old Pentium, factoring to 2^68 will most likely never be completed until the machine is retired because it is so inefficient at this task. A Pentium 4 would complete it in less than a day.

Progressive reporting of factoring is practically done already, and less work would have to be repeated. When progressive reporting is implemented, it would be a simple task to put the exponent back in line after factoring to a certain limit.

Similar could be done with stages 1 and 2 of P-1 factoring. Many expoments will never get a stage 2 run. I have a few fast machines with GBs of memory dedicated to P-1 factoring, and it would be best if I could get P-1 stage 2 work directly from the server. For now I allocate lots of work with SequentialWorkToDo=0 and move the Test= lines to machines with less memory when P-1 is done.

The simple task is to divide P-1 factoring and trial factoring into two different work types, stages 1 and 2 and <64 bit and >64 bit, and let mprime choose the work somewhat intelligently.
S00113 is offline   Reply With Quote
Old 2004-04-19, 23:37   #7
Complex33
 
Complex33's Avatar
 
Aug 2002
Texas

5·31 Posts
Default

If you have a concern about partial factoring, you know you can manipulate primenet to perform factoring up to whatever bit depth you desire and then dump the results back to the server. After a while a script comes behind the returned work that becomes "stuck" and puts the exponents back on the market at the new bit depth.
Complex33 is offline   Reply With Quote
Old 2004-04-20, 19:58   #8
cheesehead
 
cheesehead's Avatar
 
"Richard B. Woods"
Aug 2002
Wisconsin USA

22·3·641 Posts
Default

Quote:
Originally Posted by S00113
As of today, even if the machine had factored an exponent from 2^60 to 2^67, the report still says it is at 2^60 until factoring to 2^68 is complete. If this is an old Pentium, factoring to 2^68 will most likely never be completed until the machine is retired because it is so inefficient at this task. A Pentium 4 would complete it in less than a day.

Progressive reporting of factoring is practically done already, and less work would have to be repeated. When progressive reporting is implemented, it would be a simple task to put the exponent back in line after factoring to a certain limit.
Oh, so what you want is a systematic way of reporting the partial completion of a TF assignment in case the machine can't complete the assignment, along with automatic reassignment of the remaining portion. Well, that's certainly worthwhile.

What I was referring to was the deliberate splitting-up of assignments even though the machine was fully capable of completing the entire un-split factoring range. Your proposal refers, in effect, to an unplanned splitting of an assignment (though once it's implemented, it could also be used with deliberate intent).
Quote:
Similar could be done with stages 1 and 2 of P-1 factoring. Many expoments will never get a stage 2 run. I have a few fast machines with GBs of memory dedicated to P-1 factoring, and it would be best if I could get P-1 stage 2 work directly from the server.
Yeah, but stage 2 has to start with the save file (4MB in size, for a 32M exponent, e.g.) from the preceding stage 1. So you've got a noticeable (for dialup users, anyway) extra communication overhead, which the stage 1 assignee has to pay, too, when reporting his/her results.
cheesehead is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
What do the different types of work each mean? jrafanelli Information & Answers 20 2019-02-01 05:27
New work types Unregistered Information & Answers 0 2011-07-25 10:19
Work Types Unregistered Information & Answers 3 2010-07-28 09:54
v5 work types S00113 PrimeNet 14 2008-12-10 00:26
Two different P-1 types of work in worktodo.ini eepiccolo Software 1 2003-04-30 19:54

All times are UTC. The time now is 08:10.

Fri Mar 5 08:10:03 UTC 2021 up 92 days, 4:21, 0 users, load averages: 1.04, 1.28, 1.33

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.