![]() |
![]() |
#1 |
Feb 2005
210 Posts |
![]()
Hi there...
...have a simple question...thread title about says it all...I would like to know, if hardware failures (miscalculations) are only detected when running torture test or also when doing "real" work such as factoring/LL-testing etc...? Hope someone can enlighten me on this...thanks in advance! Last fiddled with by Jasmin on 2005-02-08 at 16:22 |
![]() |
![]() |
![]() |
#2 |
"Patrik Johansson"
Aug 2002
Uppsala, Sweden
52×17 Posts |
![]()
Hardware failures are also detected when you run an LL-test, but not all errors are detected.
When an error is detected the program restarts from the last save file, so a detected error should do no harm. The problem is that if errors are detected, there if a fair chance that there is at least one undetected error as well (and one is enough to spoil the whole computation). One way of making a better test is to run a few double-checks, wait until the status files are updated, and then see if the match the first test that has been done earlier. |
![]() |
![]() |
![]() |
#3 |
"6800 descendent"
Feb 2005
Colorado
11×67 Posts |
![]()
Sounds most likely heat related to me, especially if you have a fast P4 or AMD processor. Does your system have an option to monitor the CPU temperature?
My 3.4 Ghz P4 Prescott runs under 30 degrees C idle, but when running a torture test or LL test it averages 50 degrees C. And that is with a lot of work at getting the CPU's heat sink very efficient. My first try at installing the heat sink resulted in much higher temperatures, and without monitoring those temperatures I would have never known there was a problem. -Phil |
![]() |
![]() |
![]() |
#4 |
Aug 2002
2·33 Posts |
![]()
The torture test runs snippets of FFTs with known results, providing the best indication of hardware failure.
With LL, P-1, and trial factoring, the end result is unknown and so - prime95 is hamstrung in detecting hardware failures outside of torture testing. That being said, with LL and P-1 factoring, even though the exact correct result is unknown, when you are expecting a result less than one (or some other bound) and it isn't, then you can report with some certainty that a hardware error has occurred. This check would be expensive to perform on every iteration, but is worthwhile every so often, and that is what prime95 does. Notice that since a check doesn't happen every iteration, an error could have occurred even though a "no error found" status is returned. Because trial factoring is done totally in CPU/cache (very little RAM/bus access needed), machines that get stressed and fail doing LL/P-1 may still work satisfactorily there. How this helped. |
![]() |
![]() |
![]() |
#5 |
Feb 2005
28 Posts |
![]()
So if I got you correct, basically the torture tests finds more errors, because it can compare to known results, but the LL, factoring etc. also do find some errors, but not as much as torture test does...thanks for clearing that up...
That would mean it's potentially possible to send wrong results back to PrimeNet, if run on a shaky system...guess, that's what the double-checks are good for? ;) @PhilF: Thanks, but it's not that I have a failure problem, I need help with...I know the procedures/difficulties etc. in overclocking, I am just trying to find the limits of my system, and having reached a possibly stable point, at which I still would run torture test for several days, I thought, the possibility being quite high, it already is stable, I might as well let it do some real prime work, but given these answers, I better stick to torture test, until it's real stable, don't wanna submit wrong results... Last fiddled with by Jasmin on 2005-02-09 at 08:40 |
![]() |
![]() |
![]() |
#6 |
May 2003
3·13 Posts |
![]()
I've found out that running two torture tests simultaneously is a very good stress tester for HT-capable processors. Just set the affinity to 0 and 1 so that the load is maximal. I usually run the max FPU stress test + maximum heat&power consumption test.
|
![]() |
![]() |
![]() |
#7 |
"6800 descendent"
Feb 2005
Colorado
11×67 Posts |
![]()
Ok Jasmin, I understand now. I misread your original post and thought you meant your computer was fine until you ran a torture test OR LL test.
Boulder, I don't think you are stressing your HT processor any more by running 2 tests simultaneously. I know that with only one test running you show approximately 50% utilization, but that is normal with XP running on a Hyper Threaded CPU. You really are using 100% of it. It is just that XP thinks you are using 100% of one CPU and very little of the other CPU, so it reports 50%. On the other hand, I just read your post again and I see now that you are running 2 different types of tests simultaneously. That might stress the system further, but probably not by much. -Phil Last fiddled with by PhilF on 2005-02-10 at 00:38 |
![]() |
![]() |
![]() |
#8 | |
Aug 2002
2·33 Posts |
![]() Quote:
My gut would have said that if one prime95 torture test is coded to stress the system, that running two would not be as stressful, due to the possible cooling effect of swapping between processes. But I have heard of two torture tests running show failure when one would not. I'm running v23.7.1. Does a later version have separate 'max FPU' and 'max heat/power' options? I believe in stress testing for 24 hours. After that I'd run a couple of double checks (might as well benefit the project if your system is running good). I'd then throw in another double check every month or two. Anyone want to explain how fast the results of a double check get cleared, and how you can tell if it passed? We have super competent and gracious folks who do it for us over on the [www dot]teamprimerib[dot com] team. |
|
![]() |
![]() |
![]() |
#9 | ||
"Patrik Johansson"
Aug 2002
Uppsala, Sweden
52·17 Posts |
![]() Quote:
Quote:
After a first time test an exponent goes to hrf3. Here you also find non-matching double-checks (along with the first test). Either your double-check or the first time test can be faulty. When two matching tests are found, both are put/moved into lucas_v. Any non-matching test (which then is bad for certain) goes into bad. |
||
![]() |
![]() |
![]() |
#10 |
Aug 2002
2·33 Posts |
![]()
Thanks Patrik. I looked at my torture test menu and saw what you and Boulder were talking about - guess I haven't added a new machine in a while.
![]() ![]() So it would take 1-2 weeks to get back results using my suggestion to run double checks instead of a couple day long torture test. Well that's what I do every month or two to make sure my machines are in top working order. I had to ask about those files because some good hearted folks on my team (Team Prime Rib) keep track of "latest results" for the rest of us. ![]() ![]() |
![]() |
![]() |
![]() |
#11 | |
Mar 2003
New Zealand
13·89 Posts |
![]() Quote:
If I was just doing LL testing then I would turn hyperthreading off and run at 3.0GHz, but for most other projects hyperthreading is worth an extra 20%-25% throughput. |
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Catastrophic hardware failure | CuriousKit | Hardware | 6 | 2015-09-02 18:58 |
Hardware failure detected !!! | MaZeNsMz | Information & Answers | 2 | 2008-06-21 12:05 |
Hardware Failure Detected | bigal_nz | Hardware | 2 | 2007-02-07 10:43 |
Trial factoring/P-1 torture test? | cmokruhl | Software | 2 | 2005-08-03 03:54 |
Torture Test Failure Follow-up | jugbugs | Hardware | 8 | 2004-04-30 07:04 |