mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2020-08-30, 14:27   #12
rgirard1
 
Jan 2019

2×7 Posts
Default "Hardware errors" problem: interesting fact

CPU AMD 3950X with 32 GB Ram. OS: Ubuntu 18.05.5 LTS. Running prime 95 with the defaults, no overclocking and no Throttle=30. Using 4 workers. I resumed previous calculation.

The data below suggest the "hardware problems" are with Worker #3 that is doing a "PRP test of M103884401" but prime95 gives the message that "Confidence in final result is excellent." The other 3 Workers are doing fine: no error messages.

[Worker #3 Aug 30 10:05] Resuming Gerbicz error-checking PRP test of M103884401 using FMA3 FFT length 5600K, Pass1=896, Pass2=6400, clm=2, 4 threads

[Worker #3 Aug 30 10:09] Hardware errors have occurred during the test!
[Worker #3 Aug 30 10:09] 1 Gerbicz/double-check error.
[Worker #3 Aug 30 10:09] Confidence in final result is excellent.

I will continue this calculation ignoring these messages from Worker #3 and see what will happen when a new set of exponents will be assigned.

Also, are the calculations running in some numerical difficulties specifically because of the value of the exponent = 103884401?

I would like to understand what is happening here.
rgirard1 is offline   Reply With Quote
Old 2020-08-30, 14:41   #13
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

2×3×1,471 Posts
Default

Quote:
Originally Posted by rgirard1 View Post
Is there a way to start a "fresh" new calculation with new assigned exponents?
Yes, to do that you will need to quit P95 and delete all temp files, do not modify worktodo.txt, and when you restart, you will have a fresh run of THE SAME exponents.
BUT DON'T DO THAT PLEASE!


It would be a pity to start from scratch, you will lose a lot of work!

The files are not "corrupted", that was an error in the past, GEC got it, and corrected it. You will still be informed about it, until the test finishes and the result is reported. Please bear with it, and do not throw away a lot of work, by deleting the temporary checkpoints and starting from scratch. The confidence in the result is still high, the result is most probably correct.

Moreover, either if you decide to continue and finish the test or to restart, it is not recommended to take a new exponent. First, that's a mess, you need to unreserve the old one, get a new assignment, etc. Then, second, probabilistic, if there is an error in the software, or you have an issue with your system, the error is more probable to appear again if you repeat again the same assignment that generated the error. Walk the same path.

But my advice, same as before, is to continue the assignment, stuck your fingers in your ears so you don't see the error till the test is finished Unless more and more errors appear (not the same message for the former error, but new errors!) your system is OK. Bear with it for a while! (I know, my OCD tickles me too, in such situations... )

Last fiddled with by LaurV on 2020-08-30 at 14:44
LaurV is offline   Reply With Quote
Old 2020-08-30, 20:17   #14
rgirard1
 
Jan 2019

2·7 Posts
Default As advised will continue the current calculation.

Many thanks for your advice. I will continue the present calculation. I will ignore the error message from Worker #3 and see what happens with the next set of exponent.

I do not believe that my PC has a hardware problem because: (i) PC relatively new (January 2020) (ii) it is not under heavy usage that is it only running prime95 but 24/7/365 (ii) that message error is quite recent so a harware problem would have manifested itself much earlier.
rgirard1 is offline   Reply With Quote
Old 2020-09-03, 14:53   #15
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

4,567 Posts
Default

I've seen gpuowl GEC errors occur on more than one gpu model, placed in the same PCIe slot of the same system. As a dual Xeon with ECC ram, I doubt it's the system memory either. Something about that slot.
As long as it's not so frequent that it interferes with throughput, it's ok on PRP. Just don't run LLDC or P-1 there if PRP/GEC shows errors more than ~weekly.

Last fiddled with by kriesel on 2020-09-03 at 14:54
kriesel is offline   Reply With Quote
Old 2020-09-10, 20:08   #16
rgirard1
 
Jan 2019

2×7 Posts
Default Previous "Hardware error has disapeared"

In previous post I reported that Prime95 was reporting harware errors from the calculation done by Worker #3 as follows:

[Worker #3 Sep 5 11:36] Iteration: 83520000 / 103884401 [80.39%], ms/iter: 18.571, ETA: 4d 09:03
[Worker #3 Sep 5 11:36] Hardware errors have occurred during the test!
[Worker #3 Sep 5 11:36] 1 Gerbicz/double-check error.
[Worker #3 Sep 5 11:36] Confidence in final result is excellent.

That calculation by Worker #3 has ended last night and a new one was started on that Worker #3 without displying any error as it can be seen:

[Worker #3 Sep 10 15:53] Iteration: 3660000 / 107917471 [3.39%], ms/iter: 18.802, ETA: 22d 16:31

So, was the "hardware error" in the previous calculation caused by the specific value of the exponent being analysed or is it something else? I do not know but I am wondering why the "hardware error" was with Worker #3 calculations only and not with the other Workers (there are three more). Any one who can shed light on this is most welcome to comment.

Last fiddled with by rgirard1 on 2020-09-10 at 20:10
rgirard1 is offline   Reply With Quote
Old 2020-09-10, 22:16   #17
VBCurtis
 
VBCurtis's Avatar
 
"Curtis"
Feb 2005
Riverside, CA

22·1,091 Posts
Default

Only one core had a hardware error. Why would you expect all cores to have an error, just because one did?

No, it had nothing to do with the specific exponent being tested.
VBCurtis is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Possible hardware errors have occurred during the test! 1 ROUNDOFF > 0.4. Xyzzy Software 7 2016-12-20 00:01
Possible hardware errors... SverreMunthe Hardware 16 2013-08-19 14:39
Hardware, FFT limits and round off errors ewergela Hardware 9 2005-09-01 14:51
more about hardware errors graeme Hardware 4 2003-07-08 09:14
Reproducable hardware errors? cmokruhl Software 2 2002-09-17 19:04

All times are UTC. The time now is 10:59.

Tue Oct 20 10:59:52 UTC 2020 up 40 days, 8:10, 0 users, load averages: 2.11, 2.05, 2.02

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.