mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2020-08-30, 14:27   #12
rgirard1
 
Jan 2019

208 Posts
Default "Hardware errors" problem: interesting fact

CPU AMD 3950X with 32 GB Ram. OS: Ubuntu 18.05.5 LTS. Running prime 95 with the defaults, no overclocking and no Throttle=30. Using 4 workers. I resumed previous calculation.

The data below suggest the "hardware problems" are with Worker #3 that is doing a "PRP test of M103884401" but prime95 gives the message that "Confidence in final result is excellent." The other 3 Workers are doing fine: no error messages.

[Worker #3 Aug 30 10:05] Resuming Gerbicz error-checking PRP test of M103884401 using FMA3 FFT length 5600K, Pass1=896, Pass2=6400, clm=2, 4 threads

[Worker #3 Aug 30 10:09] Hardware errors have occurred during the test!
[Worker #3 Aug 30 10:09] 1 Gerbicz/double-check error.
[Worker #3 Aug 30 10:09] Confidence in final result is excellent.

I will continue this calculation ignoring these messages from Worker #3 and see what will happen when a new set of exponents will be assigned.

Also, are the calculations running in some numerical difficulties specifically because of the value of the exponent = 103884401?

I would like to understand what is happening here.
rgirard1 is offline   Reply With Quote
Old 2020-08-30, 14:41   #13
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
"name field"
Jun 2011
Thailand

996110 Posts
Default

Quote:
Originally Posted by rgirard1 View Post
Is there a way to start a "fresh" new calculation with new assigned exponents?
Yes, to do that you will need to quit P95 and delete all temp files, do not modify worktodo.txt, and when you restart, you will have a fresh run of THE SAME exponents.
BUT DON'T DO THAT PLEASE!


It would be a pity to start from scratch, you will lose a lot of work!

The files are not "corrupted", that was an error in the past, GEC got it, and corrected it. You will still be informed about it, until the test finishes and the result is reported. Please bear with it, and do not throw away a lot of work, by deleting the temporary checkpoints and starting from scratch. The confidence in the result is still high, the result is most probably correct.

Moreover, either if you decide to continue and finish the test or to restart, it is not recommended to take a new exponent. First, that's a mess, you need to unreserve the old one, get a new assignment, etc. Then, second, probabilistic, if there is an error in the software, or you have an issue with your system, the error is more probable to appear again if you repeat again the same assignment that generated the error. Walk the same path.

But my advice, same as before, is to continue the assignment, stuck your fingers in your ears so you don't see the error till the test is finished Unless more and more errors appear (not the same message for the former error, but new errors!) your system is OK. Bear with it for a while! (I know, my OCD tickles me too, in such situations... )

Last fiddled with by LaurV on 2020-08-30 at 14:44
LaurV is offline   Reply With Quote
Old 2020-08-30, 20:17   #14
rgirard1
 
Jan 2019

1016 Posts
Default As advised will continue the current calculation.

Many thanks for your advice. I will continue the present calculation. I will ignore the error message from Worker #3 and see what happens with the next set of exponent.

I do not believe that my PC has a hardware problem because: (i) PC relatively new (January 2020) (ii) it is not under heavy usage that is it only running prime95 but 24/7/365 (ii) that message error is quite recent so a harware problem would have manifested itself much earlier.
rgirard1 is offline   Reply With Quote
Old 2020-09-03, 14:53   #15
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

32×719 Posts
Default

I've seen gpuowl GEC errors occur on more than one gpu model, placed in the same PCIe slot of the same system. As a dual Xeon with ECC ram, I doubt it's the system memory either. Something about that slot.
As long as it's not so frequent that it interferes with throughput, it's ok on PRP. Just don't run LLDC or P-1 there if PRP/GEC shows errors more than ~weekly.

Last fiddled with by kriesel on 2020-09-03 at 14:54
kriesel is offline   Reply With Quote
Old 2020-09-10, 20:08   #16
rgirard1
 
Jan 2019

24 Posts
Default Previous "Hardware error has disapeared"

In previous post I reported that Prime95 was reporting harware errors from the calculation done by Worker #3 as follows:

[Worker #3 Sep 5 11:36] Iteration: 83520000 / 103884401 [80.39%], ms/iter: 18.571, ETA: 4d 09:03
[Worker #3 Sep 5 11:36] Hardware errors have occurred during the test!
[Worker #3 Sep 5 11:36] 1 Gerbicz/double-check error.
[Worker #3 Sep 5 11:36] Confidence in final result is excellent.

That calculation by Worker #3 has ended last night and a new one was started on that Worker #3 without displying any error as it can be seen:

[Worker #3 Sep 10 15:53] Iteration: 3660000 / 107917471 [3.39%], ms/iter: 18.802, ETA: 22d 16:31

So, was the "hardware error" in the previous calculation caused by the specific value of the exponent being analysed or is it something else? I do not know but I am wondering why the "hardware error" was with Worker #3 calculations only and not with the other Workers (there are three more). Any one who can shed light on this is most welcome to comment.

Last fiddled with by rgirard1 on 2020-09-10 at 20:10
rgirard1 is offline   Reply With Quote
Old 2020-09-10, 22:16   #17
VBCurtis
 
VBCurtis's Avatar
 
"Curtis"
Feb 2005
Riverside, CA

149F16 Posts
Default

Only one core had a hardware error. Why would you expect all cores to have an error, just because one did?

No, it had nothing to do with the specific exponent being tested.
VBCurtis is offline   Reply With Quote
Old 2021-05-29, 22:54   #18
MarkVanCoutren
 
May 2021

5 Posts
Default Similar issues

I've been having similar issues. I haven't tried to overclock but I left it running for a few days.

Iteration: 8280000 / 108671053 [7.61%], ms/iter: 10.106, ETA: 11d 17:48
Hardware errors have occurred during the test!
1 Gerbicz/double-check error.
Confidence in final result is excellent

I have an Intel Core i5-9600K 3.7 GHz 6-Core Processor running Windows 10. I think I got it on my last number too. Should I just let it finish the number and see if it goes away?

Last fiddled with by MarkVanCoutren on 2021-05-29 at 23:34 Reason: changed a typo (have to haven't)
MarkVanCoutren is offline   Reply With Quote
Old 2021-05-29, 23:08   #19
VBCurtis
 
VBCurtis's Avatar
 
"Curtis"
Feb 2005
Riverside, CA

5,279 Posts
Default

You should reduce the overclock, since you have found a speed / setting combination that produces hardware errors.
VBCurtis is offline   Reply With Quote
Old 2021-05-30, 08:35   #20
drkirkby
 
"David Kirkby"
Jan 2021
Althorne, Essex, UK

1110000002 Posts
Default

Depending on your operating system, and hardware, errors may be logged. With the exception of my laptop, all computers I use have error correcting (ECC) RAM. With that, most RAM errors get logged, and usually corrected, so the application doesn’t know about it. I think even standard RAM will detect errors, although not correct them. This might be logged. If you see errors about the same DIMM or same CPU, it would be wise to replace it, although it could be a motherboard fault.

There is RAM testing software. Passmark have a free tool that you can put on a USB stick and boot from it. I would run that for a few days. Prime95 or mprime have the ability to do this too. They will stress test your hardware.

Dave
drkirkby is offline   Reply With Quote
Old 2021-05-30, 09:30   #21
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

11001001100102 Posts
Default

Quote:
Originally Posted by drkirkby View Post
I think even standard RAM will detect errors, although not correct them.
Normal non-ECC RAM has no spare bits available, so there is no possibility for either detection or correction.

With no ECC you have to take your chances and hope for the best.
retina is online now   Reply With Quote
Old 2021-05-30, 10:31   #22
drkirkby
 
"David Kirkby"
Jan 2021
Althorne, Essex, UK

26×7 Posts
Default

Quote:
Originally Posted by retina View Post
Normal non-ECC RAM has no spare bits available, so there is no possibility for either detection or correction.

With no ECC you have to take your chances and hope for the best.
I believe that some non-ECC RAM has a parity bit, so can detect errors. But perhaps it is rare.

My IBM servers, which are pretty old, have the ability to have spare RAM modules, that are not used unless the system detects a DIMM failure. Obviously that limits the maximum capacity of RAM. I don’t know if my Dell workstation can do that or not, but given the price of 32 GB RDIMMs, there’s no way I would buy spares.

Dave
drkirkby is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Possible hardware errors have occurred during the test! 1 ROUNDOFF > 0.4. Xyzzy Software 7 2016-12-20 00:01
Possible hardware errors... SverreMunthe Hardware 16 2013-08-19 14:39
Hardware, FFT limits and round off errors ewergela Hardware 9 2005-09-01 14:51
more about hardware errors graeme Hardware 4 2003-07-08 09:14
Reproducable hardware errors? cmokruhl Software 2 2002-09-17 19:04

All times are UTC. The time now is 03:52.


Fri May 20 03:52:39 UTC 2022 up 36 days, 1:53, 0 users, load averages: 1.18, 2.37, 2.32

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔