![]() |
![]() |
#12 |
Jan 2019
208 Posts |
![]()
CPU AMD 3950X with 32 GB Ram. OS: Ubuntu 18.05.5 LTS. Running prime 95 with the defaults, no overclocking and no Throttle=30. Using 4 workers. I resumed previous calculation.
The data below suggest the "hardware problems" are with Worker #3 that is doing a "PRP test of M103884401" but prime95 gives the message that "Confidence in final result is excellent." The other 3 Workers are doing fine: no error messages. [Worker #3 Aug 30 10:05] Resuming Gerbicz error-checking PRP test of M103884401 using FMA3 FFT length 5600K, Pass1=896, Pass2=6400, clm=2, 4 threads [Worker #3 Aug 30 10:09] Hardware errors have occurred during the test! [Worker #3 Aug 30 10:09] 1 Gerbicz/double-check error. [Worker #3 Aug 30 10:09] Confidence in final result is excellent. I will continue this calculation ignoring these messages from Worker #3 and see what will happen when a new set of exponents will be assigned. Also, are the calculations running in some numerical difficulties specifically because of the value of the exponent = 103884401? I would like to understand what is happening here. |
![]() |
![]() |
![]() |
#13 | |
Romulan Interpreter
"name field"
Jun 2011
Thailand
996110 Posts |
![]() Quote:
BUT DON'T DO THAT PLEASE! It would be a pity to start from scratch, you will lose a lot of work! The files are not "corrupted", that was an error in the past, GEC got it, and corrected it. You will still be informed about it, until the test finishes and the result is reported. Please bear with it, and do not throw away a lot of work, by deleting the temporary checkpoints and starting from scratch. The confidence in the result is still high, the result is most probably correct. Moreover, either if you decide to continue and finish the test or to restart, it is not recommended to take a new exponent. First, that's a mess, you need to unreserve the old one, get a new assignment, etc. Then, second, probabilistic, if there is an error in the software, or you have an issue with your system, the error is more probable to appear again if you repeat again the same assignment that generated the error. Walk the same path. But my advice, same as before, is to continue the assignment, stuck your fingers in your ears so you don't see the error till the test is finished ![]() ![]() Last fiddled with by LaurV on 2020-08-30 at 14:44 |
|
![]() |
![]() |
![]() |
#14 |
Jan 2019
1016 Posts |
![]()
Many thanks for your advice. I will continue the present calculation. I will ignore the error message from Worker #3 and see what happens with the next set of exponent.
I do not believe that my PC has a hardware problem because: (i) PC relatively new (January 2020) (ii) it is not under heavy usage that is it only running prime95 but 24/7/365 (ii) that message error is quite recent so a harware problem would have manifested itself much earlier. |
![]() |
![]() |
![]() |
#15 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
32×719 Posts |
![]()
I've seen gpuowl GEC errors occur on more than one gpu model, placed in the same PCIe slot of the same system. As a dual Xeon with ECC ram, I doubt it's the system memory either. Something about that slot.
As long as it's not so frequent that it interferes with throughput, it's ok on PRP. Just don't run LLDC or P-1 there if PRP/GEC shows errors more than ~weekly. Last fiddled with by kriesel on 2020-09-03 at 14:54 |
![]() |
![]() |
![]() |
#16 |
Jan 2019
24 Posts |
![]()
In previous post I reported that Prime95 was reporting harware errors from the calculation done by Worker #3 as follows:
[Worker #3 Sep 5 11:36] Iteration: 83520000 / 103884401 [80.39%], ms/iter: 18.571, ETA: 4d 09:03 [Worker #3 Sep 5 11:36] Hardware errors have occurred during the test! [Worker #3 Sep 5 11:36] 1 Gerbicz/double-check error. [Worker #3 Sep 5 11:36] Confidence in final result is excellent. That calculation by Worker #3 has ended last night and a new one was started on that Worker #3 without displying any error as it can be seen: [Worker #3 Sep 10 15:53] Iteration: 3660000 / 107917471 [3.39%], ms/iter: 18.802, ETA: 22d 16:31 So, was the "hardware error" in the previous calculation caused by the specific value of the exponent being analysed or is it something else? I do not know but I am wondering why the "hardware error" was with Worker #3 calculations only and not with the other Workers (there are three more). Any one who can shed light on this is most welcome to comment. Last fiddled with by rgirard1 on 2020-09-10 at 20:10 |
![]() |
![]() |
![]() |
#17 |
"Curtis"
Feb 2005
Riverside, CA
149F16 Posts |
![]()
Only one core had a hardware error. Why would you expect all cores to have an error, just because one did?
No, it had nothing to do with the specific exponent being tested. |
![]() |
![]() |
![]() |
#18 |
May 2021
5 Posts |
![]()
I've been having similar issues. I haven't tried to overclock but I left it running for a few days.
Iteration: 8280000 / 108671053 [7.61%], ms/iter: 10.106, ETA: 11d 17:48 Hardware errors have occurred during the test! 1 Gerbicz/double-check error. Confidence in final result is excellent I have an Intel Core i5-9600K 3.7 GHz 6-Core Processor running Windows 10. I think I got it on my last number too. Should I just let it finish the number and see if it goes away? Last fiddled with by MarkVanCoutren on 2021-05-29 at 23:34 Reason: changed a typo (have to haven't) |
![]() |
![]() |
![]() |
#19 |
"Curtis"
Feb 2005
Riverside, CA
5,279 Posts |
![]()
You should reduce the overclock, since you have found a speed / setting combination that produces hardware errors.
|
![]() |
![]() |
![]() |
#20 |
"David Kirkby"
Jan 2021
Althorne, Essex, UK
1110000002 Posts |
![]()
Depending on your operating system, and hardware, errors may be logged. With the exception of my laptop, all computers I use have error correcting (ECC) RAM. With that, most RAM errors get logged, and usually corrected, so the application doesn’t know about it. I think even standard RAM will detect errors, although not correct them. This might be logged. If you see errors about the same DIMM or same CPU, it would be wise to replace it, although it could be a motherboard fault.
There is RAM testing software. Passmark have a free tool that you can put on a USB stick and boot from it. I would run that for a few days. Prime95 or mprime have the ability to do this too. They will stress test your hardware. Dave |
![]() |
![]() |
![]() |
#21 |
Undefined
"The unspeakable one"
Jun 2006
My evil lair
11001001100102 Posts |
![]() |
![]() |
![]() |
![]() |
#22 | |
"David Kirkby"
Jan 2021
Althorne, Essex, UK
26×7 Posts |
![]() Quote:
My IBM servers, which are pretty old, have the ability to have spare RAM modules, that are not used unless the system detects a DIMM failure. Obviously that limits the maximum capacity of RAM. I don’t know if my Dell workstation can do that or not, but given the price of 32 GB RDIMMs, there’s no way I would buy spares. Dave |
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Possible hardware errors have occurred during the test! 1 ROUNDOFF > 0.4. | Xyzzy | Software | 7 | 2016-12-20 00:01 |
Possible hardware errors... | SverreMunthe | Hardware | 16 | 2013-08-19 14:39 |
Hardware, FFT limits and round off errors | ewergela | Hardware | 9 | 2005-09-01 14:51 |
more about hardware errors | graeme | Hardware | 4 | 2003-07-08 09:14 |
Reproducable hardware errors? | cmokruhl | Software | 2 | 2002-09-17 19:04 |