mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2020-08-27, 00:46   #1
rgirard1
 
Jan 2019

2·7 Posts
Default Hardware errors have occurred during the test!

I am getting the following message from running mprime on AMD® Ryzen 9 3950x 16-core processor × 32 32 GB memory with Ubuntu 18.04.5 LTS as OS:

"Hardware errors have occurred during the test!
1 Gerbicz/double-check error.
Confidence in final result is excellent."

Can anyone help to understand what is happening? I have this machine since January 2020 so realtively new. Have I an actual "hardware problem"?
rgirard1 is offline   Reply With Quote
Old 2020-08-27, 00:53   #2
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

221068 Posts
Default

Quote:
Originally Posted by rgirard1 View Post
Have I an actual "hardware problem"?
Possibly.

How hard have you pushed it to the limits (read: overclocked)?

Is this a new build you are trying to test the limits of?

Or is this a machine you've been using for a while, and suddenly it's reporting this?

Perspiring minds want to know...
chalsall is offline   Reply With Quote
Old 2020-08-27, 07:28   #3
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

22×7×317 Posts
Default

One error is not big deal. Even the most stable hardware has errors sometimes (electricity flashes, cosmic rays, bad luck, etc). Along the test, the message repeats periodically till the test is finished, to remind you about, but it is the same, 1 (one) error that occurred in the past. Nothing to worry about.

The message will be gone once you finish the exponent, report the result, and start a new exponent.

More errors is to worry. If they start growing, or appear regularly on subsequent tests, then yes, you may have a hardware issue. Meantime, try to monitor the temperatures closely. If they raise, reduce the clocks, clean the dust in the fans, or in the worst case, think about a re-seating of the CPU.

Right now, do nothing (beside monitoring the system).

Last fiddled with by LaurV on 2020-08-27 at 07:31
LaurV is offline   Reply With Quote
Old 2020-08-27, 10:41   #4
Viliam Furik
 
Jul 2018
Martin, Slovakia

E516 Posts
Default

Quote:
Originally Posted by LaurV View Post
or in the worst case, think about a re-seating of the CPU.
I don't think that could help, since AMD uses PGA socket, that has pins on the CPU, and holes in the socket. If there was something that could be wrong, it would have to be one of these:

1. CPU is a tiny bit higher placed than it should be. - But that would most probably mean non-operability of the whole CPU.
2. One pin is missing, all other pins are in place. - THAT would be very interesting, but if the CPU is working, it would most probably cause some RAM to not be detected.
Viliam Furik is online now   Reply With Quote
Old 2020-08-27, 11:01   #5
De Wandelaar
 
De Wandelaar's Avatar
 
"Yves"
Jul 2017
Belgium

2·52 Posts
Default

Quote:
Originally Posted by rgirard1 View Post
"Hardware errors have occurred during the test!
1 Gerbicz/double-check error.
Confidence in final result is excellent."
I had the same problem when the undervolting of my CPU was too borderline.
De Wandelaar is offline   Reply With Quote
Old 2020-08-27, 11:12   #6
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

887610 Posts
Default

Quote:
Originally Posted by Viliam Furik View Post
I don't think that could help
Re-seating has nothing to do with the pins side. Or, well, it has , but what we mean by it (and by we, in turn, we meen overclocking geeks, hihi) is: taking out the cooler, clean it, remove dust clogs, if any, but most of all, clean the dry thermal paste, apply a new thermal paste. That's re-seating. To not be confused with "resetting". Put everything back carefully. As said, this would be his last resort, and I don't believe it's the case, for a computer bought this year, unless his cat sleeps inside of the computer housing and it's full of dust and hairs. On the other hand, I also don't know much about making t-shirts... (to which, if mentioned, I want to buy one! I will come back to it, hopefully, you wont exhaust the second lot till I got the time to measure myself and order it, hehe).

Last fiddled with by LaurV on 2020-08-27 at 18:49 Reason: s/both/bought/g
LaurV is offline   Reply With Quote
Old 2020-08-27, 18:34   #7
rgirard1
 
Jan 2019

2·7 Posts
Default Have dusted-up the PC but still "Hardware errors have occurred during the test!"

I have dusted up the PC and restarted "mprime" but still getting the same error message. I am not overclocking the CPU (AMD 3950X) and "Throttle = 30" so CPU runs 30% of the time. The CPU temperature is about 50 degC.

I am using 4 workers and each has 4 threads as shown below:


Resuming primality test of M54111917 using FMA3 FFT length 2880K, Pass1=1280, Pass2=2304, clm=2, 4 threads
Resuming Gerbicz error-checking PRP test of M103884359 using FMA3 FFT length 5600K, Pass1=896, Pass2=6400, clm=2, 4 threads
Resuming Gerbicz error-checking PRP test of M103884401 using FMA3 FFT length 5600K, Pass1=896, Pass2=6400, clm=2, 4 threads
Resuming Gerbicz error-checking PRP test of M105836671 using FMA3 FFT length 5600K, Pass1=896, Pass2=6400, clm=2, 4 threads


Note that for the first exponent M54111917 FFT length is 2880K and for the others (where I believe the "hardware error" is from) have FFT length of 5600 K. Is this a problem?

Is there a way to stop this calculation and start a complete and fresh new one for a new set of 4 exponents?

The above message of "Harware error" has appeared only recently.

Thank in advance for any help you can provide.

Last fiddled with by rgirard1 on 2020-08-27 at 18:35
rgirard1 is offline   Reply With Quote
Old 2020-08-27, 18:53   #8
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

22·3·643 Posts
Default

What speed are you running your memory? What kind of memory is it? Have you tried a memory test?

https://www.memtest86.com/download.htm

Xyzzy is offline   Reply With Quote
Old 2020-08-27, 18:54   #9
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

22×3×643 Posts
Default

Also, have you run the torture test?

./mprime -m

Select the torture test option. The defaults are fine.
Xyzzy is offline   Reply With Quote
Old 2020-08-27, 19:33   #10
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

22·3·5·7·17 Posts
Default

Quote:
Originally Posted by rgirard1 View Post
I I am not overclocking the CPU (AMD 3950X) and "Throttle = 30" so CPU runs 30% of the time. The CPU temperature is about 50 degC.

Is there a way to stop this calculation and start a complete and fresh new one for a new set of 4 exponents?

Do not use Throttle. Nowadays heat is rarely the cause of hardware problems. Usually it is memory related.

Do not restart your calculations. The PRP error-checking has caught and corrected the problem. Your results will be just fine.

Right now you should do nothing. Just keep an eye on the things. If you get more errors (do not worry about prime95 whining about the one error that has already occurred), then look at upping the memory voltage or reducing the RAM speed.
Prime95 is offline   Reply With Quote
Old 2020-08-30, 13:51   #11
rgirard1
 
Jan 2019

2×7 Posts
Default Ran torture test with default for over 48 hrs: all passed

I ran the torture test for over 48 hrs with default settings, no overclocking, no Throttle=30 basically the machine normal state. In the "results.txt" I got a very long listing like this:
.
.
.
[Sun Aug 30 09:36:01 2020]
Self-test 240K passed!
Self-test 256K passed!
Self-test 256K passed!
.
.
.
Self-test 256K passed!
[Sun Aug 30 09:41:11 2020]
Self-test 280K passed!

i.e. all "Self-tests" passed and no error messages.

I am concluding that hardware problems with my desktop are unlikely. I will stop the torture test and resume the prime95 calculations and if there are error message I will let them be until new exponents are assigned after the completion of the current calculations.

I am wondering if the "restart" files are not somehow corrupted and bring these error messages. I do stop prime95 when I must do a Software Update for Ubuntu 18.04 and then resume the calculations after the Software Update.

Is there a way to start a "fresh" new calculation with new assigned exponents?
rgirard1 is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Possible hardware errors have occurred during the test! 1 ROUNDOFF > 0.4. Xyzzy Software 7 2016-12-20 00:01
Possible hardware errors... SverreMunthe Hardware 16 2013-08-19 14:39
Hardware, FFT limits and round off errors ewergela Hardware 9 2005-09-01 14:51
more about hardware errors graeme Hardware 4 2003-07-08 09:14
Reproducable hardware errors? cmokruhl Software 2 2002-09-17 19:04

All times are UTC. The time now is 17:25.

Fri Oct 30 17:25:57 UTC 2020 up 50 days, 14:36, 2 users, load averages: 2.80, 2.36, 2.30

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.