mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2004-10-29, 14:51   #1
S00113
 
S00113's Avatar
 
Dec 2003

23·33 Posts
Default More RHEL WS 3.0 bugs?

On many (>10) prevoiusly error free machines running RedHat Enterprise Linux WS 3.0, I have started seeing ROUND OFF and SUM(INPUTS) != SUM(OUTPUTS) errors. Sometimes they even start to loop forever like this:
[pre]
[Sat Oct 23 22:22:24 2004]
Iteration: 3705014/12654503, ERROR: ROUND OFF (0.40625) > 0.40
Continuing from last save file.
[Sat Oct 23 22:45:15 2004]
Disregard last error. Result is reproducible and thus not a hardware problem.
For added safety, redoing iteration using a slower, more reliable method.
Continuing from last save file.
Iteration: 3705014/12654503, ERROR: SUM(INPUTS) != SUM(OUTPUTS), 2.9482982079801
73e+17 != -455.9915635528858
Possible hardware failure, consult the readme.txt file.
Continuing from last save file.
[...]
[Fri Oct 29 16:05:00 2004]
Iteration: 3705014/12654503, ERROR: ROUND OFF (0.40625) > 0.40
Continuing from last save file.
Disregard last error. Result is reproducible and thus not a hardware problem.
For added safety, redoing iteration using a slower, more reliable method.
Continuing from last save file.
Iteration: 3705014/12654503, ERROR: SUM(INPUTS) != SUM(OUTPUTS), 2.9482982079801
73e+17 != -455.9915635528858
Possible hardware failure, consult the readme.txt file.
Continuing from last save file.
[Fri Oct 29 16:10:12 2004]
Iteration: 3705014/12654503, ERROR: ROUND OFF (0.40625) > 0.40
Continuing from last save file.
Disregard last error. Result is reproducible and thus not a hardware problem.
For added safety, redoing iteration using a slower, more reliable method.
Continuing from last save file.
Iteration: 3705014/12654503, ERROR: SUM(INPUTS) != SUM(OUTPUTS), 2.9482982079801
73e+17 != -455.9915635528858
Possible hardware failure, consult the readme.txt file.
Continuing from last save file.
[/pre]
I have some save files which reproduce the loop, in case anyone are interested.

This problem has been occuring a lot lately. I can not reproduce errors on the machines with mprime -t whne the machines are idle, so I suspect a faulty driver not restoring FP context properly. Do anyone else have this problem on RHEL?
S00113 is offline   Reply With Quote
Old 2004-10-29, 19:47   #2
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

767910 Posts
Default

This looks like a bug in the error recovery code. Very strange since you are the first to report such a problem.

The ROUND OFF (0.40625) > 0.40 "error" is not a problem. This is normal when testing near the limits of an FFT range. The SUM(INPUTS) != SUM(OUTPUTS) seems to be a bug.

Can you email the pNNNNNNN file to me for debugging?

To work around the problem, try this: Exit mprime. Add the line "CpuSupportsSSE2=0" to local.ini. Run mprime until you get past the loop. Exit mprime. Remove the local.ini line. Restart mprime.
Prime95 is online now   Reply With Quote
Old 2004-11-04, 20:40   #3
S00113
 
S00113's Avatar
 
Dec 2003

D816 Posts
Default

Quote:
Originally Posted by Prime95
This looks like a bug in the error recovery code. Very strange since you are the first to report such a problem.

The ROUND OFF (0.40625) > 0.40 "error" is not a problem. This is normal when testing near the limits of an FFT range. The SUM(INPUTS) != SUM(OUTPUTS) seems to be a bug.

Can you email the pNNNNNNN file to me for debugging?
I'll mail you an URL to some examples.
Quote:
To work around the problem, try this: Exit mprime. Add the line "CpuSupportsSSE2=0" to local.ini. Run mprime until you get past the loop. Exit mprime. Remove the local.ini line. Restart mprime.
It worked. No error when running without SSE2.

I've investigated further by re-running a failed exponent from the beginning on another machine. It still failed and looped, but on a different iteration.

Machine 1:
[pre][Tue Sep 14 18:07:51 2004]
Iteration: 21958186/24928889, ERROR: ROUND OFF (0.40625) > 0.40
Continuing from last save file.
Disregard last error. Result is reproducible and thus not a hardware problem.
For added safety, redoing iteration using a slower, more reliable method.
Continuing from last save file.
Iteration: 21958186/24928889, ERROR: SUM(INPUTS) != SUM(OUTPUTS), 2.362743189749
471e+17 != -272.6337128144165
[/pre]
Machine 2:
[pre][Thu Nov 4 20:19:05 2004]
Iteration: 23733427/24928889, ERROR: ROUND OFF (0.40625) > 0.40
Continuing from last save file.
Disregard last error. Result is reproducible and thus not a hardware problem.
For added safety, redoing iteration using a slower, more reliable method.
Continuing from last save file.
Iteration: 23733427/24928889, ERROR: SUM(INPUTS) != SUM(OUTPUTS), -4.76126452918
1626e+17 != -7305.938875061262
[/pre]
S00113 is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
yafu bugs jwes YAFU 506 2021-07-01 16:07
Where to report bugs Matt Software 1 2007-02-20 19:13
Possible Prime95 bugs JuanTutors Software 9 2006-09-24 21:22
mprime segmentation fault on RHEL bej Software 28 2005-11-11 19:05
RMA 1.7 beta bugs TTn 15k Search 2 2004-11-24 22:11

All times are UTC. The time now is 01:31.


Thu Dec 2 01:31:31 UTC 2021 up 131 days, 20 hrs, 0 users, load averages: 1.19, 1.22, 1.14

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.