![]() |
![]() |
#1 |
Aug 2002
22·2,161 Posts |
![]()
Our goal is to ensure our gaming computer is 100% without error, so we are running double check work.
We have both "ErrorCheck=1" and "SumInputsErrorCheck=1" in our prime.txt file. Are these hardware errors the software corrected? Our first job turned in a legit answer despite the error code. Code:
[Main thread Nov 25 18:22] Mersenne number primality test program version 28.10 [Main thread Nov 25 18:22] Optimizing for CPU architecture: AMD Bulldozer, L2 cache size: 2 MB [Main thread Nov 25 18:22] Starting worker. [Work thread Nov 25 18:22] Worker starting [Work thread Nov 25 18:22] Setting affinity to run worker on any logical CPU. [Work thread Nov 25 18:22] Setting affinity to run helper thread 1 on any logical CPU. [Work thread Nov 25 18:22] Setting affinity to run helper thread 2 on any logical CPU. [Work thread Nov 25 18:22] Setting affinity to run helper thread 3 on any logical CPU. [Work thread Nov 25 18:22] Resuming primality test of M43585261 using AMD K10 type-2 FFT length 2304K, Pass1=512, Pass2=4608, 4 threads [Work thread Nov 25 18:22] Iteration: 30638610 / 43585261 [70.29%]. [Work thread Nov 25 18:22] Possible hardware errors have occurred during the test! 1 ROUNDOFF > 0.4. [Work thread Nov 25 18:22] Confidence in final result is fair. [Work thread Nov 25 18:22] Iteration: 30640000 / 43585261 [70.29%], roundoff: 0.219, ms/iter: 12.304, ETA: 44:14:37 [Work thread Nov 25 18:22] Possible hardware errors have occurred during the test! 1 ROUNDOFF > 0.4. [Work thread Nov 25 18:22] Confidence in final result is fair. Code:
[Fri Nov 11 19:34:36 2016] Iteration: 29900/43334623, Possible error: round off (0.5) > 0.40625 Continuing from last save file. [Fri Nov 18 14:04:29 2016] Iteration: 37050967/43334623, Possible error: round off (0.5) > 0.40625 Continuing from last save file. [Sat Nov 19 22:54:43 2016] UID: Xyzzy/880K, M43334623 is not prime. Res64: 493C534C8731CB21. We4: 93E37212,6750902,00000100, AID: F5D1C6F0BA73A811CF752C052922CB52 [Tue Nov 22 01:44:23 2016] Iteration: 11202950/43585261, Possible error: round off (0.5) > 0.40625 Continuing from last save file. |
![]() |
![]() |
![]() |
#2 |
P90 years forever!
Aug 2002
Yeehaw, FL
22·5·7·59 Posts |
![]() |
![]() |
![]() |
![]() |
#3 |
Romulan Interpreter
"name field"
Jun 2011
Thailand
24×643 Posts |
![]()
Yes, they were corrected. Hardware errors that produce an error message are "safe". Because P95 will retry, eventually with different (slower) algorithm and/or larger FFT, to redo the iteration until match and no error. Hardware errors that go undetected (resulting in a bad residue) are more dangerous...
OTOH, this is sign that you may need to clean some dust clogs, reseat that heatsink, reduce the overclocking, increase the voltages, whatever.... |
![]() |
![]() |
![]() |
#4 |
Aug 2002
22×2,161 Posts |
![]()
The computer is only a few months old. We ran each variant of the torture test for 24 hours and it passed each time.
We had the computer set to do one worker with a total of four threads so we reset the computer to run two workers, each alone. Thus, we have gone from four cores active to two. We are fairly certain our CPU shares a FPU between "logical processors" so we now have the two worker threads on separate cores. The error recovery only worked because we had the optional roundoff checking enabled, right? We have attached some diagnostic info for your perusal. ![]() |
![]() |
![]() |
![]() |
#5 | |
P90 years forever!
Aug 2002
Yeehaw, FL
22·5·7·59 Posts |
![]() Quote:
Normally, error checking is done every 128(?) iterations (unless you are testing an exponent near the limit of an FFT, then it is every iteration). If you get a roundoff error in an unchecked iteration, it often "hangs around" until the 128th iteration and is properly rolled back. |
|
![]() |
![]() |
![]() |
#6 |
Aug 2002
22·2,161 Posts |
![]()
We now have the computer working on two separate jobs without errors.
We would like to run one job with an additional helper instead. In both cases, we will be using 50% of our CPU. What do we alter in our configuration files to make this work? Because our CPU has only one FPU per core, we would like to lock the affinity to CPU #2 and #4, which Prime95 calls CPU #1 and #3. local.txt: Code:
OldCpuSpeed=3993 NewCpuSpeedCount=0 NewCpuSpeed=0 RollingAverage=1830 RollingAverageIsFromV27=1 ComputerGUID=×××××××××××××××××××××××××××××××× ComputerID=880K ThreadsPerTest=1 SrvrUID=××××××××× SrvrComputerName=×××××××××× SrvrPO2=1 SrvrPO3=3 SrvrPO4=8 SrvrPO5=8 SrvrPO6=450 SrvrPO7=1410 SrvrPO8=1 SrvrPO9=2 SrvrP00=6 LastEndDatesSent=1481402536 RollingHash=-1304581059 RollingStartTime=1481405473 RollingCompleteTime=1676485 WorkerThreads=2 SrvrPO1=101 [Worker #1] Affinity=1 [Worker #2] Affinity=3 Code:
V24OptionsConverted=1 WGUID_version=2 StressTester=0 UsePrimenet=1 Windows95Service=1 DialUp=0 V5UserID=Xyzzy PauseWhileRunning=worldoftanks MergeWindows=8 ErrorCheck=1 SumInputsErrorCheck=1 ErrorCountMessages=1 Priority=1 DaysOfWork=3 RunOnBattery=1 OutputIterations=100000 ResultsFileIterations=999999999 DiskWriteTime=30 NetworkRetryTime=2 NetworkRetryTime2=70 DaysBetweenCheckins= 0.25 NumBackupFiles=3 SilentVictory=0 AMPM=1 OutputRoundoff=1 MaxExponents=5 Left=9 Top=106 Right=1929 Bottom=1143 W2=0 448 2558 897 0 -1 -1 -8 -31 W1=0 0 2558 448 0 -1 -1 -1 -1 W3=0 897 2558 1346 0 -1 -1 -1 -1 WorkPreference=101 [PrimeNet] Debug=0 ProxyHost= ProxyUser= [Worker #1] [Worker #2] |
![]() |
![]() |
![]() |
#7 |
Aug 2002
22·2,161 Posts |
![]() ![]() |
![]() |
![]() |
![]() |
#8 |
Aug 2002
22·2,161 Posts |
![]()
We have run into additional problems.
If we run one job on multiple cores, errors occur. If we run separate jobs on separate cores, no errors occur. ![]() |
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Prime95 roundoff errors | pjaj | Software | 24 | 2021-12-16 01:11 |
Lots of roundoff errors | TheMawn | Software | 18 | 2014-08-16 03:54 |
Possible hardware errors... | SverreMunthe | Hardware | 16 | 2013-08-19 14:39 |
more about hardware errors | graeme | Hardware | 4 | 2003-07-08 09:14 |
Reproducable hardware errors? | cmokruhl | Software | 2 | 2002-09-17 19:04 |