mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing > GpuOwl

Reply
 
Thread Tools
Old 2020-10-21, 13:32   #1
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

893210 Posts
Default proof errors

Let's forget for few minutes the "random shift" (HEY! I said "shift", not...) that was bothering me so much and talk about why storing ALL checkpoints is needed (YES, with residue in the file name! but that's another story).

(my emphasis an blood staining below)
Code:
2020-10-21 10:01:25 gfx906-1 105223961 OK 100000000  95.03%;  865 us/it; ETA 0d 01:15; b18cc5ab366b9e56 (check 0.53s)
2020-10-21 10:08:39 gfx906-1 105223961 OK 100500000  95.51%;  866 us/it; ETA 0d 01:08; 54871704410f95af (check 0.54s)
2020-10-21 10:15:53 gfx906-1 105223961 OK 101000000  95.99%;  866 us/it; ETA 0d 01:01; 72f471019cbf9bf9 (check 0.54s)
2020-10-21 10:23:06 gfx906-1 105223961 OK 101500000  96.46%;  866 us/it; ETA 0d 00:54; c48653f37e373f1c (check 0.54s)
2020-10-21 10:30:21 gfx906-1 105223961 EE 102000000  96.94%;  867 us/it; ETA 0d 00:47; ef0cd8751fb0c9bf (check 0.52s)
2020-10-21 10:30:21 gfx906-1 105223961 EE 101500000 loaded: blockSize 400, 5f482a5abd968ff6 (expected c48653f37e373f1c)
2020-10-21 10:30:21 gfx906-1 Exiting because "error on load"
2020-10-21 10:30:21 gfx906-1 Bye
               (LaurV: automatic restart here, from batch, unattended)
2020-10-21 10:30:22 Note: not found 'config.txt'
2020-10-21 10:30:22 config: -device 1 -log 500000 -B1 1500000 -rB2 30 -nospin 
2020-10-21 10:30:22 device 1, unique id ''
2020-10-21 10:30:23 gfx906-1 105223961 FFT: 5.50M 1K:11:256 (18.25 bpw)
2020-10-21 10:30:23 gfx906-1 Expected maximum carry32: 537A0000
2020-10-21 10:30:24 gfx906-1 OpenCL args "-DEXP=105223961u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DPM1=0 -DAMDGPU=1 -DMM_CHAIN=1u -DMM2_CHAIN=1u -DMAX_ACCURACY=1 -DWEIGHT_STEP_MINUS_1=0xa.fee4bc79511d8p-4 -DIWEIGHT_STEP_MINUS_1=-0xd.08b4483e8adf8p-5  -cl-unsafe-math-optimizations -cl-std=CL2.0 -cl-finite-math-only "
2020-10-21 10:30:24 gfx906-1 ASM compilation failed, retrying compilation using NO_ASM
2020-10-21 10:30:28 gfx906-1 OpenCL compilation in 3.91 s
2020-10-21 10:30:29 gfx906-1 105223961 OK 101500000 loaded: blockSize 400, c48653f37e373f1c
2020-10-21 10:30:29 gfx906-1 validating proof residues for power 8
2020-10-21 10:30:40 gfx906-1 checksum ccb7b496 (expected d89ff127) in '.\105223961\proof\45213520'
2020-10-21 10:30:40 gfx906-1 validating proof residues for power 9
2020-10-21 10:30:40 gfx906-1 Can't open '.\105223961\proof\205516' (mode 'rb')
(note: I have a file 2055160 in proofs, but not the one he looks for)
2020-10-21 10:30:40 gfx906-1 validating proof residues for power 8
2020-10-21 10:30:49 gfx906-1 checksum ccb7b496 (expected d89ff127) in '.\105223961\proof\45213520'
2020-10-21 10:30:49 gfx906-1 validating proof residues for power 7
2020-10-21 10:30:49 gfx906-1 Can't open '.\105223961\proof\822063' (mode 'rb')
(note: I have a file 822064 in proofs, as well as 8220640, but not the one he looks for)
2020-10-21 10:30:49 gfx906-1 validating proof residues for power 6
2020-10-21 10:30:49 gfx906-1 Can't open '.\105223961\proof\1644125' (mode 'rb')
(note: I have a file 1644128 in proofs, as well as 16448280, but not the one he looks for)
2020-10-21 10:30:49 gfx906-1 Proof disabled because of missing checkpoints
(note: GRRRRR !!!)
2020-10-21 10:30:50 gfx906-1 105223961 OK 101500800  96.46%;  795 us/it; ETA 0d 00:49; 2143028ff0dc13a0 (check 0.52s)
2020-10-21 10:38:02 gfx906-1 105223961 OK 102000000  96.94%;  864 us/it; ETA 0d 00:46; ea073e998fe1546c (check 0.54s)
2020-10-21 10:45:16 gfx906-1 105223961 OK 102500000  97.41%;  867 us/it; ETA 0d 00:39; 1c4a6f0c49b1aca3 (check 0.54s)
2020-10-21 10:52:30 gfx906-1 105223961 OK 103000000  97.89%;  868 us/it; ETA 0d 00:32; 71b5beb1aea034c0 (check 0.54s)
2020-10-21 10:59:46 gfx906-1 105223961 OK 103500000  98.36%;  871 us/it; ETA 0d 00:25; 3104ec825d70e48b (check 0.54s)
2020-10-21 11:07:03 gfx906-1 105223961 OK 104000000  98.84%;  872 us/it; ETA 0d 00:18; f0f551c657d8f527 (check 0.54s)
2020-10-21 11:14:19 gfx906-1 105223961 OK 104500000  99.31%;  871 us/it; ETA 0d 00:11; f940620e8f7bf81d (check 0.54s)
2020-10-21 11:21:35 gfx906-1 105223961 OK 105000000  99.79%;  871 us/it; ETA 0d 00:03; 14867a51579a32e8 (check 0.54s)
2020-10-21 11:24:50 gfx906-1 CC 105223961 / 105223961, c22f2d8e6c6f____
2020-10-21 11:24:51 gfx906-1 105223961 OK 105224000 100.00%;  872 us/it; ETA 0d 00:00; c860e91a1ae2afee (check 0.51s)
2020-10-21 11:24:51 gfx906-1 {"status":"C", "exponent":"105223961", "worktype":"PRP-3", "res64":"c22f2d8e6c6f____", "residue-type":"1", "errors":{"gerbicz":"0"}, "fft-length":"5767168", "program":{"name":"gpuowl", "version":"v6.11-380-g79ea0cc"}, "computer":"gfx906-1", "aid":"<bleh bleh bleh>", "timestamp":"2020-10-21 04:24:51 UTC"}
Well... so much for "no double check needed"...
Edit: this also raises the question, how the next guy who will do PRP will be credited? (same test won't be accepted by the server, but considered duplicate of the first, or, if accepted, I may login with a fake account and send it again ). Which brings us back to the "random shi[f]t" issue... hihi

Last fiddled with by LaurV on 2020-10-21 at 17:27 Reason: obfuscate res64
LaurV is offline   Reply With Quote
Old 2020-10-21, 13:59   #2
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

24×83 Posts
Default

The problem is that one proof checkpoint is corrupted:

checksum ccb7b496 (expected d89ff127) in '.\105223961\proof\45213520'

Now I don't know *why* it is corrupted. This is supposed to become more solid in v7.
You also have some strange errors on that GPU, it would be interesting to understand why.

The DC will be done in the normal way, by mprime, with a non-zero offset (shift), and probably a proof too -- so it's not all for nought (because that DC will be delayed, under v. strong suspicion that this particular exponent ain't prime).

Last fiddled with by preda on 2020-10-21 at 14:05
preda is offline   Reply With Quote
Old 2020-10-22, 08:35   #3
aheeffer
 
Aug 2020

25 Posts
Default

I had the same error two day ago:

Code:
2020-10-20 16:47:29 Rig-RadeonVII-01 validating proof residues for power 8
2020-10-20 16:47:29 Rig-RadeonVII-01 Can't open '.\108527987\proof\423938' (mode 'rb')
2020-10-20 16:47:29 Rig-RadeonVII-01 validating proof residues for power 9
2020-10-20 16:47:29 Rig-RadeonVII-01 Can't open '.\108527987\proof\211969' (mode 'rb')
2020-10-20 16:47:29 Rig-RadeonVII-01 validating proof residues for power 8
2020-10-20 16:47:29 Rig-RadeonVII-01 Can't open '.\108527987\proof\423938' (mode 'rb')
2020-10-20 16:47:29 Rig-RadeonVII-01 validating proof residues for power 7
2020-10-20 16:47:29 Rig-RadeonVII-01 Can't open '.\108527987\proof\847875' (mode 'rb')
2020-10-20 16:47:29 Rig-RadeonVII-01 validating proof residues for power 6
2020-10-20 16:47:29 Rig-RadeonVII-01 Can't open '.\108527987\proof\1695750' (mode 'rb')
2020-10-20 16:47:29 Rig-RadeonVII-01 Proof disabled because of missing checkpoints
2020-10-20 16:47:31 Rig-RadeonVII-01 108527987 OK 108400800  99.88%;  907 us/it; ETA 0d 00:02; a9825c3b5612b3c4 (check 0.53s)
2020-10-20 16:47:55 Rig-RadeonVII-01 108527987 P2 GCD: no factor
2020-10-20 16:47:55 Rig-RadeonVII-01 {"status":"NF", "exponent":"108527987", "worktype":"PM1", "B1":"1000000", "B2":"30000000", "fft-length":"6291456", "program":{"name":"gpuowl", "version":"v6.11-380-g79ea0cc"}, "user":"al", "computer":"Rig-RadeonVII-01", "aid":"15F3DD427BAC86126A8F2CA7BED4CBCA", "timestamp":"2020-10-20 14:47:55 UTC"}
2020-10-20 16:49:26 Rig-RadeonVII-01 CC 108527987 / 108527987, ead5bc32de4b____
2020-10-20 16:49:26 Rig-RadeonVII-01 108527987 OK 108528000 100.00%;  905 us/it; ETA 0d 00:00; b95ab348db62e1eb (check 0.52s)
2020-10-20 16:49:26 Rig-RadeonVII-01 {"status":"C", "exponent":"108527987", "worktype":"PRP-3", "res64":"ead5bc32de4b____", "residue-type":"1", "errors":{"gerbicz":"0"}, "fft-length":"6291456", "program":{"name":"gpuowl", "version":"v6.11-380-g79ea0cc"}, "user":"al", "computer":"Rig-RadeonVII-01", "aid":"15F3DD427BAC86126A8F2CA7BED4CBCA", "timestamp":"2020-10-20 14:49:26 UTC"}

Last fiddled with by preda on 2020-10-22 at 19:30 Reason: obfuscate res64
aheeffer is offline   Reply With Quote
Old 2020-10-22, 17:07   #4
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

2·32·263 Posts
Default

Laurv, aheefer, everyone, please indicate version and commit number in the same post as reporting an error.
kriesel is offline   Reply With Quote
Old 2020-10-22, 19:33   #5
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

53016 Posts
Default

Quote:
Originally Posted by aheeffer View Post
I had the same error two day ago:
Not the same error: while Laur had a checksum fail on one of the proof checkpoints, in your situation one checkpoint file is simply missing (but there's not bad checksum).

Laur had:
checksum ccb7b496 (expected d89ff127) in '.\105223961\proof\45213520'
preda is offline   Reply With Quote
Old 2020-10-22, 23:04   #6
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

24×83 Posts
Default

Quote:
Originally Posted by aheeffer View Post
I had the same error two day ago:

Code:
2020-10-20 16:47:29 Rig-RadeonVII-01 validating proof residues for power 8
2020-10-20 16:47:29 Rig-RadeonVII-01 Can't open '.\108527987\proof\423938' (mode 'rb')
That value that is not found, 423938, is the very first in the set of proof checkpoints for power 8. Presumably none are present. Could you please check what is in the folder 108527987\proof\ ? Did you move/rename/remove stuff or was the proof generation disabled for some reason previously?
preda is offline   Reply With Quote
Old 2020-10-23, 02:40   #7
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

22·7·11·29 Posts
Default

Quote:
Originally Posted by kriesel View Post
Laurv, aheefer, everyone, please indicate version and commit number in the same post as reporting an error.
Grrrr ....

Scroll horizontally the code snip. The version of the program still appears in the report line.

Last fiddled with by LaurV on 2020-10-23 at 02:48
LaurV is offline   Reply With Quote
Old 2020-10-23, 09:49   #8
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

2×32×263 Posts
Default

Quote:
Originally Posted by LaurV View Post
Grrrr ....

Scroll horizontally the code snip. The version of the program still appears in the report line.
Yeah, sorry, didn't see that before I posted.
That likelihood, plus saving labor of N readers by a copy paste once, and providing complete convenient info, is why good writers of bug reports will put software name & version, OS, and maybe hardware involved at the front of a bug report, anticipating questions, etc. And not all bug reports include report lines for the readers to go sleuthing through, 5 code box widths wide.

Last fiddled with by kriesel on 2020-10-23 at 09:53
kriesel is offline   Reply With Quote
Old 2020-10-23, 10:11   #9
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

5,879 Posts
Default

Quote:
Originally Posted by kriesel View Post
That likelihood, plus saving labor of N readers by a copy paste once, and providing complete convenient info, is why good writers of bug reports will put software name & version, OS, and maybe hardware involved at the front of a bug report, anticipating questions, etc. And not all bug reports include report lines for the readers to go sleuthing through, 5 code box widths wide.
Great job of blaming the victim.
retina is online now   Reply With Quote
Old 2020-10-23, 13:34   #10
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

473410 Posts
Default

Bug reports are intended at least partly for the software authors, yes?
The time of these rare few very talented programmers is precious.
Let's make an effort to use it well.
The post writer's effort to make information accessible is once per post. The readers' effort is every reader, every reading.

Uh thanks retina I guess for the nudge. Rewrote the part of https://www.mersenneforum.org/showth...664#post521664 that deals with this aspect, to read
Quote:
Make an effort to provide an easily read complete set of the needed context information in the same post with a question. If you're asking why something is not working how you expect, tell us at the beginning what software you're asking about, what version of the software, what OS you're running it on, what OS version or flavor, what hardware, parameters it's having difficulties with, and any other pertinent information. If asking about Linux, what version of what distribution. In the case of a gpu related question, include the gpu model, driver name and version, and perhaps hardware specs that are relevant (gpu ram for example, or NVIDIA compute capability level). A little time spent once, providing that info can save many readers and the original poster a little time each, and reduce the need for Q&A that sometimes follows when such information is missing or hidden away somewhat in a very long code box line.
Finally, apologies to aheeffer for misspelling his forum name.

Last fiddled with by kriesel on 2020-10-23 at 14:04
kriesel is offline   Reply With Quote
Old 2020-10-23, 13:57   #11
aheeffer
 
Aug 2020

25 Posts
Default

Quote:
Originally Posted by preda View Post
That value that is not found, 423938, is the very first in the set of proof checkpoints for power 8. Presumably none are present. Could you please check what is in the folder 108527987\proof\ ? Did you move/rename/remove stuff or was the proof generation disabled for some reason previously?
I now understand what happened. The same thing as with another exponent I reported about in the gpuowl thread.

There was a CR/LF missing at the end of the 'worktodo.txt'-file contained in the 'pool' folder. It happened to me before and then gpuowl just stops. In this case, it looked at the 'worktodo.txt' file in the local folder and started the same exponent again. Having found the proof folder, it complained about the missing checkpoints.

In the other case, it found an old local 'worktodo.txt' with an expired exponent. I lost a few days work.

Sorry about this.
aheeffer is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Errors kriesel kriesel 4 2019-05-21 20:51
Prime95 errors ( 29.4 v8) under Win 10 x64 pepi37 Software 0 2018-11-29 08:17
ERRORS Unregistered Information & Answers 2 2013-04-01 04:14
Prime 95 errors Wychi Hardware 9 2004-10-09 16:01
heat and errors crash893 Hardware 37 2002-11-12 16:33

All times are UTC. The time now is 08:25.

Mon Nov 30 08:25:15 UTC 2020 up 81 days, 5:36, 3 users, load averages: 1.21, 1.27, 1.27

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.