![]() |
![]() |
#12 |
∂2ω=0
Sep 2002
República de California
32·1,303 Posts |
![]()
@Ken: stalling at the start of p-1 stage 2 sounds suspiciously like some kind of mem-alloc failure ... your run diagnostics show stage 2 trying to alloc ~11GB of memory, which should be OK for an R7 with its 16GB of HBM, but it is still close to 70% of the available memory. Were you running just this one gpuowl instance on the card, or was there a 2nd also running?
Under linux, the rocm-smi utility shows - among other things - VRAM usage of each installed card ... do the GPU utils under Windows have similar diagnostics? It would be interesting to keep an eye on the VRAM %, both under normal circumstances and the next time you hit this kind of stall/massive-slowdown. |
![]() |
![]() |
![]() |
#13 |
"Mihai Preda"
Apr 2015
56616 Posts |
![]()
Quite annoying. I don't know what's causing the stall. I can think of such possibilities:
1. it's stuck on OpenCL queue->finish(). Yet strange that it reacts correctly to Ctrl-C. 2. it's stuck waiting for the GCD future from the other thread. This is not supposed to happen, as the future is checked with a wait-time of zero, but who knows. 3. it's not stuck, it's progressing extremelly slowly because OpenCL allocated the buffers in host memory. It would be interesting to see, when it's stuck: - is there any load on the GPU? - what's the RAM allocation on the GPU (is it 100% full?) - is the CPU 100% one-thread? Probably more logging or running under the debugger may be needed to pinpoint it. Anyway, I'm also revamping the P-1 as-we-speak. I would not recommend doing big-P-1 right now, because stage-1 is just about to become *a lot* cheaper. Also, new bugs may appear and old ones go away as the code is reworked. |
![]() |
![]() |
![]() |
#14 | |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2·5·647 Posts |
![]() Quote:
GPU 0 radeonvii0 ran a prp proof build on a 35M exponent around 22:28, that memory demand completed ~20 minutes before the hang on radeonvii(second gpu) occurred. When second gpu hung at 554M P-1 stage 1 gcd, the first gpu was early in a PRP run of a 36M exponent. The third gpu, a 5700XT, was in the midst of a 47M exponent prp at 22:48, and had a normal run. (SO glad that Preda agreed to built in gpuowl logging with 4-digit year, mm, dd, and time to the second, for examining questions such as this.) GPU-Z can do real time display and logging of various sensors including gpu ram utilization. I did not have that logging active due to its overhead. It could take 3 sessions, to log the 3 gpus. Currently, the 554M P-1 stage 2 still running shows 14913MB committed on the gpu. On the system side, MS has annoyingly either removed or made logging capabilities of Task Manager, Resource Monitor, and Performance Monitor very difficult to find at Windows 10, possibly earlier. Those too would have overhead. They were not running when the stall occurred. Currently there's plenty of system ram available; 3GB committed out of 16GB physical ram. Last fiddled with by kriesel on 2020-09-01 at 23:51 |
|
![]() |
![]() |
![]() |
#15 | |
∂2ω=0
Sep 2002
República de California
32·1,303 Posts |
![]() Quote:
|
|
![]() |
![]() |
![]() |
#16 | |
"CharlesgubOB"
Jul 2009
Germany
33·23 Posts |
![]() Quote:
104M / 107M exponent for this card. THX |
|
![]() |
![]() |
![]() |
#17 | |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2·5·647 Posts |
![]() Quote:
Note, none of the gpus have display driving duties. That's on the igp/on-motherboard vga jack, and only used when I can't access remotely, or rarely delving into BIOS setup. The system stalled 1 gpu ~9:30 last night, and the other 2 about an hour later, so I lost more than a gpu-day overnight. Actually that was the same gpu as had stalled during the P-1 first, followed by a bugcheck 0x116 by 11pm, according to the system log. https://docs.microsoft.com/en-us/win...eo-tdr-failure. Should not be having such issues with my newest Radeon VII, but there it is. Perhaps I'll hit one of those eventually. It has other things to do now. Last fiddled with by kriesel on 2020-09-03 at 15:38 |
|
![]() |
![]() |
![]() |
#18 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2·5·647 Posts |
![]()
Another bugcheck 0x116 hit during proof calculation for 58M on the 5700xt, within a few hours of the restart of all gpu work.
Empty worktodo.txt, (worktodo.txt.bak to the rescue.) Gpuowl did not write the PRP test result to results.txt. Fortunately it did write a complete result to gpuowl.log though before the system went down. (The one written to log upon restart lacked md5 & perhaps more.) Again, second RadeonVII in the system hung, ~75 minutes before the bugcheck is the last gpuowl log entry there. It was in the midst of running a PRP ~92M exponent, not demanding of gpu ram. Perfmon running now, with an attempt at logging numerous counters. Possibly the Celeron is too slow to service these? Or the very new Radeon VII gpu is just flaky. This 5700xt is doing 1200-1290 us/it on 3M PRP/GEC; has done 4320 us/it on 10M; 23350 us/it on 48M; 91 us/it on 128K; 936-952 us/it on 2.25M; 1086 us/it on 2.5M. Last fiddled with by kriesel on 2020-09-03 at 16:42 |
![]() |
![]() |
![]() |
#19 |
"CharlesgubOB"
Jul 2009
Germany
11558 Posts |
![]() |
![]() |
![]() |
![]() |
#20 | |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
11001010001102 Posts |
![]() Quote:
|
|
![]() |
![]() |
![]() |
#21 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
145068 Posts |
![]() Code:
2020-09-04 05:21:18 asr2/radeonvii0 63000061 OK 62800000 99.68%; 571 us/it; ETA 0d 00:02; 51ed11fe314283a4 (check 0.36s) 2020-09-04 05:23:12 asr2/radeonvii0 63000061 OK 63000000 100.00%; 570 us/it; ETA 0d 00:00; 38291a38c1ee58ea (check 0.37s) 2020-09-04 05:23:13 asr2/radeonvii0 CC 63000061 / 63000061, 29e72206ff10f85c 2020-09-04 05:23:13 asr2/radeonvii0 63000061 OK 63000400 100.00%; 1379 us/it; ETA 0d 00:00; e38f014ca61ea59a (check 0.35s) 2020-09-04 05:23:13 asr2/radeonvii0 proof: building level 1, hash d05118eb6e056555 2020-09-04 05:23:14 asr2/radeonvii0 proof: building level 2, hash deeea1d84b278de5 2020-09-04 05:23:14 asr2/radeonvii0 proof: building level 3, hash 03d2404ceb9d6c8c 2020-09-04 05:23:15 asr2/radeonvii0 proof: building level 4, hash c04e38efcc9776a6 2020-09-04 05:23:17 asr2/radeonvii0 proof: building level 5, hash bbff815aad25aa9e 2020-09-04 05:23:19 asr2/radeonvii0 proof: building level 6, hash a475da8b3558062d 2020-09-04 05:23:25 asr2/radeonvii0 proof: building level 7, hash fb304242a5b31aed 2020-09-04 05:23:35 asr2/radeonvii0 proof: building level 8, hash 703e0631438280d2 2020-09-04 05:23:39 asr2/radeonvii0 checksum de9da1a4 (expected 75062f10) in '.\63000061\proof\14519546' 2020-09-04 05:23:39 asr2/radeonvii0 Exception NSt10filesystem7__cxx1116filesystem_errorE: filesystem error: checksum mismatch: No error Code:
2020-09-04 07:51:14 gpuowl v6.11-364-g36f4e2a 2020-09-04 07:51:14 config: -user kriesel -cpu asr2/radeonvii0 -d 0 -maxAlloc 15000 -proof 8 2020-09-04 07:51:14 device 0, unique id '' 2020-09-04 07:51:14 asr2/radeonvii0 63000061 FFT: 3.25M 256:13:512 (18.49 bpw) 2020-09-04 07:51:14 asr2/radeonvii0 Expected maximum carry32: 49920000 2020-09-04 07:51:15 asr2/radeonvii0 OpenCL args "-DEXP=63000061u -DWIDTH=256u -DSMALL_HEIGHT=512u -DMIDDLE=13u -DPM1=0 -DAMDGPU=1 -DMM_CHAIN=1u -DMM2_CHAIN=1u -DMAX_ACCURACY=1 -DWEIGHT_STEP_MINUS_1=0xd.ad1ea7b4d5158p-5 -DIWEIGHT_STEP_MINUS_1=-0x9.94d31fb3b12ep-5 -cl-unsafe-math-optimizations -cl-std=CL2.0 -cl-finite-math-only " 2020-09-04 07:51:15 asr2/radeonvii0 ASM compilation failed, retrying compilation using NO_ASM 2020-09-04 07:51:21 asr2/radeonvii0 OpenCL compilation in 6.17 s 2020-09-04 07:51:21 asr2/radeonvii0 63000061 OK 63000000 loaded: blockSize 400, 38291a38c1ee58ea 2020-09-04 07:51:21 asr2/radeonvii0 validating proof residues for power 8 2020-09-04 07:51:25 asr2/radeonvii0 checksum de9da1a4 (expected 75062f10) in '.\63000061\proof\14519546' 2020-09-04 07:51:25 asr2/radeonvii0 validating proof residues for power 9 2020-09-04 07:51:25 asr2/radeonvii0 Can't open '.\63000061\proof\123047' (mode 'rb') 2020-09-04 07:51:25 asr2/radeonvii0 validating proof residues for power 8 2020-09-04 07:51:28 asr2/radeonvii0 checksum de9da1a4 (expected 75062f10) in '.\63000061\proof\14519546' 2020-09-04 07:51:28 asr2/radeonvii0 validating proof residues for power 7 2020-09-04 07:51:35 asr2/radeonvii0 Proof using power 7 (vs 8) for 63000061 2020-09-04 07:51:35 asr2/radeonvii0 CC 63000061 / 63000061, 29e72206ff10f85c 2020-09-04 07:51:36 asr2/radeonvii0 63000061 OK 63000400 100.00%; 1344 us/it; ETA 0d 00:00; e38f014ca61ea59a (check 0.35s) 2020-09-04 07:51:36 asr2/radeonvii0 proof: building level 1, hash d05118eb6e056555 2020-09-04 07:51:36 asr2/radeonvii0 proof: building level 2, hash deeea1d84b278de5 2020-09-04 07:51:37 asr2/radeonvii0 proof: building level 3, hash 03d2404ceb9d6c8c 2020-09-04 07:51:38 asr2/radeonvii0 proof: building level 4, hash c04e38efcc9776a6 2020-09-04 07:51:39 asr2/radeonvii0 proof: building level 5, hash bbff815aad25aa9e 2020-09-04 07:51:42 asr2/radeonvii0 proof: building level 6, hash a475da8b3558062d 2020-09-04 07:51:47 asr2/radeonvii0 proof: building level 7, hash fb304242a5b31aed 2020-09-04 07:51:57 asr2/radeonvii0 PRP-Proof 'proof\63000061-7.proof' generated 2020-09-04 07:51:57 asr2/radeonvii0 Proof: cleaning up temporary storage 2020-09-04 07:51:58 asr2/radeonvii0 {"status":"C", "exponent":"63000061", "worktype":"PRP-3", "res64":"29e72206ff10f8__", "residue-type":"1", "errors":{"gerbicz":"0"}, "fft-length":"3407872", "proof":{"version":"1", "power":"7", "hashsize":"64", "md5":"5a2ca135f73dbdbf9f4c85f8fba7____"}, "program":{"name":"gpuowl", "version":"v6.11-364-g36f4e2a"}, "user":"kriesel", "computer":"asr2/radeonvii0", "timestamp":"2020-09-04 12:51:58 UTC"} Last fiddled with by kriesel on 2020-09-04 at 13:16 |
![]() |
![]() |
![]() |
#22 |
Random Account
Aug 2009
2×1,051 Posts |
![]()
I had a slightly older version stall about half way through a P-1 Stage 2. My 1080 only has 8 GB of RAM and I was trying to use 6. I changed "maxAlloc" to 5500. gpuOwl has not stalled since. The caveat is that it takes longer to run Stage 2.
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Things that make you go "Hmmmm…" | Xyzzy | Lounge | 4457 | 2022-05-14 15:42 |
GpuOwl PRP-Proof changes | preda | GpuOwl | 20 | 2020-10-17 06:51 |
gpuOWL for Wagstaff | GP2 | GpuOwl | 22 | 2020-06-13 16:57 |
gpuowl tuning | M344587487 | GpuOwl | 14 | 2018-12-29 08:11 |
short runs or long runs | MattcAnderson | Operazione Doppi Mersennes | 3 | 2014-02-16 15:19 |