mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing > GpuOwl

Reply
 
Thread Tools
Old 2020-09-01, 20:43   #12
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

3·7·19·29 Posts
Default

@Ken: stalling at the start of p-1 stage 2 sounds suspiciously like some kind of mem-alloc failure ... your run diagnostics show stage 2 trying to alloc ~11GB of memory, which should be OK for an R7 with its 16GB of HBM, but it is still close to 70% of the available memory. Were you running just this one gpuowl instance on the card, or was there a 2nd also running?

Under linux, the rocm-smi utility shows - among other things - VRAM usage of each installed card ... do the GPU utils under Windows have similar diagnostics? It would be interesting to keep an eye on the VRAM %, both under normal circumstances and the next time you hit this kind of stall/massive-slowdown.
ewmayer is offline   Reply With Quote
Old 2020-09-01, 22:15   #13
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

3·443 Posts
Default

Quote:
Originally Posted by kriesel View Post
An example of gpuowl stalling
Quite annoying. I don't know what's causing the stall. I can think of such possibilities:

1. it's stuck on OpenCL queue->finish(). Yet strange that it reacts correctly to Ctrl-C.
2. it's stuck waiting for the GCD future from the other thread. This is not supposed to happen, as the future is checked with a wait-time of zero, but who knows.
3. it's not stuck, it's progressing extremelly slowly because OpenCL allocated the buffers in host memory.

It would be interesting to see, when it's stuck:
- is there any load on the GPU?
- what's the RAM allocation on the GPU (is it 100% full?)
- is the CPU 100% one-thread?

Probably more logging or running under the debugger may be needed to pinpoint it.

Anyway, I'm also revamping the P-1 as-we-speak. I would not recommend doing big-P-1 right now, because stage-1 is just about to become *a lot* cheaper. Also, new bugs may appear and old ones go away as the code is reworked.
preda is offline   Reply With Quote
Old 2020-09-01, 23:45   #14
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

2×53×19 Posts
Default

Quote:
Originally Posted by ewmayer View Post
@Ken: stalling at the start of p-1 stage 2 sounds suspiciously like some kind of mem-alloc failure ...
Were you running just this one gpuowl instance on the card, or was there a 2nd also running?
System had 3 gpus at the moment; each was running a single instance of gpuowl.

GPU 0 radeonvii0 ran a prp proof build on a 35M exponent around 22:28, that memory demand completed ~20 minutes before the hang on radeonvii(second gpu) occurred.
When second gpu hung at 554M P-1 stage 1 gcd, the first gpu was early in a PRP run of a 36M exponent.
The third gpu, a 5700XT, was in the midst of a 47M exponent prp at 22:48, and had a normal run.
(SO glad that Preda agreed to built in gpuowl logging with 4-digit year, mm, dd, and time to the second, for examining questions such as this.)

GPU-Z can do real time display and logging of various sensors including gpu ram utilization. I did not have that logging active due to its overhead. It could take 3 sessions, to log the 3 gpus. Currently, the 554M P-1 stage 2 still running shows 14913MB committed on the gpu.

On the system side, MS has annoyingly either removed or made logging capabilities of Task Manager, Resource Monitor, and Performance Monitor very difficult to find at Windows 10, possibly earlier. Those too would have overhead. They were not running when the stall occurred. Currently there's plenty of system ram available; 3GB committed out of 16GB physical ram.
Attached Thumbnails
Click image for larger version

Name:	554M P-1 stage 2 normal operation.png
Views:	27
Size:	61.3 KB
ID:	23237  

Last fiddled with by kriesel on 2020-09-01 at 23:51
kriesel is online now   Reply With Quote
Old 2020-09-02, 21:19   #15
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

1157110 Posts
Default

Quote:
Originally Posted by kriesel View Post
On the system side, MS has annoyingly either removed or made logging capabilities of Task Manager, Resource Monitor, and Performance Monitor very difficult to find at Windows 10, possibly earlier. Those too would have overhead. They were not running when the stall occurred. Currently there's plenty of system ram available; 3GB committed out of 16GB physical ram.
Can you fiddle the dedicated-mem-used entry in that list to show as a %? Then next time you hit a stall - maybe do some p-1-only for a while on the R7 in an effort to trigger the problem - recheck that entry both in absolute and % terms. You're very close to the 16GB limit in normal-run mode, perhaps some kind of VRAM-allocation 'feature' in the card-mgmt is trying to reserve more for one reason or another, which could easily put you over the limit.
ewmayer is offline   Reply With Quote
Old 2020-09-02, 21:33   #16
moebius
 
moebius's Avatar
 
Jul 2009
Germany

7158 Posts
Default

Quote:
Originally Posted by kriesel View Post
The third gpu, a 5700XT, was in the midst of a 47M exponent prp at 22:48, and had a normal run.
And I thought OpenCL would not run on the 5700 XT because of driver problems. Can you please post a gpuowl benchmark FFT 5.0M, 5, 5M and 6.0M e.g.
104M / 107M exponent for this card.
THX
moebius is offline   Reply With Quote
Old 2020-09-03, 15:26   #17
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

2×53×19 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Can you fiddle the dedicated-mem-used entry in that list to show as a %? Then next time you hit a stall - maybe do some p-1-only for a while on the R7 in an effort to trigger the problem - recheck that entry both in absolute and % terms. You're very close to the 16GB limit in normal-run mode, perhaps some kind of VRAM-allocation 'feature' in the card-mgmt is trying to reserve more for one reason or another, which could easily put you over the limit.
If you mean the GPU-Z screen capture, no, it does not support displaying as a percentage. It's confirming that gpuowl is approaching the -maxAlloc 15000 I set in gpuowl's config.txt for the 16GB gpu, but not exceeding it, and generally, more than a GB of margin is enough in my experience. Maybe long term I'll try it at 14000 maxAlloc, increasing nominal margin from 1384MB to 2384MB (8.45% to 14.55%). I don't see much utility in computing percentages.
Note, none of the gpus have display driving duties. That's on the igp/on-motherboard vga jack, and only used when I can't access remotely, or rarely delving into BIOS setup.

The system stalled 1 gpu ~9:30 last night, and the other 2 about an hour later, so I lost more than a gpu-day overnight. Actually that was the same gpu as had stalled during the P-1 first, followed by a bugcheck 0x116 by 11pm, according to the system log. https://docs.microsoft.com/en-us/win...eo-tdr-failure. Should not be having such issues with my newest Radeon VII, but there it is.

Quote:
Originally Posted by moebius View Post
And I thought OpenCL would not run on the 5700 XT because of driver problems. Can you please post a gpuowl benchmark FFT 5.0M, 5, 5M and 6.0M e.g.
104M / 107M exponent for this card.
THX
Perhaps I'll hit one of those eventually. It has other things to do now.

Last fiddled with by kriesel on 2020-09-03 at 15:38
kriesel is online now   Reply With Quote
Old 2020-09-03, 16:33   #18
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

475010 Posts
Default well that was quick (and ugly)

Another bugcheck 0x116 hit during proof calculation for 58M on the 5700xt, within a few hours of the restart of all gpu work.
Empty worktodo.txt, (worktodo.txt.bak to the rescue.)
Gpuowl did not write the PRP test result to results.txt. Fortunately it did write a complete result to gpuowl.log though before the system went down. (The one written to log upon restart lacked md5 & perhaps more.)

Again, second RadeonVII in the system hung, ~75 minutes before the bugcheck is the last gpuowl log entry there. It was in the midst of running a PRP ~92M exponent, not demanding of gpu ram.

Perfmon running now, with an attempt at logging numerous counters.
Possibly the Celeron is too slow to service these? Or the very new Radeon VII gpu is just flaky.


This 5700xt is doing 1200-1290 us/it on 3M PRP/GEC;
has done 4320 us/it on 10M;
23350 us/it on 48M;
91 us/it on 128K;
936-952 us/it on 2.25M;
1086 us/it on 2.5M.

Last fiddled with by kriesel on 2020-09-03 at 16:42
kriesel is online now   Reply With Quote
Old 2020-09-03, 22:25   #19
moebius
 
moebius's Avatar
 
Jul 2009
Germany

7158 Posts
Default

Quote:
Originally Posted by kriesel View Post
This 5700xt is doing
Thx, but very Poor OpenCL performance. I'm happy that I didn't bought one.
moebius is offline   Reply With Quote
Old 2020-09-03, 23:39   #20
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

2·53·19 Posts
Default

Quote:
Originally Posted by moebius View Post
Thx, but very Poor OpenCL performance. I'm happy that I didn't bought one.
At better than 1/3 of Radeon VII gpuowl performance, it ranks fairly high in JVR value per dollar https://www.mersenne.ca/cudalucas.php?sort=jvr
kriesel is online now   Reply With Quote
Old 2020-09-04, 13:13   #21
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

2·53·19 Posts
Default Checksum mismatch exception after proof build

Code:
2020-09-04 05:21:18 asr2/radeonvii0 63000061 OK 62800000  99.68%;  571 us/it; ETA 0d 00:02; 51ed11fe314283a4 (check 0.36s)
2020-09-04 05:23:12 asr2/radeonvii0 63000061 OK 63000000 100.00%;  570 us/it; ETA 0d 00:00; 38291a38c1ee58ea (check 0.37s)
2020-09-04 05:23:13 asr2/radeonvii0 CC 63000061 / 63000061, 29e72206ff10f85c
2020-09-04 05:23:13 asr2/radeonvii0 63000061 OK 63000400 100.00%; 1379 us/it; ETA 0d 00:00; e38f014ca61ea59a (check 0.35s)
2020-09-04 05:23:13 asr2/radeonvii0 proof: building level 1, hash d05118eb6e056555
2020-09-04 05:23:14 asr2/radeonvii0 proof: building level 2, hash deeea1d84b278de5
2020-09-04 05:23:14 asr2/radeonvii0 proof: building level 3, hash 03d2404ceb9d6c8c
2020-09-04 05:23:15 asr2/radeonvii0 proof: building level 4, hash c04e38efcc9776a6
2020-09-04 05:23:17 asr2/radeonvii0 proof: building level 5, hash bbff815aad25aa9e
2020-09-04 05:23:19 asr2/radeonvii0 proof: building level 6, hash a475da8b3558062d
2020-09-04 05:23:25 asr2/radeonvii0 proof: building level 7, hash fb304242a5b31aed
2020-09-04 05:23:35 asr2/radeonvii0 proof: building level 8, hash 703e0631438280d2
2020-09-04 05:23:39 asr2/radeonvii0 checksum de9da1a4 (expected 75062f10) in '.\63000061\proof\14519546'
2020-09-04 05:23:39 asr2/radeonvii0 Exception NSt10filesystem7__cxx1116filesystem_errorE: filesystem error: checksum mismatch: No error
Which caused gpuowl to halt on that gpu. After I found that, I made a copy of the entire 63000061 folder. Then restarted the run.
Code:
2020-09-04 07:51:14 gpuowl v6.11-364-g36f4e2a
2020-09-04 07:51:14 config: -user kriesel -cpu asr2/radeonvii0 -d 0 -maxAlloc 15000 -proof 8
2020-09-04 07:51:14 device 0, unique id ''
2020-09-04 07:51:14 asr2/radeonvii0 63000061 FFT: 3.25M 256:13:512 (18.49 bpw)
2020-09-04 07:51:14 asr2/radeonvii0 Expected maximum carry32: 49920000
2020-09-04 07:51:15 asr2/radeonvii0 OpenCL args "-DEXP=63000061u -DWIDTH=256u -DSMALL_HEIGHT=512u -DMIDDLE=13u -DPM1=0 -DAMDGPU=1 -DMM_CHAIN=1u -DMM2_CHAIN=1u -DMAX_ACCURACY=1 -DWEIGHT_STEP_MINUS_1=0xd.ad1ea7b4d5158p-5 -DIWEIGHT_STEP_MINUS_1=-0x9.94d31fb3b12ep-5  -cl-unsafe-math-optimizations -cl-std=CL2.0 -cl-finite-math-only "
2020-09-04 07:51:15 asr2/radeonvii0 ASM compilation failed, retrying compilation using NO_ASM
2020-09-04 07:51:21 asr2/radeonvii0 OpenCL compilation in 6.17 s
2020-09-04 07:51:21 asr2/radeonvii0 63000061 OK 63000000 loaded: blockSize 400, 38291a38c1ee58ea
2020-09-04 07:51:21 asr2/radeonvii0 validating proof residues for power 8
2020-09-04 07:51:25 asr2/radeonvii0 checksum de9da1a4 (expected 75062f10) in '.\63000061\proof\14519546'
2020-09-04 07:51:25 asr2/radeonvii0 validating proof residues for power 9
2020-09-04 07:51:25 asr2/radeonvii0 Can't open '.\63000061\proof\123047' (mode 'rb')
2020-09-04 07:51:25 asr2/radeonvii0 validating proof residues for power 8
2020-09-04 07:51:28 asr2/radeonvii0 checksum de9da1a4 (expected 75062f10) in '.\63000061\proof\14519546'
2020-09-04 07:51:28 asr2/radeonvii0 validating proof residues for power 7
2020-09-04 07:51:35 asr2/radeonvii0 Proof using power 7 (vs 8) for 63000061
2020-09-04 07:51:35 asr2/radeonvii0 CC 63000061 / 63000061, 29e72206ff10f85c
2020-09-04 07:51:36 asr2/radeonvii0 63000061 OK 63000400 100.00%; 1344 us/it; ETA 0d 00:00; e38f014ca61ea59a (check 0.35s)
2020-09-04 07:51:36 asr2/radeonvii0 proof: building level 1, hash d05118eb6e056555
2020-09-04 07:51:36 asr2/radeonvii0 proof: building level 2, hash deeea1d84b278de5
2020-09-04 07:51:37 asr2/radeonvii0 proof: building level 3, hash 03d2404ceb9d6c8c
2020-09-04 07:51:38 asr2/radeonvii0 proof: building level 4, hash c04e38efcc9776a6
2020-09-04 07:51:39 asr2/radeonvii0 proof: building level 5, hash bbff815aad25aa9e
2020-09-04 07:51:42 asr2/radeonvii0 proof: building level 6, hash a475da8b3558062d
2020-09-04 07:51:47 asr2/radeonvii0 proof: building level 7, hash fb304242a5b31aed
2020-09-04 07:51:57 asr2/radeonvii0 PRP-Proof 'proof\63000061-7.proof' generated
2020-09-04 07:51:57 asr2/radeonvii0 Proof: cleaning up temporary storage
2020-09-04 07:51:58 asr2/radeonvii0 {"status":"C", "exponent":"63000061", "worktype":"PRP-3", "res64":"29e72206ff10f8__", "residue-type":"1", "errors":{"gerbicz":"0"}, "fft-length":"3407872", "proof":{"version":"1", "power":"7", "hashsize":"64", "md5":"5a2ca135f73dbdbf9f4c85f8fba7____"}, "program":{"name":"gpuowl", "version":"v6.11-364-g36f4e2a"}, "user":"kriesel", "computer":"asr2/radeonvii0", "timestamp":"2020-09-04 12:51:58 UTC"}
That's odd, because -proof 8 is specified in the config.txt. I guess it dropped to power 7 because of the checksum issue, which certainly is preferable to abandoning a proof entirely and as a result requiring another PRP test. I don't understand why it tries power 9 since it already knows it's a power 8 run. The power 7 proof file uploaded successfully. We'll see what an eventual cert run shows on 63000061.

Last fiddled with by kriesel on 2020-09-04 at 13:16
kriesel is online now   Reply With Quote
Old 2020-09-04, 15:30   #22
storm5510
Random Account
 
storm5510's Avatar
 
Aug 2009
U.S.A.

2×7×112 Posts
Default

Quote:
Originally Posted by ewmayer View Post
@Ken: stalling at the start of p-1 stage 2 sounds suspiciously like some kind of mem-alloc failure...
I had a slightly older version stall about half way through a P-1 Stage 2. My 1080 only has 8 GB of RAM and I was trying to use 6. I changed "maxAlloc" to 5500. gpuOwl has not stalled since. The caveat is that it takes longer to run Stage 2.
storm5510 is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Things that make you go "Hmmmm…" Xyzzy Lounge 4151 2020-12-02 07:22
GpuOwl PRP-Proof changes preda GpuOwl 20 2020-10-17 06:51
gpuOWL for Wagstaff GP2 GpuOwl 22 2020-06-13 16:57
gpuowl tuning M344587487 GpuOwl 14 2018-12-29 08:11
short runs or long runs MattcAnderson Operazione Doppi Mersennes 3 2014-02-16 15:19

All times are UTC. The time now is 08:29.

Thu Dec 3 08:29:12 UTC 2020 up 4:40, 0 users, load averages: 1.02, 0.96, 0.98

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.