mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing > GpuOwl

Reply
 
Thread Tools
Old 2018-03-10, 17:11   #12
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

2·11·269 Posts
Default

Quote:
Originally Posted by SELROC View Post
Hello Mihai, have you attempted yet to reproduce the error ?

I have reinstalled debian-testing with amdgpu-pro and still getting the same error: if two instances of gpuowl are launched, the first remains in a blocked state and we can only reboot to stop it.

However, the normal reboot will not work (with a message: "watchdog did not stop") and we can only switch off the power to reboot.

My GPU hardware is Radeon RX 580
Yikes. Not a problem on Windows 7. If gpuowl and mfakto are accidentally run on the same RX 550 gpu at the same time, the system eventually reboots itself. First, gpuowl v1.9 is simply stalled a while.

Last fiddled with by kriesel on 2018-03-10 at 17:11
kriesel is online now   Reply With Quote
Old 2018-03-10, 18:49   #13
SELROC
 

22×1,217 Posts
Default

Quote:
Originally Posted by kriesel View Post
Yikes. Not a problem on Windows 7. If gpuowl and mfakto are accidentally run on the same RX 550 gpu at the same time, the system eventually reboots itself. First, gpuowl v1.9 is simply stalled a while.
I run two instances with -device parameter set differently. One instance gpu 0 and other instance gpu 1.

The first instance will stall, and cannot be stopped with ^C
  Reply With Quote
Old 2018-03-10, 21:35   #14
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

134368 Posts
Default

Quote:
Originally Posted by SELROC View Post
I run two instances with -device parameter set differently. One instance gpu 0 and other instance gpu 1.

The first instance will stall, and cannot be stopped with ^C
Right now I'm running mfakto, cllucas, and gpuowl on 3 RX550s in the same system. That's using "in"loosely, since due to PCIE slot spacing one is perched atop the tower case and connected via 1x-16x extender. I see about a 3% speed penalty in gpuOwLv1.9 with the extender.

I have seen occasional gpuOwL stalls; the gpu load is displayed as 100% via GPU-Z, and the progress in the console and log has stopped. I don't recall if that was v1.8 or 1.9.

Are you able to read the gpu sensor values ok on linux? Gpu core clock, gpu memory clock, and temperature get disabled on Windows 7 in GPU-Z on the RX550s when using Windows Remote Desktop, but not when using the local console or VNC remote access.
kriesel is online now   Reply With Quote
Old 2018-03-11, 02:17   #15
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

53×11 Posts
Default

Quote:
Originally Posted by SELROC View Post
Hello Mihai, have you attempted yet to reproduce the error ?

I have reinstalled debian-testing with amdgpu-pro and still getting the same error: if two instances of gpuowl are launched, the first remains in a blocked state and we can only reboot to stop it.

However, the normal reboot will not work (with a message: "watchdog did not stop") and we can only switch off the power to reboot.

My GPU hardware is Radeon RX 580
No I haven't looked into this yet, sorry. It seems to be a driver problem. When this happens, if you're on linux, could you look at the system error log with:
"dmesg", eventually filtering like: "dmesg | grep -i amd". If you see errors there it's likely a confirmation of driver errors.
preda is offline   Reply With Quote
Old 2018-03-11, 09:37   #16
SELROC
 

3×2,207 Posts
Default

Quote:
Originally Posted by preda View Post
No I haven't looked into this yet, sorry. It seems to be a driver problem. When this happens, if you're on linux, could you look at the system error log with:
"dmesg", eventually filtering like: "dmesg | grep -i amd". If you see errors there it's likely a confirmation of driver errors.
I have looked at the dmesg output, only a couple of lines report something that looks like an error "kfd not supported on this ASIC", the rest of the lines look pretty normal. But just in case I attach the output.
Attached Files
File Type: txt amd.txt (6.6 KB, 203 views)
  Reply With Quote
Old 2018-03-11, 11:09   #17
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

53×11 Posts
Default

Yes it looks fine. And you don't get anything new in dmesg when the lock happens..? interesting. I still can't really imagine how the OpenCL app can lock the process much less the whose OS unless some problem happens in deeper layers (kernel/driver).

Quote:
Originally Posted by SELROC View Post
I have looked at the dmesg output, only a couple of lines report something that looks like an error "kfd not supported on this ASIC", the rest of the lines look pretty normal. But just in case I attach the output.
preda is offline   Reply With Quote
Old 2018-03-11, 11:26   #18
SELROC
 

2×3×13×79 Posts
Default

Quote:
Originally Posted by preda View Post
Yes it looks fine. And you don't get anything new in dmesg when the lock happens..? interesting. I still can't really imagine how the OpenCL app can lock the process much less the whose OS unless some problem happens in deeper layers (kernel/driver).
I am doing another test and waiting for more time before shutting down the system, to see if any other messages are generated in dmesg.
  Reply With Quote
Old 2018-03-11, 11:34   #19
kladner
 
kladner's Avatar
 
"Kieren"
Jul 2011
In My Own Galaxy!

27AE16 Posts
Default

Please forgive me if I am misunderstanding your description, but are you running two instances on a single GPU? I am not running AMD hardware, but I do run two mfaktc instances on an Nvidia card. In my circumstances, the instances run from different directories, but both are using -d 0. If I used -d 1 for one instance, it would run on my other GPU. I think with a single GPU, calling -d 1 would cause an error.

Could this being causing the difficulty?
kladner is offline   Reply With Quote
Old 2018-03-11, 11:53   #20
SELROC
 

53·149 Posts
Default

Quote:
Originally Posted by kladner View Post
Please forgive me if I am misunderstanding your description, but are you running two instances on a single GPU? I am not running AMD hardware, but I do run two mfaktc instances on an Nvidia card. In my circumstances, the instances run from different directories, but both are using -d 0. If I used -d 1 for one instance, it would run on my other GPU. I think with a single GPU, calling -d 1 would cause an error.

Could this being causing the difficulty?

I have 2 GPU, 0 and 1
  Reply With Quote
Old 2018-03-11, 13:40   #21
SELROC
 

120568 Posts
Default

Quote:
Originally Posted by SELROC View Post
I am doing another test and waiting for more time before shutting down the system, to see if any other messages are generated in dmesg.
Well, nothing else showed up in dmesg

Investigating further...
  Reply With Quote
Old 2018-03-11, 21:16   #22
kladner
 
kladner's Avatar
 
"Kieren"
Jul 2011
In My Own Galaxy!

2·3·1,693 Posts
Default

Quote:
Originally Posted by SELROC View Post
I have 2 GPU, 0 and 1
Sorry. I did not catch the 2 GPU part. I guess, out of habit, that I use "instance" in this context when speaking of more than one on the same card, though I know this is too limited a meaning.
kladner is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
gpuowl tuning M344587487 GpuOwl 14 2018-12-29 08:11
How to interface gpuOwl with PrimeNet preda PrimeNet 2 2017-10-07 21:32
runtime question yoyo YAFU 1 2015-01-08 07:07
runtime error when using redc ltd GMP-ECM 5 2009-10-30 13:09
ECM Runtime and F20 D. B. Staple Factoring 11 2007-12-12 16:52

All times are UTC. The time now is 18:23.


Tue Nov 30 18:23:28 UTC 2021 up 130 days, 12:52, 0 users, load averages: 1.23, 1.41, 1.45

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.