mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GpuOwl (https://www.mersenneforum.org/forumdisplay.php?f=171)
-   -   gpuowl: runtime error (https://www.mersenneforum.org/showthread.php?t=23117)

kriesel 2018-03-10 17:11

[QUOTE=SELROC;481995]Hello Mihai, have you attempted yet to reproduce the error ?

I have reinstalled debian-testing with amdgpu-pro and still getting the same error: if two instances of gpuowl are launched, the first remains in a blocked state and we can only reboot to stop it.

However, the normal reboot will not work (with a message: "watchdog did not stop") and we can only switch off the power to reboot.

My GPU hardware is Radeon RX 580[/QUOTE]

Yikes. Not a problem on Windows 7. If gpuowl and mfakto are accidentally run on the same RX 550 gpu at the same time, the system eventually reboots itself. First, gpuowl v1.9 is simply stalled a while.

SELROC 2018-03-10 18:49

[QUOTE=kriesel;482005]Yikes. Not a problem on Windows 7. If gpuowl and mfakto are accidentally run on the same RX 550 gpu at the same time, the system eventually reboots itself. First, gpuowl v1.9 is simply stalled a while.[/QUOTE]

I run two instances with -device parameter set differently. One instance gpu 0 and other instance gpu 1.

The first instance will stall, and cannot be stopped with ^C

kriesel 2018-03-10 21:35

[QUOTE=SELROC;482012]I run two instances with -device parameter set differently. One instance gpu 0 and other instance gpu 1.

The first instance will stall, and cannot be stopped with ^C[/QUOTE]

Right now I'm running mfakto, cllucas, and gpuowl on 3 RX550s in the same system. That's using "in"loosely, since due to PCIE slot spacing one is perched atop the tower case and connected via 1x-16x extender. I see about a 3% speed penalty in gpuOwLv1.9 with the extender.

I have seen occasional gpuOwL stalls; the gpu load is displayed as 100% via GPU-Z, and the progress in the console and log has stopped. I don't recall if that was v1.8 or 1.9.

Are you able to read the gpu sensor values ok on linux? Gpu core clock, gpu memory clock, and temperature get disabled on Windows 7 in GPU-Z on the RX550s when using Windows Remote Desktop, but not when using the local console or VNC remote access.

preda 2018-03-11 02:17

[QUOTE=SELROC;481995]Hello Mihai, have you attempted yet to reproduce the error ?

I have reinstalled debian-testing with amdgpu-pro and still getting the same error: if two instances of gpuowl are launched, the first remains in a blocked state and we can only reboot to stop it.

However, the normal reboot will not work (with a message: "watchdog did not stop") and we can only switch off the power to reboot.

My GPU hardware is Radeon RX 580[/QUOTE]
No I haven't looked into this yet, sorry. It seems to be a driver problem. When this happens, if you're on linux, could you look at the system error log with:
"dmesg", eventually filtering like: "dmesg | grep -i amd". If you see errors there it's likely a confirmation of driver errors.

SELROC 2018-03-11 09:37

1 Attachment(s)
[QUOTE=preda;482049]No I haven't looked into this yet, sorry. It seems to be a driver problem. When this happens, if you're on linux, could you look at the system error log with:
"dmesg", eventually filtering like: "dmesg | grep -i amd". If you see errors there it's likely a confirmation of driver errors.[/QUOTE]

I have looked at the dmesg output, only a couple of lines report something that looks like an error "kfd not supported on this ASIC", the rest of the lines look pretty normal. But just in case I attach the output.

preda 2018-03-11 11:09

Yes it looks fine. And you don't get anything new in dmesg when the lock happens..? interesting. I still can't really imagine how the OpenCL app can lock the process much less the whose OS unless some problem happens in deeper layers (kernel/driver).

[QUOTE=SELROC;482059]I have looked at the dmesg output, only a couple of lines report something that looks like an error "kfd not supported on this ASIC", the rest of the lines look pretty normal. But just in case I attach the output.[/QUOTE]

SELROC 2018-03-11 11:26

[QUOTE=preda;482061]Yes it looks fine. And you don't get anything new in dmesg when the lock happens..? interesting. I still can't really imagine how the OpenCL app can lock the process much less the whose OS unless some problem happens in deeper layers (kernel/driver).[/QUOTE]

I am doing another test and waiting for more time before shutting down the system, to see if any other messages are generated in dmesg.

kladner 2018-03-11 11:34

Please forgive me if I am misunderstanding your description, but are you running two instances on a single GPU? I am not running AMD hardware, but I do run two mfaktc instances on an Nvidia card. In my circumstances,[U] the instances run from different directories, but both are using -d 0.[/U] If I used -d 1 for one instance, it would run on my other GPU. I think with a single GPU, calling -d 1 would cause an error.

Could this being causing the difficulty?

SELROC 2018-03-11 11:53

[QUOTE=kladner;482064]Please forgive me if I am misunderstanding your description, but are you running two instances on a single GPU? I am not running AMD hardware, but I do run two mfaktc instances on an Nvidia card. In my circumstances,[U] the instances run from different directories, but both are using -d 0.[/U] If I used -d 1 for one instance, it would run on my other GPU. I think with a single GPU, calling -d 1 would cause an error.

Could this being causing the difficulty?[/QUOTE]


I have 2 GPU, 0 and 1

SELROC 2018-03-11 13:40

[QUOTE=SELROC;482063]I am doing another test and waiting for more time before shutting down the system, to see if any other messages are generated in dmesg.[/QUOTE]

Well, nothing else showed up in dmesg

Investigating further...

kladner 2018-03-11 21:16

[QUOTE=SELROC;482065]I have 2 GPU, 0 and 1[/QUOTE]
Sorry. I did not catch the 2 GPU part. I guess, out of habit, that I use "instance" in this context when speaking of more than one on the same card, though I know this is too limited a meaning.


All times are UTC. The time now is 10:50.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.