mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing > GpuOwl

Reply
 
Thread Tools
Old 2020-01-02, 04:44   #1
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

7·1,069 Posts
Default Gpuowl / Linux question

For the second time in a week I woke up to find all the gpuowls on a Linux box hung. Dmesg reports this:

Code:
[183465.976255] Restoring PASID 32768 queues
[183465.976348] Restoring PASID 32768 queues
[265226.254716] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring page0 timeout, signaled seq=49658, emitted seq=49660
[265226.254769] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
[265226.254775] amdgpu 0000:04:00.0: GPU reset begin!
[265226.254782] Evicting PASID 32769 queues
Do any experts have a good suggestion on how to either prevent this from happening AND/OR have gpuowl recover properly AND/OR a way to detect the hung condition, terminate the gpuowls, and restart the gpuowls?

I'll be leaving for an extended trip and hope to have a remedy in place before I go.
Prime95 is offline   Reply With Quote
Old 2020-01-02, 05:50   #2
paulunderwood
 
paulunderwood's Avatar
 
Sep 2002
Database er0rr

22×3×5×61 Posts
Default

Quote:
Originally Posted by Prime95 View Post
For the second time in a week I woke up to find all the gpuowls on a Linux box hung. Dmesg reports this:

Code:
[183465.976255] Restoring PASID 32768 queues
[183465.976348] Restoring PASID 32768 queues
[265226.254716] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring page0 timeout, signaled seq=49658, emitted seq=49660
[265226.254769] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
[265226.254775] amdgpu 0000:04:00.0: GPU reset begin!
[265226.254782] Evicting PASID 32769 queues
Do any experts have a good suggestion on how to either prevent this from happening AND/OR have gpuowl recover properly AND/OR a way to detect the hung condition, terminate the gpuowls, and restart the gpuowls?

I'll be leaving for an extended trip and hope to have a remedy in place before I go.
Passing amdgpu.noretry=0 to he kernel might help. Adding it to the approriate grub line GRUB_CMDLINE_LINUX_DEFAULT (space delimited) and running update-grub will make it permanent for the next boot. See this for more details

I seem to have an old kernel:
Code:
uname -a
Linux honeypot9 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) x86_64 GNU/Linux

Last fiddled with by paulunderwood on 2020-01-02 at 05:56
paulunderwood is offline   Reply With Quote
Old 2020-01-02, 05:50   #3
axn
 
axn's Avatar
 
Jun 2003

22×3×7×59 Posts
Default

Quote:
Originally Posted by Prime95 View Post
a way to detect the hung condition
Checking the age of the checkpoint / result file is probably the easiest way to detect no-progress condition. A simple script that runs every x minutes, checks the date of latest update in the folder, and if it is too old, killall the process and restart it.
axn is online now   Reply With Quote
Old 2020-01-02, 07:46   #4
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

32·151 Posts
Default

Quote:
Originally Posted by Prime95 View Post
For the second time in a week I woke up to find all the gpuowls on a Linux box hung. Dmesg reports this:

Code:
[183465.976255] Restoring PASID 32768 queues
[183465.976348] Restoring PASID 32768 queues
[265226.254716] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring page0 timeout, signaled seq=49658, emitted seq=49660
[265226.254769] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
[265226.254775] amdgpu 0000:04:00.0: GPU reset begin!
[265226.254782] Evicting PASID 32769 queues
Do any experts have a good suggestion on how to either prevent this from happening AND/OR have gpuowl recover properly AND/OR a way to detect the hung condition, terminate the gpuowls, and restart the gpuowls?

I'll be leaving for an extended trip and hope to have a remedy in place before I go.
Is the OS responding when this happens? is it possible to kill the gpuowl processes? (e.g. with ctrl-C, kill, kill -9). If you [re]start a gpuowl in that state, does it work?
preda is offline   Reply With Quote
Old 2020-01-02, 11:38   #5
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

2×7×223 Posts
Default

Quote:
Originally Posted by Prime95 View Post
I'll be leaving for an extended trip and hope to have a remedy in place before I go.
Time to prepare for double checking a new prime

(we need a "hope" smiley or "cross your fingers" smiley)
Attached Thumbnails
Click image for larger version

Name:	download.jpg
Views:	133
Size:	5.9 KB
ID:	21525  
ATH is offline   Reply With Quote
Old 2020-01-02, 18:24   #6
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

7×1,069 Posts
Default

Quote:
Originally Posted by preda View Post
Is the OS responding when this happens? is it possible to kill the gpuowl processes? (e.g. with ctrl-C, kill, kill -9). If you [re]start a gpuowl in that state, does it work?
The OS does respond. gpuowl does not react to ^C. Killing the gpuowls and restarting does work.
Prime95 is offline   Reply With Quote
Old 2020-01-02, 18:30   #7
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

7×1,069 Posts
Default

Quote:
Originally Posted by axn View Post
Checking the age of the checkpoint / result file is probably the easiest way to detect no-progress condition. A simple script that runs every x minutes, checks the date of latest update in the folder, and if it is too old, killall the process and restart it.
This looks like a great option.

My linux-fu is poor and I'm a little lazy to google right now. What would the syntax be for:

if (gpuowl.log has not been updated in the last hour) {
killall gpuowl
}

I can do the crontab entry and restarting is the same as the start-at-reboot code.
Prime95 is offline   Reply With Quote
Old 2020-01-02, 19:25   #8
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

117368 Posts
Default

An application and OS agnostic general case code snippet would be great. I have an instance of CUDAPm1 on Windows that restarts by batch file when it exits, but lately sometimes it produces no output into the redirected log file for nearly a day, not even the usual startup prints. Then I kill it and relaunch it manually and it seems to be fine for a while. Preferably it would be perl that could be compiled in Indigostar's perl2exe. http://www.indigostar.com/perl2exe/

Something along the lines of if this specifiable file path/name (such as a process's log file)'s last-modification date is older than this settable age, kill the process that has it open for append and relaunch. Or if the last saved checkpoint file is older than a settable age.
Support for a list of files and folders to be separately checked and individually processed. (I set up with a separate folder per running instance.)

I'm still working on a general monitor and results gathering application for several gpu apps. Will consider adding this functionality to it. More likely in the short term it will be a separate creation.

Last fiddled with by kriesel on 2020-01-02 at 19:27
kriesel is offline   Reply With Quote
Old 2020-01-03, 01:38   #9
paulunderwood
 
paulunderwood's Avatar
 
Sep 2002
Database er0rr

22×3×5×61 Posts
Default

Quote:
Originally Posted by Prime95 View Post
This looks like a great option.

My linux-fu is poor and I'm a little lazy to google right now. What would the syntax be for:

if (gpuowl.log has not been updated in the last hour) {
killall gpuowl
}

I can do the crontab entry and restarting is the same as the start-at-reboot code.
If you can hack this not-so-great code...

Code:
if (( $(date +%s) > $(stat -c %Y -- examplefile.txt ) + 3600 )); then killall gpuowl; echo "hi" ;  fi;

Last fiddled with by paulunderwood on 2020-01-03 at 01:43
paulunderwood is offline   Reply With Quote
Old 2020-01-03, 13:25   #10
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

13DE16 Posts
Default

In Windows, it's tasklist and taskkill. There are various filters in them.
In George's case it seems all gpuowl instances are to be killed and restarted.
It gets trickier if wanting to determine which pid corresponds to one hung gpu app to kill and restart, among multiple processes running the same app but on different gpus or in different folders, or different app names. Roll through a list and see which one has the relevant log file open?
Is there a Windows command line equivalent to linux lsof, that would show which process id has which log file open?
kriesel is offline   Reply With Quote
Old 2020-01-03, 14:44   #11
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

2·2,543 Posts
Default

Quote:
Originally Posted by kriesel View Post
Is there a Windows command line equivalent to linux lsof, that would show which process id has which log file open?
Yes, an addon.
https://docs.microsoft.com/en-us/sys...wnloads/handle
Run the program once interactively, to read and accept the license terms, that will pop up separately, before attempting to use it in any sort of script, as the license is programmed to be a showstopper until accepted.

Code:
C:\Users\ken\Documents>handle64 gpuowl.log

Nthandle v4.22 - Handle viewer
Copyright (C) 1997-2019 Mark Russinovich
Sysinternals - www.sysinternals.com

gpuowl-win.exe     pid: 5488   type: File            98: C:\msys64\home\ken\gpuowl-compile\gpuowl-v6.11-99-gdd8527b\gpuowl.log
gpuowl-win.exe     pid: 5768   type: File            70: C:\msys64\home\ken\gpuowl-compile\v6.11-104-g91ef9a8\rx550\gpuowl.log

 C:\Users\ken\Documents>
And console redirection to a file suffices too, for those applications that don't have built-in logging:

Code:
C:\Users\User\My Documents\starfish>handle64 cudapm1.txt

Nthandle v4.22 - Handle viewer
Copyright (C) 1997-2019 Mark Russinovich
Sysinternals - www.sysinternals.com

cmd.exe            pid: 12396  type: File            58: C:\Users\User\Documents\pm1-k4000\cudapm1.txt
CUDAPm1_win64_20130923_CUDA_55.exe pid: 1908   type: File            58: C:\Users\User\Documents\pm1-k4000\cudapm1.txt

C:\Users\User\My Documents\starfish>
I think that's the last piece needed, for a general purpose tool to identify, kill, and restart any GIMPS gpu app that's stalled, on Windows or on linux.
kriesel is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
gpuOWL for Wagstaff GP2 GpuOwl 22 2020-06-13 16:57
gpuowl tuning M344587487 GpuOwl 14 2018-12-29 08:11
Possibly stupid question about porting games to Linux. jasong Linux 4 2006-12-23 21:24
a simple question on the Linux client nngs Software 1 2005-11-27 01:39
linux question ( newb) crash893 Software 2 2003-12-26 18:50

All times are UTC. The time now is 12:16.

Thu May 6 12:16:15 UTC 2021 up 28 days, 6:57, 0 users, load averages: 1.47, 1.45, 1.42

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.