mersenneforum.org Gpuowl / Linux question
 Register FAQ Search Today's Posts Mark Forums Read

 2020-01-02, 04:44 #1 Prime95 P90 years forever!     Aug 2002 Yeehaw, FL 7·1,069 Posts Gpuowl / Linux question For the second time in a week I woke up to find all the gpuowls on a Linux box hung. Dmesg reports this: Code: [183465.976255] Restoring PASID 32768 queues [183465.976348] Restoring PASID 32768 queues [265226.254716] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring page0 timeout, signaled seq=49658, emitted seq=49660 [265226.254769] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0 [265226.254775] amdgpu 0000:04:00.0: GPU reset begin! [265226.254782] Evicting PASID 32769 queues Do any experts have a good suggestion on how to either prevent this from happening AND/OR have gpuowl recover properly AND/OR a way to detect the hung condition, terminate the gpuowls, and restart the gpuowls? I'll be leaving for an extended trip and hope to have a remedy in place before I go.
2020-01-02, 05:50   #2
paulunderwood

Sep 2002
Database er0rr

22×3×5×61 Posts

Quote:
 Originally Posted by Prime95 For the second time in a week I woke up to find all the gpuowls on a Linux box hung. Dmesg reports this: Code: [183465.976255] Restoring PASID 32768 queues [183465.976348] Restoring PASID 32768 queues [265226.254716] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring page0 timeout, signaled seq=49658, emitted seq=49660 [265226.254769] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0 [265226.254775] amdgpu 0000:04:00.0: GPU reset begin! [265226.254782] Evicting PASID 32769 queues Do any experts have a good suggestion on how to either prevent this from happening AND/OR have gpuowl recover properly AND/OR a way to detect the hung condition, terminate the gpuowls, and restart the gpuowls? I'll be leaving for an extended trip and hope to have a remedy in place before I go.
Passing amdgpu.noretry=0 to he kernel might help. Adding it to the approriate grub line GRUB_CMDLINE_LINUX_DEFAULT (space delimited) and running update-grub will make it permanent for the next boot. See this for more details

I seem to have an old kernel:
Code:
uname -a
Linux honeypot9 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) x86_64 GNU/Linux

Last fiddled with by paulunderwood on 2020-01-02 at 05:56

2020-01-02, 05:50   #3
axn

Jun 2003

22×3×7×59 Posts

Quote:
 Originally Posted by Prime95 a way to detect the hung condition
Checking the age of the checkpoint / result file is probably the easiest way to detect no-progress condition. A simple script that runs every x minutes, checks the date of latest update in the folder, and if it is too old, killall the process and restart it.

2020-01-02, 07:46   #4
preda

"Mihai Preda"
Apr 2015

32·151 Posts

Quote:
 Originally Posted by Prime95 For the second time in a week I woke up to find all the gpuowls on a Linux box hung. Dmesg reports this: Code: [183465.976255] Restoring PASID 32768 queues [183465.976348] Restoring PASID 32768 queues [265226.254716] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring page0 timeout, signaled seq=49658, emitted seq=49660 [265226.254769] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0 [265226.254775] amdgpu 0000:04:00.0: GPU reset begin! [265226.254782] Evicting PASID 32769 queues Do any experts have a good suggestion on how to either prevent this from happening AND/OR have gpuowl recover properly AND/OR a way to detect the hung condition, terminate the gpuowls, and restart the gpuowls? I'll be leaving for an extended trip and hope to have a remedy in place before I go.
Is the OS responding when this happens? is it possible to kill the gpuowl processes? (e.g. with ctrl-C, kill, kill -9). If you [re]start a gpuowl in that state, does it work?

2020-01-02, 11:38   #5
ATH
Einyen

Dec 2003
Denmark

2×7×223 Posts

Quote:
 Originally Posted by Prime95 I'll be leaving for an extended trip and hope to have a remedy in place before I go.
Time to prepare for double checking a new prime

(we need a "hope" smiley or "cross your fingers" smiley)
Attached Thumbnails

2020-01-02, 18:24   #6
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

7×1,069 Posts

Quote:
 Originally Posted by preda Is the OS responding when this happens? is it possible to kill the gpuowl processes? (e.g. with ctrl-C, kill, kill -9). If you [re]start a gpuowl in that state, does it work?
The OS does respond. gpuowl does not react to ^C. Killing the gpuowls and restarting does work.

2020-01-02, 18:30   #7
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

7×1,069 Posts

Quote:
 Originally Posted by axn Checking the age of the checkpoint / result file is probably the easiest way to detect no-progress condition. A simple script that runs every x minutes, checks the date of latest update in the folder, and if it is too old, killall the process and restart it.
This looks like a great option.

My linux-fu is poor and I'm a little lazy to google right now. What would the syntax be for:

if (gpuowl.log has not been updated in the last hour) {
killall gpuowl
}

I can do the crontab entry and restarting is the same as the start-at-reboot code.

 2020-01-02, 19:25 #8 kriesel     "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 117368 Posts An application and OS agnostic general case code snippet would be great. I have an instance of CUDAPm1 on Windows that restarts by batch file when it exits, but lately sometimes it produces no output into the redirected log file for nearly a day, not even the usual startup prints. Then I kill it and relaunch it manually and it seems to be fine for a while. Preferably it would be perl that could be compiled in Indigostar's perl2exe. http://www.indigostar.com/perl2exe/ Something along the lines of if this specifiable file path/name (such as a process's log file)'s last-modification date is older than this settable age, kill the process that has it open for append and relaunch. Or if the last saved checkpoint file is older than a settable age. Support for a list of files and folders to be separately checked and individually processed. (I set up with a separate folder per running instance.) I'm still working on a general monitor and results gathering application for several gpu apps. Will consider adding this functionality to it. More likely in the short term it will be a separate creation. Last fiddled with by kriesel on 2020-01-02 at 19:27
2020-01-03, 01:38   #9
paulunderwood

Sep 2002
Database er0rr

22×3×5×61 Posts

Quote:
 Originally Posted by Prime95 This looks like a great option. My linux-fu is poor and I'm a little lazy to google right now. What would the syntax be for: if (gpuowl.log has not been updated in the last hour) { killall gpuowl } I can do the crontab entry and restarting is the same as the start-at-reboot code.
If you can hack this not-so-great code...

Code:
if (( $(date +%s) >$(stat -c %Y -- examplefile.txt ) + 3600 )); then killall gpuowl; echo "hi" ;  fi;

Last fiddled with by paulunderwood on 2020-01-03 at 01:43

 2020-01-03, 13:25 #10 kriesel     "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 13DE16 Posts In Windows, it's tasklist and taskkill. There are various filters in them. In George's case it seems all gpuowl instances are to be killed and restarted. It gets trickier if wanting to determine which pid corresponds to one hung gpu app to kill and restart, among multiple processes running the same app but on different gpus or in different folders, or different app names. Roll through a list and see which one has the relevant log file open? Is there a Windows command line equivalent to linux lsof, that would show which process id has which log file open?
2020-01-03, 14:44   #11
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

2·2,543 Posts

Quote:
 Originally Posted by kriesel Is there a Windows command line equivalent to linux lsof, that would show which process id has which log file open?
Run the program once interactively, to read and accept the license terms, that will pop up separately, before attempting to use it in any sort of script, as the license is programmed to be a showstopper until accepted.

Code:
C:\Users\ken\Documents>handle64 gpuowl.log

Nthandle v4.22 - Handle viewer
Sysinternals - www.sysinternals.com

gpuowl-win.exe     pid: 5488   type: File            98: C:\msys64\home\ken\gpuowl-compile\gpuowl-v6.11-99-gdd8527b\gpuowl.log
gpuowl-win.exe     pid: 5768   type: File            70: C:\msys64\home\ken\gpuowl-compile\v6.11-104-g91ef9a8\rx550\gpuowl.log

C:\Users\ken\Documents>
And console redirection to a file suffices too, for those applications that don't have built-in logging:

Code:
C:\Users\User\My Documents\starfish>handle64 cudapm1.txt

Nthandle v4.22 - Handle viewer
Sysinternals - www.sysinternals.com

cmd.exe            pid: 12396  type: File            58: C:\Users\User\Documents\pm1-k4000\cudapm1.txt
CUDAPm1_win64_20130923_CUDA_55.exe pid: 1908   type: File            58: C:\Users\User\Documents\pm1-k4000\cudapm1.txt

C:\Users\User\My Documents\starfish>
I think that's the last piece needed, for a general purpose tool to identify, kill, and restart any GIMPS gpu app that's stalled, on Windows or on linux.

 Similar Threads Thread Thread Starter Forum Replies Last Post GP2 GpuOwl 22 2020-06-13 16:57 M344587487 GpuOwl 14 2018-12-29 08:11 jasong Linux 4 2006-12-23 21:24 nngs Software 1 2005-11-27 01:39 crash893 Software 2 2003-12-26 18:50

All times are UTC. The time now is 12:16.

Thu May 6 12:16:15 UTC 2021 up 28 days, 6:57, 0 users, load averages: 1.47, 1.45, 1.42