mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2020-03-05, 00:36   #1
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

210×7 Posts
Default Hardware repair odyssey

I've never failed so miserably at repairing a machine....

So, the machine was throwing read errors during boot up and normal operations. Googling said the error could be a bad sata cable. However as this was a 9 year old disk drive and the sata cable has been working for years, I figured it was time to upgrade to an SSD.

1) Got the SSD and installed Ubuntu. Started to install stuff and got disk read errors followed by complete disk read failure. Hmmmm.
2) Replaced the SATA cable. Install Ubuntu again. No better.
3) Tried moving to a different SATA port. Install Ubuntu. Worked for a few minutes and I rebooted. Now it won't power on at all. An brief flash of lights and then nothing.
4) Yanked out the power supply and replaced with a less powerful one. Install Ubuntu. Worked briefly. Powered off and power on, internet port is dead.
5) Dragged out a USB to Ethernet dongle. Install Ubuntu. Looked stable for a few minutes. Powered off to remove the GPU that cannot be run using the new power supply. Powered on -- nothing, completely dead.

I suppose the problem is the motherboard...
Prime95 is offline   Reply With Quote
Old 2020-03-05, 00:51   #2
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

34×5×11 Posts
Default

Quote:
Originally Posted by Prime95 View Post
I've never failed so miserably at repairing a machine....
I have had experience getting more life out of vintage hardware, by methodically one at a time disconnecting and reseating _everything_. If you're going to strap up with an antistatic tether, may as well make it worthwhile. And note that when that's needed, the system is on life support.
kriesel is online now   Reply With Quote
Old 2020-03-05, 01:37   #3
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

210·7 Posts
Default

Not my day!!!!!

So, I take the GPU from the dead machine above over to the other machine that is down one GPU because it needs to RMAd. I power down, plug in the GPU, power up and all the GPUs start cranking. Sweet.

....except the internet port is now dead.

EDIT: I must have forgotten to toggle the power supply off after the shutdown. It's easy to make mistakes after hours of frustration.

Last fiddled with by Prime95 on 2020-03-05 at 01:57
Prime95 is offline   Reply With Quote
Old 2020-03-05, 03:14   #4
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

716810 Posts
Default

Quote:
Originally Posted by Prime95 View Post

....except the internet port is now dead.

Some progress. I've got the USB ethernet dongle working. Configuring Ubuntu networking by hand was an adventure.

I'm not sure I'm going to try and resurrect the dead machine. I may just add a PSU to my working machine and move the GPUs there.
Prime95 is offline   Reply With Quote
Old 2020-03-05, 03:16   #5
phillipsjk
 
Nov 2019

528 Posts
Default

On a machine I got for "free", disk (and network?) IO would cause it to fail (forget the exact errors).

Turns out the southbridge chip was faulty. Ended up getting an AM2 board with cheap "high density" RAM because Intel boards don't support "high density" DDR2 (4GB modules, only 2 slots).
phillipsjk is offline   Reply With Quote
Old 2020-03-05, 03:51   #6
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

116716 Posts
Default

Quote:
Originally Posted by Prime95 View Post
EDIT: I must have forgotten to toggle the power supply off after the shutdown. It's easy to make mistakes after hours of frustration.
Maybe that's what the following drill is for, other than getting rid of stored charge:
1) orderly shutdown, which on most machines turns the power supply off; ancient hardware or OS may require manual shutoff
2) UNPLUG the power cord
3) Push and hold the power button for seconds to get rid of the residual charge still stored in the power supply
4) Check fans have stopped rotating, all LED lights went out, etc.
5) Antistatic strap onto wrist and chassis. Look around for needed tools and components
6) Unsnap the tether from the wrist band, get the screwdriver and the necessary components
7) Reconnect tether to wrist band, remove cover, etc.

If the system starts starting up again as a result of #3, you must have skpped #2.
Some of us may resort to a checklist, coffee, or a nap for major maintenance tasks.

Last fiddled with by kriesel on 2020-03-05 at 04:01
kriesel is online now   Reply With Quote
Old 2020-03-05, 05:37   #7
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

875310 Posts
Default

Yuck, George, that sucks, it looks like you carried home from your Asian trip some type of corona that affects computers...
(sorry, I could not stop! )
Wish you luck...

Last fiddled with by LaurV on 2020-03-05 at 05:37
LaurV is offline   Reply With Quote
Old 2020-03-07, 04:20   #8
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

210×7 Posts
Default

The fun continues:

Got the part that lets me run one machine with two PSUs. Moved the GPU from the dead machine, added the second power supply, boot it up. All GPUs fire up and start crunching!!!


Except.... gpuowl is now being a CPU hog. So mprime is now getting 1/9th of the CPU.

Strange console error messages too: Over-subscription is not allowed for SDMA.

Looks like I need to figure out a way to resurrect the dead machine...
Prime95 is offline   Reply With Quote
Old 2020-03-07, 09:00   #9
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

22·5·61 Posts
Default

Quote:
Originally Posted by Prime95 View Post
The fun continues:

Got the part that lets me run one machine with two PSUs. Moved the GPU from the dead machine, added the second power supply, boot it up. All GPUs fire up and start crunching!!!


Except.... gpuowl is now being a CPU hog. So mprime is now getting 1/9th of the CPU.

Strange console error messages too: Over-subscription is not allowed for SDMA.

Looks like I need to figure out a way to resurrect the dead machine...
Yes, the gpuowl taking 100% of one thread is a known ROCm issue, was reported here:
https://github.com/RadeonOpenCompute/ROCm/issues/963

At the time when it was affecting me, I did see that the framework (ROCm) was trying to create a large number of events, which was reaching a kernel (linux kernel) limit, after which the framework was falling back to busy-wait. I could sort-of fix it in the framework, recompiling of the ROCm libs to avoid the busy-wait. But was fixed (in my situation) by removing one GPU from the Gen2 PCIe slot.

What may have happened in your case is that you added one GPU in the "forbidden slot" that triggers ROCm. In that situation, all the instances on all GPUs become 100%-cpu threads.
preda is offline   Reply With Quote
Old 2020-03-07, 09:03   #10
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

22·5·61 Posts
Default

Quote:
Originally Posted by Prime95 View Post
The fun continues:

Got the part that lets me run one machine with two PSUs. Moved the GPU from the dead machine, added the second power supply, boot it up. All GPUs fire up and start crunching!!!


Except.... gpuowl is now being a CPU hog. So mprime is now getting 1/9th of the CPU.
You might also try to increase the priority of mprime and decrease (nice) the priority of gpuowl. This way, even if gpuowl wants 100%, it would yield to mprime.
preda is offline   Reply With Quote
Old 2020-03-07, 13:11   #11
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

116716 Posts
Default

Quote:
Originally Posted by preda View Post
Yes, the gpuowl taking 100% of one thread is a known ROCm issue, was reported here:
https://github.com/RadeonOpenCompute/ROCm/issues/963
I've observed the -yield option that works on Windows 7 to control cpu usage of gpuowl, not work on Windows 10. Fortunately the core counts on my Win10 systems are larger than the gpu counts so impact is not large on a percentage of hyperthreads basis.
kriesel is online now   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Which Hardware should I buy ? MLoerke Hardware 45 2020-07-23 21:37
The Right to Repair ewmayer Soap Box 17 2019-08-12 20:58
Hardware robert44444uk Prime Gap Searches 45 2018-03-12 23:59
Hardware Error after 1s StechusKaktus Information & Answers 13 2018-02-20 07:46
NAS hardware VictordeHolland Hardware 5 2015-03-05 23:37

All times are UTC. The time now is 19:51.

Tue Sep 29 19:51:06 UTC 2020 up 19 days, 17:02, 0 users, load averages: 2.02, 1.83, 1.79

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.