mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Hardware (https://www.mersenneforum.org/forumdisplay.php?f=9)
-   -   Hardware repair odyssey (https://www.mersenneforum.org/showthread.php?t=25342)

Prime95 2020-03-05 00:36

Hardware repair odyssey
 
I've never failed so miserably at repairing a machine....

So, the machine was throwing read errors during boot up and normal operations. Googling said the error could be a bad sata cable. However as this was a 9 year old disk drive and the sata cable has been working for years, I figured it was time to upgrade to an SSD.

1) Got the SSD and installed Ubuntu. Started to install stuff and got disk read errors followed by complete disk read failure. Hmmmm.
2) Replaced the SATA cable. Install Ubuntu again. No better.
3) Tried moving to a different SATA port. Install Ubuntu. Worked for a few minutes and I rebooted. Now it won't power on at all. An brief flash of lights and then nothing.
4) Yanked out the power supply and replaced with a less powerful one. Install Ubuntu. Worked briefly. Powered off and power on, internet port is dead.
5) Dragged out a USB to Ethernet dongle. Install Ubuntu. Looked stable for a few minutes. Powered off to remove the GPU that cannot be run using the new power supply. Powered on -- nothing, completely dead.

I suppose the problem is the motherboard...

kriesel 2020-03-05 00:51

[QUOTE=Prime95;538911]I've never failed so miserably at repairing a machine....
[/QUOTE]I have had experience getting more life out of vintage hardware, by methodically one at a time disconnecting and reseating _everything_. If you're going to strap up with an antistatic tether, may as well make it worthwhile. And note that when that's needed, the system is on life support.

Prime95 2020-03-05 01:37

Not my day!!!!!

So, I take the GPU from the dead machine above over to the other machine that is down one GPU because it needs to RMAd. I power down, plug in the GPU, power up and all the GPUs start cranking. Sweet.

....except the internet port is now dead.

EDIT: I must have forgotten to toggle the power supply off after the shutdown. It's easy to make mistakes after hours of frustration.

Prime95 2020-03-05 03:14

[QUOTE=Prime95;538917]

....except the internet port is now dead.

[/QUOTE]


Some progress. I've got the USB ethernet dongle working. Configuring Ubuntu networking by hand was an adventure.

I'm not sure I'm going to try and resurrect the dead machine. I may just add a PSU to my working machine and move the GPUs there.

phillipsjk 2020-03-05 03:16

On a machine I got for "free", disk (and network?) IO would cause it to fail (forget the exact errors).

Turns out the southbridge chip was faulty. Ended up getting an AM2 board with cheap "high density" RAM because Intel boards don't support "high density" DDR2 (4GB modules, only 2 slots).

kriesel 2020-03-05 03:51

[QUOTE=Prime95;538917]
EDIT: I must have forgotten to toggle the power supply off after the shutdown. It's easy to make mistakes after hours of frustration.[/QUOTE]Maybe that's what the following drill is for, other than getting rid of stored charge:
1) orderly shutdown, which on most machines turns the power supply off; ancient hardware or OS may require manual shutoff
2) UNPLUG the power cord
3) Push and hold the power button for seconds to get rid of the residual charge still stored in the power supply
4) Check fans have stopped rotating, all LED lights went out, etc.
5) Antistatic strap onto wrist and chassis. Look around for needed tools and components
6) Unsnap the tether from the wrist band, get the screwdriver and the necessary components
7) Reconnect tether to wrist band, remove cover, etc.

If the system starts starting up again as a result of #3, you must have skpped #2.
Some of us may resort to a checklist, coffee, or a nap for major maintenance tasks.

LaurV 2020-03-05 05:37

Yuck, George, that sucks, it looks like you carried home from your Asian trip some type of corona that affects computers...
(sorry, I could not stop! :razz:)
Wish you luck...

Prime95 2020-03-07 04:20

The fun continues:

Got the part that lets me run one machine with two PSUs. Moved the GPU from the dead machine, added the second power supply, boot it up. All GPUs fire up and start crunching!!!


Except.... gpuowl is now being a CPU hog. So mprime is now getting 1/9th of the CPU.

Strange console error messages too: Over-subscription is not allowed for SDMA.

Looks like I need to figure out a way to resurrect the dead machine...

preda 2020-03-07 09:00

[QUOTE=Prime95;539067]The fun continues:

Got the part that lets me run one machine with two PSUs. Moved the GPU from the dead machine, added the second power supply, boot it up. All GPUs fire up and start crunching!!!


Except.... gpuowl is now being a CPU hog. So mprime is now getting 1/9th of the CPU.

Strange console error messages too: Over-subscription is not allowed for SDMA.

Looks like I need to figure out a way to resurrect the dead machine...[/QUOTE]

Yes, the gpuowl taking 100% of one thread is a known ROCm issue, was reported here:
[url]https://github.com/RadeonOpenCompute/ROCm/issues/963[/url]

At the time when it was affecting me, I did see that the framework (ROCm) was trying to create a large number of events, which was reaching a kernel (linux kernel) limit, after which the framework was falling back to busy-wait. I could sort-of fix it in the framework, recompiling of the ROCm libs to avoid the busy-wait. But was fixed (in my situation) by removing one GPU from the Gen2 PCIe slot.

What may have happened in your case is that you added one GPU in the "forbidden slot" that triggers ROCm. In that situation, all the instances on all GPUs become 100%-cpu threads.

preda 2020-03-07 09:03

[QUOTE=Prime95;539067]The fun continues:

Got the part that lets me run one machine with two PSUs. Moved the GPU from the dead machine, added the second power supply, boot it up. All GPUs fire up and start crunching!!!


Except.... gpuowl is now being a CPU hog. So mprime is now getting 1/9th of the CPU.
[/QUOTE]

You might also try to increase the priority of mprime and decrease (nice) the priority of gpuowl. This way, even if gpuowl wants 100%, it would yield to mprime.

kriesel 2020-03-07 13:11

[QUOTE=preda;539069]Yes, the gpuowl taking 100% of one thread is a known ROCm issue, was reported here:
[URL]https://github.com/RadeonOpenCompute/ROCm/issues/963[/URL]
[/QUOTE]I've observed the -yield option that works on Windows 7 to control cpu usage of gpuowl, not work on Windows 10. Fortunately the core counts on my Win10 systems are larger than the gpu counts so impact is not large on a percentage of hyperthreads basis.

Prime95 2020-03-08 01:08

[QUOTE=preda;539069]What may have happened in your case is that you added one GPU in the "forbidden slot" that triggers ROCm. In that situation, all the instances on all GPUs become 100%-cpu threads.[/QUOTE]

Sandy Bridge lives again. I revived this antique computer and moved the GPU to it for the time being. I can't see spending $100-150 on a Haswell motherboard when Haswell gives 1/10th the performance of a Radeon VII. I can move the memory to another Haswell box which might give mprime a small boost.

I'll eventually start disassembling one of the dream machines and move the GPU and the RMA'd GPU to these mini-itx motherboards.

preda 2020-03-08 13:03

I just spent a huge amout of time diagnosing why my system with 3x GPUs started freezing after a reboot when it has been working fine before. It seemed one GPU was causing the problem, so I started (in turn) switching GPUs around, switching the PCIe slots they're connected to, etc to see to what element the problem stays attached to.

And very surprisingly, it seems the problem was attached to the power cable..! (the cable connecting the PSU to the GPU). Yet the GPU was starting up fine, just dying a few seconds after starting gpuowl on it with very exotic errors..; anyway I'm happy it seems fixed now.

dcheuk 2020-03-08 19:12

[QUOTE=preda;539159]I just spent a huge amout of time diagnosing why my system with 3x GPUs started freezing after a reboot when it has been working fine before. It seemed one GPU was causing the problem, so I started (in turn) switching GPUs around, switching the PCIe slots they're connected to, etc to see to what element the problem stays attached to.

And very surprisingly, it seems the problem was attached to the power cable..! (the cable connecting the PSU to the GPU). Yet the GPU was starting up fine, just dying a few seconds after starting gpuowl on it with very exotic errors..; anyway I'm happy it seems fixed now.[/QUOTE]

Lol :smile:

Had this problem on a Mac Pro couple years ago, almost pulled my hair out figuring out what was wrong.

kriesel 2020-03-09 15:04

[QUOTE=preda;539159]I just spent a huge amout of time diagnosing why my system with 3x GPUs started freezing after a reboot ..., it seems the problem was attached to the power cable..! (the cable connecting the PSU to the GPU). Yet the GPU was starting up fine, just dying a few seconds after starting gpuowl on it with very exotic errors
...
anyway I'm happy it seems fixed now.[/QUOTE]I've seen it take 45 minutes or more for issues to develop.

I just redid the cabling a bit on my mini-miner recently, which currently has 5 NIVIDIA gpus of 3 different models on it. So far it seems that arranging for no more than 2 connections to a gpu or pcie-extender card per power cable regardless of connector count on the cable is working better. V=IR


All times are UTC. The time now is 20:35.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.