mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2011-11-29, 02:02   #1
Christenson
 
Christenson's Avatar
 
Dec 2010
Monticello

5·359 Posts
Default Computer Diet causes Machine Check Exception -- need heuristics help

My six-core beast is ill...it seems that six cores of work after six months has given it indigestion.

System, roughly:

Eric-AMD-6-Core:
AMD Phenom II x6 1090T
ASRock 880GM-LE Mobo (SB 710 Chipset, factory clocking)
GTX440 GPU.
8 Gig of Ram in 2 sticks
Antec Earthwatts Green 380W Power supply. 200 W at full load.
Xubuntu.
Runs fine, mfaktc, no mprime.

Eric-AMD-6-Core Crash
1 Month ago, was crashing regularly and a crack was observed
run mprime about 1 hour, 5 cores. Kill-a-watt reports 200W used. Resets suddenly without warning.
run sensors, run mprime, about 20 minutes. No temperatures near limits, voltages seem OK, but there are some peculiarities -- like minima above maxima...
Get text console with
saned disabled: edit /etc/default/saned [OK]
[1045.373055] [Hardware Error]: CPU 4: Machine Check Exception: 4 Bank 0: b62bc000ea000135
[1045.373285][Hardware Error]: TSC 31cd301d4d2 ADDR 1c9973b00
[1045.373425][Hardware Error]:Processor 2: 100fa0 TIME 1322374257 SOCKET 0 APIC 4
[1045.373556][Hardware Error]:MC0_STATUS[-|UE|-|PCC|AddrV|CECC]: 0xb62bc000ea000135
[1045.373692][Hardware Error]:Data Cache Error: Data/Tag DRD error.
[1045.373810][Hardware Error]:cache Level: L1, tx: DATA, mem-tx: DRD
[1045.373927][Hardware Error]:Machine-Check: Processor context corrupt
[1045.374044] Kernel Panic - not synching: Fatal machine check on current CPU
[1045.374159]Pid: 2580, comm: mprime Tainted: P M 3.0.0-13-generic#22-Ubuntu
[1045.374295]Call Trace:
[1045.374338]<#MC> [.....
...
[1045.374828]<<EOE>>
[1045.374873] panic occurred, switching back to text console

Clearly, between power supply, mobo, ram, and CPU I have a major issue. I don't really want to end up with a lot of spare parts...but am considering a second/upgraded system, probably running Sandy Bridge.
0) I suppose step1 is to run memtest86...I gave 6 Gig to P-1...

1) The system has always been prone to crashing when the lights jump with the voltage in the house. Could I have damaged the power supply with the intermittent in the power strip?

2) A small regret with this system is that the Power supply is a bit small to run a truly high-end GPU. Is it worth buying a 600W or 800W power supply as an upgrade to see if that fixes the problem?

3) Should I go ahead and invest in a regulating UPS as a test, with enough capacity to run both systems?

Am I likely to get anywhere fiddling with the overclock settings?

Christenson is offline   Reply With Quote
Old 2011-11-29, 02:12   #2
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

3·29·83 Posts
Default

All those hardware errors seem to point to the CPU, specifcally core 2 and/or 4 (or might be 3 and/or 5). It seems to me that Ubuntu Forums would be a better place to take this, with that output. Specifically:
Quote:
Kernel Panic - not synching: Fatal machine check on current CPU
Linux is shutting down because of something on the CPU, like its own little self tests are being failed (or maybe something more serious to cause a kernel panic). Maybe if there's a way on the mobo to disable some of the cores...? Either way, I don't think it's the power supply or mobo. You wouldn't happen to have another AMD proc lying around, would you? (Of course, it may have been the power fluctuations that caused the damage, but all evidence ATM points to the CPU.)

Also, what do you mean by "a crack was observed"?

Last fiddled with by Dubslow on 2011-11-29 at 02:13
Dubslow is offline   Reply With Quote
Old 2011-11-29, 02:21   #3
KyleAskine
 
KyleAskine's Avatar
 
Oct 2011
Maryland

2×5×29 Posts
Default

I want to start by saying that all I have is guesses. There are people who would know better than me.

Anyway, my initial thoughts:
- 200W full load seems low. Those Phenoms are power hogs at full load, plus HDDs, RAM, and that Video card. I don't know what a 440 should draw (I know much more about AMD's) but it has to be higher than the 50-70W your numbers suggest. So you get 100% GFX and CPU utilization at 200W?
- With power supplies generally, 12V amperage is more imporant than overall wattage, so I would check that. Garbage PSU's claim high wattage, and throw it all somewhere worthless like the 5V line. Though Antec's are generally very high quality, so that probably isn't the issue.
- I agree - always check RAM first. It, along with HDD, are the two most likely things to be corrupt in my experience as long as you run at normal temps.
- A quality PSU and a quality surge protector should protect all your parts from damage relating to power fluctuations. Rebooting unexpectedly from power drops shouldn't hurt anything too badly, in my opinion. Power spikes would be the larger concern.
- Even though you report normal temps, if you have a stock AMD cooler I would still potentially suspect that cooling could be an issue (I have had faulty gauges before). So lowering/removing the overclocks could be fruitful, in my opinion.

Again, just some random opinions. Someone who knows more about linux than I can probably shed more light on your specific error messages.

I was getting hard crashes in Debian (which ubuntu is built off of) on my box if I ran mprime on over two cores (I was showing low 80s) with no overclock. I replaced the stock cooler with an aftermarket one and that issue completely went away.
KyleAskine is offline   Reply With Quote
Old 2011-11-29, 03:29   #4
kladner
 
kladner's Avatar
 
"Kieren"
Jul 2011
In My Own Galaxy!

100111101011102 Posts
Default

Quote:
Originally Posted by Christenson View Post
My six-core beast is ill...it seems that six cores of work after six months has given it indigestion.

System, roughly:

Eric-AMD-6-Core:
AMD Phenom II x6 1090T
ASRock 880GM-LE Mobo (SB 710 Chipset, factory clocking)
GTX440 GPU.
8 Gig of Ram in 2 sticks
Antec Earthwatts Green 380W Power supply. 200 W at full load.
Xubuntu.
Runs fine, mfaktc, no mprime.
[snip]
My 1090T, with P95 on three cores (1-LL, 2-P-1), and three cores feeding three mfaktc instances on a GTX 460 is drawing about 365 watts. The CPU is running at 3.5GHz, and the 8GB RAM is OC'd to 1600 from 1333. This is running on a PC Power and Cooling 650 watt supply.

This is different from your load conditions, but it does make a 380 watt supply seem a bit below optimum, though a good PSU might hold up under that kind of load.

Last fiddled with by kladner on 2011-11-29 at 03:34
kladner is offline   Reply With Quote
Old 2011-11-29, 03:34   #5
Christenson
 
Christenson's Avatar
 
Dec 2010
Monticello

5·359 Posts
Default

Quote:
Originally Posted by Dubslow View Post
<snip>
Also, what do you mean by "a crack was observed"?
I mean, that when I flexed the surge protector/power strip, the computer would crash, and you could hear the connection being made or broken, due to the internal arc. This continued after plugging in the Kill-a-watt device, even at no load, the Kill-a-watt would come and go. I probably should have taken it back to Staples and claimed it damaged my equipment....I garbaged it instead...

As for the power usage...recall that it's only a GT440 (not the world's fastest GPU beast, just enough performance to make it interesting) and that I'm only running one HDD and not overclocking at all that I know of....and I have on-board AMD graphics, too, but I'm not really pushing performance except with mprime and mfaktc...


****************
So, ramtest first...
Remove heatsink and replace with aftermarket cooler second (in my parts kit, don't forget about bent and partially unbent pin 997 on the CPU) second....use the good (arctic silver) heatsink paste in case the original factory stuff has dried out and thermostat is wrong...wonder if a hot spot under that is possible?
Upgrade PS third...or get regulating UPS instead? -- more $$, but I want one anyway
Is it worth a lapping kit if I get a CPU? Should I get a cheap (4-core) for testing, 8-core upgrade for real use?
********

Last fiddled with by Christenson on 2011-11-29 at 03:43
Christenson is offline   Reply With Quote
Old 2011-11-29, 03:42   #6
kladner
 
kladner's Avatar
 
"Kieren"
Jul 2011
In My Own Galaxy!

2×3×1,693 Posts
Default

Quote:
Originally Posted by Christenson View Post
I mean, that when I flexed the surge protector/power strip, the computer would crash, and you could hear the connection being made or broken, due to the internal arc. This continued after plugging in the Kill-a-watt device, even at no load, the Kill-a-watt would come and go. I probably should have taken it back to Staples and claimed it damaged my equipment....I garbaged it instead...

As for the power usage...recall that it's only a GT440 (not the world's fastest GPU beast, just enough performance to make it interesting) and that I'm only running one HDD and not overclocking at all that I know of....and I have on-board AMD graphics, too, but I'm not really pushing performance except with mprime and mfaktc...
It does seem that the power strip was pretty funky. I remember you talking about that. No telling what the by-products of arcing might have done to other components.

Also, your load is substantially less than mine. In addition to the things I listed above, I have 4 HDD's.

So I really don't know what to suggest.
kladner is offline   Reply With Quote
Old 2011-11-29, 15:35   #7
bcp19
 
bcp19's Avatar
 
Oct 2011

12478 Posts
Default

Quote:
Originally Posted by KyleAskine View Post
I want to start by saying that all I have is guesses. There are people who would know better than me.

Anyway, my initial thoughts:
- 200W full load seems low. Those Phenoms are power hogs at full load, plus HDDs, RAM, and that Video card. I don't know what a 440 should draw (I know much more about AMD's) but it has to be higher than the 50-70W your numbers suggest. So you get 100% GFX and CPU utilization at 200W?
- With power supplies generally, 12V amperage is more imporant than overall wattage, so I would check that. Garbage PSU's claim high wattage, and throw it all somewhere worthless like the 5V line. Though Antec's are generally very high quality, so that probably isn't the issue.
- I agree - always check RAM first. It, along with HDD, are the two most likely things to be corrupt in my experience as long as you run at normal temps.
- A quality PSU and a quality surge protector should protect all your parts from damage relating to power fluctuations. Rebooting unexpectedly from power drops shouldn't hurt anything too badly, in my opinion. Power spikes would be the larger concern.
- Even though you report normal temps, if you have a stock AMD cooler I would still potentially suspect that cooling could be an issue (I have had faulty gauges before). So lowering/removing the overclocks could be fruitful, in my opinion.

Again, just some random opinions. Someone who knows more about linux than I can probably shed more light on your specific error messages.

I was getting hard crashes in Debian (which ubuntu is built off of) on my box if I ran mprime on over two cores (I was showing low 80s) with no overclock. I replaced the stock cooler with an aftermarket one and that issue completely went away.
GeForce GT 440 3GB 56 Watts (same as the 1.5 GB)

Last fiddled with by bcp19 on 2011-11-29 at 15:36
bcp19 is offline   Reply With Quote
Old 2011-11-29, 20:52   #8
lycorn
 
lycorn's Avatar
 
"GIMFS"
Sep 2002
Oeiras, Portugal

11·137 Posts
Default

I wouldn´t be surprised at all if the power supply turns out to be responsible for the crashes reported.
380 W is a low value. That is the overall power rated, you may be stressing the PSU too much in some of the lines. And as you have reported some primary power unstability, that may have caused some damage to the PSU.
If you have the chance, try replacing the PSU by a more powerful one for a start.
Next thing to check, if the problem doesn´t go away, is the memory.
In any case, a regulating UPS is a good safeguard against power fluctuations, and you should get one.
lycorn is offline   Reply With Quote
Old 2011-11-30, 02:26   #9
Christenson
 
Christenson's Avatar
 
Dec 2010
Monticello

5×359 Posts
Default

The UPS, from Cyberpower, at newegg.com, rated for PFC, and regulated, is on its way...Model CP1500PFCLCD, is on its way... I was impressed by Cyberpower's willingness to help when their stuff wasn't working, and to change the way they did stuff when it was causing problems, like setting units on top of cords in shipping.

I probably spent more than absolutely necessary...$219...maybe....I didn't quite see what an extra $100 bought for the fat, squat models, but I am considering a second system. Newegg had a stepped-sine-wave output model on sale for $149...decided I didn't want to fool with that.

I'm seriously considering a high-end (800W or more) Antec, as I could use that on system #2 and/or upgrade the GPUSuggestions? Are there better brands without getting tremendously more pricey?

And while I'm at it, what's the best way to mount a small fan for spot cooling inside a case, preferably without drilling extra mounting holes? (the north bridge chip fins run a tad warm, the engineer wants to direct some cooling air at it).
Christenson is offline   Reply With Quote
Old 2011-11-30, 03:09   #10
KyleAskine
 
KyleAskine's Avatar
 
Oct 2011
Maryland

2·5·29 Posts
Default

Quote:
Originally Posted by Christenson View Post
I'm seriously considering a high-end (800W or more) Antec, as I could use that on system #2 and/or upgrade the GPUSuggestions? Are there better brands without getting tremendously more pricey?
Absolutely nothing wrong with Antec. Corsair is another good choice. Seasonic is fine. Plus a few more quality brands.

Good PSUs are significantly more expensive than cheap ones. With that said, that is the one part of the computer that you absolutely positively don't want to go cheap with, in my opinion. The well being of every component relies on it. Plus it will save you money over the long run if you get one with a decent energy rating.

Last fiddled with by KyleAskine on 2011-11-30 at 03:10
KyleAskine is offline   Reply With Quote
Old 2011-11-30, 03:22   #11
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

3·29·83 Posts
Default

[tangent]
How about Rosewill? Would you say they're a good company? My brother's looking into a new comp, and money isn't exactly easy to come by... 1000W PSU's.

He'd decided on this:
http://www.newegg.com/Product/Produc...82E16817171056
Which seems good quality, and has 5 eggs.

But this is cheaper (by a lot):
http://www.newegg.com/Product/Produc...82E16817182188

So I'm wondering: would the cheaper one be a safe buy, if we had to shave some money off?

Thanks.
[/tangent]
Dubslow is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Heuristics jnml Miscellaneous Math 15 2018-01-03 18:36
Computer diet - Need help garo Hardware 41 2011-10-06 04:06
Heuristics davieddy Math 13 2010-06-10 17:44
Double Check not assigned to this computer RMAC9.5 PrimeNet 2 2008-02-21 23:52
Check out my new computer!! JuanTutors Hardware 15 2006-09-26 14:45

All times are UTC. The time now is 02:08.


Sun Nov 28 02:08:01 UTC 2021 up 127 days, 20:37, 0 users, load averages: 1.10, 1.06, 1.13

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.