 2011-11-29, 02:02 #1 Christenson     Dec 2010 Monticello 5·359 Posts Computer Diet causes Machine Check Exception -- need heuristics help My six-core beast is ill...it seems that six cores of work after six months has given it indigestion. System, roughly: Eric-AMD-6-Core: AMD Phenom II x6 1090T ASRock 880GM-LE Mobo (SB 710 Chipset, factory clocking) GTX440 GPU. 8 Gig of Ram in 2 sticks Antec Earthwatts Green 380W Power supply. 200 W at full load. Xubuntu. Runs fine, mfaktc, no mprime. Eric-AMD-6-Core Crash 1 Month ago, was crashing regularly and a crack was observed run mprime about 1 hour, 5 cores. Kill-a-watt reports 200W used. Resets suddenly without warning. run sensors, run mprime, about 20 minutes. No temperatures near limits, voltages seem OK, but there are some peculiarities -- like minima above maxima... Get text console with saned disabled: edit /etc/default/saned [OK] [1045.373055] [Hardware Error]: CPU 4: Machine Check Exception: 4 Bank 0: b62bc000ea000135 [1045.373285][Hardware Error]: TSC 31cd301d4d2 ADDR 1c9973b00 [1045.373425][Hardware Error]:Processor 2: 100fa0 TIME 1322374257 SOCKET 0 APIC 4 [1045.373556][Hardware Error]:MC0_STATUS[-|UE|-|PCC|AddrV|CECC]: 0xb62bc000ea000135 [1045.373692][Hardware Error]:Data Cache Error: Data/Tag DRD error. [1045.373810][Hardware Error]:cache Level: L1, tx: DATA, mem-tx: DRD [1045.373927][Hardware Error]:Machine-Check: Processor context corrupt [1045.374044] Kernel Panic - not synching: Fatal machine check on current CPU [1045.374159]Pid: 2580, comm: mprime Tainted: P M 3.0.0-13-generic#22-Ubuntu [1045.374295]Call Trace: [1045.374338]<#MC> [..... ... [1045.374828]<> [1045.374873] panic occurred, switching back to text console Clearly, between power supply, mobo, ram, and CPU I have a major issue. I don't really want to end up with a lot of spare parts...but am considering a second/upgraded system, probably running Sandy Bridge. 0) I suppose step1 is to run memtest86...I gave 6 Gig to P-1... 1) The system has always been prone to crashing when the lights jump with the voltage in the house. Could I have damaged the power supply with the intermittent in the power strip? 2) A small regret with this system is that the Power supply is a bit small to run a truly high-end GPU. Is it worth buying a 600W or 800W power supply as an upgrade to see if that fixes the problem? 3) Should I go ahead and invest in a regulating UPS as a test, with enough capacity to run both systems? Am I likely to get anywhere fiddling with the overclock settings?
All those hardware errors seem to point to the CPU, specifcally core 2 and/or 4 (or might be 3 and/or 5). It seems to me that Ubuntu Forums would be a better place to take this, with that output. Specifically:
Quote:
 Kernel Panic - not synching: Fatal machine check on current CPU
Linux is shutting down because of something on the CPU, like its own little self tests are being failed (or maybe something more serious to cause a kernel panic). Maybe if there's a way on the mobo to disable some of the cores...? Either way, I don't think it's the power supply or mobo. You wouldn't happen to have another AMD proc lying around, would you? (Of course, it may have been the power fluctuations that caused the damage, but all evidence ATM points to the CPU.)

Also, what do you mean by "a crack was observed"?

 2011-11-29, 02:21 #3 KyleAskine     Oct 2011 Maryland 2×5×29 Posts I want to start by saying that all I have is guesses. There are people who would know better than me. Anyway, my initial thoughts: - 200W full load seems low. Those Phenoms are power hogs at full load, plus HDDs, RAM, and that Video card. I don't know what a 440 should draw (I know much more about AMD's) but it has to be higher than the 50-70W your numbers suggest. So you get 100% GFX and CPU utilization at 200W? - With power supplies generally, 12V amperage is more imporant than overall wattage, so I would check that. Garbage PSU's claim high wattage, and throw it all somewhere worthless like the 5V line. Though Antec's are generally very high quality, so that probably isn't the issue. - I agree - always check RAM first. It, along with HDD, are the two most likely things to be corrupt in my experience as long as you run at normal temps. - A quality PSU and a quality surge protector should protect all your parts from damage relating to power fluctuations. Rebooting unexpectedly from power drops shouldn't hurt anything too badly, in my opinion. Power spikes would be the larger concern. - Even though you report normal temps, if you have a stock AMD cooler I would still potentially suspect that cooling could be an issue (I have had faulty gauges before). So lowering/removing the overclocks could be fruitful, in my opinion. Again, just some random opinions. Someone who knows more about linux than I can probably shed more light on your specific error messages. I was getting hard crashes in Debian (which ubuntu is built off of) on my box if I ran mprime on over two cores (I was showing low 80s) with no overclock. I replaced the stock cooler with an aftermarket one and that issue completely went away.
Quote:
 Originally Posted by Christenson My six-core beast is ill...it seems that six cores of work after six months has given it indigestion. System, roughly: Eric-AMD-6-Core: AMD Phenom II x6 1090T ASRock 880GM-LE Mobo (SB 710 Chipset, factory clocking) GTX440 GPU. 8 Gig of Ram in 2 sticks Antec Earthwatts Green 380W Power supply. 200 W at full load. Xubuntu. Runs fine, mfaktc, no mprime.[snip]
My 1090T, with P95 on three cores (1-LL, 2-P-1), and three cores feeding three mfaktc instances on a GTX 460 is drawing about 365 watts. The CPU is running at 3.5GHz, and the 8GB RAM is OC'd to 1600 from 1333. This is running on a PC Power and Cooling 650 watt supply.

This is different from your load conditions, but it does make a 380 watt supply seem a bit below optimum, though a good PSU might hold up under that kind of load.

Quote:
 Originally Posted by Dubslow Also, what do you mean by "a crack was observed"?
I mean, that when I flexed the surge protector/power strip, the computer would crash, and you could hear the connection being made or broken, due to the internal arc. This continued after plugging in the Kill-a-watt device, even at no load, the Kill-a-watt would come and go. I probably should have taken it back to Staples and claimed it damaged my equipment....I garbaged it instead...

As for the power usage...recall that it's only a GT440 (not the world's fastest GPU beast, just enough performance to make it interesting) and that I'm only running one HDD and not overclocking at all that I know of....and I have on-board AMD graphics, too, but I'm not really pushing performance except with mprime and mfaktc...

****************
So, ramtest first...
Remove heatsink and replace with aftermarket cooler second (in my parts kit, don't forget about bent and partially unbent pin 997 on the CPU) second....use the good (arctic silver) heatsink paste in case the original factory stuff has dried out and thermostat is wrong...wonder if a hot spot under that is possible?
Upgrade PS third...or get regulating UPS instead? -- more , but I want one anyway
Is it worth a lapping kit if I get a CPU? Should I get a cheap (4-core) for testing, 8-core upgrade for real use?
********

Quote:
 Originally Posted by Christenson I mean, that when I flexed the surge protector/power strip, the computer would crash, and you could hear the connection being made or broken, due to the internal arc. This continued after plugging in the Kill-a-watt device, even at no load, the Kill-a-watt would come and go. I probably should have taken it back to Staples and claimed it damaged my equipment....I garbaged it instead... As for the power usage...recall that it's only a GT440 (not the world's fastest GPU beast, just enough performance to make it interesting) and that I'm only running one HDD and not overclocking at all that I know of....and I have on-board AMD graphics, too, but I'm not really pushing performance except with mprime and mfaktc...
It does seem that the power strip was pretty funky. I remember you talking about that. No telling what the by-products of arcing might have done to other components.

Also, your load is substantially less than mine. In addition to the things I listed above, I have 4 HDD's.

So I really don't know what to suggest.

Quote:
 Originally Posted by KyleAskine I want to start by saying that all I have is guesses. There are people who would know better than me. Anyway, my initial thoughts: - 200W full load seems low. Those Phenoms are power hogs at full load, plus HDDs, RAM, and that Video card. I don't know what a 440 should draw (I know much more about AMD's) but it has to be higher than the 50-70W your numbers suggest. So you get 100% GFX and CPU utilization at 200W? - With power supplies generally, 12V amperage is more imporant than overall wattage, so I would check that. Garbage PSU's claim high wattage, and throw it all somewhere worthless like the 5V line. Though Antec's are generally very high quality, so that probably isn't the issue. - I agree - always check RAM first. It, along with HDD, are the two most likely things to be corrupt in my experience as long as you run at normal temps. - A quality PSU and a quality surge protector should protect all your parts from damage relating to power fluctuations. Rebooting unexpectedly from power drops shouldn't hurt anything too badly, in my opinion. Power spikes would be the larger concern. - Even though you report normal temps, if you have a stock AMD cooler I would still potentially suspect that cooling could be an issue (I have had faulty gauges before). So lowering/removing the overclocks could be fruitful, in my opinion. Again, just some random opinions. Someone who knows more about linux than I can probably shed more light on your specific error messages. I was getting hard crashes in Debian (which ubuntu is built off of) on my box if I ran mprime on over two cores (I was showing low 80s) with no overclock. I replaced the stock cooler with an aftermarket one and that issue completely went away.
GeForce GT 440 3GB 56 Watts (same as the 1.5 GB)

 2011-11-29, 20:52 #8 lycorn     "GIMFS" Sep 2002 Oeiras, Portugal 11·137 Posts I wouldn´t be surprised at all if the power supply turns out to be responsible for the crashes reported. 380 W is a low value. That is the overall power rated, you may be stressing the PSU too much in some of the lines. And as you have reported some primary power unstability, that may have caused some damage to the PSU. If you have the chance, try replacing the PSU by a more powerful one for a start. Next thing to check, if the problem doesn´t go away, is the memory. In any case, a regulating UPS is a good safeguard against power fluctuations, and you should get one.
 2011-11-30, 02:26 #9 Christenson     Dec 2010 Monticello 5×359 Posts The UPS, from Cyberpower, at newegg.com, rated for PFC, and regulated, is on its way...Model CP1500PFCLCD, is on its way... I was impressed by Cyberpower's willingness to help when their stuff wasn't working, and to change the way they did stuff when it was causing problems, like setting units on top of cords in shipping. I probably spent more than absolutely necessary...$219...maybe....I didn't quite see what an extra$100 bought for the fat, squat models, but I am considering a second system. Newegg had a stepped-sine-wave output model on sale for \$149...decided I didn't want to fool with that. I'm seriously considering a high-end (800W or more) Antec, as I could use that on system #2 and/or upgrade the GPUSuggestions? Are there better brands without getting tremendously more pricey? And while I'm at it, what's the best way to mount a small fan for spot cooling inside a case, preferably without drilling extra mounting holes? (the north bridge chip fins run a tad warm, the engineer wants to direct some cooling air at it).
Quote:
 Originally Posted by Christenson I'm seriously considering a high-end (800W or more) Antec, as I could use that on system #2 and/or upgrade the GPUSuggestions? Are there better brands without getting tremendously more pricey?
Absolutely nothing wrong with Antec. Corsair is another good choice. Seasonic is fine. Plus a few more quality brands.

Good PSUs are significantly more expensive than cheap ones. With that said, that is the one part of the computer that you absolutely positively don't want to go cheap with, in my opinion. The well being of every component relies on it. Plus it will save you money over the long run if you get one with a decent energy rating.

