New GPU; new issues...
 2013-05-30, 18:51 #1 chalsall If I May     "Chris Halsall" Sep 2002 Barbados 2×5,021 Posts New GPU; new issues... Hey All. OK, some more guidance requested... In a very kind and generous gesture, Jerry (AKA flashj) donated to me a brand new MSi Twin Frozr III 580GTX which he had "in inventory". As I currently only have access to a single machine which can support such kit (a Dell T7500), I swapped out my GTX560 and installed this. Unfortunately, it did not pass the CUDALucas self test, nor Carl's CUDAmemtest. So I started using Carl's technique of lowering the clocks. First the memory, then the core. Still failing. Having to leave the site at the end of the day, I gave up and figured I would simply let the card do TF work until I was next in front of the machine. But... I then discovered that when running mfatkc the machine will quickly hard reset. Like instantly; no kernel panic. Just suddenly a blank screen, and then a reboot (the machine is configured to auto-power up upon power loss-then-restore). The strange thing is this never has happened with CUDALucas running (for hours), nor Carl's memtest (again, for hours). When I'm next in front of the machine I'm going to continue to fiddle with the BIOS clock settings, and possibly the voltages (but only slightly). I'm also planning on connecting a USB cable to the UPS which feeds this machine so I can monitor and graph the power consumption. I still think the (1100W) power supply is good -- this behavior occurs regardless of if the CPU is at 100% load, or 0% -- but swapping in another is scheduled. Also, I will shortly have access to six Dell Power Edge R720s, so I'll have additional machines to test this card (and the GTX 560) in -- I realize that testing with only a single host machine is not ideal. Any additional thoughts or advice from anyone which might help in this analysis?
 2013-05-30, 19:40 #2 kracker     "Mr. Meeseeks" Jan 2012 California, USA 32×241 Posts It sounds like a PSU issue. Exactly, what PSU do you have? No name one or a good one? Also, if it is a older one, remember that they lose efficiency and output gradually. Last fiddled with by kracker on 2013-05-30 at 19:40
 Originally Posted by kracker It sounds like a PSU issue. Exactly, what PSU do you have? No name one or a good one? Also, if it is a older one, remember that they lose efficiency and output gradually.
Dell 1100W. And I agree it may be the 12V "rail". That's why I have a new 1400W PSU on order for this machine.

But is strikes me as weird that CUDALucas and Carl's memtest runs for hours without such an issue, while mfaktc won't run for more than five minutes without this (very hard) reset.

As discussed before, different software loads the hardware differently. I'm simply documenting my personal experience, and asking others to point out if they think I've overlooked anything obvious.

 Originally Posted by chalsall Dell 1100W. And I agree it may be the 12V "rail". That's why I have a new 1400W PSU on order for this machine. But is strikes me as weird that CUDALucas and Carl's memtest runs for hours without such an issue, while mfaktc won't run for more than five minutes without this (very hard) reset. As discussed before, different software loads the hardware differently. I'm simply documenting my personal experience, and asking others to point out if they think I've overlooked anything obvious.
Have you tested it and compared how much power they use?

Also, if a 560 worked, and you swapped with a higher end model, and it doesn't work, it's almost 100% the PSU(unless the card is bad)

EDIT: A free card? you lucky b**.... Anyways the power company will want more money from you. Yay!

 Originally Posted by kracker Have you tested it and compared how much power they use?
Still working on that. The power consumption delta needs to be inferred; it can't be measured directly since we're working with DC and one of the three feeds is from the bus. That's the reason for the UPS data-feed.

 Originally Posted by kracker Also, if a 560 worked, and you swapped with a higher end model, and it doesn't work, it's almost 100% the PSU(unless the card is bad)
Agreed. Trying to figure out which of the two most likely possibilities is the truth.

 Originally Posted by kracker EDIT: A free card? you lucky b**....
Indeed. Jerry is a very kind gentleman.

 Originally Posted by kracker Anyways the power company will want more money from you. Yay!
You have no idea... Barbados has some of the most expensive electricity in the world....

 2013-05-30, 20:42 #6 sdbardwick     Aug 2002 North San Diego County 2×347 Posts Try using different PCI-E power connectors. That PSU has (AFAICT) 6 virtual 12V (12VA - 12VF) rails, each with an 18A over-current trip point.
 Originally Posted by sdbardwick Try using different PCI-E power connectors. That PSU has (AFAICT) 6 virtual 12V (12VA - 12VF) rails, each with an 18A over-current trip point.
Thanks!

Useful.

 2013-05-30, 21:08 #8 chalsall If I May     "Chris Halsall" Sep 2002 Barbados 2·5,021 Posts Code: Iteration 9990000 M( 9999973 )C, 0xbc0245fae77c5faf, n = 576K, CUDALucas v2.05 Alpha err = 0.01660 (0:10 real, 1.0172 ms/iter, ETA 0:00) M( 9999973 )C, 0x7da6ccf13a866e7f, n = 576K, CUDALucas v2.05 Alpha, estimated total time = 2:48:49 Bad card! Bad!
 Originally Posted by sdbardwick Try using different PCI-E power connectors. That PSU has (AFAICT) 6 virtual 12V (12VA - 12VF) rails, each with an 18A over-current trip point.

The MSi card's box talks about how each PCI-E connector powers a different subsystem.

This could explain why CUDALucas runs for hours (even with errors), while mfaktc doesn't for more than a few minutes.

Thanks again!!!

 2013-05-31, 18:35 #10 TheMawn     May 2013 East. Always East. 32778 Posts Dell = Bad This is a dell box? Did the card you replaced come with the box? Note that Dell was notorious for using proprietary fans, headers, plugs, etc. That's why I replaced my passable Dell machine with a completely new computer I built myself. I wanted a new video card, but I was worried the PSU wouldn't be happy with it (most manufacturers skimp out on power supplies, since it's the hardest to find, least impactful on performance, device). I was worried about replacing the PSU because of the number of things it connects to that it could fry. They have been known to do things like swap around the 12V, 5V and ground pins which means it immediately fries the entire motherboard if you plug a new PSU into it, or the old PSU into a new board. Apparently they do less of that now, but it's entirely possible the power supply and video card were in a passionately romantic relationship and one refuses to function without the other. The PCI-E lane actually can supply a small amount of power (AMD's HD 7750 is the fastest card that uses PCI-E power only: it doesn't plug into a PSU) so it's possible that a certain component of the card is not receiving power from the PSU but is getting just enough from the lane to barely function. Else I have no clue. Just stay away from Dell like I learned to :P
 Originally Posted by chalsall The MSi card's box talks about how each PCI-E connector powers a different subsystem. This could explain why CUDALucas runs for hours (even with errors), while mfaktc doesn't for more than a few minutes.
Thanks (for a third time) for this idea -- I had stupidly assumed the PSU had a single 12V rail, but that was clearly wrong (and actually printed (in very fine print) on the PSU itself once I looked at it with a magnifying glass). And, indeed, changing which PCI-E power connectors were used fixed the hard crashes with mfaktc.

I still had to bring the memory clock down from 2100 MHz to 2000 MHz (thanks Carl!), but I'm happy to report that today the card survived 10 CUDALucas self tests, six hours of Carl's CUDAmemtest, and (so far) three hours of mfaktc TFing.

I'll still want to run a few CUDALucas DCs to have 99.999% confidence, but right now I'm one very happy camper!!!

Thanks to everyone for the advice (and the testing tools), and to Jerry for his generosity!

