![]() |
![]() |
#122 |
Jun 2003
43·113 Posts |
![]() |
![]() |
![]() |
![]() |
#123 |
"Mike"
Aug 2002
1F3016 Posts |
![]()
Is the card blowing air in both directions? (i.e. the hot air blown back into the case?)
The crashes happen on a test bench as well as in the case. Did you open it to see if the pads for memory are thicker, or different material/type/color/dryness/etc? No. While we are capable of doing that, we aren't interested in doing surgery on expensive cards. Maybe if we got them used for a very good price we would be willing to do that, but for $700 each they should "just work". Is there any water blocks available for it? We do not know. We certainly aren't going to throw more money at the problem! Have you tried using MSI Afterburner to downclock memory? Yes. The lowest setting is the 2,000 MT/s default. We also tried using the tuning utility that ships with the AMD driver. We have an RX 560 that also throws errors when it gets hot. On that card there are no thermal pads on the memory at all! We are able to make the card 100% stable by downclocking the memory. We think maybe AMD is pushing this memory too hard. Most GDDR6 is 1,750MT/s (14Gbps) so maybe 2,000MT/s (16Gbps) is too much? Which is faster in gpuowl? Underclocked 6800 or 3070? The 6800. Another bug we didn't mention is that when the cards are underclocked the settings "reset" every time a new work unit is started. So we'd have to manually intervene/babysit the cards or find some way to force the settings to stay set. What is preventing you from testing in Linux? Nothing, except we are exhausted. Plus, there is a big demand for the cards for gamers so we had no trouble flipping them. (Yes, we told the buyer that they had issues in compute loads. The buyer was okay with that.) We hesitated to post details about the issue because we could spend several hours describing all the things we tested and tried. We are not easily flummoxed. What we lack in intelligence we make up for in dogged perseverance. FWIW, the event manager never gave any indication of any error except noting that the last "reboot" was "unplanned". Did you all note the part where we mentioned that the hard crash would reset our BIOS? (This is a real PITA!) One other weird event was after one crash the BIOS complained: "USB device over-current detected. Will shut down in 15 seconds." (!) |
![]() |
![]() |
![]() |
#124 |
"Composite as Heck"
Oct 2017
13778 Posts |
![]()
Bios reset is very weird, that and the USB over-current you mentioned makes me think there's a power issue, either a spike or the draw through the PCIe slot is wrong somehow. If it's bad enough to do a bios reset it could be bad enough to damage hardware if you trigger it too much, I don't blame you for washing your hands of it. There's two bios available for the card, the bios is signed so cannot be modified but do you recall which bios versions the cards had?
The settings resetting every work unit, if it is a problem in general, is workable depending on what you mean by work unit. In Linux you can change the settings via script (assuming that's plumbed in for big navi), not ideal but you could interleave work with setting resets. Hanging when queuing might be an issue of poor cleanup of a job putting the card in a bad state, on R7 you can run two jobs simultaneously but trying to run a third compiles but doesn't run the job (the job appears to hang as gpuowl presumably tries to submit the job to the card and doesn't get told the job is in limbo). |
![]() |
![]() |
![]() |
#125 | ||
"Mike"
Aug 2002
24×499 Posts |
![]() Quote:
Quote:
![]() |
||
![]() |
![]() |
![]() |
#126 |
"Composite as Heck"
Oct 2017
13×59 Posts |
![]()
They've implemented a power cycle for GPU's under heavy load when thermals/power get too high: https://www.phoronix.com/scan.php?pa...-Cycle-Scaling
Sounds like a massive hack to compensate for silicon that's getting pushed too hard. Supports the idea that big navi is unreliable for heavy compute in its current state, maybe with DCS enabled it could reliably be used albeit with a lower throughput. Last fiddled with by M344587487 on 2021-01-27 at 14:47 |
![]() |
![]() |
![]() |
#127 |
Feb 2016
UK
13·31 Posts |
![]()
As a "last resort" feature it could make some sense, but if it is something that could be run into regularly, I don't get it. Isn't a better solution to back off the boost some more before that situation happens?
Made up numbers for example: Case 1: Run at 100% power, but have to throttle 10% of the time. Case 2: Run at 90% power. Same power over time as above, but probably more compute efficient overall, as well as being more consistent. Or is it fixing something else? |
![]() |
![]() |
![]() |
#128 |
"Marv"
May 2009
near the Tannhäuser Gate
22·32·17 Posts |
![]()
FWIW Mike, here is something weird I noticed with a Radeon VII last year.
The system I put it in was "inexpensive" but it's been ok for other top-line cards I've run in it. The Radeon was a totally different story. The OS is Windoze 10. It would operate OK doing mundane tasks but as soon as I started GPUOWL, BIF ! POW ! ZAP ! Nothing in the log to indicate causes and 1 time only, out of several attempts, I did notice that the machine time had changed, but I never associated that with a bios problem. I "fixed" it by setting the power factor in MSI Afterburner all the way up, which I think was +20. ( percent? ). I believe I also set the temp limit to max. Anyway, that did the trick. My only guess is that this might juice the available power via the PCIE bus for the card; these cards are known to be power hungry. Anyway, at least it's cheap and easy to test. BTW, I had not touched voltage or clock. Last fiddled with by tServo on 2021-01-27 at 19:21 |
![]() |
![]() |
![]() |
#129 |
"Composite as Heck"
Oct 2017
13·59 Posts |
![]()
https://www.youtube.com/watch?v=jjBqaGLRycc
tl;dr 6700XT specs:
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Navi (RX 5700, RX 5700XT) | M344587487 | GPU Computing | 29 | 2019-11-28 14:00 |