mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2021-01-15, 09:45   #122
axn
 
axn's Avatar
 
Jun 2003

486110 Posts
Default

Quote:
Originally Posted by Xyzzy View Post
Over the past few days ...

Have you tried using MSI Afterburner to downclock memory?

Which is faster in gpuowl? Underclocked 6800 or 3070?

What is preventing you from testing in Linux?
axn is offline   Reply With Quote
Old 2021-01-15, 13:24   #123
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

24·499 Posts
Default

Is the card blowing air in both directions? (i.e. the hot air blown back into the case?)

The crashes happen on a test bench as well as in the case.

Did you open it to see if the pads for memory are thicker, or different material/type/color/dryness/etc?

No. While we are capable of doing that, we aren't interested in doing surgery on expensive cards. Maybe if we got them used for a very good price we would be willing to do that, but for $700 each they should "just work".

Is there any water blocks available for it?

We do not know. We certainly aren't going to throw more money at the problem!

Have you tried using MSI Afterburner to downclock memory?

Yes. The lowest setting is the 2,000 MT/s default. We also tried using the tuning utility that ships with the AMD driver.

We have an RX 560 that also throws errors when it gets hot. On that card there are no thermal pads on the memory at all! We are able to make the card 100% stable by downclocking the memory. We think maybe AMD is pushing this memory too hard. Most GDDR6 is 1,750MT/s (14Gbps) so maybe 2,000MT/s (16Gbps) is too much?

Which is faster in gpuowl? Underclocked 6800 or 3070?

The 6800.

Another bug we didn't mention is that when the cards are underclocked the settings "reset" every time a new work unit is started. So we'd have to manually intervene/babysit the cards or find some way to force the settings to stay set.

What is preventing you from testing in Linux?

Nothing, except we are exhausted. Plus, there is a big demand for the cards for gamers so we had no trouble flipping them. (Yes, we told the buyer that they had issues in compute loads. The buyer was okay with that.)

We hesitated to post details about the issue because we could spend several hours describing all the things we tested and tried. We are not easily flummoxed. What we lack in intelligence we make up for in dogged perseverance.

FWIW, the event manager never gave any indication of any error except noting that the last "reboot" was "unplanned".

Did you all note the part where we mentioned that the hard crash would reset our BIOS? (This is a real PITA!)

One other weird event was after one crash the BIOS complained: "USB device over-current detected. Will shut down in 15 seconds." (!)
Xyzzy is offline   Reply With Quote
Old 2021-01-15, 14:18   #124
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

14018 Posts
Default

Bios reset is very weird, that and the USB over-current you mentioned makes me think there's a power issue, either a spike or the draw through the PCIe slot is wrong somehow. If it's bad enough to do a bios reset it could be bad enough to damage hardware if you trigger it too much, I don't blame you for washing your hands of it. There's two bios available for the card, the bios is signed so cannot be modified but do you recall which bios versions the cards had?



The settings resetting every work unit, if it is a problem in general, is workable depending on what you mean by work unit. In Linux you can change the settings via script (assuming that's plumbed in for big navi), not ideal but you could interleave work with setting resets. Hanging when queuing might be an issue of poor cleanup of a job putting the card in a bad state, on R7 you can run two jobs simultaneously but trying to run a third compiles but doesn't run the job (the job appears to hang as gpuowl presumably tries to submit the job to the card and doesn't get told the job is in limbo).
M344587487 is offline   Reply With Quote
Old 2021-01-15, 18:13   #125
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

1F3016 Posts
Default

Quote:
Originally Posted by M344587487 View Post
Bios reset is very weird, that and the USB over-current you mentioned makes me think there's a power issue, either a spike or the draw through the PCIe slot is wrong somehow. If it's bad enough to do a bios reset it could be bad enough to damage hardware if you trigger it too much, I don't blame you for washing your hands of it.
The weird USB reset message happened with our Corsair SF600 PSU.
Quote:
Originally Posted by M344587487 View Post
There's two bios available for the card, the bios is signed so cannot be modified but do you recall which bios versions the cards had?
We have attached the BIOS "dump" to this post.

Attached Thumbnails
Click image for larger version

Name:	IMG_2665.jpg
Views:	50
Size:	538.5 KB
ID:	24188   Click image for larger version

Name:	IMG_2666.jpg
Views:	42
Size:	494.7 KB
ID:	24189   Click image for larger version

Name:	IMG_2667.jpg
Views:	41
Size:	503.8 KB
ID:	24190  
Attached Files
File Type: gz Navi 21.rom.gz (251.3 KB, 19 views)
Xyzzy is offline   Reply With Quote
Old 2021-01-27, 14:45   #126
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

769 Posts
Default

They've implemented a power cycle for GPU's under heavy load when thermals/power get too high: https://www.phoronix.com/scan.php?pa...-Cycle-Scaling


Sounds like a massive hack to compensate for silicon that's getting pushed too hard. Supports the idea that big navi is unreliable for heavy compute in its current state, maybe with DCS enabled it could reliably be used albeit with a lower throughput.

Last fiddled with by M344587487 on 2021-01-27 at 14:47
M344587487 is offline   Reply With Quote
Old 2021-01-27, 15:26   #127
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

13·31 Posts
Default

As a "last resort" feature it could make some sense, but if it is something that could be run into regularly, I don't get it. Isn't a better solution to back off the boost some more before that situation happens?

Made up numbers for example:
Case 1: Run at 100% power, but have to throttle 10% of the time.
Case 2: Run at 90% power. Same power over time as above, but probably more compute efficient overall, as well as being more consistent.

Or is it fixing something else?
mackerel is offline   Reply With Quote
Old 2021-01-27, 19:20   #128
tServo
 
tServo's Avatar
 
"Marv"
May 2009
near the Tannhäuser Gate

613 Posts
Default

FWIW Mike, here is something weird I noticed with a Radeon VII last year.
The system I put it in was "inexpensive" but it's been ok for other top-line cards I've run in it.
The Radeon was a totally different story. The OS is Windoze 10.
It would operate OK doing mundane tasks but as soon as I started GPUOWL, BIF ! POW ! ZAP !
Nothing in the log to indicate causes and 1 time only, out of several attempts, I did notice that the machine time had changed, but I never associated that with a bios problem.
I "fixed" it by setting the power factor in MSI Afterburner all the way up, which I think was +20. ( percent? ). I believe I also set the temp limit to max. Anyway, that did the trick.

My only guess is that this might juice the available power via the PCIE bus for the card; these cards are known to be power hungry.

Anyway, at least it's cheap and easy to test.

BTW, I had not touched voltage or clock.

Last fiddled with by tServo on 2021-01-27 at 19:21
tServo is offline   Reply With Quote
Old 2021-03-03, 16:45   #129
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

11000000012 Posts
Default

https://www.youtube.com/watch?v=jjBqaGLRycc


tl;dr 6700XT specs:

  • 40 CU's as already known
  • 12GB of RAM as already known, implying a 192 bit bus so 384GB/s bandwidth if RAM clocks stay the same as other models
  • 96MB of infinity cache. It was expected to be less than the 128MB in the 6800/6900 models but I think the actual number is new information. 96MB is still a healthy chunk of cache, I wonder how much impact that has and how much the die size shrinks relative to the big boys
  • 230W TDP
  • $480 fantasy land RRP with a street price approaching double that
M344587487 is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Navi (RX 5700, RX 5700XT) M344587487 GPU Computing 29 2019-11-28 14:00

All times are UTC. The time now is 04:45.

Mon Mar 8 04:45:45 UTC 2021 up 95 days, 57 mins, 0 users, load averages: 1.92, 2.22, 2.32

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.