mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2020-01-27, 22:05   #45
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

23×79 Posts
Default

I tried the washer mod with minimal success, YMMV: https://www.mersenneforum.org/showpo...9&postcount=13


edit:
Quote:
Warning: the thermal pad that the GPU comes with is very good -- it's hard to replace it with anything with comparable performance. When taking the cooler apart, the termal pad may be demaged and need to be replaced which would be a net loss. Personally I would recommend against trying out the washer hack.
The thermal pad is good enough and will need replacing if you remove the cooler, but even the moderately decent thermal paste I used has better conductivity (the pad has better thermal properties on paper but paper is misleading, if the pad and paste were the same thickness the pad would win but the paste is applied much more thinly). That said IMO it's not worth doing a repaste or the washer mod.

Last fiddled with by M344587487 on 2020-01-27 at 22:27
M344587487 is offline   Reply With Quote
Old 2020-01-27, 22:26   #46
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

1151810 Posts
Default

Quote:
Originally Posted by M344587487 View Post
I tried the washer mod with minimal success, YMMV: https://www.mersenneforum.org/showpo...9&postcount=13
Thanks - but did you try first just-the-washers before doing the other stuff? If you tried a bunch of different things at once you risk, using the terminology of clinical trials in medicine, "confounding effects".
ewmayer is offline   Reply With Quote
Old 2020-01-31, 21:08   #47
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2×13×443 Posts
Default

After a fresh Ubuntu 19.10 install on my ~6-year-old Haswell system and several afternoons' work, including some awkward Dremel hackery of both the R7 mounting bracket and the back of my ATX case in order to resolve a geometric mismatch there, the R7 is in and recognized by the OS, lspci shows 2 R7 entries:

03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 20 [Radeon VII] (rev c1)
03:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Vega 20 HDMI Audio [Radeon VII]

In terms of needed drivers, Matt (a.k.a. M344587487) noted this:

"Ubuntu 19.10 uses kernel 5.3 which means the open source AMD driver that's built into the kernel can handle the Vega 20. If you were on an earlier kernel you'd need to install the amdgpu-pro driver from AMD's site but you should be good. Something you might need is the Vega 20 firmware, there was a strange period where the kernel had the right drivers but some distro's hadn't caught up to providing Vega 20 firmware. To check if you have the firmware open a terminal and run 'ls /lib/firmware/amdgpu/vega20*'."

That latter list command shows 13 vega20_*.bin files, so that seems set to go.

But - and I was clued in to the pronlem by my usual Mlucas 4-thread job on the Haswell CPU running 3x slower than usual - there is some kind of misconfiguration/driver problem remaining. 'top' shows multiple cycle-eating 'system-udevd' and 'modprobe' processes. Invoking 'dmesg' shows what appears to be the problem - endless repeats of this message:

NVRM: No NVIDIA graphics adapter found!
nvidia-nvlink: Unregistered the Nvlink Core, major device number 238
nvidia-nvlink: Nvlink Core is being initialized, major device number 238

It's not clear to me which of the following 3 possible causes is the likely culprit:

1. Preparing to instal the R7, I first removed an old nvidia gtx430 card from the PCI 2.0 slot (seems unlikely, because I quickly found the issue with the R7 mounting bracket after that, at which point I rebooted sans any gfx card, and had been running happily for several days like that).

2. The R7 needs some nVidia drivers and is not finding them;

3. The system is detecting *a* new video card - brand not important - and doing something nVidia-ish as a result.

Last fiddled with by ewmayer on 2020-01-31 at 21:13
ewmayer is offline   Reply With Quote
Old 2020-01-31, 22:57   #48
paulunderwood
 
paulunderwood's Avatar
 
Sep 2002
Database er0rr

1101010101102 Posts
Default

Maybe it is just easier to backup, reinstall afresh and, like most of us do, use RocM drivers.

I will be interested how the R7 performs on a PCIE-2 rather than a PCIE-3...

Last fiddled with by paulunderwood on 2020-01-31 at 23:08
paulunderwood is offline   Reply With Quote
Old 2020-01-31, 23:04   #49
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2×13×443 Posts
Default

Quote:
Originally Posted by paulunderwood View Post
Maybe it is just easier to backup, reinstall afresh and, like most of us do, use RocM drivers.

I will interested how the R7 performs on a PCIE-2 rather than a PCIE-3...
My old gtx430 was on the PCIE-2 slot ... the R7 is on the PCIE-3, plus used both the 8-pin power connectors on the PSU in this system. It also needed me to use my Dremel with a small cutting wheel to chop out the metal bridge between the 2 back-of-case PCI cutout used by the R7. Here the gory post-surgery picture of the patient's innards:
Attached Thumbnails
Click image for larger version

Name:	radeon_install.jpg
Views:	78
Size:	306.6 KB
ID:	21702  

Last fiddled with by ewmayer on 2020-01-31 at 23:04
ewmayer is offline   Reply With Quote
Old 2020-02-01, 02:30   #50
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2×13×443 Posts
Default

Re. the nVidia-related dmesg errors in post #47, one additional possibility occurs to me ... the only nVidia drivers I ever explicitly installed were under the old headless Debian setup, which I blew away.

I removed the nVidia card a week ago, in prep. for trying to install the R7.

However, the nVidia card was still installed when I upgraded to Ubuntu 19.10 ... might the Ubuntu installer have auto-detected the nVidia card and installed/defaulted-to-use the appropriate drivers at that point, and now the kernel is throwing errors due to the mismatch between those initial-OS-install drivers and the new gfx card?

Last fiddled with by ewmayer on 2020-02-01 at 02:33
ewmayer is offline   Reply With Quote
Old 2020-02-01, 10:06   #51
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

11708 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Thanks - but did you try first just-the-washers before doing the other stuff? If you tried a bunch of different things at once you risk, using the terminology of clinical trials in medicine, "confounding effects".
I did both at once and ruined the clinical trial, my understanding that the paste made a bigger difference than the washer mod is from a tech youtuber so YMMV.

Quote:
Originally Posted by ewmayer View Post
...
However, the nVidia card was still installed when I upgraded to Ubuntu 19.10 ... might the Ubuntu installer have auto-detected the nVidia card and installed/defaulted-to-use the appropriate drivers at that point, and now the kernel is throwing errors due to the mismatch between those initial-OS-install drivers and the new gfx card?
Yes, Ubuntu installs non-free drivers by default when it needs to unless you tell it not, including nvidia's blobs if an nvidia card is present. I'm inclined to blame nvidia's proprietary crap for your problems, people have trouble mixing vendors in the same system and I believe it's because nvidia does things it's own way via binary blob which means they're not integrating properly with the Linux way of doing things.

The easiest/safest fix is probably to wipe and restart (after burning the nvidia card and burying it in a deep pit preferably, YMMV), but it can't hurt to try purging nvidia from the system if you feel like it (it's not critical but highly recommended that you change your wallpaper to Linus flipping off nvidia at this point, for luck). This is from an old guide but it seems reasonable:

This command should list all nvidia packages, there should be a few dozen of them:
Code:
dpkg -l | grep -i nvidia
Purge all packages beginning with nvidia-, which should also remove their dependencies:
Code:
sudo apt-get remove --purge '^nvidia-.*'
Reinstall ubuntu-desktop which was just erroneously removed:
Code:
sudo apt-get install ubuntu-desktop
Then reboot and see where you stand.
M344587487 is offline   Reply With Quote
Old 2020-02-01, 20:59   #52
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2CFE16 Posts
Default

Quote:
Originally Posted by M344587487 View Post
Yes, Ubuntu installs non-free drivers by default when it needs to unless you tell it not, including nvidia's blobs if an nvidia card is present. I'm inclined to blame nvidia's proprietary crap for your problems, people have trouble mixing vendors in the same system and I believe it's because nvidia does things it's own way via binary blob which means they're not integrating properly with the Linux way of doing things.

The easiest/safest fix is probably to wipe and restart (after burning the nvidia card and burying it in a deep pit preferably, YMMV), but it can't hurt to try purging nvidia from the system if you feel like it (it's not critical but highly recommended that you change your wallpaper to Linus flipping off nvidia at this point, for luck). This is from an old guide but it seems reasonable:

This command should list all nvidia packages, there should be a few dozen of them:
Code:
dpkg -l | grep -i nvidia
Purge all packages beginning with nvidia-, which should also remove their dependencies:
Code:
sudo apt-get remove --purge '^nvidia-.*'
Reinstall ubuntu-desktop which was just erroneously removed:
Code:
sudo apt-get install ubuntu-desktop
Then reboot and see where you stand.
Thanks, Matt - I PMed you the 'before' and 'after' results of 'dpkg -l | grep -i nvidia' ... on reboot, I still quickly get and a "system program problem detected" popup (but now only one, versus multiple ones before) which I dismiss, but 'dmesg' now shows no more of the repeating nVidia-crud. I PMed you the shortlist of bold-highlighted warnings/errors I did find in the dmesg output, one of which involves a vega20*bin firmware file, namely

[ 2.517924] amdgpu 0000:03:00.0: Direct firmware load for amdgpu/vega20_ta.bin failed with error -2

I see 13 files among the /lib/firmware/amdgpu/vega20*.bin set which Ubuntu 19.10 auto-installed, but no vega20_ta.bin among those, probably just need to grab that one separately.

Most importantly, 'top' no longer shows any out-of-control system processes, and my Mlucas runs on the CPU are once again back at normal throughput. So, progress!

Last fiddled with by ewmayer on 2020-02-01 at 21:28
ewmayer is offline   Reply With Quote
Old 2020-02-03, 19:42   #53
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2·13·443 Posts
Default

Now that Super Bowl Sunday (quasi-holiday in the US revolving around the National Footbal League championship game) is behind us, an update - the card seems to be functioning properly. I've been following Matt's "quick and dirty setup guide" here, am currently at the "Take the above [bash] init script [to set up for 2-gpuwol-instance running] and tweak it to suit your card". First I'd like to play with some basic single-instance running, but something is borked. The readme says "Self-test: simply start gpuowl with any valid exponent..." but does not say how to specify that expo via cmd-line flags. I tried just sticking a prime expo in there, then without any arguments whatever, both gave the following kind of error:
Code:
ewmayer@ewmayer-haswell:~/gpuowl$ ./gpuowl 90110269
2020-02-01 18:43:36 gpuowl v6.11-142-gf54af2e
2020-02-01 18:43:36 Note: not found 'config.txt'
2020-02-01 18:43:36 config: 90110269 
2020-02-01 18:43:36 device 0, unique id ''
2020-02-01 18:43:36 Exception gpu_error: DEVICE_NOT_FOUND clGetDeviceIDs(platforms[i], kind, 64, devices, &n) at clwrap.cpp:77 getDeviceIDs
2020-02-01 18:43:36 Bye
ewmayer@ewmayer-haswell:~/gpuowl$ ./gpuowl
2020-02-01 18:44:02 gpuowl v6.11-142-gf54af2e
2020-02-01 18:44:02 Note: not found 'config.txt'
2020-02-01 18:44:02 device 0, unique id ''
2020-02-01 18:44:02 Exception gpu_error: DEVICE_NOT_FOUND clGetDeviceIDs(platforms[i], kind, 64, devices, &n) at clwrap.cpp:77 getDeviceIDs
2020-02-01 18:44:02 Bye
Matt had noted to me, "If the PRP test starts we are good to go. If it fails with something along the lines ofclGetDeviceId then gpuowl couldn't see the card." How to debug that latter problem?

Looking ahead, the first 2 steps of the setup-for-2-instances script are these:
Code:
#Allow manual control
echo "manual" >/sys/class/drm/card0/device/power_dpm_force_performance_level
#Undervolt by setting max voltage
#               V Set this to 50mV less than the max stock voltage of your card (which varies from card to card), then optionally tune it down
echo "vc 2 1801 1010" >/sys/class/drm/card0/device/pp_od_clk_voltage
How do I find the max stock voltage? rocm-smi gives a bunch of things, but not that:
Code:
GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 
1 31.0c 21.0W 809Mhz 351Mhz 21.96% auto 250.0W 0% 0%
...and fiddling with various values of "/opt/rocm/bin/rocm-smi --setfan [n]" to set a constant fan speed causes the Fan value in the above to rise and fall.

Thanks for any help from current gpuowl users.

Last fiddled with by ewmayer on 2020-02-03 at 20:16
ewmayer is offline   Reply With Quote
Old 2020-02-03, 21:50   #54
paulunderwood
 
paulunderwood's Avatar
 
Sep 2002
Database er0rr

2·3·569 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Now that Super Bowl Sunday (quasi-holiday in the US revolving around the National Footbal League championship game) is behind us, an update - the card seems to be functioning properly. I've been following Matt's "quick and dirty setup guide" here, am currently at the "Take the above [bash] init script [to set up for 2-gpuwol-instance running] and tweak it to suit your card". First I'd like to play with some basic single-instance running, but something is borked. The readme says "Self-test: simply start gpuowl with any valid exponent..." but does not say how to specify that expo via cmd-line flags. I tried just sticking a prime expo in there, then without any arguments whatever, both gave the following kind of error:
Code:
ewmayer@ewmayer-haswell:~/gpuowl$ ./gpuowl 90110269
2020-02-01 18:43:36 gpuowl v6.11-142-gf54af2e
2020-02-01 18:43:36 Note: not found 'config.txt'
2020-02-01 18:43:36 config: 90110269 
2020-02-01 18:43:36 device 0, unique id ''
2020-02-01 18:43:36 Exception gpu_error: DEVICE_NOT_FOUND clGetDeviceIDs(platforms[i], kind, 64, devices, &n) at clwrap.cpp:77 getDeviceIDs
2020-02-01 18:43:36 Bye
ewmayer@ewmayer-haswell:~/gpuowl$ ./gpuowl
2020-02-01 18:44:02 gpuowl v6.11-142-gf54af2e
2020-02-01 18:44:02 Note: not found 'config.txt'
2020-02-01 18:44:02 device 0, unique id ''
2020-02-01 18:44:02 Exception gpu_error: DEVICE_NOT_FOUND clGetDeviceIDs(platforms[i], kind, 64, devices, &n) at clwrap.cpp:77 getDeviceIDs
2020-02-01 18:44:02 Bye
Run as root (or sudo) with the -user ewmayer switch (or is it --user? I just run as root.)

Start with fans at 170; monitor the temperatures and, depending on your overclock, undervolt and ambient temperature, you might be able to reduce the fan speed.

Last fiddled with by paulunderwood on 2020-02-03 at 22:57
paulunderwood is offline   Reply With Quote
Old 2020-02-03, 23:06   #55
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

101100111111102 Posts
Default

Quote:
Originally Posted by paulunderwood View Post
Run as root (or sudo) with the -user ewmayer swtich (or is it --user? I just run as root.)

Start with fans at 170; monitor the temperatures and, depending on your overclock and undervolt, you might be able to reduce the fan speed.
Thanks-
Per the readme, single minus sign ... from within a subdir 'run0' where I have created a worktodo.txt file containing a pair of PRP assignments, I tried 'sudo ../gpuowl -user ewmayer' ... after entering my sudo password the run echoed same as the 2nd #fail above, just with an added 'config: -user ewmayer' line. Trying to instead login as root and run that way [this the Ubuntu 19.10 setup I created last week] and using the same pwd gives 'Authentication failure'. I don't recall entering any other pwd during the set-pwd phase of Ubuntu 19.10 setup.

Not needed yet since I can't run at all, but how do determine the max stock voltage of my R7?
ewmayer is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
AMD Radeon Pro WX 3200 ET_ GPU Computing 1 2019-07-04 11:02
Radeon Pro Vega II Duo (look at this monster) M344587487 GPU Computing 10 2019-06-18 14:00
What's the best project to run on a Radeon RX 480? jasong GPU Computing 0 2016-11-09 04:32
Radeon Pro Duo 0PolarBearsHere GPU Computing 0 2016-03-15 01:32
AMD Radeon R9 295X2 firejuggler GPU Computing 33 2014-09-03 21:42

All times are UTC. The time now is 07:21.

Thu Oct 1 07:21:20 UTC 2020 up 21 days, 4:32, 0 users, load averages: 1.19, 1.37, 1.42

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.