2020-01-27, 22:05   #45
M344587487

"Composite as Heck"
Oct 2017

23×79 Posts

I tried the washer mod with minimal success, YMMV: https://www.mersenneforum.org/showpo...9&postcount=13

edit:
Quote:
 Warning: the thermal pad that the GPU comes with is very good -- it's hard to replace it with anything with comparable performance. When taking the cooler apart, the termal pad may be demaged and need to be replaced which would be a net loss. Personally I would recommend against trying out the washer hack.
The thermal pad is good enough and will need replacing if you remove the cooler, but even the moderately decent thermal paste I used has better conductivity (the pad has better thermal properties on paper but paper is misleading, if the pad and paste were the same thickness the pad would win but the paste is applied much more thinly). That said IMO it's not worth doing a repaste or the washer mod.

2020-01-27, 22:26   #46
ewmayer
2ω=0

Sep 2002
República de California

1151810 Posts

Quote:
 Originally Posted by M344587487 I tried the washer mod with minimal success, YMMV: https://www.mersenneforum.org/showpo...9&postcount=13
Thanks - but did you try first just-the-washers before doing the other stuff? If you tried a bunch of different things at once you risk, using the terminology of clinical trials in medicine, "confounding effects".

 Maybe it is just easier to backup, reinstall afresh and, like most of us do, use RocM drivers. I will be interested how the R7 performs on a PCIE-2 rather than a PCIE-3...
2020-01-31, 23:04   #49
ewmayer
2ω=0

Sep 2002
República de California

2×13×443 Posts

Quote:
 Originally Posted by paulunderwood Maybe it is just easier to backup, reinstall afresh and, like most of us do, use RocM drivers. I will interested how the R7 performs on a PCIE-2 rather than a PCIE-3...
My old gtx430 was on the PCIE-2 slot ... the R7 is on the PCIE-3, plus used both the 8-pin power connectors on the PSU in this system. It also needed me to use my Dremel with a small cutting wheel to chop out the metal bridge between the 2 back-of-case PCI cutout used by the R7. Here the gory post-surgery picture of the patient's innards:
Attached Thumbnails

 2020-02-01, 02:30 #50 ewmayer ∂2ω=0     Sep 2002 República de California 2×13×443 Posts Re. the nVidia-related dmesg errors in post #47, one additional possibility occurs to me ... the only nVidia drivers I ever explicitly installed were under the old headless Debian setup, which I blew away. I removed the nVidia card a week ago, in prep. for trying to install the R7. However, the nVidia card was still installed when I upgraded to Ubuntu 19.10 ... might the Ubuntu installer have auto-detected the nVidia card and installed/defaulted-to-use the appropriate drivers at that point, and now the kernel is throwing errors due to the mismatch between those initial-OS-install drivers and the new gfx card? Last fiddled with by ewmayer on 2020-02-01 at 02:33
2020-02-01, 10:06   #51
M344587487

"Composite as Heck"
Oct 2017

11708 Posts

Quote:
 Originally Posted by ewmayer Thanks - but did you try first just-the-washers before doing the other stuff? If you tried a bunch of different things at once you risk, using the terminology of clinical trials in medicine, "confounding effects".
I did both at once and ruined the clinical trial, my understanding that the paste made a bigger difference than the washer mod is from a tech youtuber so YMMV.

Quote:
 Originally Posted by ewmayer ... However, the nVidia card was still installed when I upgraded to Ubuntu 19.10 ... might the Ubuntu installer have auto-detected the nVidia card and installed/defaulted-to-use the appropriate drivers at that point, and now the kernel is throwing errors due to the mismatch between those initial-OS-install drivers and the new gfx card?
Yes, Ubuntu installs non-free drivers by default when it needs to unless you tell it not, including nvidia's blobs if an nvidia card is present. I'm inclined to blame nvidia's proprietary crap for your problems, people have trouble mixing vendors in the same system and I believe it's because nvidia does things it's own way via binary blob which means they're not integrating properly with the Linux way of doing things.

The easiest/safest fix is probably to wipe and restart (after burning the nvidia card and burying it in a deep pit preferably, YMMV), but it can't hurt to try purging nvidia from the system if you feel like it (it's not critical but highly recommended that you change your wallpaper to Linus flipping off nvidia at this point, for luck). This is from an old guide but it seems reasonable:

This command should list all nvidia packages, there should be a few dozen of them:
Code:
dpkg -l | grep -i nvidia
Purge all packages beginning with nvidia-, which should also remove their dependencies:
Code:
sudo apt-get remove --purge '^nvidia-.*'
Reinstall ubuntu-desktop which was just erroneously removed:
Code:
sudo apt-get install ubuntu-desktop
Then reboot and see where you stand.

2020-02-01, 20:59   #52
ewmayer
2ω=0

Sep 2002
República de California

2CFE16 Posts

Quote:
 Originally Posted by M344587487 Yes, Ubuntu installs non-free drivers by default when it needs to unless you tell it not, including nvidia's blobs if an nvidia card is present. I'm inclined to blame nvidia's proprietary crap for your problems, people have trouble mixing vendors in the same system and I believe it's because nvidia does things it's own way via binary blob which means they're not integrating properly with the Linux way of doing things. The easiest/safest fix is probably to wipe and restart (after burning the nvidia card and burying it in a deep pit preferably, YMMV), but it can't hurt to try purging nvidia from the system if you feel like it (it's not critical but highly recommended that you change your wallpaper to Linus flipping off nvidia at this point, for luck). This is from an old guide but it seems reasonable: This command should list all nvidia packages, there should be a few dozen of them: Code: dpkg -l | grep -i nvidia Purge all packages beginning with nvidia-, which should also remove their dependencies: Code: sudo apt-get remove --purge '^nvidia-.*' Reinstall ubuntu-desktop which was just erroneously removed: Code: sudo apt-get install ubuntu-desktop Then reboot and see where you stand.
Thanks, Matt - I PMed you the 'before' and 'after' results of 'dpkg -l | grep -i nvidia' ... on reboot, I still quickly get and a "system program problem detected" popup (but now only one, versus multiple ones before) which I dismiss, but 'dmesg' now shows no more of the repeating nVidia-crud. I PMed you the shortlist of bold-highlighted warnings/errors I did find in the dmesg output, one of which involves a vega20*bin firmware file, namely

[ 2.517924] amdgpu 0000:03:00.0: Direct firmware load for amdgpu/vega20_ta.bin failed with error -2

I see 13 files among the /lib/firmware/amdgpu/vega20*.bin set which Ubuntu 19.10 auto-installed, but no vega20_ta.bin among those, probably just need to grab that one separately.

Most importantly, 'top' no longer shows any out-of-control system processes, and my Mlucas runs on the CPU are once again back at normal throughput. So, progress!

 2020-02-03, 19:42 #53 ewmayer ∂2ω=0     Sep 2002 República de California 2·13·443 Posts Now that Super Bowl Sunday (quasi-holiday in the US revolving around the National Footbal League championship game) is behind us, an update - the card seems to be functioning properly. I've been following Matt's "quick and dirty setup guide" here, am currently at the "Take the above [bash] init script [to set up for 2-gpuwol-instance running] and tweak it to suit your card". First I'd like to play with some basic single-instance running, but something is borked. The readme says "Self-test: simply start gpuowl with any valid exponent..." but does not say how to specify that expo via cmd-line flags. I tried just sticking a prime expo in there, then without any arguments whatever, both gave the following kind of error: Code: ewmayer@ewmayer-haswell:~/gpuowl$./gpuowl 90110269 2020-02-01 18:43:36 gpuowl v6.11-142-gf54af2e 2020-02-01 18:43:36 Note: not found 'config.txt' 2020-02-01 18:43:36 config: 90110269 2020-02-01 18:43:36 device 0, unique id '' 2020-02-01 18:43:36 Exception gpu_error: DEVICE_NOT_FOUND clGetDeviceIDs(platforms[i], kind, 64, devices, &n) at clwrap.cpp:77 getDeviceIDs 2020-02-01 18:43:36 Bye ewmayer@ewmayer-haswell:~/gpuowl$ ./gpuowl 2020-02-01 18:44:02 gpuowl v6.11-142-gf54af2e 2020-02-01 18:44:02 Note: not found 'config.txt' 2020-02-01 18:44:02 device 0, unique id '' 2020-02-01 18:44:02 Exception gpu_error: DEVICE_NOT_FOUND clGetDeviceIDs(platforms[i], kind, 64, devices, &n) at clwrap.cpp:77 getDeviceIDs 2020-02-01 18:44:02 Bye Matt had noted to me, "If the PRP test starts we are good to go. If it fails with something along the lines ofclGetDeviceId then gpuowl couldn't see the card." How to debug that latter problem? Looking ahead, the first 2 steps of the setup-for-2-instances script are these: Code: #Allow manual control echo "manual" >/sys/class/drm/card0/device/power_dpm_force_performance_level #Undervolt by setting max voltage # V Set this to 50mV less than the max stock voltage of your card (which varies from card to card), then optionally tune it down echo "vc 2 1801 1010" >/sys/class/drm/card0/device/pp_od_clk_voltage How do I find the max stock voltage? rocm-smi gives a bunch of things, but not that: Code: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 1 31.0c 21.0W 809Mhz 351Mhz 21.96% auto 250.0W 0% 0% ...and fiddling with various values of "/opt/rocm/bin/rocm-smi --setfan [n]" to set a constant fan speed causes the Fan value in the above to rise and fall. Thanks for any help from current gpuowl users. Last fiddled with by ewmayer on 2020-02-03 at 20:16
2020-02-03, 21:50   #54
paulunderwood

Sep 2002
Database er0rr

2·3·569 Posts

Quote:
 Originally Posted by ewmayer Now that Super Bowl Sunday (quasi-holiday in the US revolving around the National Footbal League championship game) is behind us, an update - the card seems to be functioning properly. I've been following Matt's "quick and dirty setup guide" here, am currently at the "Take the above [bash] init script [to set up for 2-gpuwol-instance running] and tweak it to suit your card". First I'd like to play with some basic single-instance running, but something is borked. The readme says "Self-test: simply start gpuowl with any valid exponent..." but does not say how to specify that expo via cmd-line flags. I tried just sticking a prime expo in there, then without any arguments whatever, both gave the following kind of error: Code: ewmayer@ewmayer-haswell:~/gpuowl$./gpuowl 90110269 2020-02-01 18:43:36 gpuowl v6.11-142-gf54af2e 2020-02-01 18:43:36 Note: not found 'config.txt' 2020-02-01 18:43:36 config: 90110269 2020-02-01 18:43:36 device 0, unique id '' 2020-02-01 18:43:36 Exception gpu_error: DEVICE_NOT_FOUND clGetDeviceIDs(platforms[i], kind, 64, devices, &n) at clwrap.cpp:77 getDeviceIDs 2020-02-01 18:43:36 Bye ewmayer@ewmayer-haswell:~/gpuowl$ ./gpuowl 2020-02-01 18:44:02 gpuowl v6.11-142-gf54af2e 2020-02-01 18:44:02 Note: not found 'config.txt' 2020-02-01 18:44:02 device 0, unique id '' 2020-02-01 18:44:02 Exception gpu_error: DEVICE_NOT_FOUND clGetDeviceIDs(platforms[i], kind, 64, devices, &n) at clwrap.cpp:77 getDeviceIDs 2020-02-01 18:44:02 Bye
Run as root (or sudo) with the -user ewmayer switch (or is it --user? I just run as root.)

Start with fans at 170; monitor the temperatures and, depending on your overclock, undervolt and ambient temperature, you might be able to reduce the fan speed.

2020-02-03, 23:06   #55
ewmayer
2ω=0

Sep 2002
República de California

101100111111102 Posts

Quote:
 Originally Posted by paulunderwood Run as root (or sudo) with the -user ewmayer swtich (or is it --user? I just run as root.) Start with fans at 170; monitor the temperatures and, depending on your overclock and undervolt, you might be able to reduce the fan speed.
Thanks-
Per the readme, single minus sign ... from within a subdir 'run0' where I have created a worktodo.txt file containing a pair of PRP assignments, I tried 'sudo ../gpuowl -user ewmayer' ... after entering my sudo password the run echoed same as the 2nd #fail above, just with an added 'config: -user ewmayer' line. Trying to instead login as root and run that way [this the Ubuntu 19.10 setup I created last week] and using the same pwd gives 'Authentication failure'. I don't recall entering any other pwd during the set-pwd phase of Ubuntu 19.10 setup.

Not needed yet since I can't run at all, but how do determine the max stock voltage of my R7?

