mersenneforum.org Radeon VII @ newegg for 500 dollars US on 11-27
 Register FAQ Search Today's Posts Mark Forums Read

2020-05-20, 03:30   #232
preda

"Mihai Preda"
Apr 2015

22×7×41 Posts

Quote:
 Originally Posted by ewmayer Haswell system successfully upgraded from ROCm 2.10 to 3.3 ... alas, the upgrade does not solve the unable-to-fiddle-MCLK problem on this system - note the following was as root: Code: root@ewmayer-haswell:/home/ewmayer/gpuowl# echo "manual" >/sys/class/drm/card0/device/power_dpm_force_performance_level bash: /sys/class/drm/card0/device/power_dpm_force_performance_level: Permission denied
Does the Haswell system by any chance have an integrated GPU? or any other GPU beside the RadeonVII? -- if so, it may be that the non-R7 sits as "card0".

Can you cat /sys/class/drm/card0/device/pp_od_clk_voltage ?
Do you have any other "cardN" is /sys/class/drm/ ? do they have a pp_od_clk_voltage that you can cat?

2020-05-20, 16:08   #233
chris2be8

Sep 2009

186010 Posts

Quote:
 Originally Posted by ewmayer The file ownership & permissions look the same on both systems: Haswell: Code: ewmayer@ewmayer-haswell:~$ll /sys/class/drm/card0/device lrwxrwxrwx 1 root root 0 May 19 15:22 /sys/class/drm/card0/device -> ../../../0000:00:02.0/ New Build: Code: ewmayer@ewmayer-gimp:~$ ll /sys/class/drm/card0/device lrwxrwxrwx 1 root root 0 May 18 15:58 /sys/class/drm/card0/device -> ../../../0000:03:00.0/
Those are symlinks so the permissions of the symlink don't control access. Either list the permissions of the files the symlinks point to or use ls -lL /sys/class/drm/card0/device (the L tells ls to show permissions of the file the symlink references).

Also try lsattr on the target files. The immutable attribute stops even root updating a file.

Chris

2020-05-20, 19:10   #234
ewmayer
2ω=0

Sep 2002
República de California

1138610 Posts

Quote:
 Originally Posted by chris2be8 Those are symlinks so the permissions of the symlink don't control access. Either list the permissions of the files the symlinks point to or use ls -lL /sys/class/drm/card0/device (the L tells ls to show permissions of the file the symlink references). Also try lsattr on the target files. The immutable attribute stops even root updating a file.
Thanks - permissions show as root:root and "lsattr /sys/class/drm/card0/device/pp_od_clk_voltage" gives "No such file or directory", but read on, the real issue is that Mihai's surmise of this system having entries for 2 cards, pushing the only physically installed card, the R7, to 'card1' instead of the expected card0.
Quote:
 Originally Posted by preda Does the Haswell system by any chance have an integrated GPU? or any other GPU beside the RadeonVII? -- if so, it may be that the non-R7 sits as "card0". Can you cat /sys/class/drm/card0/device/pp_od_clk_voltage ? Do you have any other "cardN" is /sys/class/drm/ ? do they have a pp_od_clk_voltage that you can cat?
The R7 is currently the only card, but when I first upgraded the system to Ubuntu 19.10 in prep. for installing the R7, it had an old nVidia gtx430 installed. I removed that to make room for the R7, but it ended up causing problems due to Ubuntu 19.10 having seen it and installed a whole bunch of nVidia-driver crap, which I then had to rip out to get it to properly recognize the R7 as the sole (or at least main) card.

"cat /sys/class/drm/card0/device/pp_od_clk_voltage", both as regular-user and with sudo prepended, gives "No such file or directory".

"sudo touch /sys/class/drm/card0/device/pp_od_clk_voltage" gives "Permission denied".

But yes, now that I list all the contents of /sys/class/drm/card0, I see that indeed there *are* entries for card0 and card1:
Code:
lrwxrwxrwx  1 root root    0 May 19 15:22 card0 -> ../../devices/pci0000:00/0000:00:02.0/drm/card0/
lrwxrwxrwx  1 root root    0 May 19 15:22 card0-HDMI-A-1 -> ../../devices/pci0000:00/0000:00:02.0/drm/card0/card0-HDMI-A-1/
lrwxrwxrwx  1 root root    0 May 19 15:22 card0-HDMI-A-2 -> ../../devices/pci0000:00/0000:00:02.0/drm/card0/card0-HDMI-A-2/
lrwxrwxrwx  1 root root    0 May 19 15:22 card0-VGA-1 -> ../../devices/pci0000:00/0000:00:02.0/drm/card0/card0-VGA-1/
lrwxrwxrwx  1 root root    0 May 19 15:22 card1 -> ../../devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:00.0/0000:03:00.0/drm/card1/
lrwxrwxrwx  1 root root    0 May 19 15:22 card1-DP-1 -> ../../devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:00.0/0000:03:00.0/drm/card1/card1-DP-1/
lrwxrwxrwx  1 root root    0 May 19 15:22 card1-DP-2 -> ../../devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:00.0/0000:03:00.0/drm/card1/card1-DP-2/
lrwxrwxrwx  1 root root    0 May 19 15:22 card1-DP-3 -> ../../devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:00.0/0000:03:00.0/drm/card1/card1-DP-3/
lrwxrwxrwx  1 root root    0 May 19 15:22 card1-HDMI-A-3 -> ../../devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:00.0/0000:03:00.0/drm/card1/card1-HDMI-A-3/
lrwxrwxrwx  1 root root    0 May 19 15:22 renderD128 -> ../../devices/pci0000:00/0000:00:02.0/drm/renderD128/
lrwxrwxrwx  1 root root    0 May 19 15:22 renderD129 -> ../../devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:00.0/0000:03:00.0/drm/renderD129/
lrwxrwxrwx  1 root root    0 May 19 15:22 ttm -> ../../devices/virtual/drm/ttm/
-r--r--r--  1 root root 4096 May 19 15:22 version
...and I always assumed that the rocm-smi listing began counting devices at 1, because on that system it always only listed the R7 that way. But now that I have the new 2-card build to compare to, I realize (*imaginary lightbulb goes off over cartoon Ernst's head*) that card0 should appear as GPU 0 in the rocm-smi listing. And indeed, "cat /sys/class/drm/card1/device/pp_od_clk_voltage" shows the expected data:
Code:
ewmayer@ewmayer-haswell:~$sudo cat /sys/class/drm/card1/device/pp_od_clk_voltage OD_SCLK: 0: 808Mhz 1: 1801Mhz OD_MCLK: 1: 1000Mhz OD_VDDC_CURVE: 0: 808Mhz 716mV 1: 1304Mhz 808mV 2: 1801Mhz 1102mV OD_RANGE: SCLK: 808Mhz 2200Mhz MCLK: 801Mhz 1200Mhz VDDC_CURVE_SCLK[0]: 808Mhz 2200Mhz VDDC_CURVE_VOLT[0]: 738mV 1218mV VDDC_CURVE_SCLK[1]: 808Mhz 2200Mhz VDDC_CURVE_VOLT[1]: 738mV 1218mV VDDC_CURVE_SCLK[2]: 808Mhz 2200Mhz VDDC_CURVE_VOLT[2]: 738mV 1218mV So the ghost of the old gtx430 is still haunting the place, biting me in the butt. And the short-term workaround is to simply use the MCLK script with 'card1' on this system ... did all that as root, success! Reset SCLK to 4 and fan to its usual 120, and the per-iter timings for the 2 runs @5.5M FFT on this system just dropped to ... wait, I notice both of the 2 runs threw an 'EE' error-code in place of the usual 'OK' on the ensuing 200Kiter checkpoints, I'm assuming that's a Gerbicz-check error? I also saw a few such errors last night on the faster (sclk=4) of the 2 new-build R7s in late afternoon, when the sun hitting the west wall of our apt. warmed it up and both cards' temps went up over 80C at their normal fan levels. But those runs only threw a few and then continued OK ... the Haswell runs just both quit due to repeated errors. So it seems the same settings I used on the new-build cards are a bit too aggressive on the Haswell-system R7? First, let's see if the errors repeat on restarting both runs at the new MCLK=1200 setting ... and they do, so let's reduce the mem-OCing to MCLK=1150 ... that looks better, per-iter timings in last 24 hours dropped from 1415us (under ROCm 2.10, no mem-OCing) to 1345us (ROCm 3.3, no mem-OCing) and now to 1265us with the added mem-OCing. Longer-term, how do I rip out the card0 crud and get the system to properly recognize the R7 as card0? Last fiddled with by ewmayer on 2020-05-20 at 19:16 2020-05-20, 19:16 #235 kriesel "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 4,201 Posts Quote:  Originally Posted by ewmayer Longer-term, how do I rip out the card0 crud and get the system to properly recognize the R7 as card0? Condolences. On Windows, I've seen it where recursive rounds of file and folder deletion, registry editing, and restart have been required to exorcise an NVIDIA spirit, that may hide out as display, audio x several, some addon application, or something else. Persistence, accuracy, and thoroughness mandatory. Last fiddled with by kriesel on 2020-05-20 at 19:19 2020-05-20, 19:22 #236 ewmayer 2ω=0 Sep 2002 República de California 2·5,693 Posts Quote:  Originally Posted by kriesel Condolences. On Windows, I've seen it where recursive rounds of file and folder deletion, registry editing, and restart have been required to exorcise an NVIDIA spirit, that may hide out as display, audio x several, or something else. Persistence, accuracy, and thoroughness mandatory. Well, worst-case I simply have to up-index the R7 device indexes on this system by 1. For a throughput boost nearly equivalent to adding a second Haswell-quad CPU to this system, I think I can handle that level of annoyance. :) 2020-05-20, 19:50 #237 Prime95 P90 years forever! Aug 2002 Yeehaw, FL 32×19×41 Posts Quote:  Originally Posted by ewmayer Longer-term, how do I rip out the card0 crud and get the system to properly recognize the R7 as card0? My solution to most Linux problems usually ends up with: Erase disk. Reinstall OS. 2020-05-21, 00:13 #238 ewmayer 2ω=0 Sep 2002 República de California 101100011110102 Posts Quote:  Originally Posted by Prime95 My solution to most Linux problems usually ends up with: Erase disk. Reinstall OS. Back in the early 2000s I worked at a SiVal startup and one of my colleagues was a huge early Linux proselytizer. Several of us were dabbling with Limux co-installations on one Windows PC or another, and as I recall it, his answer to most any technical-problem issue was "[do a bunch of stuff]" followed by "rebuild your kernel". I'm happy to say that things have gotten a lot better since. :) Process-ownership question: there seems to be some root-versus-regular-user setup differences between my Haswell system and the new build. On both systems, the gouowl build-dir and all its contents are owned by me as regular user ewmayer. But on the former I don't need to prepend 'sudo' to run gpuowl, and on the latter, I both need 'sudo' and have the annoying issue that everytime a gpuowl PRP test completes, the subsequent auto-update of the worktodo file and newly-created dir for the just-begun next assignment leave both owned by root, requiring me to do a 'sudo chown -R ewmayer:ewmayer *' before I can fetch new work. What could causing this? Note that on both systems, 'top' shows the owner of the gpuowl jobs to be root. 2020-05-21, 02:02 #239 preda "Mihai Preda" Apr 2015 114810 Posts Quote:  Originally Posted by ewmayer Back in the early 2000s I worked at a SiVal startup and one of my colleagues was a huge early Linux proselytizer. Several of us were dabbling with Limux co-installations on one Windows PC or another, and as I recall it, his answer to most any technical-problem issue was "[do a bunch of stuff]" followed by "rebuild your kernel". I'm happy to say that things have gotten a lot better since. :) Process-ownership question: there seems to be some root-versus-regular-user setup differences between my Haswell system and the new build. On both systems, the gouowl build-dir and all its contents are owned by me as regular user ewmayer. But on the former I don't need to prepend 'sudo' to run gpuowl, and on the latter, I both need 'sudo' and have the annoying issue that everytime a gpuowl PRP test completes, the subsequent auto-update of the worktodo file and newly-created dir for the just-begun next assignment leave both owned by root, requiring me to do a 'sudo chown -R ewmayer:ewmayer *' before I can fetch new work. What could causing this? Note that on both systems, 'top' shows the owner of the gpuowl jobs to be root. Verify two things: 1. that your user-id belongs to the video group$ id -a

$cat /etc/udev/rules.d/70-kfd.rules SUBSYSTEM=="kfd", KERNEL=="kfd", TAG+="uaccess", GROUP="video" With those two steps there should be no need to run gpuowl (or clinfo) as root. 2020-05-21, 03:20 #240 ewmayer 2ω=0 Sep 2002 República de California 261728 Posts Quote:  Originally Posted by preda Verify two things: 1. that your user-id belongs to the video group$ id -a 2. that the video group has access to kfd: $cat /etc/udev/rules.d/70-kfd.rules SUBSYSTEM=="kfd", KERNEL=="kfd", TAG+="uaccess", GROUP="video" With those two steps there should be no need to run gpuowl (or clinfo) as root. Yes, George and PaulU led me through that on page 19, posts 204,205 ... double-checking it, and bolding the relevant entries: Code: ewmayer@ewmayer-gimp:~$ id
ewmayer@ewmayer-gimp:~$cat /etc/udev/rules.d/70-kfd.rules SUBSYSTEM=="kfd", KERNEL=="kfd", TAG+="uaccess", GROUP="video" So an oddity ... not a big deal, just a minor annoyance now that I know to precede all runs of primenet.py with the chown workaround. Back on the hardware-setup front, looking at the new build, it occurs to me there is an obvious alternate location to put a third GPU: lying flat (fans pointing downward), between the PSU-nearest of the 2 currently installed cards and the PSU itself. Pluses: 1. there is plenty of room; 2. the main hot-air venting would be in the direction of the top of the PSU, unlike blasting right at the 2 vertically-mounted cards as with my initial right-angle-pcie-adapter location 3. leaves plenty of vertical space above the CPU fan, not that that needs great airflow in this build, with the CPU only running system processes. Minuses: 1. Would need some custom-mounting-bracketage, of similar effort to the Alu. frame-end bracket I hacked to support GPU #3 in its original location (see pics in my 'bracket done' post a few pages back); 2. Would need a longish pcie riser cable (something along these lines? Or does the GPU end need to be full-width?) to hook into one of the available pci slots remaining in the board. The full-length x16 slot (the one I tried before, with the 2 short up-and-out adapters, the right-angle and the straight one) would be very awkward, since the ribbon cable would need to run underneath the full length of the board, back up over the PSU and then 5-6" over to the card in order to not block airflow to the 2 installed cards. (Alternatively, I could make sure the ribbon cable runs away from GPU #2 for a few inches, then up an over both ... a full-x16-width ribbon routed thusly would not block any airflow, it would only cover the RADEON LEDs on the 2 installed cards ... length would need to be no more than a foot.) A ribbon cable plugging into the shorter x1 slot, if that is an option, would have the problem that aid slot is right between the 2 installed cards, very narrow access space, right next to the leftmost intake fan of GPU #1. Last fiddled with by ewmayer on 2020-05-21 at 03:24  2020-05-21, 04:38 #241 preda "Mihai Preda" Apr 2015 22×7×41 Posts I think something like this is what you want: https://www.amazon.com/6-Pack-Graphi...0035671&sr=8-2 (I'm not endorsing this particular instance, you should shop around, this is just an example) You get 6x for$30, and the USB cable is easy to route. The downside is, you need to power this adapter (more power cords). Also be sure to set the PCIe to Gen1 on the motherboard, as these extendenders don't handle Gen3 usually.
2020-05-21, 15:30   #242
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

4,201 Posts

Quote:
 Originally Posted by preda You get 6x for $30, and the USB cable is easy to route. The downside is, you need to power this adapter (more power cords). Also be sure to set the PCIe to Gen1 on the motherboard, as these extenders don't handle Gen3 usually. Half the cost on eBay. Such extenders typically use a 75W rated power cable and some extenders are rated for 60W. (Direct-to-gpu cables handle the rest of the power draw.) Limit to 2 per SATA power cable in my experience, and avoid the SATA power cable that powers your boot drive suggested. These provide great flexibility in gpu location and spacing for air cooling and component clearance. In my experience gen1 vs gen2 does not seem to matter; gen3 may as Preda states, or gen4 etc in the future. I'm not sure mingling components from different lots is functional. By that I mean keep cable and extender components on either end from getting mixed among lots. I've seen some become unusable, possibly due to component shuffling or early failure (capacitors & regular suspected since they were running very hot) but at$2.50 a slot if they last a year that's ok.

 Similar Threads Thread Thread Starter Forum Replies Last Post ET_ GPU Computing 1 2019-07-04 11:02 M344587487 GPU Computing 10 2019-06-18 14:00 jasong GPU Computing 0 2016-11-09 04:32 0PolarBearsHere GPU Computing 0 2016-03-15 01:32 firejuggler GPU Computing 33 2014-09-03 21:42

All times are UTC. The time now is 20:00.

Wed Aug 5 20:00:49 UTC 2020 up 19 days, 15:47, 2 users, load averages: 1.76, 1.52, 1.51