mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2020-05-20, 03:30   #232
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

119810 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Haswell system successfully upgraded from ROCm 2.10 to 3.3 ... alas, the upgrade does not solve the unable-to-fiddle-MCLK problem on this system - note the following was as root:
Code:
root@ewmayer-haswell:/home/ewmayer/gpuowl# echo "manual" >/sys/class/drm/card0/device/power_dpm_force_performance_level
bash: /sys/class/drm/card0/device/power_dpm_force_performance_level: Permission denied
Does the Haswell system by any chance have an integrated GPU? or any other GPU beside the RadeonVII? -- if so, it may be that the non-R7 sits as "card0".

Can you cat /sys/class/drm/card0/device/pp_od_clk_voltage ?
Do you have any other "cardN" is /sys/class/drm/ ? do they have a pp_od_clk_voltage that you can cat?
preda is offline   Reply With Quote
Old 2020-05-20, 16:08   #233
chris2be8
 
chris2be8's Avatar
 
Sep 2009

22·11·43 Posts
Default

Quote:
Originally Posted by ewmayer View Post
The file ownership & permissions look the same on both systems:
Haswell:
Code:
ewmayer@ewmayer-haswell:~$ ll /sys/class/drm/card0/device
lrwxrwxrwx 1 root root 0 May 19 15:22 /sys/class/drm/card0/device -> ../../../0000:00:02.0/
New Build:
Code:
ewmayer@ewmayer-gimp:~$ ll /sys/class/drm/card0/device
lrwxrwxrwx 1 root root 0 May 18 15:58 /sys/class/drm/card0/device -> ../../../0000:03:00.0/
Those are symlinks so the permissions of the symlink don't control access. Either list the permissions of the files the symlinks point to or use ls -lL /sys/class/drm/card0/device (the L tells ls to show permissions of the file the symlink references).

Also try lsattr on the target files. The immutable attribute stops even root updating a file.

Chris
chris2be8 is offline   Reply With Quote
Old 2020-05-20, 19:10   #234
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2·13·443 Posts
Default

Quote:
Originally Posted by chris2be8 View Post
Those are symlinks so the permissions of the symlink don't control access. Either list the permissions of the files the symlinks point to or use ls -lL /sys/class/drm/card0/device (the L tells ls to show permissions of the file the symlink references).

Also try lsattr on the target files. The immutable attribute stops even root updating a file.
Thanks - permissions show as root:root and "lsattr /sys/class/drm/card0/device/pp_od_clk_voltage" gives "No such file or directory", but read on, the real issue is that Mihai's surmise of this system having entries for 2 cards, pushing the only physically installed card, the R7, to 'card1' instead of the expected card0.
Quote:
Originally Posted by preda View Post
Does the Haswell system by any chance have an integrated GPU? or any other GPU beside the RadeonVII? -- if so, it may be that the non-R7 sits as "card0".

Can you cat /sys/class/drm/card0/device/pp_od_clk_voltage ?
Do you have any other "cardN" is /sys/class/drm/ ? do they have a pp_od_clk_voltage that you can cat?
The R7 is currently the only card, but when I first upgraded the system to Ubuntu 19.10 in prep. for installing the R7, it had an old nVidia gtx430 installed. I removed that to make room for the R7, but it ended up causing problems due to Ubuntu 19.10 having seen it and installed a whole bunch of nVidia-driver crap, which I then had to rip out to get it to properly recognize the R7 as the sole (or at least main) card.

"cat /sys/class/drm/card0/device/pp_od_clk_voltage", both as regular-user and with sudo prepended, gives "No such file or directory".

"sudo touch /sys/class/drm/card0/device/pp_od_clk_voltage" gives "Permission denied".

But yes, now that I list all the contents of /sys/class/drm/card0, I see that indeed there *are* entries for card0 and card1:
Code:
lrwxrwxrwx  1 root root    0 May 19 15:22 card0 -> ../../devices/pci0000:00/0000:00:02.0/drm/card0/
lrwxrwxrwx  1 root root    0 May 19 15:22 card0-HDMI-A-1 -> ../../devices/pci0000:00/0000:00:02.0/drm/card0/card0-HDMI-A-1/
lrwxrwxrwx  1 root root    0 May 19 15:22 card0-HDMI-A-2 -> ../../devices/pci0000:00/0000:00:02.0/drm/card0/card0-HDMI-A-2/
lrwxrwxrwx  1 root root    0 May 19 15:22 card0-VGA-1 -> ../../devices/pci0000:00/0000:00:02.0/drm/card0/card0-VGA-1/
lrwxrwxrwx  1 root root    0 May 19 15:22 card1 -> ../../devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:00.0/0000:03:00.0/drm/card1/
lrwxrwxrwx  1 root root    0 May 19 15:22 card1-DP-1 -> ../../devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:00.0/0000:03:00.0/drm/card1/card1-DP-1/
lrwxrwxrwx  1 root root    0 May 19 15:22 card1-DP-2 -> ../../devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:00.0/0000:03:00.0/drm/card1/card1-DP-2/
lrwxrwxrwx  1 root root    0 May 19 15:22 card1-DP-3 -> ../../devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:00.0/0000:03:00.0/drm/card1/card1-DP-3/
lrwxrwxrwx  1 root root    0 May 19 15:22 card1-HDMI-A-3 -> ../../devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:00.0/0000:03:00.0/drm/card1/card1-HDMI-A-3/
lrwxrwxrwx  1 root root    0 May 19 15:22 renderD128 -> ../../devices/pci0000:00/0000:00:02.0/drm/renderD128/
lrwxrwxrwx  1 root root    0 May 19 15:22 renderD129 -> ../../devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:00.0/0000:03:00.0/drm/renderD129/
lrwxrwxrwx  1 root root    0 May 19 15:22 ttm -> ../../devices/virtual/drm/ttm/
-r--r--r--  1 root root 4096 May 19 15:22 version
...and I always assumed that the rocm-smi listing began counting devices at 1, because on that system it always only listed the R7 that way. But now that I have the new 2-card build to compare to, I realize (*imaginary lightbulb goes off over cartoon Ernst's head*) that card0 should appear as GPU 0 in the rocm-smi listing. And indeed, "cat /sys/class/drm/card1/device/pp_od_clk_voltage" shows the expected data:
Code:
ewmayer@ewmayer-haswell:~$ sudo cat /sys/class/drm/card1/device/pp_od_clk_voltage
OD_SCLK:
0:        808Mhz
1:       1801Mhz
OD_MCLK:
1:       1000Mhz
OD_VDDC_CURVE:
0:        808Mhz        716mV
1:       1304Mhz        808mV
2:       1801Mhz       1102mV
OD_RANGE:
SCLK:     808Mhz       2200Mhz
MCLK:     801Mhz       1200Mhz
VDDC_CURVE_SCLK[0]:     808Mhz       2200Mhz
VDDC_CURVE_VOLT[0]:     738mV        1218mV
VDDC_CURVE_SCLK[1]:     808Mhz       2200Mhz
VDDC_CURVE_VOLT[1]:     738mV        1218mV
VDDC_CURVE_SCLK[2]:     808Mhz       2200Mhz
VDDC_CURVE_VOLT[2]:     738mV        1218mV
So the ghost of the old gtx430 is still haunting the place, biting me in the butt. And the short-term workaround is to simply use the MCLK script with 'card1' on this system ... did all that as root, success! Reset SCLK to 4 and fan to its usual 120, and the per-iter timings for the 2 runs @5.5M FFT on this system just dropped to ... wait, I notice both of the 2 runs threw an 'EE' error-code in place of the usual 'OK' on the ensuing 200Kiter checkpoints, I'm assuming that's a Gerbicz-check error?

I also saw a few such errors last night on the faster (sclk=4) of the 2 new-build R7s in late afternoon, when the sun hitting the west wall of our apt. warmed it up and both cards' temps went up over 80C at their normal fan levels. But those runs only threw a few and then continued OK ... the Haswell runs just both quit due to repeated errors. So it seems the same settings I used on the new-build cards are a bit too aggressive on the Haswell-system R7? First, let's see if the errors repeat on restarting both runs at the new MCLK=1200 setting ... and they do, so let's reduce the mem-OCing to MCLK=1150 ... that looks better, per-iter timings in last 24 hours dropped from 1415us (under ROCm 2.10, no mem-OCing) to 1345us (ROCm 3.3, no mem-OCing) and now to 1265us with the added mem-OCing.

Longer-term, how do I rip out the card0 crud and get the system to properly recognize the R7 as card0?

Last fiddled with by ewmayer on 2020-05-20 at 19:16
ewmayer is offline   Reply With Quote
Old 2020-05-20, 19:16   #235
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

4,421 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Longer-term, how do I rip out the card0 crud and get the system to properly recognize the R7 as card0?
Condolences. On Windows, I've seen it where recursive rounds of file and folder deletion, registry editing, and restart have been required to exorcise an NVIDIA spirit, that may hide out as display, audio x several, some addon application, or something else. Persistence, accuracy, and thoroughness mandatory.

Last fiddled with by kriesel on 2020-05-20 at 19:19
kriesel is online now   Reply With Quote
Old 2020-05-20, 19:22   #236
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2×13×443 Posts
Default

Quote:
Originally Posted by kriesel View Post
Condolences. On Windows, I've seen it where recursive rounds of file and folder deletion, registry editing, and restart have been required to exorcise an NVIDIA spirit, that may hide out as display, audio x several, or something else. Persistence, accuracy, and thoroughness mandatory.
Well, worst-case I simply have to up-index the R7 device indexes on this system by 1. For a throughput boost nearly equivalent to adding a second Haswell-quad CPU to this system, I think I can handle that level of annoyance. :)
ewmayer is offline   Reply With Quote
Old 2020-05-20, 19:50   #237
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

24·3·149 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Longer-term, how do I rip out the card0 crud and get the system to properly recognize the R7 as card0?
My solution to most Linux problems usually ends up with: Erase disk. Reinstall OS.
Prime95 is offline   Reply With Quote
Old 2020-05-21, 00:13   #238
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

101100111111102 Posts
Default

Quote:
Originally Posted by Prime95 View Post
My solution to most Linux problems usually ends up with: Erase disk. Reinstall OS.
Back in the early 2000s I worked at a SiVal startup and one of my colleagues was a huge early Linux proselytizer. Several of us were dabbling with Limux co-installations on one Windows PC or another, and as I recall it, his answer to most any technical-problem issue was "[do a bunch of stuff]" followed by "rebuild your kernel". I'm happy to say that things have gotten a lot better since. :)

Process-ownership question: there seems to be some root-versus-regular-user setup differences between my Haswell system and the new build. On both systems, the gouowl build-dir and all its contents are owned by me as regular user ewmayer. But on the former I don't need to prepend 'sudo' to run gpuowl, and on the latter, I both need 'sudo' and have the annoying issue that everytime a gpuowl PRP test completes, the subsequent auto-update of the worktodo file and newly-created dir for the just-begun next assignment leave both owned by root, requiring me to do a 'sudo chown -R ewmayer:ewmayer *' before I can fetch new work. What could causing this?

Note that on both systems, 'top' shows the owner of the gpuowl jobs to be root.
ewmayer is offline   Reply With Quote
Old 2020-05-21, 02:02   #239
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

22568 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Back in the early 2000s I worked at a SiVal startup and one of my colleagues was a huge early Linux proselytizer. Several of us were dabbling with Limux co-installations on one Windows PC or another, and as I recall it, his answer to most any technical-problem issue was "[do a bunch of stuff]" followed by "rebuild your kernel". I'm happy to say that things have gotten a lot better since. :)

Process-ownership question: there seems to be some root-versus-regular-user setup differences between my Haswell system and the new build. On both systems, the gouowl build-dir and all its contents are owned by me as regular user ewmayer. But on the former I don't need to prepend 'sudo' to run gpuowl, and on the latter, I both need 'sudo' and have the annoying issue that everytime a gpuowl PRP test completes, the subsequent auto-update of the worktodo file and newly-created dir for the just-begun next assignment leave both owned by root, requiring me to do a 'sudo chown -R ewmayer:ewmayer *' before I can fetch new work. What could causing this?

Note that on both systems, 'top' shows the owner of the gpuowl jobs to be root.
Verify two things:

1. that your user-id belongs to the video group
$ id -a

2. that the video group has access to kfd:
$ cat /etc/udev/rules.d/70-kfd.rules
SUBSYSTEM=="kfd", KERNEL=="kfd", TAG+="uaccess", GROUP="video"

With those two steps there should be no need to run gpuowl (or clinfo) as root.
preda is offline   Reply With Quote
Old 2020-05-21, 03:20   #240
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2×13×443 Posts
Default

Quote:
Originally Posted by preda View Post
Verify two things:

1. that your user-id belongs to the video group
$ id -a

2. that the video group has access to kfd:
$ cat /etc/udev/rules.d/70-kfd.rules
SUBSYSTEM=="kfd", KERNEL=="kfd", TAG+="uaccess", GROUP="video"

With those two steps there should be no need to run gpuowl (or clinfo) as root.
Yes, George and PaulU led me through that on page 19, posts 204,205 ... double-checking it, and bolding the relevant entries:
Code:
ewmayer@ewmayer-gimp:~$ id
uid=1000(ewmayer) gid=1000(ewmayer) groups=1000(ewmayer),4(adm),24(cdrom),27(sudo),30(dip),44(video),46(plugdev),119(lpadmin),130(lxd),131(sambashare)
ewmayer@ewmayer-gimp:~$ cat /etc/udev/rules.d/70-kfd.rules
SUBSYSTEM=="kfd", KERNEL=="kfd", TAG+="uaccess", GROUP="video"
So an oddity ... not a big deal, just a minor annoyance now that I know to precede all runs of primenet.py with the chown workaround.

Back on the hardware-setup front, looking at the new build, it occurs to me there is an obvious alternate location to put a third GPU: lying flat (fans pointing downward), between the PSU-nearest of the 2 currently installed cards and the PSU itself.

Pluses:

1. there is plenty of room;
2. the main hot-air venting would be in the direction of the top of the PSU, unlike blasting right at the 2 vertically-mounted cards as with my initial right-angle-pcie-adapter location
3. leaves plenty of vertical space above the CPU fan, not that that needs great airflow in this build, with the CPU only running system processes.

Minuses:

1. Would need some custom-mounting-bracketage, of similar effort to the Alu. frame-end bracket I hacked to support GPU #3 in its original location (see pics in my 'bracket done' post a few pages back);
2. Would need a longish pcie riser cable (something along these lines? Or does the GPU end need to be full-width?) to hook into one of the available pci slots remaining in the board. The full-length x16 slot (the one I tried before, with the 2 short up-and-out adapters, the right-angle and the straight one) would be very awkward, since the ribbon cable would need to run underneath the full length of the board, back up over the PSU and then 5-6" over to the card in order to not block airflow to the 2 installed cards. (Alternatively, I could make sure the ribbon cable runs away from GPU #2 for a few inches, then up an over both ... a full-x16-width ribbon routed thusly would not block any airflow, it would only cover the RADEON LEDs on the 2 installed cards ... length would need to be no more than a foot.)

A ribbon cable plugging into the shorter x1 slot, if that is an option, would have the problem that aid slot is right between the 2 installed cards, very narrow access space, right next to the leftmost intake fan of GPU #1.

Last fiddled with by ewmayer on 2020-05-21 at 03:24
ewmayer is offline   Reply With Quote
Old 2020-05-21, 04:38   #241
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

2·599 Posts
Default

I think something like this is what you want:

https://www.amazon.com/6-Pack-Graphi...0035671&sr=8-2

(I'm not endorsing this particular instance, you should shop around, this is just an example)

You get 6x for $30, and the USB cable is easy to route. The downside is, you need to power this adapter (more power cords). Also be sure to set the PCIe to Gen1 on the motherboard, as these extendenders don't handle Gen3 usually.
preda is offline   Reply With Quote
Old 2020-05-21, 15:30   #242
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

4,421 Posts
Default

Quote:
Originally Posted by preda View Post
You get 6x for $30, and the USB cable is easy to route. The downside is, you need to power this adapter (more power cords). Also be sure to set the PCIe to Gen1 on the motherboard, as these extenders don't handle Gen3 usually.
Half the cost on eBay. Such extenders typically use a 75W rated power cable and some extenders are rated for 60W. (Direct-to-gpu cables handle the rest of the power draw.) Limit to 2 per SATA power cable in my experience, and avoid the SATA power cable that powers your boot drive suggested. These provide great flexibility in gpu location and spacing for air cooling and component clearance.
In my experience gen1 vs gen2 does not seem to matter; gen3 may as Preda states, or gen4 etc in the future.
I'm not sure mingling components from different lots is functional. By that I mean keep cable and extender components on either end from getting mixed among lots.

I've seen some become unusable, possibly due to component shuffling or early failure (capacitors & regular suspected since they were running very hot) but at $2.50 a slot if they last a year that's ok.
kriesel is online now   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
AMD Radeon Pro WX 3200 ET_ GPU Computing 1 2019-07-04 11:02
Radeon Pro Vega II Duo (look at this monster) M344587487 GPU Computing 10 2019-06-18 14:00
What's the best project to run on a Radeon RX 480? jasong GPU Computing 0 2016-11-09 04:32
Radeon Pro Duo 0PolarBearsHere GPU Computing 0 2016-03-15 01:32
AMD Radeon R9 295X2 firejuggler GPU Computing 33 2014-09-03 21:42

All times are UTC. The time now is 18:15.

Tue Sep 22 18:15:04 UTC 2020 up 12 days, 15:26, 1 user, load averages: 2.45, 2.05, 2.08

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.