![]() |
![]() |
#1 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
1FA916 Posts |
![]()
This thread is intended as reference material for system management, not specific to Mersenne hunting, but important for it.
(Suggestions are welcome. Discussion posts in this thread are not encouraged. Please use the reference material discussion thread http://www.mersenneforum.org/showthread.php?t=23383.) Table of Contents
Last fiddled with by kriesel on 2021-09-01 at 16:29 Reason: added WSL, Linux, Windows - Linux dictionary |
![]() |
![]() |
#2 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
5×1,621 Posts |
![]()
Some things to check if the system uptime or other reliability is less than quite good.
Last fiddled with by kriesel on 2019-11-17 at 15:01 |
![]() |
![]() |
#3 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
5×1,621 Posts |
![]()
These are mostly from Windows experience.
Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2020-07-16 at 18:44 |
![]() |
![]() |
#4 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
5·1,621 Posts |
![]()
Logging of application output normally directed to the console is encouraged, except for gpuowl, which has comprehensive built-in logging. Stdout and stderr are where error messages, warnings, and normal program output typically appear. Most applications don't log much of this themselves to a file. Errors not trapped for by the program may scroll off screen before the user has a chance to see them.
Per Chalsall, in Linux the append option for the tee command is either "-a" or "--append". Re Windows powershell and tee use, I saw a warning somewhere that tee creates a destination file, even if a file by that name exists. Which would blow away the previously accumulated log every time the app halted from the Windows display driver timeout or other reason and the batch wrapper restarted the application with tee to redirect a copy of screen output to the file.. Unless the alert user incorporated the batch loop count or %date%%time% into the tee destination file name in the batch file. That first time could be a killer, of months of logging. The -append modifier for tee is not accepted at the command line in my test on Win7. Win7 Pro PS (same comments except no -a or -append work) PS C:\Users\Ken\documents> dir | tee-object -filepath tee-test.txt -append Tee-Object : A parameter cannot be found that matches parameter name 'append'. At line:1 char:48 + dir | tee-object -filepath tee-test.txt -append <<<< + CategoryInfo : InvalidArgument: (:) [Tee-Object], ParameterBindingException + FullyQualifiedErrorId : NamedParameterNotFound,Microsoft.PowerShell.Commands.TeeObjectCommand PS C:\Users\Ken\documents> dir | tee-object -filepath tee-test.txt -a Tee-Object : A parameter cannot be found that matches parameter name 'a'. At line:1 char:43 + dir | tee-object -filepath tee-test.txt -a <<<< + CategoryInfo : InvalidArgument: (:) [Tee-Object], ParameterBindingException + FullyQualifiedErrorId : NamedParameterNotFound,Microsoft.PowerShell.Commands.TeeObjectCommand PS C:\Users\Ken\documents> The -append option is not present in command line tee help on Win7, but is in Win10. Windows 8 & 8.1 status, unknown. In the absence of tee -append, I use append redirection >>. Some applications have part of a message printed to stderr and part to stdout, which gets partially redirected. Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2019-11-17 at 15:03 Reason: added gpuowl as exception |
![]() |
![]() |
#5 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
5·1,621 Posts |
![]()
Memory errors might occur in the gpu vram, or in the system ram, or if particularly unlucky, both. Either can affect the GIMPS calculation results of GPU applications. Ideally we would all use highly reliable hardware, with ECC present and turned on.
On the cpu side: System ram can be tested with memtest86 or memtest86+. https://www.memtest86.com/ or http://www.memtest.org/ Memtest86+ has the capability to prepare a table of bad physical locations. System ram is inexpensive, so bad modules can be detected and removed or replaced, and the system retested. Retest periodically (annually?) is advisable. For Linux systems, those badram tables from memtest86+ can be input to the Linux badram kernel patch, which allocates those bad physical locations and hangs on to them so they don't get allocated to some application we care about whose results could be ruined by memory errors, such as GIMPS computations. For Windows systems, there is not an equivalent user-appliable patch available to my knowledge. For at least some versions, there's a built-in alternative described at https://superuser.com/questions/4200...ive-ram#490522 including lots of detail. Note the caution about possibly causing a boot failure if done incorrectly. This should be a temporary workaround while replacement RAM is on order. For other OSes, there may be no alternative to RAM replacement or removal. On the GPU side: NVIDIA GPU memory can be tested with the -memtest option of CUDALucas. AMD or NVIDIA with gpumemtest https://sourceforge.net/projects/cudagpumemtest/ Also https://www.raymond.cc/blog/having-p...st-its-memory/ Intel IGPs use system ram so that gets tested on the system side. ECC is often not available, and if present and enabled reduces performance. (Only high end pro-quality card models included ECC in their design.) Speculatively: The gpu memory may or may not be subject to the virtual memory management of the host OS. It may be possible to develop code to do bad-gpu-memory lockout at the application level, or at the driver level. Whether that results in gpu memory fragmentation that causes problems is to be determined. Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2020-07-16 at 18:45 |
![]() |
![]() |
#6 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
5·1,621 Posts |
![]()
For Windows 10 setup for better privacy, there's https://www.reddit.com/r/conspiracy/...in_windows_10/ which may be useful.
Setting Windows classic theme in Win 10: https://www.youtube.com/watch?v=j2fL1RRsuTw Stopping Cortana: https://www.pcworld.com/article/2949...assistant.html On Windows 7 a while back, benchmarking CUDALucas under different gpu driver versions, before I figured out how to reliably stop automatic driver updates, removing the network cable temporarily worked very well to block updates. Controlling when updates occur can be useful. This can be configured in Windows 10 to require your consent. See https://www.techradar.com/news/softw...ows-10-1307070 If you don't trust it there are fallbacks. Firewalling off update servers is a possibility; in your firewall router, make a router entry that says the undesired server addresses are on the LAN side. If the update software's packets never cross the router, updates won't be downloaded or installed. If all else fails, or for simplicity, temporarily unplug the network cable. Getting the Pro version of Windows provides remote desktop server capability. In some versions it also means better control of backups (more choice of destination for example). Windows limits its page file to no more than 3 times the physical memory present, so it is possible to exhaust virtual memory near 4 times physical ram. That's a showstopper. Take care to run with no more committed ram than ~3.5 times physical system ram. Get a motherboard capable of sufficient RAM expansion that it does not become an issue. Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2021-08-19 at 22:12 Reason: Add virtual memory/physical ratio |
![]() |
![]() |
#7 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
5×1,621 Posts |
![]()
It gets complicated. The client management software that's available for one computation type generally does not support another, and some are specific to CUDA or OpenCl in addition. Not all GIMPS gpu applications are supported by separate client management software. None of the gpu apps have integrated Primenet API communication. App instances, folders, files, etc proliferate quickly if running multiple computation types on multiple gpus, and more so if running multiple TF instances to extract higher performance. See the example system attached.
Bring them up one gpu and one application instance at a time. Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2019-11-17 at 15:05 |
![]() |
![]() |
#8 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
5×1,621 Posts |
![]()
There are multiple reasons to configure for power usage.
One is to ensure the system continues to work on assignments with no user interaction. Another is to optimize power efficiency. Another is to reduce total power to levels within the system's current cooling capacity or power supply capacity, which may diminish over time, or to improve running temperatures. BIOS: Turn off what you are sure you don't need. Varies considerably by BIOS flavor. Then test for the maybes. Some candidates are onboard video, USB, PCIe, audio, serial or parallel ports if present, extra NICs. Reducing maximum cpu clock rate can reduce power usage, and improve power efficiency. OS: To help the system stay up and running prime95 or mprime or whatever on the cpu, and any relevant GPU applications, at full tilt while unattended, modify the default OS power saving settings. Test by leaving the system on continuously after a restart. For Windows 10: click on the lower left Windows icon, the gear that will appear a bit above it, System in the pane that will appear, then "Power & sleep". Find "when plugged in, pc goes to sleep" and select "never". Scroll down and find "additional power settings", click on it, then in the pane that opens, click change plan settings, then click "change advanced power settings". Adjust the many settings in the resulting window to the speed/power tradeoffs you want after considering utility cost. Linux is reported to have lower overhead, so switching OS is a possibility. Disabling unneeded services or applications helps. GPUs: Consider using the power limiting or frequency limiting capabilities of nvidia-smi for NVIDIA GPUs, or the corresponding AMD GPU utilities, to reduce GPU power from maximum to a lower level that is more power efficient. Especially during the air conditioning season. At equal average power, frequency limiting is likely to be more effiicent than capping power. Misc. hardware: Remove any unnecessary hardware that's removable. Unused GPUs, excess RAM, DVD or CD drives, extra HDs, etc all draw power at idle. (Although reducing the number of occupied RAM channels will reduce computing performance somewhat.) Power supply: Use high efficiency power supplies. Select for output rating ~1.7 times the expected usage, so that the system will run near the power supply's peak efficiency, and have some room for growth in installed components. If using UPSes, select for high efficiency Placement: Locate the system in a cool area. Fans won't run as hard, saving some wattage. Lower temperatures improve both efficiency and component lifetime. Low locations (floor) tend to be cooler than elevated locations. Cleaning: Dust, lint etc builds up on grilles, fan blades and heat sinks, reducing cooling effectiveness over time. Use clean dry air to blow it out, or an old toothbrush or other brush to loosen and remove it. Use necessary antistatic precautions. Application features: Power usage can be reduced by certain features of GIMPS applications. Mprime/prime95 can use fewer cpu cores than the maximum, or use "throttle=", or both. Reducing the frequency of saves to disk may help. Application tuning may increase total wattage while raising throughput by a greater amount, improving power efficiency. In CUDALucas and CUDAPm1, low values for "PoliteValue=" can be used as a form of GPU throttling. (This post has been supplemented with content from https://mersenneforum.org/showthread.php?t=26773 and other sources.) Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2022-07-14 at 03:52 |
![]() |
![]() |
#9 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
5×1,621 Posts |
![]()
Windows Subsystem for Linux (WSL) seems to me named mostly backwards. It provides Linux capabilities for Windows systems; Linux support on Windows. (The opposite of something like WINE, enabling Windows capabilities on Linux)
WSL is a way to make a computer dual-personality-simultaneously. Or more. More than one installation, more than one distribution, more than one distribution version, more than one WSL version, all possible. Ubuntu 18.04 LTS or 20.04 LTS, Kali, Debian, Alpine WSL, OpenSUSE Leap 15.3 or 15.2, all free. Fedora Remix ~$10. Comparison of WSL1 and WSL2 https://docs.microsoft.com/en-us/win...mpare-versions Installation of WSL https://docs.microsoft.com/en-us/win.../install-win10 In a nutshell, if you want WSL1, which will install on systems lacking certain hardware support of virtualization required by WSL2/Hyper-V: Manual installation step 1, restart, step 6, done. Most of my WSL-installed test systems are hardware limited to WSL1. On WSL1, with an explorer window launched from the Windows side, I found the well-hidden WSL user home directory's mlucas folder with explorer's search function. Windows could apparently delete files there. But they would reappear later! Enough later that I had already deleted 7 unneeded files before an issue became apparent. All attempts to date to clean that situation up have failed; files indicate 0 hard links now; rm, rm -f, or after chmod to all permissions to everyone (777) on the files and containing directory followed by rm -f, or even sudo rm -f, all respond permission denied. Perhaps a cold system restart followed by scandisk on the Windows side, or the Linux equivalent (fsck?) in Ubuntu/WSL, will handle it. After fresh backups... Accessing Linux/WSL files from Windows 10 https://r00t4bl3.com/post/how-to-acc...rom-windows-10 Works in WSL1 or 2. From the Ubuntu-launched explorer session, drag and drop to a Windows-launched explorer works. Opposite way, from Windows to Linux-launched, had permissions problems when I tried it. The two explorer windows look identical. Adopt some convention, such as Linux on the left, to limit mistakes. WSL2 FAQs https://docs.microsoft.com/en-us/windows/wsl/wsl2-faq Because WSL performs some virtualization, it makes the core number specification in Mlucas relatively ineffective. I observed many other cores in a variety of test hardware joining in the party, and what appeared to be substantial core-hopping overhead on large-core-count systems, consistent with considerable discrepancy between timings on Ernst's CentOS Xeon Phi 7250, and my Ubuntu/WSL1/Win10 Pro x64 Xeon Phi 7250, for the same fft length, Mlucas version, and cores specification. A dual-12-core x2HT system was somewhat better behaved but not immune to socket-straddling. This apparent core-hopping involved all logical processors, usually in one socket, for Mlucas core counts <= logical-processor-count/socket, and made attempts to test use of 2 or 4 separate cores versus 2 or 4 with x2HT use futile. One way of mitigating the effect may be to fully load all logical processors with assigned work. Causes, mitigation and eventual software solutions are yet to be investigated. The issue seemed particularly severe on Xeon Phi with 64 or 68 cores, 256 or 272 logical processors, less so on lower-core-count and lower-hyperthreading-multiplier machines. There's something about WSL1 that prevents using more than 64 logical processors in an Mlucas run. I think it's related to Windows handling many-core systems as if they were NUMA, with processor groups of no more than 64 logical processors, even when they're single-socket. Or a design decision made for WSL1. For example, a KNL Xeon Phi 7210 presents in Task Manager as 4 NUMA groups of 64 logical processors. A KNL Xeon Phi 7250 (68 cores by 4-way HT = 272 logical processors) presents as 5 NUMA groups. WSL1 hosted Ubuntu indicates 16 cores by x4 HT = 64 logical processors. Even after launching and loading up in WSL sessions, 4 with 64-core mlucas workloads each, the 5th also indicates 64 processors. It appears that core count is both subsetted per Ubuntu running window, and in total, exceeding the hardware capacity; 5 x 64 = 320 vs. 272 logical processors supported in hardware. I'm unable to install or test WSL2 behavior on Xeon Phi, because KNL lack required hardware virtualization support for WSL2. WSL2 by default will allow its VM half the installed ram of the system. If the intention is to use lots of ram for a quicker execution of stage2 in mprime or Mlucas, it will be necessary to override that default behavior, with a .wslconfig file. See https://learn.microsoft.com/en-us/wi...wsl/wsl-config. Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2023-05-16 at 15:41 Reason: add note on WSL2 default ram half |
![]() |
![]() |
#10 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
1FA916 Posts |
![]()
(placeholder)
A link the fans will like: https://hostingtribunal.com/blog/linux-statistics/ System monitoring utilities include top, vmstat, nmon. top -d (delay expressed in seconds), c option to show full command line of a process, and 1 option to show individual core activity are useful. A description of top options I've found useful is https://www.howtogeek.com/668986/how...nd-its-output/ I was surprised (shocked, really) to find remote desktop access unusable in current releases of Ubuntu. See https://www.mersenneforum.org/showpo...8&postcount=11 Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2021-09-01 at 15:40 |
![]() |
![]() |
#11 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
176518 Posts |
![]()
(very early draft)
Code:
Task Windows Linux change current directory cd cd change current directory up cd.. cd .. change file attributes/perms attrib chmod check system time and date time date copy a file copy cp delete a file del rm display contents of a text file type cat executable file type .exe (nothing) identify OS version ver cat /etc/os-release | grep PRETTY_NAME make a directory mkdir mkdir provide help help man rename a file rename mv show contents of a directory dir ls Last fiddled with by kriesel on 2021-09-01 at 16:22 |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Fast Breeding (guru management) | VictordeHolland | NFS@Home | 2466 | 2020-09-20 06:51 |
Improving the queue management. | debrouxl | NFS@Home | 10 | 2018-05-06 21:05 |
Script-based Primenet assignment management | ewmayer | Software | 3 | 2017-05-25 04:02 |
Mally's marginal notes | devarajkandadai | Math | 3 | 2008-12-19 03:33 |
Power Management settings | PrimeCroat | Hardware | 3 | 2004-02-17 19:11 |