mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing > GpuOwl

Reply
 
Thread Tools
Old 2021-03-10, 19:43   #78
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

2·691 Posts
Default

Quote:
Originally Posted by kriesel View Post
Thanks for the response. I'll look at d1 voltage more carefully.
All the gpus in post 75 are running stock voltage curves IIRC.
D1, the most problematic, has been dialed back to below nominal memory clock and still has issues. (Was most recently at 937 MHz, currently at 919 MHz.) The rest are at 1120 MHz.
Junction temperatures are indicated as 88-93 C among the gpus on that system.
The system event log entries given ~midpost in 75 match well timewise with the end of the multigpu delay, and indicate a user-mode driver crash. This system has already had the usual Windows TDR related registry modifications applied.

A second system, which had the same 5700XT in it for a while, had several gpu -> host error occurrences. That was on an extender.

A third system, also Win 10 Pro x64, has multiple occurrences of zero read error with a Radeon VII. I think it also occurred with an RX550 in the same slot.
Lenovo D30, Dual-Xeon-e5-2697v2, ECC system ram, one gpu directly in a PCIe slot, no extender on the system. Gpuowl v7.2-21 seems to be doing a good job of keeping P-1 errors in check there.

Code:
GpuOwl VERSION v7.2-21-g28dbf88
...
As a practical approach, I would suggest reducing the P1 bounds on the GPUs with these errors (e.g. at the wavefront, B1=2M and B2=40M). Would be interesting to know if these GPUs, booted under ROCm, would display these errors in the same way.

Last fiddled with by preda on 2021-03-10 at 19:44
preda is offline   Reply With Quote
Old 2021-03-12, 08:56   #79
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

2×691 Posts
Default

Quote:
Originally Posted by preda View Post
As a practical approach, I would suggest reducing the P1 bounds on the GPUs with these errors (e.g. at the wavefront, B1=2M and B2=40M). Would be interesting to know if these GPUs, booted under ROCm, would display these errors in the same way.
I recently got a Radeon VII with Samsung memory (as as RMA replacement). Even without any RAM overclock, and without any undervolt, that memory consistently generates errors. This is in contrast with the other Radeon VII's that I have with Hynix memory, that work 100% reliably even with some overclock+undervolt.

Ken, could you please investigate whether there's corellation between the errors you see and the GPU RAM manufacturer?
preda is offline   Reply With Quote
Old 2021-03-12, 12:57   #80
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

2·5·659 Posts
Default Samsung bad gpu ram

Thank you for the insight about Samsung vs. Hynix. Samsung correlates highly in my opinion to the various sorts of problems I've recently reported. I ascribe much of the lower frequency of gpu-> host read etc on other (Hynix-containing) gpus to past attempts to find the max overclock for them. (Note to self; in the future log gpu or gpu-ram clock changes to disk with date, time and details, for later comparison to gpu error rates in gpuowl logs.) I had dialed the problem gpu down even further on ram clock a couple days ago, and it seems 919 MHz is ok but 937 is not; or maybe it was because I had also reduced gpu clock. For the first time in a long time it's gone 1.5 days without a new error. It's the only Samsung-ram gpu on that system. Another system has a Radeon VII with Samsung ram and at 1010 MHz gpu ram clock that is generating an EE during PRP about every 2 hours; just dialed that one back to 1000 and will continue lowering toward stability. Unfortunately the 919 MHz costs about 10% on performance. But all those whole-system stalls cost a lot too. Losing 10% on one gpu is better than losing 6% on all from regular stalls. This will help my total throughput. I may gently and cautiously tune the "problem child" gpu a little more.
The Hynix-based Radeon VIIs seem solid at ~1120 MHz. Since the whole-system stalls were causing errors on most gpus at times, the Hynix may be capable of going higher.

The mapping between gpuowl device number order, Windows device manager display adapter list order, GPU-Z list order, physical PCIe slot order, & AMD Radeon Sofware (tuning utility) gpu list order is messed up. But it was simple enough to line up GPU-Z instances for each gpu, note one is Samsung, flip it to sensor display, stop the gpuowl most-frequent-problem d1, and note it was the Samsung-ram gpu that had been stopped; its gpu clock declines, utilization goes to zero.
And this is another demonstration of how good the Gerbicz error check and other checks included in gpuowl are, that we can use it to both detect hardware issues and home in on why, and determine what conditions allow reliable operation.

Last fiddled with by kriesel on 2021-03-12 at 13:35
kriesel is offline   Reply With Quote
Old 2021-03-28, 15:31   #81
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

2×5×659 Posts
Default ...when there's something weird, and it don't look good...

Windows 10 Pro x64, one of n Radeon VII gpus on an Asrock BTC Pro 2.0 motherboard based open frame rig, running gpuowl v7.2-53-ge27846f, P-1 at beginning of PRP run of wavefront exponent.
Stage 1 looked normal, stage 2 not so much.
Code:
2021-03-28 07:05:19 asr2/radeonvii4 103281593 OK   1400000   1.36% 9ae45f731c77a06e 1008 us/it + check 0.59s + save 2.56s; ETA 1d 04:32 | P1(1M) 97.1% ETA 00:01 615b83c78b18fc64
2021-03-28 07:05:29 asr2/radeonvii4 103281593      1410000   1.37% 6572d32bfae35042 1007 us/it
2021-03-28 07:05:40 asr2/radeonvii4 103281593      1420000   1.37% 257ee1ebee2d1603 1071 us/it
2021-03-28 07:05:50 asr2/radeonvii4 103281593      1430000   1.38% dbe2ed3098a5da79 1008 us/it
2021-03-28 07:06:00 asr2/radeonvii4 103281593      1440000   1.39% 9fb82146a4232b04 1012 us/it
2021-03-28 07:06:06 asr2/radeonvii4 103281593 P1(1M) releasing 682 buffers
2021-03-28 07:06:06 asr2/radeonvii4 103281593 Released memory lock 'memlock-4'
2021-03-28 07:06:06 asr2/radeonvii4 103281593 OK   1442400   1.40% 6c1c9580a66d8c19 1000 us/it + check 0.58s + save 3.21s; ETA 1d 04:17
2021-03-28 07:07:03 asr2/radeonvii4 103281593 P1 Jacobi OK @ 1442400 734117b415032270
2021-03-28 07:07:04 asr2/radeonvii4 103281593 OK   1445600   1.40% e84d4ae96299b994 17782 us/it + check 0.55s + save 0.48s; ETA 20d 23:01
2021-03-28 07:07:04 asr2/radeonvii4 103281593 P2(1M,30M) D=330, nBuf=338
2021-03-28 07:07:05 asr2/radeonvii4 103281593 P2(1M,30M) Generating P2 plan, please wait..
2021-03-28 07:07:14 asr2/radeonvii4 103281593 P2(1M,30M) D=330: 1779361 primes in [1000003, 29999999]: cost 1.21M (pair: 724946, single: 329469, (81% paired), blocks: 77915)
2021-03-28 07:07:14 asr2/radeonvii4 103281593 P2(1M,30M) 77915 blocks: 12991 - 90905; start from 12991
2021-03-28 07:07:14 asr2/radeonvii4 103281593 P2(1M,30M) Acquired memory lock 'memlock-4'
2021-03-28 07:07:15 asr2/radeonvii4 103281593 P2(1M,30M) Allocated 338 buffers
2021-03-28 07:07:16 asr2/radeonvii4 103281593 P2(1M,30M) Starting P1 GCD
2021-03-28 07:08:05 asr2/radeonvii4 103281593 P2(1M,30M) Setup 338 P2 buffers in 51.4s
2021-03-28 07:08:05 asr2/radeonvii4 103281593 P2(1M,30M) OK @12991: be692526e43e1516 (0.2s)
2021-03-28 07:08:05 asr2/radeonvii4 103281593 P2(1M,30M) MULs: done 0, left 1210245; 0.0%
2021-03-28 07:08:11 asr2/radeonvii4 103281593 P2(1M,30M) GCD : no factor
2021-03-28 07:08:40 asr2/radeonvii4 103281593 P2(1M,30M)   0.3%  3191 muls, 10741 us/mul, ETA 03:36
2021-03-28 07:09:49 asr2/radeonvii4 103281593 P2(1M,30M)   0.8%  6078 muls, 11401 us/mul, ETA 03:48
2021-03-28 07:10:49 asr2/radeonvii4 103281593 P2(1M,30M)   1.3%  6087 muls, 9830 us/mul, ETA 03:16
2021-03-28 07:11:55 asr2/radeonvii4 103281593 P2(1M,30M)   1.8%  6021 muls, 11057 us/mul, ETA 03:39
...
2021-03-28 10:02:19 asr2/radeonvii4 103281593 P2(1M,30M)  81.8%  6570 muls, 9700 us/mul, ETA 00:36
2021-03-28 10:03:25 asr2/radeonvii4 103281593 P2(1M,30M)  82.3%  6578 muls, 10139 us/mul, ETA 00:36
2021-03-28 10:04:35 asr2/radeonvii4 103281593 P2(1M,30M)  82.9%  6598 muls, 10603 us/mul, ETA 00:37
2021-03-28 10:05:15 asr2/radeonvii4 103281593 P2(1M,30M) OK @78541: 504f8443bf2149f5 (0.2s)
2021-03-28 10:05:15 asr2/radeonvii4 103281593 P2(1M,30M) Starting GCD
2021-03-28 10:05:58 asr2/radeonvii4 103281593 P2(1M,30M)  83.4%  6570 muls, 12652 us/mul, ETA 00:42
2021-03-28 10:06:08 asr2/radeonvii4 103281593 P2(1M,30M) GCD : no factor
2021-03-28 10:07:09 asr2/radeonvii4 103281593 P2(1M,30M)  84.0%  6590 muls, 10788 us/mul, ETA 00:35
2021-03-28 10:07:24 asr2/radeonvii4 103281593 P2(1M,30M)  84.1%  1312 muls, 11239 us/mul, ETA 00:36
2021-03-28 10:07:35 asr2/radeonvii4 103281593 P2(1M,30M) OK @79281: a48a8232ee1ee003 (0.2s)
2021-03-28 10:07:35 asr2/radeonvii4 103281593 P2(1M,30M) Starting GCD
2021-03-28 10:07:36 asr2/radeonvii4 103281593 P2(1M,30M) waiting for GCD..
2021-03-28 10:08:29 asr2/radeonvii4 103281593 P2(1M,30M) GCD : no factor
2021-03-28 10:08:30 asr2/radeonvii4 103281593 P2(1M,30M) Released memory lock 'memlock-4'
2021-03-28 10:08:30 asr2/radeonvii4 Exiting because "stop requested"
2021-03-28 10:08:30 asr2/radeonvii4 Bye
2021-03-28 10:08:39 GpuOwl VERSION v7.2-53-ge27846f
2021-03-28 10:08:39 config: -user kriesel -cpu asr2/radeonvii4 -d 4 -maxAlloc 15G -proof 9 -use NO_ASM -autoverify 10
2021-03-28 10:08:39 device 4, unique id ''
2021-03-28 10:08:39 asr2/radeonvii4 103281593 FFT: 5.50M 1K:11:256 (17.91 bpw)
2021-03-28 10:08:39 asr2/radeonvii4 103281593 OpenCL args "-DEXP=103281593u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DAMDGPU=1 -DWEIGHT_STEP_MINUS_1=0.065443487272705284 -DIWEIGHT_STEP_MINUS_1=-0.061423705766155495 -DIWEIGHTS={0,-0.061423705766155495,-0.11907453990226374,-0.17318424616522224,-0.2239703337515917,-0.27163695163704177,-0.31637570921062819,-0.3583664465026713,-0.39777795710238401,-0.43476866667122022,-0.46948726977941896,-0.004146655251375528,-0.065315658085456849,-0.12272743408744842,-0.17661276605278126,-0.22718826124236385,} -DNO_ASM=1  -cl-std=CL2.0 -cl-finite-math-only "
2021-03-28 10:08:43 asr2/radeonvii4 103281593 OpenCL compilation in 3.76 s
2021-03-28 10:08:43 asr2/radeonvii4 103281593 trig table : 65 points, cos 73.77 bits, sin 73.34 bits
2021-03-28 10:08:43 asr2/radeonvii4 103281593 trig table : 353 points, cos 72.91 bits, sin 73.05 bits
2021-03-28 10:08:44 asr2/radeonvii4 103281593 trig table : 360449 points, cos 72.51 bits, sin 72.42 bits
2021-03-28 10:08:45 asr2/radeonvii4 103281593 maxAlloc: 15.0 GB
2021-03-28 10:08:45 asr2/radeonvii4 103281593 P1(1M) 1442134 bits
2021-03-28 10:08:45 asr2/radeonvii4 103281593 OK   1445600 on-load: blockSize 400, e84d4ae96299b994
2021-03-28 10:08:45 asr2/radeonvii4 103281593 validating proof residues for power 9
2021-03-28 10:08:46 asr2/radeonvii4 103281593 Proof using power 9
2021-03-28 10:08:48 asr2/radeonvii4 103281593 OK   1446400   1.40% 77b0e72e3ca17ad4  872 us/it + check 0.56s + save 0.43s; ETA 1d 00:40
2021-03-28 10:08:48 asr2/radeonvii4 103281593 P2(1M,30M) D=330, nBuf=338
2021-03-28 10:08:48 asr2/radeonvii4 103281593 P2(1M,30M) Generating P2 plan, please wait..
2021-03-28 10:08:57 asr2/radeonvii4 103281593 P2(1M,30M) D=330: 1779361 primes in [1000003, 29999999]: cost 1.21M (pair: 724946, single: 329469, (81% paired), blocks: 77915)
2021-03-28 10:08:57 asr2/radeonvii4 103281593 P2(1M,30M) 77915 blocks: 12991 - 90905; start from 79281
2021-03-28 10:08:57 asr2/radeonvii4 103281593 P2(1M,30M) Acquired memory lock 'memlock-4'
2021-03-28 10:08:57 asr2/radeonvii4 103281593 P2(1M,30M) Allocated 338 buffers
2021-03-28 10:08:58 asr2/radeonvii4 103281593 P2(1M,30M) Starting P1 GCD
2021-03-28 10:09:03 asr2/radeonvii4 103281593 P2(1M,30M) Setup 338 P2 buffers in 5.8s
2021-03-28 10:09:03 asr2/radeonvii4 103281593 P2(1M,30M) OK @79281: a48a8232ee1ee003 (0.2s)
2021-03-28 10:09:03 asr2/radeonvii4 103281593 P2(1M,30M) MULs: done 1017453, left 192792; 84.1%
2021-03-28 10:09:09 asr2/radeonvii4 103281593 P2(1M,30M)  84.5%  5203 muls, 1245 us/mul, ETA 00:04
2021-03-28 10:09:18 asr2/radeonvii4 103281593 P2(1M,30M)  85.0%  6578 muls, 1232 us/mul, ETA 00:04
2021-03-28 10:09:26 asr2/radeonvii4 103281593 P2(1M,30M)  85.6%  6640 muls, 1268 us/mul, ETA 00:04
2021-03-28 10:09:41 asr2/radeonvii4 103281593 P2(1M,30M)  86.1%  6575 muls, 2299 us/mul, ETA 00:06
2021-03-28 10:09:49 asr2/radeonvii4 103281593 P2(1M,30M) GCD : no factor
Over 3 hours for a wavefront P2 on a Radeon VII seems uh, excessive. Note the large and fluctuating multiply timings. Stop and restart gpuowl on the one gpu seems to have set it right for now.

...who you gonna call? Mihai!
kriesel is offline   Reply With Quote
Old 2021-03-28, 18:58   #82
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

25468 Posts
Default

Quote:
Originally Posted by kriesel View Post
Note the large and fluctuating multiply timings. Stop and restart gpuowl on the one gpu seems to have set it right for now.
The multiplication time was excessive before the restart. One possible cause would be the GPU RAM becoming over-allocated for some reason, which would slow everything down a lot.

If you catch it again in slow-mode, take a look at the amount of memory allocated on the GPU. On ROCm I can see this (total GPU RAM allocated), I don't know if there's a way on Windows.

Possible things affecting the RAM would be: running a monitor on the GPU, with some graphically-intensive apps (even a web browser).

If the RAM is confirmed as the reason, it might be fixed by lowering a bit the -maxAlloc, e.g. to 14G. Anyway, it's just a guess for now.
preda is offline   Reply With Quote
Old 2021-03-28, 19:26   #83
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

2·5·659 Posts
Default

Quote:
Originally Posted by preda View Post
Possible things affecting the RAM would be: running a monitor on the GPU, with some graphically-intensive apps (even a web browser).

If the RAM is confirmed as the reason, it might be fixed by lowering a bit the -maxAlloc, e.g. to 14G. Anyway, it's just a guess for now.
Thanks. No gpu on that system is connected to a monitor. Monitor attaches to the motherboard VGA connector. The monitor is rarely used. Usually it's Windows Remote Desktop. (MSTSC.exe) Maybe prime95 along with task manager, gpu-z, and notepad. Some explorer windows, one per running GIMPS instance. The rest is command prompt boxes, one per gpuowl instance, one per gpu. No web browser. Plain black screen background, no wallpaper image. Dozens of shortcut icons on screen. I usually leave the gpuowl console sessions maximized, everything else minimized.

-maxalloc was 15G. I found it necessary to go that high earlier because 14G was not enough for 999.3M P-1 in stage 2 and 24 buffers in V7.2-21.

Last fiddled with by kriesel on 2021-03-28 at 19:29
kriesel is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Things that make you go "Hmmmm…" Xyzzy Lounge 4467 2022-06-27 11:40
GpuOwl PRP-Proof changes preda GpuOwl 20 2020-10-17 06:51
gpuOWL for Wagstaff GP2 GpuOwl 22 2020-06-13 16:57
gpuowl tuning M344587487 GpuOwl 14 2018-12-29 08:11
short runs or long runs MattcAnderson Operazione Doppi Mersennes 3 2014-02-16 15:19

All times are UTC. The time now is 19:45.


Sat Jul 2 19:45:42 UTC 2022 up 79 days, 17:47, 1 user, load averages: 1.05, 1.14, 1.15

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔