mersenneforum.org  

Go Back   mersenneforum.org > New To GIMPS? Start Here! > Information & Answers

Reply
 
Thread Tools
Old 2022-09-27, 01:37   #12
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

22·3·593 Posts
Default

Dual Xeon E5-2670, 8 cores x2HT each (16 cores, 32 hyperthreads total, AVX), Windows 7 Pro, 128 GiB ECC ram, Prime95 V30.8b15, no issue seen in ten minutes with default torture test, except it made remote desktop very laggy.
Falk's 7900X is AVX512 capable so we may be running different code branches. testcb00's E5-2648L V2 is AVX capable.
If it is a hardware issue, even hardware-level diagnostics can be fooled. I had a system that regularly threw errors during the BIOS checks pointing to a particular DIMM slot. Swapping DIMMs did not move or affect the issue at all. When the ancient Tesla C2075 GPU finally failed and was removed, the "memory" issue was no longer seen since.
Attached Thumbnails
Click image for larger version

Name:	torture on emu.png
Views:	29
Size:	14.7 KB
ID:	27364  

Last fiddled with by kriesel on 2022-09-27 at 04:10
kriesel is offline   Reply With Quote
Old 2022-09-27, 02:15   #13
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

157148 Posts
Default

Quote:
Originally Posted by testcb00 View Post
Thank you for your explanation, kriesel.

I try the a) option you specified but I see only few RAM is used. My understanding is that GIMPS is not going to use all the hardware. I would like to know if there are methods to increase the RAM usage. My target is to test the load > 81919MB Memory.
You're welcome. See https://mersenneforum.org/showthread.php?t=28038,
https://www.mersenneforum.org/showpo...8&postcount=31, https://www.mersenneforum.org/showpo...0&postcount=15 and try to find a known factor using lots of ram in P-1 stage 2.
kriesel is offline   Reply With Quote
Old 2022-09-27, 03:34   #14
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

1F9516 Posts
Default

Quote:
Originally Posted by falk View Post
I am s second user who came here for the exact same issue, forum please take us seriously.

I created another thread in the Software forum but I do now see that @testcb00 has the exact same issue. So, let me continue the discussion here.
Issues are taken seriously. Right now we are in a data gathering phase -- looking for patterns that indicate there really is a prime95 bug.

Your problem description and testcb00's are far from identical. Have you tried reproducing testcb00's problem exactly? That is, memory to use (in MB) < 81919MB on 224K and 240K FFT size works, but memory to use (in MB) > 81919MB on 224K and 240K FFT size fails. Actually, the 81919MB number is not critical. If you found that 224KB and 240KB FFT failed consistently at a different memory boundary would be significant.

Alas, I do not have a machine with 128GB memory.

Can you post screen shots of two or three runs where torture test failed shortly after starting the torture test?

Quote:
stress.txt says if Prime95 fails then it MUST be your hardware. Well, then ...
There have been maybe two or three cases in 25 years where the problem was not the user's machine -- one of this was a bug in Intel's CPUs.
Prime95 is online now   Reply With Quote
Old 2022-09-27, 08:06   #15
falk
 
Sep 2022
Munich, Germany

208 Posts
Default

Quote:
Originally Posted by kriesel View Post
Falk's 7900X is AVX512 capable so we may be running different code branches.
I switched AVX512 on and off - it makes no difference. Switching the default option "Blend (all of the above)" off does - the fault goes away. @testcb00 reported the same.

I"ll try different memory limits to see if I see the same limit as @testcb00. Thanks, Falk
falk is offline   Reply With Quote
Old 2022-09-27, 13:15   #16
falk
 
Sep 2022
Munich, Germany

24 Posts
Default

Quote:
Originally Posted by Prime95 View Post
There have been maybe two or three cases in 25 years where the problem was not the user's machine -- one of this was a bug in Intel's CPUs.
Now, you have four or five cases ...


Quote:
Originally Posted by Prime95 View Post
Actually, the 81919MB number is not critical. If you found that 224KB and 240KB FFT failed consistently at a different memory boundary would be significant.
As promised, I repeated the test. To customize memory, I needed to select "Custom" rather than "default". Fortunately, the default custom settings seem to replicate the behaviour with "default", ie., a mix of sizes is run.

I found the exact same 81919MB boundary as did @testcb00,

ie., it runs fine with 81919MB boundary but fails with 81920MB boundary (it does not crash then, only stops all (but one) workers due to errors. I did not try other memory sizes as @testcb00 already did that job for all of us. Thank You!

Quote:
Originally Posted by Prime95 View Post
Can you post screen shots of two or three runs where torture test failed shortly after starting the torture test?
I attach 4 screenshots, I hope they stay in order:
  1. A normal run after 1 hour, with 81919MB, no errors reported
  2. A fault run with 81920MB, shortly after start, only one worker left running, no hard crash
  3. A fault run at default 127756MB, fast errors, eventually crashes (therefore, hard to screenshot)
  4. An actual segmentation fault screenshot, address -1 is always the same

I changed the window layout in between to better capture the worker messages.

I hope this helps debugging Prime95. It definitely has an issue, I wouldn't trust its compute results in its current form and maybe part of the work load should be redone. Of course, the error must be analyzed and its impact on past results be understood first. I am glad I only tried a torture test. Which my machine passes as I replicate the results of testcb00 perfectly :)

Thanks everybody, my issue is solved, a software bug for the communty here remains :(
Attached Thumbnails
Click image for larger version

Name:	Screenshot 2022-09-27 145159.png
Views:	31
Size:	363.8 KB
ID:	27367   Click image for larger version

Name:	Screenshot 2022-09-27 135028.png
Views:	36
Size:	287.0 KB
ID:	27368   Click image for larger version

Name:	Screenshot 2022-09-27 145423.png
Views:	31
Size:	390.5 KB
ID:	27369   Click image for larger version

Name:	Screenshot 2022-09-27 145446.png
Views:	32
Size:	39.3 KB
ID:	27370  
falk is offline   Reply With Quote
Old 2022-09-27, 13:52   #17
falk
 
Sep 2022
Munich, Germany

100002 Posts
Default

Just found a results.txt of the last few runs ...

The last 3 runs are started as described for the last 2 screenshots and behave the same, qualitatively (crash hard within a minute).

However, inspecting results.txt, I see numerical differences in the FATAL messages. This may help in debugging, could be a race condition of some sort. I append the prime.txt config file too.
Attached Files
File Type: txt results.txt (14.3 KB, 26 views)
File Type: txt prime.txt (677 Bytes, 21 views)
falk is offline   Reply With Quote
Old 2022-09-27, 14:54   #18
falk
 
Sep 2022
Munich, Germany

208 Posts
Default Race condition hypothesis

Following up on my previous assumption ...

I repeated the tests with only the 240k size configured under "Custom" - but now varying the number of cores and hyperthread toggle. I kept the memory at max (128GB).

My findings:
  • up to 10 physical cores (my MAX) but hyperthreading switched OFF, the torture tests runs stable for the default of 6 minutes (didn#t try longer)
  • up to 4 physical cores but now hyperthreading switched ON, the same.
  • 5 physical cores with hyperthreading switched ON fails within a minute.
  • Anything in excess of 10 logical threads fails within a minute or two.

I assume that there is a non-linear stable run-time for 240k with >82GB, which increases with lowering the number of execution threads and lowering the processor execution speed. Ie., slower machines with less cores and less memory most likely don't see the problem. That's would be typical for a race condition bug which makes them so hard to debug ...

BTW, the two memory addresses displayed in the screenshot shown above are both constant! That should help in a powerful enough debug tool.
PC: 0x7FFCE1A883D6 (code at that location refers to ->)
Address: 0xFFFFFFFFFFFFFFFF (illegal address)

Last fiddled with by falk on 2022-09-27 at 15:18
falk is offline   Reply With Quote
Old 2022-09-27, 17:46   #19
falk
 
Sep 2022
Munich, Germany

24 Posts
Default Regression bug hypothesis

Now, I also checked prior versions von Prime95. Results:
  • Prime95 v30.7 Build9: same failure behaviour as current v30.8.
  • Prime95 v29.8 Build6 (2019): all tests pass within 95min of default testing, or within 17min of custom testing for size 240k (cf. results.txt).

This looks like a regression bug within Prime95 to me, introduced some time within the past 3 years.
Attached Files
File Type: txt results.txt (7.9 KB, 18 views)
falk is offline   Reply With Quote
Old 2022-09-27, 17:50   #20
falk
 
Sep 2022
Munich, Germany

100002 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Alas, I do not have a machine with 128GB memory.
Are you the famous author of Prime95?

If yes, then of course it may be an issue not to be able to debug this issue. I hope the community can help.
falk is offline   Reply With Quote
Old 2022-09-27, 18:45   #21
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

83·131 Posts
Default

Quote:
Originally Posted by falk View Post
Are you the famous author of Prime95?
He is.

George has helped all in ways few can imagine. He is very welcoming of bug reports but is also a very busy person.

I'm quite sure everyone is going to get along splededly. The truth is what matters. DWIM (Do What I Mean) is an instruction programmers claim to covet, but may regret one day...
chalsall is online now   Reply With Quote
Old 2022-09-27, 19:56   #22
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

176258 Posts
Default

Quote:
Originally Posted by falk View Post
Now, you have four or five cases ...
That remains to be seen :)

Quote:
However, inspecting results.txt, I see numerical differences in the FATAL messages.
Different numerical differences are not helpful for debugging. These errors simply mean "something went wrong in the calculations". A stable machine running the same code over and over again should produce the same results every time. Historically, the fact that you are getting a variety of errors very strongly suggests a memory problem specific to your machine. Passing memtest means very little. I've seen memtest pass and prime95 find problems countless times.

Other less frequent causes are bad CPUs, memory controllers, caches, motherboards, power supplies, etc.

Quote:
My findings:
  • up to 10 physical cores (my MAX) but hyperthreading switched OFF, the torture tests runs stable for the default of 6 minutes (didn#t try longer)
  • up to 4 physical cores but now hyperthreading switched ON, the same.
  • 5 physical cores with hyperthreading switched ON fails within a minute.
  • Anything in excess of 10 logical threads fails within a minute or two.
This is interesting. There may be cores that cannot take the extra stress introduced by hyperthreading. Try setting "AffinityVerbosityTorture=1" in prime.txt. The code for assigning affinity settings changed after v29.8 to handle e-cores and p-cores. Also, I believe the torture test dialog box changed somewhere between 29.8 and 30.8. In 29.8 you need to specify 20 threads to get the equivalent behaviour of 30'8's 10 cores withhyperthreading on.

Note that no matter how prime95 set affinities in 29.8 and 30.8 you should not get any errors. The first step to confirm a CPU/memory problem vs. prime95 bug is significantly underclocking the CPU and underclocking the RAM and perhaps even overvolting, then retrying a torture test that was repeatedly failing. If the torture test then works, you'll know it is a hardware issue.

Quote:
Originally Posted by falk View Post
Are you the famous author of Prime95?
I don't know about "famous", but otherwise the answer is yes.
Prime95 is online now   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
odd segmentation fault ChristianB YAFU 4 2015-09-09 19:38
Segmentation fault in msieve. include Msieve 4 2012-11-14 00:59
Segmentation fault PhilF Linux 5 2006-01-07 17:12
Linux FC3 - mprime v23.9 : Segmentation fault (core dumped) nohup ./mp -d T.Rex Software 5 2005-06-22 04:22
Segmentation Fault sirius56 Software 2 2004-10-02 21:43

All times are UTC. The time now is 01:00.


Wed Dec 7 01:00:38 UTC 2022 up 110 days, 22:29, 0 users, load averages: 1.06, 1.00, 0.97

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔