![]() |
![]() |
#12 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2·29·127 Posts |
![]()
Dual Xeon E5-2670, 8 cores x2HT each (16 cores, 32 hyperthreads total, AVX), Windows 7 Pro, 128 GiB ECC ram, Prime95 V30.8b15, no issue seen in ten minutes with default torture test, except it made remote desktop very laggy.
Falk's 7900X is AVX512 capable so we may be running different code branches. testcb00's E5-2648L V2 is AVX capable. If it is a hardware issue, even hardware-level diagnostics can be fooled. I had a system that regularly threw errors during the BIOS checks pointing to a particular DIMM slot. Swapping DIMMs did not move or affect the issue at all. When the ancient Tesla C2075 GPU finally failed and was removed, the "memory" issue was no longer seen since. Last fiddled with by kriesel on 2022-09-27 at 04:10 |
![]() |
![]() |
![]() |
#13 | |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2·29·127 Posts |
![]() Quote:
https://www.mersenneforum.org/showpo...8&postcount=31, https://www.mersenneforum.org/showpo...0&postcount=15 and try to find a known factor using lots of ram in P-1 stage 2. |
|
![]() |
![]() |
![]() |
#14 | ||
P90 years forever!
Aug 2002
Yeehaw, FL
32×907 Posts |
![]() Quote:
Your problem description and testcb00's are far from identical. Have you tried reproducing testcb00's problem exactly? That is, memory to use (in MB) < 81919MB on 224K and 240K FFT size works, but memory to use (in MB) > 81919MB on 224K and 240K FFT size fails. Actually, the 81919MB number is not critical. If you found that 224KB and 240KB FFT failed consistently at a different memory boundary would be significant. Alas, I do not have a machine with 128GB memory. Can you post screen shots of two or three runs where torture test failed shortly after starting the torture test? Quote:
|
||
![]() |
![]() |
![]() |
#15 | |
Sep 2022
Munich, Germany
208 Posts |
![]() Quote:
I"ll try different memory limits to see if I see the same limit as @testcb00. Thanks, Falk |
|
![]() |
![]() |
![]() |
#16 | |||
Sep 2022
Munich, Germany
24 Posts |
![]() Quote:
Quote:
I found the exact same 81919MB boundary as did @testcb00, ie., it runs fine with 81919MB boundary but fails with 81920MB boundary (it does not crash then, only stops all (but one) workers due to errors. I did not try other memory sizes as @testcb00 already did that job for all of us. Thank You! Quote:
I changed the window layout in between to better capture the worker messages. I hope this helps debugging Prime95. It definitely has an issue, I wouldn't trust its compute results in its current form and maybe part of the work load should be redone. Of course, the error must be analyzed and its impact on past results be understood first. I am glad I only tried a torture test. Which my machine passes as I replicate the results of testcb00 perfectly :) Thanks everybody, my issue is solved, a software bug for the communty here remains :( |
|||
![]() |
![]() |
![]() |
#17 |
Sep 2022
Munich, Germany
24 Posts |
![]()
Just found a results.txt of the last few runs ...
The last 3 runs are started as described for the last 2 screenshots and behave the same, qualitatively (crash hard within a minute). However, inspecting results.txt, I see numerical differences in the FATAL messages. This may help in debugging, could be a race condition of some sort. I append the prime.txt config file too. |
![]() |
![]() |
![]() |
#18 |
Sep 2022
Munich, Germany
100002 Posts |
![]()
Following up on my previous assumption ...
I repeated the tests with only the 240k size configured under "Custom" - but now varying the number of cores and hyperthread toggle. I kept the memory at max (128GB). My findings:
I assume that there is a non-linear stable run-time for 240k with >82GB, which increases with lowering the number of execution threads and lowering the processor execution speed. Ie., slower machines with less cores and less memory most likely don't see the problem. That's would be typical for a race condition bug which makes them so hard to debug ... BTW, the two memory addresses displayed in the screenshot shown above are both constant! That should help in a powerful enough debug tool. PC: 0x7FFCE1A883D6 (code at that location refers to ->) Address: 0xFFFFFFFFFFFFFFFF (illegal address) Last fiddled with by falk on 2022-09-27 at 15:18 |
![]() |
![]() |
![]() |
#19 |
Sep 2022
Munich, Germany
208 Posts |
![]()
Now, I also checked prior versions von Prime95. Results:
This looks like a regression bug within Prime95 to me, introduced some time within the past 3 years. |
![]() |
![]() |
![]() |
#20 |
Sep 2022
Munich, Germany
24 Posts |
![]() |
![]() |
![]() |
![]() |
#21 |
If I May
"Chris Halsall"
Sep 2002
Barbados
101011010011012 Posts |
![]()
He is.
George has helped all in ways few can imagine. He is very welcoming of bug reports but is also a very busy person. I'm quite sure everyone is going to get along splededly. The truth is what matters. DWIM (Do What I Mean) is an instruction programmers claim to covet, but may regret one day... |
![]() |
![]() |
![]() |
#22 | ||
P90 years forever!
Aug 2002
Yeehaw, FL
11111111000112 Posts |
![]()
That remains to be seen :)
Quote:
Other less frequent causes are bad CPUs, memory controllers, caches, motherboards, power supplies, etc. Quote:
Note that no matter how prime95 set affinities in 29.8 and 30.8 you should not get any errors. The first step to confirm a CPU/memory problem vs. prime95 bug is significantly underclocking the CPU and underclocking the RAM and perhaps even overvolting, then retrying a torture test that was repeatedly failing. If the torture test then works, you'll know it is a hardware issue. I don't know about "famous", but otherwise the answer is yes. |
||
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
odd segmentation fault | ChristianB | YAFU | 4 | 2015-09-09 19:38 |
Segmentation fault in msieve. | include | Msieve | 4 | 2012-11-14 00:59 |
Segmentation fault | PhilF | Linux | 5 | 2006-01-07 17:12 |
Linux FC3 - mprime v23.9 : Segmentation fault (core dumped) nohup ./mp -d | T.Rex | Software | 5 | 2005-06-22 04:22 |
Segmentation Fault | sirius56 | Software | 2 | 2004-10-02 21:43 |