mersenneforum.org  

Go Back   mersenneforum.org > New To GIMPS? Start Here! > Information & Answers

Reply
 
Thread Tools
Old 2022-07-16, 14:06   #1
potz13
 
Jul 2022

2·5 Posts
Default stress test - strange behavior with errors and seg faults

Hello,

I am sorry for my first post beeing a new thread about errors in the stress test. Yes I tried to read up on known problems and solutions and nonetheless I am here opening up a new thread.

The core of my problem:
I ran mprime 30.8 b15 to stress-test one of my computers (build two years ago) using default parameters and got an error right in the first test (AVX-512 FFT length 240K) about wrong rounding. Seconds later mprime crashed with a segmentation fault. The problem is reproducible in the sense that the crash occurs everytime in the first test. The error message varies.

The machine:
Xeon W-2135
Supermicro X11SRM-VF
4* 32GB Samsung M393A4K40BB2-CTD (certified by board vendor)
Mellanox MCX312A-XCBT
Samsung SSD 830 128GB

OS: Fedora35 with current updates

What helps:
different RAM configurations with max 48GB in total.
- single DIMM 32GB works
- 4* 8GB works (ran for 15h)
- 1* 32GB + 2*8GB works
----
- 1* 32GB + 3*8GB does not work (the 32GB DIMM used here was a spare part I have on hand)
- 2* 32GB does not work (tried multiple combinations of slots and DIMMs
----
the 4 original DIMMs seem to run perfectly well with an EPYC 7551 (running for 4h now)

What did not help:
- running Ubuntu 21.10 from USB thumb drive
- downgrading BIOS from current 2.5 to 2.4
- boot flag: mitigations=off
- reducing RAM clock to 2133

Unfortunately I don't have another CPU with AVX512 nor a different CPU for the board or another board for the CPU.

Is anyone able to make any sense out of this mess? I am really unsure if this is a problem with board/CPU or with mprime. I'd appreciate any input for additional tests I could do.
potz13 is offline   Reply With Quote
Old 2022-07-16, 14:32   #2
PhilF
 
PhilF's Avatar
 
"6800 descendent"
Feb 2005
Colorado

52·29 Posts
Default

Your problem may not have anything to do with the CPU or the motherboard. One thing we know for sure -- it isn't mprime.

How long into the test does it run before the first error? The reason I ask is because your issue may be related to heat. That particular type of small FFT, AVX-512 test will heat the CPU more than any other application I know of, and it will do it quite rapidly. If you have a compromised cooling system, for example the heat sink and the CPU is not making PERFECT contact, this could be your symptom.

My guess is that the problem is related either to that, or the power supply isn't able to handle the heavy load mprime puts on it. Even connectors between the power supply and the motherboard can be a problem under the heavy load of mprime. Try another power supply if you have one.

Phil
PhilF is offline   Reply With Quote
Old 2022-07-16, 14:53   #3
potz13
 
Jul 2022

1010 Posts
Default

Thank you very much for your reply PhilF!

The time until the error appears, changes for example depending on the mem configuration. I just tested it with 1*32GB plus 3*8GB and with this combination mprime crashes very fast after ~5s. This should be no problem with the cooling.
For completeness: a Noctua NH-U12DX i4 is mounted and I never saw more than 75°C (other stress tests than blend test never showed errors after running for a while; never tested for many hours)


Will change the PSU next! (Thought about that myself yesterday. Forgot it...)
potz13 is offline   Reply With Quote
Old 2022-07-16, 15:20   #4
potz13
 
Jul 2022

2·5 Posts
Default

Changed to a different PSU (Seasonic Prime PX750) incl. all the cables, without changing the crashing behavior. :(
potz13 is offline   Reply With Quote
Old 2022-07-16, 15:24   #5
paulunderwood
 
paulunderwood's Avatar
 
Sep 2002
Database er0rr

10000101110002 Posts
Default

It is not unknown that sometimes a CPU needs reseating. This will also ensure you have perfect contact between the CPU and the heatsink, separted by the thinnest layer of paste when you reinstall the heatsink

Last fiddled with by paulunderwood on 2022-07-16 at 15:24
paulunderwood is offline   Reply With Quote
Old 2022-07-16, 15:27   #6
chris2be8
 
chris2be8's Avatar
 
Sep 2009

32·5·53 Posts
Default

Try running memtest86 (both single and multi-threaded) with memory configs that cause errors and configs that don't. That should tell you if the memory is faulty.

Also can you run temperature monitoring on the system to see how hot it gets during testing under Linux. Running watch sensors in another terminal window should do the job if you don't have a GUI app to monitor temperatures.
chris2be8 is offline   Reply With Quote
Old 2022-07-16, 15:32   #7
PhilF
 
PhilF's Avatar
 
"6800 descendent"
Feb 2005
Colorado

52·29 Posts
Default

Well, if I understand your post correctly, you had one RAM configuration that ran for 4h. Someone correct me if I'm wrong, but if it always crashes during a 240K FFT test memory access should be very minimal, because the entire FFT fits in the CPU's cache. If so, you may be seeing crashes that are not really related to RAM.

Does your motherboard's BIOS allow you to disable AVX-512? If so, try it to see if it stabilizes things. Or, maybe you can tell mprime to not use AVX-512? I've not looked for it, but that option might be there.

Also make sure your motherboard is updated to the latest BIOS version available.

Phil
PhilF is offline   Reply With Quote
Old 2022-07-16, 15:49   #8
potz13
 
Jul 2022

10102 Posts
Default

Quote:
Originally Posted by paulunderwood View Post
It is not unknown that sometimes a CPU needs reseating. This will also ensure you have perfect contact between the CPU and the heatsink, separted by the thinnest layer of paste when you reinstall the heatsink

Will try that later! Thanks for the suggestion!
potz13 is offline   Reply With Quote
Old 2022-07-16, 15:54   #9
potz13
 
Jul 2022

128 Posts
Default

Quote:
Originally Posted by chris2be8 View Post
Try running memtest86 (both single and multi-threaded) with memory configs that cause errors and configs that don't. That should tell you if the memory is faulty.
Will try to find a usable free version of that. Has been easier...


Quote:
Originally Posted by chris2be8 View Post
Also can you run temperature monitoring on the system to see how hot it gets during testing under Linux. Running watch sensors in another terminal window should do the job if you don't have a GUI app to monitor temperatures.
I monitor temperatures both using ipmitool and a mini script reading data from /sys/class/hwmon/hwmon*
I don't see temps that seem alarming to me.
potz13 is offline   Reply With Quote
Old 2022-07-16, 16:14   #10
potz13
 
Jul 2022

2×5 Posts
Default

Quote:
Originally Posted by PhilF View Post
Well, if I understand your post correctly, you had one RAM configuration that ran for 4h. Someone correct me if I'm wrong, but if it always crashes during a 240K FFT test memory access should be very minimal, because the entire FFT fits in the CPU's cache. If so, you may be seeing crashes that are not really related to RAM.
Sorry! You mixed up my chaotic post a little (4h is the time the DIMMs, which originally provoked the problem, were running fine in a different system. Its now 6-7h. Still running) but yes I ran the problematic system without a problem for 15h with 4* 8GB.
Memory usage in the test is >90% leaving about 1GB free. This seems to be the case generally. I have no idea how much bandwidth is used.
No matter how much the test really stresses the memory, going from 48GB RAM to 56GB RAM (mprime does use the additional memory) triggers a different pattern in code or memory access that in turn triggers the problem.

Quote:
Originally Posted by PhilF View Post
Does your motherboard's BIOS allow you to disable AVX-512? If so, try it to see if it stabilizes things. Or, maybe you can tell mprime to not use AVX-512? I've not looked for it, but that option might be there.
Did not find that in BIOS. Will search for an option in mprime.


Quote:
Originally Posted by PhilF View Post
Also make sure your motherboard is updated to the latest BIOS version available.
See my first post. I even tried down grading.
potz13 is offline   Reply With Quote
Old 2022-07-16, 16:30   #11
potz13
 
Jul 2022

1010 Posts
Default

Things get interesting:
I disabled AVX512 by adding CpuSupportsAVX512F=0 to my local.txt file. Problem remains the same but crashing takes 50s instead of 5s in the same 1*32GB+3*8GB configuration. Commenting out the line -> 5s. It's consistent. :/


slightly different tests are run:
68000 Lucas-Lehmer iterations of M4818591 using FMA3 FFT length 240K, Pass1=320, Pass2=768, clm=1
68000 Lucas-Lehmer iterations of M4818591 using AVX-512 FFT length 240K, Pass1=640, Pass2=384, clm=1

Last fiddled with by potz13 on 2022-07-16 at 16:37
potz13 is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Do not stop stress testing after errors were found intelfx Software 1 2022-05-04 12:53
Very strange mem timing behavior, Asus M3A67-EM jwh Information & Answers 1 2009-01-30 18:04
Strange behavior of polynomial selection ET_ Msieve 5 2008-12-24 14:45
Strange Computer Behavior jinydu Lounge 23 2004-06-08 09:00
Strange behavior on 1.7G Celeron willmore Software 0 2002-09-09 20:17

All times are UTC. The time now is 00:51.


Tue Sep 27 00:51:53 UTC 2022 up 39 days, 22:20, 0 users, load averages: 1.90, 1.68, 1.55

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔