mersenneforum.org  

Go Back   mersenneforum.org > New To GIMPS? Start Here! > Information & Answers

Reply
 
Thread Tools
Old 2022-09-27, 22:32   #23
falk
 
Sep 2022
Munich, Germany

208 Posts
Default

Quote:
Originally Posted by Prime95 View Post
In 29.8 you need to specify 20 threads to get the equivalent behaviour of 30'8's 10 cores withhyperthreading on.
This is what I did, as it is the 29.6 default for my machine anyway. I cannot reproduce the problem with 29.6, as I said above.

I‘ll try the underclocking test as you mentioned, just to help sorting this out. Your theory about cores and extra stress does not explain though why two different machines see the exact same limit on memory. And why only with 240k size. And why only 30.x version. And why always at same PC with a reference to always the same illegal address.

I am a rather experienced programmer myself. If it looks like a race condition, then it probably is one … I have seen race conditions pass undetected for years …

BTW, Prime95 doesn‘t even stress my machine much. Programs like Y Cruncher or Cinebench cause higher CPU temperatures. I know, means little when accessing a lot of memory. But then, some sort of instability should emerge with v29.6, shouldn‘t it?

I added the option AffinityVerbosityTorture=1 in prime.txt, but v30.8 FFT 240k still fails like this, eventually craches:
Quote:
[Wed Sep 28 00:46:12 2022]
FATAL ERROR: Rounding was 0.4999617766, expected less than 0.4
Hardware failure detected running 240K FFT size, consult stress.txt file.
FATAL ERROR: Rounding was 0.4999901984, expected less than 0.4
Hardware failure detected running 240K FFT size, consult stress.txt file.
FATAL ERROR: Rounding was 0.4999935824, expected less than 0.4
Hardware failure detected running 240K FFT size, consult stress.txt file.
FATAL ERROR: Rounding was 0.5, expected less than 0.4
Hardware failure detected running 240K FFT size, consult stress.txt file.
FATAL ERROR: Rounding was 0.5, expected less than 0.4
Hardware failure detected running 240K FFT size, consult stress.txt file.
FATAL ERROR: Resulting sum was -2.116212414775013e+170, expected: 2.080420007391802e+179
Hardware failure detected running 240K FFT size, consult stress.txt file.
EDIT:
Now ran the Underclocking tests. I almost halfed the Clock (to 2GHz as monitored during execution), also increased DRAM voltage up a notch. v30.8 still fails in the exact same manner. I also tried to overclock to 4 GHz (but rather ran at 3.5GHz) and v29.6 ran with no issues.

Hope this helps to pin this down.

Last fiddled with by falk on 2022-09-27 at 23:23
falk is offline   Reply With Quote
Old 2022-09-27, 22:45   #24
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

251718 Posts
Default

Quote:
Originally Posted by falk View Post
I am a rather experienced programmer myself.
Join the very large club.

Quote:
Originally Posted by falk View Post
If it looks like a race condition, then it probably is one …
Correlation does not *necessarily* mean causality.

This is why experienced programmers observe carefully the in situ.
chalsall is online now   Reply With Quote
Old 2022-09-27, 23:25   #25
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

176258 Posts
Default

Quote:
Originally Posted by falk View Post
This is what I did, as it is the 29.8 default for my machine anyway. I cannot reproduce the problem with 29.8, as I said above.
Good to know -- reduces the set of variables.

Quote:
I‘ll try the underclocking test as you mentioned, just to help sorting this out. Your theory about cores and extra stress does not explain though why two different machines see the exact same limit on memory. And why only with 240k size. And why only 30.x version.

I am a rather experienced programmer myself. If it looks like a race condition, then it probably is one …
My first instinct would be to look at either code changed between 29.8 and 30.8. I don't think there were changes in locking/syncing threads where race conditions might occur (30.9 does make some changes). I'll be looking at the code for buffer overflows or arithmetic errors dealing numbers above 64G and sizeof(240K FFT) -- specifically incorrectly allocating too many or few buffers or indexing into the 64GB buffer for the FFT data. Unfortunately, eyeballing code is a pretty poor method for debugging.

Quote:
BTW, Prime95 doesn‘t even stress my machine much. Programs like Y Cruncher or Cinebench cause higher CPU temperatures. I know, means little when accessing a lot of memory. But then, some sort of instability should emerge with v29.8, shouldn‘t it?
Yes, prime95 is memory-bound which is why prime95 usually uncovers memory problems. Running small FFTs might produce similar stress to Y-cruncher.
Prime95 is online now   Reply With Quote
Old 2022-09-27, 23:28   #26
falk
 
Sep 2022
Munich, Germany

24 Posts
Default

Quote:
Originally Posted by chalsall View Post
Correlation does not *necessarily* mean causality.
This is why experienced programmers observe carefully the in situ.
What do you think I am doing?

If you really wanted to help, you would run v30.8 on 128GB+ with 10+ cores, ideally Intel.

BTW, correlation does indeed not imply causality (which is why I said "probably"), but correlation is all you can measure and causality is always just theorized. According to the scientific method. In my last few posts above, I used "hypothesis" in my title for a reason.

The most nasty bug I chased wasted 3 months of my life. Spending most of the time convincing others that there was a bug in the first place (it was a fully valid program when run, made a supercomputer reboot without leaving any traces that my program ever even existed...). Fortunately, this time I don't depend on a fix.

Last fiddled with by falk on 2022-09-27 at 23:38
falk is offline   Reply With Quote
Old 2022-09-28, 00:05   #27
falk
 
Sep 2022
Munich, Germany

24 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Unfortunately, eyeballing code is a pretty poor method for debugging.
If you have a debug version of 30.8 which prints more intermediate results, then I am ready to help. The crash at the end may be a follow up error. But as it happens at a fixed code location, it may still be worth to look up the code line in question.

Race conditions often disappear when adding debug logging. So, it would be interesting if there were some fine grain control of log verbosity.

But maybe, let‘s first find a third machine with the same issue …
Tomorrow, I‘ll test v30.9. Saw it on your ftp server.

If I find the time, I may do a binary search for the exact version 30.x B.y when the issue first energed. Would that help?

Last fiddled with by falk on 2022-09-28 at 00:14
falk is offline   Reply With Quote
Old 2022-09-28, 01:05   #28
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

3×5×72×11 Posts
Default

Please try p95v308b17.win64.zip
Prime95 is online now   Reply With Quote
Old 2022-09-28, 10:03   #29
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

2·3·5·47 Posts
Default

Should be fixed in the new build posted. Was affecting systems with large (>64GB) RAM.
preda is offline   Reply With Quote
Old 2022-09-28, 16:08   #30
falk
 
Sep 2022
Munich, Germany

208 Posts
Default

Quote:
Originally Posted by Prime95 View Post
I tried it. No faults, runs stable, even at 240k, even overclocked, for at least an hour now.

@preda seems to be fixed, thanks for your fix.

Need participants in GIMPS to be concerned about possibly wrong results posted?


The issue seems to resolved. I learned a lot about Mersenne primes along the way and how to do integer multiplication right :)
falk is offline   Reply With Quote
Old 2022-09-28, 17:43   #31
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

176258 Posts
Default

Quote:
Originally Posted by falk View Post
I tried it. No faults, runs stable, even at 240k, even overclocked, for at least an hour now.
Wonderful. Thank you to testcb00 for narrowing the problem down to a specific FFT sizes and amount of RAM. Thanks to falk for noting that 29.8 worked just fine. From those two clues I was able to greatly narrow down the code that could be responsible.

Quote:
Need participants in GIMPS to be concerned about possibly wrong results posted?
The bug is only in hyper-threaded torture testing, specifically FFT sizes below 256K where memory is set to more than #cores * 8GB. GIMPS results are A-OK.
Prime95 is online now   Reply With Quote
Old 2022-09-28, 19:11   #32
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

2×3×5×47 Posts
Default

Quote:
Originally Posted by falk View Post
I tried it. No faults, runs stable, even at 240k, even overclocked, for at least an hour now.

@preda seems to be fixed, thanks for your fix.
Not my fix, George found the issue and fixed it. I was only affected by it, and I did a thorough hardware debug (replace CPU, replace all memory, replace subsets of DIMMs etc) to reach the conclusion that it's unlikely to be a HW issue. Afterwards I tested the candidate build, all fine, so looked the fix was good.
preda is offline   Reply With Quote
Old 2022-09-28, 19:25   #33
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

83×131 Posts
Default

Quote:
Originally Posted by preda View Post
Not my fix, George found the issue and fixed it.
We all are standing on the shoulders of giants.

Thanks, guys (and gals). Seriously.
chalsall is online now   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
odd segmentation fault ChristianB YAFU 4 2015-09-09 19:38
Segmentation fault in msieve. include Msieve 4 2012-11-14 00:59
Segmentation fault PhilF Linux 5 2006-01-07 17:12
Linux FC3 - mprime v23.9 : Segmentation fault (core dumped) nohup ./mp -d T.Rex Software 5 2005-06-22 04:22
Segmentation Fault sirius56 Software 2 2004-10-02 21:43

All times are UTC. The time now is 23:33.


Tue Dec 6 23:33:25 UTC 2022 up 110 days, 21:01, 0 users, load averages: 1.33, 1.13, 1.02

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔