mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2016-07-23, 14:31   #1
nucleon
 
nucleon's Avatar
 
Mar 2003
Melbourne

5×103 Posts
Default I'm having quality issues on skylake as well

I'm hoping someone can help with suggestions on improving my results on this setup. I found this thread and it seems to closely align with what I have, but let me know if I should post elsewhere.

Of the last 12x double checks run on this machine, only 6 matched. This is probably my worst performance on a new machine I've assembled. Big dent to the ego :(

My Setup:
Intel Xeon E3-1270-V5 (Skylake)
Crucial 32GB (2x 16GB Kit) PC4-17000 ECC Unbuffered 288-pin EUDIMM ( 2xCT16G4WFD8213 )
MSI c236a Workstation Motherboard

I'm running latest BIOS from the motherboard site (2.4), so I'm assuming I have the latest microcode updates. For reference v2.2 of the bios had "Updated CPU microcode(0x7C)." listed.

I've tried the 768k torture test, so far 30mins in, and no hang or freeze.

As this is a xeon cpu - overclock options are extremely limited, and I'm running on defaults.

I've run memtest 86+ overnight with no issues.

I've thrown other torture tests at it and no issues. (IBT/Intel XTU/Prime95 torture test)

The only thing I can run that will cause problems on demand is running Intel Burn-in Test over all memory, then loading
something memory intensive to use up all the memory. Then I had a blue screen. (doesn't always happen)

I did notice that someone mention there were some incompatibilities with crucial memory and skylake cpus?

Anyone have any more info on that?


-- Craig
nucleon is offline   Reply With Quote
Old 2016-07-23, 15:19   #2
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

2·52·67 Posts
Default

Are you running any XMP profile on your RAM (eXtreme Memory Profile)?

On my Haswell-E 5960X I started getting bad results in Prime95 with an XMP profile running the RAM at 3000 Mhz (and the processor at 3500Mhz, base clock 125 Mhz ratio 28), but 36 hours of Memtest86 and 45 hours of Prime95 stresstest did not give any errors.

After switching to a lower XMP profile running the RAM at 2666 Mhz and the processor still at 3500Mhz with base clock 100Mhz and ratio 35 I have not had a single error since.
ATH is offline   Reply With Quote
Old 2016-07-24, 03:15   #3
nucleon
 
nucleon's Avatar
 
Mar 2003
Melbourne

51510 Posts
Default

There's a lot less options with the skylake xeon.

I can only run the DDR4 memory at 2133, no xmp profile in the dimm SPD.

-- Craig
nucleon is offline   Reply With Quote
Old 2016-07-24, 04:27   #4
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

13·227 Posts
Default

If you test a failed double check, do you get the same result?
Mark Rose is offline   Reply With Quote
Old 2016-07-24, 05:00   #5
nucleon
 
nucleon's Avatar
 
Mar 2003
Melbourne

5·103 Posts
Default

Good question. I'll check that.

(takes about 30 hours).

I'll also add all voltages/timings are set to defaults.

Temperatures are the lowest I've seen for a new CPU running prime 95 on the intel cooler. Although it's winter here and quite cool (20degC), I haven't seen CPU temperatures above 70degC. 50-65degC seems pretty common when running prime95 in full.

Given I've seen blue screens when using up all physical memory, I thought the issue might be related (somehow) to the disk and the pagefile. So I've bought a new SSD, and cloned the old volume. And will be testing for a while to see how it goes.

-- Craig
nucleon is offline   Reply With Quote
Old 2016-07-25, 15:00   #6
Madpoo
Serpentine Vermin Jar
 
Madpoo's Avatar
 
Jul 2014

3×11×101 Posts
Default

Quote:
Originally Posted by nucleon View Post
Good question. I'll check that.

(takes about 30 hours).

I'll also add all voltages/timings are set to defaults.

Temperatures are the lowest I've seen for a new CPU running prime 95 on the intel cooler. Although it's winter here and quite cool (20degC), I haven't seen CPU temperatures above 70degC. 50-65degC seems pretty common when running prime95 in full.

Given I've seen blue screens when using up all physical memory, I thought the issue might be related (somehow) to the disk and the pagefile. So I've bought a new SSD, and cloned the old volume. And will be testing for a while to see how it goes.
Since I have a bold plan (along with AirSquirrels) to tackle all of the exponents that need triple-checking anyway, I could run a few of yours that mismatched and see which side yours land on.

You mentioned that you have ECC memory so it would seem likely the memory is not the problem, but I'll add a big "but" to that...

I had one server with memory issues, and it was a pain in the behind for me. It was a node of a SQL cluster and when running as the passive node, no problems... this thing would run for weeks, months. But when I'd make it the active node, SQL would eventually use up more and more physical memory and then the thing would BSOD. Ouch. Not good on a production cluster, but it least the other node behaved itself.

It was actually this whole experience that got me back to doing Prime95 stuff because I fired it up as a stress testing tool.

It ran fine under Prime95, memtest, I used a few other esoteric mem tools that would go through the whole 36GB installed on there and nothing would make it fail except running an actual SQL load.

Fortunately the HP server tools reported which mem module threw the uncorrectable error and I was able to remotely disable that module (by putting it into "spare module" mode since the bad module was fortunately in one of the "spare" slots) and then finally replace the module on my next visit.

But it was really frustrating, and especially annoying that nothing seemed to trigger it except the one thing this server was designed to do, and it ended up being an uncorrectable error, so even ECC didn't do more than let me know which module was bogus.

Point of all my rambling... I don't know if your system has any tools that show specific ECC related memory issues, like if it detected a correctable/uncorrectable error? You mentioned that you've had blue screens/crashing so it could definitely be related.

But then again it could be something else not mem related... funky power, a single "iffy" contact on the CPU socket, etc.

I'd lean towards a mem issue in your case if not for the ECC thing, but like I said, even with ECC you're not immune to those issues.
Madpoo is offline   Reply With Quote
Old 2016-07-27, 22:50   #7
nucleon
 
nucleon's Avatar
 
Mar 2003
Melbourne

5·103 Posts
Default

Thanks.

I've done further testing.

The exponent I'm using is M39295301. I've tested it on my Titan Black, and I get the residue reported by previous tester. So I'm pretty confident my machine is on the error side.

I've now run this exponent an additional 2x times with different FFT sizes.

With the default FFT size, even though I get 2x errors during the run:
Iteration: 30351457/39295301, Possible error: round off (0.5) > 0.40625
Continuing from last save file.
Iteration: 30348517/39295301, Possible error: round off (0.5) > 0.40625
Continuing from last save file.

I still get matching residue.

But when I did second run with no errors on FFT=2240K, I get no errors, but incorrect result, which didn't match my original run.

I think I might be hitting this error:

http://www.intel.com/content/www/us/...000020749.html

Basically this error boils down to using NCQ with intel AHCI drivers on c236 chipset. Which is what I'm doing.

So what I think is happening, is a pagefile read is triggered, the mem page is read from disk in error, and replaced in memory corrupting 'various' memory pages. ECC memory won't fix these.

So I'm trying to find a way to disable NCQ.

I've found a registry hack to disable NCQ for MS AHCI drivers, but I haven't found one for intel AHCI drivers.

-- Craig
nucleon is offline   Reply With Quote
Old 2016-07-27, 22:54   #8
nucleon
 
nucleon's Avatar
 
Mar 2003
Melbourne

5×103 Posts
Default

On ECC memory testing,

Memtest 86 Pro, which is the paid version apparently does things like ECC inject testing.

I might have to pony up the money and do some of the advanced tests there. The free version passes on my machine.

I don't know of any utilities that can report on ECC stats for my platform. I might do some googling when I have time.
nucleon is offline   Reply With Quote
Old 2016-07-27, 23:30   #9
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

7·17·89 Posts
Default

Quote:
Originally Posted by nucleon View Post
I think I might be hitting this error:

http://www.intel.com/content/www/us/...000020749.html
Considering that Intel sells a LOT of kit to serious players, it seems a bit strange that they rely on their users to find bugs.
chalsall is offline   Reply With Quote
Old 2016-07-28, 04:46   #10
S485122
 
S485122's Avatar
 
"Jacob"
Sep 2006
Brussels, Belgium

25·3·19 Posts
Default

Quote:
Originally Posted by nucleon View Post
I think I might be hitting this error:
http://www.intel.com/content/www/us/...000020749.html
Basically this error boils down to using NCQ with intel AHCI drivers on c236 chipset. Which is what I'm doing.
So I'm trying to find a way to disable NCQ.
...
In the article you quote Intel states that their new drivers do not have that problem. Why not just update the drivers ?
S485122 is offline   Reply With Quote
Old 2016-07-28, 08:09   #11
nucleon
 
nucleon's Avatar
 
Mar 2003
Melbourne

5×103 Posts
Default

Quote:
Originally Posted by S485122 View Post
In the article you quote Intel states that their new drivers do not have that problem. Why not just update the drivers ?
They are the raid drivers. My chipset is currently set to AHCI mode, and not RAID mode.

I'm going to do an attempt, where I set the chipset to raid mode.

But I'm currently trying to develop a test case that I can run in a shorter time. :) 30hrs is a long time to wait between test cases.

-- Craig
nucleon is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Combining low quality random numbers sources only_human Miscellaneous Math 3 2016-05-20 05:47
Skylake and RAM scaling mackerel Hardware 34 2016-03-03 19:14
So does skylake-nonXeon actually get us anything? fivemack Hardware 36 2015-09-08 01:42
Skylake AVX-512 clarke Software 15 2015-03-04 21:48
Quality of results ltd Prime Sierpinski Project 2 2004-08-10 22:09

All times are UTC. The time now is 12:40.


Mon Aug 15 12:40:05 UTC 2022 up 39 days, 7:27, 1 user, load averages: 2.34, 1.70, 1.37

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔