2017-05-31, 03:08   #1475
ewmayer
2ω=0

Sep 2002
República de California

2×3×1,637 Posts

Quote:
 Originally Posted by GP2 OK. I'm monitoring each milestone and at the 4M mark, all four exponents match so far.
Looks like the issue was overheating - After an outright system freeze-up during the night, David replaced the 'decent-quality' fan-based cooler on the Ryzen with a water cooler. Restarted my octet of 3072K DCs - which were throwing fatal-ROE-and-interval-retries more than once per hour (among the 8 jobs in total) yesterday - 3 hours ago, no sign of anything but steady crunching since.

The restart also gave me a chance to try one more throughput-related test - I queued up the 8 DCs I just grabbed above in addition to the original 8 (all @3072K) and assigned one to each of the 16 logical cores of the system [8 physical cores, each mapping to 2 logical]. That proved very bad - 8 jobs (on cores 0,2,4,6,8,10,12,14, i.e. 1 per physical core using AMD's core-numbering system) yields 0.042 sec/iter for each for a total throughput of 8/.042 = 190 iter/sec, but 16 jobs (1 each on core 0-15) pushes that up to .11 sec/iter for a total throughput of 16/.11 = 145 iter/sec, a massive 25% drop.

Fingers crossed that all the overheating-data-corruption badness didn't hose any of the 8 original DCs - Gord, how are those 4 TC residues looking compared to the ones I posted?

 2017-05-31, 14:45 #1476 kladner     "Kieren" Jul 2011 In My Own Galaxy! 1004010 Posts 45699887 matched
2017-06-02, 03:26   #1477
GP2

Sep 2003

29·89 Posts

Quote:
 Originally Posted by ewmayer Just finished these DCs [53647547,53648423,53648893,53648981] on the Ryzen system - all 4 final residues mismatch those of the first-test submission.
Very close to the halfway mark on the triple-check run: three exponents are at 26M and one laggard is at 25M.

All the interim residues are matching yours so far.

If there is some secret algorithm to make Mlucas robust on flaky hardware, you should let George know so he can implement it for mprime too.

Let me know if any exponents in the new batches of 8 + 8 end up mismatching.

Last fiddled with by GP2 on 2017-06-02 at 03:28

2017-06-02, 07:16   #1478
ewmayer
2ω=0

Sep 2002
República de California

2·3·1,637 Posts

Quote:
 Originally Posted by GP2 Very close to the halfway mark on the triple-check run: three exponents are at 26M and one laggard is at 25M. All the interim residues are matching yours so far. If there is some secret algorithm to make Mlucas robust on flaky hardware, you should let George know so he can implement it for mprime too.
Thanks! I confess I'm (pleasantly) shocked that none of the multiple fatal-ROE-retries in my runs have been accompanied by a silent-but-deadly data corruption. Wish I could claim credit for some 'secret sauce' behind that, but that would be dishonest. If and when I garner more than a trivial user base, we may get information on whether the alleged robustness is in any way systematic, or whether this particular instance of hardware flakiness is a lucky fluke.

FYI, here is the number of such occurrences in each of my 4 runs, based on grepping the exponent status file:

p53647547.stat:62
p53648423.stat:53
p53648893.stat:57
p53648981.stat:72

 2017-06-05, 16:09 #1479 Madpoo Serpentine Vermin Jar     Jul 2014 29·113 Posts Weird list... Here's a strange list of 5 exponents that have a very high chance of being done wrong the first time. All 5 are currently assigned which means technically you'd be poaching them, but all of them are assigned to "anonymous" users back in Sep/Oct of 2014 and haven't been heard from in years. Since they wouldn't normally expire for many years to come (when the double-checking gets up in to the 70M range) we may as well check them now since the assignments have clearly been abandoned. Code: DoubleCheck=73964809,75,1 DoubleCheck=73965077,75,1 DoubleCheck=73919003,75,1 DoubleCheck=78181099,75,1 DoubleCheck=73681423,75,1 They have bad/good ratios from 4.5 up to 21.0 (that last one) for any given month... This user/computer has been on our target list for a while due to a period of several months when it just churned out one bad result after another, and a LOT of them (30+ in a single month somehow). If there are other likely cases of a "very likely bad" result with an assignment that's so old it's probably abandoned, I'll put those up later as well, but these 5 have been bugging me, just sitting there... EDIT: In addition to those 5, there were only a couple others that fall into the category of "assigned, but really really old". So, here are all 7 of those along with the relevant stats so you can see just what the ratios look like: Code: exponent Bad Good Unk Sus Solo Mis worktodo 73681423 21 1 1 0 1 0 DoubleCheck=73681423,75,1 74207999 14 1 1 0 1 0 DoubleCheck=74207999,75,1 73964809 25 3 2 0 2 0 DoubleCheck=73964809,75,1 73965077 25 3 2 0 2 0 DoubleCheck=73965077,75,1 73919003 33 5 1 0 1 0 DoubleCheck=73919003,75,1 78181099 27 6 1 0 1 0 DoubleCheck=78181099,75,1 66882859 2 0 1 0 1 0 DoubleCheck=66882859,74,1 Those 2 new additions were last updated 1 to nearly 2 years ago. Last fiddled with by Madpoo on 2017-06-05 at 17:26
2017-06-05, 23:59   #1480
bgbeuning

Dec 2014

22·32·7 Posts

Queued up the first 5

Quote:
 DoubleCheck=73964809,75,1 DoubleCheck=73965077,75,1 DoubleCheck=73919003,75,1 DoubleCheck=78181099,75,1 DoubleCheck=73681423,75,1
But they do not show up as reserved and
my "manual comm" button in prime95 is disabled.

2017-06-06, 01:06   #1481
GP2

Sep 2003

29×89 Posts

Quote:
 Originally Posted by bgbeuning Queued up the first 5 But they do not show up as reserved and my "manual comm" button in prime95 is disabled.
They can't be reserved, since other (anonymous) users have already reserved them.

However, those anonymous users have not been heard from since 2014, so they are probably not going to complete the exponents.

You can just run the exponents yourself despite the lack of a reservation and the program will report the results to PrimeNet in the usual way.

 2017-06-06, 07:59 #1482 GP2     Sep 2003 29·89 Posts The triple checks on exponents 53648893, 53648981, 53648423, 53647547 will finish within a few hours, in that order. The first one is already at 51M and the slowest is almost at 49M. All interim residues are still matching. Since it seems almost certain that the first-time LL tests (all by the same user) were wrong, here are some more exponents by that same user that are relatively close in time frame and in exponent value: DoubleCheck=53643523,73,1 DoubleCheck=53643913,73,1 DoubleCheck=53644573,73,1 DoubleCheck=53644883,73,1 DoubleCheck=53647171,73,1 DoubleCheck=53648677,73,1 DoubleCheck=53648729,73,1 DoubleCheck=53679289,73,1
 2017-06-06, 14:02 #1483 Prime95 P90 years forever!     Aug 2002 Yeehaw, FL 2·3·1,193 Posts i took: DoubleCheck=53643523,73,1
2017-06-06, 16:17   #1484
Madpoo
Serpentine Vermin Jar

Jul 2014

1100110011012 Posts

Quote:
 Originally Posted by GP2 ... Since it seems almost certain that the first-time LL tests (all by the same user) were wrong, here are some more exponents by that same user that are relatively close in time frame and in exponent value...
FYI, that particular user/cpu has a mixed history with an overall track record of 52 bad and 178 good.

Here is the breakdown by year/month, so depending on when a result came in it may have better/worse odds:
Code:
YYYY-MM	Bad	Good	Unknown
2008-11	3	2	0
2008-12	2	0	0
2009-1	0	14	7
2009-3	0	2	1
2009-4	0	2	3
2009-5	0	4	7
2009-6	0	1	1
2009-7	0	1	4
2009-8	0	7	2
2009-9	0	1	1
2009-10	0	2	4
2009-11	0	2	6
2009-12	0	4	3
2010-1	0	0	4
2010-2	0	2	1
2010-3	0	2	6
2010-4	0	0	4
2010-5	0	1	7
2010-6	0	4	2
2010-7	0	5	3
2010-8	0	3	6
2010-9	0	2	0
2010-10	0	8	3
2010-11	0	8	1
2010-12	0	2	3
2011-1	0	1	14
2011-2	0	1	8
2011-3	0	1	7
2011-4	0	0	10
2011-5	0	3	8
2011-6	0	1	14
2011-7	0	4	10
2011-8	0	1	11
2011-9	0	2	12
2011-10	0	2	17
2011-11	0	0	8
2011-12	0	0	24
2012-1	0	1	2
2012-2	9	7	2
2012-3	0	1	22
2012-4	3	2	0
2012-5	1	3	17
2012-7	0	2	31
2012-8	1	2	17
2012-9	0	3	27
2012-10	1	1	14
2012-11	1	4	17
2012-12	0	0	23
2013-1	0	0	9
2013-2	0	0	17
2013-3	0	0	18
2013-4	0	0	24
2013-5	0	0	6
2013-6	0	0	22
2013-7	0	2	14
2013-8	0	1	17
2013-9	0	0	17
2013-10	0	0	28
2013-11	1	2	19
2013-12	0	1	13
2014-1	1	5	11
2014-2	7	6	11
2014-3	5	9	19
2014-4	6	4	7
2014-6	0	2	21
2014-7	2	2	16
2014-8	1	2	13
2014-9	0	0	15
2014-10	0	0	14
2014-11	0	0	10
2014-12	0	0	14
2015-2	0	0	17
2015-4	0	0	14
2015-6	0	0	19
2015-9	2	2	8
2015-10	2	2	4
2015-11	1	2	1
2015-12	0	0	4
2016-1	0	0	4
2016-3	0	0	12
2016-5	0	0	4
2016-6	0	0	8
2016-8	0	1	16
2016-10	0	0	6
2016-12	0	0	7
2017-1	3	2	0
2017-2	0	8	3
2017-3	0	3	4
2017-4	0	0	4
2017-6	0	1	3

2017-06-06, 16:52   #1485
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

11011111101102 Posts

Quote:
 Originally Posted by Madpoo FYI, that particular user/cpu has a mixed history with an overall track record of 52 bad and 178 good.
I think it would be a good idea to strategically double-check at least half of this users exponents. Based on the data that comes back, we might then double-check the other half.

