mersenneforum.org LLRnet servers for NPLB
 Register FAQ Search Today's Posts Mark Forums Read

2009-08-19, 16:02   #1189
gd_barnes

May 2007
Kansas; USA

2×5,279 Posts

Quote:
 Originally Posted by mdettweiler Ah, that would make sense. Both G8000 and G7000 crashed a couple of times after the outage, as always seems to happen after an outage. They seem to have stabilized now, though I've put them in a loop so that they'll restart if they do crash again.
They kept crashing Max. Please don't use the phrase "couple of times" when you don't know how many times. Just look in the restart.txt file. There are multiple crashes. I looked at 3 AM CDT this morning and I saw that both had crashed again within the last few hours and were automatically restarted with the loop thing.

You are bound and determined to gloss over this whole issue without doing a detailed look at the exact times and matching up when the rejected results were originally handed out. I took 2 hours last night to do that for you now. How about looking into it this time please?

Please calculate when the 26 rejected results were originally handed out today. I saved them off under an obvious file name. Like I said, I only had time to look at the first 2-3 and those were handed out at 09:55-10:00 CDT on Aug. 18th. Simply take the time that the original result was returned and subtract the # of seconds that it took to return it.

I'm not going to back off on this until we nail it down. I nailed down 10 rejected results to the original power outage. The other 46 still have no explanation. How do we know that they were as a result of yet another crash? We don't. We need to match up exact crash times with times in which the original pairs were handed out.

We seem to have gotten into this habit of glossing over these server problems and that habit needs to end.

I don't know if this will help but it can't hurt:

On port G8000 only, please increase the JobMaxTime to 2 days.

Please tell me how you safely stop the server to do this. If you can let me know how that is done, then I'll do it if it is needed in the future. I now how to change the JobMaxTime and to restart it but don't want to create a problem when I stop it.

Karsten, can we talk you into returning pairs normally instead of ~100 at a time about twice a day? If you need to do so many at a time, how about you write a script to do ~20 each hour for 5 hours or something like that? That may help some.

Gary

Last fiddled with by gd_barnes on 2009-08-19 at 16:06

 2009-08-19, 19:27 #1190 kar_bon     Mar 2006 Germany 24·3·61 Posts i've sent the 2 outstanding pairs at n=957k for GB8000 some time ago and they are in the "last copy off"-file, but the "First unprocessed k/n-pairs" still show one of them! why? was the prune-time not 1 hour? please edit the stats-page for the GB ports to show those settings like the IB ports! PS: the stats updated 14:45 CDT with n=971k! Last fiddled with by kar_bon on 2009-08-19 at 20:08
 2009-08-19, 20:52 #1191 MyDogBuster     May 2008 Wilmington, DE 22×23×31 Posts What has this http://nplb.ironbits.net/ been replaced with? You know, the one with the IB port, all the primes for the day, the first n to process, links to the rejects, results for the day, etc; in the vertical format. Last fiddled with by MyDogBuster on 2009-08-19 at 20:56
2009-08-19, 21:10   #1192
Lennart

"Lennart"
Jun 2007

25×5×7 Posts

http://noprimeleftbehind.net/index.php

Lennart

Quote:
 Originally Posted by MyDogBuster What has this http://nplb.ironbits.net/ been replaced with? You know, the one with the IB port, all the primes for the day, the first n to process, links to the rejects, results for the day, etc; in the vertical format.

 2009-08-19, 21:11 #1193 AMDave     Jan 2006 deep in a while-loop 66610 Posts
 2009-08-19, 21:14 #1194 AMDave     Jan 2006 deep in a while-loop 12328 Posts SNAP!
 2009-08-19, 21:50 #1195 MyDogBuster     May 2008 Wilmington, DE 1011001001002 Posts try http://www.noprimeleftbehind.net http://noprimeleftbehind.net/index.php Not the one's I'm looking for. The one I had in mind did not show the hourly progress. It did show all the primes found for that day listed by each port. It is similar to http://nplb-gb1.no-ip.org/llrnet/ but instead for IB. It did have a like to http://www.noprimeleftbehind.net, but also had links to all current results for the day, rejects, etc. Last fiddled with by MyDogBuster on 2009-08-19 at 21:50
2009-08-19, 22:04   #1196
mdettweiler
A Sunny Moo

Aug 2007
USA (GMT-5)

3×2,083 Posts

Quote:
Okay. I see that G7000 has restarted a number of times, while G4000 and G8000 have not. Note that there's no timestamps on the restart log file, so there's no way to know exactly when the crashes occurred.

As for the rejected results, here's a tally of how many were handed out when:
-10 rejected from marco.bs around 11:00 CDT, 8/18
-20 rejected from kar_bon around 15:39 CDT, 8/18
-21 rejected from kar_bon around 00:37 CDT, 8/19
-5 rejected from marco.bs around 2:23 CDT, 8/19

Note that we can't tell exactly when these were handed out because they were (like many rejected results) listed with a time of 0.0 sec. Correlating with times on the same k/n pairs in the main results files is not helpful in this case, since those could very well have been assigned at a different time.

Note that all of these rejected results are from G8000, a server which did not crash at all. Thus, we can't even circumstantially correlate these with any particular crashes. Even if it had been known to crash, then we wouldn't be able to know when the crashes happened; the time and date of restarts aren't logged.

I know it may seem like I'm glossing over this stuff, but quite frankly, LLRnet doesn't let me do much more than that. It just plain doesn't log enough info. Yes, we redirect the screen output to a file, but that's essentially useless since there's no timestamps on it. Because of this, most server glitches simply have to be glossed over, because any further investigation is just going to waste a lot of time on something that there's not enough information to pinpoint. The glitches usually (as is the case now) will be handled by the server through its normal processes of expiry and reassignment; there's just nothing more we can do except let it run its course.

This is one of the reasons why PRPnet will be very, very nice when it's all ready for production use. It keeps very detailed logs that are of great help when tracing down problems of any sort.

As for the jobMaxTime, unfortunately it's a rather difficult process to stop the server once it's in the loop. I can do it, but in order to verify that the server's actually stopped correctly, I have to do a number of "geek things" that would be really, really hard to explain. Ditto for restarting with the whole loop thing. I've just now changed G8000 to 2 days jobMaxTime; if you need any such changes performed while the servers are in the loop thingy, let me know and I'll do it the absolute soonest that I can. I'd love to tell you how to do it so that it isn't dependent on my availability, but quite frankly, as I said that may be a bit difficult.

Max

 2009-08-19, 22:41 #1197 gd_barnes     May 2007 Kansas; USA 2·5,279 Posts I found calculating when the original pairs (that were later handed out a 2nd time) were handed out to be helpful, even though you couldn't determine when the duplicated pairs (that actually DID reject) were handed out. When the rejected results were returned doesn't help us much. Here's why: Likely the originals and the duplicates were handed out at about the same time. As you could see from the calculated times above, those original pairs were all handed out at 2 distinct times, one of which I was able to correlate almost exactly to the power outage. In other words, this tells me that there was some distinct problem that occurred at those 2 times. Had the original pairs been handed out at more random times, we could not come to such a conclusion. Even if the duplicated pairs had been handed out at distinct times, we couldn't discern such because, as you said, the rejected results don't show how much time was taken. By gleening as much info. as possible through calculations such as this allows us to hopefully cut down on it in the future. Anyway, I agree, it's not easy to gleen much info. from things on LLRnet. I guess we'll have to stop now. One more question: Will getting David's code on to my servers mean that we can avoid the "loop thing" code to restart the servers? If so, that will prevent quite a bit of this "after outage" multiple crashes that we keep encountering. Gary Last fiddled with by gd_barnes on 2009-08-19 at 23:00
2009-08-19, 22:51   #1198
gd_barnes

May 2007
Kansas; USA

2·5,279 Posts

Quote:
 Originally Posted by MyDogBuster What has this http://nplb.ironbits.net/ been replaced with? You know, the one with the IB port, all the primes for the day, the first n to process, links to the rejects, results for the day, etc; in the vertical format.
Quote:
 Originally Posted by Lennart
Quote:
 Originally Posted by AMDave
Quote:
 Originally Posted by MyDogBuster try http://www.noprimeleftbehind.net http://noprimeleftbehind.net/index.php Not the one's I'm looking for. The one I had in mind did not show the hourly progress. It did show all the primes found for that day listed by each port. It is similar to http://nplb-gb1.no-ip.org/llrnet/ but instead for IB. It did have a like to http://www.noprimeleftbehind.net, but also had links to all current results for the day, rejects, etc.

Lennart and AMDave, both of these responses are incorrect and both link to the same incorrect page. Ian asked for the "noprimeleftbehind" link name version of http://nplb.ironbits.net/. If everything is going to roll over to the new server, we need a new link name with "noprimeleftbehind" in it that specifically has this web page in it.

David, are you just going to leave this one link on the old "ironbits" link name or can we expect a new link that has "noprimeleftbehind" in it?

This is an important page that we don't want to lose. I'll Email David with a link to this posting.

Thanks,
Gary

2009-08-19, 22:52   #1199
kar_bon

Mar 2006
Germany

B7016 Posts

Quote:
 Originally Posted by mdettweiler I know it may seem like I'm glossing over this stuff, but quite frankly, LLRnet doesn't let me do much more than that. It just plain doesn't log enough info. Yes, we redirect the screen output to a file, but that's essentially useless since there's no timestamps on it.
so if you need timestamps for the output, try this: http://www.mersenneforum.org/showthread.php?t=10066

i've given those timestamps for the client-side to write this (still using this on my clients):

Code:
[2009-08-20 00:36:21] 2013*2^235548-1 is not prime.  Res64: 0BAAB87826667E2E  Time : 61.858 sec.
[2009-08-20 00:37:23] 2013*2^235595-1 is not prime.  Res64: A6ED66F8AA9036F5  Time : 61.854 sec.
[2009-08-20 00:38:24] 2013*2^235640-1 is not prime.  Res64: 8BCA8E2B12058E30  Time : 61.950 sec.
[2009-08-20 00:39:26]
note: a result and the following timestamp are a pair (couldn't handle this in other order).
so you have to read as.

Code:
[2009-08-20 00:37:23] 2013*2^235548-1 is not prime.  Res64: 0BAAB87826667E2E  Time : 61.858 sec.
[2009-08-20 00:38:24] 2013*2^235595-1 is not prime.  Res64: A6ED66F8AA9036F5  Time : 61.854 sec.
[2009-08-20 00:39:26] 2013*2^235640-1 is not prime.  Res64: 8BCA8E2B12058E30  Time : 61.950 sec.
every line has it's timestamp and result.

perhaps you can change the "server.lua" the same.

 Similar Threads Thread Thread Starter Forum Replies Last Post mdettweiler No Prime Left Behind 228 2018-12-26 04:50 gd_barnes No Prime Left Behind 0 2009-08-10 19:21 gd_barnes Conjectures 'R Us 39 2008-07-15 10:26 em99010pepe No Prime Left Behind 229 2008-04-30 19:13 em99010pepe No Prime Left Behind 19 2008-03-26 06:19

All times are UTC. The time now is 11:28.

Thu Dec 9 11:28:49 UTC 2021 up 139 days, 5:57, 0 users, load averages: 0.97, 0.98, 1.10