View Single Post
Old 2009-08-19, 22:04   #1196
A Sunny Moo
mdettweiler's Avatar
Aug 2007

3·2,083 Posts

Originally Posted by gd_barnes View Post
They kept crashing Max. Please don't use the phrase "couple of times" when you don't know how many times. Just look in the restart.txt file. There are multiple crashes. I looked at 3 AM CDT this morning and I saw that both had crashed again within the last few hours and were automatically restarted with the loop thing.

You are bound and determined to gloss over this whole issue without doing a detailed look at the exact times and matching up when the rejected results were originally handed out. I took 2 hours last night to do that for you now. How about looking into it this time please?

Please calculate when the 26 rejected results were originally handed out today. I saved them off under an obvious file name. Like I said, I only had time to look at the first 2-3 and those were handed out at 09:55-10:00 CDT on Aug. 18th. Simply take the time that the original result was returned and subtract the # of seconds that it took to return it.

I'm not going to back off on this until we nail it down. I nailed down 10 rejected results to the original power outage. The other 46 still have no explanation. How do we know that they were as a result of yet another crash? We don't. We need to match up exact crash times with times in which the original pairs were handed out.

We seem to have gotten into this habit of glossing over these server problems and that habit needs to end.

I don't know if this will help but it can't hurt:

On port G8000 only, please increase the JobMaxTime to 2 days.

Please tell me how you safely stop the server to do this. If you can let me know how that is done, then I'll do it if it is needed in the future. I now how to change the JobMaxTime and to restart it but don't want to create a problem when I stop it.

Karsten, can we talk you into returning pairs normally instead of ~100 at a time about twice a day? If you need to do so many at a time, how about you write a script to do ~20 each hour for 5 hours or something like that? That may help some.

Okay. I see that G7000 has restarted a number of times, while G4000 and G8000 have not. Note that there's no timestamps on the restart log file, so there's no way to know exactly when the crashes occurred.

As for the rejected results, here's a tally of how many were handed out when:
-10 rejected from around 11:00 CDT, 8/18
-20 rejected from kar_bon around 15:39 CDT, 8/18
-21 rejected from kar_bon around 00:37 CDT, 8/19
-5 rejected from around 2:23 CDT, 8/19

Note that we can't tell exactly when these were handed out because they were (like many rejected results) listed with a time of 0.0 sec. Correlating with times on the same k/n pairs in the main results files is not helpful in this case, since those could very well have been assigned at a different time.

Note that all of these rejected results are from G8000, a server which did not crash at all. Thus, we can't even circumstantially correlate these with any particular crashes. Even if it had been known to crash, then we wouldn't be able to know when the crashes happened; the time and date of restarts aren't logged.

I know it may seem like I'm glossing over this stuff, but quite frankly, LLRnet doesn't let me do much more than that. It just plain doesn't log enough info. Yes, we redirect the screen output to a file, but that's essentially useless since there's no timestamps on it. Because of this, most server glitches simply have to be glossed over, because any further investigation is just going to waste a lot of time on something that there's not enough information to pinpoint. The glitches usually (as is the case now) will be handled by the server through its normal processes of expiry and reassignment; there's just nothing more we can do except let it run its course.

This is one of the reasons why PRPnet will be very, very nice when it's all ready for production use. It keeps very detailed logs that are of great help when tracing down problems of any sort.

As for the jobMaxTime, unfortunately it's a rather difficult process to stop the server once it's in the loop. I can do it, but in order to verify that the server's actually stopped correctly, I have to do a number of "geek things" that would be really, really hard to explain. Ditto for restarting with the whole loop thing. I've just now changed G8000 to 2 days jobMaxTime; if you need any such changes performed while the servers are in the loop thingy, let me know and I'll do it the absolute soonest that I can. I'd love to tell you how to do it so that it isn't dependent on my availability, but quite frankly, as I said that may be a bit difficult.

mdettweiler is offline   Reply With Quote