![]() |
![]() |
#1178 |
May 2007
Kansas; USA
1074010 Posts |
![]()
Karsten,
As you probably noticed, I haven't had time to look into the problems with the missing pairs on port G8000. I looked at it briefly last night but found it was going to take quite a while to determine exactly what happened. It is on my "short term" to do list now. Once I verify exactly what is missing and see if I can figure out what happened, I'll stop the server, add the pairs in the correct n-value order to knpairs.txt, and restart it. They should then be immediately handed out. I also want to see if I can figure out why all of those k/n pairs got rejected yesterday. I didn't immediately see them in the results.txt file. That is a different situation than the 2 rejected results for port G7000. Fot that server, I was quickly able to find that the pairs were already in results.txt so the server must have handed the pairs out twice with the short power blip the likely culprit. Here is what I speculate may have happened on them. Please note that this is very much a guess. Something similar only different occurred on port G8000 as occurred on G7000. But in the port G8000 case, I think it may have already removed the pairs from both knpairs.txt and joblist.txt because it was "in the process" of receiving the results right as the power blip occurred. Therefore the results will never get into results.txt...not a good scenario. If that is the case, what we'll need to do is the following so that Karsten and whomever else receives credit for the already processed pairs but do not have to crunch them again: 1. Add the pairs back to knpairs.txt and to joblist.txt. In joblist.txt, make sure that the person's ID who originally worked on the pairs is correct. 2. Remove the pairs from the rejected results file. 3. Convert the rejected results to the one line "client format", that is the format that the client sends to the server when it is done with a pair. 4. Send the client formatted results from #3 back to the server. I could do this myself. If the joblist.txt file is properly updated in #1, even if I send them to the server, the correct people will get the credit that they should without crunching the pairs again. Max, Can you do me a favor? Can you post the file here that you loaded into port G8000 originally up to n=970K? (I think you loaded in specifically the n=900K-970K range.) I want to check for oddball carriage control characters and other such nonsense that seems to occassionally cause the servers to simply skip over k/n pairs. I also want to make sure the missing pairs were in the file to begin with. One thing good about having all servers on my machines that I'm sure that Karsten will like...I can do specific tweakings like this to quickly account for and give credit for missing or rejected pairs and results. Gary Last fiddled with by gd_barnes on 2009-08-18 at 17:47 |
![]() |
![]() |
![]() |
#1179 | |
A Sunny Moo
Aug 2007
USA (GMT-5)
624910 Posts |
![]() Quote:
I don't have the original file that I loaded in any longer, though I do have the master file I made for the 1st 6-k minidrive, all the way for 600K-1M. That's where I pulled the data out of to load into the server, so if you'd like that I can send it to you. Actually, quite frankly, since there were only a couple of these rejected pairs, I figured it would by far be easiest to simply let it go for now, and then when the entire range is being processed in the end, find out exactly which k/n pairs are missing and re-do them at that time. Otherwise, there's an enormous potential for messing things up much, much more than they are now (which currently is only a minor problem that seems to have only affected two k/n pairs). Believe me, I've learned that lesson over and over again with some of the mess-ups we've had before on PRPnet G3000; even though that was PRPnet rather than LLRnet, the basic idea of trying to "fix" these things manually leading to a big mess still applies. Max ![]() |
|
![]() |
![]() |
![]() |
#1180 | |
Mar 2006
Germany
23·32·41 Posts |
![]() Quote:
Code:
user=gd_barnes [2009-08-17 16:31:04] 345*2^957466-1 is not prime. Res64: B3595697CA967AE7 Time : 1202.0 sec. i just sent some results to port GB8000 but i got none of the 2 remaining missing pairs. perhaps they've reserved by marco.bs because he has done some the last hour. could someone check this in the joblist, please?! |
|
![]() |
![]() |
![]() |
#1181 | |
A Sunny Moo
Aug 2007
USA (GMT-5)
3·2,083 Posts |
![]() Quote:
Anyway, this would seem to confirm that there are in fact no pairs missing in the server; as long as they're listed in knpairs.txt (which is how they'd get on the status page), then we're good. All we have to do is let them expire naturally, and they'll be reassigned and dealt with. ![]() |
|
![]() |
![]() |
![]() |
#1182 | |||
May 2007
Kansas; USA
101001111101002 Posts |
![]() Quote:
Quote:
Quote:
I have a question for you and I want you to answer it without looking at my servers: Is the JobMaxTime on my servers 1 day or 3 days? After answering, then check the servers and see if you are right. If not, please post the correct JobMaxTime. First, your 2 statements above could not possibly both be true. It's either one or the other but not both. Think about and look closely at the timing of things and you'll figure it out. Second, there are 30 rejected results today (Tuesday)! The first at 11:00 AM CDT and the last at 15:39 CDT and you responded at 15:03 CDT with your post saying that there are only a "couple of rejected pairs". PLEASE STOP saying that there is nothing wrong and that it will work its way out!! There is definitely something wrong and I am looking into it now. I've asked this before: Please slow down when responding. If you can't take 15-30 mins. to analyze a technical problem in great detail, then don't respond at all. It only confuses matters more. If you don't have time, ask me to do a detailed analysis on it and I will. BTW, you are correct on one thing. I failed to return those pairs on port G8000. I'll do that now Thank you, Gary |
|||
![]() |
![]() |
![]() |
#1183 | ||||
A Sunny Moo
Aug 2007
USA (GMT-5)
3·2,083 Posts |
![]() Quote:
![]() Quote:
Quote:
![]() Quote:
As long as there aren't any pairs that both a) aren't in any results files and b) aren't in knpairs.txt, then everything is normal, since any possible "problems" are being handled throught the server's normal process of expiry, reassignment, and rejection. Any interference in a situation like that is likely to make things much, much worse and end up dropping a large # of results through the cracks. The best way to proceed is to simply let the server do its thing. Max ![]() |
||||
![]() |
![]() |
![]() |
#1184 |
May 2007
Kansas; USA
22×3×5×179 Posts |
![]()
Odd. Then why were my pairs still sitting in knpairs.txt from 30 hours ago as of right before I returned them to the server 15 mins. ago? They were all retrieved at 16:11 CDT on 8/17 and never returned.
This seems to imply a JobMaxTime of longer than 1 day. This is likely a mountain out of a molehill but it's more me trying to understand the quirks of these LLRnet servers. Last fiddled with by gd_barnes on 2009-08-19 at 04:52 |
![]() |
![]() |
![]() |
#1185 |
May 2007
Kansas; USA
29F416 Posts |
![]()
I have matched up all results in port 8000 vs. the original sieve file for n>=900K as of Aug. 19th at 12:01 AM CDT (5 AM GMT).
Conclusion 1: All pairs are accounted for. That is, all pairs are either still in the knpairs.txt file or they have returned results for them. Conclusion 2: We do not know why the pairs that I reserved > 30 hours ago have not be reassigned yet. We will keep an eye on that. Conclusion 3: Pairs were handed out twice right after the time of the power outage around 16:00-16:30 CDT on Aug. 17th causing 10 rejected results. This is understandable. Conclusion 4: Pairs were handed out twice around 09:50-10:00 CDT on Aug. 18th for an unknown reason causing at least 46 rejected results. This is bad and warrants further investigation. Clarification: The times that the pairs were handed out were when they were handed out the 1st time. It is not known when they were handed out the 2nd time. Because the rejected.txt file does not show how much time is taken to return the pair, it is not known. Explanations: Pairs reserved by me but not tested and returned to server 30 hours later as of ~1-2 hours ago. These should be handed out at any time now. If not, then there is a problem with the server. Code:
333 957578 315 957595 315 971061 339 971061 327 971068 333 971070 339 971075 345 971076 333 971079 345 971080 327 971082 339 971083 333 971094 321 971105 327 971106 345 971106 333 971355 345 971566 315 971575 345 971577 Code:
333 970326 10:59:52 marco.bs 315 970327 10:59:52 marco.bs 339 970327 15:38:58 kar_bon 315 970334 15:38:59 kar_bon 333 970334 15:38:59 kar_bon 321 970337 15:39:00 kar_bon 333 970343 10:59:53 marco.bs 315 970347 15:39:00 kar_bon 327 970352 15:39:01 kar_bon 327 970361 15:39:01 kar_bon 315 970366 15:39:02 kar_bon 327 970404 11:00:45 marco.bs 315 970419 11:00:47 marco.bs 339 970419 11:00:47 marco.bs 345 970419 11:00:47 marco.bs 321 970422 11:00:47 marco.bs 315 970426 11:00:47 marco.bs 327 970429 11:00:48 marco.bs 333 970410 15:39:02 kar_bon 333 970415 15:39:03 kar_bon 345 970996 15:39:20 kar_bon 321 970997 15:39:21 kar_bon 339 970999 15:39:21 kar_bon 333 971004 15:39:22 kar_bon 339 971005 15:39:22 kar_bon 315 971014 15:39:23 kar_bon 345 971019 15:39:23 kar_bon 315 971027 15:39:24 kar_bon 339 971029 15:39:24 kar_bon 315 971030 15:39:25 kar_bon Code:
333 970326 16:34:25 Aug. 17th gd_barnes 1224 secs = handed out 16:14:01 on 17th 315 970327 16:34:28 Aug. 17th gd_barnes 1215 secs = handed out 16:14:13 on 17th 339 970327 10:59:52 Aug. 18th kar_bon 3855 secs = handed out 09:55:37 on 18th 315 970334 10:59:52 Aug. 18th kar_bon 3854 secs = handed out 09:55:38 on 18th 333 970334 10:59:53 Aug. 18th kar_bon 3855 secs = handed out 09:55:38 on 18th 321 970337 10:59:53 Aug. 18th kar_bon 3855 secs = handed out 09:55:38 on 18th 333 970343 16:34:46 Aug. 17th gd_barnes 1220 secs = handed out 16:14:26 on 17th 315 970347 10:59:53 Aug. 18th kar_bon 3854 secs = handed out 09:55:39 on 18th 327 970352 10:59:54 Aug. 18th kar_bon 3855 secs = handed out 09:55:39 on 18th 327 970361 10:59:54 Aug. 18th kar_bon 3854 secs = handed out 09:55:40 on 18th 315 970366 10:59:54 Aug. 18th kar_bon 3854 secs = handed out 09:55:40 on 18th 327 970404 16:35:09 Aug. 17th gd_barnes 1225 secs = handed out 16:14:44 on 17th 333 970410 11:00:46 Aug. 18th kar_bon 3906 secs = handed out 09:55:40 on 18th 333 970415 11:00:46 Aug. 18th kar_bon 3905 secs = handed out 09:55:41 on 18th 315 970419 09:55:48 Aug. 18th kar_bon 62788 secs = handed out 16:29:20 on 17th 339 970419 09:55:49 Aug. 18th kar_bon 62788 secs = handed out 16:29:21 on 17th 345 970419 09:55:49 Aug. 18th kar_bon 62788 secs = handed out 16:29:21 on 17th 321 970422 09:55:50 Aug. 18th kar_bon 62788 secs = handed out 16:29:22 on 17th 315 970426 09:55:51 Aug. 18th kar_bon 62789 secs = handed out 16:29:22 on 17th 327 970429 09:55:52 Aug. 18th kar_bon 62790 secs = handed out 16:29:22 on 17th 345 970996 10:59:56 Aug. 18th kar_bon 3805 secs = handed out 09:56:31 on 18th 321 970997 10:59:56 Aug. 18th kar_bon 3804 secs = handed out 09:56:32 on 18th 339 970999 10:59:57 Aug. 18th kar_bon 3805 secs = handed out 09:56:32 on 18th 333 971004 10:59:57 Aug. 18th kar_bon 3804 secs = handed out 09:56:33 on 18th 339 971005 10:59:58 Aug. 18th kar_bon 3805 secs = handed out 09:56:33 on 18th 315 971014 10:59:59 Aug. 18th kar_bon 3806 secs = handed out 09:56:33 on 18th 345 971019 10:59:59 Aug. 18th kar_bon 3805 secs = handed out 09:56:34 on 18th 315 971027 13:59:03 Aug. 18th kar_bon 14549 secs = handed out 09:56:34 on 18th 339 971029 13:59:03 Aug. 18th kar_bon 14549 secs = handed out 09:56:34 on 18th 315 971030 13:59:04 Aug. 18th kar_bon 14549 secs = handed out 09:56:35 on 18th Is there a server log that shows what might have happened? It appears that something is not being saved off to the joblist.txt file correctly. There was no power outage at that time. Gary Last fiddled with by gd_barnes on 2009-08-19 at 07:32 Reason: done editing now |
![]() |
![]() |
![]() |
#1186 |
Mar 2006
Germany
23×32×41 Posts |
![]()
just returned some results for GB8000 and assigned the two outstanding pairs at n=957k!
so i can return those results in about 8 hours (home again). |
![]() |
![]() |
![]() |
#1187 |
May 2007
Kansas; USA
22×3×5×179 Posts |
![]()
There are 26 more rejected results from Karsten on Aug. 19th as of 2:30 AM CDT (7:30 AM GMT).
I checked the original results for the first 2-3 pairs and they were all originally assigned the first time, once again, between 09:55 and 10:00 on Aug. 18th. Karsten, for any pairs that your machines were assigned the FIRST TIME around 14:55-15:00 GMT on Aug. 18th, you'll likely wind up with a second assignment of the same pair that is rejected. I hope that you didn't cache 100 pairs at that time. At the time of this post, we now have 56 rejected results in the last 24 hours. Only 10 of those can directly be associated with the power outage. The other 46 were originally assigned in the Aug. 18th 09:55-10:00 CDT time frame. I suspect this is an unfixable LLRnet bug. The server may have simply become a little unstable after the outage. Hopefully it has worked its way through. My battery power backup should be here by Friday. Hopefully these kinds of problems will be history after that. Gary Last fiddled with by gd_barnes on 2009-08-19 at 07:29 |
![]() |
![]() |
![]() |
#1188 |
A Sunny Moo
Aug 2007
USA (GMT-5)
3·2,083 Posts |
![]()
Ah, that would make sense. Both G8000 and G7000 crashed a couple of times after the outage, as always seems to happen after an outage. They seem to have stabilized now, though I've put them in a loop so that they'll restart if they do crash again.
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
PRPnet servers for NPLB | mdettweiler | No Prime Left Behind | 230 | 2022-02-21 06:42 |
Servers for NPLB | gd_barnes | No Prime Left Behind | 0 | 2009-08-10 19:21 |
LLRnet servers for CRUS | gd_barnes | Conjectures 'R Us | 39 | 2008-07-15 10:26 |
NPLB LLRnet server discussion | em99010pepe | No Prime Left Behind | 229 | 2008-04-30 19:13 |
NPLB LLRnet server #1 - dried | em99010pepe | No Prime Left Behind | 19 | 2008-03-26 06:19 |