mersenneforum.org  

Go Back   mersenneforum.org > Prime Search Projects > No Prime Left Behind

Reply
 
Thread Tools
Old 2010-08-21, 01:26   #1
kar_bon
 
kar_bon's Avatar
 
Mar 2006
Germany

33×107 Posts
Default Server machine Aug. 20th crash and backups

All servers LLRnet and PRPnet are offline since 15 minutes now!

Max could not reach the server and Gary is not available, so we only can wait for now!
kar_bon is offline   Reply With Quote
Old 2010-08-21, 01:30   #2
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

141518 Posts
Default

Quote:
Originally Posted by kar_bon View Post
All servers LLRnet and PRPnet are offline since 15 minutes now!

Max could not reach the server and Gary is not available, so we only can wait for now!
I just got a hold of Gary via text message--my guess of a thunderstorm that I mentioned in a PM turned out to be right. As of 8 minutes ago he said he's checking to confirm that the internet is out. (Not that there's much he can do if it is out besides call the cable company and wait for them to come by...)
mdettweiler is offline   Reply With Quote
Old 2010-08-21, 01:40   #3
gd_barnes
 
gd_barnes's Avatar
 
May 2007
Kansas; USA

2×71×73 Posts
Default

Hum. It had nothing to do with the thunderstorm. All of my other machines are internet connected and running. Jeepford just spontaneously shut down. In booting it back up, it appears to have a few disk errors and would not boot up. After a few attempts at the shell or whatever it is that is the Linux equivalent of a DOS C-prompt, I keep getting just a little further each time. The boot originally only got to 2-3% and the last attempt got to 34% before stopping. The fsck utility appears to have corrected some errors.

Thanks for the very quick notification Max. I'm all over it now. If it's quickly fixable with fsck or other obvious utility, I'll have it fixed within the next 15-30 mins.

Edit: My perception now is that server machines are hard on hard drives. The last one developed a few errors here-and-there and is still working fine as a "normal" machine. All of the constant disk writes from the PRPnet server during the rally may have stressed it to the point that it got some errors that need to be written around.

Last fiddled with by gd_barnes on 2010-08-21 at 01:42
gd_barnes is offline   Reply With Quote
Old 2010-08-21, 01:45   #4
gd_barnes
 
gd_barnes's Avatar
 
May 2007
Kansas; USA

2×71×73 Posts
Default

As a point of reference as to what I'm getting when I run the fsck utility: It comes back with:

"Inodes that were part of a corrupted orphan linked list found. Fix(y)?"

I then tell it yes and it seems to fix some and then hesitates for an extended period, apparently looking for more of them. I think I wasn't patient enough before and just rebooted it after the first group of errors. I'll just keep letting it find them now and then hopefully when I try the next reboot, it will work completely correctly.

Edit: Does Karsten ever sleep? lol

Last fiddled with by gd_barnes on 2010-08-21 at 01:46
gd_barnes is offline   Reply With Quote
Old 2010-08-21, 02:12   #5
gd_barnes
 
gd_barnes's Avatar
 
May 2007
Kansas; USA

287E16 Posts
Default

Everything is working now.

My take on the issue: A few small hard disk errors had crept in from the tremendous amount of reading/writing of the PRPnet server messages from the rally in an attempt to isolate the cause of the "to many connections" error. One of the errors somehow crept into some root or system file caushing the machine to shut itself down.

It took aformentioned utility a couple of attempts to write the system files around the bad hard drive sectors. Thankfully Linux is robust in that regard.

Sorry about the problem. Fortunately the fix wasn't too bad. It was strange to see the server machine sitting there shut down while my modem/router continued to blink away, all of my other machines were still on, and with the battery backup still in good shape and apparently never used. There was no power flicker here.
gd_barnes is offline   Reply With Quote
Old 2010-08-21, 03:27   #6
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

186916 Posts
Default

Hmm...interesting. I hadn't ever really considered the impact on the hard drive of all the beatup it takes on a daily basis running not only PRPnet and LLRnet servers, but also the stats DB. For each pair that's handed out, it has to write to the disk once when it's sent to the client, once when it's returned, and probably a few more times when it's imported into the DB. Plus, there's various overhead not specifically tied to # of pairs processed, such as LLRnet's pruning and PRPnet's every-10-minute checks to see if it needs to send out any new emails.

While in this case everything worked out OK since the errors were minor, it is definitely a striking reminder of the need to have a backup system. At this time, none of the server stuff is being backed up on a regular basis--at least not to a location outside of the server's primary hard drive. (I believe Dave does something to backup the DB, but it just puts it elsewhere on the same disk.)

I think it's high time I started looking into options for backing up the server to some kind of external location... Possibly an external (USB?) hard drive for regular backups, with a DVD or something similar for periodic (monthly?) backups of that.
mdettweiler is offline   Reply With Quote
Old 2010-08-21, 03:50   #7
AMDave
 
AMDave's Avatar
 
Jan 2006
deep in a while-loop

12228 Posts
Default

Max, please refer to email 31-05-2010 and respond on that email.
You need to tell me which destination I can copy-off to.
Thx.
AMDave is offline   Reply With Quote
Old 2010-08-21, 04:14   #8
gd_barnes
 
gd_barnes's Avatar
 
May 2007
Kansas; USA

1036610 Posts
Default

You didn't mention the most obvious thing that might have caused the problem: The fact that we were doing the huge amount of logging for the PRPnet server resulting in 100's of MB of file writes.
gd_barnes is offline   Reply With Quote
Old 2010-08-21, 04:43   #9
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

3×2,083 Posts
Default

Quote:
Originally Posted by gd_barnes View Post
You didn't mention the most obvious thing that might have caused the problem: The fact that we were doing the huge amount of logging for the PRPnet server resulting in 100's of MB of file writes.
Yes, that too. At any rate, it accelerated the degeneration of the disk such that a few semi-critical errors (critical enough to keep the system from booting right away) showed up over the course of a few days. Still, even normal server operation is hard on a disk over the long term, as you observed on humpford (which didn't even have the stats DB on it).

Despite the huge log files it creates, I do prefer to keep all the PRPnet servers on maximum debug level (which while verbose, is not quite as much so as the special version we used during the rally). Those elusive high-load bugs that only show up once in a while are almost impossible to catch otherwise--to apply an analogy you used a few days ago in a PM, it's like a car problem that suddenly stops happening when you take the car into the shop.

And besides all this, we really are quite overdue to get a real external backup system set up for jeepford: all it would take is one dead hard drive (which is not an entirely uncommon scenario even in a non-server setting) for us to lose the entire stats DB (and of course the active contents of the servers as well, though that's not as hard to recover from).

Last fiddled with by mdettweiler on 2010-08-21 at 04:45
mdettweiler is offline   Reply With Quote
Old 2010-08-21, 07:26   #10
kar_bon
 
kar_bon's Avatar
 
Mar 2006
Germany

33·107 Posts
Default

Quote:
Originally Posted by gd_barnes View Post
Edit: Does Karsten ever sleep? lol
It was before I'm going to sleep when a heard the beep-sound!
Lucky to have find a prime it was 'only' that issue, so I stayed online awhile, waiting the server comes back.
I continued another local effort so there was not much time left.

But now, after 5 hours of sleep, I've rested enough!
kar_bon is offline   Reply With Quote
Old 2010-08-21, 12:19   #11
Flatlander
I quite division it
 
Flatlander's Avatar
 
"Chris"
Feb 2005
England

31×67 Posts
Default

Quote:
Originally Posted by mdettweiler View Post
...
And besides all this, we really are quite overdue to get a real external backup system set up for jeepford: all it would take is one dead hard drive (which is not an entirely uncommon scenario even in a non-server setting) for us to lose the entire stats DB (and of course the active contents of the servers as well, though that's not as hard to recover from).


I know you guys are the experts but please tell me there are multiple backups of all the NPLB and CRUS results and sieve files.

I've had about 4 HD failures here in 10 years.

All my important data here is on 4 HDs on 3 PCs, then it is automatically backed up online. 678GB so far.
Flatlander is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
20th Test of primality and factorization of Lepore with Pythagorean triples Alberico Lepore Alberico Lepore 43 2018-01-17 15:55
Move the 20th (moving to endgame soon) Dubslow Game 1 - ♚♛♝♞♜♟ - Shaolin Pirates 10 2013-03-03 08:59
Rally Feb. 20th-22nd gd_barnes No Prime Left Behind 13 2009-02-20 14:06
Prime95's backups broken? abstractius Software 4 2007-12-18 02:31
New Server Hardware and price quotes, Funding the server Angular PrimeNet 32 2002-12-09 01:12

All times are UTC. The time now is 14:50.

Tue May 11 14:50:10 UTC 2021 up 33 days, 9:31, 1 user, load averages: 2.04, 2.50, 2.43

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.