mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   No Prime Left Behind (https://www.mersenneforum.org/forumdisplay.php?f=82)
-   -   Server outrages (https://www.mersenneforum.org/showthread.php?t=13840)

AMDave 2014-04-02 09:01

The NPLB server IP address changed about 3 hours ago. The server's DNS updater correctly updated the DNS reference with the domain host within an hour.

However, the global DNS cache updates are taking some time to cascade - they can take 3 to 4 hours. As a result some of you may be experiencing connection issues.

These issues should be resolved automatically within the next 2 hours.
I am monitoring the update and will confirm when resolved.

AMDave 2014-04-02 09:17

RESOLVED:

NPLB server connectivity confirmed fixed by the automated processes in place.
I have made an optimal adjustment to reduce the minimum length of time required to complete the change from 3 hours down to 1.5 hours, for next time. ***

I have been monitoring the logs hourly since Nov-2013 to confirm this change-over. (see post on previous page).
Now that the process has proven itself, admin log-file and connectivity monitoring will return to daily.

Thank you for your patience.

AMDave

(*** allowing for S.T.P., how many fingers you can cross and whether I am monitoring the change at the time or not :P )

gd_barnes 2014-04-03 06:38

Thank you for the excellent work Dave! You have been indispensible on this project. This of course happened while I was out of town and so I could not have dealt with it until next week.

AMDave 2014-04-16 20:23

You are most welcome.

Complete backup set offsite download completed for 2014-04-16. (4.8GB)

AMDave 2014-05-01 11:48

The server appears offline - for me anyway.
Investigating.

AMDave 2014-05-01 11:54

Unplanned NPLB server outage.

Unable to connect to the server.
Not a DNS issue. The IP address has not changed.
It may be a local power fault.
Requires attention from gd_barnes.

gd_barnes 2014-05-01 19:44

Fixed. Equipment issue. Sorry about the problem.

AMDave 2014-05-04 01:32

Complete backup set offsite download completed for 2014-05-02. (4.8GB)

AMDave 2014-05-17 23:58

Complete backup set offsite download completed for 2014-05-16. (4.8GB)

AMDave 2014-06-16 08:36

Complete backup set offsite download completed for 2014-06-13. (4.8GB)

mdettweiler 2014-06-20 01:43

PRPnet port 9000 will be going down shortly, for up to an hour, while I upgrade it to PRPnet v4.3.6. :smile:

See [url=http://www.mersenneforum.org/showthread.php?p=376266#post376266]here[/url] for more details. This is a simple upgrade that we've done many times before and it'll only be putting this port on par with all the others. (Gary approved this upgrade more than a year ago, maybe two years, I just haven't done it yet. That'll be fixed now... :ermm:)

mdettweiler 2014-06-20 02:08

...and it's all done. :smile: As expected, no problems so far. Everything is fully backed-up, so if we run into any problems we can roll back. (I don't expect any problems, since we've been running this version without issue for upwards of 2.5 years on multiple other servers.)

@Gary: FYI, the backup files are located at:[LIST][*]Database backup: [FONT="Courier New"]/home/max/prpnet9000-db-backup-20140619-2046.sql.lz[/FONT] (use "[font="Courier New"]plzip -d [filename][/font]" to decompress; use "[font="Courier New"]sudo su max[/font]" and enter your password to log in as me if it says you don't have permission to read the file)[*]prpnet9000 directory backup: the folder "[font="Courier New"](DELETE AFTER 1 WEEK IF NO PROBLEMS)prpnet9000-backup-20140619[/font]" on your desktop - name is self-explanatory :smile:[/LIST]

mdettweiler 2014-06-29 01:38

Server is down
 
All noprimeleftbehind.net services appear to be down presently - I can't reach the web site, any of the PRPnet servers, or SSH.

The most likely explanations are:[LIST][*]Server IP change that needs to be updated in DNS (though I think Dave might have that set up automatically now)[*]Power outage at Gary's house, in which case we probably won't hear from him until it's back up[/LIST]

AMDave 2014-06-29 03:02

NPLB unplanned outage confirmed.
Either a power or comms outage at the server location.
Commenced shortly after Sat, 28 Jun 2014 19:11:59 -0500.

Lennart 2014-06-29 04:30

[QUOTE=AMDave;376975]NPLB unplanned outage confirmed.
Either a power or comms outage at the server location.
Commenced shortly after Sat, 28 Jun 2014 19:11:59 -0500.[/QUOTE]


It started with the servers I could not reach them but I could reach all other webpages.

Lennart

mdettweiler 2014-06-29 08:15

[QUOTE=Lennart;376977]It started with the servers I could not reach them but I could reach all other webpages.

Lennart[/QUOTE]
Maybe the web pages were still cached in your browser?

Given the time of year (thunderstorm season in the U.S.), my bet is on it being a power outage...but, I can only speculate until we hear from Gary. We've had all sorts of crazy stuff happen to the servers in the past (I can scarcely believe the number of motherboard blowouts Gary's had - both on his AMDs and Intels, which were bought at different times, and even many of the replacement boards*).

[SIZE="1"]*To be fair, the failure of the Intel boards can be attributed directly to my own bad advice; when I recommended the components to Gary for those builds, I had neglected to check the motherboard manufacturer's website to make sure it could handle the right wattage. Turns out, that model was only designed for Core 2 Duos and lower-wattage Core 2 Quads, of which the Q6600's we were using were not one. I'd put together one of my own builds from the same parts at the same time, and it blew out in the exact same fashion. Moral of the story: always check the manufacturer's website, the label "Core 2 Duo/Quad" on Newegg is not enough. :smile:

As for the AMD boards, I have no idea why we had such bad luck on them, except that both Gary and I have now learned well to avoid buying the [I]very[/I] cheapest motherboards that meet the requirements. :rolleyes:[/SIZE]

mdettweiler 2014-06-29 20:19

BTW @Gary: when you turn the PRPnet servers back on, you might want to set the reservation time limit to 1 week temporarily. IIRC, PRPnet refuses results returned after the deadline (even if they could otherwise be accepted) - this wasn't too big a deal in the past but now that we're doing much larger tests (particularly at CRUS) we'll want to watch this.

gd_barnes 2014-06-30 03:28

Wow. Bad timing. I'm out of town until Thursday morning. My home phone line is out too. I'll try contacting my internet provider to see if they see anything on their end.

When I get it reconnected, I'll set all of the servers to 2 weeks for a period of 2-3 days to allow everyone to return their work.

Sorry everyone.

AMDave 2014-06-30 09:31

ITMT I restored the backups from 2014-06-13 on my DRP server. The restore is confirmed successful.

DRP URL for viewing only - [url]http://nplb.no-ip.org/stats/index.php?content=port[/url]
To those not in the know: do not attempt to use DRP ports - like your brain-sucker you will starve ;)
The ports remain closed and untested on the DRP server and some - as yet untested - config would be required to make it active.

Have a safe trip, Gary. Hopefully all comes back up ok when you get home.

EDIT -
I got asked the last time I mentioned DRP:
DRP = Disaster Recovery Plan - [url]http://en.wikipedia.org/wiki/Disaster_recovery_plan[/url]
Due to the differences between the current server and the DRP server I have not yet been able to upgrade the DRP plan to a BCP plan as the DRP config is not backward compatible to the current host. I could not implement the automatic fail-over as it could not fail-back. This may be resolved at some point in the future if necessary.
Our DRP plan is tested about twice per year to keep it valid and up to date. Although the full administrative functionality has not yet been fully tested under the DRP, there is a high level of confidence that the 'automagic wand' of linux admin SMACK-FU will 'Make it so.' - Yes. I like Picard quotes too ;)

mdettweiler 2014-06-30 16:34

Ah, thanks for getting that set up Dave. As you mentioned, the PRPnet/LLRnet ports are obviously not open on the DRP server - that would indeed be somewhat tricky to implement, since there's not really a "clean" way to communicate assigned pairs and results back to the original ports when they come back up.

This got me thinking...Dave, what did you have in mind for implementing the automatic fail-back? In this case, the most recent backup we had available to restore onto the DRP server was from over 2 weeks ago (the last monthly backup). That's great if the main server were to fail completely (i.e. if the hard drive bailed and we lost everything on it) - rolling back 2 weeks is certainly better than losing it entirely - but, in situations like the present one where the main server is expected to come back online soon, with no loss of data, the two servers would be completely out of sync if we attempted to run PRP/LLRnet ports on the DRP server int he meantime. For a (relatively) short downtime like we anticipate this one to be, by the time the main server came back up we'd still be processing "historical" work on the DRP server that's long been completed.

The only way I can think of to do this practically (i.e. so we're not spending days on end spinning our wheels on historical work) would be to:[LIST][*]Have the backup server running continuously, keeping its stats database updated daily from the master server's results files.[*]Likewise, download a snapshot of the PRPnet port directories and databases to the DRP server on a daily basis.[*]In case of main server failure, the DRP server would be activated, and clients could fall back to it. (This is easy for PRPnet since it allows backup servers to be configured; with LLRnet it would require manual intervention.)[*]When the main server comes back online, instead of bringing its own ports online immediately, it would redirect (port forward) to the DRP server, meanwhile getting its own stats database back in sync by pulling down daily and hourly files from the DRP server.[*]We'd then need to manually migrate the current port state back from the DRP server to the main server. This would involve shutting down the port, taking a mysqldump of the respective database and a tarball of the port directory, SCPing the files to jeepford, restoring the database and port directory, then restarting the server.[/LIST]Aside from the logistical hassles of keeping the second server "in touch" with the main one on a daily basis to be ready for fallback (which may or may not be practical), the last one would seem to be the kicker - sometimes those mysqldumps can get kind of big. The raw size of the dump file is not so much the issue; they can be compressed quite effectively, but it can take quite a while (>5 minutes) just to make the dump. When you include the time to do the SCP, and to restore the backup on the main server, the port could realistically be offline for an hour (and it does have to be offline, because you need to transfer the PRPnet database and directory together to keep them in sync and make sure no results are lost). This would need to be done individually for each port, and would probably entail a fair amount of manual admin intervention to pull off smoothly. In this light, it [i]may[/i] not be worthwhile to even attempt to have the ports run on the DRP server.

Just thinking out loud...obviously none of this is of immediate critical importance, but, since we were talking about DRP...

gd_barnes 2014-07-01 06:04

Well, I had a friend go over to my house. There is definitely internet access there because my main desktop machine was able to connect just fine. The server machine was a different story. It rebooted itself for some reason. I had him log onto it but he could not get internet access from the machine itself even after doing some recyling of routers and modems. So I don't know why it cannot connect and why it rebooted itself. Max or Dave, you might try remote access to the machine now and see if you can get to it.

I'll be home around 3 AM CDT Thursday morning. I'll look at it right away when I get home. Hopefully it's as simple as messing with the cable in the wall or tigtening a loose connection.

Sorry again for the problems.

AMDave 2014-07-01 09:39

Negative from me unfortunately.
I had set up a 'call home' trace from the server via the log file emails, a while back, so that even if the IP address changed I could trace it back, but that message is not coming out.
The IP on the DNS provider has also not updated.
So must to conclude that the server cannot communicate outbound.
If I have the IP I can connect and fix things but not until then.
EDIT - yep. a loose CAT5 terminator would do it.

odicin 2014-07-01 10:07

The backup website handled by no-ip is also unreachable now, because no-ip was takedown by MS: [URL]https://www.noip.com/blog/2014/06/30/ips-formal-statement-microsoft-takedown/[/URL]

Regards Odi

AMDave 2014-07-01 10:25

That is an interesting development.
We stopped using NO-IP as the secondary domain back on post [URL="http://www.mersenneforum.org/showthread.php?t=13840&page=24"]#257[/URL] in this thread, circa 14 Aug 2013.
"Decision made - [url]http://nplb-gb1.no-ip.org/[/url] will not be renewed.
Since the DNS update for [url]www.noprimeleftbehind.net[/url] has been fixed and has been more reliable, it will be the only address moving forward."

I test and observe that the DRP link via NO-IP is currently working and confirm that the NO-IP issue [I]should not[/I] be affecting us. However, that doesn't mean that it couldn't:
"In the meantime, NO-IP / Vitalwerks have published their answer online:
Apparently, the Microsoft infrastructure is not able to handle the billions of queries from our customers. Millions of innocent users are experiencing outages to their services because of Microsoft’s attempt to remediate hostnames associated with a few bad actors.”
The un-seized dynamic dns ".net" domains may have become caught up in this filtering overload, although I am not clear on the 'how'.
So this is potentially a feasible hypothesis from Odi.

edit - the ".net" domain may well be caught up in this after further reading, even though we are using a different provider. I can't connect via the IP, so I still suspect something else.

edit -
ITMT (M$) are boasting about it - [url]http://blogs.technet.com/b/microsoft_blog/archive/2014/06/30/microsoft-takes-on-global-cybercrime-epidemic-in-tenth-malware-disruption.aspx[/url]
Reminds me of Judge Dredd ... "I AM the law!"
May the deities help us, because justice is out to lunch.
/me facepalms

Who the heck made M$ the Internet's policeman ... oh looky there. A judge did.
SlashDot thread - [url]http://yro.slashdot.org/story/14/07/01/0025220/microsoft-takes-down-no-ipcom-domains?utm_source=rss1.0moreanon&utm_medium=feed[/url]

AMDave 2014-07-01 12:02

update - the no-ip addresses to NPLB DRP site and also to the FDCPS project are no longer working.
They are hosted in Australia on a linux server.
{bad words}
If it were a car company, they'd have to issue a recall, but some US judge said sure, you can highjack the international highway so you can find and stop the defective vehicles you made and sold.
I have issues with the disregard for jurisdiction and for the incorrect placement of responsibility.

mdettweiler 2014-07-01 16:10

Yeah, this whole Microsoft/No-IP episode has been rather disappointing - Microsoft's other botnet takedowns prior to this had generally been against unequivocally shady providers, and they seem to be using the same tactics in this case, clueless to the fact that No-IP is actually a legit provider heavily used by good guys. The weird thing is that, over the last number of years I'd noticed Microsoft becoming [i]more[/i] clueful as a company in general, but this proves a shining exception. It doesn't help that the judges involved are often quite clueless about the technological ramifications, and thus inevitably base a large part of these subpoena/court order decisions on the reputation of the party asking for it rather than the technical and situational merits of the case...definitely a recipe for disaster if there ever was one.

Interestingly, I can still access the DRP server via it's No-IP address, as well as one of my own boxes which also has an address at no-ip.org. I wonder if the "temporary infrastructure" bottlenecks are affecting international links more than the U.S. - that could be plausible, and would explain why it works for some but not others.

odicin 2014-07-01 19:44

Hmm... curious. I can't loop up the No-IP Adress with the DRP Site here from Germany. I tested it with different DNS-Servers from different providers.

Maybe I should take an american DNS ;)

Regards Odi

gd_barnes 2014-07-02 06:48

I just realized today that my home phone line is not working either. This may end up requiring a call to Time Warner after all. I'll try my best to get it working early Thursday morning after I get home. If I can't get my phone line and internet connection on the server machine working myself, it will have to wait until Thursday afternoon when I can put in the call.

gd_barnes 2014-07-03 07:34

I got home about an hour ago. A power blip caused the problems. I got my home phone line working by recycling the modem. After recycling everything else that I can think of including disconnecting and reconnecting the main cable line from the wall, my internet works on my main Windows desktop machine but does not work on any of my other machines including, unfortunately, the server machine. All of the lights are on or flashing as they should be on both my router and modem so I don't know what it causing the connection issue.

I will keep playing around for a little while longer. If I can't come up with something within the hour, I'll call Time Warner Thurs. afternoon.

gd_barnes 2014-07-03 09:31

I'm at a complete loss. I think there is something flaky with the router because my laptop is unable to connect either and it is the only wireless machine in my place. I have started all of the servers in case everything finally connects while I am sleeping. They are all temporarily set to 7 days to allow everyone to return their work.

I don't know if Time Warner will be able to fix the problem over the phone but I'll try them this afternoon. I hope my router has not crapped out. I wouldn't have time to get a new one until Saturday and trying to configure a new one of those things for all of these servers will surely be a nightmare.

gd_barnes 2014-07-03 10:09

I guess I did enough unplugs and replugs to get it to work. I decided to try out the idea that my router might be bad so I tried plugging the server machine as well as a couple of others directly into the modem and bypassing the router. No luck. Still only my Windows desktop machine worked. So I replugged everything back into the router and suddenly everything connected! Go figure. I've had power outages do flaky things to the connections before but nothing to this extent.

Bottom line: All servers are now back up. I will leave a 7-day window on the LLRnet/PRPnet servers for a couple of days to allow everyone to return their work.

I'm very sorry about the extended outage.

AMDave 2014-07-03 10:38

Just the comms were out then.
It looks like the server itself was running fine the whole time, aside from the manual reboots.
I just received every hourly log file email since the start of the episode.
The server is fine ;)

(in other news: the NO-IP addresses are back on line too)

mdettweiler 2014-07-03 15:45

Confirmed that port 1400 is back online and has accepted my backlogged pairs. :smile:

AMDave 2014-07-05 10:24

Complete backup set offsite download completed for 2014-07-04. (4.9GB)

mdettweiler 2014-07-18 05:26

Server is down again - I wonder what it is this time...

gd_barnes 2014-07-18 05:41

Temporary internet blip here. I don't know what caused it. Everything appears OK now.

mdettweiler 2014-07-18 05:53

Confirmed everything is back on my end. For what it's worth, I think this is the second time (that I know of) your internet has blipped out today - one of my PRPnet clients fell back to a PrimeGrid server earlier at 10:43 PM CDT. That too appeared to be a short-term outage, since by the time I noticed it things were back online again.

mdettweiler 2014-07-20 06:38

Gary, I think there's still some lingering issues with your internet connection - maybe your router is flaking out (dying perhaps?). It had another "extended short term" outage just an hour ago, around 12:00 - 12:20 AM CDT. (That's the minimum time window I was able to confirm - it may have been out longer, but that's when 3 of my clients fell back to PrimeGrid servers. One of the clients is in another state than the other 2, so this is definitely not an issue on my end.)

AMDave 2014-07-20 11:53

I don't think so.
That's the exact time that the daily refresh script ran the weekly result table optimize task.
This task has been in place since "# HISTORY : AMDave 20120524 original"
That keeps mysql very busy for more than a few minutes due to the size of the results table.
If mysql is very very busy it can impact the prpnet database I/O, as I think it may have done here.
I will turn the weekly task off for a while I analyse and try to find a 'softer' solution.

AMDave 2014-07-20 13:50

I have plotted a likely solution.
Testing will be done over the next 2 weeks before committing to NPLB.

mdettweiler 2014-07-20 19:18

[QUOTE=AMDave;378637]I don't think so.
That's the exact time that the daily refresh script ran the weekly result table optimize task.
This task has been in place since "# HISTORY : AMDave 20120524 original"
That keeps mysql very busy for more than a few minutes due to the size of the results table.
If mysql is very very busy it can impact the prpnet database I/O, as I think it may have done here.
I will turn the weekly task off for a while I analyse and try to find a 'softer' solution.[/QUOTE]
Ah, that's interesting. That would make sense to explain this "outage", since given the time of day, both the optimize task [i]and[/i] the daily stats run were in progress.

Would that also explain the two earlier outages on the 18th (reported a few posts up)?

AMDave 2014-07-20 20:10

Afraid not. That one is still a mystery.

AMDave 2014-07-22 10:41

NPLB DR server planned outage: 24:00hrs 2014-08-07 AEST - 2 to 5 days.
WHAT: There will be a planned outage of the NPLB DR server for relocation of the hardware. [COLOR="Red"][B]This outage will not affect any normal NPLB services.[/B][/COLOR]
WHEN: from 24:00hrs 2014-08-07 AEST .
DURATION: The planned relocation window is 5 days, target is 2 days. The shortest possible turnaround on the critical path is 24 hours, depending on network build tasks, lunar and astral alignments, the lucky number 16, the market price of Longjing in Zhejiang Province and whether the roaring forties happens to be blowing at the time.

gd_barnes 2014-07-23 08:43

[QUOTE=AMDave;378667]Afraid not. That one is still a mystery.[/QUOTE]

I'm confident that those were outages were my provider monkeying around. I was online at my main Windows desktop machine when I saw the lights on the modem gradually disappear and then come back on a few minutes later as I waited for my pages to load for several minutes. I'm guessing that they occassionally do some random maintenance late at night.

AMDave 2014-07-28 10:36

[QUOTE=AMDave;378639]I have plotted a likely solution.
Testing will be done over the next 2 weeks before committing to NPLB.[/QUOTE]
Quick update:
The database patch script and refresh script changes are now designed and being tested.
Although the patch is simple it takes a long time to process the results table due to it's size.
There is no simple alternative, so a stats outage will need to be scheduled to deploy the patch.
I am still confirming timings & additional tests to verify that the desired outcomes will be achieved.
I am quite optimistic & will post again when I have the full set of results, to schedule the change.

MyDogBuster 2014-07-28 20:39

Looks like we are down again.

mdettweiler 2014-07-29 03:09

The server itself is still up and running (and the [url]http://noprimeleftbehind.net/[/url] web page is working just fine), but the LLR/PRPnet servers are all out.

It appears that Gary's power company came by and switched out his electric meter, as we'd been warned of over in the "News" thread. According to jeepford, its current uptime is about 5 hours, which means it partially survived the power outage: the UPS must have held it, but for whatever reason, the computer rebooted itself. (That often happens on semi-short power flickers, which is probably what happened since consumer-grade UPSes usually take a second or so to switch between AC and battery.)

That means that the computer is up and running, but not logged in. I can restart the LLR/PRPnet servers remotely...I'll go do that now.

mdettweiler 2014-07-29 03:24

Well, after just attempting to get in through the "back door non-console-session VNC", I remembered why we never used that for jeepford. :smile: It's all coming back to me now...that solution worked fine on Gary's other (read: older OS) machines, but the magic incantations I'd put in the ~/.vnc/xstartup script didn't work for Ubuntu 9.04. Whenever you try to type anything, it comes out as gibberish because something's messed up the keyboard layout. (Yes, we're still on 9.04. Despite multiple non-starter attempts to upgrade to a newer version, and a totally overambitious plan I came up with to virtualize the entire server in a VM, we still have never actually finished the upgrade.)

So: long story short, the "back door style VNC" doesn't work on jeepford, and it never did. That's why we always used the "Ubuntu built-in console VNC" for jeepford. Trouble is, that one doesn't work until the system's been logged on locally from the console. I [i]thought[/i] we solved that problem by setting it to log in automatically on boot-up, but, apparently that's not working for whatever reason.

Never fear, though! I can still start the LLR/PRPnet servers. :smile: I just have to do it from the SSH console...stand by.

mdettweiler 2014-07-29 03:47

[b]Quick upshot: all the servers are back online. :smile:[/b]

Long version for Gary:

Since I couldn't get graphical access to jeepford through VNC (well, I [i]could[/i], but the keyboard didn't work...), I had to start the servers through the textual SSH console. In order to multiplex the console (to allow starting multiple LLR/PRPnet server processes that would keep running when I logged out), I needed to use a program called "screen", which is very handy, and very powerful, but has a huge learning curve. :ermm:

So, to make this easier (I have yet to learn screen properly myself), I used a handy little program called "screenie", which is a nice, simple wrapper around "screen" that doesn't require learning all those crazy keyboard shortcuts.

This is basically the old-fashioned (text-only) way of opening up multiple tabs on a terminal window like we usually do to start the servers.

In short, here's what you need to do to get to the servers I started (i.e., to shut them down, restart them, etc.):
[list][*]Open a terminal window. (Or connect remotely via SSH.)[*]Run "./screenie". (This will only work from your home directory, so if you've cd'd to somewhere else, type "cd" to go back.)[*]You will see a list of screens I've opened up - they're each labeled with the appropriate LLRnet/PRPnet port number.[*]To go to one of the server screens, type the number next to it, and press Enter.[*]You are now "inside" the terminal window associated with that screen. You can press Ctrl-C to stop the server, use the console as usual, restart the server, etc. If you type "exit", this screen window will go away, just like in the graphical terminal, and you'll be returned to the menu.[*]To return to the menu [i]without[/i] closing the terminal window - for instance, to leave an LLR/PRPnet server running - press Ctrl-A, then the D key. That will take you back to the menu.[*]To quit screenie and return to the "master" terminal (the original window you opened, or your initial SSH session if you connected remotely), type "q", then press Enter at the screenie menu. Any screen windows shown as open on the menu will remain open, even when you log out.[/list]
BTW, if you want to restart your PRPnet clients on jeepford remotely, you can add additional terminal windows to screenie by typing "a" then pressing Enter at the menu. It will ask you to name the window, and then for a "job" - that's a bit confusing, but you should just enter "bash". (Don't ask. :smile: If you're wondering, though, that's the name of the Linux terminal, like cmd.exe on Windows.) From there, you have another terminal window you can use like any other - you can start up a PRPnet client (or sr2sieve, if that's what you have jeepford doing right now), and press "Ctrl-A d" to go back to the menu and leave it running, just like with the servers.

Confused yet? :smile: Sorry if this is a lot to take in...if you don't care about restarting whatever clients you had going on jeepford, don't worry, everything's up and running now. This is mainly for when you need to go in and [i]stop[/i] the LLR/PRP servers, i.e. if you're going to reboot jeepford at some point.

MyDogBuster 2014-07-29 03:55

Does all this mean they are up or down 'cause I still can't get to Port 1468.

mdettweiler 2014-07-29 05:29

[QUOTE=MyDogBuster;379275]Does all this mean they are up or down 'cause I still can't get to Port 1468.[/QUOTE]
Whoops, forgot that one. It's running now. :smile:

(So many servers, so many port numbers to keep track of... :rolleyes:)

MyDogBuster 2014-07-29 07:14

Thx Max

gd_barnes 2014-07-29 08:58

Nice work Max. It wasn't quite as it seemed as far as the outage. I don't have a UPS attached and the blip was only for less than minute. Jeepford did not turn itself back on. I know this because my girlfriend was there at the time. She was kind enough to turn back on two of my machines but I just missed her call before she left the house so she did not know to log into Jeepford. I could have a friend go over and log into it if that is needed but it sounds like it's not. Thankfully there was not an internet issue this time around so you were able to take care of things remotely.

Do I need to do anything remotely before I get back? I'll be back a week from today. All of my cores were running a CRUS sieving job and I don't really care about trying to restart the 2 of them that were running on Jeepford.

mdettweiler 2014-07-29 16:37

No, there's nothing further that needs to be done remotely. Even once you get back, all the servers are up and running just fine, only on the text console side (make sure you don't try to start them again in a graphical terminal when you log in, that could cause problems! at the least it would error out binding to the port...)

AMDave 2014-07-30 10:52

Patch testing completed.
Results were poor.
Patch cancelled.
The stats DB is already very well optimized.
I made a config change to run the optimization less frequently instead.
EDIT - added another tweak that should cut about 400 seconds out of the stats refresh

AMDave 2014-07-30 13:16

CONFIRMED - NPLB hourly stats processing time down to 337 seconds (was 780 − 900 seconds)
Apologies to gamer007, Max and sm5ymt who each received one accidental prime notification from the TEST server as I forgot to turn the emails off in the PVT run.
As per my follow-up emails, please disregard those as you have already reported and submitted them.

AMDave 2014-08-03 02:09

Complete backup set offsite download completed for 2014-08-02. (4.8GB)

Lennart 2014-08-05 08:41

NPLB is down here.


Lennart

AMDave 2014-08-05 08:55

server is up, websites & databases also.
I'll jump on and check it out.

CONFIRMED - looks like the power blipped 1.5 hrs ago.

I'll see if I can get the ports up.

AMDave 2014-08-05 09:27

I may or may not have prpnet 1468, 2000, 9000 running in screen
I may or may not have llrnet 3500 running in screen
YMMV
I'll watch for a bit :smile:

ed - I see port activity now and the stats have fetched results from the ports and updated. I will check back later.

AMDave 2014-08-05 10:56

based upon some deduction I have resumed prpnet 1400 and 1465 for CRUS

and prpnet 12000 and lltnet 12050 for TPS

gd_barnes 2014-08-05 20:50

Just to make sure everything is OK and for my own comfort level, I just now shut down the server machine for a couple of minutes and then restarted it. I then restarted all of the servers "on screen" so that I could see them running. I didn't feel comfortable not being able to see what was running.

There have been no pairs in LLRnet port 12050 for a long time. I did not start it.

For future reference, here are the ports that are currently running:
PRPnet:
1400 (CRUS)
1465 (private)
1468 (NPLB)
1470 (private)
2000 (NPLB)
9000 (NPLB)
12000 (TPS)
13000 (TPS)

LLRnet:
3500 (NPLB)

AMDave 2014-08-06 09:35

Thanks Gary.
I'll cast another incantation so they automajically start on your (real) screen after a reboot.
Running some tests for that here just now.
I'll let you know when done.

Lennart 2014-08-06 10:17

I have some big problems to reach stats page on NPLB and that started after the first outage when Gary was away.

It seems that there are some problem in DB or apache .

as I understand crus is on the same server and I have the same issue there.

Big problems on page with lots of data. Most times they time out.
Some time I get 20 -50% of the data.

50% of the time I get nothing just a white page and then it times out.

I think you need to optimize DB and check apachee, Mysql, PHP.

Just some guess :smile:

This is no critic just information :)

Lennart

AMDave 2014-08-06 10:41

Good news for us - everything is fine. Page & DB response is very snappy.

Bad news for you is that either:
somewhere between you and the server is a "poisoned" cache.
edit - you can test this by over-riding your primary and secondary DNS server addresses and trying a different ISP's DNS server - edit
OR
[STRIKE]you are having a problem with the Prime List page where there is a slight delay while your browser downloads 8000+ primes and then the javascript table client re-renders the list in a pretty looking table with pages. (The port status page however does not use that javascript plugin and should display very fast.) At a guess you may have a local issue with the javascript plug in. (I hope you are not using IE !?)[/STRIKE]
edit - no. you said you had the same problem on crus so this does not apply - edit.
OR
Maybe your browser is using the "hardware acceleration when available" which is a GPU/APU which is also crunching? That can result in broken page rendering also. A few recent versions of Chrome/ium have been bad at that.

It could be one of many reasons, but the good news is it is not a problem on the server or a response time issue - considering that I am on the other side of the world over 15 'hops' from the server. :)

AMDave 2014-08-06 10:53

Gary,

I will not script that auto-start tonight.
I could but it's not quite finessed.
It doesn't quite behave like a terminal window that you opened yourself.
As soon as you stop the program running in it, the terminal window closes, which is not how you would want it to behave.
I'll do some more tests after I have moved and overcome that behaviour.

PS - I did fiddle with your server. It should log you in after a reboot now.

ITMT - Welcome back :)

AMDave 2014-08-19 13:27

[QUOTE=AMDave;378799]NPLB DR server planned outage: 24:00hrs 2014-08-07 AEST - 2 to 5 days.
WHAT: There will be a planned outage of the NPLB DR server for relocation of the hardware. [COLOR="Red"][B]This outage will not affect any normal NPLB services.[/B][/COLOR]
WHEN: from 24:00hrs 2014-08-07 AEST .
DURATION: The planned relocation window is 5 days, target is 2 days. The shortest possible turnaround on the critical path is 24 hours, depending on network build tasks, lunar and astral alignments, the lucky number 16, the market price of Longjing in Zhejiang Province and whether the roaring forties happens to be blowing at the time.[/QUOTE]

NPLB DR server is back online.
It was a super-moon. There were no lucky numbers so the pool jackpotted. The price of tea in china didn't budge. The roaring forties were stuck at twenty. Even the westerlies turned up a week late.
That was 11 long and frustrating days.
But we are back in business.

MyDogBuster 2014-08-19 14:31

Nice job Dave, as usual.:cool:

AMDave 2014-08-27 11:32

I have been getting connection failures to the NPLB server for about 1 hour, but the server contacted me successfully on schedule about 16 minutes ago.
The DNS and IP address is the same so I suspect an ISP / network issue is in progress.
Server is up, but the inter-webs is confuzzled.
Hopefully it's just a regional issue for me and not happening to you.

AMDave 2014-08-27 20:27

All fine this morning.

Lennart 2014-08-27 20:37

[QUOTE=AMDave;381562]All fine this morning.[/QUOTE]


Not working now here noprimeleftbehind down ?

Lennart

mdettweiler 2014-08-27 21:18

[QUOTE=Lennart;381563]Not working now here noprimeleftbehind down ?

Lennart[/QUOTE]
It does look to be down across the board now - all of my clients (at two different locations) have fallen back to PrimeGrid servers, and I can't reach any noprimeleftbehind.net services.

Dave, are you still getting pings from the server? If so, then this may be an IP/DNS issue (since that would mean the server still has internet access, but we can't reach it).

mdettweiler 2014-08-28 05:34

Just noticed something weird in my PRPnet logs...it seems that noprimeleftbehind.net became "gradually" unreachable. At 5:28 this morning (server time), one of my clients tried to fetch new work, but just a couple hours later, at 7:40, it was able to return its last result. By 20:07, when it tried to fetch new work again (after finishing a fairly large PrimeGrid test), the server was fully unreachable.

Not that I could make a guess as to what this means for [i]why[/i] things are down, but, it did seem noteworthy, for whatever it's worth.

gd_barnes 2014-08-28 07:18

I'm out of town for ~1.5 more weeks. Since Dave has the server machine now set up to automatically log on if there is a power blip, I could have a friend shut it down and turn it back on and it should log itself on. Then Max or Dave could restart the servers. I'll check back here in about 2 hours and again about this time tomorrow morning to see if there is any update as to what is going on.

AMDave 2014-08-28 20:20

right after I said it was ok, access to the server stopped.
It still has not been able to get a log file email out to me.
So either the router needs to be reset or the server is down.
In this circumstance it has mostly been the router.
I am unable to access the server to apply a response.
The best outcome would be to reset the router, wait 30 mins & check the web page, if still not available then restart the server as well.

In typical 'Murphy' fashion, the DR server is also down due to a PSU failure. I'll either replace it or migrate the websites & databases to another machine tomorrow.

mdettweiler 2014-08-29 21:08

Just a thought: if the router [i]does[/i] turn out to be the issue, it might be time to look into getting a new one. We seem to be having router issues with increasing frequency lately, which may indicate it's "dying". :ermm:

A halfway-decent wireless router can usually be found these days for $20-$25; a name brand pushes it up to the $35-$50 range, but usually no more than that for the features we need. Gary, let me know if you'd like me to do some quick shopping around for one online; I can also assist with configuring it appropriately after installation (the easiest thing to do would be to reconfigure it to use IP addresses consistent with the old router, rather than changing the static IPs on each machine).

gd_barnes 2014-08-30 09:23

I had a power outage and the machine was off. I had a friend go over and turn it on and it has internet access. Dave or Max, if you can start the servers, everything should be OK. Sorry about the problems.

I will look into another UPS in the near future. I need something more reliable than I had before.

mdettweiler 2014-08-30 19:32

Okay, I've restarted all the servers, namely: 1400, 1465, 1468, 1470, 2000, 9000, 12000, and 13000. Please let me know if I've forgotten any. :smile:

I also set the deadlines on all of them to 7 days until the old assignments are returned.

Incidentally, graphical access through VNC is now working perfectly, so Gary, when you get back you'll find the servers running exactly how they usually are. I noticed someone (Dave?) left us a little present on the desktop background - is that you, Dave? :smile: (He looks jolly enough to fit the bill...)

Edit: Gary, I didn't restart any PRPnet clients, not knowing exactly which ones you'd want. Since VNC is working well now, you should be able to get full remote access yourself through the usual script (the 5900 version, not 5901).

AMDave 2014-08-31 06:16

No idea about desktop backgrounds on remote hosts as I only use the command line.

The DR server is up and running on alternate hardware until the server PSU is fixed or replaced.

AMDave 2014-08-31 07:27

@gary
It's probably not the UPS's fault if the power outage was long enough.
Don't forget to say thanks to your buddy from the rest of us.

AMDave 2014-09-14 20:19

Complete backup set offsite download completed for 2014-09-13. (4.7GB)

gd_barnes 2014-09-14 20:39

[QUOTE=mdettweiler;381761]Incidentally, graphical access through VNC is now working perfectly, so Gary, when you get back you'll find the servers running exactly how they usually are. I noticed someone (Dave?) left us a little present on the desktop background - is that you, Dave? :smile: (He looks jolly enough to fit the bill...)[/QUOTE]

That's actually my buddy who turned on my machine. While I am gone he feeds my cat. He is a total ham, always pulling pranks like that while I'm gone. His favorite thing to do is move random stuff around my place. I almost always have a surprise when I get back. lol

AMDave 2014-09-30 10:23

Complete backup set offsite download completed for 2014-09-28. (4.8GB)

AMDave 2014-09-30 13:36

Backups from 2014-09-28 restored successfully on the DRP server.

DRP URL (view only): [url]http://nplb.no-ip.org/stats/index.php[/url]

AMDave 2014-11-03 11:33

Backups from 2014-11-02 restored successfully on the DRP server.

AMDave 2014-11-16 08:33

Backups from 2014-11-15 restored successfully on the DRP server.

AMDave 2014-11-30 00:18

Backups from 2014-11-29 restored successfully on the DRP server.

AMDave 2014-12-14 00:06

Backups from 2014-12-12 restored successfully on the DRP server.

AMDave 2014-12-20 12:49

Backups from 2014-12-19 restored successfully on the DRP server.

Drive safe, be cheerful and pass it on.

AMDave 2015-01-04 02:29

Backups from 2015-01-03 restored successfully on the DRP server.

AMDave 2015-01-11 10:04

Backups from 2015-01-10 restored successfully on the DRP server.

AMDave 2015-01-12 01:50

NPLB server went down in the last hour.

Gary must have driven over a state line. The server just threw a hissy fit.
The outbound 'here-I-am' message is not coming through and I'm unable to connect on the host name or the IP address so at the very least, the connection is out.

AMDave 2015-01-12 05:08

Alive again.

AMDave 2015-02-06 11:30

Backups from 2015-02-05 restored successfully on the DRP server.

AMDave 2015-03-02 19:55

Backups from 2015-02-28 restored successfully on the DRP server.

AMDave 2015-04-11 11:05

Backups from 2015-04-09 02:05:57 restored successfully on the DRP server.

AMDave 2015-04-16 20:05

NPLB is online but the DNS provider has hiccups since 48 minutes ago, so most of us are unable to connect.
Have to wait for the provider to fix which should not be long.
(From my phone)


All times are UTC. The time now is 07:04.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.