mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   No Prime Left Behind (https://www.mersenneforum.org/forumdisplay.php?f=82)
-   -   Server outrages (https://www.mersenneforum.org/showthread.php?t=13840)

AMDave 2015-04-17 07:57

Our logs show the DNS problem lasted for between 3 and 4 hours, which is about normal for a fresh DNS cascade.

AMDave 2015-04-25 00:47

Backups from 2015-04-24 02:05:28 restored successfully on the DR server.

AMDave 2015-05-08 12:38

In an alternate, but nearby, reality a plague of locusts and frogs has descended upon the NPLB server basement.
In this one, however, a temporary outage appears to be occurring.
We will closely monitor the level of the locusts relative to the ingestion rate of the frogs and advise of any further changes.
:)

AMDave 2015-05-24 02:54

NPLB server/router is down from approx 1hr 43 mins ago.

AMDave 2015-05-28 11:02

NPLB server intermittent outages,

I did not close-out the previous issue because - for me at least - it has not gone away.
I seem to be observing ongoing intermittent outages between the server and the ISP (up to a few hours at a time).
The prime candidates is the router.

@Gary - when you get back to the server room could you check the router is not over-heating, please. Thanks.

gd_barnes 2015-05-28 18:42

I have observed them too and it is frustrating. I'm not sure what is causing them. I have shut everything down and rebooted. I've also changed which slot in the router that the cable is plugged into. We'll see if that makes a difference.

AMDave 2015-05-30 07:30

I was able to get the backups downloaded last night.
Backups from 2015-05-27 02:05:20 restored successfully on the DR server.

gd_barnes 2015-05-30 08:30

Everything seems stable now. About 30 hours ago I went so far as to disconnect the cable connection from the wall for several minutes and plug it back in. There have been no blips since then.

AMDave 2015-06-02 07:59

2 more long blips today, unfortunately. Currently down.
Still unclear on whether it is the ISP or local network.

gd_barnes 2015-06-02 09:17

I will be calling Time Warner later today. My router is good. I've done everything I can on my end.

AMDave 2015-06-02 10:18

I did some investigative work with the logs.
It would appear to be the ISP.
Behind the scenes you have had a lot of IP address re-allocations recently which indicates the service connection has been failing repeatedly.
probably "OC out" as opposed to high attenuation.

gd_barnes 2015-06-03 17:00

I called time warner late yesterday. It is an area wide outage. To say that I'm hacked off would be an understatement. This has been going on for over a week. I'm leaving town for 6 days starting tomorrow. I'll keep my fingers crossed that they get it fixed in the next couple of days. We've had a lot of rain and thunderstorms here over the last several weeks and it's supposed to continue over the next several days. I don't know if that has caused the problem. This is the biggest problem that I've ever had with them. I'm paying for business class for the extra speed and service but I'm not getting it right now.

rebirther 2015-06-03 17:20

[QUOTE=gd_barnes;403458]I called time warner late yesterday. It is an area wide outage. To say that I'm hacked off would be an understatement. This has been going on for over a week. I'm leaving town for 6 days starting tomorrow. I'll keep my fingers crossed that they get it fixed in the next couple of days. We've had a lot of rain and thunderstorms here over the last several weeks and it's supposed to continue over the next several days. I don't know if that has caused the problem. This is the biggest problem that I've ever had with them. I'm paying for business class for the extra speed and service but I'm not getting it right now.[/QUOTE]

Yes, thats terrible. I know that too here. I need to grab some more bases reserved if its online again before I will run dry soon.

rebirther 2015-06-05 05:25

The website is still down :(
I have only work for 2 days left.

MyDogBuster 2015-06-05 10:04

[QUOTE]The website is still down :(
I have only work for 2 days left. [/QUOTE]

Reb, I have 5 sieve files I keep in reserve for just such a occasion. They are all of the "long variety" but they are tests.

R152 250K-1M 12192 tests
R159 250K-1M 5397 tests
R160 300K-1M 11531 tests
R162 250K-1M 26584 tests
R173 250K-1M 17368 tests for a total of 73072 tests

They are all 1ker's except for R162 which is a 2ker. I am positive they are all still valid.

I can send you all or some if you are interested. You will need to PM your email address to me.

PS. I also have 6 sieve files for bases at n=10K-25K. Each is over 250K tests with hundreds of k's remaining.

Let me know what you want to do.

MyDogBuster

AMDave 2015-06-05 11:12

Nice move MyDogBuster!
or ... connection is up right now. U/L & D/L if you can.

MyDogBuster 2015-06-05 11:20

Connection is up and the servers are running but I still can't get to the Riesel and Sierp pages????????

EDIT: All seems to be working again. Thanks Dave.

AMDave 2015-06-05 12:37

Ooh. That was lucky then. Glad you got some work units.
Just as you were looking at it the connection disappeared again.
TimeWarner must have employed a Schrodinger's Cat for a maintenance technician.
I will pop down to the market gardens tomorrow for some alternate-reality-Cat-Nip.
Maybe we can coax it back !! :P

rebirther 2015-06-05 15:08

[QUOTE=MyDogBuster;403545]Reb, I have 5 sieve files I keep in reserve for just such a occasion. They are all of the "long variety" but they are tests.

R152 250K-1M 12192 tests
R159 250K-1M 5397 tests
R160 300K-1M 11531 tests
R162 250K-1M 26584 tests
R173 250K-1M 17368 tests for a total of 73072 tests

They are all 1ker's except for R162 which is a 2ker. I am positive they are all still valid.

I can send you all or some if you are interested. You will need to PM your email address to me.

PS. I also have 6 sieve files for bases at n=10K-25K. Each is over 250K tests with hundreds of k's remaining.

Let me know what you want to do.

MyDogBuster[/QUOTE]

I could need some 25-100k or 50-100k. If there is nothing left I will take 10-25k. A mirror would be good of this site.

MyDogBuster 2015-06-05 22:44

Looks like we are in yo-yo mode. Time your access accordingly. LOL

gd_barnes 2015-06-06 18:46

I'm out of town but called time warner. It's probably the modem. They are sending someone out on Monday afternoon to check if out. Fortunately I have a friend that will go over there so that they can hopefully get it taken care of.

Sorry about all of the problems. Good work Ian on having some backup files in reserve.

gd_barnes 2015-06-09 17:37

Time warner came out yesterday but could not fully fix the problem. They are coming back out late Wednesday. I will be there to hold their hand to the fire this time.

gd_barnes 2015-06-10 03:15

Came home and recycled router. Everything is working perfectly now. The pages and servers are back up.

I'm very sorry for the extended outage.

AMDave 2015-06-11 11:18

Hardly your fault. You deserve a pat on the back for being so patient and persistent with them.
In spite of the age, the logs show that the server didn't even blink. Trucking along inexorably like a sand dune ;)

Backups from 2015-06-09 02:10:27 restored successfully on the DRP server.

MyDogBuster 2015-06-11 12:45

Dave, I enjoy the quips in your posts, but when the words get to 4 syllables I have to pull out the dictionary. Ah, something lasting
a long time. Persistent. :banana: Kind of like the banana. LOL

LaurV 2015-06-12 01:43

[nitpicking]ye can't count..."persistent" is 3 syllables only; the only words with 4 syllables there are "inexorable" and "successfully"(maybe also "hardly" and "blink" but here we are not convinced, still counting syllables) :razz:[/nitpicking]

AMDave 2015-06-28 00:15

Backups from 2015-06-26 restored successfully on the DRP server.

AMDave 2015-07-12 12:26

Backups from 2015-07-11 restored successfully on the DRP server.

AMDave 2015-07-14 09:40

NPLB's DR unplanned outage resolved.

The NPLB project's disaster recovery server was off-line for the last 20 hours due to an unplanned ISP PPPoE failure.
Was getting nowhere with the ISP's technical support line so I veto-ed our ISP and removed their customer-side hardware and replaced with my "go-to" CISCO equipment.
The PPPoE connection to the ISP still fails but I was able to restore the connection via PPPoA for the time being.

No NPLB operations were impacted as it was the project's DR server, but it is worthy of note.

Always test your backups and have other recovery plans in-hand (and test those too).
EDIT - I just repaired and rebuilt another machine to stand-in if the DR server ever fails.

AMDave 2015-07-26 09:02

NPLB Backups from 2015-07-24 02:05:36 restored successfully on the DRP server.

mdettweiler 2015-08-05 16:30

To address an [url=http://www.mersenneforum.org/showthread.php?p=406834#post406695]issue raised in the CRUS forum[/url], I will be upgrading PRPnet ports 2000 and 9000 to version 5.4.0 later today. This will result in a brief downtime for each port in turn, hopefully less than an hour.

The databases and config folders will be backed up beforehand so if anything goes wrong, I can quickly roll back to the current version and troubleshoot "offline". Version 5.4.0 has been around and used at PrimeGrid since January and has no regressions from 5.3.2 that I'm aware of.

Edit: I will be also upgrading NPLB port 1468. Almost forgot about that one. :smile:

mdettweiler 2015-08-05 22:53

Port 2000 is going down right now for upgrade. I will edit this post when it's back up.

I will only take down one port at a time, to ensure that the other can serve as a backup for people's clients.

mdettweiler 2015-08-05 23:57

[b]Update (6:52 PM server time):[/b] Port 2000 is back up. Everything seems to be working fine, though there hasn't yet been much client activity to test it. I'll be keeping an eye on it and if anything goes terribly wrong, I can revert to the backups.

[B]Dave:[/B] FYI, I needed to update the "prpnet-to-llrnet.pl" script to handle the changes to PRPnet's completed_tests.log format in version 5.2.0. Ever since I upgraded the CRUS servers last year, this script has been jumbling the LLRnet-formatted output files, but nobody's noticed because we don't actually [i]use[/i] those files over at CRUS. :smile: At NPLB, however, the stats system is reading everything in LLRnet format, so this is important. I've tweaked the script so it should be able to handle both old- and new-format input files (and even both mixed in one file, as today's will be).

Now I'm taking port 9000 down for upgrade.

mdettweiler 2015-08-06 00:43

[b]Update (7:35 PM server time):[/b] Port 9000 is back up.

For 9000, I upgraded only to 5.3.2 (the second-newest version, same as we're using at CRUS). I initially tried to upgrade it to 5.4.0, but the server was immediately beseiged by "client too old, dropping connection" messages as Gary's old clients tried to talk to it. :smile: Since the older clients are unaware of this mechanism, they unwittingly kept hammering the server [i]many times a second[/i], effectively DoSing it. Oops... :rolleyes: So I restored the backup I took before the upgrade, and re-upgraded to 5.3.2. Now it's working better. :smile:

5.3.2 is compatible with both the older and newer clients, so it will keep everyone happy until Gary can get his clients upgraded. 5.4.0 is a drop-in replacement for 5.3.2, so "finishing" the upgrade is very easy and quick.

Next, I will bring down port 1468 for upgrade. Like 9000, I will upgrade it to 5.3.2, not all the way to 5.4.0, until Gary has upgraded his clients. (Port 2000 is already on 5.4.0, but I guess Gary will just have to stay off that server until he's upgraded. :smile: I don't believe anyone else is running clients that old, but if they are, the same goes for them.)

mdettweiler 2015-08-06 00:52

[b]Update (7:50 PM server time):[/b] Port 1468 is back up. Now all NPLB and CRUS servers are running PRPnet 5.x.

To summarize, the server versions are:[LIST][*]Port 2000: 5.4.0 (only 5.x clients can connect)[*]Port 9000: 5.3.2 (all clients can connect)[*]Port 1468: 5.3.2 (all clients can connect)[*]CRUS 1300 and 1400: 5.3.2 (all clients can connect)[/LIST]
I will continue keeping an eye on them over the next couple of days.

Also, Gary, when you get back from your trip, we can discuss upgrading your clients to 5.4.0. :smile:

mdettweiler 2015-08-06 04:05

Port 9000 down for the moment
 
We are having some ongoing trouble with port 9000. Shortly after the upgrade, it was besieged by "The client is too old. The connection was dropped." messages repeating multiple times a second. This ultimately lead to a server crash. Restarting the server doesn't seem to help, because it immediately gets stuck in the "The client is too old" loop again.

Please note that port 9000 is running PRPnet 5.3.2, the same version that has been performing stably and reliably at CRUS since last year. The only server running a [i]newer[/i] version (5.4.0, which has also been around since January and is used on all of PrimeGrid's servers) is port 2000, which has not exhibited this issue yet.

I have a backup of port 9000's database and configuration from before the outage, so we always have the option of restoring it to version 4.3.6 and picking up where we left off. However, since we have other servers still operating reliably, I am going to try to diagnose this issue (off forum) with Mark before rolling back the server again. If it continues for more than a few days, I'll restore the backup and we'll investigate this on a test server.

The other servers (2000 and 1468, as well as 1300 and 1400 which have been on 5.3.2 since last year) are not exhibiting this problem, so if you have not done so already, please set your clients to use them as backups. Port 2000 is testing candidates very similar to what's in 9000 right now. :smile:

(P.S.: Gary, sorry to do this to you while you're away. Everything is under control since it's easy to restore from the backups...just wanted to give everyone a heads-up. Let me know if you want me to restore the backup at any time.)

mdettweiler 2015-08-06 04:30

As a follow-up...further troubleshooting seems to confirm that this is is [b]not[/b] an issue with Gary's old clients (which are running 4.3.1, not quite as old as I thought), or even necessarily with the PRPnet server code, since Gary's been running those clients without trouble on CRUS's v5.3.2 servers for a long time now. My hunch is that there's something especially "weird" going on with port 9000's database.

That said, I am just beginning debugging and can't say anything for sure. Long story short, all the other ports (both CRUS and NPLB) are doing fine and survived the upgrade apparently without issue. This is something weird with 9000.

mdettweiler 2015-08-06 16:54

Port 9000 is back and better than ever!
 
Hi all,

The problems with port 9000 have been fixed. Now all of NPLB's servers have been successfully upgraded to v5.3.2 or newer. :smile:

It turns out that my original instinct was on the money. One "runaway" client of Gary's running the very old version 4.2.0 didn't know what to do with the new server, and "crashed": prpclient had gotten stuck in a loop trying to contact the server, over and over again. It was trying to do this so fast that it effectively DoS'd the server, bringing it to a halt. (Because Gary's clients are on the server's LAN, the bombardment had the full effect of a 100Mbps connection.)

I checked all of Gary's machines remotely (except his personal Windows box, which I don't have access to :smile:), and this appears to be the only one running a really old ("broken") version. The rest of his machines are all on 4.3.1 or 5.0.8 and playing nicely with the upgraded servers. (Well, I don't know how they'll react to port 2000 running 5.4.0...hopefully a bit more gracefully than the 4.2.0 client did. Gary doesn't have any clients on port 2000 right now.)

Max

AMDave 2015-08-07 08:37

Good work. Thanks Max!

AMDave 2015-08-11 23:40

Backups from 2015-08-10 02:10:38 restored successfully on the DR server, including all of the port upgrades.

AMDave 2015-12-24 07:58

NPLB update 2015-12-24.

Today I rolled out some subtle upgrades to the live stats pages. No outage was required.

Last week, the backups from 2015-12-18 02:08:40 were restored successfully on the DR server. The backup and restore process is still working as expected to both the local backups and the DR server.

Note that the NPLB DR server has been upgraded to Ubuntu 14.04.3 LTS (GNU/Linux 3.13.0-65-generic x86_64) and everything is working well.

I am already at the point where I am happy that I can have the project operating on a replacement server anywhere in the world within a day or two. The only thing of concern would be the age of the last DR backup if a local backup could not be recovered from the previous day. I am still testing the DR restore manually every fortnight (or so) to make sure we are never too far behind.

A couple of weeks ago I ran some operational recovery tests on the DR server. This did raise a remote connection config issue that I need to resolve at cut-over, but it looks like everything is working fine - although GB's remote desktop would look a bit different.

That's all the fiddling I am doing for 2015.

Best wishes to all for the holiday season

(PS - don't forget to clean out yer dust-bunnies)

Cheers.
AMDave

AMDave 2016-02-19 09:54

1 Attachment(s)
It seems the NPLB server fell off the interweb about 13 hours ago.
Since it is not back up yet, I conclude that Gary is away.

ed - DR server is at "2016-02-14 02:06:17" but I will be taking it off line for a couple of hours while a very large lightning storm passes through
[ATTACH]13922[/ATTACH]
This one will be welcome, however, as we have been waiting several weeks for a storm to break the heat and humidity.
(re the scale - yes that's a storm around 1,000 km long. We don't do things by half up here in Australia :P )

Dubslow 2016-02-19 18:21

[QUOTE=AMDave;426818]
This one will be welcome, however, as we have been waiting several weeks for a storm to break the heat and humidity.
(re the scale - yes that's a storm around 1,000 km long. We don't do things by half up here in Australia :P )[/QUOTE]

I've seen storm systems stretch from Minnesota to Texas (1,500+ km) before, though they are rather less common than smaller systems of course. Is this storm a typical storm for your region?

science_man_88 2016-02-19 19:03

[QUOTE=Dubslow;426867] from Minnesota to Texas (1,500+ km) before[/QUOTE]

depending on what parts of those states you are talking about try 1500+ miles(2500+ km) potentially.edit: or as small as just 500+ miles (838+ km)

AMDave 2016-02-19 23:45

Fairly typical, yes. A welcome long period of heavy rain over night. This one moved slower than most, thankfully.

I brought DR server back up 5 hours ago.
Unfortunately the NPLB server still appears to be off line.

gd_barnes 2016-02-20 09:17

This is not good. I will be out of town for as much as 10 more days. I'll see if I can have a friend take a look at it in the next couple of days.

Rincewind 2016-02-28 16:05

archive.org
 
I tried archive.org to visit the site.
They have a snapshot from 11th february 2016 from the mainpage.
Just search the URL you need maybe the can provide the information for you.
[URL="https://archive.org/web/"]https://archive.org/web/[/URL]

AMDave 2016-02-28 20:01

Thanks Rincewind.
I don't think any data will be lost.
We have nightly backups and our DR server is up to date as at 2016-02-14 02:06:17
I'm sure it is just a matter of time for Gary to reboot his network.

gd_barnes 2016-03-04 20:11

I just had Time Warner out. They replaced the modem but were unable to fix the issue. The problem is I have internet on my main Windows machine but none of my Linux machines will connect. So from their perspective it is not a Time Warner internet issue. To say that I am disgusted is an understatement. I have an older router but that is not the issue because I have tried bypassing the router directly through the new modem that they gave me.

Help! I am at a total loss now.

rogue 2016-03-04 21:54

[QUOTE=gd_barnes;428115]I just had Time Warner out. They replaced the modem but were unable to fix the issue. The problem is I have internet on my main Windows machine but none of my Linux machines will connect. So from their perspective it is not a Time Warner internet issue. To say that I am disgusted is an understatement. I have an older router but that is not the issue because I have tried bypassing the router directly through the new modem that they gave me.

Help! I am at a total loss now.[/QUOTE]

Can your linux boxes connect to the router? Once connected can they ping your Windows box and vice versa?

rebirther 2016-03-04 22:14

The router could have a change in the IP address. Iam using a dyndns to get the VM connected to the internet. Maybe you need to check also if the linux box is whitelisted in the routers firewall (also check the firmware update), I have also no control over my fritzbox with firmware updates, if the provider want to update they will do it with no asking.

gd_barnes 2016-03-04 23:14

Just to let everyone know: There will be no loss of data. Everything is intact on both my laptop and my Linux server machine, which are backed up fairly regularly. This is only an internet connectivity issue.

Setup:
Arris Time-Warner modem. Only my older Belkin router is plugged into it. Everything else is plugged into the router.

Here is what works:
My Windows7 desktop machine that I am typing from now.

Here is what does not work:
My IPhone will not connect.
My wireless laptop will not connect.
My wired Linux server machine will not connect.
My wired five Linux crunching boxes will not connect.

Here is what I have tried:
1. Plugging everything directly into the modem. It makes no difference. This tells me that it is not the router because my desktop machine works fine when plugged into the router.
2. Disconnecting all of the crunching boxes and connecting only the (important) Linux server machine.

This setup and router have worked for years. About two weeks ago, this issue happened. I have requested that Time Warner come back out. They will come back out on Monday afternoon.

rebirther 2016-03-04 23:21

Is the modem only standalone without router so you need an extra router? Do you have access from the linux box to the router over wifi?

gd_barnes 2016-03-04 23:49

I need the router for my wireless devices...my laptop and my IPhone.

The Linux server box is wired. I've tried wiring it directly into the modem. No luck. I've tried wiring it into the router (the usual setup). No luck.

I've tried completely disconnecting the router from the modem and plugging everything directly into the modem (of course laptop and IPhone won't connect since they are wireless). No luck.

I've even tried a wired connection from my laptop directly to the modem. Still no luck.

Only my newest Windows7 machine connects. My older Windows Vista laptop does not. My older Linux server machine as well as the Linux crunching machines do not. I can't help but think that Time Warner did something that has effectively made the older connections obsolete. I can't believe that could be the case but you never know.

Time Warner has done something with the IP or the network. These devices did not just disconnect on their own. I just need to get them to own up to it so we can resolve the issue.

Dubslow 2016-03-04 23:51

What if you disconnect everything from the modem and attach everything to the router? Does the entire LAN function as it should?

AMDave 2016-03-05 09:43

The dyndns will not update unless the server gets outbound for at least 1 hour.
If only one machine gets outgoing then the modem may be in bridge-mode.
Assume the tech has had basic training: 1 router + 1 machine
Check NAT is enabled. It may not be the default.

pepi37 2016-03-05 15:13

[QUOTE=gd_barnes;428140]I need the router for my wireless devices...my laptop and my IPhone.

The Linux server box is wired. I've tried wiring it directly into the modem. No luck. I've tried wiring it into the router (the usual setup). No luck.

[/QUOTE]
Can you replace LAN card on Linux server?
Is LAN cable OK? ( maybe it is trivial thing to check...)
If you have onboard LAN card, replace it with PCIE or PCI Lan card.
Different sub-domain on new modem? ( 192.168.0.1 -> 192.168.1.1)
If you have fixed IP address you will never connect to Net.

gd_barnes 2016-03-05 20:11

[QUOTE=AMDave;428163]The dyndns will not update unless the server gets outbound for at least 1 hour.
If only one machine gets outgoing then the modem may be in bridge-mode.
Assume the tech has had basic training: 1 router + 1 machine
Check NAT is enabled. It may not be the default.[/QUOTE]

I don't know what any of this means. Assume that I have had no training. :-)

I need to know what to do in a basic sense. Are you saying that I should disconnect everything from the modem and router for one hour?

AMDave 2016-03-06 02:10

Not quite.
What we need to do is go through the config panels on the router config and check them.
I will PM you.

mdettweiler 2016-03-06 08:15

Dave - if it helps, here are a couple details on how we set up Gary's network way back:

1) Most of the Linux boxes (all of them except for a couple that Gary reinstalled more recently) are on static IPs within the LAN. Most importantly, jeepford has a static IP (192.168.2.110). The router is on 192.168.2.1, and is configured to hand out dynamic IPs in the range 192.168.2.2-192.168.2.100; the Linux boxes with static IPs start at .101, in the order that Gary built them.

2) All other computers on the network (Gary's Windows desktop, the laptop, the iPhone, and one or two of the Linux crunching boxes) are on dynamic IPs.

3) The modem has always been in bridge mode, and the Belkin router has provided NAT for all the computers on the LAN. The modem was not a combined unit with a router in it - just a simple bridge modem. I'm not sure about the new modem Time Warner just gave him - I believe they give out a few kinds, some of them have routing capabilities and some don't.

Gary, just to confirm - your Windows desktop is connected directly to the back of your router, right? (not to one of the downstream Ethernet switches like the Linux boxes are?) And it's getting Internet right now, connected through the router?

I concur with Dave's suggestion - the only way to figure out what's going on is to look at the router config pages. You'll need to pull up [url]http://192.168.2.1[/url] in a web browser on your Windows desktop to access that. The specific things of interest I would suggest looking for are:

1. The router's external IP address and connection status - this should be on the front page after you log in. If you could take a screenshot of this and email it to Dave and I, we should be able to figure out something from that (or at least figure out what to ask you to do next :smile:).

2. The "DHCP Client List" page. This will tell us whether the laptop, the iPhone, and the dynamic-IP crunching boxes are getting through to the router at all. Again, a screenshot would be helpful here.

mdettweiler 2016-03-06 08:34

Thinking about this a bit more...I have a possible theory (though please take it with a grain of salt since it's just a wild guess).

I recently got set up with Time Warner myself at home, and they gave me a new Arris modem (just a simple modem, like Gary's - you need a separate router). There's one Ethernet port on the back; you plug into it, and it connects you directly to the Internet. Easy peasy. However, I believe I remember when I looked at Gary's router config page in the past, it showed he was using PPPoE (for the uninitiated - this is a type of username/password authentication over Ethernet used by some cable providers) to "dial in" to Time Warner. So I'm wondering if Time Warner favored PPPoE in the past, but more recently transitioned to a direct-connection setup. If they just now turned off the PPPoE system, then it's possible Gary's router is trying to "dial in" with PPPoE when it should be just asking for a regular dynamic IP from Time Warner.

This theory doesn't explain everything, but it's the best I can come up with at the moment to even come close to making sense...

If this theory's correct, it means that something Time Warner did on their end is what caused the problem, but as far as they're concerned, the problem is on your end, because the problem is actually that your router is not talking to their systems using the latest settings. Your router (despite its age) is perfectly capable of using the new settings, but I think we might need to change a setting to make that happen.

In any case, my previous advice stands: take a screenshot of the router config page and send it to Dave and I. We should be able to direct you more from there.

gd_barnes 2016-03-06 10:48

I sent all of the info. from my router config page to David via PM. I will forward it to you. What you said makes a little bit of sense to me. They did give me a new Arris modem but it didn't make any difference. When I went to whatismyip, I see that Time Warner's location is in Virginia and that some update was done to that on Feb. 23rd. That's about the date that all of this happened.

I hate that they can just do stuff like this and it's so difficult to figure out. Their tech would not take responsibility at all. He replaced the modem and wouldn't even check to see if my Linux box would connect. He's been out the last two times that I've had a problem and to say that he has been less than helpful would be an understatement. Both times I've had to call them back out a second time. Needless to say, I've requested a different tech this time.

If we or they cannot get this figured out, I am firing Time Warner business class and it may be an even longer outage. This has gotten ridiculous.

AMDave 2016-03-06 10:57

the new modem is a wireless router & modem, not a simple modem
that's part of the problem then, now that I see your description.
I PM'd you both with the technical document for the device as I will be AFK for the work day
Please use PM to solve this offline.

gd_barnes 2016-03-06 10:57

[QUOTE=mdettweiler;428214]Gary, just to confirm - your Windows desktop is connected directly to the back of your router, right? (not to one of the downstream Ethernet switches like the Linux boxes are?) And it's getting Internet right now, connected through the router?[/QUOTE]

That is correct. That's the way it's always been and the way it is now. But...just to experiment with things, I tried plugging everything directly into the new modem since there are 4 slots in it. This made no difference whatsoever. The Windows7 desktop would connect but nothing else would. This confirmed to me that it isn't the router. After finding that it made no difference, I changed everything back to the way it was with one exception: For now I'm leaving the crunching boxes disconnected from the router. (In case it was some issue with bandwith or something like that.) Only the Windows7 machine and Jeepford, our server, are currently plugged into the router.

S485122 2016-03-06 19:10

[QUOTE=gd_barnes;428218]...
But...just to experiment with things, I tried plugging everything directly into the new modem since there are 4 slots in it. This made no difference whatsoever. The Windows7 desktop would connect but nothing else would.
...[/QUOTE]I suppose your Windows desktop has DHCP and you said your Linux boxes have a fixed IP. If the "internal network" address of the modem has changed form 192.168.2.1 to f.i. 192.168.0.1 only your desktop will be able to connect to the internet.

What would help is the network configuration of your desktop (IP address, netmask, default router...) You can get those by asking the properties of the connection or by opening a command prompt and issuing the command "ipconfig /all".

Jacob

gd_barnes 2016-03-06 22:32

[QUOTE=S485122;428232]I suppose your Windows desktop has DHCP and you said your Linux boxes have a fixed IP. If the "internal network" address of the modem has changed form 192.168.2.1 to f.i. 192.168.0.1 only your desktop will be able to connect to the internet.

What would help is the network configuration of your desktop (IP address, netmask, default router...) You can get those by asking the properties of the connection or by opening a command prompt and issuing the command "ipconfig /all".

Jacob[/QUOTE]

It's interesting that you mention all of that. I think we're getting on the right track. Max (mdetweiller) requested my IPconfig page already and I have sent it to him. There does appear to be a change as you suggested above in the "DHCP server". Max has mentioned that it might be a change from IPV4 to IPV6 protocol.

I don't want to post a lot of specific info. publicly here. I have sent my router config and my IPconfig to Max and Dave and hopefully we can get it worked out today or on Monday.

mdettweiler 2016-03-06 23:18

Now that I've seen Gary's ipconfig dump, I should mention that IPv6 is definitely not the cause of the problem. That was a bit of a red herring. (I must say, if this really was an IPv6 issue, this would be the single weirdest network issue I've ever diagnosed! It's not, though.)

I concur that Jacob's idea is on the right track. I came to the same conclusion when I saw the ipconfig dump. I had missed the subtle difference between 192.168.2.1 and 192.168.0.1 in the info posted on the forum. :smile:

I think there's more to it than just the static IPs being in the wrong range, because the wireless gadgets also can't connect - and they all have dynamic IPs.

We should be able to get the rest of this worked out privately, without airing any more of the "network dirty laundry" out on the forum. Thanks everyone for the tips and suggestions!

rebirther 2016-03-06 23:53

If they changed the protocol from IP4 to IP6 (DS-Lite) you are not able to get access to your home network from outside anymore. I had this one time after an update so I called my provider to fix it.

mdettweiler 2016-03-07 08:19

The server is back online! I spent the last couple of hours on the phone with Gary and we got his new Arris router set up to replace the Belkin.

I'm still not 100% sure what caused the problem in the first place, but at any rate, it's fixed now. :smile:

Since the noprimeleftbehind.net IP address has changed, it may take a couple of hours for the update to propagate through the DNS system. It's already updated for me, but it might take a bit longer for people on the other side of the globe.

AMDave 2016-03-07 10:34

Excellent work Max and Gary.
I'm monitoring.
DNS is updating again.
First new results have reported and gone through ok. Stats updated.
The server is busy handing out new work.
A few people are going have some warm bedrooms tonight ;)

High-fives all round.

AMDave 2016-03-13 01:41

NPLB backups restored successfully to DR server - Last update: 2016-03-11 02:05:51
... and all is well.

gd_barnes 2016-04-04 00:23

The server machine went down about two hours ago. It may be a hard drive failure because it goes into a boot sequence but cannot complete it. I'll keep working on it tonight and tomorrow. It is backed up regularly so there should be little loss of data.

AMDave 2016-04-04 01:39

Time to upgrade that OS.

Just don't grab the HDD out of humpford - that's where a copy of yesterday's backup should be sitting :)

DR is running Ubuntu 14.04 LTS and the backup restore process works fine.
DR is currently at "2016-03-25 02:07:56" so that's a worst case of 10 days.

Now is the time for that OS upgrade.
Max suggested you'd like Mint.
Add the sshd and create my account then I can reconfig the web services DynDSN etc, run all the setup scripts & the restore the backups including your desktop setup.
You'll be up and running with the latest and greatest real quick. (as long as I can get yesterday's backup copy from humpford)

mdettweiler 2016-04-04 03:33

As Dave said - if the hard drive is indeed dead and needs to be replaced, then this is definitely the time to install a [b][i]new[/i][/b] operating system, instead of just using one of the old Ubuntu CDs you have around. I already sent you a download link for this in an email last month, if you end up needing to do this.

Dave has the backup/restore procedure very well-oiled, so all you would need to do is run through the standard install from the live CD, create accounts for Dave and I, and install the SSH server (just run "sudo apt-get install openssh-server" from the terminal) so we can log in remotely.

This is where all the setup we did on the new router last month will pay off...you don't need to do [b][i]anything[/i][/b] special with IP addresses or network configuration on jeepford after a reinstall. As long as the motherboard hasn't changed (which means the MAC address of the Ethernet interface won't change), then the router will assign it the correct IP address and all the current port forwards will drop into place.

That said, it is also possible that the hard drive is not the point of failure here...could be the motherboard/CPU (though I think that is less likely given the symptoms you describe). You might want to try booting jeepford up using one of the Ubuntu live CD's you have laying around; if that works fine (i.e. you boot up into a desktop without problems), then the problem is almost certainly the hard drive.

Also - if you see any error messages come up on screen when it fails to boot, could you send those to Dave and I? (Maybe take a picture of the screen, since copy/paste is clearly not an option here?) Please send it by email to Dave and I rather than post in the forum.

This doesn't really come at a good time for me as I have a fair amount on my plate, but I'll try to keep up best I can. Between Dave and I we should be able to get you back up and running soon. :smile:

gd_barnes 2016-04-04 05:02

OK I had in mind that I would boot from an old Ubuntu disk that I have laying around. If it boots to a desktop then I know it's the hard drive. I'll edit this post as soon as I do that to let you know. I agree that the O.S. needs updating. If the hard drive is bad, I'll buy a new one sometime tomorrow and install the new O.S. (Mint) that Max sent me a few weeks ago.

For future record, here are some of the messages I am getting:
[code]
kinit: No resume image, doing normal boot...
mount: mounting /dev/disk/by-uuid/6d54e625-ec49-4b06-8c5b-609f615887f3 on /root
failed: Invalid argument
mount: mounting /dev on /root/dev failed: No such file or directory
mount: mounting /sys on /root/sys failed: No such file or directory
mount: mounting /proc on /root/proc failed: No such file or directory
Target filesystem doesn't have /sbin/init.
No init found. Try passing init= bootarg.

BusyBox v1.10.2 (Ubuntu 1:1.10.2-2ubuntu7) built-in shell (ash)
Enter 'help' for a list of built-in commands.

(initramfs)
[/code]

gd_barnes 2016-04-04 05:21

I was able to boot it from a very old Ubuntu disk (version 8.04). I then was able to access the hard drive and see all of the files; PRPnet servers, the web pages, etc. So...it appears nothing is lost. My suspicion at this point is that there are a few bad sectors in the boot up part of the hard drive. I'll see what I can do with a newer version of Ubuntu.

Edit: And it works...lol. I pulled out the old C.D, and tried another reboot and it came right up normally with no C.D. and no new install. I had tried about 20 different reboots earlier this evening trying various things and it kept coming back to the same error messages.

So...everything will be back online shortly. We will see how long it lasts.

gd_barnes 2016-04-04 05:57

Everything is back up. There was no loss of data.

mdettweiler 2016-04-04 06:37

Good to hear. Based on the error messages you posted, I believe you are correct in surmising that there were some bad sectors on the disk that needed to be repaired. I can't give you a sure answer why it was suddenly "fixed", but perhaps Ubuntu was able to run a file-system check during that final boot - such a check can often "patch up" bad sectors without too much trouble. Or, maybe one of the live CD bootups did such a check and fixed things.

The hard drive may in fact be perfectly fine - sometimes bad sectors can be caused by power outages, if they strike at an inopportune time when the disk head can't stop safely. Modern filesystems keep enough error-correcting metadata that they can "patch around" bad sectors, and recover any lost data, if only a small section of the disk was damaged by this. This is a somewhat routine occurrence, and modern disks usually ship with some extra "hidden" sectors designed to "replace" damaged sectors transparently (i.e., they can do all this in hardware within the disk, instead of relying on the OS and filesystem to do it).

In fact, now that I think of it, if you didn't see Ubuntu doing any disk checks during any of the boot attempts, such a "transparent patch-up" internal to the hard drive may have been exactly what happened, which would explain why it "suddenly worked" without you seeing anything. (Perhaps the time you spent running the computer off the live CD gave the disk enough time to do all of this internally, while powered-on but without the OS trying to use the bad sectors. This is only educated speculation, though. :smile:)

Anyway, since we have good backups, there may not be any reason to replace the current hard drive on the off chance there's really an issue...if/when it does actually die, we'll be in no worse situation than we could have been in today, which would be to replace the disk and put Dave's well-oiled recovery plan into action. :smile:

All that said, we should definitely work on the OS upgrade regardless of the hard drive issues. My suggestion - if you're amenable to it and would be willing to make the purchase - would be to build a new computer to replace jeepford as the server. It would have a completely new hard drive (and a much bigger one, since they've come down in price), and all-new hardware which will give us a lot more capacity to handle newer software and a continually-growing database. You could install Mint on it, and Dave and I could bring it up from the backups at our leisure. Once it's all set we can transition the production systems over to it "seamlessly" with little or no downtime. Afterward, jeepford could join the rest of your full-time crunching boxes, and it would be no big deal if its hard drive ever failed.

Off the top of my head, I estimate we could build such a computer for about $500, if not less (especially if you could re-use some simple components, like a case, from some of your "dead" crunchers).

(Heck, for that matter, your crunching boxes don't even need hard drives...if you wanted to run them "bare-bones" you could boot them all from live CDs and run all the prime stuff off flash drives. :smile:)

gd_barnes 2016-04-04 07:04

lol on that final idea. I'm a little too old school for that.

I'm not quite willing to buy a new computer yet but I agree that we need to upgrade the O.S. I'm not going to give any timeline but it is something I'll keep in the back of my mind.

What you said about a power blip makes sense. I noticed a "dulling" of my lights on-and-off for a minute or more right about the time that it happened. No complete outage so my clocks weren't flashing or anything. Perhaps it was some sort of minor electrical log jam somewhere up the line. Usually the lights go completely out for a few secs and then come back on with all of the clocks flashing so this was a more unique occurrence. Anyway, when that has happened in the past, usually some or all my computers go down. If the blip is fast enough, most of them will not be affected. But with the sustained dulling of the lights, I suspected that they all had been affected, which they had. Oddly some had just rebooted while others completely shut down so the apparent "reduction" in electricity for a minute or so affected some more than others. Jeepford had just rebooted to the error messages that I posted.

Based on your explanation, I think it is very possible that the sustained reduction in electricity flow (or whatever it was) maybe messed with the booting sectors because it was likely trying to reboot itself while the dulling was going on since it lasted for a minute or more. Perhaps putting the old O.S. boot disk in there for a while allowed it to fix itself...very cool if that is what happened.

mdettweiler 2016-04-04 08:25

The sustained power reduction you describe [i]definitely[/i] sounds like it was responsible for the problem. Hard power-offs while attempting to reboot are by far the greatest cause of hard drive sector damage that I have seen. I have experienced this myself more times than I can remember.

This would also further support my supposition as to why the problem "fixed" itself. Presumably, the hard drive can do its automatic "patching around the problem" magic as long as the drive is powered on, but until that operation is done (and I doubt it would be instantaneous), the hard drive will still have to return an error when the computer tries to access the sectors under repair - which is why the OS could not boot, because it had critical files in those sectors. Since it wasn't booting, you kept rebooting it, interrupting power to the hard drive and preventing it from finishing the process. When you booted a different OS from the live CD, the computer stayed on long enough (and you were only accessing [i]other[/i] files on the disk - namely, confirming the servers/files/etc. were all still OK - which were not in the under-repair sectors) that the process could complete without interruption. Hence, on the next reboot, everything was hunky dory.

It's still just educated speculation, but it's my best guess and I'm sticking to it. :smile: If this happens in the future, I would suggest booting it into a live CD, letting it sit for a few minutes, then removing the CD and trying to boot it up normally.

Given this, I think your hard drive is probably fine going forward. Since the data has clearly been recovered without issue (that we're aware of), I see no reason why we should expect the drive to fail imminently.

I should note that because the drive has only a limited supply of "shadow sectors" with which to perform this behind-the-scenes repair, it can only do this a finite number of times. Once it runs out of "shadow sectors", new bad sectors cannot be transparently patched around when they arise. However, this is still not necessarily a deal-breaker, because the OS is perfectly capable of running its own disk check and patching around the bad sectors at the filesystem level. 15 years ago, this is what computers always did, because they didn't have "shadow sectors" - if you remember the Windows 98 days, you may recall good old Scandisk that would come up when your computer was shut down improperly; it would perform exactly this check and repair things if necessary. Obviously, if you rely on this, you are missing that extra layer of protection that modern hard drives provide, but with good backups it need not be a great concern.

The only problem is that some hard drives try to be "extra smart" and send annoying warning messages to the computer when they run out of "shadow sectors". These warnings are good to know about, but depending on how aggressively the OS notifies you about them, the warnings can sometimes get in the way of normal computer use. On one of my computers, I had an "Intel Smart Drive Management" software installed that came with the motherboard drivers, which popped up a dialog box every 60 seconds when the hard drive issues an "out of shadow sectors" error, which I couldn't do anything to get rid of, even though I'd have been perfectly happy to resort to the "old school" method of mapping around bad sectors without shadow sectors. :smile: (I probably could have gotten rid of this error by uninstalling the "Smart Management" software if I'd cared enough.)

gd_barnes 2016-04-04 10:42

I'm not sure that what you stated is completely true but I think it is mostly true. When I tried rebooting it many times, a couple of the times I waited 20-30 minutes in between just to maybe give it a chance to figure itself out. Nothing worked. Later in the night, I tried rebooting it a couple of more times. No luck. Right after that I used the boot CD and it only took a couple of minutes for it to boot from that. I then looked around for maybe 3 more minutes with the boot CD in to make sure all of the files were on the hard drive, which they were. So at that point, it had only been about 5 minutes since the last reboot. I then took the CD out and it rebooted fine directly from the hard drive. Regardless something clicked in those last 5 minutes that didn't previously click in two 20-30 minute attempts...maybe because it wasn't specifically accessing the bad sectors.

AMDave 2016-04-05 10:21

Ahh. Interesting. That's a feature I should have known about, but didn't.
Thanks Max. I learned something today which makes it a good day :)

I'm glad it worked out and the HDD is responding again.

Although, I thought you (Gary) would be using the same term as me, given our comparable age group.
We have always called it a "Brown out".
So called because, although the power does not completely fail, the voltage drops dramatically causing the old tungsten light filaments to die down to a yellow glow then brown as they cool.
However, there is a nasty consequence to this for modern equipment.
As the voltage drops the Amps increase and that's generally when my el-cheep-o power supply components overload and die, if they are not on the UPS (which handily cleans up the sine waves and regulates the current).
The more robust (and pricey) PSUs can last longer under these conditions, but will also eventually succumb if it happens repeatedly.
I have had it happen so many times that it calculated out as being better at the cheaper end of the scale.

As Max says, we're good to go on a new OS on either a new or existing machine whenever you are ready and at your own pace.
The software rebuild is a known factor and the DR server demonstrates that the daemons, databases and web sites all work on current OS and software versions, along with the built in "stim pack" of bug fixes and speed and security improvements.

We can make an email trail out of that, offline, when you are ready.

gd_barnes 2016-04-05 11:19

Interesting. I had heard the term brown out but I never knew what it meant. I've lived at my current residence for about 10 years and this is likely only the 2nd or 3rd time that I can recall a brown out. Usually it's a full blip where the lights go out for a few secs or mins and the digital clocks flash after power comes back on. So now I know what a brown out is and it is an unusual occurrence here.

mdettweiler 2016-04-19 02:23

[b]Brief server downtime around 2016-04-18, 20:00 CDT (server time) - [color=red]RESOLVED[/color][/b]

The noprimeleftbehind.net server apparently rebooted itself around 20:00 server time today. Nothing seems to be "broken" - it had just rebooted (possibly a power blip), which shut down all the PRPnet servers. I logged in around 21:00 and restarted the servers - everything is back up and running now.

Total downtime is just about an hour. Nothing to see here, move along people. :smile:

AMDave 2016-04-19 08:49

Excellent catch. I didn't even get around to noticing it!
I have been too busy watching that and other Frank Drebin quotes on You-Tube :P
It was worth the sojourn.

AMDave 2016-04-30 12:07

Backups backed up and successfully restored to DR server as at 2016-04-29 02:00hrs

AMDave 2016-08-17 20:01

NPLB email notification hiccup:
The server is running fine, but the email service has started queueing on the server (all admin notifications to me, so far, no prime notifications yet).
I will be back at the console in about 12 hours to investigate and resolve.

AMDave 2016-08-24 09:18

NPLB server is 404.

LaurV 2016-08-24 10:32

Wrong. NPLB server is \(2^2\cdot 101\).

VBCurtis 2016-08-24 16:50

[QUOTE=LaurV;440580]Wrong. NPLB server is \(2^2\cdot 101\).[/QUOTE]

:tu: :davar55:

gd_barnes 2016-08-24 17:04

Looks like it is down while I'm out of town. I'll be back late Friday and will look into the issue at that time. Sorry for the problem.

gd_barnes 2016-08-27 08:34

Servers are back up. Sorry for the outage.

AMDave 2016-08-28 01:29

It looks like the email queueing may have been a symptom because when you got us back online they all came through to me at once.
I have mail. Lots of mail :P

As often happens, the contents of the log files show me that the server was running fine the whole time. It was just the connection to it that was impaired, so the server was unable to contact the dynamic DNS server to update it's location.
Router manufacturers could add a scheduled reboot feature. That'd be useful.

AMDave 2016-09-03 08:37

NPLB server just started getting error responses from the DNS host again in the last 2 hourly dyanmic DNS resfreshes.
Investigating...

AMDave 2016-09-04 02:21

it went away again not long after.
Just some random wierdness from the dynamic DNS provider host.

mdettweiler 2017-03-21 06:37

Looks like the noprimeleftbehind.net server is offline/down - I can't connect (either by web, PRPnet or SSH).

Based on my PRPnet clients it looks like it's been out since about midday (server time) on 3/15. Hard to get an exact fix on exactly when it went out since the clients I can readily check were doing multi-hour tests (CRUS).

AMDave 2017-03-21 08:51

According to my logs sometime between 20/03/2017 16:00:01 and 20/03/2017 17:00:00 (server time)


All times are UTC. The time now is 16:15.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.