mersenneforum.org Severe outrages
 Register FAQ Search Today's Posts Mark Forums Read

 2016-04-04, 05:02 #463 gd_barnes     May 2007 Kansas; USA 10,597 Posts OK I had in mind that I would boot from an old Ubuntu disk that I have laying around. If it boots to a desktop then I know it's the hard drive. I'll edit this post as soon as I do that to let you know. I agree that the O.S. needs updating. If the hard drive is bad, I'll buy a new one sometime tomorrow and install the new O.S. (Mint) that Max sent me a few weeks ago. For future record, here are some of the messages I am getting: Code: kinit: No resume image, doing normal boot... mount: mounting /dev/disk/by-uuid/6d54e625-ec49-4b06-8c5b-609f615887f3 on /root failed: Invalid argument mount: mounting /dev on /root/dev failed: No such file or directory mount: mounting /sys on /root/sys failed: No such file or directory mount: mounting /proc on /root/proc failed: No such file or directory Target filesystem doesn't have /sbin/init. No init found. Try passing init= bootarg. BusyBox v1.10.2 (Ubuntu 1:1.10.2-2ubuntu7) built-in shell (ash) Enter 'help' for a list of built-in commands. (initramfs) Last fiddled with by gd_barnes on 2016-04-04 at 05:03
 2016-04-04, 05:21 #464 gd_barnes     May 2007 Kansas; USA 10,597 Posts I was able to boot it from a very old Ubuntu disk (version 8.04). I then was able to access the hard drive and see all of the files; PRPnet servers, the web pages, etc. So...it appears nothing is lost. My suspicion at this point is that there are a few bad sectors in the boot up part of the hard drive. I'll see what I can do with a newer version of Ubuntu. Edit: And it works...lol. I pulled out the old C.D, and tried another reboot and it came right up normally with no C.D. and no new install. I had tried about 20 different reboots earlier this evening trying various things and it kept coming back to the same error messages. So...everything will be back online shortly. We will see how long it lasts. Last fiddled with by gd_barnes on 2016-04-04 at 05:26
 2016-04-04, 05:57 #465 gd_barnes     May 2007 Kansas; USA 10,597 Posts Everything is back up. There was no loss of data.
 2016-04-04, 06:37 #466 mdettweiler A Sunny Moo     Aug 2007 USA (GMT-5) 3×2,083 Posts Good to hear. Based on the error messages you posted, I believe you are correct in surmising that there were some bad sectors on the disk that needed to be repaired. I can't give you a sure answer why it was suddenly "fixed", but perhaps Ubuntu was able to run a file-system check during that final boot - such a check can often "patch up" bad sectors without too much trouble. Or, maybe one of the live CD bootups did such a check and fixed things. The hard drive may in fact be perfectly fine - sometimes bad sectors can be caused by power outages, if they strike at an inopportune time when the disk head can't stop safely. Modern filesystems keep enough error-correcting metadata that they can "patch around" bad sectors, and recover any lost data, if only a small section of the disk was damaged by this. This is a somewhat routine occurrence, and modern disks usually ship with some extra "hidden" sectors designed to "replace" damaged sectors transparently (i.e., they can do all this in hardware within the disk, instead of relying on the OS and filesystem to do it). In fact, now that I think of it, if you didn't see Ubuntu doing any disk checks during any of the boot attempts, such a "transparent patch-up" internal to the hard drive may have been exactly what happened, which would explain why it "suddenly worked" without you seeing anything. (Perhaps the time you spent running the computer off the live CD gave the disk enough time to do all of this internally, while powered-on but without the OS trying to use the bad sectors. This is only educated speculation, though. ) Anyway, since we have good backups, there may not be any reason to replace the current hard drive on the off chance there's really an issue...if/when it does actually die, we'll be in no worse situation than we could have been in today, which would be to replace the disk and put Dave's well-oiled recovery plan into action. All that said, we should definitely work on the OS upgrade regardless of the hard drive issues. My suggestion - if you're amenable to it and would be willing to make the purchase - would be to build a new computer to replace jeepford as the server. It would have a completely new hard drive (and a much bigger one, since they've come down in price), and all-new hardware which will give us a lot more capacity to handle newer software and a continually-growing database. You could install Mint on it, and Dave and I could bring it up from the backups at our leisure. Once it's all set we can transition the production systems over to it "seamlessly" with little or no downtime. Afterward, jeepford could join the rest of your full-time crunching boxes, and it would be no big deal if its hard drive ever failed. Off the top of my head, I estimate we could build such a computer for about \$500, if not less (especially if you could re-use some simple components, like a case, from some of your "dead" crunchers). (Heck, for that matter, your crunching boxes don't even need hard drives...if you wanted to run them "bare-bones" you could boot them all from live CDs and run all the prime stuff off flash drives. ) Last fiddled with by mdettweiler on 2016-04-04 at 06:39
 2016-04-04, 07:04 #467 gd_barnes     May 2007 Kansas; USA 101001011001012 Posts lol on that final idea. I'm a little too old school for that. I'm not quite willing to buy a new computer yet but I agree that we need to upgrade the O.S. I'm not going to give any timeline but it is something I'll keep in the back of my mind. What you said about a power blip makes sense. I noticed a "dulling" of my lights on-and-off for a minute or more right about the time that it happened. No complete outage so my clocks weren't flashing or anything. Perhaps it was some sort of minor electrical log jam somewhere up the line. Usually the lights go completely out for a few secs and then come back on with all of the clocks flashing so this was a more unique occurrence. Anyway, when that has happened in the past, usually some or all my computers go down. If the blip is fast enough, most of them will not be affected. But with the sustained dulling of the lights, I suspected that they all had been affected, which they had. Oddly some had just rebooted while others completely shut down so the apparent "reduction" in electricity for a minute or so affected some more than others. Jeepford had just rebooted to the error messages that I posted. Based on your explanation, I think it is very possible that the sustained reduction in electricity flow (or whatever it was) maybe messed with the booting sectors because it was likely trying to reboot itself while the dulling was going on since it lasted for a minute or more. Perhaps putting the old O.S. boot disk in there for a while allowed it to fix itself...very cool if that is what happened. Last fiddled with by gd_barnes on 2016-04-04 at 07:06
 2016-04-04, 08:25 #468 mdettweiler A Sunny Moo     Aug 2007 USA (GMT-5) 3·2,083 Posts The sustained power reduction you describe definitely sounds like it was responsible for the problem. Hard power-offs while attempting to reboot are by far the greatest cause of hard drive sector damage that I have seen. I have experienced this myself more times than I can remember. This would also further support my supposition as to why the problem "fixed" itself. Presumably, the hard drive can do its automatic "patching around the problem" magic as long as the drive is powered on, but until that operation is done (and I doubt it would be instantaneous), the hard drive will still have to return an error when the computer tries to access the sectors under repair - which is why the OS could not boot, because it had critical files in those sectors. Since it wasn't booting, you kept rebooting it, interrupting power to the hard drive and preventing it from finishing the process. When you booted a different OS from the live CD, the computer stayed on long enough (and you were only accessing other files on the disk - namely, confirming the servers/files/etc. were all still OK - which were not in the under-repair sectors) that the process could complete without interruption. Hence, on the next reboot, everything was hunky dory. It's still just educated speculation, but it's my best guess and I'm sticking to it. If this happens in the future, I would suggest booting it into a live CD, letting it sit for a few minutes, then removing the CD and trying to boot it up normally. Given this, I think your hard drive is probably fine going forward. Since the data has clearly been recovered without issue (that we're aware of), I see no reason why we should expect the drive to fail imminently. I should note that because the drive has only a limited supply of "shadow sectors" with which to perform this behind-the-scenes repair, it can only do this a finite number of times. Once it runs out of "shadow sectors", new bad sectors cannot be transparently patched around when they arise. However, this is still not necessarily a deal-breaker, because the OS is perfectly capable of running its own disk check and patching around the bad sectors at the filesystem level. 15 years ago, this is what computers always did, because they didn't have "shadow sectors" - if you remember the Windows 98 days, you may recall good old Scandisk that would come up when your computer was shut down improperly; it would perform exactly this check and repair things if necessary. Obviously, if you rely on this, you are missing that extra layer of protection that modern hard drives provide, but with good backups it need not be a great concern. The only problem is that some hard drives try to be "extra smart" and send annoying warning messages to the computer when they run out of "shadow sectors". These warnings are good to know about, but depending on how aggressively the OS notifies you about them, the warnings can sometimes get in the way of normal computer use. On one of my computers, I had an "Intel Smart Drive Management" software installed that came with the motherboard drivers, which popped up a dialog box every 60 seconds when the hard drive issues an "out of shadow sectors" error, which I couldn't do anything to get rid of, even though I'd have been perfectly happy to resort to the "old school" method of mapping around bad sectors without shadow sectors. (I probably could have gotten rid of this error by uninstalling the "Smart Management" software if I'd cared enough.) Last fiddled with by mdettweiler on 2016-04-04 at 08:26
 2016-04-04, 10:42 #469 gd_barnes     May 2007 Kansas; USA 10,597 Posts I'm not sure that what you stated is completely true but I think it is mostly true. When I tried rebooting it many times, a couple of the times I waited 20-30 minutes in between just to maybe give it a chance to figure itself out. Nothing worked. Later in the night, I tried rebooting it a couple of more times. No luck. Right after that I used the boot CD and it only took a couple of minutes for it to boot from that. I then looked around for maybe 3 more minutes with the boot CD in to make sure all of the files were on the hard drive, which they were. So at that point, it had only been about 5 minutes since the last reboot. I then took the CD out and it rebooted fine directly from the hard drive. Regardless something clicked in those last 5 minutes that didn't previously click in two 20-30 minute attempts...maybe because it wasn't specifically accessing the bad sectors.
 2016-04-05, 10:21 #470 AMDave     Jan 2006 deep in a while-loop 12328 Posts Ahh. Interesting. That's a feature I should have known about, but didn't. Thanks Max. I learned something today which makes it a good day :) I'm glad it worked out and the HDD is responding again. Although, I thought you (Gary) would be using the same term as me, given our comparable age group. We have always called it a "Brown out". So called because, although the power does not completely fail, the voltage drops dramatically causing the old tungsten light filaments to die down to a yellow glow then brown as they cool. However, there is a nasty consequence to this for modern equipment. As the voltage drops the Amps increase and that's generally when my el-cheep-o power supply components overload and die, if they are not on the UPS (which handily cleans up the sine waves and regulates the current). The more robust (and pricey) PSUs can last longer under these conditions, but will also eventually succumb if it happens repeatedly. I have had it happen so many times that it calculated out as being better at the cheaper end of the scale. As Max says, we're good to go on a new OS on either a new or existing machine whenever you are ready and at your own pace. The software rebuild is a known factor and the DR server demonstrates that the daemons, databases and web sites all work on current OS and software versions, along with the built in "stim pack" of bug fixes and speed and security improvements. We can make an email trail out of that, offline, when you are ready.
 2016-04-05, 11:19 #471 gd_barnes     May 2007 Kansas; USA 10,597 Posts Interesting. I had heard the term brown out but I never knew what it meant. I've lived at my current residence for about 10 years and this is likely only the 2nd or 3rd time that I can recall a brown out. Usually it's a full blip where the lights go out for a few secs or mins and the digital clocks flash after power comes back on. So now I know what a brown out is and it is an unusual occurrence here. Last fiddled with by gd_barnes on 2016-04-05 at 11:20
 2016-04-19, 02:23 #472 mdettweiler A Sunny Moo     Aug 2007 USA (GMT-5) 3×2,083 Posts Brief server downtime around 2016-04-18, 20:00 CDT (server time) - RESOLVED The noprimeleftbehind.net server apparently rebooted itself around 20:00 server time today. Nothing seems to be "broken" - it had just rebooted (possibly a power blip), which shut down all the PRPnet servers. I logged in around 21:00 and restarted the servers - everything is back up and running now. Total downtime is just about an hour. Nothing to see here, move along people.
 2016-04-19, 08:49 #473 AMDave     Jan 2006 deep in a while-loop 10100110102 Posts Excellent catch. I didn't even get around to noticing it! I have been too busy watching that and other Frank Drebin quotes on You-Tube :P It was worth the sojourn.

All times are UTC. The time now is 00:30.

Fri Jan 28 00:30:07 UTC 2022 up 188 days, 18:59, 3 users, load averages: 1.33, 1.26, 1.32