Friday, September 25, 2009

Fault protection take 2

I have received the additional hardware I needed to get the fault protection working...I think. I will be setting that up and running some test on it today. It shouldn't effect RationalWiki at all as I can do my testing further down the network chain. If all goes well I will need to restart the server and that's about it. I will drop an intercom on RW when/if I do that.

Update: Okay, fault protection seems to be working well. I have implemented it live on the server now. I will continue to monitor everything and adjust as needed. But all appears well for now.

Wednesday, September 23, 2009

So just how bad has it been?

The recent downtime issues annoy me at least (probably more) than every user of RationalWiki. As always, I strive to offer the best service as far as performance, availability, and protection against catastrophe that I can. But I am not perfect, and I think it is important to keep in mind that this is hobby, and my "real life" as a graduate student is as overworked and under paid as most cliches depict it.

All that said and done, just how bad has it really been? After the great crash in August I started working with a service to help monitor RationalWiki's up time and server performance. With close to three weeks worth of data we can paint a picture for how bad things have been.

If you take a look at this analysis you can see that RW has actually been up 95 percent of the time. That is not bad all things considered. Take into account a few points: first that a good chunk of our downtime is packed together so it is mostly caused by 1 or 2 disasters that caused prolonged downtime, and second that our nightly backups cause time out errors for about 20-30 minutes. If you remove the few major disasters our uptime averages just over 99 percent a day, and without the backups you are looking at 100 percent coverage most days.

The key then is disaster recovery. To be able to quickly handle issues that cause long protracted downtime. Most of these are easily handled if I am awake and with in walking distance of the server. The issues today with the cable going down are very rare. So we are left with one major issues: server cop-outs that prevent remote log in and shutdown the site that occur when I am either asleep or traveling.

I am actively working on a solution that I think will greatly increase the servers ability to auto-recover from failure, and to expand the options for remote administration in the event of catastrophic failure when I am not present (ala what happened in August).

A lot of this is trial-by-error and learning as I go. I have never done a project like this before. All we can do is learn from our mistakes, and move forward with the goal of doing the best we can. That said what we do have is pretty good I think.

And we are back....

What appears to have happened is my next door neighbors got a cable hookup, and God put it into the mind of the cable technician to accidentally disconnect my cable while doing the hook up.

Where God failed in his plan was in prevent another technician from showing up and fixing the problem. An even larger failure was that he allowed the technician to show up hours early! How often does that happen? I think we know who to thank for that!

On a more serious note, in my comment on the last post I discussed the monthly cost of running RW. The point was not to ask for more money, the cost stems form the dedicated IP address and extra bandwidth that RW requires. It is an extra $50 a month above what I would pay anyway. I get about $20 a month in donations from people that give every month. That means my per month out of pocket expense is $30 a month. I can totally handle that. The occasional need for new hardware that I can't afford is usually met almost immediately by a small donation drive. Every now and then I get an extra donation that I keep on the back burner for emergencies.

The RW accounts have about $80 sitting in them right now for emergency purchases. I like to keep that amount around $100-$150 to make sure I can get almost anything we need.

All of this is to say that at the moment the financial cost of RW is not really a burden on anyone. That was the point to moving to a privately hosted site.

If we were to try and move to a commercially hosted site the financial burden increases multiple fold for me and those people who are able and willing to donate. There is also substantial risk that if we fail to get enough cash we could lose the site.

So we put up with this less than perfect up time because it means that RW is under no immediate risk of permanent shutdown and is not a going to bankrupt anyone in the proccess.

When it rains...

So God's will has struck again, the site is down but this time it has nothing to do with the server. My internet connection seems to have crapped out. Because I pay $200 a month for it though one of the perks is "emergency service" so I have a guy coming in sometime beteween 5pm-8pm EST to take a look at it and hopefully get everything back and running.

Tuesday, September 22, 2009

No sleep for the server admin

So I was up till 4 am trying to get the fault protection working. It appeared to be working beautifully so I went to bed. Only to be awoken 3 hours later with it going nuts. I have removed it from the system for now. I think I need another piece of hardware to get everything working together the way I want. So I am putting it all on hold for now. I will order the new hardware today so it could be upwards of a week before I go at this again.

My goal is to have it all in place before my trip to Chicago.

And now I think I am getting a cold. I blame the stress and lack of sleep damn it.

Okay I lied

As is probably obvious I lied in the last updated when an idea for a solution to my problem came to my mind.

So I think I have everything setup. I am going to avoid going into specifics for security reasons but we now have much greater remote administration abilities that are no longer dependent on the server being online. I have also setup a range of automated monitoring software and utilities that will aid in both keeping track of site up time and doing some automated tasks that should allow the server to recover from all but the most serious of crashes automatically.

With that I am going to bed.

Monday, September 21, 2009

Extended maintenance

The site is going down this afternoon for extended maintenance. Running some tests, changing some options and install some new hardware. All designed to try and help deal with some of the recent outages. Running the backup first, then I will get started.

Update: Screw it I am done for tonight. Got about half of what I wanted figured out. Luckily the last half lets me keep RW up most of the time I am working. I will have to come back and keep working on this probably tomorrow which means don't freak if there is intermittent downtime for a minute or so every now and then for the next day or so.

Repairing a table in the database

Things will be locked up for a few minutes while the repair is run. I am aware of the situation and working on it, and hope to have things back up shortly.

UPDATE: Repair is done site is back online let me know if there are further issues.

Sunday, September 13, 2009

Server crash post-mortem

Time for the official post-mortem of what happened as far as the server crash goes. The official cause of the crash shall be listed as a failing power supply unit.

About a month ago the power supply for the server went completely dead. In order to get the server back up and running as quickly as possible I swapped in a spare unit I had from an older computer. It did the job beautifully. Seeing as how everything appeared to be working fine and there were no substantial problems I didn't replace it with a new unit.

About 4 days before I left for my trip back home the server shut itself down. When I booted it back up I had some problems with the MYSQL server and so chalked the problem up to that. Then I left and went away. We all know what happened next. When I got back to the server I found that it was in the same disabled state as the first crash. I got it back up and decided to watch and see what it would do. A day later the same thing happened.

So I ran some tests on the power supply and it was providing irregular power on the 12v rail, my guess is that was probably leading to a temperature triggered shutdown. Anyway, regardless, 2 days ago I purchased a high-end power supply unit and swapped it in. Everything seems to be running fine now.

Thanks to the donations everyone at RW gave or have promised to give, I have gone ahead and upgraded some of the networking hardware that was worrying me as well. I am also working on getting some hardware to allow for remote management of the server even if it is unresponsive, as well as server resets automatically if it becomes unresponsive.

So the whole thing is my fault for not swapping in a new power unit after the old one failed and instead relying on a spare one. Feel free to block me for some pi unit of time for my failing.

As a final note, if this is truly an act of God as more than one person posited, it is pretty convoluted and weak. A swarm of locust munching on my power cords would have been far more effective in both maintaining downtime and for the general "shock and awe" of it all.

And a few RationalWiki prods for the road:

No true Scotsman
Common descent

Saturday, September 12, 2009

The expanding face of RationalWiki

Awhile ago I posted a small picture of the RationalWiki server. It seemed to amuse people. Well since that time the complex of RW has expanded to take up more and more space. I decided to upload a new picture so everyone can see the new face of RationalWiki:

Woot! New network switch just arrived.

I am replacing the weird little neon hunk of plastic that I think is a network switch...but the Mandarin confuses me......with a solid linksys switch I ordered from newegg. Thanks to everyone that helped donate!

I think the switch was by far the "weakest" point in the network setup, and most likely to fail next. So this is a good upgrade.

A switch should take less than a minute to install so I am not bothering with an RW intercom message. But posting this here just in case I blow something up and the site stays down longer than expected.

After this just need to get some automatic/remote server monitoring hardware and we will be set!

Friday, September 11, 2009

New status widget and update on google

New widget

So for fun I have setup a little widget on the blog to show the status of the RationalWiki servers.

If the server is down it just says server down. If the server is up and working it will display the number of hours and minuets that the server has been up "straight." That means no reboot or power down. I am also displaying the 15 minute running average for the CPU load so people can see how busy the server has been recently.

Google update

So based on my searching we are back on top for Andrew Schlafly and Poe's Law which were two of our bigger hitters for search engine referred traffic. Not everything is reindexed yet but it looks like we should recover all right from our downtime.

Hardware replacement for real this time

Okay, so now that our magic new backup system appears to be working, I can actually do what I meant to do yesterday. So the site is down because I am replacing hardware.

Obligatory Google prod of the day:

Denyse O'Leary in honor of the first person to openly admit to considering a defamation lawsuit against us.
Esther Hicks just because.

UPDATE: It looks like everything went exactly as planned, smooth upgrade, site back online. I will continue to monitor the situation to make sure nothing weird happens.

Thursday, September 10, 2009

Hardware replacement, downtime

I think I have discovered the issues with the random shutdowns that took the site off line. As is the case with everything in life the solution to the problem is going to cost money and time.

I will be heading out to purchase some new hardware this afternoon and its installation will cause some downtime. If all goes as planned it will be less than an hour. I will give a 20-30 minute warning on RW before I take things down.

Google prodding still needed btw:

Gish Gallop

UPDATE: Okay I lied. I am trying to get a few things done at once, and the completion time on task 1 is taking longer than I anticipated so the hardware replacement has been delayed till late tonight or tomorrow.

Monday, September 7, 2009

Poking Google

We got dropped by the search engines so I am going to poke a few of the important pages, and hope it prods re-indexing of the site:

Poe's Law
Andrew Schlafly

If you have a blog or other dynamically updated website please consider poking these pages, or any of your favorite RationalWiki pages.

Sunday, September 6, 2009

All systems a go

Well RationalWiki should have all cylinders firing.

Sorry for the downtime, but there was naught I could do. I will be doing some thinking about how to expand remote administration of the site over the next few days. This probably means a mini-fund raiser will be in order but hopefully we can get something set up that will prevent this kind of disaster in the future.

Getting things back online now

Not a hundred percent yet.

Main thing to do is some troubleshooting and then apparently importing pages that have been made on other wikis.

Database is locked for now.

Stay tuned for updates.

Update: Everything is working. NX is going to import pages from other wikis and then open up the database for the grand reunion. I am going to get food an caffeine. We have some "rebuilding" to do though since google dropped us. I will post on RW later tonight to poke people for help getting us back to awesome again.