The recent downtime issues annoy me at least (probably more) than every user of RationalWiki. As always, I strive to offer the best service as far as performance, availability, and protection against catastrophe that I can. But I am not perfect, and I think it is important to keep in mind that this is hobby, and my "real life" as a graduate student is as overworked and under paid as most cliches depict it.
All that said and done, just how bad has it really been? After the great crash in August I started working with a service to help monitor RationalWiki's up time and server performance. With close to three weeks worth of data we can paint a picture for how bad things have been.
If you take a look at this analysis you can see that RW has actually been up 95 percent of the time. That is not bad all things considered. Take into account a few points: first that a good chunk of our downtime is packed together so it is mostly caused by 1 or 2 disasters that caused prolonged downtime, and second that our nightly backups cause time out errors for about 20-30 minutes. If you remove the few major disasters our uptime averages just over 99 percent a day, and without the backups you are looking at 100 percent coverage most days.
The key then is disaster recovery. To be able to quickly handle issues that cause long protracted downtime. Most of these are easily handled if I am awake and with in walking distance of the server. The issues today with the cable going down are very rare. So we are left with one major issues: server cop-outs that prevent remote log in and shutdown the site that occur when I am either asleep or traveling.
I am actively working on a solution that I think will greatly increase the servers ability to auto-recover from failure, and to expand the options for remote administration in the event of catastrophic failure when I am not present (ala what happened in August).
A lot of this is trial-by-error and learning as I go. I have never done a project like this before. All we can do is learn from our mistakes, and move forward with the goal of doing the best we can. That said what we do have is pretty good I think.