Wednesday, September 23, 2009

So just how bad has it been?

The recent downtime issues annoy me at least (probably more) than every user of RationalWiki. As always, I strive to offer the best service as far as performance, availability, and protection against catastrophe that I can. But I am not perfect, and I think it is important to keep in mind that this is hobby, and my "real life" as a graduate student is as overworked and under paid as most cliches depict it.

All that said and done, just how bad has it really been? After the great crash in August I started working with a service to help monitor RationalWiki's up time and server performance. With close to three weeks worth of data we can paint a picture for how bad things have been.



If you take a look at this analysis you can see that RW has actually been up 95 percent of the time. That is not bad all things considered. Take into account a few points: first that a good chunk of our downtime is packed together so it is mostly caused by 1 or 2 disasters that caused prolonged downtime, and second that our nightly backups cause time out errors for about 20-30 minutes. If you remove the few major disasters our uptime averages just over 99 percent a day, and without the backups you are looking at 100 percent coverage most days.

The key then is disaster recovery. To be able to quickly handle issues that cause long protracted downtime. Most of these are easily handled if I am awake and with in walking distance of the server. The issues today with the cable going down are very rare. So we are left with one major issues: server cop-outs that prevent remote log in and shutdown the site that occur when I am either asleep or traveling.

I am actively working on a solution that I think will greatly increase the servers ability to auto-recover from failure, and to expand the options for remote administration in the event of catastrophic failure when I am not present (ala what happened in August).

A lot of this is trial-by-error and learning as I go. I have never done a project like this before. All we can do is learn from our mistakes, and move forward with the goal of doing the best we can. That said what we do have is pretty good I think.

7 comments:

  1. We love you Trent!

    ReplyDelete
  2. Can't you run backup "later" - miss us EST nightowls without inconviencing the Brits?

    ReplyDelete
  3. Its not that bad...the database is locked for no more than 10 minutes while the site might run a bit slow for another 20 minutes depending on a few things. So at most you are looking at 30 minutes of any disruption.

    No matter what time it is run its going to piss someone off. But better we have the nightly backup yes?

    ReplyDelete
  4. Yes, better we have it. Maybe we could discuss best time on saloon bar? Obviously, I'd like it when I am sleeping, but that's not very predictable.

    ReplyDelete
  5. It's been a half hour and counting, I have a frustrating complex edit I'd like not to see conflicted....

    (just whinging!)

    ReplyDelete
  6. The only thing that will cause issues with editing is when the database is locked. Once the site becomes available again it will function normally. Because I am sending the backup to a remote location though I am using upload bandwidth that can slow it down a bit though.

    ReplyDelete
  7. Well, it took over a half hour for an edit to get "accepted" last night when I was whining. Just sayin' is all. Maybe we should have an "official" 30 minute downtime each day, on a predictable schedule, when the backup is run? And a 15-minute or so warning automatically on the intercom? 'Cause it's freaky when one's edits won't go through.

    ReplyDelete