Crisis MO

While running any sort of site, expect problems. Lots of them. They will range from trucks crashing into your data center, to bad releases, to people defacing your site. Whatever the problem may be, there always seems to be the same pattern in dealing with it.

  1. Fail over
  2. Diagnose
  3. Fix
  4. Fail over

This list makes a few assumptions. It assumes that there is an alternate server for you to fail over to, and a means to do so. Having an alternate server with a simple status page is really cheap and quick to set up. The easiest way to fail over is probably to use DNS. Dynect offers really great DNS service and has an interface that anyone in the company can use.

Of course, failing over doesn’t necessarily mean you take your entire site down either. It could mean pushing an update that closes a particular feature, or removing data that’s causing a problem. Failing over really means any sort of quick maneuver that will get you out of hot water.

Once you’ve failed over, it’s really important to understand what happened and why. Knowing what caused the site to break is the most important step in fixing it. When you’re in this situation, it’s really easy to put up a quick fix without understanding what is happening. This can easily lead to thrashing through a number of quick fixes that each break something else. In situations like these, making decisions slowly and calmly is crucial. Anything that’s not well thought out can make a bad situation worse.

Once the fix is applied, turn the feature back on or point DNS back to the production site. When everything is back up, it’s well worth spending some time making sanity checks on your fixes.

Lastly, it’s also well worth the time to thoroughly documentwhat went wrong and why. The aftermath of crises is a golden opportunity to identify problems with architecture process. It’s also a great time to do a few shots…

May 26th, 2009