Zombie Projects

2009.06.30

A Zombie Project is any project that has been completed for a while but needs to be changed or updated in some way. After laying dormant for a while, the team that developed it may have left, or evolved into a different team, or changed their methodology entirely.

While the project was dormant, it is extremely likely, if not definite that knee-jerk reactions to bugs have caused changes that no one remembers, much less committed. So when development starts on the project sandbox, if it still exists, it is more likely than not that everything is out of date. In this case, bite the bullet. FTP the whole codebase down, and check it against the repo. Your life will probably be miserable for a bit, but at least there won’t be any horrible surprises later. Also, if there’s a sandbox, triple check it. If not, be glad there’s nothing you’ll miss in an old one.

If this project is only being opened for a short period, resist the urge to make small changes in methodology or framework. For every small change you make, you’ll want to make another, and update this, and modify that. Be extremely careful of the budget, and cost of each change, or you’re done for.

Lastly, be very, very afraid of the build to production. Even with fastidious notes, a sys admin with OCD, and Gillian the QA girl telling you how awesome you are, assume you missed something. An open tail of the error log is your best visibility into whatever has reared it’s ugly head. Your next best bet is your trusty Crisis MO.

Document failure

2009.06.07

When things go wrong, there is always a reason. Sometimes it’s a good one. (I had no idea $_SERVER wasn’t available from the CLI) Sometimes it’s a really bad one. (I left 2 seconds after running the build script, and I didn’t test beforehand.) Whatever the reason is, the question needs to be asked ‘How can we never, ever let this happen again?’ Once we arrive at a good answer to that question, it’s really important that you hold that answer in a vise-like grip, and never let it go. Because knowing how to prevent problems, and actually taking the steps to prevent them are two very different things.

That is why I keep a Book of Fail. I will write down the consequences and circumstances of a particular catastrophe as a permanent record. I make sure to physically write it down because the act of writing with a pencil, unlike typing, requires significant effort, and helps to deeply ingrain what I have chosen to record. Secondly, keeping these records in a physical journal means that however many crappy Macs I burn through, or hosting plans I forget to pay for, I will still be able to carry my Moleskine around.

Periodically reading through the Book is also crucial to keeping these painful moments fresh. Many people will claim to have learned from a mistake, but will continually repeat the same doomed process over and over without a second’s thought. Once having a made a mistake a couple times, it should never be repeated.

Architecture will Save You

2009.06.04

There are two things that cause sites to go down:

  • Something breaks.
  • Too much traffic.

When things break, there’s nothing to do but failover, examine, and fix.

But when there’s too much traffic, you’re screwed. The reason you’re getting so much traffic is because you’ve done something to earn it. Your site has made it on to Digg or Reddit or even TV. Failing over to a status page is admitting that you haven’t done your job. Staying up means that some people will see the site, maybe, but after a stupidly long wait. And then you become ‘that site’ that was taken down by too much traffic.

So as a web architect, this is your absolute worst moment. Everything you’ve done over the past months or years is falling down miserably in front of you, and there isn’t anything you can do at that moment that will make a meaningful difference. Unless, of course, your architecture is set up so you can increase capacity by just turning up more servers. Among software architects, this is know as being able to scale horizontally

So when the site is being built, it’s extremely important that every choice you make supports scaling horizontally, with each layer able to be spread across many servers. Storage, session handling, database, apache [...] are all different layers. They all need to be able to handle more capacity just by adding additional services to whatever layer needs it.

Having the horizontal scalable quality is really difficult when trying to get a site out quickly. architecture is something that will often get de-prioritized because there is no immediate result. However, taking the extra time to build an application that follows the horizontally scalable patterns is extremely important. Without spending that extra time, you can easily be wiped out by the Digg effect.

The biggest takeaway is when thinking about how to build your application, focus on making it scale horizontally. So when the moment comes when you get on Digg, you can spin up some more servers, sit back, and have a drink.


Categories : Best Practices   scalability