Japanese Inspection

2010.02.11

Everybody procrastinates. Some tasks that get pushed off don’t matter, it just gets done later. Some tasks that go over deadline result in profanities and bloody noses. But every once in a while, tasks come along that have an expiration date. As in, if it doesn’t get done by a certain time, it doesn’t matter.

You ever heard of a “Japanese Inspection?” Japanese Inspection, you see, when the Japs take in a load of lettuce they’re not sure they wanna let in the country, why they’ll just let it sit there on the dock ’til they get good and ready to look at, But then of course, it’s all gone rotten… ain’t nothing left to inspect. You see, lettuce is a perishable item…like you two monkeys.

Big John, Days of Thunder

What Big John was referring to was the fact that all he had to do was ignore Cole and Rowdy until they didn’t any fight left in them. Tasks can be just the same way. Eventually, the need for the task to be done just goes away, or starts to smell. The only thing that really matters is being able to tell the difference between the things that really need to get done, and the things that just aren’t important enough to get done.

Changing Criteria

2009.12.12

Occasionally, a project will come across my plate with the criteria, ‘Make sure this works everywhere, is completely template-able, and is something we can grow with.’ Normally this is coupled with ‘We need this to work with X *right now*, and Y and Z later.’ What I really hear is ‘Make it work for X, and ship the damn thing.’ After all, hitting those deadlines is really important.

Of course, this has a whole bunch of ugly assumptions tied to it. The first is: when I get to Y, everything I did for X will work. All I need to do is drop in a few config changes, and tweak a few parameters, and I’m done. (Yea, right) Secondly: that every case Y needs to cover is contained within X. (Not Likely) Third: All of this will be so well documented that any literate individual will be able to implement Y by osmosis.

So. Do we spend time now or later? Shipping X seems pretty simple, so why not just build X, satisfy the business dude and call it a day. Spending time now means that deadlines may have to shift, and something that should be simple becomes complex. We have other, more important, projects to work on.

Eventually Y comes calling. So let me introduce you to…Future Web Dev Guy Person Girl! If you’re lucky, that person is you. If you’re not, it’s another dev. The assumptions we made back in paragraph 2 have reared their ugly heads. Since they were assumptions, you’re probably boned. If not, you’re probably one of these guys. If you’re the rest of us, Future Web Dev Guy Person Girl definitely hates your guts, because the groundwork that was supposed to be laid out is not there. They’re running through a lice-infested rat’s nest of procedural functions trying to pass the additional variable that will make this all work.

The best way to keep Future Web Dev Guy Person Girl from cursing like a sailor is implement correctly, test thoroughly, and deal with Y before it’s due. Deadlines need to be managed according to project scope, and if project scope includes Y, it needs to be accounted for now, before you lose a friend in Future Web Dev Guy Person Girl.

Categories : Best Practices

Fix or Manage?

2009.10.24

Sometimes bugs come along that require significant work to fix. Depending on what project timelines are like at the moment, sometimes fixing the bug isn’t the best option. For example, a race condition in the caching architecture causes pages to be stale. The persistent data store is correct, but the cache is not. To the person who just triggered the update, there’s a bug. The information on the public side is not in sync with the information they just entered.

So, like any other bug, a report will eventually percolate down to the dev team. People scream, fortunes are lost, the svn blame command is used, and the devs who wrote the code pee their pants. Once the chaos dies down, the actual prognosis of this issue can turn out to be extremely grim.

A shortcoming of the caching architecture shows that there’s a race condition when the system is under heavy load. In order to fix it, the dev team needs to plumb the depths of the data access layer, and probably change some parameters. But that’ll probably break everything. Everywhere. Or the layer manipulating the data could be fixed to replace the cache instead of invalidating. Except the methods to manipulate that entity live in 3 different codebases. It’ll probably break the editor. Either way, the actual solution doesn’t matter.The dev team certainly needs to do something, and it needs to be released three days ago.

The correct way to fix this issue will vary widely depending on circumstances. But in this particular case, the best answer was to not fix it, just manage it. Our team was busy, there were other projects that were more pressing. Plus the codebase was being rewritten. So instead of flogging a dead horse, a simple script was thrown together that compared the cache and the database. If they were out of sync, the cache would be cleared, and would be repopulated with the correct information the next time it was requested. Once it was implemented, the bug was still there, but the cache seemed to be up to date.

Every dev team will face bugs that have enormous costs to fix. The way to deal with these bugs will be different every time they come up. It’s important to remember that managing bugs can be almost as effective as fixing them.

Gearman

2009.07.25

Users have high expectations of web apps in terms of performance, responsiveness and tons of features. Normally, you’re only allowed two of any list of three really cool things. In the case of Web Apps, that would be true. Most will find some compromise of between performance / responsiveness and tons of features. More features usually equals less responsiveness, depending on the feature and scale.

Enter Gearman. Gearman is a queuing system that allows work to be farmed out to other servers. Most importantly, it allows for intense tasks to be queued and performed in the background. This means that when a user performs an action that could potentially take a long time (sending notification emails, updating Full Text indexes, etc), that slow task can be queued to run in the background, and the page can be sent to the user, keeping things snappy.

Gearman is pretty simple to install on Red Hat.

download gearman from server
> wget http://launchpad.net/gearmand/trunk/0.8/+download/gearmand-0.8.tar.gz

unzip and move into the directory
> tar -xvzf gearmand-0.8.tar.gz
> cd gearmand-0.8

Red Hat didn’t have some dependencies. The next few steps will vary depending on your *nix distro.

Install the libevent developer library.
> yum install libevent-devel

Install the e2fsprogs developer library
> yum install e2fsprogs-devel

configure and install
> ./configure
> make
> make install

/** Net Gearman **/

download php extension from the pecl repo
> wget http://pecl.php.net/get/gearman-0.4.0.tgz

untar
> tar -xvf gearman-0.4.0.tgz

build the extension
> phpize
> ./configure
> make
> make test
> make install

Add the extension to the php.ini

[gearman]
extension=gearman.so

And you’re all set!

Integration will depend on if you decide to use the php extension, and how encapsulated the code base is. I highly recommend using the pecl extension, as it provides great implementations of the client and worker. and Gearman will save you.

Document failure

2009.06.07

When things go wrong, there is always a reason. Sometimes it’s a good one. (I had no idea $_SERVER wasn’t available from the CLI) Sometimes it’s a really bad one. (I left 2 seconds after running the build script, and I didn’t test beforehand.) Whatever the reason is, the question needs to be asked ‘How can we never, ever let this happen again?’ Once we arrive at a good answer to that question, it’s really important that you hold that answer in a vise-like grip, and never let it go. Because knowing how to prevent problems, and actually taking the steps to prevent them are two very different things.

That is why I keep a Book of Fail. I will write down the consequences and circumstances of a particular catastrophe as a permanent record. I make sure to physically write it down because the act of writing with a pencil, unlike typing, requires significant effort, and helps to deeply ingrain what I have chosen to record. Secondly, keeping these records in a physical journal means that however many crappy Macs I burn through, or hosting plans I forget to pay for, I will still be able to carry my Moleskine around.

Periodically reading through the Book is also crucial to keeping these painful moments fresh. Many people will claim to have learned from a mistake, but will continually repeat the same doomed process over and over without a second’s thought. Once having a made a mistake a couple times, it should never be repeated.

Architecture will Save You

2009.06.04

There are two things that cause sites to go down:

  • Something breaks.
  • Too much traffic.

When things break, there’s nothing to do but failover, examine, and fix.

But when there’s too much traffic, you’re screwed. The reason you’re getting so much traffic is because you’ve done something to earn it. Your site has made it on to Digg or Reddit or even TV. Failing over to a status page is admitting that you haven’t done your job. Staying up means that some people will see the site, maybe, but after a stupidly long wait. And then you become ‘that site’ that was taken down by too much traffic.

So as a web architect, this is your absolute worst moment. Everything you’ve done over the past months or years is falling down miserably in front of you, and there isn’t anything you can do at that moment that will make a meaningful difference. Unless, of course, your architecture is set up so you can increase capacity by just turning up more servers. Among software architects, this is know as being able to scale horizontally

So when the site is being built, it’s extremely important that every choice you make supports scaling horizontally, with each layer able to be spread across many servers. Storage, session handling, database, apache [...] are all different layers. They all need to be able to handle more capacity just by adding additional services to whatever layer needs it.

Having the horizontal scalable quality is really difficult when trying to get a site out quickly. architecture is something that will often get de-prioritized because there is no immediate result. However, taking the extra time to build an application that follows the horizontally scalable patterns is extremely important. Without spending that extra time, you can easily be wiped out by the Digg effect.

The biggest takeaway is when thinking about how to build your application, focus on making it scale horizontally. So when the moment comes when you get on Digg, you can spin up some more servers, sit back, and have a drink.

Categories : Best Practices  scalability

Crisis MO

2009.05.25

While running any sort of site, expect problems. Lots of them. They will range from trucks crashing into your data center, to bad releases, to people defacing your site. Whatever the problem may be, there always seems to be the same pattern in dealing with it.

  1. Fail over
  2. Diagnose
  3. Fix
  4. Fail over

This list makes a few assumptions. It assumes that there is an alternate server for you to fail over to, and a means to do so. Having an alternate server with a simple status page is really cheap and quick to set up. The easiest way to fail over is probably to use DNS. Dynect offers really great DNS service and has an interface that anyone in the company can use.

Of course, failing over doesn’t necessarily mean you take your entire site down either. It could mean pushing an update that closes a particular feature, or removing data that’s causing a problem. Failing over really means any sort of quick maneuver that will get you out of hot water.

Once you’ve failed over, it’s really important to understand what happened and why. Knowing what caused the site to break is the most important step in fixing it. When you’re in this situation, it’s really easy to put up a quick fix without understanding what is happening. This can easily lead to thrashing through a number of quick fixes that each break something else. In situations like these, making decisions slowly and calmly is crucial. Anything that’s not well thought out can make a bad situation worse.

Once the fix is applied, turn the feature back on or point DNS back to the production site. When everything is back up, it’s well worth spending some time making sanity checks on your fixes.

Lastly, it’s also well worth the time to thoroughly document
what went wrong and why. The aftermath of crises is a golden opportunity to identify problems with architecture process. It’s also a great time to do a few shots…

A tool to DRY off

2009.05.19

Every developer worth their bits knows that code repeated is a maintenance problem waiting to happen. However, code written by a group of devs under tight deadlines tends to get pretty ugly pretty quick, with lots of snippets being copy/pasted because ‘they work’. The allure of getting things up and running quickly is a siren call that constantly lures us away from the all-important refactoring and integration that makes code maintainable. But once the dust has settled, and there is a spare moment to re-read and consider what should be changed, the task of refactoring seems too daunting to even bother.

Thankfully, Sebastian Bergmann has created a tool that will find every dirty little Ctrl-V. It’s called the php Copy Paste Detector, and can be installed using pear. Or download the source from git.

What’s really interesting is when you play with the number of tokens and number of lines that constistutes a copy-paste. For my purposes, I used a minimum of 5 lines. In quite a few cases, the copy/paste turned out to declarations, or including the same style sheets and scripts on different pages. But when it was php, it was abundantly clear what needed to be refactored, and how.

Just manage it for me

2009.04.26

I’m busy. Really busy. So busy that most of the time, I can’t be bothered with minor details that are so important they really shouldn’t be minor.

Hardware upgrades, hardware failures, Kernel updates, Red Hat updates, PHP updates, MySQL updates, who actually has the expertise to make judgments on every single one of these? I certainly do not. However, at this point, I’m managing over 15 servers, and I need to be up to date and secure. So that’s why we host at Rackspace. They do it all for me.

Then there’s mail.  Spam, ISP’s blacklists, whitelists, bounce management, mailbombs, DKIM, DomainKeys, SPF, WTF.  The world of SMTP messages is one of black magic and voodoo.  Hosting your own email offers infinite flexibility, and working with a single server with postfix lets you do pretty much whatever you want.  But every time you do something nifty, you have to deal with the ongoing maintenance of said nifty hack.  Then there’s spam.  Good luck.  Instead of fighting the good fight against spammers, I decided to wimp out and let someone smarter do it for me.  In 20 minutes, I set up Google Apps, imported every user in my postfix server.  5 minutes later, after changing DNS, I gave my users a spam-free, flexible mail system.

Moral of these stories…don’t be too proud to let someone else manage your IT. Free up your time to do important things.