Architecture Will Save You

There are two things that cause sites to go down:

  • Something breaks.
  • Too much traffic.

When things break, there’s nothing to do but failover, examine, and fix.

But when there’s too much traffic, you’re screwed. The reason you’re getting so much traffic is because you’ve done something to earn it. Your site has made it on to Digg or Reddit or even TV. Failing over to a status page is admitting that you haven’t done your job. Staying up means that some people will see the site, maybe, but after a stupidly long wait. And then you become ‘that site’ that was taken down by too much traffic.

So as a web architect, this is your absolute worst moment. Everything you’ve done over the past months or years is falling down miserably in front of you, and there isn’t anything you can do at that moment that will make a meaningful difference. Unless, of course, your architecture is set up so you can increase capacity by just turning up more servers. Among software architects, this is know as being able to scale horizontally

So when the site is being built, it’s extremely important that every choice you make supports scaling horizontally, with each layer able to be spread across many servers. Storage, session handling, database, apache […] are all different layers. They all need to be able to handle more capacity just by adding additional services to whatever layer needs it.

Having the horizontal scalable quality is really difficult when trying to get a site out quickly. architecture is something that will often get de- prioritized because there is no immediate result. However, taking the extra time to build an application that follows the horizontally scalable patterns is extremely important. Without spending that extra time, you can easily be wiped out by the Digg effect.

The biggest takeaway is when thinking about how to build your application, focus on making it scale horizontally. So when the moment comes when you get on Digg, you can spin up some more servers, sit back, and have a drink.

Jun 4th, 2009

Success!

When making changes to a large site, it’s really helpful to have tools to measure how those changes affect performance. One of my favorite tools is cacti. This is a graph of the load average of one our database servers

Database Load Average

We done good…

Crisis MO

While running any sort of site, expect problems. Lots of them. They will range from trucks crashing into your data center, to bad releases, to people defacing your site. Whatever the problem may be, there always seems to be the same pattern in dealing with it.

  1. Fail over
  2. Diagnose
  3. Fix
  4. Fail over

This list makes a few assumptions. It assumes that there is an alternate server for you to fail over to, and a means to do so. Having an alternate server with a simple status page is really cheap and quick to set up. The easiest way to fail over is probably to use DNS. Dynect offers really great DNS service and has an interface that anyone in the company can use.

Of course, failing over doesn’t necessarily mean you take your entire site down either. It could mean pushing an update that closes a particular feature, or removing data that’s causing a problem. Failing over really means any sort of quick maneuver that will get you out of hot water.

Once you’ve failed over, it’s really important to understand what happened and why. Knowing what caused the site to break is the most important step in fixing it. When you’re in this situation, it’s really easy to put up a quick fix without understanding what is happening. This can easily lead to thrashing through a number of quick fixes that each break something else. In situations like these, making decisions slowly and calmly is crucial. Anything that’s not well thought out can make a bad situation worse.

Once the fix is applied, turn the feature back on or point DNS back to the production site. When everything is back up, it’s well worth spending some time making sanity checks on your fixes.

Lastly, it’s also well worth the time to thoroughly documentwhat went wrong and why. The aftermath of crises is a golden opportunity to identify problems with architecture process. It’s also a great time to do a few shots…

May 26th, 2009

A Tool to DRY Off

Every developer worth their bits knows that code repeated is a maintenance problem waiting to happen. However, code written by a group of devs under tight deadlines tends to get pretty ugly pretty quick, with lots of snippets being copy/pasted because ‘they work’. The allure of getting things up and running quickly is a siren call that constantly lures us away from the all- important refactoring and integration that makes code maintainable. But once the dust has settled, and there is a spare moment to re-read and consider what should be changed, the task of refactoring seems too daunting to even bother.

Thankfully, Sebastian Bergmann has created a tool that will find every dirty little Ctrl-V. It’s called the php Copy Paste Detector, and can be installed using pear. Or download the source from git.

What’s really interesting is when you play with the number of tokens and number of lines that constistutes a copy-paste. For my purposes, I used a minimum of 5 lines. In quite a few cases, the copy/paste turned out to declarations, or including the same style sheets and scripts on different pages. But when it was php, it was abundantly clear what needed to be refactored, and how.

May 19th, 2009

Framed / Hijacked

“There’s lots of bad stuff out there on the Internet.” – anonymous Rackspace tech

And bad people, too. On one occasion, I found that someone had hijacked a domain similar to ours, and framed our site. Once they had the frame set up, they started sending out spam from the hijacked domain. The legitimacy they gained from having our site at their domain was really, really scary. Users clicking links from their spam were sent to a site that was actually ours. Luckily, we were alerted to this issue before any real damage was done.

There wasn’t much I could do server-side to fight this, without having to dig into a bunch of PHP. However, checking the location window.top against what I knew to be my domain turned out to be decent defense.

if (top.location != self.location) { alert(“someone is doing something bad…”);}

Of course, this particular defense will also complain about anyone who has framed your site for any other reason. For example, the Digg toolbar frames sites. So if anyone hits your site using the Digg toolbar, they’ll be treated to your warning as well. But whether Digg should be framing sites in this way seems to be controversial enough.

Update : su.pr, which combines a URL shortener with the StumbleUpon toolbar, has even further complicated the framing situation. (Full disclosure : I am a user.) Clicking any su.pr link will open the shortened URL with the StumbleUpon toolbar. This has great potential for authors, who can use the toolbar to increase traffic to their site, but at the cost of forcing the toolbar on users.

May 16th, 2009

Hard File Limits

A while ago, we ran up against a hard file limit on our storage server. Within the main images directory, we subdivide our folders for every user. Well, it turns out that on the ext3 filesystem, there is a hard limit of 32k folders within a folder. So when our 32001th user signed up, they were welcomed by not being able to upload any content to their nonexistent user folder.

So there we were…

Several calls to our hosting provider Rackspace yielded no elegant solution where we could simply reconfigure the limit. If we wanted to continue on with our file structure the way it was, we would need to migrate our data off the server, reformat the filesystem to xfs, and migrate back on. That would take us down for close to 36 hours. Needless to say that was not an option.

The other option was to quietly close registration, and start a mad dash to implement programmatic partitioning. We created 5 folders next to our root user content folder, numbered them 2-5, and assigned current users the partition of 1. New users would be assigned a randomly generated number at registration. Then we wrote a few quick utility methods to determine which user belonged to which partition and began the quickest rewrite of file upload system the world has ever seen.

Downtime : 0

In retrospect, someone (me) should have taken a scrutinizing look at the characteristics of the filesystem that would be responsible for storing all of our users' content.

Lesson learned, the hard way.

May 10th, 2009

Run Cron More Than Once a Minute

As far as linux utilities go, cron is one of my favorite tools. It handles batch jobs, checks your database, cleans up files, whatever you need to do on a regular basis. One of my favorite uses for it has been to keep a watchful eye on MySQL for runaway queries.

The lowly crontab file

One of the problems with managing a website that has outgrown its database design is that as traffic piles up and tables grow, so does the length of time it takes to run queries. Even the simplest query, when faced with examining several million rows can wind up locking a table for a scary amount of time. The real solution is to rewrite your app to not suck. However, when that luxury (neccessity?) isn’t available, the next best thing is periodically run a script that will kill long running queries.

It is not the cleanest solution, or the safest, and care must be taken to avoid killing queries that do important things. It would be a really, really BAD THING if you wind up killing the query that logs that $500 transaction. That said, in most situations running this script every minute should solve all your woes. With decent error logging, it’ll also tell you exactly what is running away from you. But what happens when you need to run it more often? Say every 30 seconds. And that’s where cron falls on its face.

After exhaustively researching, I could not find a way to get cron to run a job more than once a minute. (If any one knows differently, please say so…) However, knowing that I needed to run this script more than once a minute, I needed to find an alternative. That alternative turned out to be the humble, rarely used sleep() command. Set up a second cron job, with the script doing nothing sleeping for 30 seconds, and then calling the save mysql script.

Voila, cron that runs more often than a minute!

May 2nd, 2009

Just Manage It for Me

I’m busy. Really busy. So busy that most of the time, I can’t be bothered with minor details that are so important they really shouldn’t be minor.

Hardware upgrades, hardware failures, Kernel updates, Red Hat updates, PHP updates, MySQL updates, who actually has the expertise to make judgments on every single one of these? I certainly do not. However, at this point, I’m managing over 15 servers, and I need to be up to date and secure. So that’s why we host at Rackspace. They do it all for me.

Then there’s mail.  Spam, ISP’s blacklists, whitelists, bounce management, mailbombs, DKIM, DomainKeys, SPF, WTF.  The world of SMTP messages is one of black magic and voodoo.  Hosting your own email offers infinite flexibility, and working with a single server with postfix lets you do pretty much whatever you want.  But every time you do something nifty, you have to deal with the ongoing maintenance of said nifty hack.  Then there’s spam.  Good luck. Instead of fighting the good fight against spammers, I decided to wimp out and let someone smarter do it for me.  In 20 minutes, I set up Google Apps, imported every user in my postfix server.  5 minutes later, after changing DNS, I gave my users a spam-free, flexible mail system.

Moral of these stories…don’t be too proud to let someone else manage your IT. Free up your time to do important things.