Rackspace Downtime

[Update] My team at Rackspace has sent me the fluffiest, most comfortable pillow I have ever had.

When a hosting provider goes down, there are lots of questions that get raised. Is my host reliable? Will they flake out during crucial times when my site needs the traffic? Will they double bill me?

Since I have been working with Rackspace, they have had less than stellar uptime, with issues mostly related to power. My company pays a lot for hosting with them, and downtime for a young company is deadly. But oddly enough, I’m still OK with Rackspace hosting my company’s myriad services. The benefits of hosting with them have been so great that a couple hours of downtime is nothing.

First off, their SLA has provisions for downtime, when it happens. If your server has a legitimate issue, you’re entitled to ask for a credit. To me, this is a promise that they’ll put their money where their mouth is. And if you call them on it, they’ll be reasonable.

Secondly, their support during crises is still amazing. During the truck incident, I was able to get a tech to run fsck on my disks, and hang out to watch no questions asked. No, I am not on their intensive plan.

Third, their support culture is simply amazing. Their linux techs are always willing to look deep into an issue to find a resolution, and they provide much of the basic infrastructure that is hard to come by for small companies.. They’re also completely willing to educate their customers about the servers they maintain.

In short, Rackspace has been the target of a lot of criticism over issues in their datacenters. The fact of the matter is that there will always be issues and downtime. Their SLA guarantees the impossible, which they seem to realize, as any failure on their part comes with swift response. In the end their SLA seems to be more of a way of setting standards than anything else.

[Full disclosure: I haven’t slept in 2 days because of their power issues]

Nov 4th, 2009

Fix or Manage?

Sometimes bugs come along that require significant work to fix. Depending on what project timelines are like at the moment, sometimes fixing the bug isn’t the best option. For example, a race condition in the caching architecture causes pages to be stale. The persistent data store is correct, but the cache is not. To the person who just triggered the update, there’s a bug. The information on the public side is not in sync with the information they just entered.

So, like any other bug, a report will eventually percolate down to the dev team. People scream, fortunes are lost, the svn blame command is used, and the devs who wrote the code pee their pants. Once the chaos dies down, the actual prognosis of this issue can turn out to be extremely grim.

A shortcoming of the caching architecture shows that there’s a race condition when the system is under heavy load. In order to fix it, the dev team needs to plumb the depths of the data access layer, and probably change some parameters. But that’ll probably break everything. Everywhere. Or the layer manipulating the data could be fixed to replace the cache instead of invalidating. Except the methods to manipulate that entity live in 3 different codebases. It’ll probably break the editor. Either way, the actual solution doesn’t matter.The dev team certainly needs to do something, and it needs to be released three days ago.

The correct way to fix this issue will vary widely depending on circumstances. But in this particular case, the best answer was to not fix it, just manage it. Our team was busy, there were other projects that were more pressing. Plus the codebase was being rewritten. So instead of flogging a dead horse, a simple script was thrown together that compared the cache and the database. If they were out of sync, the cache would be cleared, and would be repopulated with the correct information the next time it was requested. Once it was implemented, the bug was still there, but the cache seemed to be up to date.

Every dev team will face bugs that have enormous costs to fix. The way to deal with these bugs will be different every time they come up. It’s important to remember that managing bugs can be almost as effective as fixing them.

What's in a Name?

It’s easy to get caught up in semantics. Figuring out the best names for variables, tables, columns, classes, etc is something that can eat up hours or even days of a development schedule. The idea is that the more precise the name, the better it is. The arguments for precision naming are many :

  • Clear names help other developers read your code. New developers who come on will immediately understand what’s happening Calling well named methods of classes will read like sentence, further increasing readability.* Clear names will be able to help developers relate things in the UI to the code.

Keep in mind, I’m not talking about naming conventions. Naming conventions are simply rules for choosing the character sequences. They don’t dictate what words you should assign to things in your code.

Whatever names developers choose, they will get strewn throughout the layers of the application. Database, table and column names will be impacted. Variables in server-side scripts. Organization of classes into folders. Javascript file names. Memcache keys. URLs. Just like sand at the beach, the labels the dev team decided on goes everywhere you can think of. Invariably, the marketing team will bound down the hall, and announce the product is being rebranded. Jobs will become Gigs. Friends will become Followers. Application code will become confusing.

New devs won’t get it anyway.

The fact of the matter is overthinking naming is a good way to get nowhere fast. Keeping it simple and just take enough time to make sure that things make sense will give devs more time to focus on important stuff. Like being able to articulate the thought process behind code.

Oct 14th, 2009

Blackberry OS Will Never Take Over the World

A few days ago, I installed Opera Mini on my Blackberry Curve. The experience offered by Opera Mini is really impressive. It retrieves and renders pages quickly and flawlessly. The interface is specifically designed to handle navigating long pages on a tiny screen. In short, it’s a great app, and a major improvement on the Blackberry Browser.

Here’s the catch (or catches). Links that appear in other applications won’t open in Opera, they’ll open in the Blackberry browser. What’s worse is that Opera’s location bar does not have an option to paste links directly into the location bar. In order to paste a link into the location bar, you need to hit the symbol key, which brings up an edit screen pre-populated with ‘www.’ You need to erase the www, and then you can paste your link in.

That lengthy process completely kills any satisfaction you may get out of having a workable browser on a Blackberry. And none of it is Opera’s fault. On most other OSes, clicking URLs in any application will fire up the default browser and retrieve the URL. Hell, on iPhone OS, you can set up protocol handlers that will open up other apps. Hopefully RIM will provide a means for applications to talk to each other soon. Creating the seamless interaction between apps is probably even more important than pushing App World.

Sep 26th, 2009

Gearman

Users have high expectations of web apps in terms of performance, responsiveness and tons of features. Normally, you’re only allowed two of any list of three really cool things. In the case of Web Apps, that would be true. Most will find some compromise of between performance / responsiveness and tons of features. More features usually equals less responsiveness, depending on the feature and scale.

Enter Gearman. Gearman is a queuing system that allows work to be farmed out to other servers. Most importantly, it allows for intense tasks to be queued and performed in the background. This means that when a user performs an action that could potentially take a long time (sending notification emails, updating Full Text indexes, etc), that slow task can be queued to run in the background, and the page can be sent to the user, keeping things snappy.

Gearman is pretty simple to install on Red Hat.

download gearman from server> wget http://launchpad.net/gearmand/trunk/0.8/+d ownload/gearmand-0.8.tar.gz

unzip and move into the directory> tar -xvzf gearmand-0.8.tar.gz> cd gearmand-0.8

Red Hat didn’t have some dependencies. The next few steps will vary depending on your *nix distro.

Install the libevent developer library.> yum install libevent-devel

Install the e2fsprogs developer library> yum install e2fsprogs-devel

configure and install> ./configure> make> make install

/ Net Gearman /

download php extension from the pecl repo> wget http://pecl.php.net/get/gearm an-0.4.0.tgz

untar> tar -xvf gearman-0.4.0.tgz

build the extension> phpize> ./configure> make> make test> make install

Add the extension to the php.ini

[gearman]extension=gearman.so

And you’re all set!

Integration will depend on if you decide to use the php extension, and how encapsulated the code base is. I highly recommend using the pecl extension, as it provides great implementations of the client and worker. and Gearman will save you.

Save MySQL

Runaway queries on MySQL can be a real problem. If a long-running query locks up important tables, other queries trying to query the table will will placed in a queue. Each new query is a new connection to MySQL. Once you hit max_connections, your MySQL connection code will start to fail. Depending on how errors are handled at this stage of the request, this could mean total disaster for a site.

Although there is no way to fix this within the MySQL server itself, a bit of clever scripting can be run via cron to check if there is a problem. Presenting : save_mysql

/usr/bin/mysql -e ‘show full processlist \G;’ 2> /dev/null |grep -A1 -B5 -E “Time: [1-9][0-9][0-9]?” |grep -E “Id:\ |State:\ ” |/usr/bin/perl -n -e “if( $. % 2 ) { chomp $;print $; } else { print $_; }” |grep -E “\ State:\ Sending\ data$|\ State:\ Sorting\ result$” |awk {‘print $2’} |xargs -iTHREAD -r -n1 /usr/bin/mysqladmin kill THREAD &> /dev/null

/usr/bin/mysql -e ‘show full processlist \G;’ 2> /dev/nullThis line will grab a list of all the currently running queries and commands from the MySQL server. It also redirects any error output to the blackhole. It produces output like so:

*************************** 1. row ***************************Id: 842863User: adminHost: localhostdb: NULLCommand: QueryTime: 0State: NULLInfo: show full processlist

grep -A1 -B5 -E “Time: [1-9][0-9][0-9]?"The grep here will grab line directly below and the 5 above if the time is over 100 seconds. This line can be tweaked to grep for less time. My preference is between 30 seconds and a minute. So instead of[1-9][0-9][0-9]you’d have[3-9][0-9] (30 seconds) OR [6-9][0-9] (60 seconds)

grep -E “Id:\ |State:\ "This will filter out the other lines from the previous grep and just grab the MySQL process ID and it’s State.

/usr/bin/perl -n -e “if( $. % 2 ) { chomp $;print $; } else { print $_; }"Quick Perl script to put id and state from the step above on the same line.

grep -E “\ State:\ Sending\ data$|\ State:\ Sorting\ result$"This line will filter out the queries being run that are in the state ‘Sending Data’ or ‘Sorting Result’. These are both states where it’s safe to kill the query.

awk {‘print $2’}This line grabs the query ID from the output.

xargs -iTHREAD -r -n1 /usr/bin/mysqladmin kill THREAD &> /dev/nullLastly, this line will grab the ID from above to the mysqladmin kill command, effectively killing the query.

Do One Thing, but Do It Really Well

In life there are people who will consider themselves Jacks of all Trades. But as the saying goes, they are master of none.

Websites will always start out as a codebase that does everything. There will be a couple files that add users, encode video to Flash, pull rss feeds, assemble HTML, update products, charge users, manipulate images, redirect old links, handle file uploads, calculate shipping, delete categories, create rss feeds, search the database, etc. Sometimes the code to do these will be organized into files, sometimes it won’t. The whole site will run on a single server, or more likely, a slice of a single server.

None of the things in the above list will be done well. None. This is mostly because there is too little code and too little hardware focused on doing too much. Also, every piece of code will be tightly coupled. So any one of those features could potentially get a ton of traffic, or hit a bump, and consume a ton of resources. Once that happens, it’s safe to assume the whole thing will go down in flames.

So to avoid the Fail Whale, its really important to build sites as a group of components that work together. Architecture is key, and when carefully thought out, can ensure that the most important parts of the site stay up. Even when your image manipulation script on the backend freaks out, the home page should continue to load flawlessly.

With database-driven apps (almost every major site on the web), there needs to be particular attention paid to a caching layer. Again, since most sites start out with a jumbled codebase, the likelihood that all the code to manage data is in the same place is unlikely. Given the complexities of managing cache objects, making sure that objects are invalidated on update is crucial to making updates look seamless. So there needs to be a set of code that’s good at one thing: managing data and its cache.

Search is another area that commonly relies on database, and can eat a ton of resources. If performing search in SQL, difficult queries can lock tables and keep other queries from being answered. As good as some DBMSes have gotten at handling search ( ie MySQL’s FULLTEXT ), they still can’t fulfill the concurrency demands of a site with heavy traffic. So, again, the solution is a change where a resource intensice feature needs to be isolated from other code. There are a few different ways to do this. One is running replication, which may not be possible in smaller hosting environments. Another is to use Full Text Search (Lucene, Sphinx, etc.) Again, this may not be possible in smaller hosting environments.

Using code that’s already good at managing and retrieving data, an interface can be built to query your data. A second hosting environment that’s suitable for running the search tool of your choice can then query the data code for updates it needs to keep itself updated. In turn, this server will return search results without tying up any resources necessary for doing important stuff, like serving the home page.

So in these two short examples, we’ve created a theoretical architecture that can sustain heavy, site-breaking traffic to the search, and still continue to serve the home page. Of course, until the Apache server becomes so inundated with requests that it can’t do anything. Then it’s time to get that load balancer in place…

Jul 8th, 2009

Explain Your Code

In my search to expand my dev team, I use a code sample as one of the main determining factors. During an interview, I will always make the same request:

“Give us a code sample. It can be something that you think is really great, or something you think really sucks. Most importantly, tell us why you think it’s great, or why you think it sucks.”

No one seems to able to do it. I have received code samples that consist of stream wrappers, database wrappers, complete websites, etc. Some have been really good, and some have been outright scary. But very few candidates have been able to communicate what they think of their own code and why.

Which is very surprising, given most developers' proclivity to judge others' code as total crap without a second thought (guilty).

Jul 1st, 2009

Zombie Projects

A Zombie Project is any project that has been completed for a while but needs to be changed or updated in some way. After laying dormant for a while, the team that developed it may have left, or evolved into a different team, or changed their methodology entirely.

While the project was dormant, it is extremely likely, if not definite that knee-jerk reactions to bugs have caused changes that no one remembers, much less committed. So when development starts on the project sandbox, if it still exists, it is more likely than not that everything is out of date. In this case, bite the bullet. FTP the whole codebase down, and check it against the repo. Your life will probably be miserable for a bit, but at least there won’t be any horrible surprises later. Also, if there’s a sandbox, triple check it. If not, be glad there’s nothing you’ll miss in an old one.

If this project is only being opened for a short period, resist the urge to make small changes in methodology or framework. For every small change you make, you’ll want to make another, and update this, and modify that. Be extremely careful of the budget, and cost of each change, or you’re done for.

Lastly, be very, very afraid of the build to production. Even with fastidious notes, a sys admin with OCD, and Gillian the QA girl telling you how awesome you are, assume you missed something. An open tail of the error log is your best visibility into whatever has reared it’s ugly head. Your next best bet is your trusty Crisis MO.

Jun 30th, 2009

Document Failure

When things go wrong, there is always a reason. Sometimes it’s a good one. (I had no idea $_SERVER wasn’t available from the CLI) Sometimes it’s a really bad one. (I left 2 seconds after running the build script, and I didn’t test beforehand.) Whatever the reason is, the question needs to be asked ‘How can we never, ever let this happen again?’ Once we arrive at a good answer to that question, it’s really important that you hold that answer in a vise-like grip, and never let it go. Because knowing how to prevent problems, and actually taking the steps to prevent them are two very different things.

That is why I keep a Book of Fail. I will write down the consequences and circumstances of a particular catastrophe as a permanent record. I make sure to physically write it down because the act of writing with a pencil, unlike typing, requires significant effort, and helps to deeply ingrain what I have chosen to record. Secondly, keeping these records in a physical journal means that however many crappy Macs I burn through, or hosting plans I forget to pay for, I will still be able to carry my Moleskine around.

Periodically reading through the Book is also crucial to keeping these painful moments fresh. Many people will claim to have learned from a mistake, but will continually repeat the same doomed process over and over without a second’s thought. Once having a made a mistake a couple times, it should never be repeated.