Emergencies will audit the shit out of you

2010.10.22

Things never go wrong at convenient times: Like when you’re auditing the latest, coolest version of your app, and looking for bugs. Things have a funny way of working out fine then. However, soon as you look the other way, a multitude of problems come out of the woodwork. It usually goes something like this:

One server goes down, and the system that was supposed to fail silently starts screaming. The application it was supporting goes down, because the proper timeouts and error handling was never written. You can’t fail over, because failing over will take down 2 other applications. When that first server comes back up, nothing works, because the proper startup scripts were never put in place. Once the right services start, if you can remember what the hell they were, you find the original application is configured wrong. Not only is it configured wrong, it’s always been configured wrong, and no one noticed. No one noticed because it only explodes in the exact set of horrible circumstances you have right now. Which is, by the way, being down.

It’s an all-too-familiar story, and one that even most the anal of admins has dealt with. The fact of the matter is that it is going to happen, and there’s not a whole lot you can do to prepare, other than randomly pulling plugs out of servers. But with any mistake that causes downtime, it should only happen once. Proper postmortem examination needs to be taken here to figure out what went wrong where. Once all the variables are understood, the next step is to duplicate the same set of circumstances in your sandbox, and apply the necessary error handling.

Downtime and emergencies are a part of running any site. What’s really important is to treat emergencies as an opportunity to learn about what happens when systems fail, for real.

Search is Hard

2010.10.10

The title of this post is a direct quote from a Facebook engineer presenting at the SXSW panel Beyond Lamp. Search is a critical function of any site, but its gotten much much harder as Google has gotten better. To quote the Beyond Lamp panel one more time:

Search is always compared against Google, which is like comparing the canoe you just built to the QE2.

The difficulty of search is made apparent by the majority of sites, even major sites get it wrong. A large factor in the success of search is relevancy. Google takes into account 500 million variables in determining how relevant content is. Not only that, but they also know who you are, what you’ve clicked, and can make decisions based on that to present pages that are more relevant to you. Facebook’s EdgeRank, LinkedIn’s Signal are other examples of search implementations that are vast in scale.

In a startup, where time is of the essence and resources need to be begged, borrowed or stolen, search is a huge challenge. Like trying to be build the QE2 with nothing but a swiss army knife. Basic tools normally don’t cut it. MySQL’s FULLTEXT indexes are helpful, but start trying to implement basic IR techniques like booleans, and MySQL’s builtin functionality starts to lack the ability to get the results your want.

There are ways to simplify building search. Sphinx provides great matching capabilities and incredibly fast sorting. When combined with other data, Sphinx can be a great way to get users fast, meaningful results. The one downside with using a document based search engine is that there is little room for returning completely tailored results. Unlike MySQL, which allows you to slice and dice data in any way you choose, it is more difficult to return results that take into account relationship specific to users and documents. However, for most search tasks, it should function very well.

It happens to everyone…

2010.09.26

Through a combination of unhealthy fears, paranoid tendencies, and luck, I’ve been able to avoid that unavoidable situation that every sysadmin fears: completely nuking a system. Until last Tuesday, when I did something really, really dumb. On the server that hosts http://chr.ishenry.com, I had noticed a script, svcrack.py, running and consuming lots of resources, and bandwidth, as I would later find out from my hosting bill.

Since I sure as hell wasn’t running that, I could only assume that someone had exploited my server and was using it to look for unsecured voip installations. Initially, I assumed killing the scripts and changing some passwords would be sufficient. However, checking in the server later, I found the same script running. All this is fair enough, as I am on WordPress, a few versions behind, and there are enough folders with unhealthy permissions that I kind of deserved it. So after a few days of trying to lock things down, I got a bit desperate.

Since svcrack is a python script, there was a good chance the best way to discourage my assailant would be to remove python. Great idea in theory, but it seemed my execution was a bit poor. It turns out running ‘yum remove python’ is a great way to destroy your entire system. yum runs on python, which meant a reinstall would have to be done manually. Only problem, most of the shell bultins stopped working as well. cp, mv, ls all resulted in a ‘command not found’ error. The best part of this situation: no backups. After all the hubbub about blogs and backups lately, it’s kind of amazing I missed this rather important detail.

I’ve always considered data loss the cardinal sin in development, web or otherwise. However, I also never considered my personal site to be mission critical, or worthy of taking the the time for backups. But as they say, you never know what you have till it’s gone. I was lucky enough that mysql and apache were still running, and I was able to export everything, spin up a new server, and import. Even with no data loss, this is certainly a lesson learned. I am making a backup right now.

Gmail actually gets something really wrong.

2010.08.16

I’m a huge fan of Gmail and Google Apps for many reasons. I love the new redesign, and how they’re finally promoting consistency across their major webapps. It makes me feel like the web could really be a viable alternative alternative to desktop software. I can even deal with slowness in Gmail, given the amount of work they need to do in order to keep your inbox snappy. They need to index every message, which means parsing every message, converting every attachment, and linking it the search architecture. In real time. Not easy…

However, what I found today, was completely inexcusable: Gmail’s clipping “feature”. This is definitely a feature that sounds a lot more like a bug than a helpful tool.

Gmail Message clipping

What should be here is a few more links, some mouse text that contains our mailing address and unsubscribe links. What I did not show in this screenshot is the capacity for destruction this feature has on HTML emails. When the email is ‘clipped’, the HTML is broken at a random place, and not displayed. If your message is clipped at an inopportune place, there goes your entire HTML layout. In the best case, your HTML is simply truncated, leaving users with only a piece of their email.

As the entity sending this email, the responsibility falls on me to make sure that I send emails that are accessible, conform to CAN-SPAM, and are pleasing to the eye. Gmail bones me on three of these goals. Thanks to a lack of documentation as to how long an email can be without invoking the clipping feature. Most importantly, my users have no clear to unsubscribe from the list, since the most likely links to be clipped are the unsubscribe links.

I agree that performance is king, but never at the cost of the user.

Update: It seems like Gmail limits messages to around 102k characters before clipping. So the solution seems to be running HTML through a compressor. I found a pretty good one here

Rackspace Downtime

2009.11.03

[Update] My team at Rackspace has sent me the fluffiest, most comfortable pillow I have ever had.

When a hosting provider goes down, there are lots of questions that get raised. Is my host reliable? Will they flake out during crucial times when my site needs the traffic? Will they double bill me?

Since I have been working with Rackspace, they have had less than stellar uptime, with issues mostly related to power. My company pays a lot for hosting with them, and downtime for a young company is deadly. But oddly enough, I’m still OK with Rackspace hosting my company’s myriad services. The benefits of hosting with them have been so great that a couple hours of downtime is nothing.

First off, their SLA has provisions for downtime, when it happens. If your server has a legitimate issue, you’re entitled to ask for a credit. To me, this is a promise that they’ll put their money where their mouth is. And if you call them on it, they’ll be reasonable.

Secondly, their support during crises is still amazing. During the truck incident, I was able to get a tech to run fsck on my disks, and hang out to watch no questions asked. No, I am not on their intensive plan.

Third, their support culture is simply amazing. Their linux techs are always willing to look deep into an issue to find a resolution, and they provide much of the basic infrastructure that is hard to come by for small companies.. They’re also completely willing to educate their customers about the servers they maintain.

In short, Rackspace has been the target of a lot of criticism over issues in their datacenters. The fact of the matter is that there will always be issues and downtime. Their SLA guarantees the impossible, which they seem to realize, as any failure on their part comes with swift response. In the end their SLA seems to be more of a way of setting standards than anything else.

[Full disclosure: I haven't slept in 2 days because of their power issues]

Categories : Horror Stories

Document failure

2009.06.07

When things go wrong, there is always a reason. Sometimes it’s a good one. (I had no idea $_SERVER wasn’t available from the CLI) Sometimes it’s a really bad one. (I left 2 seconds after running the build script, and I didn’t test beforehand.) Whatever the reason is, the question needs to be asked ‘How can we never, ever let this happen again?’ Once we arrive at a good answer to that question, it’s really important that you hold that answer in a vise-like grip, and never let it go. Because knowing how to prevent problems, and actually taking the steps to prevent them are two very different things.

That is why I keep a Book of Fail. I will write down the consequences and circumstances of a particular catastrophe as a permanent record. I make sure to physically write it down because the act of writing with a pencil, unlike typing, requires significant effort, and helps to deeply ingrain what I have chosen to record. Secondly, keeping these records in a physical journal means that however many crappy Macs I burn through, or hosting plans I forget to pay for, I will still be able to carry my Moleskine around.

Periodically reading through the Book is also crucial to keeping these painful moments fresh. Many people will claim to have learned from a mistake, but will continually repeat the same doomed process over and over without a second’s thought. Once having a made a mistake a couple times, it should never be repeated.

Hard File Limits

2009.05.09

A while ago, we ran up against a hard file limit on our storage server. Within the main images directory, we subdivide our folders for every user. Well, it turns out that on the ext3 filesystem, there is a hard limit of 32k folders within a folder. So when our 32001th user signed up, they were welcomed by not being able to upload any content to their nonexistent user folder.

So there we were…

Several calls to our hosting provider Rackspace yielded no elegant solution where we could simply reconfigure the limit. If we wanted to continue on with our file structure the way it was, we would need to migrate our data off the server, reformat the filesystem to xfs, and migrate back on. That would take us down for close to 36 hours. Needless to say that was not an option.

The other option was to quietly close registration, and start a mad dash to implement programmatic partitioning. We created 5 folders next to our root user content folder, numbered them 2-5, and assigned current users the partition of 1. New users would be assigned a randomly generated number at registration. Then we wrote a few quick utility methods to determine which user belonged to which partition and began the quickest rewrite of file upload system the world has ever seen.

Downtime : 0

In retrospect, someone (me) should have taken a scrutinizing look at the characteristics of the filesystem that would be responsible for storing all of our users’ content.

Lesson learned, the hard way.

Categories : Horror Stories