Gmail actually gets something really wrong.

2010.08.16

I’m a huge fan of Gmail and Google Apps for many reasons. I love the new redesign, and how they’re finally promoting consistency across their major webapps. It makes me feel like the web could really be a viable alternative alternative to desktop software. I can even deal with slowness in Gmail, given the amount of work they need to do in order to keep your inbox snappy. They need to index every message, which means parsing every message, converting every attachment, and linking it the search architecture. In real time. Not easy…

However, what I found today, was completely inexcusable: Gmail’s clipping “feature”. This is definitely a feature that sounds a lot more like a bug than a helpful tool.

Gmail Message clipping

What should be here is a few more links, some mouse text that contains our mailing address and unsubscribe links. What I did not show in this screenshot is the capacity for destruction this feature has on HTML emails. When the email is ‘clipped’, the HTML is broken at a random place, and not displayed. If your message is clipped at an inopportune place, there goes your entire HTML layout. In the best case, your HTML is simply truncated, leaving users with only a piece of their email.

As the entity sending this email, the responsibility falls on me to make sure that I send emails that are accessible, conform to CAN-SPAM, and are pleasing to the eye. Gmail bones me on three of these goals. Thanks to a lack of documentation as to how long an email can be without invoking the clipping feature. Most importantly, my users have no clear to unsubscribe from the list, since the most likely links to be clipped are the unsubscribe links.

I agree that performance is king, but never at the cost of the user.

Rackspace Downtime

2009.11.03

[Update] My team at Rackspace has sent me the fluffiest, most comfortable pillow I have ever had.

When a hosting provider goes down, there are lots of questions that get raised. Is my host reliable? Will they flake out during crucial times when my site needs the traffic? Will they double bill me?

Since I have been working with Rackspace, they have had less than stellar uptime, with issues mostly related to power. My company pays a lot for hosting with them, and downtime for a young company is deadly. But oddly enough, I’m still OK with Rackspace hosting my company’s myriad services. The benefits of hosting with them have been so great that a couple hours of downtime is nothing.

First off, their SLA has provisions for downtime, when it happens. If your server has a legitimate issue, you’re entitled to ask for a credit. To me, this is a promise that they’ll put their money where their mouth is. And if you call them on it, they’ll be reasonable.

Secondly, their support during crises is still amazing. During the truck incident, I was able to get a tech to run fsck on my disks, and hang out to watch no questions asked. No, I am not on their intensive plan.

Third, their support culture is simply amazing. Their linux techs are always willing to look deep into an issue to find a resolution, and they provide much of the basic infrastructure that is hard to come by for small companies.. They’re also completely willing to educate their customers about the servers they maintain.

In short, Rackspace has been the target of a lot of criticism over issues in their datacenters. The fact of the matter is that there will always be issues and downtime. Their SLA guarantees the impossible, which they seem to realize, as any failure on their part comes with swift response. In the end their SLA seems to be more of a way of setting standards than anything else.

[Full disclosure: I haven't slept in 2 days because of their power issues]


Categories : Horror Stories

Document failure

2009.06.07

When things go wrong, there is always a reason. Sometimes it’s a good one. (I had no idea $_SERVER wasn’t available from the CLI) Sometimes it’s a really bad one. (I left 2 seconds after running the build script, and I didn’t test beforehand.) Whatever the reason is, the question needs to be asked ‘How can we never, ever let this happen again?’ Once we arrive at a good answer to that question, it’s really important that you hold that answer in a vise-like grip, and never let it go. Because knowing how to prevent problems, and actually taking the steps to prevent them are two very different things.

That is why I keep a Book of Fail. I will write down the consequences and circumstances of a particular catastrophe as a permanent record. I make sure to physically write it down because the act of writing with a pencil, unlike typing, requires significant effort, and helps to deeply ingrain what I have chosen to record. Secondly, keeping these records in a physical journal means that however many crappy Macs I burn through, or hosting plans I forget to pay for, I will still be able to carry my Moleskine around.

Periodically reading through the Book is also crucial to keeping these painful moments fresh. Many people will claim to have learned from a mistake, but will continually repeat the same doomed process over and over without a second’s thought. Once having a made a mistake a couple times, it should never be repeated.

Hard File Limits

2009.05.09

A while ago, we ran up against a hard file limit on our storage server. Within the main images directory, we subdivide our folders for every user. Well, it turns out that on the ext3 filesystem, there is a hard limit of 32k folders within a folder. So when our 32001th user signed up, they were welcomed by not being able to upload any content to their nonexistent user folder.

So there we were…

Several calls to our hosting provider Rackspace yielded no elegant solution where we could simply reconfigure the limit. If we wanted to continue on with our file structure the way it was, we would need to migrate our data off the server, reformat the filesystem to xfs, and migrate back on. That would take us down for close to 36 hours. Needless to say that was not an option.

The other option was to quietly close registration, and start a mad dash to implement programmatic partitioning. We created 5 folders next to our root user content folder, numbered them 2-5, and assigned current users the partition of 1. New users would be assigned a randomly generated number at registration. Then we wrote a few quick utility methods to determine which user belonged to which partition and began the quickest rewrite of file upload system the world has ever seen.

Downtime : 0

In retrospect, someone (me) should have taken a scrutinizing look at the characteristics of the filesystem that would be responsible for storing all of our users’ content.

Lesson learned, the hard way.


Categories : Horror Stories