SXSW Isn't for Backend Devs Anymore

During my time at the interactive portion of SXSW, I was looking for great technical panels on practical ways to improve my technical skills. While I found a bunch of panels that addressed some interesting issues, I don’t think I saw any server side code the entire time I was there. There were a number of great CSS and HTML5 talks, but aside from the PHP workshop that was listed at the wrong time in the booklet, I found no practical talks for backend developers.

Over the past couple years, it seems like there has definitely been a shift in how sxsw views technical talks. Two years ago, there were developers presenting great content on how to structure APIs, write great PHP, and develop iPhone / Android Apps. This year, the php workshop had the incorrect time printed in the booklet. The technical talks this year seemed be designed for the nontechnical. The discussions around scaling included big names quoting scaling statistics that are sure to be taken out of context and read as homily by technical managers everywhere. The Android Developer meetup was almost completely devoid of Android devs, just sharks looking for them, myself included. The panel on the death of the RDBMS painted a rosy picture of what databases could be, but did not include a mention of a single technology that fit the presenter’s pipe dream.

The ubiquitousness of the term social media at SXSW leaves me with one conclusion. SXSW has changed from a gathering that was about people doing cool stuff to a group of people talking about stuff they think is cool right now. There has been a lot of discussion on conversation, but not much conversation on how to create things worth talking about.

Mar 16th, 2011

The Progress Bar Pysch

A classic UX problems is communicating to users how long they’ll have to wait before their task completes. A spinner or progress bar provides feedback that the system is, in fact, doing something, and how long that task may take. Psychologically, progress bars create tension while progressing, and resolution when completed.

From a technical standpoint, progress bars are black magic. The developer is attempting to estimate a task based on potentially thousands of variables. In the case of a file upload, the developer has to deal with differing network conditions, disk performance, etc, etc, etc. Then they have to write the code to communicate what is happening to the browser. Not a trivial task. However, when executed well, can provide the user with reasonable feedback about their task.

Lately, sites like LinkedIn, Mint.com, and OKCupid have used that same tension to motivate users to completely fill out their profiles. During profile creation, a progress bar is displayed indicating how far the user has come along. Once the user completely fills out their profile, the progress bar hits 100%, and what changes? In most cases, nothing. The progress bar is just a psychological hack to entice users to go through the entire process.

The question is: Exactly how effective is the progress bar at enticing users to fully complete the task at hand? And are they actually worth it.

Mar 12th, 2011

Emergencies Will Audit the Shit Out of You

Things never go wrong at convenient times: Like when you’re auditing the latest, coolest version of your app, and looking for bugs. Things have a funny way of working out fine then. However, soon as you look the other way, a multitude of problems come out of the woodwork. It usually goes something like this:

One server goes down, and the system that was supposed to fail silently starts screaming. The application it was supporting goes down, because the proper timeouts and error handling was never written. You can’t fail over, because failing over will take down 2 other applications. When that first server comes back up, nothing works, because the proper startup scripts were never put in place. Once the right services start, if you can remember what the hell they were, you find the original application is configured wrong. Not only is it configured wrong, it’s always been configured wrong, and no one noticed. No one noticed because it only explodes in the exact set of horrible circumstances you have right now. Which is, by the way, being down.

It’s an all-too-familiar story, and one that even most the anal of admins has dealt with. The fact of the matter is that it is going to happen, and there’s not a whole lot you can do to prepare, other than randomly pulling plugs out of servers. But with any mistake that causes downtime, it should only happen once. Proper postmortem examination needs to be taken here to figure out what went wrong where. Once all the variables are understood, the next step is to duplicate the same set of circumstances in your sandbox, and apply the necessary error handling.

Downtime and emergencies are a part of running any site. What’s really important is to treat emergencies as an opportunity to learn about what happens when systems fail, for real.

Search Is Hard

The title of this post is a direct quote from a Facebook engineer presenting at the SXSW panel Beyond Lamp. Search is a critical function of any site, but its gotten much much harder as Google has gotten better. To quote the Beyond Lamp panel one more time:

Search is always compared against Google, which is like comparing the canoe you just built to the QE2.

The difficulty of search is made apparent by the majority of sites, even major sites get it wrong. A large factor in the success of search is relevancy. Google takes into account 500 million variables in determining how relevant content is. Not only that, but they also know who you are, what you’ve clicked, and can make decisions based on that to present pages that are more relevant to you. Facebook’s EdgeRank, LinkedIn’s Signal are other examples of search implementations that are vast in scale.

In a startup, where time is of the essence and resources need to be begged, borrowed or stolen, search is a huge challenge. Like trying to be build the QE2 with nothing but a swiss army knife. Basic tools normally don’t cut it. MySQL’s FULLTEXT indexes are helpful, but start trying to implement basic IR techniques like booleans, and MySQL’s builtin functionality starts to lack the ability to get the results your want.

There are ways to simplify building search. Sphinx provides great matching capabilities and incredibly fast sorting. When combined with other data, Sphinx can be a great way to get users fast, meaningful results. The one downside with using a document based search engine is that there is little room for returning completely tailored results. Unlike MySQL, which allows you to slice and dice data in any way you choose, it is more difficult to return results that take into account relationship specific to users and documents. However, for most search tasks, it should function very well.

It Happens to Everyone...

Through a combination of unhealthy fears, paranoid tendencies, and luck, I’ve been able to avoid that unavoidable situation that every sysadmin fears: completely nuking a system. Until last Tuesday, when I did something really, really dumb. On the server that hosts http://chr.ishenry.com, I had noticed a script, svcrack.py, running and consuming lots of resources, and bandwidth, as I would later find out from my hosting bill.

Since I sure as hell wasn’t running that, I could only assume that someone had exploited my server and was using it to look for unsecured voip installations. Initially, I assumed killing the scripts and changing some passwords would be sufficient. However, checking in the server later, I found the same script running. All this is fair enough, as I am on Wordpress, a few versions behind, and there are enough folders with unhealthy permissions that I kind of deserved it. So after a few days of trying to lock things down, I got a bit desperate.

Since svcrack is a python script, there was a good chance the best way to discourage my assailant would be to remove python. Great idea in theory, but it seemed my execution was a bit poor. It turns out running ‘yum remove python’ is a great way to destroy your entire system. yum runs on python, which meant a reinstall would have to be done manually. Only problem, most of the shell bultins stopped working as well. cp, mv, ls all resulted in a ‘command not found’ error. The best part of this situation: no backups. After all the hubbub about blogs and backups lately, it’s kind of amazing I missed this rather important detail.

I’ve always considered data loss the cardinal sin in development, web or otherwise. However, I also never considered my personal site to be mission critical, or worthy of taking the the time for backups. But as they say, you never know what you have till it’s gone. I was lucky enough that mysql and apache were still running, and I was able to export everything, spin up a new server, and import. Even with no data loss, this is certainly a lesson learned. I am making a backup right now.

Ode to the Environment

Consistency is key, in everything. People come to rely on trains, because they come on time. Devs rely on the environment which they develop in, because its stability allows them to be productive. Changing or upgrading that environment can be the same as changing the train schedule. Sometimes people get where they’re going faster, but sometimes, the change makes their life miserable.

Environment upgrades can be just like that.  Speed, security, features. Everybody likes those.  The ugly side to upgrades is that they have a tendency to break things that are already working, may be incompatible with current code, and disrupt the work of the team.  As a struggling sysadmin / developer, all I really want in life is to build a stable platform that I can build my app in.

Hence, the Ode to the Environment:

The Environment is the basis for my business. Without it, and it’s consistency, there is uncertainty, chaos, and ultimately, failure.

I need to be able to replicate the Environment quickly, identify when issues are caused by it, sandbox it, and be comfortable building it from scratch, if it comes to that. (Hopefully, it never does.)

I need to be confident in the set of packages I’ve come to love, loathe, and rely on, and make sure they work for my business’s app.

I know that the Environment’s well-being will affect my application’s uptime, developers relying on it, and my business’s reputation.

I need to know the flaws and shortcomings in the Environment, and weigh how to fix them against the cost of change.

When it comes time to upgrade the Environment, there will be damn good reason. I need to be horribly convinced that my business will see benefits immediately.

Once I upgrade the Environment, I need to love and loathe it same as the old, embrace whatever change it brings, advocate for it and fix whatever issues the change brings.

Above all, I will maintain the best Environment that suits my business, and ensure that it is always meets the goals of my business, no matter the cost.

Gmail Actually Gets Something Really Wrong.

I’m a huge fan of Gmail and Google Apps for many reasons. I love the new redesign, and how they’re finally promoting consistency across their major webapps. It makes me feel like the web could really be a viable alternative alternative to desktop software. I can even deal with slowness in Gmail, given the amount of work they need to do in order to keep your inbox snappy. They need to index every message, which means parsing every message, converting every attachment, and linking it the search architecture. In real time. Not easy…

However, what I found today, was completely inexcusable: Gmail’s clipping “feature”. This is definitely a feature that sounds a lot more like a bug than a helpful tool.

Gmail Message clipping

What should be here is a few more links, some mouse text that contains our mailing address and unsubscribe links. What I did not show in this screenshot is the capacity for destruction this feature has on HTML emails. When the email is ‘clipped’, the HTML is broken at a random place, and not displayed. If your message is clipped at an inopportune place, there goes your entire HTML layout. In the best case, your HTML is simply truncated, leaving users with only a piece of their email.

As the entity sending this email, the responsibility falls on me to make sure that I send emails that are accessible, conform to CAN-SPAM, and are pleasing to the eye. Gmail bones me on three of these goals. Thanks to a lack of documentation as to how long an email can be without invoking the clipping feature. Most importantly, my users have no clear to unsubscribe from the list, since the most likely links to be clipped are the unsubscribe links.

I agree that performance is king, but never at the cost of the user.

Update: It seems like Gmail limits messages to around 102k characters before clipping. So the solution seems to be running HTML through a compressor. I found a pretty good one here

Aug 16th, 2010

MySQL and Linux Swappiness

MySQL’s InnoDB engine is really great. Row-level locking is amazing in tables where there is heavy concurrency. Write buffering is also awesome for cases where a table needs to accept a lot of data. InnoDB’s use of memory to store indexes or sometimes the entire table can also make reads incredibly fast, especially on tables that need to support complex queries where even the best placed indexes do nothing.

However, when tables get large, the innodb_buffer_pool is set to close to amount of memory on the server, Linux has a tendency to remove your data from memory for no good reason. The symptoms are unmistakable: a query that was known to be pretty quick, but hasn’t run in a while, will take long. Too long. Run it again, and it becomes snappy. What’s happening is that when the query initially runs, the necessary data isn’t in memory, so it’s read in from disk, and the query is performed. Once it’s in memory, that second run is quick.

Actually there is a good reason Linux behaves like this:

“My point is that decreasing the tendency of the kernel to swap stuff out is wrong. You really don’t want hundreds of megabytes of BloatyApp’s untouched memory floating about in the machine. Get it out on the disk, use the memory for something useful.”

This all makes sense, as most systems need to reclaim memory from applications that aren’t doing anything. Except in the case where you have a large dataset in InnoDB that you’d really like to be in memory when you query it. Luckily, there is a tunable that you can change to dictate how aggressive Linux is reclaiming memory from applications. /proc/sys/vm/swappiness stores a number for 0 to 100, where 100 means that Linux will be extremely aggressive in reclaiming memory, and 0 means that memory won’t be reclaimed all that much.

For servers that need to keep datasets in memory all the time, this variable can be extremely helpful. With an InnoDB table / indexes that consume ~80% of memory on the machine, a swappiness value of 30 is sufficient to allow MySQL to keep most of that in memory. Of course, I don’t recommend this for a machine that is not 100% dedicated to a single task. However, on dedicated MySQL machines, tuning this variable can be really helpful.

Jul 30th, 2010

Accountability Is a Feedback Loop

Accountability is a word that’s getting tossed around a lot lately. You hear people saying things like:

  • That developer should be held accountable for the validation problems.

  • The tester should be accountable for not finding that bug.

  • BP needs to be accountable for destroying an ecosystem.

The term seems to be thrown around most often when parts of a system fail. BP is part of a larger industry which that’s regulated. The government agency responsible for monitoring safety measures is responsible for ensuring they follow safety regulations. So when BP made their whoopsie daisy, the fingers were pointed squarely at them. However, where were the regulators? There were tons of opportunities for the government to push feedback to BP regarding the safety of their operation. But it seemed like no one was talking.

The development process is strikingly similar. Any development team worth their bits has a process that puts any issue in front of at least two parties at all times. Joel Spolsky’s infamous Bug 1203, a quick story about the interactions between a dev and a tester, is the picture of accountability, and shows that without active management and constant feedback being exchanged, things don’t get done.

A quick synopsis and commentary: Jill the tester finds a bug, and provides feedback to the dev team via the ticket system. In doing so, Jill has started the feedback loop, and made it the responsibility of the dev team to investigate the issue. The dev team, as they are prone to doing, deny responsibility for the issue, and mark the issue as ‘NOT A BUG’ Having done so, they’ve put the onus on Jill to prove it’s really a bug, which she does (probably in about 2 seconds). It’s again the responsibility of the dev team to fix the bug, which they do. Jill confirms the fix, and thereby closes the loop.

What’s important to realize is that in this type of process, it is the responsibility of anyone and everyone involved to be accountable for their role, and be focused on pushing feedback to the next person. Once there’s a break in the loop, the issue is likely to be dropped, and never fixed. The last person holding the ball is the screwup. I’m sure someone somewhere is really upset they didn’t ask BP about that little safety measure.

Jun 23rd, 2010

I'll Buy Time Any Day.

Many people value money as the most important thing in life, and will gladly trade time for it. The pursuit of saving money is an extremely American one. People will spend time in line for free stuff, just because it’s free. Motorists getting tickets will spend days in court, just to avoid a fine. Clipping coupons has become an art form, and even extended to the digital world in the form of sites like SlickDeals and Groupon.

Me, I like time. To me, time is way more valuable than the almighty dollar. Reason: I can’t get it back. Evar.

If I get a parking ticket, I know that if I pony up that those 55 greenbacks, chances are there will be a check with my name on it in the next couple of weeks that puts that those 55 American pesos back in my pocket. If went to court, I’d never get back that 3-4 hours of my life. I’d also probably lose the case. I’d also probably spend that time sitting next to someone who smells like cheese. Paying up gives me a net gain of 3-4 hours of my life, which I could spend doing stuff I like.

Being on a development team in a startup is pretty much the same thing. Your team should be focusing as much time as possible on actually developing your product. That means doing the things unique to your business and focusing on what your company decides it’s core competency should be. However, there’s tons of work that is hard, time-consuming, and generally unpleasant. Not only is it unpleasant, but it can be incredibly time-consuming, because chances are, you’re not good at it, or find it kind of icky. Leave that stuff to someone else. Even better, find someone who likes doing that stuff and pay them to do it.

In the cloud-infested webscape that exists today, there are any number of companies that have decided that their core competency is something specialized that you probably need. Companies that specialize in IT management, video encoding, DNS, storage, billing, etc. all exist and are willing to accept a chunk of your cold hard cash to provide a service. The most important thing to realize, is that if time is of the essence (and it always is), you’re not just buying a service, you’re also buying the time it would take to you to build that service yourself. So don’t be a crafty coupon-clipper and build it yourself. Buy back that precious, precious time and spend it doing something you really like.

Jun 16th, 2010