2010.10.10
The title of this post is a direct quote from a Facebook engineer presenting at the SXSW panel Beyond Lamp. Search is a critical function of any site, but its gotten much much harder as Google has gotten better. To quote the Beyond Lamp panel one more time:
Search is always compared against Google, which is like comparing the canoe you just built to the QE2.
The difficulty of search is made apparent by the majority of sites, even major sites get it wrong. A large factor in the success of search is relevancy. Google takes into account 500 million variables in determining how relevant content is. Not only that, but they also know who you are, what you’ve clicked, and can make decisions based on that to present pages that are more relevant to you. Facebook’s EdgeRank, LinkedIn’s Signal are other examples of search implementations that are vast in scale.
In a startup, where time is of the essence and resources need to be begged, borrowed or stolen, search is a huge challenge. Like trying to be build the QE2 with nothing but a swiss army knife. Basic tools normally don’t cut it. MySQL’s FULLTEXT indexes are helpful, but start trying to implement basic IR techniques like booleans, and MySQL’s builtin functionality starts to lack the ability to get the results your want.
There are ways to simplify building search. Sphinx provides great matching capabilities and incredibly fast sorting. When combined with other data, Sphinx can be a great way to get users fast, meaningful results. The one downside with using a document based search engine is that there is little room for returning completely tailored results. Unlike MySQL, which allows you to slice and dice data in any way you choose, it is more difficult to return results that take into account relationship specific to users and documents. However, for most search tasks, it should function very well.
2010.09.26
Through a combination of unhealthy fears, paranoid tendencies, and luck, I’ve been able to avoid that unavoidable situation that every sysadmin fears: completely nuking a system. Until last Tuesday, when I did something really, really dumb. On the server that hosts http://chr.ishenry.com, I had noticed a script, svcrack.py, running and consuming lots of resources, and bandwidth, as I would later find out from my hosting bill.
Since I sure as hell wasn’t running that, I could only assume that someone had exploited my server and was using it to look for unsecured voip installations. Initially, I assumed killing the scripts and changing some passwords would be sufficient. However, checking in the server later, I found the same script running. All this is fair enough, as I am on WordPress, a few versions behind, and there are enough folders with unhealthy permissions that I kind of deserved it. So after a few days of trying to lock things down, I got a bit desperate.
Since svcrack is a python script, there was a good chance the best way to discourage my assailant would be to remove python. Great idea in theory, but it seemed my execution was a bit poor. It turns out running ‘yum remove python’ is a great way to destroy your entire system. yum runs on python, which meant a reinstall would have to be done manually. Only problem, most of the shell bultins stopped working as well. cp, mv, ls all resulted in a ‘command not found’ error. The best part of this situation: no backups. After all the hubbub about blogs and backups lately, it’s kind of amazing I missed this rather important detail.
I’ve always considered data loss the cardinal sin in development, web or otherwise. However, I also never considered my personal site to be mission critical, or worthy of taking the the time for backups. But as they say, you never know what you have till it’s gone. I was lucky enough that mysql and apache were still running, and I was able to export everything, spin up a new server, and import. Even with no data loss, this is certainly a lesson learned. I am making a backup right now.
2010.08.24
Consistency is key, in everything. People come to rely on trains, because they come on time. Devs rely on the environment which they develop in, because its stability allows them to be productive. Changing or upgrading that environment can be the same as changing the train schedule. Sometimes people get where they’re going faster, but sometimes, the change makes their life miserable.
Environment upgrades can be just like that. Speed, security, features. Everybody likes those. The ugly side to upgrades is that they have a tendency to break things that are already working, may be incompatible with current code, and disrupt the work of the team. As a struggling sysadmin / developer, all I really want in life is to build a stable platform that I can build my app in.
Hence, the Ode to the Environment:
The Environment is the basis for my business. Without it, and it’s consistency, there is uncertainty, chaos, and ultimately, failure.
I need to be able to replicate the Environment quickly, identify when issues are caused by it, sandbox it, and be comfortable building it from scratch, if it comes to that. (Hopefully, it never does.)
I need to be confident in the set of packages I’ve come to love, loathe, and rely on, and make sure they work for my business’s app.
I know that the Environment’s well-being will affect my application’s uptime, developers relying on it, and my business’s reputation.
I need to know the flaws and shortcomings in the Environment, and weigh how to fix them against the cost of change.
When it comes time to upgrade the Environment, there will be damn good reason. I need to be horribly convinced that my business will see benefits immediately.
Once I upgrade the Environment, I need to love and loathe it same as the old, embrace whatever change it brings, advocate for it and fix whatever issues the change brings.
Above all, I will maintain the best Environment that suits my business, and ensure that it is always meets the goals of my business, no matter the cost.
2010.08.16
I’m a huge fan of Gmail and Google Apps for many reasons. I love the new redesign, and how they’re finally promoting consistency across their major webapps. It makes me feel like the web could really be a viable alternative alternative to desktop software. I can even deal with slowness in Gmail, given the amount of work they need to do in order to keep your inbox snappy. They need to index every message, which means parsing every message, converting every attachment, and linking it the search architecture. In real time. Not easy…
However, what I found today, was completely inexcusable: Gmail’s clipping “feature”. This is definitely a feature that sounds a lot more like a bug than a helpful tool.

What should be here is a few more links, some mouse text that contains our mailing address and unsubscribe links. What I did not show in this screenshot is the capacity for destruction this feature has on HTML emails. When the email is ‘clipped’, the HTML is broken at a random place, and not displayed. If your message is clipped at an inopportune place, there goes your entire HTML layout. In the best case, your HTML is simply truncated, leaving users with only a piece of their email.
As the entity sending this email, the responsibility falls on me to make sure that I send emails that are accessible, conform to CAN-SPAM, and are pleasing to the eye. Gmail bones me on three of these goals. Thanks to a lack of documentation as to how long an email can be without invoking the clipping feature. Most importantly, my users have no clear to unsubscribe from the list, since the most likely links to be clipped are the unsubscribe links.
I agree that performance is king, but never at the cost of the user.
Update: It seems like Gmail limits messages to around 102k characters before clipping. So the solution seems to be running HTML through a compressor. I found a pretty good one here
2010.07.29
MySQL’s InnoDB engine is really great. Row-level locking is amazing in tables where there is heavy concurrency. Write buffering is also awesome for cases where a table needs to accept a lot of data. InnoDB’s use of memory to store indexes or sometimes the entire table can also make reads incredibly fast, especially on tables that need to support complex queries where even the best placed indexes do nothing.
However, when tables get large, the innodb_buffer_pool is set to close to amount of memory on the server, Linux has a tendency to remove your data from memory for no good reason. The symptoms are unmistakable: a query that was known to be pretty quick, but hasn’t run in a while, will take long. Too long. Run it again, and it becomes snappy. What’s happening is that when the query initially runs, the necessary data isn’t in memory, so it’s read in from disk, and the query is performed. Once it’s in memory, that second run is quick.
Actually there is a good reason Linux behaves like this:
“My point is that decreasing the tendency of the kernel to swap stuff out is wrong. You really don’t want hundreds of megabytes of BloatyApp’s untouched memory floating about in the machine. Get it out on the disk, use the memory for something useful.”
– http://kerneltrap.org/node/3000
This all makes sense, as most systems need to reclaim memory from applications that aren’t doing anything. Except in the case where you have a large dataset in InnoDB that you’d really like to be in memory when you query it. Luckily, there is a tunable that you can change to dictate how aggressive Linux is reclaiming memory from applications. /proc/sys/vm/swappiness stores a number for 0 to 100, where 100 means that Linux will be extremely aggressive in reclaiming memory, and 0 means that memory won’t be reclaimed all that much.
For servers that need to keep datasets in memory all the time, this variable can be extremely helpful. With an InnoDB table / indexes that consume ~80% of memory on the machine, a swappiness value of 30 is sufficient to allow MySQL to keep most of that in memory. Of course, I don’t recommend this for a machine that is not 100% dedicated to a single task. However, on dedicated MySQL machines, tuning this variable can be really helpful.
2010.07.20
Every once and a while I get to leave server-land and get to do some fun projects that involve doing something on the front end. The latest was building an embed script for the Behance Job List. Projects like this, that get me out of the terminal and into a space that requires a bit more interaction between domains, are particularly appealing. As much as I think the Same Origin Policy is reasonable rule for security, I love looking at ways to get around it.
The technique I chose for this was JSONP, or JSON with Padding. I’m a huge fan of JSON as a transport, as I feel it is compact, flexible and stupidly simple to generate and consume. In fact, I’ve sworn to never touch another XML file as long as I live. JSONP is really convenient from a API implementation perspective, because when the request for the data is made (via the script tag), all the client has to do is pass a callback and it can use the data in any way it chooses. The server doesn’t have to be aware of what the callback actually does, although I do recommend checking against a list of pre-approved callbacks, just to make sure.
Like any semi-decent developer, I have dog-fooded my own work, and implemented the Behance Joblist embed code right here.
A little about the Behance Joblist:
Top global companies find and hire talent on Behance, the world’s leading network for creative professionals.
2010.06.22
Accountability is a word that’s getting tossed around a lot lately. You hear people saying things like:
– That developer should be held accountable for the validation problems.
– The tester should be accountable for not finding that bug.
– BP needs to be accountable for destroying an ecosystem.
The term seems to be thrown around most often when parts of a system fail. BP is part of a larger industry which that’s regulated. The government agency responsible for monitoring safety measures is responsible for ensuring they follow safety regulations. So when BP made their whoopsie daisy, the fingers were pointed squarely at them. However, where were the regulators? There were tons of opportunities for the government to push feedback to BP regarding the safety of their operation. But it seemed like no one was talking.
The development process is strikingly similar. Any development team worth their bits has a process that puts any issue in front of at least two parties at all times. Joel Spolsky’s infamous Bug 1203, a quick story about the interactions between a dev and a tester, is the picture of accountability, and shows that without active management and constant feedback being exchanged, things don’t get done.
A quick synopsis and commentary: Jill the tester finds a bug, and provides feedback to the dev team via the ticket system. In doing so, Jill has started the feedback loop, and made it the responsibility of the dev team to investigate the issue. The dev team, as they are prone to doing, deny responsibility for the issue, and mark the issue as ‘NOT A BUG’ Having done so, they’ve put the onus on Jill to prove it’s really a bug, which she does (probably in about 2 seconds). It’s again the responsibility of the dev team to fix the bug, which they do. Jill confirms the fix, and thereby closes the loop.
What’s important to realize is that in this type of process, it is the responsibility of anyone and everyone involved to be accountable for their role, and be focused on pushing feedback to the next person. Once there’s a break in the loop, the issue is likely to be dropped, and never fixed. The last person holding the ball is the screwup. I’m sure someone somewhere is really upset they didn’t ask BP about that little safety measure.
2010.06.15
Many people value money as the most important thing in life, and will gladly trade time for it. The pursuit of saving money is an extremely American one. People will spend time in line for free stuff, just because it’s free. Motorists getting tickets will spend days in court, just to avoid a fine. Clipping coupons has become an art form, and even extended to the digital world in the form of sites like SlickDeals and Groupon.
Me, I like time. To me, time is way more valuable than the almighty dollar. Reason: I can’t get it back. Evar.
If I get a parking ticket, I know that if I pony up that those 55 greenbacks, chances are there will be a check with my name on it in the next couple of weeks that puts that those 55 American pesos back in my pocket. If went to court, I’d never get back that 3-4 hours of my life. I’d also probably lose the case. I’d also probably spend that time sitting next to someone who smells like cheese. Paying up gives me a net gain of 3-4 hours of my life, which I could spend doing stuff I like.
Being on a development team in a startup is pretty much the same thing. Your team should be focusing as much time as possible on actually developing your product. That means doing the things unique to your business and focusing on what your company decides it’s core competency should be. However, there’s tons of work that is hard, time-consuming, and generally unpleasant. Not only is it unpleasant, but it can be incredibly time-consuming, because chances are, you’re not good at it, or find it kind of icky. Leave that stuff to someone else. Even better, find someone who likes doing that stuff and pay them to do it.
In the cloud-infested webscape that exists today, there are any number of companies that have decided that their core competency is something specialized that you probably need. Companies that specialize in IT management, video encoding, DNS, storage, billing, etc. all exist and are willing to accept a chunk of your cold hard cash to provide a service. The most important thing to realize, is that if time is of the essence (and it always is), you’re not just buying a service, you’re also buying the time it would take to you to build that service yourself. So don’t be a crafty coupon-clipper and build it yourself. Buy back that precious, precious time and spend it doing something you really like.
2010.05.31
Since the dawn of online advertising, the gold standard of effectiveness has been the CTR. This has made a lot of sense, since for the first time ever, adertisers could leverage technology to figure out exactly how well they were communicating. A user would click, and that click would be recorded. The total number of clicks is compared against the total number of ads that are put on the screen, and bingo, you know exactly how effective the campaign was. Combine that number with more advanced analytics, such as tracking the user past the initial click, onto the advertiser’s site, and onto really interesting places, like the confirmation page of an ecommerce site. This gave advertisers really effective ways to quantify the effectiveness of a campaign.
What has been more difficult to quantify is the value of advertising in creating engagement and awareness. Simply seeing a brand association between host site and banner advertisement creates awareness of a brand. Seeing a banner ad doesn’t necessarily trigger immediate reactions (clicks), but can trigger actions of the user later. Users may be inclined to purchase products later on because of the brand awareness created by seeing banner ads. This is awesome for advertisers, who get a return, albeit indirect, from banner advertising, but it’s far less awesome for publishers. That publisher put a great deal of work to create content that people want to see, and advertising fees are very common way of monetizng that work. However, CPMs are commonly determined by CTR. If that publisher has a lousy CTR, as a result of something terrible, like having a savvy demographic that knows not to click on ads, then that publisher suffers.
Microsoft has long been a proponent of measuring engagement, and Google has recently mentioned rolling out tools that will track a user across sessions on multiple sites. It’s clear that the industry needs to move in this direction, although hopefully it will move slowly and find ways to avoid becoming a ubiquitous, Minority Report-style system where Skynet knows who you are, and will show you ads for Banana Republic after you’ve purchased khakis from the Gap. However, the rewards for publishers could be great if the larger players in the industry were able to track users across websites, and even devices.
2010.05.02
This post could alternately be titled: ‘How to make developers hate you.’
A very common criticism of MySQL is that there is no support for delayed replication. Delaying data flowing from master to slave can be very useful in certain cases. For example, running a co-located slave for backups is still susceptible to data problems that caused by a DELETE with no where or a mistaken executed DROP. However, by running the slave anywhere from an hour to a day behind, you have the opportunity to catch whatever problems caused and have a good copy of your data ready to go.
In sandbox environments, a consistent slave delay is a great way to reproduce race conditions. In fact, running slave delay gives you the opportunity to ensure that data will be out of sync between the master and slave. When you can count on this part of the environment, developers can test and write code against this condition. Of course, in reality, working in this type of environment is reaally annoying, but necessary.
Delayed MySQL replication can be accomplished by using a tool from the maatkit library. Documentation for the tool can be found at http://www.maatkit.org/doc/mk-slave-delay.html. What’s great about this tool is that can be run as a daemon, so that it can be easily run for an extended period of time, without have to do any serious management.
|