Out of the Terminal

2010.07.20

Every once and a while I get to leave server-land and get to do some fun projects that involve doing something on the front end. The latest was building an embed script for the Behance Job List. Projects like this, that get me out of the terminal and into a space that requires a bit more interaction between domains, are particularly appealing. As much as I think the Same Origin Policy is reasonable rule for security, I love looking at ways to get around it.

The technique I chose for this was JSONP, or JSON with Padding. I’m a huge fan of JSON as a transport, as I feel it is compact, flexible and stupidly simple to generate and consume. In fact, I’ve sworn to never touch another XML file as long as I live. JSONP is really convenient from a API implementation perspective, because when the request for the data is made (via the script tag), all the client has to do is pass a callback and it can use the data in any way it chooses. The server doesn’t have to be aware of what the callback actually does, although I do recommend checking against a list of pre-approved callbacks, just to make sure.

Like any semi-decent developer, I have dog-fooded my own work, and implemented the Behance Joblist embed code right here.

A little about the Behance Joblist:

Top global companies find and hire talent on Behance, the world’s leading network for creative professionals.

Sphinx Full Text Search Engine

2010.01.28

For a very long time, I was convinced that a FULLTEXT index in MySQL was the best solution for all your searching needs. Then I realized that it was horribly slow, and mixing with complex joins completely destroyed any chances of using MySQL indexes in any way that would make sense or get decent results. The solution to fast and scalable free text search on any website is, of course, a Full Text search engine.

There are a few different ones out there. After a brief affair with Lucene, I settled on Sphinx. Sphinx is easy to install, even on 64-bit machines, and is architected in a way that makes a lot of sense for the web. The following steps were performed on a Red Hat machine. Don’t skip the mysql-dev install, even if you already MySQL installed.

> yum install gcc-c++
> yum install mysql-dev*
> wget http://www.sphinxsearch.com/downloads/sphinx-0.9.9.tar.gz
> tar xzvf sphinx-0.9.9.tar.gz
> mkdir /usr/local/sphinx
> ./configure –prefix /usr/local/sphinx –with-mysql
> ./make
> make install

Once installed, it’s fairly simple to start playing with the packaged example data and queries. The php APIs make integration easy, either to build a service, or use locally as a substitute for MySQL. In fact, as long as the index can be kept reasonably up to date, Sphinx is a better choice for complicated sorts than MySQL.

What’s in a Name?

2009.10.13

It’s easy to get caught up in semantics. Figuring out the best names for variables, tables, columns, classes, etc is something that can eat up hours or even days of a development schedule. The idea is that the more precise the name, the better it is. The arguments for precision naming are many :

* Clear names help other developers read your code.
* New developers who come on will immediately understand what’s happening
* Calling well named methods of classes will read like sentence, further increasing readability.
* Clear names will be able to help developers relate things in the UI to the code.

Keep in mind, I’m not talking about naming conventions. Naming conventions are simply rules for choosing the character sequences. They don’t dictate what words you should assign to things in your code.

Whatever names developers choose, they will get strewn throughout the layers of the application. Database, table and column names will be impacted. Variables in server-side scripts. Organization of classes into folders. Javascript file names. Memcache keys. URLs. Just like sand at the beach, the labels the dev team decided on goes everywhere you can think of. Invariably, the marketing team will bound down the hall, and announce the product is being rebranded. Jobs will become Gigs. Friends will become Followers. Application code will become confusing.

New devs won’t get it anyway.

The fact of the matter is overthinking naming is a good way to get nowhere fast. Keeping it simple and just take enough time to make sure that things make sense will give devs more time to focus on important stuff. Like being able to articulate the thought process behind code.


Categories : Development   Web Dev Teams

Do one thing, but do it really well

2009.07.07

In life there are people who will consider themselves Jacks of all Trades. But as the saying goes, they are master of none.

Websites will always start out as a codebase that does everything. There will be a couple files that add users, encode video to Flash, pull rss feeds, assemble HTML, update products, charge users, manipulate images, redirect old links, handle file uploads, calculate shipping, delete categories, create rss feeds, search the database, etc. Sometimes the code to do these will be organized into files, sometimes it won’t. The whole site will run on a single server, or more likely, a slice of a single server.

None of the things in the above list will be done well. None. This is mostly because there is too little code and too little hardware focused on doing too much. Also, every piece of code will be tightly coupled. So any one of those features could potentially get a ton of traffic, or hit a bump, and consume a ton of resources. Once that happens, it’s safe to assume the whole thing will go down in flames.

So to avoid the Fail Whale, its really important to build sites as a group of components that work together. Architecture is key, and when carefully thought out, can ensure that the most important parts of the site stay up. Even when your image manipulation script on the backend freaks out, the home page should continue to load flawlessly.

With database-driven apps (almost every major site on the web), there needs to be particular attention paid to a caching layer. Again, since most sites start out with a jumbled codebase, the likelihood that all the code to manage data is in the same place is unlikely. Given the complexities of managing cache objects, making sure that objects are invalidated on update is crucial to making updates look seamless. So there needs to be a set of code that’s good at one thing: managing data and its cache.

Search is another area that commonly relies on database, and can eat a ton of resources. If performing search in SQL, difficult queries can lock tables and keep other queries from being answered. As good as some DBMSes have gotten at handling search ( ie MySQL’s FULLTEXT ), they still can’t fulfill the concurrency demands of a site with heavy traffic. So, again, the solution is a change where a resource intensice feature needs to be isolated from other code. There are a few different ways to do this. One is running replication, which may not be possible in smaller hosting environments. Another is to use Full Text Search (Lucene, Sphinx, etc.) Again, this may not be possible in smaller hosting environments.

Using code that’s already good at managing and retrieving data, an interface can be built to query your data. A second hosting environment that’s suitable for running the search tool of your choice can then query the data code for updates it needs to keep itself updated. In turn, this server will return search results without tying up any resources necessary for doing important stuff, like serving the home page.

So in these two short examples, we’ve created a theoretical architecture that can sustain heavy, site-breaking traffic to the search, and still continue to serve the home page. Of course, until the Apache server becomes so inundated with requests that it can’t do anything. Then it’s time to get that load balancer in place…