Sphinx Full Text Search Engine

2010.01.28

For a very long time, I was convinced that a FULLTEXT index in MySQL was the best solution for all your searching needs. Then I realized that it was horribly slow, and mixing with complex joins completely destroyed any chances of using MySQL indexes in any way that would make sense or get decent results. The solution to fast and scalable free text search on any website is, of course, a Full Text search engine.

There are a few different ones out there. After a brief affair with Lucene, I settled on Sphinx. Sphinx is easy to install, even on 64-bit machines, and is architected in a way that makes a lot of sense for the web. The following steps were performed on a Red Hat machine. Don’t skip the mysql-dev install, even if you already MySQL installed.

> yum install gcc-c++
> yum install mysql-dev*
> wget http://www.sphinxsearch.com/downloads/sphinx-0.9.9.tar.gz
> tar xzvf sphinx-0.9.9.tar.gz
> mkdir /usr/local/sphinx
> ./configure –prefix /usr/local/sphinx –with-mysql
> ./make
> make install

Once installed, it’s fairly simple to start playing with the packaged example data and queries. The php APIs make integration easy, either to build a service, or use locally as a substitute for MySQL. In fact, as long as the index can be kept reasonably up to date, Sphinx is a better choice for complicated sorts than MySQL.

Google Short Links

2010.01.06

Having your own domain and server is a really fun thing. You can keep a blog, store files, post photos. All good ole down-home interwebz fun you can dream up. With the advent of micro-blogging, everyone now has an audience they need to communicate with in the briefest fashion.

Short links have become the norm for sharing, but with a huge price. Short links break the interconnectedness of the Internet. Search results that depend on the count of links become incorrect. Some services try to reconnect links between domains by using some DNS trickery. But the issue remains that there is a middleman that can’t help make the connection.

Unless, of course, that middleman is the search engine. Google’s recently announced URL shortener could solve many of the problems inherent with URL shorteners. By being the middleman, Google would have all the necessary information to put the pieces back together. It’s easy enough to set up, provided you have a Google Apps account. It does not currently have an API, but here’s hoping.

30 Seconds to Mars :: This is War

2009.12.21

30 Seconds to Mars recently released their third album, This is War, and it is quite the departure from their first album. Brand New Day was raw and angry, with amazing guitar sounds, great composition, and a real sense of urgency in the writing. The album was really exciting to listen to, and the live performances were great. I saw them at Avalon in NYC a few years ago, and it’s still one of my all time favorite shows.

This is War is mild and boring in comparison. The effects driven distorted guitars characteristic to Brand New Day is missing. They seem to have been replaced with over-produced electronics. Jared Leto has less than half the intensity than he did in Brand New Day, and the lyrics have lost their edge. The focus on deeply layered choruses of what sounds like children singing lacks impact. The collaboration with Kanye and his 808 didn’t really go anywhere, and just seemed to pull the band further from their roots. It’s sad to hear a band that had such a great band with a unique sound has gone so far off track.


Categories : Music
Tags :

Changing Criteria

2009.12.12

Occasionally, a project will come across my plate with the criteria, ‘Make sure this works everywhere, is completely template-able, and is something we can grow with.’ Normally this is coupled with ‘We need this to work with X *right now*, and Y and Z later.’ What I really hear is ‘Make it work for X, and ship the damn thing.’ After all, hitting those deadlines is really important.

Of course, this has a whole bunch of ugly assumptions tied to it. The first is: when I get to Y, everything I did for X will work. All I need to do is drop in a few config changes, and tweak a few parameters, and I’m done. (Yea, right) Secondly: that every case Y needs to cover is contained within X. (Not Likely) Third: All of this will be so well documented that any literate individual will be able to implement Y by osmosis.

So. Do we spend time now or later? Shipping X seems pretty simple, so why not just build X, satisfy the business dude and call it a day. Spending time now means that deadlines may have to shift, and something that should be simple becomes complex. We have other, more important, projects to work on.

Eventually Y comes calling. So let me introduce you to…Future Web Dev Guy Person Girl! If you’re lucky, that person is you. If you’re not, it’s another dev. The assumptions we made back in paragraph 2 have reared their ugly heads. Since they were assumptions, you’re probably boned. If not, you’re probably one of these guys. If you’re the rest of us, Future Web Dev Guy Person Girl definitely hates your guts, because the groundwork that was supposed to be laid out is not there. They’re running through a lice-infested rat’s nest of procedural functions trying to pass the additional variable that will make this all work.

The best way to keep Future Web Dev Guy Person Girl from cursing like a sailor is implement correctly, test thoroughly, and deal with Y before it’s due. Deadlines need to be managed according to project scope, and if project scope includes Y, it needs to be accounted for now, before you lose a friend in Future Web Dev Guy Person Girl.


Categories : Best Practices

The Technician, now on a Cloud Server

2009.11.29

I am pleased to announce that this site is now hosted on the Rackspace Cloud. It was a simple migration from MediaTemple, and has given me the level of control I want. I got to choose my OS (CentOS), versions of php and MySQL, and setup apache how I like it. I’m free of Plesk and those and the limitations therein.

The one thing I would really like to see from the Rackspace Cloud is DNS Support. My goal when migrating http://chr.ishenry.com was to move entirely off of MediaTemple. The one thing I really did like about hosting with them was that DNS was integrated directly into the service. With the Rackspace Cloud, there was no such convenience. However, a quick signup with DynDNS and a tweak to my domain registrar solved that.

Big thanks to Ryan Kearney’s video tutorial for the yum command that brought everything together.

Rackspace Downtime

2009.11.03

[Update] My team at Rackspace has sent me the fluffiest, most comfortable pillow I have ever had.

When a hosting provider goes down, there are lots of questions that get raised. Is my host reliable? Will they flake out during crucial times when my site needs the traffic? Will they double bill me?

Since I have been working with Rackspace, they have had less than stellar uptime, with issues mostly related to power. My company pays a lot for hosting with them, and downtime for a young company is deadly. But oddly enough, I’m still OK with Rackspace hosting my company’s myriad services. The benefits of hosting with them have been so great that a couple hours of downtime is nothing.

First off, their SLA has provisions for downtime, when it happens. If your server has a legitimate issue, you’re entitled to ask for a credit. To me, this is a promise that they’ll put their money where their mouth is. And if you call them on it, they’ll be reasonable.

Secondly, their support during crises is still amazing. During the truck incident, I was able to get a tech to run fsck on my disks, and hang out to watch no questions asked. No, I am not on their intensive plan.

Third, their support culture is simply amazing. Their linux techs are always willing to look deep into an issue to find a resolution, and they provide much of the basic infrastructure that is hard to come by for small companies.. They’re also completely willing to educate their customers about the servers they maintain.

In short, Rackspace has been the target of a lot of criticism over issues in their datacenters. The fact of the matter is that there will always be issues and downtime. Their SLA guarantees the impossible, which they seem to realize, as any failure on their part comes with swift response. In the end their SLA seems to be more of a way of setting standards than anything else.

[Full disclosure: I haven't slept in 2 days because of their power issues]


Categories : Horror Stories

Fix or Manage?

2009.10.24

Sometimes bugs come along that require significant work to fix. Depending on what project timelines are like at the moment, sometimes fixing the bug isn’t the best option. For example, a race condition in the caching architecture causes pages to be stale. The persistent data store is correct, but the cache is not. To the person who just triggered the update, there’s a bug. The information on the public side is not in sync with the information they just entered.

So, like any other bug, a report will eventually percolate down to the dev team. People scream, fortunes are lost, the svn blame command is used, and the devs who wrote the code pee their pants. Once the chaos dies down, the actual prognosis of this issue can turn out to be extremely grim.

A shortcoming of the caching architecture shows that there’s a race condition when the system is under heavy load. In order to fix it, the dev team needs to plumb the depths of the data access layer, and probably change some parameters. But that’ll probably break everything. Everywhere. Or the layer manipulating the data could be fixed to replace the cache instead of invalidating. Except the methods to manipulate that entity live in 3 different codebases. It’ll probably break the editor. Either way, the actual solution doesn’t matter.The dev team certainly needs to do something, and it needs to be released three days ago.

The correct way to fix this issue will vary widely depending on circumstances. But in this particular case, the best answer was to not fix it, just manage it. Our team was busy, there were other projects that were more pressing. Plus the codebase was being rewritten. So instead of flogging a dead horse, a simple script was thrown together that compared the cache and the database. If they were out of sync, the cache would be cleared, and would be repopulated with the correct information the next time it was requested. Once it was implemented, the bug was still there, but the cache seemed to be up to date.

Every dev team will face bugs that have enormous costs to fix. The way to deal with these bugs will be different every time they come up. It’s important to remember that managing bugs can be almost as effective as fixing them.

What’s in a Name?

2009.10.13

It’s easy to get caught up in semantics. Figuring out the best names for variables, tables, columns, classes, etc is something that can eat up hours or even days of a development schedule. The idea is that the more precise the name, the better it is. The arguments for precision naming are many :

* Clear names help other developers read your code.
* New developers who come on will immediately understand what’s happening
* Calling well named methods of classes will read like sentence, further increasing readability.
* Clear names will be able to help developers relate things in the UI to the code.

Keep in mind, I’m not talking about naming conventions. Naming conventions are simply rules for choosing the character sequences. They don’t dictate what words you should assign to things in your code.

Whatever names developers choose, they will get strewn throughout the layers of the application. Database, table and column names will be impacted. Variables in server-side scripts. Organization of classes into folders. Javascript file names. Memcache keys. URLs. Just like sand at the beach, the labels the dev team decided on goes everywhere you can think of. Invariably, the marketing team will bound down the hall, and announce the product is being rebranded. Jobs will become Gigs. Friends will become Followers. Application code will become confusing.

New devs won’t get it anyway.

The fact of the matter is overthinking naming is a good way to get nowhere fast. Keeping it simple and just take enough time to make sure that things make sense will give devs more time to focus on important stuff. Like being able to articulate the thought process behind code.


Categories : Development   Web Dev Teams

Blackberry OS will never take over the world

2009.09.26

A few days ago, I installed Opera Mini on my Blackberry Curve. The experience offered by Opera Mini is really impressive. It retrieves and renders pages quickly and flawlessly. The interface is specifically designed to handle navigating long pages on a tiny screen. In short, it’s a great app, and a major improvement on the Blackberry Browser.

Here’s the catch (or catches). Links that appear in other applications won’t open in Opera, they’ll open in the Blackberry browser. What’s worse is that Opera’s location bar does not have an option to paste links directly into the location bar. In order to paste a link into the location bar, you need to hit the symbol key, which brings up an edit screen pre-populated with ‘www.’ You need to erase the www, and then you can paste your link in.

That lengthy process completely kills any satisfaction you may get out of having a workable browser on a Blackberry. And none of it is Opera’s fault. On most other OSes, clicking URLs in any application will fire up the default browser and retrieve the URL. Hell, on iPhone OS, you can set up protocol handlers that will open up other apps. Hopefully RIM will provide a means for applications to talk to each other soon. Creating the seamless interaction between apps is probably even more important than pushing App World.


Categories : Mobile

Technorati needs to find a better way to do this.

2009.09.04

xq94dwsy2u