Distributed Updates

2011.06.25

Part of managing any large site involves writing scripts that will go through oyur data, make changes, merge things, remove things, do type transformations, etc. Most of the time, in PHP, iterating through rows or objects will do just fine. However, when there are lots of rows or objects, you could be faced with a script that takes hours or days to run. Depending on how often active the is, you may need to restrict access to ensure that the data before and after the transformation remains consistent. In other words, if someone tries to make a change to the data before the transformation, and the new feature only looks at data after the transformation, that user has just lost their changes. That is Very Bad.

As sites get larger and problems like this loom, taking the site offline becomes less and less of an option. This is what the business team calls a luxury problem, and what the ops team refers to simply as a problem. One option is to write a more efficient script. You can get pretty far by simply ensuring you’re reading from the fastest data source available, make good use of cache, etc. ensure that the tables being read for the transformation are properly indexed. All of these are great places to start. Additionally, making sure that data is grabbed in chunks can give the database time to breathe. There’s nothing worse than getting stuck in MySQL’s “sending data” phase simply because it needs to read several thousand rows from disk. MySQL configuration can also be your friend here. If using InnoDB, increasing the insert buffer is a great way to speed up writes.*

However, as much as you can do to speed up a single transaction, the fact remains that you have to execute each transformation serially, one after another. Your bottleneck is the transformation itself. It will take (# of transformations * # of objects to transform) to complete the job. No matter how well tuned the database is, it will only be performing one operation at a time, which means that the other (max connections – 1) connections are doing precisely crap. So the next logical step is to change your update script to distribute the update operations so a few can be run in parrallel.

Rewriting the update script does require thinking about your update differently, and will not work in every case. For example, if one is simply moving a large amount of data from one table to another, and there is no transformation, or the transformation can be accomplished via a builtin MySQL function, use that. However, just be prepared to deal with locking issues, and the source data potentially not being available while the transformation is taking place. However, if your transformation is complicated, and requires per-case logic, this is definitely a good route to take. The biggest difference is how the code for the update is organized. The update script needs to be separated out into code that will apply the transformation for exactly one entity, and code that will manage which entities get transformed and when. Ideally, the code for the transformation is idempotent, so failures can be handled by simply resubmitting the entity / object to be transformed again.

Accomplishing parallel processing in PHP can be kind of tricky. Php’s pcntl_exec function has always felt a bit finicky to me. Of course exec on its own it blocking, so that’s out. Additionally, neither of these solutions offer any sort of baked-in communication between the process that submitted the job, and the process carrying out the job. That leaves us with a queuing system. Popular systems include: RabbitMQ and Gearman. Personally, I’ve made great use of Gearman. It’s easy to install, as is the PHP module.

To sum up, performing large data updates via a distributed system is the way to go if you have complex requirements per transformation, and the option to perform these processes in parallel.

*If using MySQL’s MyISAM engine, this isn’t necessarily true, as writes will block, and the database could become the bottleneck. However, since MySQL is continuing to push InnnDB, this is getting increasingly unlikely. So if your tables are all InnoDB, you’re probably in good shape.

Categories : Best Practices  Ops  Process

Development without Internet Access

2011.06.13

While flying to Austin for sxsw, I had a small programming task. Take a string of a few search terms, break it apart and highlight those terms in another string. It’s a straightforward task, and probably a wheel that’s been reinvented thousands of time in the history of computer science. I approached it as an exercise, to see if I could add another squeaky wheel to the pile. My goal was to do it without using any 3rd party code or any resources. I had no access to documentation, google, stack overflow, or any of the other resources I use constantly to get my job done every day.

The code that I produced was bloated, naive, and horribly inefficient (I suspect). While writing it, i knew I wasn’t really on the right path. When I got back to New York, I took a look at it, and more or less decided I had wasted my time. Then I realized I had written it on a plane, and had nothing better to do. I simply got myself into the zone, and wanted to work through a problem until it was solved. After I got over my initial disgust, I wondered what aside from boredom and stubbornness had prompted me to complete the task.

I never really came to any conclusions until a few days later. I was going about my day normally, fixing bugs, writing emails, troubleshooting. As I hit a hard spot, something I couldn’t figure out, I gave up staring at the code, and turned to Google. Then I came across a builtin php function that was giving me a strange result. After puzzling for a few seconds, I dropped the function into Google. A little while later, I was examining the results of an EXPLAIN statement in MySQL, and the output was something I hadn’t seen before. I found the answer on StackOverflow a few minutes later.

Then it dawned on me. Maybe I don’t actually have the skills to be a web developer, and I’ve faked it all these years. Maybe I don’t know all that much about MySQL, and perhaps I only know enough about Linux to cause problems for Rackspace. Whether or not that’s true, I did realize that I’m pretty good at finding solutions to problems from the collective experiences, wisdom, and flames of the Internet. Maybe it’s not entirely fair to say that I faked my way through several years of a career. After all the code that I’ve put together over the years to answer various questions, or sift through or collect data serves a purpose, performs relatively well, and is serving people everyday. Also, that disgusting snippet of string highlighting code works pretty well, despite that fact that I hate its face and want it to die.

After I got myself out of my existential development funk, more questions came to mind. First, how the F did anyone get any answers to tough questions before the Internet? Secondly, how did programmers back in the day find any sort of direction? Books on technology and programming are great, don’t get me wrong, but you can’t get answers to complicated questions. After having these thoughts crop up, I spent a little bit of time looking over other devs’ shoulders at the office. What I saw was very reassuring, as the Google machine was often hard at work for the rest of the team. The php site, StackOverflow, and QuirksMode were in browsers constantly.

Which begs yet another question: what exactly does it take to be a web programmer? Based on my experience, it seems to boil down to an Internet connection, Google, tenacity to the point of stupidity, and decent search skills. To back up even further, is it possible to take on a job you know nothing about, and learn how do it via the Internet?