Distributed Updates

2011.06.25

Part of managing any large site involves writing scripts that will go through oyur data, make changes, merge things, remove things, do type transformations, etc. Most of the time, in PHP, iterating through rows or objects will do just fine. However, when there are lots of rows or objects, you could be faced with a script that takes hours or days to run. Depending on how often active the is, you may need to restrict access to ensure that the data before and after the transformation remains consistent. In other words, if someone tries to make a change to the data before the transformation, and the new feature only looks at data after the transformation, that user has just lost their changes. That is Very Bad.

As sites get larger and problems like this loom, taking the site offline becomes less and less of an option. This is what the business team calls a luxury problem, and what the ops team refers to simply as a problem. One option is to write a more efficient script. You can get pretty far by simply ensuring you’re reading from the fastest data source available, make good use of cache, etc. ensure that the tables being read for the transformation are properly indexed. All of these are great places to start. Additionally, making sure that data is grabbed in chunks can give the database time to breathe. There’s nothing worse than getting stuck in MySQL’s “sending data” phase simply because it needs to read several thousand rows from disk. MySQL configuration can also be your friend here. If using InnoDB, increasing the insert buffer is a great way to speed up writes.*

However, as much as you can do to speed up a single transaction, the fact remains that you have to execute each transformation serially, one after another. Your bottleneck is the transformation itself. It will take (# of transformations * # of objects to transform) to complete the job. No matter how well tuned the database is, it will only be performing one operation at a time, which means that the other (max connections – 1) connections are doing precisely crap. So the next logical step is to change your update script to distribute the update operations so a few can be run in parrallel.

Rewriting the update script does require thinking about your update differently, and will not work in every case. For example, if one is simply moving a large amount of data from one table to another, and there is no transformation, or the transformation can be accomplished via a builtin MySQL function, use that. However, just be prepared to deal with locking issues, and the source data potentially not being available while the transformation is taking place. However, if your transformation is complicated, and requires per-case logic, this is definitely a good route to take. The biggest difference is how the code for the update is organized. The update script needs to be separated out into code that will apply the transformation for exactly one entity, and code that will manage which entities get transformed and when. Ideally, the code for the transformation is idempotent, so failures can be handled by simply resubmitting the entity / object to be transformed again.

Accomplishing parallel processing in PHP can be kind of tricky. Php’s pcntl_exec function has always felt a bit finicky to me. Of course exec on its own it blocking, so that’s out. Additionally, neither of these solutions offer any sort of baked-in communication between the process that submitted the job, and the process carrying out the job. That leaves us with a queuing system. Popular systems include: RabbitMQ and Gearman. Personally, I’ve made great use of Gearman. It’s easy to install, as is the PHP module.

To sum up, performing large data updates via a distributed system is the way to go if you have complex requirements per transformation, and the option to perform these processes in parallel.

*If using MySQL’s MyISAM engine, this isn’t necessarily true, as writes will block, and the database could become the bottleneck. However, since MySQL is continuing to push InnnDB, this is getting increasingly unlikely. So if your tables are all InnoDB, you’re probably in good shape.

Categories : Best Practices  Ops  Process

Emergencies will audit the shit out of you

2010.10.22

Things never go wrong at convenient times: Like when you’re auditing the latest, coolest version of your app, and looking for bugs. Things have a funny way of working out fine then. However, soon as you look the other way, a multitude of problems come out of the woodwork. It usually goes something like this:

One server goes down, and the system that was supposed to fail silently starts screaming. The application it was supporting goes down, because the proper timeouts and error handling was never written. You can’t fail over, because failing over will take down 2 other applications. When that first server comes back up, nothing works, because the proper startup scripts were never put in place. Once the right services start, if you can remember what the hell they were, you find the original application is configured wrong. Not only is it configured wrong, it’s always been configured wrong, and no one noticed. No one noticed because it only explodes in the exact set of horrible circumstances you have right now. Which is, by the way, being down.

It’s an all-too-familiar story, and one that even most the anal of admins has dealt with. The fact of the matter is that it is going to happen, and there’s not a whole lot you can do to prepare, other than randomly pulling plugs out of servers. But with any mistake that causes downtime, it should only happen once. Proper postmortem examination needs to be taken here to figure out what went wrong where. Once all the variables are understood, the next step is to duplicate the same set of circumstances in your sandbox, and apply the necessary error handling.

Downtime and emergencies are a part of running any site. What’s really important is to treat emergencies as an opportunity to learn about what happens when systems fail, for real.

Accountability is a Feedback Loop

2010.06.22

Accountability is a word that’s getting tossed around a lot lately. You hear people saying things like:

– That developer should be held accountable for the validation problems.

– The tester should be accountable for not finding that bug.

– BP needs to be accountable for destroying an ecosystem.

The term seems to be thrown around most often when parts of a system fail. BP is part of a larger industry which that’s regulated. The government agency responsible for monitoring safety measures is responsible for ensuring they follow safety regulations. So when BP made their whoopsie daisy, the fingers were pointed squarely at them. However, where were the regulators? There were tons of opportunities for the government to push feedback to BP regarding the safety of their operation. But it seemed like no one was talking.

The development process is strikingly similar. Any development team worth their bits has a process that puts any issue in front of at least two parties at all times. Joel Spolsky’s infamous Bug 1203, a quick story about the interactions between a dev and a tester, is the picture of accountability, and shows that without active management and constant feedback being exchanged, things don’t get done.

A quick synopsis and commentary: Jill the tester finds a bug, and provides feedback to the dev team via the ticket system. In doing so, Jill has started the feedback loop, and made it the responsibility of the dev team to investigate the issue. The dev team, as they are prone to doing, deny responsibility for the issue, and mark the issue as ‘NOT A BUG’ Having done so, they’ve put the onus on Jill to prove it’s really a bug, which she does (probably in about 2 seconds). It’s again the responsibility of the dev team to fix the bug, which they do. Jill confirms the fix, and thereby closes the loop.

What’s important to realize is that in this type of process, it is the responsibility of anyone and everyone involved to be accountable for their role, and be focused on pushing feedback to the next person. Once there’s a break in the loop, the issue is likely to be dropped, and never fixed. The last person holding the ball is the screwup. I’m sure someone somewhere is really upset they didn’t ask BP about that little safety measure.

Categories : Best Practices  Process