MySQL Error 28

2011.11.02

Yesterday, I had to run a query for some statistics I needed. This was a query that I knew were going to be particularly nasty as it required sorting 1.3M rows. Normally I run these sorts of queries on a reporting slave I keep around for this reason, but for some reason I chose to run this query on a production slave. When I ran my query, I got the following error;

ERROR 3 (HY000): Error writing file ‘/tmp/MYNcSyQ9′ (Errcode: 28)

Oh. *&^%. After some Googling, a bit of shitting my pants, and a wild grep session through as many application logs as I could find, I was able to figure out that problem seemed limited to this particular query. My Googling turned up the fact that the error code indicated that the server was out of disk space.

As a rapidly growing company, we’ve had our fair share of issues with managing (or failing to manage) rapidly filling disks, failed RAID controllers, and the like. However, I had recently done audits of this particular cluster of servers, and ascertained that the situation with disks was nominal. I was confident the disk wasn’t full, and permissions were correct. Our particular disk layout puts /tmp on its own 2GB partition, and after running the query, that partition was 2% full.

It turns out that during the execution of the query, MySQL was creating a temporary table that was 2GB, hence the error. By default MySQL will write temporary tables to /tmp, which in many cases, is its own small partition. The solution here was to set the tmpdir to a folder on the main partition adjacent to the MySQL datadir. This solution obviously has its own problems (ie you could fill your main partition, which is way worse than filling /tmp) However, for this type of ad hoc query, this was exactly what we needed.

It happens to everyone…

2010.09.26

Through a combination of unhealthy fears, paranoid tendencies, and luck, I’ve been able to avoid that unavoidable situation that every sysadmin fears: completely nuking a system. Until last Tuesday, when I did something really, really dumb. On the server that hosts http://chr.ishenry.com, I had noticed a script, svcrack.py, running and consuming lots of resources, and bandwidth, as I would later find out from my hosting bill.

Since I sure as hell wasn’t running that, I could only assume that someone had exploited my server and was using it to look for unsecured voip installations. Initially, I assumed killing the scripts and changing some passwords would be sufficient. However, checking in the server later, I found the same script running. All this is fair enough, as I am on WordPress, a few versions behind, and there are enough folders with unhealthy permissions that I kind of deserved it. So after a few days of trying to lock things down, I got a bit desperate.

Since svcrack is a python script, there was a good chance the best way to discourage my assailant would be to remove python. Great idea in theory, but it seemed my execution was a bit poor. It turns out running ‘yum remove python’ is a great way to destroy your entire system. yum runs on python, which meant a reinstall would have to be done manually. Only problem, most of the shell bultins stopped working as well. cp, mv, ls all resulted in a ‘command not found’ error. The best part of this situation: no backups. After all the hubbub about blogs and backups lately, it’s kind of amazing I missed this rather important detail.

I’ve always considered data loss the cardinal sin in development, web or otherwise. However, I also never considered my personal site to be mission critical, or worthy of taking the the time for backups. But as they say, you never know what you have till it’s gone. I was lucky enough that mysql and apache were still running, and I was able to export everything, spin up a new server, and import. Even with no data loss, this is certainly a lesson learned. I am making a backup right now.