Recently one of our clients reported us disk space issues for volume containing the binary logs(Master) and relay logs on the slave. Our alerting reported the same too, but before we could actually find out, the slaves choked up (could not add more relay logs due to space issues). We were taken by surprise as to what made the Master generate so many binlogs at such a rapid rate, when client didn't change a code line(Suspecting initially the client scripts could have gone wrong).
So now the slaves out of hand and mysql on the master un-responsive due to binary logs full, we needed to purge few binlogs manually to continue investigation. After observing the processlist for sometime, we noticed something unusual with one of the tables (pretty large one 195G). The table is used by internal scripts and was not web facing, that gave us some room to breath. So moving further, the unusual thing was the series of updates on same "ID" value, which should not be the case as per client.
Next we started with the table structure analysis (using "SHOW CREATE TABLE ...") to notice that the AUTO_INCREMENT value was not incremental Also the last_insert_id() seemed suspicious, a very well known number. Something we learnt at college days. BANG!!!
We had hit the max value for INT datatype 2^32 and the column by default is type "SIGNED". 2147483647 was the magic number.
We were depressed as ALTERing such a huge table is going to be a nightmare. After some assistance from client we were able to drop the historical partition data to shrink the table. Still we had to deal with the ALTER and removal of "what now seemed to be garbage data with ID 2147483647".
The ALTER was done but was still unsure about the start point for the auto_increment attribute on the table. So after checking the table structure again we noticed the IDs now were getting generated as required.
Further moving on to the investigation of the issue, like "what made such a huge pile-up of binary logs", I noticed that the auto_increment behavior was real unsubtle. The auto_increment column after reaching the max value of INT type, kept on generating the same number. In our clients case for almost 1.7Million records. Next as I mentioned above a series of updates running on the same ID (using a single condition in the where clause "... ID=magic number"), the updates actually changed 1.7Mil rows for every run. The binlog format was set to "mixed" and as per its working all updates were logged in "row" based formats. This started a chain reaction which generated Binary logs eventually filling up the volume, slave got overwhelmed with this and it filled up 100G volume for relay logs too. Next the updates started stalling (since unable to write the statements to binlogs), and finally timing out.
Now everything is back on track, and we decided to rebuild the slaves again.
However, I realized the fact that, how a unique constraint on AUTO_INCREMENT column would have saved our day. A duplicate key error would have been sufficient for us to let know that INT has maxed out.
No comments:
Post a Comment