Monday, 11 February 2013

Unique constraint and AUTO_INCREMENT. Need of the day!!!



Recently one of our clients reported us disk space issues for volume containing the binary logs(Master) and relay logs on the slave. Our alerting reported the same too, but before we could actually find out, the slaves choked up (could not add more relay logs due to space issues). We were taken by surprise as to what made the Master generate so many binlogs at such a rapid rate, when client didn't change a code line(Suspecting initially the client scripts could have gone wrong).

So now the slaves out of hand and mysql on the master un-responsive due to binary logs full, we needed to purge few binlogs manually to continue investigation. After observing the processlist for sometime, we noticed something unusual with one of the tables (pretty large one 195G). The table is used by internal scripts and was not web facing, that gave us some room to breath. So moving further, the unusual thing was the series of updates on same "ID" value, which should not be the case as per client.

Next we started with the table structure analysis (using "SHOW CREATE TABLE ...") to notice that the AUTO_INCREMENT value was not incremental  Also the last_insert_id() seemed suspicious, a very well known number. Something we learnt at college days. BANG!!!

We had hit the max value for INT datatype 2^32 and the column by default is type "SIGNED". 2147483647 was the magic number.

We were depressed as ALTERing such a huge table is going to be a nightmare. After some assistance from client we were able to drop the historical partition data to shrink the table. Still we had to deal with the ALTER and removal of "what now seemed to be garbage data with ID 2147483647". 

The ALTER was done but was still unsure about the start point for the auto_increment attribute on the table. So after checking the table structure again we noticed the IDs now were getting generated as required.

Further moving on to the investigation of the issue, like "what made such a huge pile-up of binary logs", I noticed that the auto_increment behavior was real unsubtle. The auto_increment column after reaching the max value of INT type, kept on generating the same number. In our clients case for almost 1.7Million records. Next as I mentioned above a series of updates running on the same ID (using a single condition in the where clause "... ID=magic number"), the updates actually changed 1.7Mil rows for every run. The binlog format was set to "mixed" and as per its working all updates were logged in "row" based formats. This started a chain reaction which generated Binary logs eventually filling up the volume, slave got overwhelmed with this and it filled up 100G volume for relay logs too. Next the updates started stalling (since unable to write the statements to binlogs), and finally timing out.

Now everything is back on track, and we decided to rebuild the slaves again. 

However, I realized the fact that, how a unique constraint on AUTO_INCREMENT column would have saved our day. A duplicate key error would have been sufficient for us to let know that INT has maxed out.

Tuesday, 5 February 2013

Restoring single or some tables from Xtrabackup!!! (Replication Gotchas)



There is this wonderful blog post  from one of the Perconian Miguel Angel Nieto, regarding restoring single table from an Xtrabackup backup. Since this is a very useful feature with respect to InnoDB where direct filecopy in MyISAM-style is not allowed, I thought of giving it a try, plus one of our client demanded this (Its not a good idea to restore 1TB of data just for a few MB table).

So moving forward, I wont list out the steps already mentioned by Miguel, but would like to mention one addition to the process. While you import the backup .ibd file in the data directory, the changed row data wont be visible unless you do a mysql restart. For eg : Here's what I did :

1) Deleted some records from a table.

mysql> select count(*) from titles_back;
+----------+
| count(*) |
+----------+
|   443308 |
+----------+
1 row in set (0.99 sec)

mysql> delete from titles_back limit 20;
Query OK, 20 rows affected (0.15 sec)

mysql> select count(*) from titles_back;
+----------+
| count(*) |
+----------+
|   443288 |
+----------+
1 row in set (0.08 sec)

2) Prepared the backup using --apply-log and --export option (I used Innobackupex script, however --prepare can be used instead if you are using xtrabackup binary).
3) Copied the tablespace file from the backup to the database schema directory. And made sure the file permissions are correct.
4) Imported the tablespace back to the table.

mysql> alter table titles_back import tablespace;
Query OK, 0 rows affected (0.00 sec)

mysql> select count(*) from titles_back;
+----------+
| count(*) |
+----------+
|   443288 |
+----------+
1 row in set (0.08 sec)

As we can see above, the tablespace was imported without any problems, however still the deleted data is not visible, Although it exists in the files. Looks like somehow InnoDB is still referring the page for the deleted records from buffer pool (the page-ids are same), but I am not sure about it.

5) Next an addition as I mentioned above would be to do a mysql restart for the instance.
6) After the restart we have the missing rows back :
mysql> select count(*) from titles_back;
+----------+
| count(*) |
+----------+
|   443308 |
+----------+
1 row in set (1.03 sec)

Now the restart can be actually scheduled for a later time, if not possible. InnoDB will allow any data manipulations to rows execpt the previously changed rows (the one we needed to restore from backup). Please look at the example below :

mysql> select * from titles_back limit 5;
+--------+-----------------+------------+------------+
| emp_no | title           | from_date  | to_date    |
+--------+-----------------+------------+------------+
|  10001 | Senior Engineer | 1986-06-26 | 9999-01-01 |
|  10002 | Staff           | 1996-08-03 | 9999-01-01 |
|  10003 | Senior Engineer | 1995-12-03 | 9999-01-01 |
|  10004 | Engineer        | 1986-12-01 | 1995-12-01 |
|  10004 | Senior Engineer | 1995-12-01 | 9999-01-01 |
+--------+-----------------+------------+------------+
5 rows in set (0.02 sec)

mysql> delete from titles_back where emp_no=10001;
Query OK, 1 row affected (0.21 sec)

mysql> select * from titles_back limit 5;
+--------+-----------------+------------+------------+
| emp_no | title           | from_date  | to_date    |
+--------+-----------------+------------+------------+
|  10002 | Staff           | 1996-08-03 | 9999-01-01 |
|  10003 | Senior Engineer | 1995-12-03 | 9999-01-01 |
|  10004 | Engineer        | 1986-12-01 | 1995-12-01 |
|  10004 | Senior Engineer | 1995-12-01 | 9999-01-01 |
|  10005 | Senior Staff    | 1996-09-12 | 9999-01-01 |
+--------+-----------------+------------+------------+
5 rows in set (0.00 sec)

mysql> alter table titles_back discard tablespace;
Query OK, 0 rows affected (0.06 sec)

/* Copied the files from the backup to the database directory. */

mysql> alter table titles_back import tablespace;
Query OK, 0 rows affected (0.00 sec)

mysql> select * from titles_back where emp_no=10001;
Empty set (0.00 sec)

/* MySQL restart in between */

mysql> select * from titles_back where emp_no=10001;
+--------+-----------------+------------+------------+
| emp_no | title           | from_date  | to_date    |
+--------+-----------------+------------+------------+
|  10001 | Senior Engineer | 1986-06-26 | 9999-01-01 |
+--------+-----------------+------------+------------+
1 row in set (4.47 sec)

I hope this was clear. 

Now lets observe what happens when you carry out the partial restore on a Master-Slave replication setup :

I followed all the steps mentioned above, to notice that the slave had broken with an error : 
                   Last_Errno: 1030
                   Last_Error: Error 'Got error -1 from storage engine' on query. Default database: 'employees'. Query: 'alter table titles_back import tablespace'
  
This is pretty obvious since the Replication slave was able to execute the "discard tablespace" command (which infact deletes the .ibd file for the table), however it was not able to import. Simply because of the fact that .ibd file doesnt exist. Remember we copied the backup file to the master and replication doesnt handle a transfer of .ibd file as it does with "LOAD DATA INFILE ...". So all you need to do is restore the same file to the slave, and restart the mysqld on the slave. Simple.

Now this becomes a necessity, when you are discarding a tablespace and importing it, people might think of turning off sql_log_bin for the session, although this wont cause any errors and the slave to break it, but this will surely lead to inconsistent slaves (The deleted records would still not be present on the slave, but on the master.)

Hope this was helpful. Xtrabackup is a great utility to backup databases, which can save a lot of time, during the restore especially.

Disclaimer : This was strictly done on Percona MySQL server, to carry out the same on Oracle's MySQL server, some server config variables needs to set, you can refer Miguel's blogpost for that.

Sunday, 3 February 2013

How MySQL Replication works, somewhat a detailed pipeline approach



From many days I had been willing to write this blog, but due to some un-answered questions in my mind and literally tons of blogs already available on this topic, I went to see how it works at the code level and try to write something different. The MySQL code is obviously over-whelming and not strictly suggested unless you truly know what to look for. BANG!!! I hit this road-block "What to look for and where to ?". With some help from Valeriy Kravchuk one of the Prinicipal support engineer at Percona, I got a starting point.

So to start on, below are the files to look for the Replication code : 
/sql/slave.cc
/sql/sql_repl.cc

Anyways I'll write another post with thorough source code implementation since this post can get long and we will shift from the topic.

So this is what most of the books, manual and other blog posts have to explain about MySQL replication, "In a simple MySQL replication setup, an event (a DML, DDL statement most often) is written to binary logs which are stored on the originating server called Master. Then there is one more server running mysql which gets those events (generally some SQL or exact SQL statements) stored in Master's binary logs and stores them locally in a file called Relay-Log file. Further a specialized thread runs in MySQL on the another server which is usually named as a Slave which will execute the SQL statements stored in form of events in Relay-log, in simple terms as you would execute a SQL script file".




This being a very simple definition, doesnt answer some of annoying questions which could come up in a novice's mind, like :
1) How does the Slave gets those events (statements) in its Relay log (trust me in my earlier days I used a think file-copy happens, yes, the manual make it looks something like this. Duh!!)
2) Why do we create and grant replication privileges to a user on Master ?
3) How does the Slave knows when to get the data, MySQL replication being asynchronous, this seemed confusing.
4) Finally who actually transfers the data ? Is it the Slave or the Master. (This really made me investigate the whole process).

So I'll try to answer these questions as per my understanding of code and some help from the debug process. (Disclaimer : I might not be accurate with all the process as this still needs to worked on).

Answer 1) Lets take an example of a simple INSERT statement for a MyISAM table. An insert when executed on the master, opens up a bifurcated process, first which executes the statement (in short changes the required pages at the storage engine level, blah blah blah) and second adds the statement to a file-cache (a memory area opened up for Binary logs, named as Binlog cache). Now when the statement completes processing on the first part of the process, it waits until the statement is written to Binlog cache, once written the statement completes processing. Next the Binlog_Dump thread sends a broadcast message saying somewhat like "Hey, All the slaves I have a new statement in my cache, is anyone interested in copying it". This broadcast message is listened by the Slave_IO thread and it says "I am connected and I want that statement to be added in my own Relay-logs", the Slave_IO threads uses the same Binlog cache to copy the statement. The copy is done at packet level (every event would be checked with slave_max_allowed_packet size variable) and then it is written to relay log with the next Binlog file coordinates updated in the "master_info". There are several status messages for replication threads, which in our case after the events are copied to relay logs would be "Waiting for master to send event" on the Slave and "Master has sent all binlog to slave; waiting for binlog to be updated" on the Master.

Answer 2) The replication user account, is actually the whole bread-butter for the Slave_IO threads. Logically, what happens on the Master side is, the replication user logs in to the mysqld on Master (using the username, password and special grant privileges) and retrieves the events. Just imagine you doing a "Show binlog events in ... from ...", what you get in return is the events actually contained in the Binary logs on master. The replication user uses the same concept, although not the same code (since it has to do some extra work of accessing a totally different cache and writing to the relay-logs and updating the master-info). When there are no more events to be copied the Slave_IO thread waits for another broadcast from the Binlog_dump thread.

Answer 3) So what if your replication was paused for a month and then you start it. How does the slave knows where to start, well it simply looks into the master-info. So does it means the Binlog Dump thread will run forever, absolutely not. The Binlog Dump threads disconnects just like every other thread. However, when we start replication, the Slave_IO threads first registers himself with the master, if all goes well, the Master will spawn the Binlog Dump for the that slave. Now the Binlog_dump thread doesnt store the information like at what position the Slave_IO thread stopped a month ago, it just checks the current Binlog position and tells the Slave_IO thread "Hey you have to copy all the events till this position, I will let you know if there are any new events upcoming." So master-info, relay-log.index and bin-log.index plays an important role in this situation.

Answer 4) Logically, the answer is damn simple, the slaves pulls the latest Binary log data. So the code says that, the Binlog Dump thread will keep on sending broadcast messages, thats all it does. The Slave_IO thread will actually copy those events and write them to its relay logs.

So after all this process completes, the Slave_SQL threads get the signal to execute the events newly added in relay logs, and once it has executed all the events, the Slave_SQL will show the status "Slave has read all relay log; waiting for the slave I/O thread to update it".

The working of Slave_SQL threads would fill up its own blog post and frankly there are several of them available on the internet.

So MySQL replication is basically composed of three threads :
1) Binlog Dump : The one which runs on the Master and who signals the corresponding slave about new events in the binary logs. I mentioned corresponding since, every Slave_IO thread will have its own Binlog_Dump thread (Running as system_user in the Master).
2) Slave_IO : The one who copies the binary log events from the master, writes them to the relay log on Slave, and this is strictly not a file-copy :P
3) Slave_SQL : The one who executes the events (usually SQL statements) contained in the relay logs and purges the relay logs when all of the events have finished executing. This is one of the bottlenecks in MySQL replication being that all the statements are executed serially.

Hope this new way of understanding MySQL replication was helpful.

More to come soon...

Tuesday, 8 January 2013

How InnoDB writes data (Something's FUZZY out there!) ?


Internals of InnoDB is one of the most challenging topics to learn in MySQL. Hence in my earlier blogpost, I tried to simplify how the InnoDB disk and Memory layout looks like. Although once you have understood the basic understanding, questions will start cropping in your minds, How and When exactly InnoDB writes its data. Lets start with our usual simplified approach to understanding this.

Lets assume users are executing some updates on the some InnoDB tables. Usually InnoDB will check if the records(contained in data Pages) to be manipulated are existing inside the innodb buffer pool, if found then InnoDB will proceed with updating the concerned record(in reality the pages). If the page is not found in buffer pool, InnoDB Read I/O thread will the page from the concerned table's datafile and put it in the innodb buffer pool. Then the appropriate record in the page would be manipulated, a special LSN(Log Sequence Number) would be written to Transaction logs(ib_logfile) and the block will be marked as "Dirty".

This was straight-forward to understand, wasn't it ? Now what happens if someone goes and kills the MySQL process, as usual on restart InnoDB will starts its transaction recovery of committed and un-committed transaction. Wait a minute un-committed transactions are ok can be rolled back from RSegs (Undo or Rollback Statement situated in the ibdata file), but what about committed transactions. Isn't InnoDB suppose to write a dirty block as soon as a COMMIT(Checkpoint in too much technical terms) occurs. Well for your information this is known as Undo phase and Redo phase resp.

So the instance recovery process raised some interesting questions, like when does InnoDB actually writes data to the persistent files on disk. Looking at the recovery process it doesnt seem to write immediately though. That's something fishy... I mean "FUZZY". So going back to our update statement's Dirty page, the actual page will not be flushed to disk or disk-cache, until the following scenarios occur :
1) The LRU(Least Recently Used) page list becomes full. InnoDB reads pages from the disk into buffer to do manipulations on them or as data pages for SELECT statements. Now if the page was modified it will land-up in the Dirty pages list. If it was used for SELECT, it will still be in the LRU list. Now since innoDB buffer is of finite size, the LRU list is bound to get full, so at this moment some Dirty buffers will be flushed to disk to accommodate any new READ buffers. In layman's words "A Select can flush a DML".
2) Second scenario is quite compelling, this is when the "Flush List" itself runs out of room to accommodate more dirty buffers, Oh Yes! this happens and is controlled by "innodb_max_dirty_pages_pct" mainly. So to make room Least manipulated pages are flushed (FUZZY!!! Which means an update on a page which happened long while ago, could still be kept in Flush List, and the page which is somewhat recently manipulated could also be flushed to disk, however the older page is continuously updated by other statements too, this is a kinda write-optimization, Percona counts this using "Young-making rate").

BTW, if you are wondering which process actually writes these pages in their specific table's datafiles, I have a hint for you, they are the counter-part of READ I/O threads, although this all happens in the background and we dont need to worry about.

Note : There are three ways in which the data page would be written, the above two (Fuzzy Check-pointing), plus a "Sharp checkpoint" as explained by Baron Schwartz in his blog-posts at Xaprb.

So are there any performance bottlenecks and improvements ? Well logically dirty pages flushed due to LRU list signifies more memory should be added, a bigger footprint for innodb_buffer_pool_size. Dirty pages flushed due to Flush list signifies more complicated optimization involving innodb_adaptive_flushing, innodb_max_dirty_pages_pct, innodb_io_capacity and innodb_log_file_size (Courtesy mysqlperformanceblog.com).

Just to leave you tickling, all these factors depends upon an important setting known as "innodb-flush-method", keep reading...