Sunday 3 February 2013

How MySQL Replication works, somewhat a detailed pipeline approach



From many days I had been willing to write this blog, but due to some un-answered questions in my mind and literally tons of blogs already available on this topic, I went to see how it works at the code level and try to write something different. The MySQL code is obviously over-whelming and not strictly suggested unless you truly know what to look for. BANG!!! I hit this road-block "What to look for and where to ?". With some help from Valeriy Kravchuk one of the Prinicipal support engineer at Percona, I got a starting point.

So to start on, below are the files to look for the Replication code : 
/sql/slave.cc
/sql/sql_repl.cc

Anyways I'll write another post with thorough source code implementation since this post can get long and we will shift from the topic.

So this is what most of the books, manual and other blog posts have to explain about MySQL replication, "In a simple MySQL replication setup, an event (a DML, DDL statement most often) is written to binary logs which are stored on the originating server called Master. Then there is one more server running mysql which gets those events (generally some SQL or exact SQL statements) stored in Master's binary logs and stores them locally in a file called Relay-Log file. Further a specialized thread runs in MySQL on the another server which is usually named as a Slave which will execute the SQL statements stored in form of events in Relay-log, in simple terms as you would execute a SQL script file".




This being a very simple definition, doesnt answer some of annoying questions which could come up in a novice's mind, like :
1) How does the Slave gets those events (statements) in its Relay log (trust me in my earlier days I used a think file-copy happens, yes, the manual make it looks something like this. Duh!!)
2) Why do we create and grant replication privileges to a user on Master ?
3) How does the Slave knows when to get the data, MySQL replication being asynchronous, this seemed confusing.
4) Finally who actually transfers the data ? Is it the Slave or the Master. (This really made me investigate the whole process).

So I'll try to answer these questions as per my understanding of code and some help from the debug process. (Disclaimer : I might not be accurate with all the process as this still needs to worked on).

Answer 1) Lets take an example of a simple INSERT statement for a MyISAM table. An insert when executed on the master, opens up a bifurcated process, first which executes the statement (in short changes the required pages at the storage engine level, blah blah blah) and second adds the statement to a file-cache (a memory area opened up for Binary logs, named as Binlog cache). Now when the statement completes processing on the first part of the process, it waits until the statement is written to Binlog cache, once written the statement completes processing. Next the Binlog_Dump thread sends a broadcast message saying somewhat like "Hey, All the slaves I have a new statement in my cache, is anyone interested in copying it". This broadcast message is listened by the Slave_IO thread and it says "I am connected and I want that statement to be added in my own Relay-logs", the Slave_IO threads uses the same Binlog cache to copy the statement. The copy is done at packet level (every event would be checked with slave_max_allowed_packet size variable) and then it is written to relay log with the next Binlog file coordinates updated in the "master_info". There are several status messages for replication threads, which in our case after the events are copied to relay logs would be "Waiting for master to send event" on the Slave and "Master has sent all binlog to slave; waiting for binlog to be updated" on the Master.

Answer 2) The replication user account, is actually the whole bread-butter for the Slave_IO threads. Logically, what happens on the Master side is, the replication user logs in to the mysqld on Master (using the username, password and special grant privileges) and retrieves the events. Just imagine you doing a "Show binlog events in ... from ...", what you get in return is the events actually contained in the Binary logs on master. The replication user uses the same concept, although not the same code (since it has to do some extra work of accessing a totally different cache and writing to the relay-logs and updating the master-info). When there are no more events to be copied the Slave_IO thread waits for another broadcast from the Binlog_dump thread.

Answer 3) So what if your replication was paused for a month and then you start it. How does the slave knows where to start, well it simply looks into the master-info. So does it means the Binlog Dump thread will run forever, absolutely not. The Binlog Dump threads disconnects just like every other thread. However, when we start replication, the Slave_IO threads first registers himself with the master, if all goes well, the Master will spawn the Binlog Dump for the that slave. Now the Binlog_dump thread doesnt store the information like at what position the Slave_IO thread stopped a month ago, it just checks the current Binlog position and tells the Slave_IO thread "Hey you have to copy all the events till this position, I will let you know if there are any new events upcoming." So master-info, relay-log.index and bin-log.index plays an important role in this situation.

Answer 4) Logically, the answer is damn simple, the slaves pulls the latest Binary log data. So the code says that, the Binlog Dump thread will keep on sending broadcast messages, thats all it does. The Slave_IO thread will actually copy those events and write them to its relay logs.

So after all this process completes, the Slave_SQL threads get the signal to execute the events newly added in relay logs, and once it has executed all the events, the Slave_SQL will show the status "Slave has read all relay log; waiting for the slave I/O thread to update it".

The working of Slave_SQL threads would fill up its own blog post and frankly there are several of them available on the internet.

So MySQL replication is basically composed of three threads :
1) Binlog Dump : The one which runs on the Master and who signals the corresponding slave about new events in the binary logs. I mentioned corresponding since, every Slave_IO thread will have its own Binlog_Dump thread (Running as system_user in the Master).
2) Slave_IO : The one who copies the binary log events from the master, writes them to the relay log on Slave, and this is strictly not a file-copy :P
3) Slave_SQL : The one who executes the events (usually SQL statements) contained in the relay logs and purges the relay logs when all of the events have finished executing. This is one of the bottlenecks in MySQL replication being that all the statements are executed serially.

Hope this new way of understanding MySQL replication was helpful.

More to come soon...

1 comment: