peter_zaitsev (peter_zaitsev) wrote,

MySQL: Something to watch with replication

Last week I ran into this situation 3 times, so it must be common and I guess extremely upsetting as there is limited things you can do when it happens,
being informed in advance you however can plan accordantly and avoid the problem. What am I speaking about ? Just MySQL Replication in large databases with heavy writes. What is wrong with it you would ask and would be correct - in most cases Replication is working and performing great but there are some things to watch out.



So when does the problem happens ? In a few cases.

Most common is when you have huge database - significantly larger than your memory size, intensive write load and fancy IO subsytem with several drives. In this case you may find whatever number of hard drives you plug into the Slave it is still unable to keep up. Why ? Because to load many hard drives efficiently with large database you need many things going at the same time, which does not happen as replication has single thread executing the queries. This case is actually hardest to solve non intrusive way - the best solution is to get more memory or optimize database so load becomes CPU bound. If it is impossible use few higher performing disks - pair of SCSI 15000RPMs in RAID1 could work better than 4 of 10000RPMs in RAID10. Solid State disks is also theoretical solution but is a "bit" expensive.


Case number two. Database might be small but having many small write transactions with innodb_flush_logs_at_trx_commit=1. In this case slave might not be able to keep us as group commit (commiting several transactions with single disk write) does not work for the same reason - it requires multiple transactions commiting at the same time, while slave SQL thread always has only one. How serious difference could be ? Quite large. Order of magnitude sometimes. Single disk is able to physically do some 100-200 writes per seconds, while with group commit you can get thousands of transactions commited per second. This problem happily has easy solution - use innodb_flush_logs_at_trx_commit=2 on the slave as replication is asynchronous anyway or
get RAID with battery backed up cache.


Case number three. As you could already have guessed the same may happen with CPU bound load - if you have many CPUs on the system replication will be able to use only one effectively. If it is not enough to handle your slave updates you're in trouble. Moving from 2CPU box to 4CPU or dual core is unlikely to help. The same as with disk drives - few faster CPUs are better than more of slower ones in this case.


All of these issues are likely to be fixed in MySQL at some point, but I would not count on it anytime soon. Making replication parallel is major and rather complex task. Row level replication should reduce CPU load and reduce frequency of case number 3, Optimizations in MySQL query executions increase the load needed for this to happen but there is no silver bullet yet.

My advice for such cases would be - scale out. Plan your database so it can run on many commodity boxes with few CPUs and disks - so you would not need more than part of them for replication. This is also going to be cheaper easier to maintain (after you automate things) and safer. Many times I've hear "We're running without HA now as our 32 CPU 64GB RAM box is out for repair".
  • Post a new comment

    Error

    Comments allowed for friends only

    Anonymous comments are disabled in this journal

    default userpic
  • 2 comments