peter_zaitsev (peter_zaitsev) wrote,

Faking the sync

Someone could find it an old and boring story but it is still popping up and people are still loosing data, so it is
worth to go other it once again.

MySQL/Innodb as any other transactional database relies on OS guaranties to provide its own. Transaction will be durable
and database intact on the crash only if database will perform synchronous IO as synchronous - reporting it is done when
data is physically on the disk.



The fact however is, these exected guaranties are void from time to time. Depending on OS version, file system, hardware
configuration and even a way IO is performed. you may get a confirmaition when data is not on the disk yet. Two most well known cases are perhaps Linux 2.4 with ATA/SATA drives and OS X. In most 2.4 kernel versions (2.6 are told to be fixed in some cases, but I did not check) ATA/SATA will have drive cache enabled by default and it will not be flushed on fsync(), which means if you power down the box cache content will be lost leading to transaction loss or even database corruption. OS X just sellected fsync() not to be real fsync as users typically care about performance more than about
consistency :) It also will not flush cache on ATA/SATA drives. Apple however at least is clear about this and provided
special fcntl() call to perform real fsync. So latest MySQL versions can cope with it.

The more tricky cases could involve - software or hardware RAID which adds more possibilities to mess things up.

This problem also can't be fully solved for any hardware, as some hardware just does not have an option to disable or flush cache. For example external "firewire" devices. Rumors are some ATA drives also do not implement cache flush command or lie about their cache settings. I was lucky myself.

For a long fime I though this drive cache problem is relatively mild - you can loose transactions but will not corrupt your data. It felt like fsync() does not flush cache to the drive but at least likes as "barrier" so no operations submitted after fsync() can be executed before onessubmited before it. Modern drive caches seems to become smarter and this property is also lost. And well anyway if you store logs and data on different devices you could be in trouble as there is no syncronization in such case.


Resently this problem got some publicity because of LiveJournal and Wikipedia lost some data this way on power failure.
Of course there are backups to help but it takes some time to restore this.

Some people blame it as entirely MySQL problem - it is not the case. I've seen similar problems reported on PostgreSQL mailing list as well as by Oracle users. All disk based databases require pretty much same requirements from OS to function. You can add "paranoid" mode to be able to survive without it but it will come at the cost of performance,
so it is better to make your OS to work right.


So how to do check if your Hardware-OS configuration works as needed ?

1) You can measure fsync() rate. If you do not have battery backed up RAID cache you can't get more than
250 reliable fsyncs/sec for 15000RPM drive as you have just 250 drive rotations per second. You can do
it by for example 4KB sequential synchronous write test in sysbench.

2) We've added validation mode to sysbench: (http://sourceforge.net/projects/sysbench) You can run read-write test
with --validate option power down the box and when run sequential read test with --validate to make sure no corruptions
happened. It will not make sure you do not loose last transactions but will check for other IO bugs. You may with to use
4K page size as with 16K pages partial page writes on power failure are likely. Innodb has protection against these.

3) We're working on tool to run from your client to check your server behavior on power failure. It is not published yet
but if someone is interested you can mail me.
  • Post a new comment

    Error

    Comments allowed for friends only

    Anonymous comments are disabled in this journal

    default userpic
  • 17 comments