8.4 High Availability

So far we've concerned ourselves with the slaves. Using a proper heartbeat setup and load balancer, you can achieve a high degree of availability and transparency for MySQL-based applications. In its current state, MySQL doesn't offer a lot in the way of high availability support on the master, but that doesn't mean all hope is lost.

In this section, we'll look at several high-availability solutions (both commercial and free). Each of the options considered has pros and cons, which we've done our best to document.

NDB Cluster

As we were putting the finishing touches on this book, MySQL AB was completing the initial integration work on the newest storage engine: NDB. In 2003, MySQL AB acquired Alzato, a company started by Ericsson in 2000. The company developed NDB Cluster, a clustered database system designed for both high availability and scalability.

When the integration is complete, MySQL's NDB storage engine will provide an interface to a backend NDB cluster. For the first time, MySQL will have built-in clustering with automatic failover capabilities. See the MySQL web site and manual for more details on the NDB technology.

8.4.1 Dual-Master Replication

We looked at dual-master replication back in Chapter 7. While it doesn't help in scaling an application (both servers must handle the full write load), you can achieve much improved availability and transparency by putting a load balancer in the mix. Figure 8-5 illustrates this arrangement.

Figure 8-5. Dual-master replication for high availability

Aside from the downsides mentioned in Chapter 7 (mostly a lack of conflict resolution), there isn't a lot that can go wrong with this setup. The worst problem is the potential for data loss, but that's really no different from master/slave replication. After a query writes a record to master 1, MySQL records the query in the binary log, and the other master has a chance to read it. If master 1 happens to crash between the time that the record is written and when the binary log is updated, however, the other master (and any slaves) will never know about the query. As far as master 2 is concerned, the query never happened. The solution would be for MySQL to provide synchronous replication with two-phase commit, but it doesn't.

On the plus side, this solution is relatively easy to set up and understand. If you already know how to configure replication and have a working load balancer set up with good health checks, dual-master replication isn't much extra work. If you need to perform maintenance on the masters, you can simply take master 1 offline, do the work, bring it back online, and repeat the process on the other as soon as the first has caught up. Of course, it's best to do this gracefully. Set the health check to fail, and wait until clients are no longer accessing the master before shutting it down. Otherwise you risk interrupting in-progress transactions.

If your load balancer is sophisticated enough, you can virtually eliminate the problem of conflict resolution. Here's how it works: rather than having both masters active, configure the load balancer so that master 1 is active, and master 2 is on standby. Only when master 1 goes down should the load balancer send any traffic to master 2. Most load balancers provide a mechanism for doing this.

However, a wrinkle occurs when master 1 comes back online. What should the load balancer do? If it begins sending connections to master 1 again, you'll have a situation in which writes could be occurring to both masters at the same time. That's a recipe for conflict. Remember, MySQL connections can be long-running, so the load balancer can't assume that clients will suddenly disconnect from master 2. The load balancer needs to be configured so that the notion of the "live master" changes only when the current live master goes down.

8.4.2 Shared Storage with Standby

By increasing the cost and complexity of your infrastructure, you can eliminate the problem of lost updates described previously. Instead of two servers with their own copies of the data using replication to stay in sync, you can configure the active and standby masters to use shared storage.^[3] It's very important to realize that the standby master shouldn't mount the filesystem or start MySQL until the first is offline.

^[3] The exact type of shared storage isn't terribly important. You see greater performance from directly attached systems than network attached storage, however, due mainly to the reduced latency.

Figure 8-6 shows one implementation of shared storage. It's worth pointing out that a load balancer isn't strictly necessary in this scenario. All you really need is an agent running on each node to monitor the other. If the agent running on master 2 finds that master 1 is unavailable, it takes over master 1's IP address and starts up MySQL with an identical configuration (same data directory, log filenames, etc.). If the configuration is truly identical, starting up MySQL on master 2 is logically no different from fixing master 1 and bringing MySQL up there. However, in reality there is an important difference: time. Master 2 is already booted and ready to go. Starting up MySQL takes a matter of seconds. The only delay is imposed by consistency checks on the data. Shared storage means the possibility of share corruption if you're not using InnoDB or BDB tables.

Figure 8-6. A live master and a warm standby master using shared storage

Writing such an agent is a tricky undertaking. We don't recommend you try it unless you have a lot of time available for testing all the possible edge cases you're likely to encounter with flaky network equipment. Instead, spend some time looking at existing tools. There are numerous open source projects that can be adapted to do this for MySQL. The best candidate is keepalived (http://keepalived.sourceforge.net/), a keep-alive health check facility written to work in conjunction with LVS. There are also two commercial solutions on the market today, described in the next section.

8.4.3 Commercial Solutions

As of this writing, there are two commercial products worthy of consideration for high availability. Each takes a completely different approach to solving the problem, so different sites may find one or the other suitable, or neither. Keep an eye on this market: we expect to see a lot of new development in this area in the next year or so.

8.4.3.1 Veritas cluster a gent

Veritas has a well established reputation for providing the technology necessary to build many sorts of clusters. Their MySQL offering builds on the shared storage with standby model we just looked at. The cluster agent runs on both the active and standby nodes, monitoring the health of the primary master. When the agent detects a problem on the master, it brings the standby instance online and takes over the primary master's functionality.

8.4.3.2 EMIC Networks

EMIC Networks provides a full-blown clustering solution for MySQL. By combining a number of relatively inexpensive servers running EMIC's version of MySQL, you can create incredibly robust MySQL clusters without needing to worry about the single point of failure most other architectures have.

< Day Day Up >