Multimaster Replication | Scalable Internet Architectures

Multimaster replication is the "holy grail" of distributed databases. In this model, all data is located at more than one node, and all nodes are completely capable of processing transactions. Most systems do not require what multimaster replication offers. However, traditionally most nontechnical people (and even many technical people) assume this model when speaking of database replication. As time goes on and replicated databases become common in both small and large architectures, this interpretation will change. It is my hope that people will lean toward more descriptive names for their replication setups. For now, when we say replication, we'll refer to the whole kit and caboodlemultimaster replication.

Why is this problem so hard? Let's step back and look what ACID requires:

Atomicity seems to be a simple thing to ensure. It simply means we do all or nothing, right? Yes; however, when at the end of a transaction, we must commit. Now, it is not as simple as closing out the transaction. With a single instance, we commit, and if anything goes horribly wrong we return an error that the transaction could not be committed. In a distributed system, we must commit on more than one node, and all must succeed or all must fail, or an inconsistency is introduced.
Consistency requires that all instances of perspectives on a datum are updated when a transaction is complete. In and of itself this isn't a problem. However, we have introduced a complication in the "scheduling" of database operations. When operating in a single instance, the database engine will arbitrarily order concurrent transactions so that one commit takes place before another. Concurrent transactions take on a whole new meaning when transactions operating on the same data can initiate and run their course at different nodes. It means that there must be a consistent global ordering of transaction commitsnot so easy.
Isolation is not really affected by the fact that transactions are happening at the same time in different places.
Durability requires that in the event of a crash, the copy must correctly reflect all information through the last committed transaction and not any information that was part of an uncommitted transaction. When multiple copies of the same data exist on separate machines, this task proves more difficult because agreeing on clusterwide commits is now necessary.

Although there are several technical approaches to the problem of multimaster database replication and several real-world implementations, they often require sacrificessuch as decreased performance, network topology limitations, and/or application restrictionsthat are impossible to make.

Suppose that you have a database that performs 2,000 transactions per second. The obvious goal is to have that replicated in three locations around the world and be able to increase the transactional throughput to 6,000 transactions per second. This is a pipe dreama pure impossibility for general database uses. Increasing a cluster's size by a factor of three resulting in a threefold increase in performance would be an optimal solution. We know that optimal solutions don't exist for general problems. However, what the true performance "speedup" is may be alarming. Although read-only query performance can often be increased almost linearly, the total transactional throughput decreases if the transactions affect the same data.

There is a special case where an almost linear performance gain can be achieved. That case is when replication takes the pure form of data aggregation. If the operations on the datasets at each location are disjointed, and this is an assumption in the architecture of the replication technology itself, atomicity and consistency can be enforced without collaboration. Because this is an edge case, we'll ignore it for now.

Two-Phase Commit (2PC)

Two-phase commit (and to a lesser degree three-phase commit) has been used for decades to safely and reliably coordinate transactions between two separate database engines. The basic idea is that the node attempting the transaction will notify its peers that it is about to commit, and they will react by preparing the transaction and notifying the originating node that they are ready to commit. The second phase is noticing any aborts in the process and then possibly following through on the commit (hopefully everywhere).

The two-phase commit is not perfect, but it is considered sufficient for applications such as stock trading and bankingwhich are considered to be the most stringent industries when it comes to data consistency, aside from perhaps aerospace operations.

The glaring problem with 2PC is that it requires the lock-step progress on each node for a transaction to finish. With a single node down, progress can be halted. If a node is marked offline to allow progress, resynchronization must occur. This approach simply isn't ideal for wide-area replication against hostile networks (the Internet is considered a hostile network in this case).

Despite its shortcomings, the only common "correct" multimaster replication technique used today is the 2PC, and, due to its overhead costs, it is not widely used for geographic replication.

EVS Engine

Extended virtual synchrony (EVS) is a concept from the world of group communications. The EVS Engine itself is a product of the group communication world and has been commercialized by Spread Concepts, LLC. The theoretical approach to this replication technique is well documented and proven correct in the PhD thesis of Yair Amir (one of the Spread Concepts guys).

If you just sit back and consider for a moment who would have the best ideas on how to get multiple computers to communicate, organize, and collaborate on an effort to accomplish some goal (say ACID), it would be the group communications people. On the other hand, the only people you trust your data to are the database people. The requirement that these two groups must collaborate poses a roadblock for the evolution of more advanced data replication approaches.

Techniques from the group communication world can help overcome the challenges posed by disconnected networks and network partitions. In typical multimaster replication configurations all nodes must concur that an operation will happen and then execute on it using 2PC. This might work well in two-node scenarios; however, when more nodes are involved and they are not local with respect to each other, serious complications are introduced that can seriously affect progress in these systems.

To take a conceptually simple architecture for multimaster replication, we can look at a company in the United States that has offices in six states. Each office wants a local copy of the enterprise's database. To modify the content in that database, that database must know that it will not produce a conflict with any of its peers. This sounds simple until one of the long-haul networking circuits used to transit information goes down. This will cause some partition in the working set of databases, and it is likely that there will be two or more views of the participating databases.

Assume that we have two West Coast, two East Coast, and two Midwest databases. After a networking disaster, a VPN failure, or some other unexpected but probable event, a partition in the working set of databases will occur. Those on the West Coast and in the Midwest can all see each other and believe that the eastern nodes have crashed, while the two eastern nodes can see each other and believe that the four nodes to their west have gone down. This is called a network partition. When network partitions occur, you have two options: manual magic and quorums.

Manual Magic

The first step is to stop all operations; after any data modification on any database node you will not be able to effectively perform a commit because you cannot form a unanimous vote to do so. In this model, an operator must intervene and instruct the participating machines that they only need a consensus across a newly defined set of the nodes. For example, the eastern operators do nothing, and the western operators reconfigure their replication scheme to exclude the two nodes that are unavailable to them. This is often the case in local area replication schemes because the expectation is that network partitions will not occur and that if a node becomes unavailable it is because something disastrous happened to that node. Aside from the manual intervention required to allow progress in the event of a network partition, techniques and procedures must be developed to handle resynchronizing those machines excluded from the replication configuration during the time of the partitionthis is hardit's magic...black magic.

Quorums

The second option is to establish a quorum. Our friends in the group communication field have had to deal with this problem from the beginning of the exploration of their field. It isn't a concept at all related to databases but rather one related to group decision making. A quorum is defined as an improper subset of the total possible group that can make a decision. A correct quorum algorithm ensures that regardless of the number of machines, as long as every machine follows that algorithm there will exist at most one quorum at any time. The most simple quorum algorithm to imagine is one when a majority of the total configuration is represented in the working set. With the six machine configuration just described, this would be any time a group of four or more machines can communicate. That grouping is the quorum. More complicated quorum algorithms exist that are more robust than the simple majority algorithm, but they are outside the scope of this book.

By establishing and executing a quorum algorithm, the overall replication configuration can guarantee progress in the event of arbitrary network configuration changes that could induce partitions and subsequent merges. Some believe the downside of this is that an administrator could make better business decisions on which partition should be allowed to make progress because they have access to outside influence and information. However, it should be recognized that quorum algorithms can be efficient in preserving business continuity if the proper considerations are incorporated into the adoption or development of the quorum algorithm used in the system.

If we introduce the fact that our New York database (one of the two eastern databases) is at the corporate headquarters and that, if at all possible, it should be able to make progress, we can change our quorum algorithm to reflect that. For example, if the group of connected nodes contains the New York database, it is a quorum. Probably safer than that is a quorum defined by either five of the six machines or any two machines if one is in New York.

On top of this quorum system, EVS Engine allows reliable, ordered messaging among members without the cost of 2PC. A message is sent to the current group; an EVS system guarantees that when it is delivered, you will know the exact membership of the group at the time of delivery and that everyone else who received the message also saw the same membership at the time of delivery. Using this, it proves that you can replicate ACID databases.

Although EVS Engine is available in some fashion or another from Spread Concepts, LLC., it has a long way to go before it is accepted as a standard approach to multimaster replication. It has considerable opposition from the database community for unsubstantiated reasons, but when that intergroup dynamic resolves, you can expect replication over extended virtual synchrony to be the de facto replication paradigm.

In the end, multimaster replication isn't ready for prime time in most environments for the sole reason that people tend to want multimaster replication for increased availability and performance, and it isn't possible to have both.

As an interesting aside, the PostgreSQL Slony-II project is implementing an algorithm similar in nature to EVS Engine. Its algorithm replaces some of the more challenging problems tackled by EVS Engine with more simple (and practical) solutions. The Slony-II project is positioned to lead the industry with respect to achieving scalable and efficient geographically separated multimaster database configurations.