Why Is Replication So Hard? | Scalable Internet Architectures

Why is replication such a challenging problem? The short answer is ACID. Although this isn't intended to be a databases primer, you do need a bit of background on how databases work and what promises they make to their users. Here is what ACID buys us:

AtomicityAll the data modifications that occur within a transaction must happen completely or not at all. No partial transaction can be recorded even in the event of a hardware or software failure.
ConsistencyAll changes to an instance of data must be reflected in all instances of that data. If $300 is subtracted from my savings account, my total aggregated account value should be $300 less.
IsolationThe elements of a transaction should be isolated to the user performing that transaction until it is completed (committed).
DurabilityWhen a hardware or software failure occurs, the information in the database must be accurate up to the last committed transaction before the failure.

Databases have been providing these semantics for decades. However, enforcing these semantics on a single machine is different from enforcing them between two or more machines on a network. Although it isn't a difficult technical challenge to implement, the techniques used internally to a single system do not apply well to distributed databases from a performance perspective.

Single database instances use single system specific facilities such as shared memory, interthread and interprocess synchronization, and a shared and consistent file system buffer cache to increase speed and reduce complexity. These facilities are fast and reliable on a single host but difficult to generalize and abstract across a networked cluster of machines, especially wide-area networked.

Instead of attempting to build these facilities to be distributed (as do single-system-image clustering solutions), distributed databases uses specific protocols to help ensure ACID between more than one instance. These protocols are not complicated, but because they speak over the network they suffer from performance and availability fluctuations that are atypical within a single host.

People have come to expect and rely on the performance of traditionally RDBMS solutions and thus have a difficult time swallowing the fact that they must make a compromise either on the performance front or the functionality front.