Aspects of Scalability | Professional JMS

Scalability is the measure of how effective a cluster is. In fact, I would go as far as to say that it is the measure of whether or not a group of machines is, effectively, a cluster or not. These statements may sound a bit bold, but my definition of scalability below is quite broad and can be interpreted to encompass everything that has to do with combining multiple computers into a cluster.

In a nutshell, if a cluster of four machines does not provide more of anything (including reliability) that the corresponding monolithic server supplies, the cluster has achieved nothing; you may just as well leave machines two through four switched off. If that cluster of four machines provides four times the capability of a monolithic server, then the cluster scales very well. In fact such ideal scalability is generally impossible to achieve in practice; the challenge is to see how close a cluster can get to this case.

Defining Scalability

The term scalability is used in many different contexts, and depending on context one can probably find quite a few different definitions. In terms of a message server, let's use the following definition:

Important

Scalability is the degree to which some aspect of the messaging service is enhanced by adding more resources.

Specifically in the case of clustering, let's limit the discussion to an even more specific definition: scalability is the degree to which some aspect of the service is enhanced as more physical machines are added to the cluster.

In some cases scalability can be measured quantitatively. For instance, if you have a benchmark that measures the number of messages per second that your cluster can process, you can run it for a cluster of two machines and a cluster of four machines and compare the results. If doubling the number machines doubles the message rate, then the scalability is linear, because a plot of message rate versus machine will form a straight line, as shown in the diagram overleaf.

We can be even more quantitative and divide the percent improvement in the benchmark by the percent increase in the number of machines to get a numerical value for the scalability:

Important

Scalability = % improvement divided by % increase in number of machines.

click to expand

A value of 1 corresponds to linear scalability. A value less than 1 indicates sublinear scalability, and a value greater than 1 indicates superlinear scalability. Superlinear scalability corresponds to a condition commonly called synergy, in which the whole is greater than the sum of its parts.

Synergy is generally considered to be the result of good teamwork, but in the world of server clustering, it doesn't happen. I will therefore stick to the rule that linear scalability is the best that one can hope to achieve in the world of clustering, and that superlinear scalability is impossible. In practical situations, scalability will always be somewhat less than linear because there is inevitably some extra overhead incurred when increasing the size of a cluster.

Scalability will seldom be constant as the number of nodes in a cluster increases. A very common scenario is that scalability will be close to linear for the first few nodes, and then become sublinear very quickly. At some point the scalability will saturate, or become 0, corresponding to the point where the extra overhead of communication within the cluster equals the gains brought by each extra machine. It is entirely possible that beyond some point, the scalability becomes negative and total capability of the cluster degrades as more machines are added.

A prerequisite to quantitatively measuring scalability is the ability to measure the feature that you are trying to scale. If this is performance, then you will need a quantitative performance benchmark before you can measure the effect of adding nodes. Benchmarking is very much a black art. It is common for two different benchmarks to show very different results for the same server. This is because a message server, like any complex software system, has many different potential performance bottlenecks. The limits imposed by each of these may behave very differently as the load and the number of nodes is increased.

Important

It is very easy for two different benchmark programs to encounter different bottlenecks and thus show very different scalability behavior. As always, you need to understand what you are measuring. All the caveats that apply to performance benchmarking apply even more so to the quantitative measurement of scalability.

Scalability of Connections

There are many resource limits in a message server that could ultimately lead to scalability concerns. Foremost among them are CPU capacity, memory, and persistent storage (disk) space. It might seem strange to some that the number of client connections might receive more attention than things like message volume and storage capacity addressed below, but connections are a limited resource.

Some operating systems use file descriptors as the interface that gives applications access to network connections. This makes them look like files, even though they are not. Operating systems place a fixed limit on how many open file descriptors a process may have. If a server is implemented in Java, then the problem of connection limits is even more acute, as described below. Message servers by their very nature maintain a lot of client connections, and this is very often the first scalability limit encountered on a monolithic server.

As mentioned above, servers implemented in Java are limited to a number of connections less than that imposed by the file descriptor limit. The explanation for this is somewhat detailed, but since the issue of limited connections very often leads to the need for clustering, it may be worthwhile to understand the background.

The Java I/O API does not provide any means of doing non-blocking I/O. This means that the only way to determine if there is data available to read from a network socket is to actually try to read it. If the read() method call does not return then there is no data available. This, of course, presents a problem if you want your program to do something else until there is data to be read from the socket. Creating a separate thread dedicated to waiting to read from each socket easily solves this problem. If a server has thousands of open connections, though, then it needs thousands of threads that do nothing except wait for incoming data.

Sun clearly designed the Java I/O libraries with the assumption that threads are "cheap". They are in terms of CPU cycles: a thread that is blocked waiting to read data does not consume any CPU, but it does use memory. Java reserves a fixed amount of memory for the call stack of each thread, regardless of how much it actually uses. This is not the same the heap space reserved for allocating objects; it is in addition to the heap. Java VM's generally let you configure the actual stack size. For example, the default stack allocation for Sun's Java 1.3 on Solaris is 512KB, and can be changed by using -XSS command-line option with the java command.

The ultimate problem is that there is a limit to how much memory a single process can use. On 32-bit processors, a pointer cannot address more than 4GB of memory, so it is impossible for a process to use more than this, even if the machine has more than 4GB of total virtual memory. (Total virtual memory is the sum of the amount of real memory and swap space.) When the combined memory of the heap, all the thread stacks, the program code, and the virtual machine itself reaches this limit, then the process is out of memory. 4GB is actually a best case for 32-bit operating systems. Some operating systems do not make the make the maximum possible memory space available to the application: Windows NT gives 2GB, Linux 3GB, and Solaris gives all 4GB.

With 64-bit processors and operating systems, this should cease to be a problem. More importantly, Sun is working on the introduction of non-blocking I/O in future versions of Java. (This is detailed in Java Specification Request #000051: New I/O APIs for the Java Platform.) This will alleviate the need to reserve massive amounts of memory for threads that perform a trivial function.

Keep in mind, that if a program is using a large number of threads for purposes other than blocking connections, this will further reduce the total number of connections that are possible. The actual number of connections that a Java based serve can handle depends, in the end, on the actual Java VM implementation, the OS, and the default stack size. In practice, the limit usually falls somewhere in the range of 1000 to 5000 connections per process.

Note

See the Volano report (http://www.volano.com/report.html) for more details on this.

Connections are a rare case in which we can make some sweeping generalizations, with numbers even, that apply to all Java-based message server clusters:

If you need to support more than 1000 simultaneous client connections, you should plan on using some kind of clustering solution. Most JVMs can support at least this number of connections, if the server does not require many more threads in addition to those associated with connections.
Any cluster should provide, as a minimum, good connection scalability.

Note that the limit on the number of connections, whether it is due to the thread limit or the file descriptor limit, is per process. This means that clustering could be used to alleviate the connection limit even with all of the process nodes running on the same machine. In most cases though, if you have a cluster you will probably want to take advantage of the ability to spread out the load over multiple machines.

Fortunately, it is relatively easy to ensure near linear scalability for connections. Since the connection limit is almost fixed per VM, then each additional VM on an additional physical machine should provide the same number of additional connections. If all other parameters do not scale negatively, then the cluster should be able to handle any number of connections by simply adding machines.

Scalability of Message Throughput

Message throughput is the parameter that most people mean when referring to the performance of a message server. This means, specifically, the total number of messages per second that can be produced and consumed. When people refer to load balancing in a cluster, this is the primary factor that they are seeking to improve. It is clear that message throughput could never scale linearly; in addition to moving messages from producer to the server, and from server to consumer, there is the extra overhead of moving messages among the nodes of the cluster.

The actual scalability of message throughput will depend heavily on the architecture of the particular cluster. It is almost guaranteed to be different for topics and queues, even for the same JMS provider. Clusters that require every message to be unconditionally copied to every node will probably not scale well here, although they will be bound to get good scores for high availability. Shared disk and shared database (more details on these later) are also unlikely to have scaleable throughput as all messages must be written to the same central store, regardless of the number of nodes in the cluster.

The extent to which throughput scales depends on the degree to which message processing can be performed in parallel, and the degree to which unnecessary message copying between nodes can be avoided. In the Pub/Sub domain, messages must be transferred to each node that has subscribers interested in that message. This depends a lot on the actual distribution of the producers and consumers, but in a worst case, when every node has at least one subscriber that is interested in every message, then it will be difficult for any architecture to scale.

The best-case scenario is when all publishers and subscribers for a particular topic are attached to the same node. This case scales well, but if a cluster requires this, then it need not really be a cluster, just a group of isolated monolithic servers and a repository where the clients can look up the location of each server. One implication of such a system is that if the all of the clients needed to access all destinations, then each client would need to be connected to all of the servers. This would mean that this "cluster" also did not solve the problem of a limited number of connections.

In the FTP domain it is hardly useful to copy every message to every node. In fact this would make it difficult to ensure that each message is only delivered to one receiver. If the nodes share information about their receivers, then it is possible to determine on which node the message will be consumed, and transfer it only to that node. If the network communication does not immediately become a bottleneck, then it is possible to get a reasonable amount of parallelism for queues.

In summary, it is safe to say that realizing a high degree of scalability for message throughput in a JMS message server cluster is not trivial.

Scalability of Storage Capacity

In both the PTP and Pub/Sub domains, JMS providers need to be able to store messages until consumers for those messages become available. The primary purpose of a message system is to move messages from producers to consumers, but message storage is also an important part of the loose coupling paradigm. This is what allows producers to send messages at any time, without being concerned about whether the consumer is ready to accept them or not. If message volume is large and the consumers are offline much of the time, the storage capacity of the message server can be quickly exhausted.

Flow Control

Flow control is essential to ensure that producers are stopped before the server's storage capacity is exceeded. It guarantees a graceful degradation of services instead of message loss, but it also detracts from the loose coupling of the system by forcing producers to block until consumers are ready (or messages expire). It is, of course, more desirable when enough storage capacity is available to accommodate the maximum amount of messages that the server will ever need to store.

A natural result of all of this is to use clustering to scale the storage capacity of the server. Each additional machine in the cluster brings additional disk space. In addition these disks can operate in parallel and potentially increase the performance of the server.

Storage Algorithm

Realizing these benefits is not easy though. It means that the cluster must use some algorithm to decide on which node each message should be stored. If redundancy is desired for high availability, then each message must be stored on multiple hardware nodes, but it may not be stored on all nodes, as this would eliminate any chance for scalability. This also introduces complexity in distribution because the cluster must match up each message with a corresponding consumer no matter which node the message is stored on and which node the consumer is connected to.

One possible scheme designates a home node for each destination and all messages produced for a destination are stored on the destination's home node before distribution. This detracts from performance because each message must be moved between as many as three different nodes of the cluster: the node that the producer is connected to, the node that is the destination's home, and the node that the consumer is connected to.

Improving the Algorithm

A possible improvement on this scheme is to always store a message on the node that the producer of the message is connected to. This eliminates one of the "hops" in the previous scheme, but introduces a new twist: the messages for one destination are not all stored on the same node. This could make it difficult to guarantee that the messages are distributed in the order that they arrive.

Fortunately, the JMS specification does not require a provider to guarantee anything about the relative ordering of messages from different producers, only that messages from one producer arrive in the order in which they were sent. This scheme does not, as described here, provide any redundancy for high availability. In general, scalability of storage capacity is a direct tradeoff against redundant storage; for a given number of hardware nodes, the more redundancy there is, the less the effective storage capacity.

As with throughput, cluster architectures that require all messages to be stored on all machines, as well as shared disk and shared database architectures, will not scale with respect to storage. It is important to note, though, that shared disk clusters may (and often do) use a RAID as the shared disk.

RAID subsystems provide the ability to scale storage independently of the server cluster by adding additional physical disks. Likewise, database servers often provide their own clustering implementations that can scale storage capacity over multiple machines without requiring the message server cluster to provide this. In these cases, the RAID controller or the database connections may ultimately limit the message throughput. A cleverly designed message server cluster has the potential to scale storage capacity without such bottlenecks, but realizing this potential is not an easy task.

Scalability of Storage Redundancy

Redundant message storage is necessary in order to provide high availability. It means that each message must be stored on at least two different hardware nodes of the cluster, so that if one of them fails, the message will not be lost. This is one of the most common reasons for employing clustering. Since I take the view that scalability is the ultimate measure of cluster effectiveness, I like to look at storage redundancy from this point of view also. Some may consider this to be stretching the scalability concept too far, but bear with me.

If we make the assumption that only one node of a cluster will fail at one time, then redundancy is a binary factor, you have it or you don't. If we loosen the assumption and look at the possibility of multiple nodes failing simultaneously, or additional nodes failing before the first failed node is replaced, then providing sufficient redundancy is more complex.

There are a number of reasons why multiple nodes could fail at one time. Natural disasters and long term regional power outages will almost certainly take all machines at a site out of service. (Battery powered uninterruptible power supplies (USP) provide protections from short term power outages, but their capacity is limited. True safety only comes with a local power generation plant.) A blown fuse or a broken USP could cause the failure of a subset of machines at one site. Multiple hardware nodes of a cluster should only draw power from the same circuit if the cluster can survive the simultaneous loss of all of these nodes.

A network failure can cause several nodes of the cluster to become unreachable at one time. This situation can be particularly difficult to deal with, since the nodes that are unreachable continue to operate and it appears to them as the rest of the cluster has failed. The implications of this will be discussed below under "Network Partitioning."

Availability of a System Administrator

Another case that is often overlooked in clustering is two or more unrelated node failures that overlap. By "overlap", I mean that the nodes do not necessarily need to fail at exactly the same time, but if one node fails, say due to disk failure, and is not replaced promptly, then another could fail, say due to an OS crash, before the first failed disk in the first node is replaced. We hope that our cluster nodes are reliable, so the likelihood of this happening is extremely small. But how small is it? An important facto here is the time required to replace hardware after it has failed. No matter how automated the system may be, this step requires human intervention.

For example, suppose a data center is unmanned on weekends. If a cluster node bursts into flames and burns up on Friday evening, a backup node should spring into action and take over. The backup has switched over to be primary, but if there were no additional backups for this node, then the new primary runs without a backup until Monday morning when a human is available to cleanup the burnt remains of the failed node and replace it with new hardware and configure it to be the new backup node.

For two days, then, the cluster runs with compromised reliability. It is unlikely that the new primary node fails during this time, but the actual risk should be assessed. If it is too high, then system responsible should consider manning the data center over weekends, or adopting a cluster architecture that supports multiple backup nodes per primary.

Time Required to Restore the Node

After a new hardware node is brought online to replace the failed one, all of the data that should be stored on it must be replaced. It is not enough to restore data from a backup tape, or to match the state of the node that it is replacing. This is because there may be a constant stream of new messages being published that need to be stored on the node that is being replaced.

When the original node failed, a backup node presumably took over its responsibilities, in other words, the backup node became the new primary and took on the job of storing and distributing those messages that the failed node would have been responsible for. The new primary node is now the one that contains the current set of messages. The new replacement node must copy all of the messages and associated state data from the new primary, while the new primary is also handling new incoming messages. If there are Gigabytes of messages stored on the node, this will take some time. Furthermore, the act of copying data from the new primary to the new backup will slow down the performance of the new primary.

If handling new messages has a higher priority than restoring the state of the new backup node, then the restore operation will take even longer. If you are using a shared disk architecture, then there is no data copying required when replacing a failed node. However, shared disk architectures should use a RAID, and after a failed RAID disk is replaced, the same factors come in to play: it takes a non-trivial amount of time to rebuild the contents of the replaced disk.

Additional Node Failure

The point here is there are some situations in which it could take hours to restore a cluster with a failed hardware node back to its full complement of nodes. What happens if another node fails during this period? Things could get ugly. Still we often expect cluster nodes to run for months at a time without a problem, so the chance of two nodes failing within a time interval of several hours is still small. It is important, though, to estimate the actually likelihood of this occurrence and decide if it is acceptable or not. If it is not acceptable, then you need to stop thinking about storage redundancy as a binary "all or nothing" factor, and start thinking about it in terms of scalability.

Important

Having a cluster that can keep operating after a node fails is not the end of the story. Any complete high availability scheme requires monitoring, a recovery plan and spare hardware. This is especially true in the case that the cluster can survive one node failure, but not two. This means that after a node has failed, the cluster will have lower availability than a monolithic server, not higher.

Consider the following unlikely, but enlightening, example: you have a cluster of 365 hardware nodes and hardware failures occur, on average, once per year per node. A little math will show that you can expect a failure once per day on average (including weekends). If your cluster can tolerate exactly one node failure, then you are OK for the first day, but if you do not replace the node right away, a complete cluster failure is only one more day away.

Compare this to the single node case, where you will have no problems for roughly a year (statistically speaking). In fact, with the cluster described above, you will never be able to just sit back and enjoy the reliability of your cluster; you should count on replacing nodes every day. And you had better have plenty of extra hardware in stock.

Scalability Criteria

To come up with a scalability criterion, we actually have to look at how the probability of irreplaceably losing data changes as more nodes are added to the cluster. This probability asymptotically approaches zero as the number of redundant copies of data increases. So in order to be consistent with the other measures of scalability and have a value of 1 corresponding to linear scalability, I would have to come up with some complex function based on the probability of losing data. I could present such a formula, but it would probably not be particularly enlightening for most people, so instead I will present examples of clustering schemes that illustrate the extremes of the scalability range.

The first example is the trivial case of a cluster that provides no redundant message storage. This clearly corresponds to a scalability value of 0, since adding additional nodes to the cluster yields no increase in safety. This would be true of a cluster that is only intended to increase performance through load balancing or to provide message routing. In practice, reliability is one of the primary reasons for clustering, so this type of cluster is not so common.

Primary Backup Configuration

The most basic form of redundant storage in a cluster is a simple primary/backup scheme, where each node that is actively providing services (a primary node) has one corresponding backup node that should be a perfect replica of the primary at all times, at least in terms of the data that it stores. The backup must be able to detect the failure of the primary and take over in its place immediately.

Synchronization of the primary and backup may occur in several ways: they may share a common disk (which introduces a single point of failure; more on this later), the backup may perform all functions identically to the primary and just not produce output, or the primary may send messages directly to the backup that synchronize its data store. Although it is theoretically possible to have more than one backup for each primary, this is not common in practice.

The reasons for this vary from simple cost effectiveness (the extra safety of multiple backups does not justify the extra hardware expense) to practical limitations such as limitations on how many nodes can be connected to a single disk, network bandwidth limitations, or the complexity involved in ensuring that exactly one of the backups takes over after a primary failure.

A cluster that allows only one backup per primary node will show near linear scalability of redundancy when increasing the cluster from 0 backups per primary to 1 backup per primary, as this represents a near ideal increase in reliability. The addition of a second or third backup will yield a further scalability of zero since, in this example, we are assuming the cluster is incapable of utilizing these extra nodes as live backups.

The other extreme is the case of full replication (described below), where every node in the cluster stores every single message. This would correspond to consistent near linear scalability with increasing number of nodes. This type of cluster provides the best possible safety, because even if only one node of the cluster survives, no messages are lost. This comes at the price of poor scalability for throughput and storage capacity.

Network Partitioning

Earlier I mentioned that network failures within the cluster present special problems. A failed or malfunctioning network component (or sometimes even a broken network cable) can cause the cluster to be split into two parts, where each part is still functional. This is referred to network partitioning. The most fundamental part of the problem is that it is often impossible for the cluster nodes to detect the difference between network partitioning and the failure of a group of nodes.

When partitioning occurs, it appears to each partition that the nodes in the other partition have failed and each partition tries to keep operating. Each partition will have a different set of clients, and will see different messages. When the network partition is repaired, the nodes in the different partitions will have inconsistent states that are difficult or impossible to reconcile. Assuming that the cluster can detect that a partition has occurred (again this by itself is not trivial) the most basic solution is to force all but one of the partitions to shut down. It is difficult to automate this, since the partitions cannot communicate with each other to come to an agreement.

The most logical action is to kill off all partitions that do not contain a majority of nodes, but the cluster may be split evenly, or it may be split three ways and there is no majority. The common solution for this is to designate one node as the "tiebreaker" when the cluster is started. In the case of partitioning, all partitions that do not contain the tiebreaker shut down.

click to expand

This is all very complicated and any workable solution depends very much on the characteristics of the particular cluster and exactly how the nodes get partitioned. I will present a trivial example with two message server nodes in primary-backup configuration to demonstrate why this is such a difficult topic for cluster designers. The diagram shows the various stages of failure and recovery:

Stage 1 shows normal operation where all clients connect to node A, the primary, to produce and consume messages.

In Stage 2, the network connection between the machines is broken and node B, the backup, switches to primary mode and accepts client connections. Node A is also still primary and has its own clients. Each node handles different messages from this point onward.

In Stage 3, the network connection is restored. Now the two primary nodes can "see" each other, and each contains different messages. If one just shuts down, then messages will be lost. The only way to properly recover from this is to devise a process for the two nodes to reconcile their message stores, transfer all clients to one node and revert the other node to backup mode. The nodes must also ensure that all topic subscribers received all messages that they might have missed while the network connection was down.

It is not possible to rule out out-of-order message delivery. It suffices to say, that if the cluster contains more than two nodes, reconciliation is much more complex. If you need your JMS cluster to be resilient to network partitions, it is important to know how your JMS provider behaves in this scenario.

Other Aspects of Scalability

There are many factors that can lead to limits in the scalability of a cluster. As a rule, a cluster that is designed to provide unlimited scalability with respect to some capacity will still saturate at some point with scalability approaching zero. As scalability is pushed to extremes, even such trivial things as a list of the addresses of the other machines in the cluster become impossible to maintain. Examples of some of the places where bottlenecks can arise are at the consumer, as a result of concurrency, and because of lack of network bandwidth.

Consumers

If a message-server cluster tries to avoid storing every message on every node, then it needs to have a way to match up a consumer on one node to a message stored on another node. Often this will lead to each node maintaining a list of all the consumers interested in messages stored in the topics on that node. The size of the data structure for one consumer may be small, but if there are to be thousands of consumers for each destination then even this can cause the cluster to saturate due to the fact that the nodes run out of room to store consumer info.

Concurrency

As described earlier in the context of connection scalability, each thread consumes a certain amount of memory. This leads to a limit to the number of threads that can exist in a process. There are other ways that concurrency can be limited, though. The operating system may place a limit on the number of threads that can actually execute simultaneously. Software architectures that require excessive thread synchronization can exhibit poor concurrency even though enough resources might otherwise be available.

The effect of poor concurrency on a cluster is that one overloaded node in the cluster can cause other, less heavily loaded nodes to slow down. This happens because the nodes must interact, and if that interaction it accomplished using a blocking request/reply protocol, then a node initiating a request to another node relies on other threads to make effective use of CPU while one thread is waiting for a reply from the other node.

The same node that is waiting for a reply is relying on a sufficient amount of concurrency in the other node to allow it to service the request in a timely manner. Insufficient concurrency on the target node will mean that the reply is delayed because some other action on the target node, which has nothing to do with the reply, has to complete before the request can be serviced. Poor concurrency is characterized by low CPU usage on the nodes that are actively processing messages, despite poorer than expected overall performance.

Network Bandwidth

As we will see below, many cluster architectures rely exclusively on a LAN for inter-node communication. LANs have significantly less bandwidth than the various data conduits within an individual computer. For this reason, the network can easily become a performance bottleneck. Much can be achieved by using high-speed networks and active network components (switches, routers) to interconnect the node cluster. Cluster architectures that permit nodes to communicate over multiple independent networks have an advantage here.