LVS Scheduling Methods | Linux Enterprise Cluster: Build a Highly Available Cluster with Commodity Hardware and Free Software

Having discussed three ways to forward packets to the nodes inside the cluster, let's look at how to distribute the workload across the cluster nodes. When the Director receives an incoming request from a client computer to access a cluster service on its VIP, it has to decide which cluster node should get the request. The scheduling methods the Director can use to make this decision fall into two basic categories: fixed scheduling and dynamic scheduling.

Note

When a node needs to be taken out of the cluster for maintenance, you can set its weight to 0 using either a fixed or a dynamic scheduling method. When a cluster node's weight is 0, no new connections will be assigned to it. Maintenance can be performed after all of the users log out normally at the end of the day. We'll discuss cluster maintenance in detail in Chapter 19.

Fixed (or Non-Dynamic) Scheduling Methods

In the case of fixed, or non-dynamic, scheduling methods, the Director selects the cluster node to use for the inbound request without checking to see how many of the previously assigned connections are active. Here is the current list of fixed scheduling methods:

Round-robin (RR)

When a new request is received, the Director picks the next server on its list of servers, rotating through them in an endless loop.

Weighted round-robin (WRR)

You assign each cluster node a weight or ranking, based on how much processing load it can handle. This weight is then used, along with the round-robin technique, to select the next cluster node to be used when a new request is received, regardless of the number of connections that are still active. A server with a weight of 2 will receive twice the number of new connections as a server with a weight of 1. If you change the weight of a server to 0, no new connections will be allowed to the server (but currently active connections will not be dropped). We'll look at how LVS uses this weight to balance the incoming workload in the "Weighted Least-Connection (WLC)" section of this chapter.

Destination hashing

This method always sends requests for the same IP address to the same server in the cluster. Like the locality-based least-connection (LBLC) scheduling method (which will be discussed shortly), this method is useful when the servers inside the cluster are really cache or proxy servers.

Source hashing

This method can be used when the Director needs to be sure the reply packets are sent back to the same router or firewall that the requests came from. This scheduling method is normally only used when the Director has more than one physical network connection, so that the Director knows which firewall or router to send the reply packet back through to reach the proper client computer.

Dynamic Scheduling Methods

Dynamic scheduling methods give you more control over the incoming workload, with little or no penalty, since they only require a small amount of extra processing load on the Director. When dynamic scheduling methods are used, the Director keeps track of the number of active and inactive connections for each cluster node and uses this information to determine which cluster node to use when a new request arrives for a cluster service. An active connection is a TCP network session that remains open (in the ESTABLISHED state) while the client computer and cluster node are sending data to each other. In a Linux Enterprise Cluster, telnet or ssh sessions remain active as long as the user is logged on.^[12]

An inactive connection, on the other hand, is any network connection that is not in the ESTABLISHED state. If a TCP inactivity timeout causes the connection to drop, or if the client computer sends a FIN packet to close the connection, LVS keeps the connection in the IPVS table for a brief period in case subsequent packets for the connection arrive to reestablish the TCP connection. This may happen, for example, when packets are resent due to transmission problems. The Director, in other words, attempts to protect the integrity of the connection between the client computer and the cluster node when there are minor network transmission problems.

Note

This discussion is more theoretical than practical when using telnet or ssh for user sessions in a Linux Enterprise Cluster. The profile of the user applications (the way the CPU, disk, and network are used by each application) varies over time, and user sessions last a long time (hours, if not days). Thus, balancing the incoming workload offers only limited effectiveness when balancing the total workload over time.

As of this writing, the following dynamic scheduling methods are available:

Least-connection (LC)

Weighted least-connection (WLC)
Shortest expected delay (SED)
Never queue (NQ)
Locality-based least-connection (LBLC)
Locality-based least-connection with replication scheduling (LBLCR)

Least-Connection (LC)

With the least-connection scheduling method, when a new request for a service running on one of the cluster nodes arrives at the Director, the Director looks at the number of active and inactive connections to determine which cluster node should get the request.

The mathematical calculation performed by the Director to make this decision is as follows: For each node in the cluster, the Director multiplies the number of active connections the cluster node is currently servicing by 256, and then it adds the number of inactive connections (recently used connections) to arrive at an overhead value for each node. The node with the lowest overhead value wins and is assigned the new incoming request for service.^[13]

If the mathematical calculation results in the same overhead value for all of the cluster nodes, the first node found in the IPVS table of cluster nodes is selected.^[14]

Weighted Least-Connection (WLC)

The weighted least-connection scheduling method combines the least- connection method and a specified weight or ranking for each server to select the cluster node. (This is the default selection method if you do not specify one.) This method was intended for use in clusters with nodes that have differing processing capabilities.

The Director determines which cluster node to assign to a new inbound request for a cluster service by first calculating the overhead value (as described earlier in the discussion of the LC scheduling method) for each cluster node and then dividing this value by the weight you have assigned to the cluster node to arrive at a WLC value for each cluster node. The cluster node with the lowest WLC value wins, and the incoming request is assigned to that node.^[15]

If the WLC value for all of the nodes is the same, the first node found in the list of cluster nodes is selected. (We'll talk more about this list, which is called the IPVS table, in the next three chapters.)

The WLC scheduling method is a good choice for a Linux Enterprise Cluster because it does a good job of balancing the workload of a typical enterprise.

Shortest Expected Delay (SED)

SED is a recent addition to the LVS scheduling methods, and it may offer a slight improvement over the WLC method for services that use TCP and remain in an active state while the cluster node is processing each request (large batch jobs are a good example of this type of request).

The SED calculation is performed as follows: The overhead value for each cluster node is calculated by adding 1 to the number of active connections. The overhead value is then divided by the weight you assigned to each node to arrive at the SED value. The cluster node with the lowest SED value wins.

There are two things to notice about the SED scheduling method:

It does not use the number of inactive connections when determining the overhead of each cluster node.
It adds 1 to the number of active connections to anticipate what the over- head will look like after the new incoming connection has been assigned.

For example, let's say you have two cluster nodes and one is three times faster than the other (one has a 1 GHz processor and the other has a 3 GHz processor^[16]), so you decide to assign the slower machine a weight of 1 and the faster machine a weight of 3. Suppose the cluster has been up and running for a while, and the slower node has 10 active connections and the faster node has 30 active connections. When the next new request arrives, the Director must decide which cluster node to assign. If this new request is not added to the number of active connections for each of the cluster nodes, the SED values would be calculated as follows:

Slower node (1 GHz processor)	10 active connections / weight 1 = 10
Faster node (3 GHz processor)	30 active connections / weight 3 = 10

Because the SED values are the same, the Director will pick whichever node happens to appear first in its table of cluster nodes. If the slower cluster node happens to appear first in the table of cluster nodes, it will be assigned the new request even though it is the slower node.

If the new connection is first added to the number of active connections, however, the calculations look like this:

Slower node (1 GHz processor)	11 active connections / weight 1 = 11
Faster node (3 GHz processor)	31 active connections / weight 3 = 10.34

The faster node now has the lower SED value, so it is properly assigned the new connection request.

A side effect of adding 1 to the number of active connections is that a cluster node may sit idle even though multiple requests are assigned to another cluster node. For example, let's use our same two cluster nodes, but this time we'll assume the slower cluster node has no active connections and the faster node has one active connection. The SED calculation for each node looks like this (recall that 1 is added to the number of active connections):

Slower node (1 GHz processor)	1 active connection / weight 1 = 1
Faster node (3 GHz processor)	2 active connections / weight 3 = .67

So the new request gets assigned to the faster cluster node even though the slower cluster node is idle. This may or may not be desirable behavior, so another scheduling method was developed, called never queue.

Never Queue (NQ)

This scheduling method enhances the SED scheduling method by adding one new feature: if a cluster node has no active connections, it is always assigned the new incoming request for service, regardless of the result of the calculated SED values for each cluster node.

Locality-Based Least-Connection (LBLC)

Directors can also direct outbound traffic to a set of transparent proxy servers. In this configuration, the cluster nodes are transparent proxy or web cache servers that sit between client computers and the Internet.^[17]

When the LBLC scheduling method is used, the Director attempts to send all requests destined for a particular IP address (a particular web server) to the same transparent proxy server (cluster node). In other words, the first time a request comes in for a web server on the Internet, the Director will pick one proxy server to service this destination IP address using a slightly modified version of the WLC scheduling method,^[18] and all future requests for this same destination IP address will continue to go to the same proxy server. This method of load balancing, like the destination-hashing scheduling method described previously, is, therefore, a type of destination IP load balancing.

The Director will continue to send all requests for a particular destination IP address to the same cluster node (the same transparent proxy server) until it sees that another node in the cluster has a WLC value that is half of the WLC value of the assigned cluster node. When this happens, the Director will reassign the cluster node that is responsible for the destination IP address (usually an Internet web server) by selecting the least loaded cluster node using the modified WLC scheduling method.

In this method, the Director tries to associate only one proxy server to one destination IP address. To do this, the Director maintains a table of destination IP addresses and their associated proxy servers. This method of load balancing attempts to maximize the number of cache hits on the proxy servers, while at the same time reducing the amount of redundant, or replicated, information on these proxy servers.

Locality-Based Least-Connection with Replication Scheduling (LBLCR)

The LBLCR scheduling method (which is also a form of destination IP load balancing) attempts to improve on the LBLC scheduling method by maintaining a set of proxy servers that can service each destination IP address. When a new connection request comes in, the Director will select the proxy server with the fewest number of active connections from this set of servers.

Note

See the file /proc/net/ip_vs_lblcr for the servers associated with each destination IP address on a Director that is using the LBLC scheduling method.

At the time of a new connection request from a client computer, proxy servers are added to this set for a particular destination IP address when the Director notices a cluster node (proxy server) that has a WLC^[19] value equal to half of the WLC value of the least loaded node in the set. When this happens, the cluster node with the lowest WLC value is added to the set and is assigned the new incoming request.

Proxy servers are removed from the set when the list of servers in the set has not been modified within the last six minutes (meaning that no new proxy servers have been added, and no proxy servers have been removed). If this happens, the Director will remove the server with the most active connections from the set. The proxy server (cluster node) will continue to service the existing active connections, but any new requests will be assigned to a different server still in the set.

Note

None of these scheduling methods take into account the processing load or disk or network I/O a cluster node is experiencing, nor do they anticipate how much load a new inbound request will generate when assigning the incoming request. You may see references to two projects that were designed to address this need, called feedbackd and LVS- KISS. As of this writing, however, these projects are not widely deployed, so you should carefully consider them before using them in production.

^[12]Of course, it is possible that no data may pass between the cluster node and client computer during the telnet session for long periods of time (when a user runs a batch job, for example). See the discussion of the TCP session timeout value in the "LVS Persistence" section in Chapter 14.

^[13]See the 143 lines of source code in the ip_vs_lc.c file in the LVS source code distribution (version 1.0.10).

^[14]The first cluster node that is capable of responding for the service (or port number) the client computer is requesting. We'll discuss the IP Virtual Server or IPVS table in more detail in the next three chapters.

^[15]The code actually uses a mathematical trick to multiply instead of divide to find the node with the best overhead-to-weight value, because floating-point numbers are not allowed inside the kernel. See the ip_vs_wlc.c file included with the LVS source code for details.

^[16]For the moment, we'll ignore everything else about the performance capabilities of these two nodes.

^[17]A proxy server stores a copy of the web pages, or server responses, that have been requested recently so that future requests for the same information can be serviced without the need to ask the original Internet server for the same information again. See the "Transparent Proxy with Linux and Squid mini-HOWTO," available at http://www.faqs.org/docs/Linux-mini/TransparentProxy.html among other places.

^[18]The modification is that the overhead value is calculated by multiplying the active connections by 50 instead of by 256.

^[19]Again the overhead value used to calculate the WLC value is calculated by multiplying the number of active connections by 50 instead of 256.