Scaling Your Web Service

Maintaining High Availability

A Web service might be able to scale to handle the volume of requests received from clients, but it will be useful only if it is up and running. Ensuring that a Web service provides the necessary level of availability is just as important as ensuring that it can scale to meet the needs of its clients.

Availability is often defined as the percentage of time the system is up during its scheduled hours of operation. For example, if a Web service has 99.9 percent availability, that means the system experiences one or more outages 0.1 percent of the total time the system is scheduled to be operational.

The percentage of uptime is meaningful only if you know the Web service's scheduled hours of operation. For example, a Web service that must be operational 247 and requires 99.999 percent uptime can have only 5.3 minutes of downtime per year, including downtime for maintenance. Compare that to a Web service that needs to be available only between 9 A.M. and 5 P.M. on weekdays. If the Web service requires 99.999 percent uptime, it can experience only about one minute of unscheduled downtime a year but can have a total of 6656 hours of maintenance per year.

One key factor in creating a highly available Web service is to ensure that there are no single points of failure. This encompasses every resource used by the Web service—including the server that hosts the Web service, the network elements responsible for routing the requests to the Web service, and the power for the network elements and servers.

Once you have determined that there are no single points of failure, you need to ensure that if one of the components should fail, the infrastructure supporting the Web service is still capable of carrying the entire load. For example, if the cluster hosting your Web service is front-ended by two network routers, you should ensure that a single router is capable of handling the network traffic.

When you are planning the maximum capacity that any one element within the system should carry, take into consideration the total cumulative effect of that one element. For example, suppose during normal operations you have two servers within the cluster that are actively servicing client requests. When determining the amount of memory that should be installed in each system, take into account issues such as memory fragmentation. If each node is running at 50 percent memory usage and one node fails, the other node might not be able to handle the additional requests due to memory fragmentation. (Note that this is less of an issue with managed applications since the Garbage Collector is capable of compacting the heap.)

More important, you must ensure that the necessary procedures are in place for administering and maintaining a highly available Web service. These include a solid disaster recovery plan, documentation of the server configuration, and a solid change management strategy.

In short, you should make sure that your Web service is managed by qualified administrators. You can find many excellent resources for administrators that examine best practices for managing highly available applications. One of them is the Microsoft Operations Framework (MOF); you can find information about it at http://www.microsoft.com/mof.

Next I provide a high-level overview of the software and hardware required to create a highly available Web service. Then I explain some of the paradigm shifts you should make when you program against a highly available resource.

Highly Available Scale-Up Resources

By definition, a resource that relies on the scale-up strategy is a single point of failure. If the server hosting the resource goes down, the resource is no longer available. To achieve high scalability for a resource hosted on a single server, you can use a failover cluster.

A failover cluster is composed of multiple machines; one machine is active, and one or more machines serve as backups. If the active machine is unable to service requests, a backup machine is brought on line and client requests are automatically directed to it.

For Windows, the predominant failover cluster platform is Microsoft Clustering Service (MSCS). MSCS ships with both Windows 2000 Advanced Server and Windows 2000 Datacenter Server. The former supports two-node clusters, and the latter supports four-node clusters.

Any resource can be hosted on an MSCS cluster, but only MSCS-aware resources can take full advantage of the functionality provided by an MSCS cluster. A number of resources are cluster aware, including SQL Server, MSMQ, and NTFS file shares.

MSCS supports the “shared nothing” model in which each node in the cluster has its own system bus and access to disk subsystems and the network. In general, the active node in the cluster is given exclusive access to a particular disk subsystem. You can access specific data on a particular resource only through one node of the cluster at any given time.

If a disk subsystem contains data used by the resource to process client requests, it must be accessible by every node in the cluster. If the active node fails, MSCS will designate another node in the cluster to serve as the active node. As part of the failover process, the new active node will gain exclusive access to the disk subsystem.

The disk subsystem containing the data necessary to process client requests is a single point of failure. If the disk subsystem fails, none of the nodes in the cluster will be able to process client requests. Therefore, the disk subsystem is usually hosted on a RAID 5 disk array.

MSCS Components

MSCS has three primary components: the Cluster service, the Resource Monitor, and the Resource DLL.

The Cluster service is a Windows NT service that is responsible for the overall control of the cluster. It has the following responsibilities:

  • Monitoring the status of the nodes in the cluster

  • Coordinating the initialization and cleanup process when nodes are added and removed from the cluster

  • Maintaining a database that contains information about the cluster, including the cluster's name and resource types installed on the cluster

The Resource Monitor enables communication between the Cluster service and one or more resources hosted on a node in the cluster. If the Cluster service fails, the Resource Monitor is responsible for taking the resources on a particular node off line.

The Resource Monitor is hosted within its own process. This prevents a misbehaving resource from taking down the cluster. In addition, multiple Resource Monitors can be hosted on a particular node. If you have a resource hosted on the MSCS cluster that is particularly unstable, you can configure it within its own Resource Monitor.

Cluster-aware resources have their own Resource DLL that is installed on each node in the cluster. The Resource DLL is loaded by the Resource Monitor and is accessed via a well-known set of interfaces defined by the Cluster API. These interfaces enable the Cluster service to obtain information about the resource and also allow the Resource Monitor to tell the Resource DLL to take the resource on line or off line.

If your Web service will leverage a clustered resource in production, it is often helpful to develop against a clustered resource within the development environment. However, installing production-quality clustering hardware in a development environment is often prohibitively expensive. An alternative is to install SCSI adapters in two servers and connect them to an external SCSI disk drive. After you get MSCS installed and running, you can install any number of MSCS resources on the cluster.

Highly Available Scale-Out Resources

Although scale-out resources are hosted on multiple servers, basic scale-out strategies do not inherently provide high availability. Ensuring that a resource deployed using the scale-out strategy is fault tolerant takes deliberate planning.

Partitioned resources are not fault tolerant. A node hosts a portion of the resource, and if the node is no longer available, the portion of the resource hosted by the node is no longer available. Each node within a partitioned resource is a single point of failure.

To ensure that a partitioned resource is fault tolerant, you must create a failover cluster for each node. For example, if the resource is partitioned across five servers, you must create and maintain five failover clusters. This bolsters the argument for avoiding partitioned resources whenever possible.

What might be less obvious is that a network load–balanced cluster is not inherently fault tolerant. The NLB system must know whether the resource itself is on line. For example, if a node hosting the Banking Web service loses its connection to the database, it returns a SOAP exception to the user. Because IP traffic can still be routed to the node, the NLB system continues to route requests to the instance of the Web service that is unable to connect to the database.

NLB algorithms that route requests based on server utilization can worsen the problem by actually increasing the number of requests routed to the troubled node. For example, suppose the NLB system used to route requests to the Banking Web service routes requests to the node with the lowest CPU utilization. And suppose the HTTP service stops on one of the nodes in the cluster and therefore can no longer process requests. Its CPU utilization will drop significantly because the server is no longer processing requests. As a result, the NLB system will route even more requests to the disabled Web server because it has very low CPU utilization compared to the other nodes in the cluster.

The algorithm used to detect when a node is no longer able to process requests is usually specific to the resource that is being load balanced. You can use products such as Microsoft Application Center to monitor nodes in a cluster to ensure that they are capable of processing requests.

You can configure Microsoft Application Center to periodically send each node in the cluster an HTTP request and parse the response to ensure that a success message is returned. If a success message is not returned, Microsoft Application Center will communicate with the NLB system to remove the node from the cluster.

If you are on a budget, you can use the HTTPMon utility that ships with the Windows 2000 Resource Kit to monitor nodes in a cluster. HTTPMon is not as feature rich or as easy to use as Microsoft Application Center, but it monitors Web servers within an NLB cluster by posting HTTP requests and parsing the results. If an unexpected result is received, the node is removed from the cluster.

Programming Against a Highly Available Resource

One common characteristic shared by load-balanced clusters and failover clusters is that they are generally invisible to the client. The client should not be able to tell the difference between a clustered resource and a stand-alone server. The client should use the same API regardless of whether the resource is clustered.

Even though the method of accessing a clustered resource does not change, you take special measures to make sure that you maximize the benefits of programming against a highly available clustered resource. Here are some of these measures:

  • When a request made to the resource fails, retry the request. Most high-availability technologies are reactive and will remove a machine from the cluster only when a request made to that machine fails, so be sure you have the appropriate retry logic within your application.

  • Take into account the time it takes for the clustered resource to recover from a failure. When you retry the request, be aware of the time it takes for the cluster to recover from a server failure. For example, if a server fails in an NLB cluster, by default it will take at least five seconds for the other servers in the cluster to start the convergence process to recover from the failure. Depending on the resource, an MSCS cluster can take considerably longer to recover from a failure. If the application performs a single retry without regard to time, the application will fail and therefore not exploit the availability the cluster offers.

  • Take into account state that might not be automatically rehydrated on the new server. You must take into account any state that cannot be failed over to another node in the cluster if a server in the cluster fails while the client is in the middle of a session with a resource that maintains state. Any resource that requires server affinity falls into this category. If the resource is a Web service, it is an excellent candidate to be refactored so that requests made to it are atomic and any state persisted between requests is saved to a data store that is accessible by other nodes in the cluster.



Building XML Web Services for the Microsoft  .NET Platform
Building XML Web Services for the Microsoft .NET Platform
ISBN: 0735614067
EAN: 2147483647
Year: 2002
Pages: 94
Authors: Scott Short

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net