Monitoring and Managing High Availability Solutions

After you have planned and implemented your new high availability server solution, you need to perform routine monitoring and management tasks on it to keep it operating correctly. For the most part, high availability solutions are prone to the same sorts of problems that plague regular servers: hardware failures, service stoppages, data corruption, and so on. Recall that the purpose of implementing a high availability solution is not to prevent these sorts of events from occurring (although it would be nice), but instead to minimize the impact of these occurrences on the client experience.

In this section we examine the process to recover from the failure of a cluster node as well as some of the tools that can be used to monitor network load balancing implementations .

Recovering from Failed Cluster Nodes

Recover from cluster node failure.

If you've done all your planning and implementing right up to now, you are ready for the day when something ”anything ”happens that renders one of your cluster nodes inoperable. As mentioned previously, high availability servers are still subject to the same sorts of problems and failures that plague any server; the difference is that because you have implemented a high availability solution, your clients will continue to have access to the required applications services and should not, under most circumstances, even notice that something terrible has happened behind the scenes.

It's a nice feeling knowing that even if disaster strikes, your clients can still carry on as if nothing ever happened. However, you cannot afford to rest on your laurels when disaster does strike; you will need to get that failed cluster node back online and in the cluster in short order. How you do this will depend exactly on what the problem at hand is.

In most cases, when an MSCS cluster node has failed, you either need to rebuild it (hardware failure) or restore it (software failure or corruption) from an earlier backup set. In either case, you need to first evict the node from the cluster. To evict a node from a cluster, open the Cluster Administrator and connect to the cluster in question. Locate the node to be evicted and right-click on it. From the context menu, select Evict Node, as shown in Figure 5.37. You cannot evict a node where the Cluster Service is still running.

Figure 5.37. You may need to evict a cluster node for a variety of reasons.

Evicting the last remaining node in the MSCS cluster removes the entire cluster itself, so be careful not to do so unless this is your intention . The eviction process is fairly abrupt, but it poses no problem to an already failed node because it is no longer providing service to clients.

After the cluster node has been evicted, you can rebuild it or perform a restoration on it as required. Should you need to rebuild the cluster node, you need to ensure that its configuration matches exactly that of the node it is replacing. IP address, local driver letters , computer name , and domain membership are all critical to being able to successfully join the newly created node to the cluster. In the event that you need to perform a restoration from a previous backup set, you can perform this as discussed in Chapter 6, "Monitoring and Maintaining Server Availability."

In a worst-case scenario in which you cannot evict a cluster node that is still operating but is experiencing problems with the Cluster Service, you can initiate a manual removal of the Cluster Service from the node by issuing the command cluster node nodename /forcecleanup from the command line, as shown in Figure 5.38.

Figure 5.38. If nothing else works to evict the cluster node, you can initiate a manual removal of the Cluster Service.

After a cluster node has failed, you should also monitor the remaining cluster nodes to ensure that they are not adversely affected or overloaded as a result. This situation can easily occur when Active/Active clustering is being used. Chapter 6 discusses monitoring server performance in Windows Server 2003. Lastly, after a cluster node has failed, you should make sure that any failovers that were configured to occur have occurred properly. If they have not already properly occurred, you need to manually move the resource group by right-clicking on it and selecting Move Group , as shown in Figure 5.39.

Figure 5.39. You may have to manually move a resource group if the failover has not occurred properly for some reason.

Monitoring Network Load Balancing

Monitor network load balancing. Tools might include the Network Load Balancing Monitor Microsoft Management Console ( MMC ) snap-in and the WLBS cluster control utility.

When it comes to monitoring your NLB clusters, there is really not a whole lot to do. You should, as a standard administrative practice, perform basic performance monitoring on each of your NLB cluster nodes. You should monitor the following items:

CPU
Disk
Memory
Network
Service or application-specific items as required

Using the Performance console to monitor and baseline servers is discussed at length in Chapter 6 and is not discussed here.

EXAM TIP

Using nlb.exe remotely The strength of the nlb.exe command is that it can be used to manage NLB clusters and cluster nodes remotely across a LAN or WAN if desired. To run the nlb.exe command from a remote computer, you must enable remote control for the NLB cluster.

Enabling remote control presents security risks to the NLB cluster, such as data tampering, denial of service (DoS), and unintentional data disclosure to attackers . Remote control should be used only from a trusted computer inside the same firewall as the NLB cluster or over a VPN if outside the firewall.

If you choose to enable remote control despite the risks associated with it, you should take steps to protect the NLB cluster from attack as a result. The default User Datagram Protocol (UDP) control ports for the cluster, 1717 and 2504 at the cluster VIP, should be protected by a firewall. Also, you must ensure that you have configured a strong remote control password.

You can, however, also perform some monitoring of your NLB cluster and NLB cluster hosts from the command line using the wlbs.exe command. For those of you screaming out that the Windows Load Balancing Service was retired with Windows NT 4.0, you are very much correct; Microsoft has kept the WLBS acronym around for good measure. In reality, wlbs.exe and nlb.exe are identical in every way; therefore, we discuss nlb.exe .

The nlb.exe command has the following basic context: nlb command [ remote options ] . A complete listing of all available NLB commands can be found in the Windows Server 2003 help files or online at www.microsoft.com/technet/prodtechnol/windowsserver2003/proddocs/entserver/nlb_command.asp. From a monitoring point of view, we focus only on the commands outlined in Table 5.2.

Table 5.2. The `nlb.exe` Monitoring-Specific Commands

Command	Description
`query`	Displays the current cluster state and the list of host priorities for the current members of the cluster. This command can be targeted at a specific cluster, a specific cluster on a specific host, all clusters on the local computer, or all computers that are part of a cluster. The possible states are Unknown, Converging, Draining, and Converged.
`queryport`	Displays information about a given port rule. The command returns the following information: Whether the specified port rule was found The current state of the specified port rule The number of packets accepted and dropped on the specified port rule
`display`	Displays extensive information about the current NLB parameters, cluster state, and past cluster activity.
`params`	Displays information about the current NLB configuration.
`ip2mac`	Displays the MAC address corresponding to the specified cluster name or IP address. If multicast support is enabled, the multicast media access control address is used by network load balancing for cluster operations. Otherwise, the unicast media access control address is used.