In the previous section I discussed the HeartbeatMonitor application as a design pattern for monitoring distributed applications. In this section I investigate why this specific monitoring pattern is particularly effective.
One of the most common ways to monitor individual components running as part of a distributed application is to use "heartbeats." The idea behind this is that each component transmits a regular heartbeat across the local network to an associated heartbeat monitor running on another machine. Each of the heartbeat monitors has a list of components that it's supposed to watch, and it displays a status page with a "smiley" face to represent a healthy component (i.e., a component transmitting a regular heartbeat) and a "frowny" face for any component that has missed more than one heartbeat.
So now a member of your application support team phones to tell you that component ABC on server XYZ has stopped broadcasting a heartbeat. This is an application-critical component, so you need to respond fairly fast. You use Terminal Services to log into machine XYZ, but after 5 minutes you're still waiting for the login to happen because the server doesn't seem to be responding. So you phone the server support team to report a nonfunctioning server. They take 5 more minutes to respond, but then they tell you that they can log into server XYZ without a problem. Having established that you're on a different network segment than the server support team, you ring the network support team to report a possible network switch issue. After another 10 minutes, they phone you back to say that the network switches appear to be functioning normally. Sure enough, when you try to log into server XYZ again, the login works perfectly and the heartbeat status page is now showing a smiley face again for component ABC. You never do establish what went wrong, or why.
If you look back to the list of possible failures in a distributed application presented at the beginning of this chapter, you can see the problem with the debugging scenario that I just discussed. When the heartbeat monitor component is remote from the components that it's monitoring, it has no way of knowing what's really wrong. The problem could be with a component that's being monitored or with any of the software or hardware sitting between that component and the monitoring component. To perform remote diagnosis of what's really wrong would require some very sophisticated software and hardware monitoring, and introduce some serious complexity into the monitoring process.
There are two more problems with this design pattern of remote monitoring. The first is that it's not easy for a remote monitor to take any corrective action, such as restarting a component that appears to be dead in the water. The second problem is that when you have several distributed applications running on your local network, the number of heartbeat messages can rise to a significant proportion of your total network traffic. At one company where I worked, more than 50% of the application network traffic was attributed to heartbeat messages. Although this may not necessarily be a problem because heartbeat messages tend to be small and any good network should be optimized to handle many small messages, it's still a pain to explain this to the network support team. And as you can see from the debugging scenario that I've just discussed, most of these heartbeat messages are useless.
To avoid all of these problems, one very good technique is to use the local monitor design pattern, as demonstrated by the HeartbeatMonitor application. Because each heartbeat monitor runs locally on the same machine as the components that it's monitoring, it's able to analyze a problem in much more detail than is possible with a remote monitor. When a heartbeat is missed, the local monitor can check for problems such as low memory, low disk space, or high processor utilization. It can sometimes determine whether a problematic component is completely dead or is just hung. If necessary, it can kill and/or restart a dead component, or take some other corrective action. The local monitor can also watch the overall health of the machine on which it's running and provide advance warning about problems such as low disk space that might affect other components running on the machine. All this is possible because diagnosis of local failure is much easier and more reliable than diagnosis of remote failure.
You still need one or more remote monitors to watch the local monitors and present the aggregated results, but these remote monitors won't generate anywhere near the amount of network traffic they did in the original scenario. As an added benefit, each remote monitor doesn't need to maintain a complex and ever-changing list of application components to watch, as this list can now be kept local to each machine. Instead, each remote monitor has a much smaller list of local monitors to watch, preferably one per machine. Because the local heartbeats are aggregated before being pushed to the remote monitors, your network is no longer flooded with ( mainly useless) heartbeat messages and you have much more reliable diagnostics.
I should mention, of course, that heartbeats are only part of the solution when monitoring a distributed application. You should also ensure that the server support team monitors all of your application servers for hardware faults, hardware and software warnings, and server up/down status. You should make sure that the network support team monitors the network effectively and checks the type and volume of network traffic. Finally, you must ensure that there is proper documentation that describes what should be done in the event of specific faults or warnings, including support escalation procedures.