Maintaining High Availability

Prev don't be afraid of buying books Next

Maintaining High Availability

VoIP management is absolutely essential if you are committed to reducing downtime. (The other side of keeping availability high.) Availability is defined as follows:

Availability = 1 – (Total downtime)/(Total elapsed time)



where:

Total downtime = (Mean time to repair) * Number of problems



To reduce downtime, focus on both aspects of the right side of this equation. You want fewer problems—and when you do have problems, you want to minimize the time required to find and fix them.

The section "Understanding Reliability" in Chapter 3 discussed the main reasons for downtime or system unavailability. You may remember that user errors and processes, software applications, and technology all contribute to downtime.

A good VoIP management system helps to reduce availability incidents cause by user errors and process problems. In addition, when a software application problem occurs, a VoIP management system can potentially restart the application to correct the problem immediately.

The following are the three areas to consider when approaching the management of VoIP availability:

  • Prevention— The best way to reduce the problems you encounter is to avoid the problems altogether—before they occur, before they lower the availability.

    Here is where good processes, careful management, and vigilant monitoring come into play. To avoid problems, you need to see them coming long before they occur. By analogy, you like to see something turn warm before it turns hot. This means going deeper than basic event management; you monitor specific things important to VoIP, such as computer temperatures, internal counters, failed logins, and so on, and proactively respond as the relative "temperature" increases.

  • Detection— If a problem does occur, its location and cause need to be isolated quickly. To make sure that the mean time to repair (MTTR) is short, your team needs to be efficient at isolating, diagnosing, and repairing the problem. Techniques for reducing the MTTR were discussed in the previous section, as part of day-to-day operations management.

  • Reaction— Reacting well means providing a short-term fix, plus doing the long-term things needed to avoid the problem in the future—or, at least, to lessen its effect or speed the isolation and repair time if it does recur.

A 1999 University of Michigan survey showed that router failure causes about 23 percent of IP network downtime.[5] As mentioned at the beginning of this chapter, this text is written with the assumption that you are already managing your existing network components well, so it does not digress into router management. Even more important for VoIP availability are the new boxes, in particular, the VoIP servers.

Monitoring VoIP Servers

No single definition covers all so-called VoIP servers. A single "VoIP server" may encompass such varied functions as IP PBX, PSTN gateway, call manager, application server, and accounting hub—or these functions may be distributed among multiple computers.

VoIP servers are such crucial components in a VoIP system that they must be monitored continuously. Most VoIP servers run off-the-shelf software on off-the-shelf hardware, and therein lies the problem. The software and hardware were not necessarily designed for five nines of availability. Yet keeping VoIP servers running well is at the core of maintaining high availability.

Monitoring a VoIP server means watching the hardware, the operating system, and the major applications running on the server. The hardware boxes used today are frequently Intel-based systems or Sun SPARC systems. The operating systems may include Windows, Linux, or Solaris. Applications include web servers, databases, and file transfer services.

What exactly needs to be monitored on VoIP servers? The list is long. Table 6-2 looks at some of the key elements.

Table 6-6. Examples of Elements to Be Monitored on a VoIP Server

Monitored Element

Description of This Element

Hardware

The box temperature, cooling fan operation, disk errors, and network interface errors.

Phone calls

The number attempted, number completed, currently active, in progress, busy attempts.

Performance thresholds

For delay, jitter, lost packets.

CPU

The maximum CPU utilization threshold exceeded per application and across the entire system. Consider user and kernel modes.

Memory

The memory usage maximum, per application and across the entire system. Physical and virtual memory. Page file maximum. Paging rate.

Model, serial number

The vendor's model and serial number for the device.

Disk

The disk utilization percentage. Free disk space minimum. Disk operation maximum time, for reads and writes. Backup status. Disk failures or error reports in an event log.

Applications

The status of software applications necessary for successful operation.

Database

The blocked access—time and number of incidents. CPU, memory, and disk usage for database operations. Lock and connection utilization.

Network

The network interface usage. Bandwidth utilization.

Security

Intrusion detections. Invalid or failed login attempts. Denial-of-service attacks.




Exercise care when monitoring critical servers. You don't want the monitoring to affect the performance of the server's main functions. In addition, you don't want the action of monitoring to adversely skew the statistics that you are trying to monitor.

There is a trade-off involved with real-time monitoring. To get real-time information about the status of the server, you want to collect the monitored statistics from the server, but not so frequently that the collection affects performance. You want to optimize the data collection so that batches are sent back to management consoles at different intervals. However, you never want the monitored information that is collected to consume a noticeable portion of your bandwidth.

It is difficult to achieve "true" real-time monitoring. Instead, depend upon thresholds and events. Set thresholds that, when exceeded by the collected statistics, trigger early warnings. Configure event generation to alert you when something specific occurs. Monitored elements work economically when they incorporate thresholds that determine when an event is logged. Define policies and actions for handling important events.

Most management systems allow for some degree of automation. For example, if a critical VoIP server software application goes down, an event can be generated. One automated response action could be to alert the IT administrator. An even better response (at least your IT administrator thinks so) is to log the failure, automatically restart the application, and then notify the IT staff about the problem.

But, how is the monitored information gathered? Because most VoIP servers run on standard operating systems, information can be gathered via standard application programming interfaces (APIs). Here are several sources for system information that is valuable for monitoring the health of VoIP servers:

  • Log files— These provide an audit trail of what is happening in a system. Applications and the operating system itself may write to log files when events occur. Log files are also an excellent source of early warning information. For example, Windows disk management services can log warnings before hard drive failures have a chance to wreak havoc.

  • Performance counters— Applications and the operating system periodically publish performance information and key statistics.

  • SNMP— SNMP is the standard way of gathering information published in device MIBs (collections of device configuration and utilization data). It is possible to set up SNMP monitoring so that it keeps very busy collecting and reporting information. Unfortunately, this can result in high CPU utilization and additional network flows, which may perturb the system you are trying to monitor.

  • Operating system APIs— Powerful, low-level APIs enable direct access to operating system information. These APIs let a management application take action and restart services or kick off other responses to events.

Security intrusions can often be identified using data from a mix of monitored data sources. For example, an event can be raised if the number of failed server logon attempts, as recorded in a security event log, exceeds a threshold. An unexpected spike in CPU utilization at the web server can indicate a virus attack, such as the Code Red virus. The sooner you know there is a problem, the more easily you can prevent it from spreading and taking your system down.

VoIP servers may be deployed in clusters to provide redundancy and scalability. Within the cluster, different servers perform different roles. Monitor these servers individually and as a single entity.

NetIQ's Vivinet Manager (http://www.netiq.com/products/vm/) is an example of a software system that is designed to monitor the wide range of elements described above (and call quality and call setup) and to automatically respond when failures occur or thresholds are crossed.

Servers are not the only significant VoIP components to monitor, though. Monitor and manage your PSTN gateway function, as well. Many calls will pass through it to the PSTN. Increased CPU utilization, increased memory usage, and an increased percentage of PSTN lines in use could signal a capacity problem. Gateways can be viewed as a single point of failure: They are your connection to the PSTN, and they usually have many interfaces for incoming and outgoing traffic.

Application Management

Despite its complexity and capabilities, most software today is more fragile than people would like it to be. As a result, combining applications inside a computer increases the likelihood that the software may interact unexpectedly. A typical VoIP server runs many software applications to support VoIP calls. For example, VoIP servers use database applications to store everything from phone numbers to phone IP addresses. If lookups on this information are slow or blocked, VoIP calls cannot be completed. In addition, the database software on which VoIP applications and servers rely is complex and can consume large amounts of server resources: CPU, memory, and disk space.

Applications can be greedy at times, quickly consuming scarce resources. You should therefore consider monitoring the CPU usage of certain applications. Automated management tools let you set thresholds, with event generation when an application uses too much CPU. An appropriate response may be to stop the application or kill the thread it is running under.

Continuous monitoring is required for VoIP servers to maintain high availability—five nines is the goal. With good management, this goal can be attained. Now it is time to look at another core component of VoIP management infrastructure—managing call quality.

Amazon


Taking Charge of Your VoIP Project
Taking Charge of Your VoIP Project
ISBN: 1587200929
EAN: 2147483647
Year: 2004
Pages: 90

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net