Handling the day-to-day changes to the network, applications, and users can be the most difficult part of providing a good VoIP implementation. You are dealing with complex systems with many variables that change frequently and may even be beyond your control. An avalanche of information is available on the state of your VoIP implementation, but filtering out the irrelevant pieces can be a challenge. You need to have good processes in place to control the variables that you can control, and monitor the variables that you can't control. Feedback should be clear and concise. As a rule, good operations management begins with system configuration.
This chapter makes broad slices through the field of network and application management. Under the topic of operations management it covers configuration management, event management, and fault management in depth. These topics address management of changes related to network or VoIP components to indicate a problem or potential problem. Elements of security management are part of operations management, so those are discussed as well.
Configuration management concerns itself with the setup details (configuration) of the components in a VoIP system. Configuration information can be represented in files, reports, or diagrams. Because the current configuration forms the basis for any changes that are made in a VoIP implementation, it must be carefully managed and fully understood. Reliability and quality problems (and new security holes) are most likely to occur when something is changed, with the fault often located at the place where the change was made. Not all changes work out. In cases where they don't, you need to be able to back them out, returning the network to its previous working configuration. To successfully manage your configurations, you have to
Make changes and test the changes.
Track relentlessly what is changed.
Keep tight control of who is authorized to make the changes.
Configuration management in a network means knowing all the components of the network and how each of them is set up. Each of these topics is discussed in detail in the following sections.
Knowing Your Network Components
Do you know how many routers and switches are in your data network? Do you know which vendors made the routers? What is the current location of the routers? Good configuration management means that all the information you need about your network devices is readily available.
Configuration management in a network starts with knowledge of the components in the networkwhat they are and what is in them. That knowledge is then enhanced with topology information, such as where the network components are and how they are linked together. After you have fairly complete knowledge of the physical layer, the next step is the logical layer. Each of the VoIP componentsIP PBXs, servers, gateways, IP phoneshas one or more IP addresses that need to be managed.
You need not reinvent network management, so these topics are treated only briefly. You are likely doing excellent network management already, maintaining complete records of physical and logical layer configuration and topology, but if you are not, the added complexity of VoIP makes this level of network knowledge imperative.
Your records should include up-to-date documents listing the devices in your network and the hardware and software that comprise them. A network inventory contains information about the devices in your network, such as routers, switches, firewalls, and servers. Although the type and amount of information in a network inventory varies, Table 6-1 provides an example of information that is routinely collected (and extremely useful) from each device in a network inventory.
Table 6-5. Sample of Information Collected in a Network Inventory
Description of Attribute for Network Inventory
The name assigned to this device. Many times this name is the DNS name.
The physical location of the device, often a room number in a building.
The IP address of the device. Some devices may have multiple IP addresses.
The role or function of the device. For example, this is a router acting as a gateway to the PSTN, or a router acting as a firewall.
The vendor of the device. This is useful information when considering upgrades or new equipment purchases.
Model, serial number
The vendor's model and serial number for the device.
The operating system and its version number, currently running on the device.
Memory, disk space
The amount and type of physical memory in the device. Some devices have multiple types of memoryRAM and ROM (sometimes called Flash). Include the capacity of any disks, as well.
CPU type and speed
The processor typeRISC, Intel, and so onand its operating speed.
The hardware cards or modules installed in this device. Modules may provide different functions, such as WAN links or gateway functionality.
A variety of tools can help you to create and maintain an up-to-date network inventory. Most tools discover network components by using the Simple Network Management Protocol (SNMP) to query Management Information Bases (MIBs) on the network devices. The devices respond to the SNMP requests with the information that was requested. You can either scan a range of IP addresses for devices or provide a starting router that offers its routing table information to initiate a discovery of devices in the network.
After you have compiled or updated a network inventory, a network topology diagram is useful to graphically present the information. Where are the network devices you have just cataloged? What are the links among them? A network topology diagram shows the network devicesrouters, firewalls, switchesand how they are interconnected. (See Figure 6-1.) The different addresses and subnets are shown with each associated router interface.
Figure 6-1. Excerpt from a Network Topology Diagram
A network topology diagram provides a quick, high-level overview of how the network is connected. Network topology diagrams often show WAN link information, such as link type, provider, and bandwidth. In VoIP implementations, you may want to add the locations of key VoIP servers and IP PBXs to the topology diagram. Excellent tools are available, such as Microsoft Visio (http://www.microsoft.com/office/visio/), to help you discover and diagram your network.
Knowledge of your equipment and topology is important; you need to know exactly what is physically deployed in the network to find and fix problems quickly. But the real magic occurs at the logical layer, where every component in the network has one or more IP addresses. For good operations management, you need accurate documentation of the ties between each IP address and the actual physical box that is its home. And because most VoIP networks rely on DNS and DHCP to keep IP phones in contact with critical servers, you need to monitor the DNS and DHCP services used in your network.
DNS provides a mapping from a name to an IP address. For example, www.netiq.com is a DNS name that maps to the IP address of the web server that runs the NetIQ website. You most likely already have a DNS server in your network that provides a mapping from names to IP addresses. The DNS server that lets you navigate the Internet and your own network also enables each VoIP phone to locate its VoIP server readily.
IP phones are generally DHCP-enabled. The DHCP server provides an IP address when a network host, in this case an IP phone, becomes active on the network. By using the DHCP service, you can move IP phones with relative ease. When you relocate an IP phone (by moving it to another subnet, for example), the DNS server for that subnet should be able to find it, unless you disabled the phone's DHCP capability and gave it a static address. In that case, you will have a configuration problem after the move. For similar reasons, if your DNS server goes down, you could lose your phone service.
The availability of the DNS and DHCP servers needs to be monitored. In addition, if it is possible with your current budget and if you have not already done so, you should reconsider your current DNS server redundancy plan. The software company Men & Mice, which specializes in DNS testing, conducted a survey in 2001 and found that around 250 large companies' websites "are still at risk of virtually shutting down if the single network segment housing their DNS servers fails."
Maintaining Control over Critical Files
The hardware components in your VoIP system are probably running advanced software programs that control their operation. The active management of this critical software is part of day-to-day operations management.
The critical software files to be managed are the executable program files that control how the devices operate and how they convey their health, and their essential configuration and data files, which control how the software is set up to do its job. These files are spread across many computers and devices in the network, often installed in remote locations.
The management of these files is explored soon in more depth in this chapter, but first consider these recommendations (which are just commonsense preventative-maintenance steps):
- Tightly control who can access these files
- Back up frequently
Configuration and Data Files
Many of the devices in a VoIP system run on off-the-shelf PC hardware with commonly available operating systems. One analysis puts a darker spin on this fact by noting that "much of the VoIP gear on the market is based on commodity operating systems and commonly hacked software." You are probably familiar with the critical data files that these VoIP devices are usingfiles with extensions .ini, .dat, and .sys, as well as Registry files. Critical VoIP configuration and data files also extend to applications: database index files, CDRs, and call routing files. Other network components have equally critical files, such as router and firewall configuration information. This discussion focuses on the most important protection measures, as follows:
Access control Because configuration and data files really are critical to the successful operation of your VoIP telephone system, you need clear policies about who can read and change them. You want to make sure that no unexpected changes are made. Some of the files contain personal information; for example, CDRs contain information about who called whom and how long the call lasted. Personal information needs assured privacy.
In particular, apply strict change control to VoIP servers. Use access control to designate who can make changes and restrict the changes they can make.
Remote control The ability to manage computers and their critical files remotely is a necessity, to avoid having to visit them each time something is amiss. Write and enforce policies that describe what can be remotely accessed, and what permissions are required to obtain remote access. Record all occasions when sanctioned remote access took place, and make sure you are alerted when a nonsanctioned access occurs.
Backups You must be able to restore critical files if a failure occurs, or if you want to back out a change. Keeping a backup copy of these files is good configuration management. If the device should crash for some reason, a backup of the configuration enables you to restore it quickly. Backup configuration files also can be useful for planning configuration changes. The changes can be applied and tested offline, before being rolled out in a production network.
The obvious recommendation is to do frequent backups. Store some of the backups offsite. If a backup job fails, it should alert your staff.
Intrusion prevention Viruses or other, similar invading programs may affect or damage your critical files. Where appropriate, run a vulnerability assessment on the target computers, and install, run, and update antivirus software on a scheduled basis.
A clear-cut example of critical data files that you must protect are call routing tables, which describe dial plans and how each telephone call is routed as local, long-distance, and international calls are made. Routing tables control long-distance access, for example; you don't want people using the IP phone in the lobby of your offices to make calls to Nepal. Call routing tables also describe how incoming calls are forwarded. They are typically stored in a database and configured through an IP PBX or VoIP server.
Call routing files are an example of configuration files that are particularly fragile. Few tools are available for managing them well, and their configuration is something of a black art. Routing calls to the proper destinations can require very complex setup. VoIP servers must be configured to map the phone numbers dialed to their destinations, which may be an IP phone, or a phone in the PSTN.
The fragility of call routing files highlights the points previously made: maintain tight control over who can access these files, and back them up frequently. If an invalid change is made, you can back it out by reverting to the last good backup copy.
The executable files for the operating system, and for relevant VoIP applications and management agents, are also critical to the operations of each computer. The same types of management requirements discussed in the previous section apply to executable and configuration files:
Access control Access control deals with who has access to the physical keys, who can log on to the computer, and what changes they can make. Maintain strict control over who can read or modify executable and configuration files, as discussed above. However, you also need to maintain access control to determine who can execute the files and who can replace them. You also want to know each time anyone else accesses these files remotely.
Remote control As with configuration files, you want to manage operating system and VoIP program files remotely. You also want to manage and monitor the applications remotely. Any given application may hang, run amuck, or consume the last of its available disk space. Less severely, application performance may simply degrade. In each case, you want remote-control access to the VoIP applications and program filesor you want your management software to take care of some of the problems when personnel are unavailable.
Install management software on all VoIP servers. But, look for management agents that are vendor-certified and consume few resources. You don't want a management program to create VoIP server performance problems by using large amounts of CPU time and memory.
Security The VoIP software components are critical to your business, and you need to protect them with your highest level of security. Secure computing begins with physical security; where possible, the computers and network devices should be kept under lock and key. Access control, as discussed previously, is the next step.
The section "Software Reliability and Features" in Chapter 3, "Planning for VoIP," discusses working from clean computers. Install the operating systems and necessary applications from scratch. Then, run a vulnerability assessment and load latest antivirus software. Enable intrusion-detection instrumentation wherever possible to prevent (or at least detect) unwanted security intrusions.
To avoid running into constraints on memory or other resources, and to avoid introducing unnecessary software vulnerabilities, keep the footprint of the operating system and applications as small as possible. Turn off unneeded services, and lock down the options on major applications such as web servers and database servers.
Update control As time passes, software applications usually need version updates and patches. Keep careful records of server software versions and patches to reduce compatibility problems.
Establish a methodology and use automation tools to apply patches and roll out new software versions. You don't want to visit each computer or device each time there is a new version or patch for the operating system, one of its critical applications, or one of its management agents. Tools such as Microsoft's SMS enable you to distribute software updates from a central console, to selected computers, on a scheduled basis.
Modern operating systems keep a record of most of the events they see. They take note when key programs are started and stopped. They track application errors and suspect traffic arriving at the computer. Many computers and network devices also keep logs of these events, which are usually specific to a certain operating system, application, or network component. You can see the evidence of these events on your own computer; for example, open Windows Event Viewer or dump a UNIX syslog. (See Figure 6-2.)
Figure 6-2. Application Event Information In Event Viewer
In their simplest form, events come in three varieties: Error, Warning, and Information. You are probably aware of the significance of these on each system you are working with. When a failure occurs, when performance declines, or when intruders attack any part of a computer network, it is likely that telltale signs are left behind, written as events to the logs of computers near the problem.
Your goals for successful event management are to develop policies that describe which events are important for the health of your organization's network and to define what actions should be taken for each event. Appropriate actions include paging an administrator, sending e-mail, or calling someone, but, as much as possible, you would like the management system itself to execute the right corrective action automatically, in a timely manner.
In a VoIP system, many events are generated every second, every hour, every day. The sheer volume of events that are logged can make for a tremendous management task. Event management focuses on filtering the large amount of event information to find the important, relevant events and respond appropriately. You define rules for action when events of different severity occur.
The following are some of the types of events and the places where they are recorded:
Windows: system log, security log, application log, web browser log
Application-specific event and log files
Intrusion detection events
SNMP events, which are often traps sent for monitored systems, covering a wide range of events, such as "performance threshold exceeded," "out of disk space," and so on
Firewall log files, showing events such as connections established and connections refused
Managing all of these events and event logs is obviously not a job for a human to tackle alone. For one thing, trying to check every log fairly often is incredibly time-consuming. For another thing, you would like to see these events as they occur, rather than going back to each system after a problem and dumping its system logs. Clues to what is happening in a computer network are widely availablethey are just spread out all over the place. Good event management means that you have the events from different systems correlated to isolate failures or detect broad attacks. And you have the log data consolidated and synchronized so that you can see what is happening and where. Most important, good event management generates automated responses to certain events, corresponding to the response policies you have established in your organization.
Event correlation in any large organization is a huge data-processing task, well beyond the capabilities of humans. It requires consolidating copious amounts of event data, eliminating redundancy in the data, discovering patterns in the events, and then initiating actions to respond to what is discovered. Despite the daunting nature and size of the task, some people continue to perform event correlation manually.
IT systems in a large organization can accumulate more than a terabyte of event data over a seven-day period. In addition, that data must be kept online for some period if the intent is to perform any forensics after a security intrusion has been detected.
When you can centralize event recording and handling, you also gain the ability to correlate events across an organization. For example, suppose someone attempting to crack a password at the VoIP server moves from workstation to workstation to avoid detection. Under normal circumstances, this method would not raise any alarms, because the only way to notice the moving intruder would be to look at event logs on a computer-by-computer basis. However, event correlation systems can see "the big picture" by gathering events from all of these locations. They can correlate these actions and detect a pattern that raises an alert, and then initiate an automated response, such as disabling a user ID for some period of time.
Even after extensive data reduction, the task of correlation and pattern matching requires a strong analysis engine. In particular, you want the pattern analysis to identify points of failure across the networks, systems, and applications. Look for applications that are designed to perform the analytical and alerting tasks described here.
Finally, it makes sense to forward the summarized event to a central management console, where it can be consolidated with other events appearing across a broader range of components, including hardware.
Achieving high availability for your VoIP system can be viewed as a process of reducing downtime. The best way to avoid downtime is to avoid problems altogether, a core management process covered in more detail in the next section. In a network with real hardware and software, unexpected failures do inevitably occur, however. To reduce downtime, you want to find and isolate the failure quickly, and then minimize its impact by fixing it quickly. Finding and fixing problems quickly is part of your day-to-day operations.
When applications and networks consisted of terminals accessing mainframes, problem determination was much easier. Now, with a mix of protocols, applications, and dispersed intelligence, your job is much more difficult. If a user is unable to get a dial tone, is the server or the network at fault? You need to make this top-level diagnosis quickly, because you often have different teams who specialize in either network or application troubleshooting.
In which place is the source of the problem most likely locatedthat is, where should your team look first when doing fault isolation? The following are a few considerations:
Places where the most recent changes were made
Places where there have previously been failures
Places where the monitoring trends show escalating trouble
In the diagnosis of a VoIP problem, it is important to know the network path between the phones. There could be many devices in between the two endpoints of a phone call. Each device and link in the path represents a potential point of failure. If you have good records from your network inventory and topology diagram, fault isolation is easier. But nowadays, tools are available that map a logical path between two phones. Once the path is mapped, each device and link along the path can be monitored to isolate the problem. Figure 6-3 shows the path between two devices on the network.
Figure 6-3. Path Between Two Points in a Network
Best-practice fault management means applying the principles of incident tracking. Tracking an incident means that someone owns the problem at every step, and that the current status of the problem is always visible. With proper tracking, every incident should have the following:
An author The person who found it
An owner The person who is currently responsible for it
A status indication Open, assigned, under investigation, or resolved
A resolution code To identify problem areas and trends
A severity An indication by the author of how bad the problem appears
A priority A ranking by the owner, who prioritizes it among all of the owner's other problems
A sizing An estimate of the effort required to find and fix it
A schedule An estimate by the owner of when the problem should be fixed
A problem description A detailed description of the problem and re-creation scenarios
Often problems are reported using information that is based on their symptoms. When the same symptoms are seen again, you can go back and see what the fix was the last time this occurred. You will determine either that the problem has recurred or that the symptoms have been caused by something new. A really excellent system can capture the symptoms and the solutions, in order of likelihood for your location, making it straightforward to debug a problem given its problem description.
Often, when a severe system-wide failure or security attack occurs, you need to drop everything. Immediately, you must act to reduce the depth and breadth of the damage. Firefighting like this is a poor way to spend your IT budget.
But, in a larger sense, all firefighting is costly, because everything that is productive and proactive stops, sometimes throughout the entire organization. In lieu of making forward progress, you try to reduce the amount that you fall back. Schedules slip, people become stressed out and lose sleep, more accidents occur, morale declinesrotten conditions prevail. Lots of collateral damage occurs as a result of a severe failure or attack, including, potentially, your reputation or the reputation of the whole organization.
Develop a firefighting plan. Establish a set of processes to be followed when a system-wide failure does occur. Plan ahead for fighting fires, to reduce the chaos when they arrive. Hold "fire drills" by simulating problems that your team must handle. Everyone on the team should have clear assignments, and should be able to tell when one step is complete so that they can move to the next step. Let every incident become a lesson on how to prevent or reduce the size of the next fire. If necessary, change your management policies so that certain types of fire don't happen again.