11.8 Open Source Management Tools | LANs to WANs: The Complete Management Guide

< Day Day Up >

There is a continuing debate about whether it is better to build or buy management tools. Advocates of the open-source model cite the high cost of management applications as reason enough to “build it yourself.” Like their counterparts in the operating systems market, management software vendors have priced their offerings so high that they have virtually invited the open-source response. Moreover, many of these management solutions have become far too complex to deploy and offer a range of features that buyers often do not want—either because they do not really need them or because they are just too complicated to implement. Consequently, organizations end up paying for functionality that goes unused. Open-source packages take the opposite approach, allowing users to start with a set of core functions and then add others as required at little or no expense.

Open-source software implies that the “source code” is freely available and may be enhanced by anyone who wants to offer new features. By comparison, “closed source” or proprietary software has its source code hidden so that changes cannot be made. SNMP is an example of an open-source tool that is freely available and may be extended to offer additional functionality. Hewlett-Packard’s OpenView is an example of a closed-source product that must be purchased and whose source code is not accessible to customers who might like to change it. With mixed results, the open-source movement is turning its attention to network management software, both to add needed functionality and to help users avoid the exorbitant costs big vendors charge for their outdated platforms.

True open-source products are governed by the GNU general public license (GPL). Under this arrangement, software is published under the GPL so others can use it and modify it. Among other things, this frees developers to share and improve on code without concern for corporate intellectual property issues, allowing a community to focus on solving common problems and leverage a much larger test environment, which can result in more robust software. The success of the Linux operating system in the corporate environment attests to the validity and effectiveness of this software development method.

From a strategic perspective, the open-source movement has compelling ramifications for network management because it offers organizations the ability to differentiate themselves based on the skill with which they can build highly customized management solutions. Since virtually all enterprise applications have been extended to remote locations, telecommuters, mobile professionals, and even customers and partners via IP networks, the ability to solve problems fast to keep them running at peak performance offers a significant competitive advantage. Firms that lack the ability to custom-build management solutions must rely on expensive, often unwieldy, off-the-shelf packages. In today’s slow-growth economy, the availability of the right tool at the right time could spell the difference between profit and loss.

11.8.1 Sample Tools

There are a number of open-source network management tools that can be downloaded from developers’ Web pages, along with documentation and release histories. For additional support, there may be frequently asked questions (FAQs), invitations to e-mail the author, subscriptions to mailing lists, and Internet relay chat (IRC) discussion forums. Many of these open-source tools are text-based and provide very specific functions. Those that are graphical typically do not have the polished look of high-priced products from traditional suppliers. A sampling of open-source management tools follows.

OpenNMS

This open-source project is intended as the foundation of an enterprise-grade network management platform. It addresses two main aspects of network management: detecting when faults occur and measuring performance. It has pollers that simulate users’ access to a network service in order to detect faults and determine service levels, and it includes a collection engine for gathering performance data. Reporting and notification capabilities are accessed through a Web browser. Once installed and configured, much of the operation of the product is automated.

Since OpenNMS is built to simulate a user, its focus starts at the application layer (Layer 7) of the OSI reference model and works its way down. Polls are made to network services just as the users would access them, which dispenses with the need for application-specific agents.

After the system is started, the discovery process sends pings to every address configured for discovery. If a device responds, discovery will generate a “suspect node” event. It is also possible to generate these events manually or through a script, which allows for the discovery process to be bypassed automatically. By default, 24 hours after the discovery process has run, it will repeat in order to determine if new devices are on the network.

During a poll, a test is run to determine if the monitored service is available. If not, an outage event is generated. When an outage is detected, the system can be configured for adaptive polling. By default, the poller will check HTTP, for example, on a node every five minutes. If the HTTP service is down, determined by multiple failed attempts to reach the service over a short period of time, a “node lost service” event will be generated and the poller will increase the frequency of polls from 5 minutes to 30 seconds. If after 5 minutes the service is still down, the poller will resume normal 5-minute polls. If after 12 hours the service is still down, the polling rate is further reduced to once every 10 minutes. If an outage lasts 5 days or more, the service is removed from the node. The polling intervals are configurable.

For any device that supports SNMP, data specific to that device can be collected and presented in reports. The data can be at the device level (i.e., CPU utilization, free memory) or at the interface level (i.e., in/out, errors). Non-IP interfaces can be polled as long as the SNMP agent on the device can be reached via at least one IP address.

An event management system displays the events currently in the database via a Web-based interface. Events can be generated internally, such as when there is a service outage, or if a data collection threshold is exceeded; or externally, either via a command or SNMP trap. Automatic actions can be taken based on the type of the event and the information passed with it. For every event in the system, a notification can be sent. Events used in notifications can be filtered based on source and service.

Through the Web browser, which requires a user name and a password, users can determine the current system state and current service levels via the main console. They can list and manage events and notifications and generate reports.

Admin-level users can perform system management tasks in addition to all of the functions available to other users.

Multirouter Traffic Grapher

The multirouter traffic grapher (MRTG) improves on the way SNMP collects data and sends it to the management station for analysis. SNMP relies on the UDP to send messages to a target device, requesting performance information it has collected according to various trap definitions. But UDP is connectionless, meaning that the delivery of request messages and the trapped information to be sent back from the target device is not guaranteed. Unlike the TCP, which tracks packets and acknowledges their receipt and, if needed, requests retransmission of missing or corrupt packets, UDP just issues packets into the network with the assumption that they will arrive at their proper destination. But sometimes the packets are dropped along the way.

Some likely causes of lost SNMP data are path congestion and busy routers. Another possibility might be links with high error rates. In such cases, where some SNMP packets are lost but traffic is flowing, the resulting graph is filled with gaps. MRTG interpolates the lost data to produce a smoother graph that is more accurate in cases of intermittent packet loss. When an SNMP query goes out and a response does not come back, MRTG assumes something to put in the graph; by default, it assumes that the last answer it got back is probably closer to the truth than zero. This trade-off does not hold up during a total outage, but the user has the choice of applying it or not.

Round Robin Database

The Round Robin Database (RRD) Tool is a reiteration of MRTG’s graphing and logging features but offers more speed and flexibility. RRDTool provides a system to store and display time-series data such as network bandwidth, machine-room temperature, and server load average. It stores the data in a very compact way and presents it in the form of useful graphs by processing the data to enforce a certain data density. It can be launched via simple wrapper scripts from shell or Perl, or a front end that polls network devices and puts it in a user-friendly graphical format.

MRTG can be used to graph the round-trip time and packet loss of ping echoes sent from the data center to each remote location on the network. The historical network information MRTG provides is important in situations where the network manager gets phone calls from users stating that they were not able to connect to the servers a few minutes ago, but everything seems to be working fine now. The collected data can show if there was packet loss or long round-trip times on the Internet during the period when users could not connect.

However, MRTG does not combine round-trip time (milliseconds) with packet loss (percentage) on the same graph, forcing network managers to graph them separately. Using the RRDTool’s infinite number value (INF), network managers are able to put both sets of data on a single graph (Figure 11.6). And with the RRD Tool’s fetch command, information can be pulled out of the RRD files for other uses. For example, a current-status page can be provided to the organization’s help desk so that staff can easily check if a major network outage has taken place. This information can be useful when answering trouble calls.

click to expand
Figure 11.6: Tobi Oetiker’s RDDTool showing packet-loss/round-trip delay.

Big Sister

This tool provides a simple Web browser view of current network status, generates alarms on status changes, and provides a history of status changes. The tool interoperates with other Big Sister instances and commercial network monitors such as HP’s OpenView.

Big Sister provides detailed status information per tested host or service. A network administrator can use Big Sister to set up a custom view of the network. Big Sister supports grouping, summarizing, graphical displays, and other features. It also logs performance data and visualizes them via the RRDTool. The way Big Sister displays its information is configurable by the user.

Hyperlinks on the entry page guide the administrator through more detailed views. Status changes are logged, and history data is available for immediate viewing or can be stored for later reference. The tool can send alarms on status changes via mail or pager. An alarm summary page lists pending alarms (see Figure 11.7) that are viewed through a Web browser, which eases its use in collaborate environments. Alarm details can be viewed, acknowledged, or deleted.

click to expand
Figure 11.7: Big Sister by Aeby Graeff showing alarms.

mon Service Monitoring Daemon

Developed under Linux, mon is a client-server scheduler and alert management tool used for monitoring service availability and triggering alerts upon failure detection. These functions are implemented by two separate programs. Monitors and alerts are not a part of the core mon server, but the distribution comes with samples to get the user started. This means that if a new service needs monitoring, or if a new alert is necessary, the mon server does not need to be changed, making mon easily extensible. Adding a test for a new service entails writing a monitor in any language and putting it in the monitor directory.

Mon includes support for asynchronous events communicated to the mon server. Traps generated by remote entities can be programmed to behave in the same manner as failures identified by local polling monitors. This makes it is possible to build a distributed monitoring architecture. For example, remote monitoring domains, such as sites separated by slow WAN lines, can collect their own data locally and report significant events to a centralized location, such as a NOC.

Alert scripts send a message or otherwise act on a failure that mon detects. Repetitive alerts can be suppressed. These alerts, like the monitors, are not part of mon but are easy to add. So-called “upalerts,” which are used to trigger an alert when a server comes back up after being down for a long period of time, are also supported. A history of failures and alerts that can be queried by clients is kept.

Failure of any monitor can trigger any (and multiple) alerts to different people at different times, effectively constructing “on-call” schedules using this feature. For example, a page can be sent to all system administrators if a resource goes down before 8:00 p.m., but after that time, a page can be sent to a specific system administrator on duty while an e-mail is sent to everyone else.

Mon supports parallel tasking, the checking of services on different hosts or groups of hosts. Hosts can be grouped together, and each host or group can have multiple services. For example, a ping of routers can occur at the same time as a ping of Web servers. There is no queue that can postpone the scheduled testing of other services. Interservice dependencies and event correlation prevents the system administrator from becoming overwhelmed by the cascading of alerts, which happens when some critical resource is not accessible. A service failure can be acknowledged so that alerts are suppressed until the problem is fixed. Also, alerts for particular hosts, groups, or services can be temporarily disabled and re-enabled by the client without stopping and restarting the server. If a particular server is being upgraded, for example, the alert can be disabled while the work is being done and then re-enabled on completion.

To help with large configurations, “views” can be generated to simplify reports for those who do not need to know the status of all services being monitored. For example, a network view can be generated that includes the status of all networking gear, just as a servers view can show all information pertaining to servers. Views can be configured on a per-user basis if needed, with each system administrator able to control his or her own views.

11.8.2 Risk Factors

There is some risk in relying too much on open-source network management software. Since nobody owns it, technical support may be difficult to obtain. This is one of the hurdles that had to be cleared before Linux could make significant headway in the enterprise environment as an alternative operating system to Windows and UNIX. Although vendors like Red Hat, SuSE, and Caldera (now known as The SCO Group) built their Linux distributions on an open-source kernel, they quickly realized that corporate acceptance hinged on the ready availability of technical support. Likewise, IBM, Hewlett-Packard, Computer Associates, Oracle, and other vendors had to build extensive support infrastructures to position themselves for enterprise sales of Linux products.

Mitigating the support risk, however, is the fact that open-source products are peer-reviewed and thoroughly tested before distribution. This is how SNMP and TCP/IP were developed over the years, and nobody dismisses them for the lack of development discipline or support availability. It can be argued, however, that these successes were due primarily to government funding.

It is recommended that open-source network management tools be tested extensively before relying on them exclusively. These tools are developed by individuals or teams of people whose contributions to the source code may be few and far between. Web pages may not be up to date, and the content of some is a few years old, raising doubt about the viability of any downloaded code. Support is often spotty, and projects seem to be abandoned as many times as new ones start up. Other tools seem to be in eternal beta development and are acknowledged by the author as incomplete or buggy. There is less risk, however, if the user is technically savvy and can add code to enhance functionality and fix bugs.

Traditional NMS vendors have not yet embraced the open-source movement the way application and hardware vendors have for Linux. Currently, there is no all-encompassing project that seeks to tie together all the independent open-source utilities that have sprung up over the years. Consequently, there is no credible threat to traditional NMS vendors and certainly no incentive for them to participate in developing an open-source platform. This may happen in the future, as it did for Linux, but only when a groundswell of support materializes from enterprise users. To date, open-source network management has not progressed beyond the small-project stage. Nevertheless, some of these tools show promise and are worth trying.

< Day Day Up >