Network-Monitoring Fundamentals | Network Administrators Survival Guide

Network monitoring is a vast field comprised of a wide array of techniques and protocols. Additionally, various commercial or regulatory bodies create terminologies that become part of the industry jargon. A basic awareness of the common terms and techniques used for network monitoring helps the Netadmin to better understand and deploy network-monitoring systems.

Network-Monitoring Terms

Network monitoring has been overhyped by the use of marketing jargon such as SLA, MTTR, five nines, and others. The precise definitions of these terms are subject to interpretation by different organizations. The following sections describe some of the most common terms with reference to their accepted use within the industry.

Service-Level Agreement

A service-level agreement (SLA) is the description of the services that would be provided by the telco (or ISP) to the consumer. This agreement or contract typically specifies the following details:

Responsibilities and expectations of the parties involved
Network metrics, such as latency, jitter, and packet loss
Durations of scheduled or unscheduled downtimes and the minimum notice period before scheduling maintenance downtimes
Time to respond to and resolve help desk calls

To set expectations, organizations also define SLAs for the services offered by their IT departments to other internal departments and employees.

Mean Time to Repair

The mean time to repair (MTTR) is the average downtime for a device or service over a predefined time period. The downtime is measured from the instant the device fails to the moment when it is brought back to a fully functional state.

Mean Time to Respond

The time to respond is measured from the instant a user reports a failure to the moment a service representative responds. The mean time to respond is the average of all the instances of time to respond.

Mean Time Between Failure

The mean time between failure (MTBF) is the average uptime for a device or service over a predefined time period. The uptime is measured from the instant the device is fully functional to the moment it fails.

Availability

For a given period, availability is the percentage of time that the device is up or available. If a device is up for 90 days (MTBF = 90 days) and down for 10 days (MTTR = 10 days), over a 100-day period, its availability is 90 percent. This can be written as any of the following equations:

Availability = 100 * [Uptime / (Total time)]

Availability = 100 * [Uptime / (Uptime + Downtime)]

Availability = 100 * [MTBF / (MTTR + MTBF)]

Five Nines

A system with an availability of 99.999 percent is called a five nines system. This is perhaps one of the most popular terms among executives and managers when discussing the performance of a network. A system with an availability of five nines over one calendar year can be expressed as follows:

99.999% = 100 * [Uptime / (1 year)]

Uptime = 0.99999 year

Because, Downtime = Total time - Uptime, the following equations apply:

Downtime = 1 year - 0.99999 year = 0.00001 year

Downtime = 0.00001 year * (365 * 24 * 60) minutes/year

Downtime = 5.256 minutes

To maintain a network so that it is down for only 5.256 minutes in an entire year is quite impressive; hence the hype attached to five nines.

Network-Monitoring Techniques

Network monitoring, in its simplest form, involves probing a network device by sending a ping (Internet Control Message Protocol [ICMP] Echo Request) packet to a destination machine. The successful receipt of the ping packet (the ICMP Echo Reply) is a confirmation of network reachability. Also, it is a reasonable indication that the Transmission Control Protocol/Internet Protocol (TCP/IP) stack is working properly.

Most network-monitoring systems (both open source and proprietary) use this technique as their primary means of monitoring network availability. This technique is also called agentless monitoring because it eliminates the installation of a software agent on the monitored node. The round-trip time (RTT) delay of the packet provides information regarding the network conditions. Comparing the new RTT value with historic values can help the Netadmin to identify bottlenecks and performance issues or alert the Netadmin to communication failures.

Advantages of network monitoring using ICMP are as follows:

Allows agentless monitoring, which requires no installation or configuration on the monitored devices
Provides the ability to monitor any network device irrespective of operating system, vendor, or device type
Provides a quick bird's-eye view of the entire network
Is simple to implement

Some disadvantages are as follows:

Is limited to monitoring device status at the network layer only
Cannot monitor services such as HTTP, FTP, and SMTP
Can fail because many networks block ICMP traffic
Can choke links when monitoring traffic on networks with slower WAN links by consuming excessive bandwidth
Can produce false results in redundant networks by reporting standby interfaces as being down

Caution

ICMP traffic is often assigned a lower priority. If the router CPU utilization is high, the router might not respond to ICMP messages but still process the network traffic. Additionally, many ISPs block ICMP packets. These factors can result in high round-trip times or timeouts. This can lead to inaccurate results when monitoring networks.