Notification Processing | Network Management, MIBs and MPLS: Principles, Design and Implementation

Notification processing is an important part of network fault management ”this is the F (fault) part of the FCAPS areas ”arguably the most critical part of any NMS because faults generally reflect problems in the network. Network problems can in turn affect end users. Notifications are the means by which SNMP agents asynchronously communicate problems with their NMS. From a scalability perspective, notifications provide a cue for remedial action from the NMS in response to some change in the network. This reduces the need for polling by the NMS. A number of issues arise in relation to SNMP notifications:

Notifications are not acknowledged by the NMS (unless they are informs).
Notifications are transported using the UDP protocol and hence are unreliable.
Faulty NEs can generate many notifications.
Aggregated services that become faulty can result in notification storms.
New hardware being added to (or reconfigured in) a network can produce notification storms.

When an NMS receives an SNMP trap over an unreliable transport, it never acknowledges it. This is in the interests of scalability and keeping the management protocol as lightweight as possible. It also helps avoid exacerbating situations such as network congestion. When an agent detects a problem, it sends a best-effort notification message and delegates resolution of the underlying problem to the NMS. Networks are often designed to leave an absolute minimum of about 25 percent bandwidth free to allow for routing, signaling, and management protocols to continue to operate at all times. If this is adhered to, then in theory agent notifications should always get through to the NMS. This enables the latter to carry out some meaningful remedial action.

Faulty NEs can generate large numbers of notifications; for example, if a node interface is flapping up and down, then each status transition results in a new notification. The NMS user should quickly try to resolve this by downing the associated link or resolving the underlying problem with the interface.

Aggregated services, such as layer 2 VPNs (as we saw in Chapter 3, "The Network Management Problem," Figure 3-2), may have thousands of underlying connections. If a major fault occurs, such as a fiber cut, then the originating node for each affected connection may legitimately emit a notification. This can result in a great many notifications, particularly for the increasingly dense next-generation NEs (described in Chapter 3). If the NEs are aware they are participating in a VPN, then it should be possible to intelligently reduce the number of notifications, as discussed next .

MIB Note: Scalable Aggregated Services

Managed objects that are constituents of an aggregated service (such as all the virtual circuits in a layer 2 VPN) can be logically grouped by an NMS. This allows the network operator to view the service as a whole rather than as a collection of objects. The NEs are not generally aware of such a grouping. This means that the associated NEs cannot act in concert in the event of faults. This can lead to problems like notification storms.

A MIB table that expressed membership of aggregated services like VPNs could help prevent such notification storms. MIB indexes of members (e.g., virtual circuits) could be entered in the table, and the NEs could then negotiate overall service status before issuing notifications. This would have the effect of pushing more intelligence into the network and reducing the burden on the NMS. Given the trend towards increasingly dense NEs with more complex component objects (such as layer 2 and layer 3 VPNs), this type of issue may become more important.

NNM Notification Processing

NNM uses the term event to describe NE notifications as well as messages from other sources (e.g., external applications). NNM provides an alarm browser for all such events. Important (service- affecting ) events can then be configured to show up as alarms so that operator intervention is prompted.

NNM distinguishes between SNMP notifications and events. The lifecycle for a notification is as follows :

An NE sends a notification.
The notification is received by NNM and logged.
NNM then distributes the notification to applications that registered for it.

NNM allows notifications to be paired so that notification A indicates a problem (e.g., link down) and notification B indicates problem rectification (e.g., link up). Not all notifications are symmetric like this; for example, if an LSR receives an MPLS-encapsulated packet with an invalid (or unknown) label ”this is called a label fault ”then there is no correction for this. It is a once-off, hopefully transient type of error. Paired notifications assist a network operator because they reflect those situations when the network self-heals. Likewise, when the corrective notification does not occur, then the fault remains active.

NNM also supports event correlation in which a given notification is processed before it is forwarded to one of the applications. This helps in situations where the same notification keeps recurring. As mentioned in the previous chapter, a very useful NE facility would be one that allows for notifications to be staggered or paced in order to avoid flooding the network with unnecessary traffic. This is particularly relevant during network reconfigurations. Some MIBs support notification throttling (RFC 1224) by using a sliding window of a specific duration (in seconds) and limiting the number of notifications allowed in this window.