Fault Server


The purpose of the Fault Server is to process NE notifications. It faces into the network and seeks to maintain parity between the NMS picture of network faults and the real situation in the network. In many ways this is the most critical element of an NMS because it ultimately determines if real problems exist. A Fault Server will generally provide the following features:

  • Listening for notifications

  • Determining the underlying problem (root-cause analysis)

  • Updating persistent repositories and any GUI visual indicators

The first of these is straightforward enough: Notifications from the network are received, processed, and stored. Getting to the root of the associated problem may not be so simple and is often called root-cause analysis. This is the way in which notifications are analyzed and processed to determine what exactly is causing the problem. An example of this is a steady stream of link-up and link-down notifications that involve the same link. Without root-cause analysis, the NMS will process each notification, attempt to reflect it in the fault database, and propagate it to the other applications. This is an expensive operation, particularly when the actual problem may be due to a faulty interface on one end of the link. Root-cause analysis would attempt to examine the notifications and apply such reasoning. The best action might be to disable the link in question and investigate why the interface is faulty. In effect, root-cause analysis tries to impose semantics upon the data arriving from the network. A point about this description of multiple notifications and the recommended remedial action is that it would help in reducing the number of notifications received.

Once the Fault Server has determined that a real problem has been identified, it must then record it. This consists of at least two actions:

  • Updating the database

  • Updating registered clients

Database update takes the form of either inserting a new record or updating an existing record in a fault table. Updating registered clients consists of ensuring that the new fault is propagated to any users viewing network faults. This can take the form of a topology section changing color and/or a new entry appearing in a fault listing. This is illustrated in Figure 6-3 with a combined topology view and a fault listing. As faults occur on the network, they appear in the listing window and in the topology view. Some systems may also provide a geographical map background for the topology.

Figure 6-3. Client topology view with a fault listing.

graphics/06fig03.jpg

As we've observed before, the faster the latter occurs, the better. This last task can be nontrivial (involving temporarily locking out some database changes) if many clients are viewing the faults.

Figure 6-4 illustrates a possible Fault Server with its constituent software components .

Figure 6-4. Fault Server components.

graphics/06fig04.gif

Notable items in Figure 6-4 are:

  • A multilingual SNMP stack

  • Various (possibly optional) server application components and features

  • Specific database tables for use in fault processing

The SNMP stack component is capable of accepting:

  • SNMPv1 traps

  • SNMPv3 notifications

Incoming messages pass into the SNMP stack, which listens on port 162 for all such messages. Once received, messages are processed by the stack and passed upwards into the Fault Server. If the fault is new ”for example, if it indicates that LSP x has become operational ”then a new entry is inserted in the fault table to this effect. Other affected tables may also be updated at this point, in this case, the LSP table, as illustrated in Figure 6-4. Alternatively, a message could be sent to the monitoring server (described later) expediting rediscovery of LSP x. If the LSP has become operational, then it is ready to receive IP traffic ”this can be communicated to an external application that wishes to send IP traffic. Alternatively, once the LSP is operational, the IP traffic may start to flow across it immediately with no need for external communication. So, the simple case of an LSP becoming operational can result in IP traffic landing at our MPLS network boundary. This in turn can result in the sender being billed for the MPLS resources used.

The preceding discussion again shows the way the different areas of network management are often inextricably interwoven. All of our examples are carefully chosen to illustrate simple network changes and the aftereffects; real networks may present hundreds of distributed changes occurring in quick succession. The NMS in general struggles to keep up. One additional point about this is that a change in LSP operating status from down to up is perhaps not exactly a "fault"; it is more of an event. For this discussion, we assume that faults and events are treated essentially in the same way even though in practice this is likely not to be the case.

Two other items in Figure 6-4 relate to fault and duplicate suppression. During certain periods of reconfiguration or fault, the operator may wish to inhibit processing of notifications. This is in order to avoid overwhelming the NMS or filling up database fault tables. Also, if a given fault is recurring at an unreasonable rate ”for instance, a given pair of link up/down notifications ”then it may be desirable to not process (i.e., to suppress) the faults until the problem is resolved.

Another class of fault is the paired variety in which only two states are possible, such as a power supply that is either outside or inside its allowed operating temperature range. If a fault occurs to indicate that a power supply has exceeded its allowed temperature range, then when the device returns to the normal operating range, a second fault should be issued. This second fault should clear the first one ”there should not be two unrelated faults.

Fault Server Database Tables

Basic fault storage can take the form of one or more relational tables keyed by node ID (unique number attached to every node in the NMS). Examples of columns in a fault table are:

  • Node ID (the key)

  • Description : A text string embedded in the notification explaining the fault

  • Origin : The originating NE (processor, card, fabric, etc.) for the fault

  • Status : Active, cleared, acknowledged (the user knows about the fault but has not cleared it)

  • Color : Red for active, blue for acknowledged, green for clear

As described in the previous section, rows containing all of the above columns are created or updated as incoming faults are processed.

Fault Server Software Structure

The software components in Figure 6-4 can be viewed as separate modules all using the central database. The main Fault Server component is the engine that processes incoming notifications. It stores faults in the database and signals to the rest of the NMS using the other components, such as topology update. The Fault Server can be hosted on its own platform (e.g., a Solaris, HP-UX, Windows 2000 system) or shared with other FCAPS servers.

Topology Update

Clients looking at faults generally want to see new notifications propagated into their views as quickly as possible. As mentioned earlier, there are many ways of achieving this:

  • CORBA

  • J2EE

  • Java RMI

  • RPC

  • Database update

In order to use a CORBA-based solution, the Fault Server can simply alert registered clients by calling an appropriate remote object method. The topology update object can reside on each registered client and provide a method (or function call) called by the Fault Server to indicate the arrival of new faults. The client can then synchronize with the database for the new faults. It is even possible for the faults to be provided as parameters in the object method. Similar facilities can be provided using J2EE if, as may be the case with CORBA-based systems, there is no need to bridge different programming languages and environments. Java RMI and RPC provide a lower level interface for achieving remote synchronization. Alternatively, the clients can be relied upon to regularly poll the database for newly updated and inserted faults.



Network Management, MIBs and MPLS
Network Management, MIBs and MPLS: Principles, Design and Implementation
ISBN: 0131011138
EAN: 2147483647
Year: 2003
Pages: 150

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net