MSPP Fault Management


One fault-management architecture that service providers commonly use is an EMS based architecture. The MSPP vendor supplies the EMS to manage the services on the MSPP network. In this case, the EMS provides a user interface and a northbound interface (NBI), as shown in Figure 14-1.

Figure 14-1. EMS with Architecture Example


Most service providers prefer this architecture when the EMS is used to interface the MSPP network into the service provider's Operational Support System (OSS) network. In other words, the service provider's OSS sends messages, which are commonly based on an application protocol called Common Object Request Broker Architecture (CORBA), to the EMS's NBI, which instructs the EMS to carry out certain tasks. Provisioning an Ethernet circuit between two ONS 15454 nodes is an example of a task. The EMS NBI forwards alarms and events to the OSS through the NBI whenever a MSPP node on a ring generates alarms or events. This architecture is commonly used for large-scale networks; it is further explained in Chapter 15, "Large-Scale Management."

Note

An OSS is developed either in-house (commonly referred to as a homegrown system), by a single vendor, or as an integration of commercial off-the-shelf systems. The latter is sometimes referred to as a best-of-breed system. The Telcordia OSS is an example of a single-vendor OSS. The Telcordia OSS consists of NMA, TIRKS, and TEMS applications, used for monitoring and provisioning large-scale networks using TL1.


In other smaller-scale networks, an enterprise or service provider might choose not to use an EMS, but instead use an OSS that contains commercial network-management software and homegrown software to monitor and provision services on MSPP rings. In the case of the ONS 15454, these existing systems would interface through an Ethernet/Internet Protocol (IP) network and retrieve alarms and events through SNMP or TL1, as shown in Figure 14-2.

Figure 14-2. Non-EMS Interface Architecture Example


When a problem occurs in the network, the SNMP monitoring application receives messages from the MSPP network and raises the appropriate alarms. Many of the fault-management software today allow complex logic to be programmed within it, to eliminate some of the manual troubleshooting.

Note

A person's logic and years of experience and knowledge can be useful in solving a particular problem.


This fault-management software is useful in MSPP networks when the underlying technology is SONET or DWDM and the end-to-end service is Ethernet and storage networking. In this scenario, the SONET expert might not be the same person as the Ethernet expert, so multiple organizations must work together.

A collection engine, also known as a probe, is software that can periodically poll devices for statistics and receive SNMP traps when faults occur. However, when a fault occurs, the service provider has no idea which customers are impacted. Fault-management software can help resolve this problem and also automate the fault-resolution process by doing the following:

  • Detecting the problem by receiving the traps and correlating the traps from each endpoint into one alarm for the connectivity problem

  • Localizing and diagnosing the problem in general

  • Reconfiguring the MSPP if it is misconfigured

  • Confirming that service has been re-established

Using SNMP MIBs for Fault Management

You can use different management protocols for fault management in a MSPP network. The Cisco 15454 uses two management protocols: SNMP and TL1. Certain service providers, such as incumbent local exchange carriers (ILECs), traditionally use TL1 to monitor networks; large corporate networks and Internet service providers (ISPs) traditionally use SNMP to monitor networks. SNMP has been applied to telecom interfaces such as DS1, DS3, and SONET interfaces. In other words, SNMP MIBs have been developed for these telecom interfaces to allow fault-management and performance-management systems to monitor them.

What Is an SNMP MIB?

An SNMP MIB defines the management information that is collected and stored on communications equipment, such as an MSPP. The SNMP MIB contains a list of parameters, commonly referred to as objects, that defines what information can be collected. Standard MIBs have been written by the standards Internet Engineering Task Force (IETF). Equipment vendors also write their own proprietary MIBs for their equipment to fill in the gaps of additional information that is not contained as part of standard MIBs. These proprietary MIBs are necessary to completely monitor an end-to-end service such as Ethernet traffic on a MSPP ring.

Vendors decide which MIBs to implement on their communications equipment. Reading SNMP MIBs can be overwhelming because of the large number of objects defined by the MIB. Service providers and other companies might choose to collect only the MIB objects needed to help prevent services from degrading or failing.

An SNMP-managed network has two components: SNMP agents and an SNMP manager. An SNMP agent is software that resides on the MSPP and is responsible for keeping track of information on the MSPP inventory, fault management, and performance management. The MIB provides a blueprint of this information in the form of objects. The SNMP manager can access all or a subset of these SNMP objects.

Note

An SNMP monitoring application contains the SNMP Manager component and is part of the fault-management software. Fault-management software is typically included in a network management system.


The SNMP manager accesses this information by either retrieving (commonly referred to as polling) this information or having the SNMP agent autonomously send notifications of events to it. These notifications, called SNMP traps, are used to notify the network-management system when certain important events occur, such as Ethernet frames being dropped. The SNMP agent sends these unsolicited traps to the network-management system. For example, traps can alert the SNMP monitoring application of a certain condition on the network, such as monitoring Ethernet link status (up or down) or other significant events. Figure 14-3 shows the SNMP agent architecture and messages.

Figure 14-3. SNMP Agent on an MSPP


Some data and telecom equipment allows SNMP network-management systems to configure and monitor the equipment. You cannot provision services on the Cisco 15454 using SNMP, but you can poll information and receive SNMP traps.

Note

Because SNMP traps use an unreliable transport protocol, called Unreliable Data Protocol (UDP), delivery of these messages cannot be guaranteed. However, SNMP Version 2 uses an Inform message to issue a notification that requires an acknowledgment from the SNMP manager. ONS 15454 does not support the Inform message, yet it can support SNMPv1 and SNMPv2c simultaneously.


SNMP Traps

SNMP traps allow a MSPP to report the occurrence of an event to one or more network-management systems; the ONS 15454 can send traps to up to 10 different SNMP-management systems.

The network-management system can use events that occur on an MSPP to carry out fault management, capacity management, performance management, or configuration management.

If a non-EMS architecture is being used, SNMP traps are needed to avoid excessive polling of MIB variables to monitor the health of the network, as discussed in the beginning of this chapter. Extensive SNMP polling results in the following:

  • Additional configuration needed on the fault-management and performance-management software

  • Increased traffic on the data communications network and SONET Data Communications Channel (DCC), especially in large networks

  • Load on the fault-management and performance-management software

  • Additional computing hardware required as a network grows

Designing and configuring the fault-management and performance-management software to use SNMP traps and limited polling eliminates the issues previously listed and enables faster notification of faults in the network.

ONS 15454 Traps

The ONS 15454 generates all alarms, conditions, and threshold crossing alerts (TCAs) as SNMP traps. These traps are defined in the specific Cisco MIB for the ONS 15454, called CERENT-454-MIB.

Note

ONS 15454 alarms are provisionable. Alarms that are configured to be suppressed using the CTC Alarm-Provisioning tool are reported as a "Nonreported," or "NR," severity condition. Conditions defined as NR are not reported as SNMP traps.


This MIB defines more than 400 different traps. The ONS 15454 supports additional traps that are defined in other MIBs. Examples of some important ONS 15454 traps follow:

  • entConfigChange (ENTITY-MIB-rfc2737) This trap reports changes in the ports and interfaces that occur through CTC or TL1.

  • risingAlarm (RMON-MIB-rfc2819) This trap is generated when an alarm entry crosses the rising threshold.

  • fallingAlarm (RMON-MIB-rfc2819) This trap is generated when an alarm entry crosses the falling threshold.

  • performanceMonitorThresholdCrossingAlert (CERENT-454 MIB) This trap is generated when a threshold is exceeded for a defined performance parameter.

The current alarms and conditions associated with these traps are accessible using SNMP get requests on the 15454 Alarm Table, called cerent454AlarmTable, defined in the CERENT-454-MIB, as shown in Figure 14-4.

Figure 14-4. ONS 15454 SNMP Alarm Table


Figure 14-4 does not show the complete list of trap values because nearly 700 trap values are defined. You can use the 15454 Alarm Table to validate the traps that the ONS 15454 receives. Additionally, an SNMP manager can poll this alarm table for the current alarms and conditions instead of listening to SNMP traps from the ONS 15454.

Trap Example: Improper Removal

The ONS 15454 generates traps that contain a unique ID to identify the alarm. This ID is referred to as an object ID (OID). The trap contains information that identifies where the alarm originated. For example, an "Improper Removal" alarm is generated when a card from a 15454 shelf is unseated without deleting it from the 15454 database, as shown in Figure 14-5.

Figure 14-5. Improper Removal of 15454 Card


This SNMP trap is automatically issued to the IP addresses assigned as the SNMP manager in CTC. The trap message contains key information about the event that just occurred. This information is contained in the data field of the trap, which is formally called the Variable Bindings or Varbind List. In this example, the message identifies the following items:

  • Time of event This is the time stamp of the alarm, indicating the date and time when the alarm locally occurred. This is labeled as the CerentNodeTime in the trap message; Cerent is the original name of the ONS 15454. The time is displayed in the following format: YYYYMMDDhhmmss{S/D}. In this example, 20050705184318D is the time stamp, which converts to 6:43:18 PM EDT on July 5, 2005. If the last octet was S, this would indicate standard time.

  • Alarm state This is the severity of the alarm (critical, major, minor, and so on), indicating whether the alarm is service affecting or nonservice affecting. In this case, unseating a card produces a critical alarm, which is service affecting. This is labeled as cerent454AlarmState in the trap message.

  • Entity that raised the alarm The entity is the equipment that raised this alarm. This is labeled as cerent454AlarmObjectType in the trap message.

  • Slot associated with the alarm In this example, the card was unseated from Slot 3. This is labeled as cerent454AlarmSlotNumber in the trap message.

  • Port associated with the alarm If port information is not relevant to the alarm, the value is 0. This is labeled as cerent454AlarmPortNumber in the trap message.

  • Line associated with the alarm If line information is not relevant to the alarm, the value is 0. This is labeled as cerent454AlarmLineNumber in the trap message.

  • TL1 name This value provides more detailed information about what entity generated the trap. Entities can include Slot, Port, Synchronous Transport Signal (STS), Virtual Tributary (VT), and Bidirectional Line Switch Ring (BLSR). For example, this value provides STS or VT numbers if an STS or VT circuit triggers the trap. Or, if a BLSR ring event triggers a trap, the ring number is provided here. In this example, SLOT-3 is the value. If a VT circuit associated with Slot=6, STS=192, VT Group=7, and VT=4 raises a trap, the value is VT1-6-192-7-4.

Note

This TL1 name is identical to the access identifier (AID) value contained in every TL1 alarm or event generated from the ONS 15454. In this example, the AID is SLOT-3. Other AIDs that can be contained in the trap message are Facility (FAC), VT1, STS, and (VT Group) VTG. The following examples help you further understand the TL1 name. In these examples, s stands for "slot," p stands for "port," sts stands for "STS number," vtg stands for "VT group number," and vt stands for "VT number." Table 14-1 contains some other examples of other AIDs formats.


Table 14-1. AIDs Format Examples

Format

Example

FAC-s-p

FAC-5-1 for port 1 of OC12 in Slot 5.

SLOT-s

SLOT-16 for OC12 in Slot 16.

VT1-s-sts-vtg-vt

VT1-6-4-6-3 for VT in module located in Slot 6 (sts=4, VTGroup=6, vt=3).

STS-s-sts

STS-2-5 for STS object in module located in Slot 2 associated with STS 5 (that is, it corresponds to the second STS on Port 2).


The ONS 15454 also generates a trap for each alarm when the alarm clears, as shown in Figure 14-6.

Figure 14-6. 15454 Card Reseated


As you will notice, the trap generated is nearly identical to the previous trap, except that the alarm state has changed to "cleared" and the time of the event has a new time stamp.

Trap Example: Carrier Loss on G1K Port

Before talking about SNMP traps that are issued for a carrier loss on a G1K port, it is important to discuss a feature on the G1K card called link integrity. Link integrity means that the entire end-to-end Ethernet link fails if any point of the end-to-end path is not working. The Ethernet ports and the SONET circuit path interconnecting these ports must be up and working before Ethernet traffic can flow. In other words, the link integrity feature ensures that the transmit lasers on the G1K card at each end are not turned on until all SONET and Ethernet errors along the path are cleared. For example, if one of the G1K ports is manually disabled using CTC or is in a Carrier Loss state (CARLOSS) because the attached switch or router's Ethernet port is down, this port is considered to be in a failure state because the end-to-end Ethernet path is unavailable. As a result of this port failure, the G1K card at each end disables its transmit laser, which prevents the port from going into service.

Note

The G1K card can be set to Automatic In-Service state (AINS). AINS is supported in Release 5.0 and later. This means that the G1K port is initially in a state that suppresses alarm reporting, but traffic is carried and loopbacks are allowed. After a predefined period, referred to as the soak period, the port changes to in-service and alarm reporting is no longer suppressed.


Figure 14-7 shows a case in which a router's Ethernet port is shut down. As a result, Port 1 on the G1K card issues a CARLOSS alarm.

Figure 14-7. Carrier Loss on G1K Port


The trap name is carrierLossOnTheLAN, defined in the Cisco MIB. After the router's Ethernet port is activated, the CARLOSS is cleared on the G1K port and the same trap is sent, but this time with the Alarm state set to "cleared." Figure 14-8 shows a carrier loss cleared on a G1K port.

Figure 14-8. Carrier Loss Cleared on G1K Port


Trap Example: Protection Switch

A protection switch raises a condition called Working Switched To Protection (WKSWPR). This is the case for all protection types. This occurs when a loss of signal (LOS) or signal degrade (SD) occurs on an electrical or optical circuit. When a revertive protection group is established, a WKSWPR condition is raised when the working card switches to the protect card. As soon as the traffic is switched back to the working card, the WKSWPR is cleared. In the case of a nonrevertive protection group, a Switched Back to Working (WKSWBK) condition is issued when the traffic is manually switched back to the working card.

Note

1: N protection groups in the system are always revertive.


Figure 14-9 shows the SNMP trap that indicates a DS1 working card has switched to a DS1N protect card.

Figure 14-9. Working Card Switch to Protection Card


Figure 14-10 shows the SNMP trap that indicates the traffic has switched back to the DS1 working card.

Figure 14-10. Traffic Switched Back to Working Card


Trap Example: Audit Trail Full

To facilitate security monitoring, the ONS 15454 generates a trap when the audit log reaches 80 percent and also at 100 percent. The audit trail keeps track of user activities on the ONS 15454.

ONS 15454 MIBs

The ONS 15454 has its own enterprise-specific MIBs in addition to standard MIBs. Supported MIBs are listed at Cisco.com at http://www.cisco.com/public/sw-center/netmgmt/cmtk/mibs.shtml.

Refer to Cisco documentation to determine whether the MIB file is backward compatible. For example, if a customer has ONS 15454 Release 5.0 and ONS 15454 Release 4.6 running in the network, all MIB objects defined in Release 4.6 will still be supported in Release 5.0. If the MIB file is backward compatible, this allows the operations center to continue to monitor all the nodes in the network by polling the same MIB objects, even after certain 15454 nodes are upgraded to a newer software release.

Note

The version of the MIB file is indicated in the Description field at the beginning of the file. For example, the following lines were extracted from the Cerent-454-MIB:

Description

"This file can be used with R5.0 release."


MIB Object Example: Retrieving DS3 Port Name

The MIB object that identifies the DS3 port name is dsx3CircuitIdentifier. This MIB object is from the standard MIB called DS3-MIB-rfc2496. This corresponds to the port name "Customer ABC" on the DS3 card shown in Figure 14-11.

Figure 14-11. DS3 Port Name Displayed in CTC


Up to 12 different port names can be assigned per DS3 card. Therefore, 12 different rows of the dsx3CircuitIdentifier exist in the MIB for every DS3 card. These port names can be retrieved using the SNMP get and getNext commands.

MIB Object Example: Retrieving Serial Numbers

Another standard MIB is the ENTITY-MIB (RFC 2737). This MIB has an objected called entPhysicalSerialNum that contains the serial numbers for each ONS 15454 card.

As an example, if a DS3 card is in Slot 2 in an ONS 15454 chassis, you need to translate Slot 2 to an index number that represents Slot 2 in the table. This is calculated as follows:

Table index = (Slot# x 4096) + 1

In this example, the table index would be (2 x 4096) + 1, which equals 8193.

Setting Up SNMP on the ONS 15454

Setting up SNMP in the ONS 15454 is a simple task. First, assign the community names in the ONS 15454. The community name is like an assigned username, which is assigned and carried by each SNMP request message that the fault-management software issues. The community name in the SNMP request message must match the community name assigned in the ONS 15454. Community names are provisioned using CTC, as shown in Figure 14-12.

Figure 14-12. Provisioning Community Names Using CTC


An SNMP request will be accepted if its community name matches any community name listed in the SNMP trap destinations table. In this example, the community name is public and the trap destinations are 10.1.1.40 and 10.1.1.50. All SNMP requests will be dropped if its community name does not match any of the community names identified in this table.

The second step is to assign the IP address of where the SNMP traps need to be delivered. Up to 10 trap destinations can be assigned.

Loading MIBs

You must load the ONS 15454 enterprise-specific MIB files and standard MIB files as one of the first steps in setting up your fault-management and performance-management systems. You must follow two rules:

1.

Ensure that the right versions of MIB files are loaded. If the release of the ONS 15454 is Release 5.0 or later be sure that the ONS 15454 MIBs are from the same version.

Be aware that two versions of MIBs are supported on the ONS 15454: SNMPv1 and SNMPv2 MIBs.

2.

Load the MIB files in the order specified in the MIB dependency table for the ONS 15454 release you are working with. Otherwise, the SNMP manager might not compile one or more of the MIB files.

Using TL1 for Fault Management

Transaction Language 1 (TL1) is a standard set of ASCII messages that an OSS application uses to manage a network element, such as MSPPs. Similar to SNMP, two key types of messages exist:

  • TL1 commands

  • TL1 autonomous messages

TL1 commands can configure the MSPP or retrieve information. Setting up an STS cross-connect is one example of using the TL1 command. You construct the TL1 command as follows:

ENT-CRS-<STS_PATH>


Retrieving the current status of all active alarms and conditions from an ONS 15454 is another example of using the TL1 command. You construct the TL1 command as follows:

RTRV-ALM-ALL


If these TL1 commands are successful, the ONS 15454 issues a response message that contains a number called the completion code. If a TL1 command fails, the ONS 15454 issues an error response that contains a number, called the DENY code.

TL1 autonomous messages are like SNMP trap messages: They are used to report alarms, conditions, and configuration changes. In other words, the OSS application does not have to request that these messages be sent; the ONS 15454 issues these messages without any intervention.

TL1 Versus SNMP

TL1 and SNMP have similar characteristics. Both TL1 and SNMP use autonomous messages to notify management systems of alarms and conditions from a network element, such as the ONS 15454. In addition, TL1 and SNMP both have mechanisms to retrieve the current alarms and conditions from a network element. TL1 uses the commands RTRV-ALM-ALL and RTRV-COND-ALL, and SNMP uses get/getNext request messages to retrieve alarms and conditions from 15454 MIB tables. In the case of the ONS 15454, the cerent454AlarmTable contains all the active 15454 alarms and conditions that an SNMP-management system can retrieve.

Thus, SNMP and TL1 can be used to monitor the health of the 15454 by using autonomous messages or retrieving alarms and conditions. One key difference, however, is that SNMP cannot provision the ONS 15454, whereas TL1 can provision electrical, SONET, Layer 1 Ethernet, and DWDM services on the ONS 15454.

TL1 has traditionally been used to manage telecommunications equipment, such as SONET add-drop multiplexers (ADMs), whereas SNMP has traditionally been used to manage data communications devices, such as Ethernet switches and routers. However, SNMP is now being used in telecommunications equipment, such as MSPPs. In certain large-scale service-provider networks, in which thousands of network elements (NEs) are deployed, TL1 is commonly used as the management protocol for transport SONET and DWDM networks; SNMP is commonly used as the management protocol for Ethernet and IP networks. In the latter case, some NEs are managed directly via SNMP and other NEs are managed using an EMS, which, in turn, interfaces with a newer generation of OSS. This architecture was discussed in the beginning of this chapter.

Everything in TL1 and SNMP can be accomplished in CTC. However, be aware that certain configuration tasks can be carried out only in CTC and cannot be set up by TL1. Automatic provisioning of an end-to-end circuit by selecting only the two endpoints of the circuits (for example, DS3 port in Node A and DS3 port in Node Z) can be accomplished only using CTC. The equivalent action in TL1 is setting up a series of cross-connects in each 15454 that the end-to-end circuit traverses.

Using TL1 for Ethernet Services

TL1 cannot provision certain parameters. These mainly consist of Layer 2 Ethernet configuration, such as setting up a resilient packet ring (RPR), virtual LAN (VLAN) IDs, and quality of service (QoS). TL1 does not support these parameters because no TL1 commands exist for configuring these parameters. Instead, a Layer 2 capable EMS like Cisco Transport Manager (CTM) can be used to set up these parameters.

The ML 100 and ML 1000 Ethernet cards are considered Layer 2 cards. These are the Ethernet cards that support RPR and QoS features used in setting up Multipoint Ethernet service. You can use TL1 on the ML Ethernet cards for card inventory, alarm management, Layer 1 provisioning, and the retrieval of Ethernet performance information. In addition, you can use TL1 to provision SONET circuits and transfer a Cisco IOS startup configuration file to the memory on the Timing Communications Control 2 (TCC2) card. Each ML card runs its own instance of Cisco IOS software and must have a Cisco IOS configuration file.

The G1K and the CE100 Ethernet cards are considered Layer 1 cards, on which QoS and VLANs parameters do not exist. These cards map the Ethernet frames coming into the Ethernet port directly onto a point-to-point SONET circuit. As a result, you can use TL1 to provision a point-to-point Ethernet service using G1K and CE100 cards.

Using CTC for Fault Management

Every alarm and event that the ONS 15454 generates is captured in the CTC alarm and conditions tables. A condition is a fault or status that the ONS 15454 detects. In CTC, the conditions window shows all conditions that occur (including those that are superseded) unless the root-cause filter is on, in the conditions table. For example, if loss of frame (LOF) and loss of signal (LOS) are present in the network, CTC shows both the LOF and LOS conditions in this window; LOS occurs when the port on the card is in service but no signal is being received.

The CTC alarm window would show only LOS because LOS supersedes and replaces LOF. Having all conditions visible can be helpful when troubleshooting the ONS 15454. Fault conditions include reported alarms and Not Reported or Not Alarmed conditions.

Note

Alarms that are suppressed are normally found on the conditions table in CTC. The conditions table is used to diagnose system problems, without any filtering of alarms or conditions. As a result, any alarm that normally would be suppressed is viewable in the conditions table as an NR severity condition.


CTC is a useful and practical tool that can be used to monitor a 15454 network. Keep in mind, however, that it is not intended as a tool to monitor a network of many 15454s. CTC is typically used to view and monitor one ring of 15454s at a time. It is a useful tool when troubleshooting is required on a 15454 ring.

SNMP traps can be correlated with the associated alarm generated in the CTC AlarmTable. The value of TL1 name (labeled as cerent454AlarmObjectName in the trap) which is also called the AID (as discussed in the trap section) is identical to the column called Object in the Alarm window in CTC. Therefore, you can use CTC to look into the cause of the alarm further when the SNMP-monitoring application receives a trap. Figure 14-13 shows where the object is located in the Alarm window. In this example, a CARLOSS alarm is present on a G1K. The Object field has the value FAC-4-1, which translates to Port 1 on the card in Slot 4.

Figure 14-13. CTC Alarm Objects


CTC supplies a graphical representation of the 15454 network topology, with current standing alarms and conditions, and depicts the current 15454 shelf configuration. CTC is an easy-to-use application because it automatically discovers all DCC-enabled 15454s that are interconnected to the 15454 node that CTC logs into. The DCC is created by using the bytes in the SONET signal's overhead to create a communications channel among multiple 15454s. CTC uses this communications channel, along with the built-in routing protocol of the 15454, to discover each 15454. After the network map of all the 15454s is created in CTC, the user can log into any of the 15454s identified in the network map and view all the alarms, as shown in Figure 14-14.

Figure 14-14. CTC Alarm Table in Network Map View


The user can then drill down into a single 15454 node and view the alarms associated with that particular 15454. This allows you to quickly filter out alarms from other 15454 nodes in the network. You can then filter again by double-clicking the 15454 card, which will show only the alarms associated with that card.




Building Multiservice Transport Networks
Building Multiservice Transport Networks
ISBN: 1587052202
EAN: 2147483647
Year: 2004
Pages: 140

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net