Monitoring Network Interface Card and Cable Failures | UNIX Fault Management: A Guide for System Administrators

I l @ ve RuBoard

The Mean Time Between Failures (MTBF) for a single Network Interface Card (NIC) can be many years . However, in a large computing environment, you are still likely to experience LAN card failures. This section shows how you can monitor LAN card failures, verify that physical cables are set up properly, and discover related problems, such as a link disconnection. You can also see how to identify the type of network link being used.

This section refers to monitoring the networking components that correspond roughly to the physical and data link layers of the Open Systems Interconnection (OSI) model. This section also shows how to obtain a LAN's status and statistics, the LAN station addresses used by each UNIX server, and information on LAN packet errors. This section concludes with some of the monitoring tools that are specific to certain network links, such as X.25, Asynchronous Transfer Mode (ATM), and Fiber Distributed Data Interface (FDDI).

Using SNMP Instrumentation

As mentioned earlier in this chapter, important monitoring information can be stored in standard MIBs, and can then be queried by using a MIB Browser. The MIB Browsers are publicly available and are also included with the NNM, IT/O, and other network management products. The MIB data can also be obtained by using a custom SNMP application, but you need to know the Object Identifiers (OIDs) of the fields you are interested in querying, which are provided in the MIB specifications. Standard MIB documents are provided as Requests For Comments (RFCs) from the Internet Engineering Task Force (IETF), and are available on the Web at http://www.ietf.org.

The MIB-II (defined in RFCs 1213 and 1214) contains numerous useful fields for monitoring network links. Information about each network interface is available, including its current status. You can compare the total number of input packets to the number of input packets received in error. Similar fields are available for output packets. Together, these fields can be an indication of how well a network is behaving.

You may also want to check the number of inbound packets that were discarded despite being received without error. If packets are being discarded, this could indicate a lack of system memory reserved for network buffers.

Each network interface has both an administration status and an operation status defined in the MIB-II. Administration status is up, down, or testing. It refers to the desired state of the interface. Operation status is also up, down, or testing, and it corresponds to the current operational state of the interface.

Network statistics are accumulated from the time the system was booted , or since the last manual reset of the statistics. Because this usually is a long period of time, a single MIB query is unlikely to be useful. You will probably want to set up ongoing monitoring and calculate the change in value over time to get an idea of network traffic over a NIC and the current error rate. Output errors are likely to indicate a problem getting to the link from the local system.

Using Standard Commands and Tools

Each UNIX operating system includes a variety of commands that can be helpful in determining the status of a network interface. The more commonly used HP-UX commands are described in this section. You may want to check the online man pages for additional information about each command.

In addition to these commands, you may also want to check the system log file, /var/adm/syslog/syslog.log, for LAN-related error messages if your system is experiencing network problems.

ioscan

The ioscan command is used to provide information about all the I/O paths on a system. It can show the hardware paths to your LAN cards and help you to distinguish between a built-in LAN card and additional LAN cards. You can also see the specific type of LAN card being used. ioscan shows errors if the hardware or software has not been properly installed. Errors may be due to the wrong network driver being installed, meaning that you would need to rebuild the ker nel with the correct driver. Listing 6-1 shows an HP-UX system with one built-in LAN card and three EISA cards.

Listing 6-1 The ioscan command with grep, which shows only network information.

 chacha#ioscan grep lan 8/16/6             lan            Built-in LAN 8/20/5/1             lan          EISA card HWP1990 8/20/5/7             lan          EISA card HWP1990 8/20/5/8             lan          EISA card HWP1990 chacha#

lanscan

The lanscan command shows the hardware path to each LAN device, the station address, link status, network management ID, and link technology being used (for example, Ethernet or FDDI). The station address is the unique identifier for a LAN card and is also referred to as a physical address or Media Access Control (MAC) address. You can compare the information from lanscan, as shown in Listing 6-2, to that provided by ioscan, shown in Listing 6-1. With this additional information, you can uniquely identify your LAN card so that you can configure it for the appropriate corporate network. Also, if you are using an analyzer to look at network packets on a LAN, lanscan will help you to map the station address in the packet to a network interface on a system.

Note that an interface has both a hardware state and a network interface state. The hardware state corresponds to the configured state of the device, and the network interface state corresponds to the current operational state. In the example in Listing 6-2, lan1 has been configured UP, but is not operating properly.

lanadmin

The lanadmin command can be used to display or change the station address, Maximum Transmission Unit (MTU), or speed setting of each configured network interface. It can also be used to display or reset the MIB-II network interface statistics. The statistics include the interface name , administrative status, operational status, number of error packets, and other information. For Ethernet links, lanadmin also displays statistics from RFC 1284 (Ethernet-like interface information), which can indicate the number of collisions happening on the link.

Some overlap exists with other commands. For example, the station address is also available from lanscan, and the MTU can also be obtained from netstat (discussed later).

lanadmin can be used to reinitialize a LAN card. For example, if networking fails during system startup, you may be able to enable networking by using lanadmin to reset the LAN card, and then use ifconfig (described later) to bring the card up with the proper IP address.

Numerous statistics are available from lanadmin, as shown by using the lanadmin menu-driven interface in Listing 6-3. Note that to use the lanadmin command, you need to specify the network management ID number of the LAN card, which can be obtained from the lanscan command.

Listing 6-2 Output from lanscan command showing the list of LAN cards.

 # lanscan Hardware Station        Crd Hardware Net-Interface  NM MAC   HP DLPI Mjr Path     Address        In# State    NameUnit State ID Type  Support Num 8/16/6   0x080009C3FA7D 0   UP       lan0     UP    4  ETHER Yes     52 8/20/5/1 0x080009ACC83C 1   UP       lan1     DOWN  5  ETHER Yes     176 8/20/5/7 0x080009ACD860 2   DOWN     lan2     DOWN  6  ETHER Yes     176 8/20/5/8 0x080009ACF11E 3   DOWN     lan3     DOWN  7  ETHER Yes     176 #

Listing 6-3 Output from lanadmin execution.

 cancan#lanadmin           LOCAL AREA NETWORK ONLINE ADMINISTRATION, Version 1.0                        Sun, Oct 4,1998  16:49:08                Copyright 1994 Hewlett-Packard Company.                        All rights are reserved. Test Selection mode.         lan      = LAN Interface Administration         menu     = Display this menu         quit     = Terminate the Administration         terse    = Do not display command menu         verbose  = Display command menu Enter command: lan LAN Interface test mode. LAN Interface Net Mgmt ID = 4         clear    = Clear statistics registers         display  = Display LAN Intf status and statistics registers         end      = End LAN Interface Admin, return to Test Selection         menu     = Display this menu         nmid     = Network Management ID of the LAN Interface         quit     = Terminate the Administration, return to shell         reset    = Reset LAN Interface to execute its selftest Enter command: display                       LAN INTERFACE STATUS DISPLAY                        Sun, Oct 4,1998  16:49:12 Network Management ID           = 4 Description                     = lan0 Hewlett-Packard LAN Interface Hw Rev 0 Type (value)                    = ethernet-csmacd(6) MTU Size                        = 1500 Speed                           = 10000000 Station Address                 = 0x80009e72436 Administration Status (value)   = up(1) Operation Status (value)        = up(1) Last Change                     = 14829 Inbound Octets                  = 30754920 Inbound Unicast Packets         = 0 Inbound Non-Unicast Packets     = 0 Inbound Discards                = 0 Inbound Errors                  = 0 Inbound Unknown Protocols       = 0 Outbound Octets                 = 46132380 Outbound Unicast Packets        = 202335 Outbound Non-Unicast Packets    = 0 Outbound Discards               = 5 Outbound Errors                 = 0 Outbound Queue Length           = 0 Specific                        = 655367 Press <Return> to continue Ethernet-like Statistics Group Index                           = 4 Alignment Errors                = 0 FCS Errors                      = 0 Single Collision Frames         = 0 Multiple Collision Frames       = 0 Deferred Transmissions          = 0 Late Collisions                 = 0 Excessive Collisions            = 0 Internal MAC Transmit Errors    = 0 Carrier Sense Errors            = 67445 Frames Too Long                 = 0 Internal MAC Receive Errors     = 0 LAN Interface test mode. LAN Interface Net Mgmt ID = 4         clear    = Clear statistics registers         display  = Display LAN Intf status and statistics registers         end      = End LAN Interface Admin, return to Test Selection         menu     = Display this menu         nmid     = Network Management ID of the LAN Interface         quit     = Terminate the Administration, return to shell         reset    = Reset LAN Interface to execute its selftest Enter command: quit cancan#

Note that the lanadmin command has made the landiag command, available on earlier HP-UX releases, obsolete. Also note that you should not use the lanadmin command with an Asynchronous Transfer Mode (ATM) adapter. ATM provides other commands for use with ATM adapters. These are listed later in this chapter.

linkloop

The linkloop command is used to verify LAN connectivity. A local network management ID and a local or remote station address is specified, and the connectivity is then checked by sending test packets. In the example shown in Listing 6-4, connectivity is checked between lan2 (with a network management ID of 6) and a network interface on a remote system.

Because linkloop tests only the physical and data link layers, the remote station must be on the same network segment. The test packets are not routable using IP addresses; you must specify the station address of the destination system instead.

If you are unsure of the station address of a remote host, one command you can use is arp, which shows the current Address Resolution Protocol (ARP) cache entries. ARP provides the mapping between an IP address and a station address. The arp command also shows the host name associated with each IP address. This command is helpful only if connectivity has previously been established or a static ARP entry is being used. You may need to run lanscan on the remote system to determine its station address. The arp command is discussed in more detail later in this chapter.

Listing 6-4 Output showing lanscan and linkloop, which are used to test connectivity.

 cancan#lanscan Hardware Station        Crd Hardware Net-Interface   NM  MAC     HP DLPI Mjr Path     Address        In# State    NameUnit State  ID  Type    Support Num 8/16/6   0x080009E72436 0   UP       lan0     DOWN   4   ETHER   Yes     52 8/20/5/1 0x0800096BBED4 1   UP       lan1     DOWN   5   ETHER   Yes     176 8/20/5/7 0x0800096BEE5A 2   UP       lan2     UP     6   ETHER   Yes     176 8/20/5/8 0x0800096BBEDE 3   DOWN     lan3     DOWN   7   ETHER   Yes     176 cancan# cancan#linkloop -i 6 0x0800094A1861 Link connectivity to LAN station: 0x0800094A1861 --OK cancan#

Using Additional Products to Monitor Network Links

The standard commands shipped with the UNIX operating system provide substantial capabilities for monitoring your network links. However, you can also buy products that extend these capabilities. A few examples are given in this section.

EMS HA Monitors

The Event Monitoring Service (EMS) is a free software package for HP-UX systems to monitor system hardware components. EMS provides libraries so that additional hardware and software monitors can be added. Hewlett-Packard provides the HA Monitors product, which can be used to detect LAN interface status changes. HA Monitors discovers all the network interfaces. Monitoring can be enabled for one or all network interfaces. This status information is similar to that provided by the MIB-II or lanscan, but EMS allows for information to be communicated automatically through a variety of notification methods . For example, EMS can send an e-mail message, log an event to the system log or an arbitrary text file, or report a problem to MC/ServiceGuard, HP's high availability cluster software. Events can also be reported via TCP, User Datagram Protocol (UDP), SNMP, opcmsg (a proprietary communication protocol used by IT/O), or e-mail, as long as it does not require the use of a failed NIC.

To configure EMS network monitoring, you use SAM. From the Resource Management functional area, you select the Event Monitoring Service and then select the action to add a new monitoring request. Resources are grouped into categories, and by selecting /net, you can see the interfaces available to be monitored. You can choose an interface to be monitored , or select all of them, and then enter the monitoring parameters. A portion of the EMS configuration is shown in Figure 6-7.

Figure 6-7. Configuring network status monitoring in EMS.

graphics/06fig07.gif

ATM and HyperFabric adapters can also be monitored by using EMS. An EMS monitor is included with each of these products.

MC/ServiceGuard

MC/ServiceGuard is a high availability software product that detects system failures, network or LAN card failures, and the failures of critical applications. MC/ServiceGuard can be configured to handle these failures automatically. For example, a failed critical application can be restarted. The failure of a LAN card can result in the automatic configuration of a backup LAN card to take over the network load.

MC/ServiceGuard is much more than a monitoring product, because it also provides transparent application restart when failures are detected . If you are responsible for monitoring a network, and MC/ServiceGuard is being used in your environment, then you should become familiar with how the MC/ServiceGuard product uses the network.

Although MC/ServiceGuard can be used for a single system to provide transparent LAN failover, it is more commonly used in a cluster environment. In this case, if a system failure occurs, an application can be restarted on another system. This capability requires MC/ServiceGuard to be sending packets over the network, to ensure that its cluster members are still alive . You may want to have a dedicated network for this high availability communication, to ensure that high network utilization doesn't cause an unnecessary migration of your application. MC/ServiceGuard monitors subnetworks and can send an SNMP trap when a subnetwork goes up or down. If you are using an SNMP-based management application, this can be a quick way to learn that you are having network problems.

MC/ServiceGuard takes advantage of the ability to have multiple IP addresses bound to a single network interface card. This allows an application to be migrated to another system without the clients needing to learn a new address; the IP address moves with the application. MC/ServiceGuard distinguishes between a "stationary" IP address, which is associated with one computer system, and "floating" IP addresses, which are associated with applications and are migrated with them to other systems.

MC/ServiceGuard sends SNMP traps when its IP addresses are added or removed from the system. To see whether these floating IP addresses are being used on your system, you can use netstat_in, which shows all of your configured IP addresses. The ifconfig command shows only the stationary IP address for a specified network interface. These commands are discussed in more detail later in the chapter.

Note that MC/ServiceGuard is supported only on HP 9000 Series 800 systems running HP-UX 10. x or later operating systems. However, other system vendors provide similar high availability networking products.

MeasureWare

MeasureWare is a software product that collects system performance information. Collected data can be used by performance tools such as HP PerfView.

MeasureWare monitors numerous metrics on HP-UX, Sun Solaris, and other platforms. Although the information is collected primarily for performance reasons, it can also be helpful in inferring the status of network components. For example, if the rate of inbound LAN packets suddenly dropped to zero on a busy network interface, it might signal a link problem. After you detect a potential problem, you should use other tools to diagnose where the problem might be happening.

MeasureWare is discussed in more detail in "Collecting Network Performance Data," later in this chapter.

Using Link-Specific Commands

You should be able to diagnose the health of Ethernet and Token Ring links by using only the tools already described. Additional commands are available for other network links, such as FDDI, X.25, and ATM. The remaining portion of this section describes some of the link-specific commands that are available to help with fault diagnosis and troubleshooting.

Using FDDI Status Commands

Fiber Distributed Data Interface (FDDI) is a high-speed LAN technology that is capable of data transfer rates of 100 megabits per second. FDDI supports dual fiber- optic rings, which provide a level of redundancy. Systems can connect to an FDDI network through an FDDI device called a concentrator, which can provide some fault tolerance by ensuring that the network won't be impacted by a failed system. NICs with dual-attach capability can directly connect into separate concentrators , for additional protection.

You can check whether your system is using FDDI by executing the lanscan command. The link type is shown for each LAN card. In addition to lanscan, other common commands can be used with FDDI links, such as ioscan, linkloop, and lanadmin.

You can execute fddiinit with the device file of your FDDI adapter, to initialize the adapter and connect the system to the network. If this command fails, the cables may not be connected properly, the card may not be operational, or the concentrator may not be up.

If the adapter is initialized successfully, you can get additional information about the card by using fddistat, which also requires the device file (for example, /dev/lan1). If the card is set up properly, the output should indicate that the ring is operational and that the line state of the station connection is WRAP_S.

An additional command, fddinet, can be used to display connection information about the local ring, a specified remote node, or all nodes connected to the same FDDI ring.

X.25 Status Commands

X.25 is a packet-switching network standardized by CCITT. x25stat can be used to detect problems with an X.25 link. It provides the status of the X.25 device and card, configuration information, and virtual circuit statistics. Executing x25stat “f “d /devicefile shows you the number of errors on the X.25 link. If the link state is not listed as NORMAL, then a problem exists.

Statistical information shown through x25stat includes the current number of virtual circuits configured and the bytes transferred over each subnetwork. x25stat can also show statistics for a specific virtual circuit, such as the number of data packets sent or received.

If IP is not working over X.25, you can check that your network address has been configured correctly and see the mapping between IP and X.25 addresses by using x25stat “a. You may need to reinitialize X.25 using x25init.

The x25check and x25server commands are used together to test the X.25 network. x25server is a background process that waits for call requests from x25check. x25check sends packets to a remote system running x25server, and the packets are echoed back. This can be used to test your X.25 link to the network or the X.25 switch. If successful, x25check also shows the time spent setting up the virtual circuit.

ATM Status Commands

Asynchronous Transfer Mode (ATM) is a high-speed network technology. Some network commands, such as lanscan and netstat, are generic and can be used with ATM links. netstat “i, for example, shows the statistics for ATM network interfaces. However, some ATM-specific status commands are also available, and are described next .

To get information about an ATM adapter, its MAC address, and associated virtual circuits, you can use the atmmgr command. The command requires a card instance number to be specified. Additional information can also be shown, depending on the command options specified. For example, atmmgr 0 show “p shows the status, statistics, and MAC address.

To do an external loopback test, you can use the atmloop command. You must first use atmstop to stop the ATM adapter, and then replace the cable with a loopback hood. After the test, you can issue an atmstop again and then use atminit to initialize the adapter after removing the hood. The atminit command can be used to reset or initialize an adapter, but all Switched Virtual Circuits (SVCs) will be lost.

Listing 6-5 Output from atmmgr showing AAL5 statistics.

 cancan: atmmgr 0 show -a AAL5 statistics CRC-32 errors in received SDUs      : 0 SDUs lost due to reassembly timeout : 0 SDUs too big                        : 0

To get information about ATM network interfaces and ARP cache entries, use the elstat and elarp commands, respectively. These commands require a network interface name to be specified, such as el101.

To check ATM Adaptation Layer (AAL) interface statistics, use a card instance number of 0, as shown in Listing 6-5.

If the AAL5 interface is working correctly, the number of errors should be low.

Additional ATM commands include atmcheck and atmserver, which can be used to check ATM hardware and end-to-end connectivity, and atminit_net, which resets or initializes a specific network interface on an ATM adapter. This does not stop the ATM adapter, so activity on other network interfaces is not impacted.

HA ATM

HA ATM is a product from Hewlett-Packard that protects ATM links in the event of a failure. HA ATM is integrated with EMS and MC/ServiceGuard. Local recovery is aided through its EMS integration, and remote recovery involves both EMS and MC/ServiceGuard.

The HP ATM adapter enables you to run one Classical IP (CIP) interface and up to 32 Emulated LAN (ELAN) interfaces simultaneously . High availability is provided only for ATM with LAN emulation.

HA ATM protects against a single failure in the ATM network. It detects local ATM ELAN interface failures and can perform local recovery by switching to another local ELAN interface. HA ATM can detect and recover locally from problems with the core ATM software, or with the ATM adapter, link, or ATM switch. EMS events can be sent when a failure is detected. MC/ServiceGuard is notified via EMS when HA ATM is unable to provide local recovery.

MC/ServiceGuard remote recovery is used when a system failure occurs or the system loses all of its adapters, switches, or cables. To process a remote failover, MC/ServiceGuard moves all ATM applications defined as packages to the remote system, and configures any package IP addresses on the remote system's ELAN interface on an ATM adapter.

I l @ ve RuBoard