Recovering from Network Problems | UNIX Fault Management: A Guide for System Administrators

I l @ ve RuBoard

This section discusses how to diagnose and then correct network problems. Networking is a broad area, spanning geographical distances and a variety of software and hardware components . Faults may be due to a NIC, transceiver, connector, cable, or other network device. Network communication problems may be due to a software configuration problem or misbehaving network software. You need to use a combination of the tools discussed throughout this chapter to isolate the problem.

Problems related to performance were discussed earlier, in the section "Using Network Performance Data." Distinguishing between performance problems and network faults is often difficult. For example, a failed network device stops responding to packets, which may lead to a flood of retransmissions. A variety of network faults can be detected by using network performance tools. netstat and nfsstat are examples of commands that can be used for both performance monitoring and fault diagnosis.

An early step in troubleshooting should be to try to identify what is unique about the problem. For example, if only one user is experiencing a problem, then something could be wrong with that user 's configuration or access rights. If the problem involves only a specific network service, you may want to check whether the service has recently been updated. Similarly, if a communication problem involves a specific computer system, you should check whether that system has recently had a software or hardware configuration change.

The following sections provide some troubleshooting steps that can help you to locate a network problem.

Isolating the Fault

If a user is unable to use a network service, the first thing that you probably should do is use ping to see whether you can reach the user's system and the network server. ping is an easy command to remember and can isolate the problem to the appropriate part of the networking stack. ping tests the physical, data link, and network layers of the networking stack. If ping succeeds, then you should check the transport layer or higher layers for the problem. If this fails, then you have isolated the problem to the network or lower layers .

Depending on the network topology in your computing environment, a possible scenario is that you can access the user's system and network server from your management station, but the user can't access the network server directly. If the user's topology is different from your own, you may need to ping from the user's system, or use the remote ping capability of NNM. If the ping behavior differs , it may indicate a routing problem.

If a ping failure occurs, you should verify the name resolution before continuing. You can use nslookup to see whether the name resolves to the correct IP address. Another option is to use ping and specify the IP address instead of the system name . You may try pinging from different subnetworks to eliminate some of the network components as the source of the problem. You can also check the ARP cache to see whether the IP address is being resolved to the correct MAC address.

Network and Lower Layers

If you suspect a problem with the lower network protocol layers, you should first verify the network configuration by using ioscan, lanscan, and ifconfig. The lanscan command should display the status of all the LAN cards shown through ioscan . ifconfig shows you whether the IP address has been properly bound to the network interface and whether the subnet masks are set correctly. If these commands do not return the correct information, then a link or software configuration error may exist. You should check the system log file for link-level errors.

You may be able to confirm a LAN card error by checking the LEDs on the card itself. Usually, green is good and red or yellow is bad. Check the appropriate hardware user manual for the detailed interpretation of these LEDs.

You can test the data link layers of the source and destination systems by using linkloop. If this command fails, the link layer would seem to be the problem and you should check the LAN cables, transceivers, and other network hardware. lanadmin can show you LAN statistics, to see whether excessive packet errors or collisions are occurring. If the linkloop test succeeds, you should next check the network configuration, using netstat “in.

If network problems started when a new system was added to the LAN, the new system may be using a duplicate IP address. You can use netstat “in on each system on the LAN to see whether duplicates are present.

If you suspect that packets are being routed inappropriately, you can use traceroute to identify the route UDP packets are taking from the source system to the destination. The command shows the route and time taken between gateways. This can help you to determine whether a router has failed or a routing table is configured incorrectly.

If netstat shows bad checksums under the IP, ICMP, TCP, or UDP protocols, it is an indication that packets are getting corrupted, possibly on their way through a gateway. If the server is attached to multiple gateways, you may need to send test packets through each gateway to isolate the culprit. Just check the number of bad checksums after sending the packets through each gateway to determine which one is faulty. You then need to repair the faulty gateway device.

The number of input and output errors reported by netstat should be very low, unless a power failure recently occurred. If netstat reports an increasing number of input errors, it could be an indication of corrupt packets being received from another system on the network, or a damaged LAN cable. Try to isolate the source of the remote packets to identify the culprit. If you see an increasing number of output errors, then your system's NIC could be bad.

When a network cable problem is suspected, a LAN analyzer can be used to isolate the problem. The analyzer is attached to the problematic network and can determine the distance to the failed component. A cable may be improperly terminated , or a network segment may need to be replaced .

Transport and Higher Layers

If the network and link layers seem to be working properly, you should next check the transport layer to the destination system by using a network application such as telnet or FTP. The netstat command can be used on the server to see which network applications are currently running. If a desired service has failed, it may need to be restarted. You may want to look in /var/adm/syslog/syslog.log for potential causes of the failure. netstat can also show useful protocol statistics.

If connectivity to the network server seems okay, but a service is not responding, performance problems may exist with the server or services, or the service may have failed. You may be running the wrong version of the service. Compare the version of the service on a working system with that on a system where the service is failing. To check the version of a service such as FTP, type what ftpd .

A service may exist on the system, but it may not be running because of security reasons. If a service is not being started, check whether the optional file /var/adm/inetd.sec is being used. This file can be used to deny or allow access to each network service. If the user is authorized, make sure that this file is not configured to deny access. Some services have their own specific security files. For example, FTP denies access to local accounts listed in /etc/ftpusers.

I l @ ve RuBoard