9.2 Network troubleshooting tools

This section is concerned with how you maintain an operational network. Specifically we will look at the tools and techniques you will need to identify problems on your network and remedy them. Essentially, these fall into the following categories:

The NMS
Device statistics and log output
Software diagnostic tools such as ping, traceroute, tcpdump, netstat, and dig
Advanced protocol analyzers and network testers

We are almost at the end of this book. Readers who require comic relief during this discussion should refer to [47].

9.2.1 Using the NMS

Your first port of call should be the network management system itself. If properly set up, it should enable you to focus on specific problems to which you can then apply more sophisticated techniques. If you suspect a problem, the first thing you will want to do is get a fix on its location and isolate the nature of the problem using the following tools:

Network map—Examine the network status map and see if there are any obvious clues regarding the location of the problem. Many NMS GUIs use color-coded icons to indicate the status of a device or link.
Event logs—Examine network management event logs for any specific Traps, alarms, or events. Examine generic event logs on the management station, such as Syslog, if available.
Monitoring—You could have proactively configured the NMS to monitor specific MIB objects on key nodes in advance (e.g., interface status, retransmission counters, etc.).

A naive implementation of the management application may indicate many failures, and this could make life difficult if you are trying to isolate the problem (e.g., failure of a single WAN link may cause a number of icons to indicate failure status). A more sophisticated management application may use some form of artificial intelligence or a rule base to lock down more precisely on the problem, or it may implement Trap-directed polling. However, remember that the NMS controls only objects that are manageable and in its object database. If the problem is due to an intermediate device, such as an unmanaged modem, it may not show up on the NMS in any obvious way.

In the standard MIB-II there are many objects that are good candidates for proactive monitoring, including the following:

ifOperStatus in the interfaces table. Reports up, down, or testing status for a specific interface.
ipRouteEntry in the IP routing table. You could monitor mission-critical routing entries.
icmpInRedirects in the ICMP group. You could monitor for excessive redirects, indicating some type of system, routing, or interface failure.
tcpRetransSegs in the TCP group. You could monitor for excessive retransmissions.

You should explore the MIB contents and see if there are any objects of particular relevance for your environment. Bear in mind the previous discussion on performance with SNMP polling, so keep the number of objects to a minimum and set a reasonable polling interval (one that does not create too much traffic but does give you enough warning about a potential problem). To preserve traffic in large internetworks it may be wise to configure your own enterprise Traps or use RMON probes.

9.2.2 Router troubleshooting features

Most internetworking devices provide built-in diagnostic features. For example, you would expect a router to provide basic statistics on packet or protocol behavior, interface status, routing topology status, and some protocol event messages (either logged locally in RAM or sent to Syslog). These tools can be invaluable for debugging network or connectivity problems, and you should familiarize yourself with the facilities available. For example, on a Xyplex router it is possible to view the status of a wide area interface via the CLI, as shown in the following code segment:

 SHOW LINK W1 STATUS Link: W1 (A–B)                           Type: WAN Active State:                Running                    Current   High  Average Active Circuit:             Default      percent Utilization:     0        4         0 Compression:                Disabled     Error Rate:        0        0        0 Dial State:                      N/A Monitor State:                   N/A                    Current   High  Maximum Monitored Utilization:             0     Output Queue:      0        1     1082             DCE DEVICE                                  Current   High    Total Cable:                          V35     Link Downtime:     0        2        2 Transmit Link Speed (KBPS):  55.900     Link Down Count:                     0 Receive Link Speed (KBPS):   55.900     Last Occurred:                Current       Changes CTS:          Observed             1 RTS:          Asserted             0 DCD:          Observed             1 DSR:          Observed             0 DTR:          Asserted             0 Ring:              N/A             0

Here we can see link utilization, link errors, transitions in interface signals, and whether the link has gone down at any point. Farther up the protocol stack we might, for example, monitor the status of OSPF neighbors to find out why routing databases are not synchronizing, as shown in the following code segment:

 MONITOR OSPF NEIGHBOR SUMMARY 08-00-87-01-FF-82 (ROUTER-C)               BR/460        Uptime: 000 01:13:27   Area/Interface   Router ID      Nbr IP Addr           State   Mode Priority   193.128.4.2      0.0.0.0        1 (E1)                 Down   None        0   193.128.4.2        193.128.4.1      193.128.4.1            Full Master        1

There may be a large number of displays to get familiar with, some more relevant than others. Often the vendor may be able to advise you what the most useful commands and displays for troubleshooting are.

Some router vendors allow you to configure SNMP Traps for key events, and this enhances the standard Trap functionality described earlier in this chapter. For example, ICMP network unreachable events could trigger specific Traps, indicating the offending source and destination network addresses. These Traps can be sent to one or more nominated Trap hosts configured by the system administrator and could be particularly useful if configured on a default router to indicate the presence of Martian Hosts [1].

9.2.3 Software diagnostic tools

In this section, we will investigate a number of software tools and services used for testing various aspects of internetwork operation. The main tolls of interest include the following:

Ping—A tool that performs basic reachability tests of network operation.
TraceRoute—A tool that can help diagnose routing problems.
Netstat—A tool that is useful for examining network or routing status.
TCPdump—A tool that can be used as a basic protocol analyzer.
Nslookup and Dig—Tools that are useful for studying DNS.

The following standard services are not described here, but you may wish to explore these further in [48]:

Echo—A service that echoes characters back to the sender (port = 7)
Chargen—A service that sends a complete ASCII character set continuously (port = 19).
Finger—A service that discovers the real name of a user from his or her UserID (port = 79).
Discard—A service that acts as a sink point for data (port = 9).
Daytime—A service that gives the date and time (port = 13).

Although these diagnostic tools are extremely useful, many are not advanced enough to give expert interpretation or even identification of the problems on your network. They give basic packet event and timing data, which you must interpret. You need to invest time in understanding the protocols involved to know what to expect in order to devise a testing strategy that will quickly lock in on possible problems. Experience and a methodical approach help great deal, and anybody working in this area is strongly recommended to keep a notebook and write down anything important as you build up your knowledge.

Ping

Ping (packet internet groper) is probably the most useful debugging tool for internetworks and the first thing most engineers will use as part of the problem-solving strategy. Ping takes its name from a submarine sonar search, in which a short sound burst is transmitted underwater. If there are other objects present, a reflection is echoed back (which makes a sound like a ping). Ping is implemented over IP using the standard ICMP echo function [49]. Ping is primarily used to test reachability, but it can also be used to gather basic statistics about network performance and reliability. Ping performs the following operations:

Ping sends one or more probe packets (an ICMP echo request) to a specified destination IP address and then waits for a reply. If you understand the topology of your network, you can easily backtrack from the destination, testing reachability along the expected delivery path until you find where the problem is.
It can generate multiple echo requests (with a specified data size) in batches, allowing you to test for percentage packet loss.
It inserts a unique sequence number in each probe packet and reports back the sequence numbers received. By running ping for several minutes you can determine if packets have been dropped, duplicated, or reordered.
It checksums each packet sent and received, enabling you to detect if damage or corruption has occurred.
It inserts a timestamp in each packet, which is echoed back. This can be used to compute the time taken for each packet exchange (i.e., the Round-Trip Time [RTT]). Run ping for several minutes to determine if the round-trip delays are consistent. Ping normally reports timestamps in milliseconds (ms), although the resolution may be limited to the nearest 10 or 20 ms.
It reports other useful ICMP error messages (such as target host or subnet unreachable).

The following are some of the typical problems you may experience when debugging with ping:

Discarded packets
Fluctuating Round-Trip Times (RTT)
Unstable connectivity
Ping works but some applications fail

Ping options

Under MS-DOS version 4.10.2222, ping supports the following options.

MS-DOS Option Definitions

-t—Ping the specified host until stopped.
-a—Resolve addresses to hostnames.
-n count—Number of echo requests to send.
-l size—Send buffer size.
-f—Set Don't Fragment flag in packet.
-i TTL—Time To Live.
-v TOS—Type of Service.
-r count—Record route for count hops.
-s count—Timestamp for count hops.
-j host-list—Loose source route along host-list.
-k host-list—Strict source route along host-list.
-w timeoutt—Timeout in milliseconds to wait for each reply.

BSD UNIX implements a much richer version of ping. The following are some of the most useful ping options under BSD.

BSD UNIX option definitions

-c Count—Send count packets and then stop. To stop manually type Control-C. This option is useful for scripts that periodically check network behavior.
-f Flood—Send packets as fast as the receiving host can handle them, at least 100 per second. This is a useful way to stress test a production network. Exercise with caution on a live network, since high-performance workstations can consume large amounts of bandwidth.
-l Preload—Send preload packets as fast as possible, and then change to normal behavior. Good for finding out how many packets your routers can handle as a burst.
-n—Numeric output only. Use this when, in addition to everything else, you've got nameserver problems and ping is hanging trying to give you a nice symbolic name for the IP addresses.
-p Pattern—Pattern is a string of hexadecimal digits with which to pad the end of the packet. This can be useful if you suspect data-dependent problems, since links have been known to fail only when certain bit patterns are presented to them.
-R Record—Use IP's Record Route option to determine what route the ping packets are using. This can be problematic, since the target host is under no obligation to place a corresponding option on the reply, so consider this a bonus if it works.
-r—Bypass the routing tables. Use this when, in addition to everything else, you've got routing problems and ping can't find a route to the target host. This works only for hosts that can be directly reached without using any routers.
-s Size—Set the size of the test packets. You should check large packets, small packets (the default), very large packets that must be fragmented, and packets that are not a power of two. Read the manual to find out exactly what you're specifying here—BSD ping doesn't count either IP or ICMP headers in the packet size.
-V—Verbose output. Displays other ICMP packets not normally considered interesting.

Ping is available on practically all variations of UNIX, LINUX, and Windows95/NT and is often implemented on many internetworking devices as part of the diagnostic suite. There are several variations, and you may find that not all the options are available or consistent. BSD UNIX offers a fully featured ping, freely available for many host systems. Ping is also a nonprivileged command on most systems. Most Windows implementations have only the basic features implemented. Refer to the documentation on the platform for further details. The options available are very much platform dependent. Syntax varies markedly and on networking hardware (such as a router CLI) ping may be implemented with very limited functionality. The general format for using ping is as follows:

 ping [options] <destinationIPaddrs>

Using ping

Many TCP/IP systems have a special IP address called the localhost or loop-back interface (specifically, 127.0.0.1; no packets sent to this address must appear outside a host). Pinging the loopback interface is a good way to exercise the internal network configuration and IP stack, although the extent of testing is implementation dependent (good implementations test all the way down to the hardware interface, but many implementations test only down to the IP layer). Problems found in a loopback test are rare but are cause for further investigation (if you cannot ping your local address, it is unlikely you will be pinging anywhere else either). The following ping test shows a sequence of ten echo responses over the loopback interface on a UNIX host (the reports will vary by OS). Note sequence numbers with each reply. The TTL values are also reported, as are the round-trip times. Both are very consistent. At the end of the session, statistics are reported.

 bozo$ ping -c10 localhost PING localhost (127.0.0.1): 56 data bytes 64 bytes from 127.0.0.1: icmp_seq=0 ttl=255 time=2 ms 64 bytes from 127.0.0.1: icmp_seq=1 ttl=255 time=2 ms 64 bytes from 127.0.0.1: icmp_seq=2 ttl=255 time=2 ms 64 bytes from 127.0.0.1: icmp_seq=3 ttl=255 time=2 ms 64 bytes from 127.0.0.1: icmp_seq=4 ttl=255 time=2 ms 64 bytes from 127.0.0.1: icmp_seq=5 ttl=255 time=2 ms 64 bytes from 127.0.0.1: icmp_seq=6 ttl=255 time=2 ms 64 bytes from 127.0.0.1: icmp_seq=7 ttl=255 time=2 ms 64 bytes from 127.0.0.1: icmp_seq=8 ttl=255 time=2 ms 64 bytes from 127.0.0.1: icmp_seq=9 ttl=255 time=2 ms --- localhost ping statistics --- 10 packets transmitted, 10 packets received, 0 percent packet loss round-trip min/avg/max = 2/2/2 ms bozo $

The next ping test illustrates how a WAN link is performing by pinging a remote router on the other side of a 128-Kbps link.

 laurell 8 >ping hardy PING hardy (193.128.56.2): 56 data bytes 64 bytes from 193.128.56.2: icmp_seq=0 ttl=254 time=36.134 ms 64 bytes from 193.128.56.2: icmp_seq=1 ttl=254 time=27.473 ms 64 bytes from 193.128.56.2: icmp_seq=2 ttl=254 time=29.243 ms 64 bytes from 193.128.56.2: icmp_seq=3 ttl=254 time=39.151 ms 64 bytes from 193.128.56.2: icmp_seq=4 ttl=254 time=28.922 ms 64 bytes from 193.128.56.2: icmp_seq=5 ttl=254 time=39.181 ms 64 bytes from 193.128.56.2: icmp_seq=6 ttl=254 time=31.221 ms ... 64 bytes from 193.128.56.2: icmp_seq=30 ttl=254 time=816.691 ms 64 bytes from 193.128.56.2: icmp_seq=31 ttl=254 time=36.105 ms 64 bytes from 193.128.56.2: icmp_seq=32 ttl=254 time=853.323 ms 64 bytes from 193.128.56.2: icmp_seq=33 ttl=254 time=678.253 ms 64 bytes from 193.128.56.2: icmp_seq=34 ttl=254 time=331.213 ms 64 bytes from 193.128.56.2: icmp_seq=35 ttl=254 time=27.931 ms 64 bytes from 193.128.56.2: icmp_seq=36 ttl=254 time=273.661 ms 64 bytes from 193.128.56.2: icmp_seq=37 ttl=254 time=131.990 ms 64 bytes from 193.128.56.2: icmp_seq=38 ttl=254 time=29.141 ms ... laurell 9 >

The initial timings show consistent link behavior with an average RTT of approximately 31 seconds. However, about 30 seconds into the trace, we see large RTT fluctuations (nearly a whole minute for some packets). Since we are transferring 56 bytes of data, plus an 8-byte ICMP header, plus a 20-byte IP header, plus link encapsulation (assume 10 bytes), this gives 94 byte packets. At 128 Kpbs, 94 bytes should require approximately 5.875 ms to transfer (94 × 8/128,000). For a two-way exchange (i.e., an ICMP echo request and echo response) we would expect at least twice this delay plus some host processing time (say 10–20 ms in total). We can assume the initial RTT values indicate other traffic on the link and possible queuing latency, but the values from packet 30 onward indicate serious link congestion.

Potential problems

In some situations ping may be unable to help. These situations include the following:

Some routers silently discard undeliverable packets. Others may believe a packet has been transmitted successfully when it has not been. (This is especially common over Ethernet, which does not provide link-layer acknowledgments.) Therefore, ping may not always provide reasons why packets go unanswered.
Ping cannot tell you why a packet was damaged, delayed, or duplicated. It cannot tell you where this happened either, although you may be able to deduce it.
Ping cannot give you a blow-by-blow description of every host that handled the packet and everything that happened at every step of the way. It is an unfortunate fact that no software can reliably provide this information for a TCP/IP network.
Application-level faults may not be detected by ping, since it tests only the IP layer.
In secure environments ICMP echo may be disabled on sensitive hosts or devices to avoid potential hacking techniques.

Traceroute

TCP/IP provides very limited capabilities for tracing routes, restricted to the IP record route options. These are poorly specified, not reliably implemented, and often disabled for security reasons—hence, they cannot be relied upon as diagnostic tools. Traceroute is, therefore, a best-effort tool hacked together to work around these limitations; it may not work in all circumstances—nevertheless, it can be surprisingly useful. The traceroute program attempts to trace the path a packet takes through the network by transmitting a series of UDP probe packets to a specified IP address (using UDP port 33434) and then waits for ICMP replies. A group of probe packets (usually three) are initially sent with a minimum valid TTL value (i.e., one). The TTL (within the IP header) is then incremented for each subsequent test, usually up to a value of 30. In an internetwork, every router that forwards these packets will subtract one from the packet's current TTL. If the TTL reaches zero, the packet lifetime has expired and the packet must be discarded. Traceroute relies on the fact that routers normally send an ICMP time exceeded message back to the sender whenever they discard packets due to a zero TTL value. By starting with small TTL values that quickly expire, traceroute forces routers along the active route to generate these ICMP messages, so we can identify which routers are in the path and in which order. For example, a packet sent with TTL = 1 should produce a message from the first router in the path; using the IP address of the interface it transmits the ICMP timeout messages on the receiving interface. A packet with TTL = 2 generates a message from the second router and so on, as illustrated in Figure 9.11. If the packet eventually reaches the specified destination, the receiving node will return an ICMP port-unreachable packet (since 33434 is not a well-known port). Refer to [49] for full details of ICMP error codes.

click to expand
Figure 9.11: Traceroute recording example.

For each batch of probe packets traceroute displays the IP addresses reported back, and DNS is used to convert this into a symbolic domain address. Round-trip times are also reported for each packet in the group. Traceroute reports any additional ICMP messages (such as destination unreachable) using a rather cryptic syntax (!N means network unreachable; !H means host unreachable). Once a group of packets has been processed (this could take several seconds), the next group (TTL + 1) begins transmitting, and the whole process repeats.

Traceroute options

Here's a list of common traceroute options.

Traceroute option definitions

-m max-ttl—At some TTL value, traceroute expects to get a reply from the target host. Of course, if the host is unreachable for some reason, this may never happen, so max-ttl (default 30) sets a limit on how long traceroute keeps trying. If the target host is farther than 30 hops away, you'll need to increase this value.
-n—Numerical output only. Use if you're having nameserver problems and traceroute hangs trying to do inverse DNS lookups.
-p port—Base UDP port. The packets traceroute sends are UDP packets targeted at strange port numbers that nothing will be listening on (we hope). The target host should ignore the packets after generating port-unreachable messages. Port is the UDP port number that traceroute uses on its first packet, and increments by one for each subsequent packet. My traceroute uses 33434 (yours probably does too). Change this if a program on the target host might be using ports in roughly the 33434-33534 range.
-q queries—How many packets should be sent for each TTL value. The default is 3, which is fine for finding out the route. If you're more interested in seeing RTT values from each hop, I'd suggest increasing this number to 10.
-w wait—Wait is the number of seconds packets have to generate replies before traceroute assumes they never will and moves on. The default is 3. Increase this if pings to the target host show round-trip times longer than this.

Potential problems

Traceroute has already been described as potentially unreliable, not because of any fault in the application itself but due to external issues that could occur and make its findings questionable. These issues include the following:

Routing policy—Increasingly features such as policy-based routing are being used on routers to support quality-of-service requirements. Routing policy may cause traceroute to generate routes that have no relevance to the forwarding paths used by applications.
Lack of or bad responses—Some routers do not send back ICMP time-exceeded packets, or they manipulate the TTL field incorrectly. Some end systems do not return ICMP port-unreachable messages, causing traceroute to take a long time to timeout.
Routing oscillations—During the running of a traceroute test, probe packets should follow a consistent route, but this is not always the case. If the network is unstable, packets may be routed differently and the resulting reports may be confusing. Run the test several times to be sure.
No forwarding addresses—Traceroute reports only one IP address from each router (normally the receiving interface address). The forwarding interface is not explicitly exposed, and with unnumbered WAN links this could be impossible to deduce.
Routing problems—The router does not have a route back to the sender or may have a route via a different interface than the one on which it received the probe packet. In these cases, you may receive no reply at all or a response from an IP address that is inconsistent (i.e., some other interface on the same router or even the loopback address).

The bottom line is that traceroute can be very useful, but treat its output with some degree of scepticism. Use other tools to validate its reports if you are unsure.

Netstat

Netstat is a useful status tool, available on most TCP/IP implementations. Netstat can be used to display a number of protocol or interface statistics. It can also be used to display the contents of the routing table (so you can see if the host node is learning routes as expected). Again the syntax and facilities will vary between platforms. The options available for MS-DOS version 4.10.2222 are shown in the following list.

MS-DOS NetStat option definitions

-a—Displays all connections and listening ports.
-e—Displays Ethernet statistics. This may be combined with the —s option.
-n—Displays addresses and port numbers in numerical form.
-p proto—Shows connections for the protocol specified by proto; proto may be TCP or UDP. If used with the —s option to display per protocol statistics, proto may be TCP, UDP, or IP.
-r—Displays the routing table.
-s—Displays per protocol statistics. By default, statistics are shown for TCP, UDP, and IP; the —p option may be used to specify a subset of the default.
interval—Redisplays selected statistics, pausing interval seconds between each display. Press CTRL+C to stop redisplaying statistics. If omitted, Netstat will print the current configuration information once.

Using Netstat

We can examine the routing table by using the —r option, as shown in the following code segment.

 C:\WINDOWS>netstat -r Route Table Active Routes:   Network Address           Netmask   Gateway Address          Interface   Metric         127.0.0.0         255.0.0.0         127.0.0.1          127.0.0.1        1      192.168.32.0     255.255.255.0    192.168.32.200     192.168.32.200        1    192.168.32.200   255.255.255.255         127.0.0.1            127.0.0        1    192.168.32.255   255.255.255.255    192.168.32.200     192.168.32.200        1         224.0.0.0         224.0.0.0    192.168.32.200     192.168.32.200        1   255.255.255.255   255.255.255.255    192.168.32.200            0.0.0.0        1 Active Connections   Proto  Local Address          Foreign Address        State

In this example there are no active connections.

In our next example we can combine the -s and -p options to display a subset of the available protocol statistics, as shown in the following code segment.

 C:\WINDOWS>netstat -s -p icmp ICMP Statistics                              Received     Sent   Messages                   69           69   Errors                     0            0   Destination Unreachable    0            0   Time Exceeded              0            0   Parameter Problems         0            0   Source Quenches            0            0   Redirects                  0            0   Echos                      33           33   Echo Replies               33           33   Timestamps                 0            0   Timestamp Replies          0            0   Address Masks              0            0   Address Mask Replies       0            0

Tcpdump

Tcpdump is a basic packet analyzer, originally released on UNIX and subsequently supported on SunOS, Ultrix, and most BSD revisions. For the following examples I used a version ported onto DOS (there are also reports of a port to LINUX). It was originally written by Van Jacobsen to analyze TCP performance problems but can now be used for analyzing a variety of IP protocols and encapsulations, such as TCP, DNS, NFS, SLIP, or Apple-Talk. It may be difficult to get tcpdump to work on some UNIX systems, since it requires the interface to be in promiscuous mode.

Using tcpdump

The easiest way to use tcpdump is to run it in interactive mode and use the -i switch to specify the network interface to be used. Summary information for every Internet packet received or transmitted on the interface will be displayed on the screen. The following example trace shows VRRP traffic in normal operation. Since we are looking at all traffic, there is also ICMP traffic generated by a ping request.

 bozo[admin]# tcpdump -i eth-s1p2 tcpdump: listening on eth-s1p2 06:12:00.630249 192.168.32.12 > 224.0.0.18:  VRRPv2-adver 20: vrid 32 pri 255 06:12:01.630144 192.168.32.12 > 224.0.0.18:  VRRPv2-adver 20: vrid 32 pri 255 06:12:02.630132 192.168.32.12 > 224.0.0.18:  VRRPv2-adver 20: vrid 32 pri 255 06:12:03.630107 192.168.32.12 > 224.0.0.18:  VRRPv2-adver 20: vrid 32 pri 255 06:12:03.822585 192.168.32.100 > 192.168.32.12:  icmp: echo request 06:12:03.822659 192.168.32.12 > 192.168.32.100:  icmp: echo reply 06:12:04.630083 192.168.32.12 > 224.0.0.18:  VRRPv2-adver 20: vrid 32 pri 255 06:12:04.831643 192.168.32.100 > 192.168.32.12:  icmp: echo request 06:12:04.831684 192.168.32.12 > 192.168.32.100:  icmp: echo reply 06:12:05.630082 192.168.32.12 > 224.0.0.18:  VRRPv2-adver 20: vrid 32 pri 255 ^C 10 packets received by filter 0 packets dropped by kernel

The following example trace illustrates tcpdump in verbose mode. Again this trace shows VRRP traffic in normal operation plus ICMP traffic generated by a ping request. Verbose mode is useful for more detailed packet data, but it may swamp the display, making it difficult to analyze and debug problems.

 bozo [admin]# tcpdump -ev -i eth-s1p2 tcpdump: listening on eth-s1p2 06:14:10.630215 0:0:5e:0:1:20 1:0:5e:0:0:12 0800 54: 192.168.32.12 > 224.0.0.18: VRRPv2-adver 20: vrid 32 pri 255 int 1 sum ff27 naddrs 1 192.168.32.12 (ttl 255, id 1845) 06:14:10.807724 0:80:c7:bf:52:9c 0:0:5e:0:1:20 0800 74: 192.168.32.100 > 192.168.32.12: icmp: echo request (ttl 32, id 16643) 06:14:10.807800 0:0:5e:0:1:20 0:80:c7:bf:52:9c 0800 74: 192.168.32.12 > 192.168.32.100: icmp: echo reply (ttl 255, id 1846) 06:14:11.630110 0:0:5e:0:1:20 1:0:5e:0:0:12 0800 54: 192.168.32.12 > 224.0.0.18: VRRPv2-adver 20: vrid 32 pri 255 int 1 sum ff27 naddrs 1 192.168.32.12 (ttl 255, id 1847) 06:14:11.810506 0:80:c7:bf:52:9c 0:0:5e:0:1:20 0800 74: 192.168.32.100 > 192.168.32.12: icmp: echo request (ttl 32, id 16899) 06:14:11.810551 0:0:5e:0:1:20 0:80:c7:bf:52:9c 0800 74: 192.168.32.12 > 192.168.32.100: icmp: echo reply (ttl 255, id 1848) ^C 9 packets received by filter 0 packets dropped by kernel

Tcpdump provides several other important options, as well as the ability to specify an expression to restrict the range of packets you wish to see. Refer to the tcpdump "man" page under UNIX or the relevant help or Readme file under DOS or Windows.

Potential problems

Potential problems include the following:

No output—Check to make sure you're specifying the correct network interface with the -i option, which I suggest you always use explicitly. If you're having DNS problems, tcpdump might hang trying to look up DNS names for IP addresses; try the -f or -n options to disable this feature. If you still see nothing, check the kernel interface; -tcpdump might be misconfigured for your system.
Dropped packets—At the end of its run, tcpdump will inform you if any packets were dropped in the kernel. If this becomes a problem, it's likely that your host can't keep up with the network traffic and decode it at the same time. Try using the tcpdump -w option to bypass the decoding and write the raw packets to a file; then come back later and decode the file with the -r switch. You can also try using -s to reduce the capture snapshot size.
Messages that end with [|rip] and [|domain]—Messages ending with [|proto] indicate that the packet couldn't be completely decoded, because the capture snapshot size (the so-called snarf length) was too small. Increase it with the -s switch.

Nslookup and dig

Nslookup is a comprehensive tool for diagnosing DNS problems and is included in the BIND distribution. There are too many applications for nslookup to be covered here, so rather than do it an injustice the interested reader is directed to [50], where a whole chapter is dedicated to its use. Domain information groper (Dig) is another useful diagnostic tool that works in a way similar to nslookup and is also included in the BIND distribution. Some engineers prefer dig's user interface to that of nslookup. In short, both dig and nslookup are command-line tools that enable you to issue queries to DNS servers; the replies are then reported on the display (nslookup can be run interactively or noninteractively, depending on the scope of action required). In effect they allow you to act as a DNS resolver. For example, you can retrieve a list of root servers by typing the following command line:

    percent dig @a.root-servers.net . ns > db.cache

9.2.4 Advanced diagnostic tools

There are many commercial tools on the market for diagnosing network problems. They broadly fall into the following categories:

Media and cable testers
Multiprotocol protocol analyzers
Traffic generators and application simulators

We will not dwell on the first category of products; there are many vendors of media testers—see [51] for representative examples such as the DSP-2000 handheld cable analyzer.

Protocol analyzers

Network protocol analyzers are one of the key tools used by network engineers to identify subtle network, application, or protocol issues. They range in complexity from handheld or PC-based packet capture tools (see Figure 9.12) to sophisticated custom hardware offering simultaneous multiport packet capture, expert analysis, and simulation/playback. Each packet is typically timestamped with millisecond resolution and checked for CRC errors and media errors, such as being too short or too long. So-called expert analyzers will check for errors or notable events based on analysis of multiple packet streams by identifying which stream each packet belongs to and analyzing each flow individually. This requires data structures to be maintained to keep track of items such as source and destination addresses, port numbers, protocol IDs, timestamps, and flags. Expert analysis requires inherent knowledge of how various protocols and applications work, as well as tracking of the key state machine events.

click to expand
Figure 9.12: A PC-based network analyzer, showing detailed protocol decoding of a TCP/IP frame over an Ethernet interface.

Historically the prime vendors of this equipment include companies such as Network Associates, Inc. (Santa Clara, CA), Wandel & Goltermann Technologies, Inc. (Triangle Park, NC), and Hewlett-Packard Co. (Palo Alto, CA). Examples of widely used network analyzers today include the following:

Network Associates, Inc.—Sniffer, NetXRay [52]
HP range of network analyzers (e.g., 4950) [45]
Wandel & Goltermann Technologies network analyzers and testing suites [53]
Radcom Equipment, Inc. (e.g., Enlite, Wirespeed 622 ATM Analyzer, Prismelt Analyzer) [54]
Fluke range of handheld protocol analyzers (e.g., 680 Series Enterprise LANMeter) [51]
Tekelec range of network analyzers
Novell LANAlyzer
IBM's DatagLANce Network Analyzer

The Sniffer product from Network Associates, Inc. has become something of a benchmark product, used by many companies because of its robustness, its broad protocol decoding abilities, and the ability to add custom decodes. HP also has a history of producing quality products, and W&G offer high-end, high-performance analyzers. RAD is relatively new to the market but offers a number of products, including wire-speed ATM analyzers. At the basic level these devices will give you useful statistics on the major performance parameters, such as the following:

Network utilization
Symbol and receive length errors
Average and peak frame rates
Bytes transmitted
Hardware flow control
Unicast and multicast distribution
Protocol distribution
Bad frames, too-short too-long frames, bad CRC
Total and current frames/cells sent and received by VPI/VCI or DLCI

For serious work the key features to look out for are as follows:

A full suite of protocol decodes
A wide range of WAN options (Frame Relay, T1/T3, E1/E3, ATM, X.25, PPP, etc.)
Extensive custom filtering capabilities
The capability to add custom decodes
Real-time capture and translation
Expert protocol or application analysis
The ability to edit traces and playback for simulation work
Multiinterface capture (typically LAN and WAN)
High-speed media support (100-Mbs Ethernet, Gigabit Ethernet, etc.)
The ability to distribute multiple capture devices and centrally manage them
Trace file export options (ASCII and other trace formats)

Some of these features are discussed in the following text.

Expert or no expert

Without an expert mode option the engineer needs to be extremely well acquainted with the protocols and services of interest. A basic analyzer will decode frames and provide timestamps, flag analysis, and rudimentary error checking (long and short frames, CRC errors, etc.). Some protocol suites may not be decoded fully, depending upon the level of sophistication of the analyzer. These basic devices are effectively stateless in that the more subtle protocol state issues are not observed, since the analysis is essentially based on discrete frames. More sophisticated expert analyzers are more protocol aware and more stateful in operation. These devices are aware of the state machines used by common protocol stacks, so they know that FTP uses a fixed control session and dynamic data sessions on different port numbers and can associate several related flows. They may even include very high level application support for protocols such as SQLnet and have a much broader set of decodes. Network Associates, Inc., for example, claims support for more than 400 protocol decodes. This level of support can save valuable hours digging through traces and mapping out protocol interactions by hand.

Some devices may offer automated fault and performance management functions, which can be integrated with your network management. These tools automatically detect and pinpoint problems while monitoring the network and forwarding alarms with events to an NMS as required. Bottlenecks, protocol violations, and even problems such as duplicate addresses and misconfigured routers can be identified automatically and flagged to the NMS.

Edit, playback, and simulation

For any serious work you will also need the ability to modify and play back traces. This gives you the ability to simulate either failure or heavy load conditions in the lab and get a better handle on the problem. You may also want to stress test a live network as part of the commissioning stage. If your analyzer lets you edit frames, this will allow you to adjust a trace from a live network to suit your test environment. For example, you may need to change IP addresses in a whole range of packets. Note that if you did this, you would typically have to modify any TCP or UDP checksums, so any tool that can do this automatically is extremely useful.

Real-time capture, multiple interfaces

Another useful feature of more advanced devices is multiport real-time capturing. This allows you to capture a LAN and WAN port concurrently. You can then map the two traces together (both are timestamped), and this can resolve subtle issues when diagnosing problems over wide area links.

High-speed, real-time capture

As network media increase in bandwidth (e.g., Gigabit Ethernet or ATM OC-3 and OC-12), analyzers have a much tougher job to keep up. Not only do they have to capture all frames comfortably (missing frames equals missed problems), but they have to do at least some measure of translation and analysis in real time. For example, when capturing ATM you may wish to display open and closed Switched Virtual Circuit (SVC) connections in real time while capturing cells. Gigabit Ethernet operates at approximately 1.4 million frames per second, so real-time capture and translation is no small task. Software-based analyzers, which run on standard PCs using off-the-shelf adapter cards, are unlikely to keep up with these kind of speeds for the foreseeable future; you will need dedicated hardware. Vendors of custom-designed analyzers claimed to operate at gigabit rates are Network Associates, Inc. [52], Wandel & Goltermann Technologies, Inc. [53], and Hewlett-Packard Co [45].

Trace format export and import

The lowest common denominator here might be ASCII or Comma Separated Value (CSV) format. This at least allows you to read a trace from different sources via a standard word processor (ASCII) or spreadsheet (CSV). Of course, you will not be able to perform any further protocol analysis in this format; you are at the mercy of the tool that exported the file. Network Associates was probably the first major vendor in this field to publish its internal (proprietary) trace file format. They chose to implement a simple but elegant <Tag><Length><Value> schema for storing records, and this is much easier to translate than a bulk memory dump. Subsequently many other vendors supported the import and export of this file format. The Sniffer file format is the closest thing to a standard among competing manufacturers, and this is important for any support organization, where traces may be coming in from the field in different file formats. It is far easier to convert traces and analyze on a common platform than to have an array of different analyzers back at the main office just to read different file formats.

9.2.5 Selecting the right tools for the job

Before you dust off the analyzer, the most useful approach for locating problems is the NMS and tools such as ping, traceroute, and tcpdump. You can waste considerable time running around your network taking traces at exactly the wrong spots, so the first job you must do in the event of a network failure is attempt to isolate the problem. You may narrow this down to a particular LAN or WAN segment and then you can focus in on the problem. In many cases you will resolve the problem without taking a trace.

For small networks or day-to-day use a cheaper PC-based analyzer can be extremely useful. These tools are essentially software applications running over standard NICs and may be restricted in the interface types supported (often LAN only, sometimes only Ethernet). A good example is illustrated in Figure 9.12. The tcpdump utility is useful but really quite limited; but if you cannot afford even a PC-based analyzer, this may be your only option. For serious debugging you will need a more sophisticated LAN/WAN analyzer, probably more than one. If you have taken a proactive approach with your management strategy by deploying remote monitoring agents or RMON probes, a network analyzer will complement these tools well. Much as your finance department may balk at the price for these devices, the cost of failing to diagnose a problem quickly enough may dwarf the product cost. Choose a tool with high-performance capabilities, with the capability to add more modules as your network grows or new media types are added. You need a wide range of protocol decodes, even on an IP-only internetwork; remember that there will be Windows devices potentially churning out Novell, NetBIOS, and who knows what else. To lock in quickly on problems choose an analyzer with some form of expert capability, and playback mode is a must.