THE MECHANICS OF NETWORK TROUBLESHOOTING | Cisco: A Beginners Guide, Fourth Edition

In internetworks, trouble is often caused either by failing device hardware or a configuration problem. The location of most problems can be identified remotely, and to some extent, the problems can also be diagnosed and even fixed remotely (but the hardware must still be running for that). By "fixing remotely," we mean without walking over and actually inspecting and touching the device; we don't necessarily mean being geographically removed. If, say, an enterprise's campus internetwork is experiencing a problem, network administrators usually do most troubleshooting tasks without even leaving their desks.

In Cisco environments, remote work can be done through a network management console or by logging directly into a device's IOS command-line environment through Telnet or Secure Shell (SSH). As you learned, the Cisco NMS consoles-CiscoWorks Resource Manager Essentials and CiscoWorks Campus Manager-use their graphical interfaces to indirectly manipulate IOS commands inside the remote device. Thus, most of the real troubleshooting work takes place inside the device's IOS environment. Here are the major IOS commands used to perform most troubleshooting tasks:

ping Indicates whether "echo" packets are reaching a destination and returning. For example, if you enter ping 10.1.1.1, IOS will return the percentage of packets that echoed back from the 10.1.1.1 interface.
traceroute Reports the actual path taken to a destination. For example, if you enter traceroute ip 10.1.1.1, IOS will list every hop the message takes to reach the destination 10.1.1.1 interface.
show Reports configuration and status information on devices and networks. For example, the show memory command displays how much memory is assigned to each network address and how much is free.

The source of problems must be in either device or network media (cabling, connectors, and so on). Even if the trouble is in a cable, the way to it is through IOS. The ping and trace commands are used to locate problems. If the device is still running, the show and debug commands are employed to diagnose them. Actual fixes are done by changing either the hardware or its configuration. The debug command is similar to show, except it generates far more detailed information on device operations-so much so that running debug may greatly slow down device performance.

Network Troubleshooting Methods

Problems are usually brought to a network administrator's attention by users. They want to know why they can't access a service within the enterprise's internetwork, or they complain that performance is slow. The location and nature of the complaint are themselves strong clues as to what's causing the problem. Many times, the administrator immediately knows what's wrong and how to fix it, but oftentimes an investigation must be launched to figure out which device is the source of the trouble, what's causing it, and what the best way to fix it is. The network administrator must find answers by methodical troubleshooting. As you might imagine, troubleshooting largely works by a process of elimination, as in the following:

What are the symptoms? Usually, this boils down to users not being able to reach a destination. Knowing both endpoints of a network problem-the source and destination addresses-is the base information in most troubleshooting situations.
Where do I start looking? Does the scenario fit a known pattern that suggests probable causes? For example, if a server isn't responding to service requests from a client, there could be a problem with the server or the client itself. If the server is working okay for other clients, then it might be the client device. If not that, then the problem must reside somewhere between the two.
Where do I start? There are rules of thumb that short-list what's most likely causing a certain type of symptom. The administrator should diagnose "bestcandidate" causes first. For example, if a server accessed over a WAN link seems slow to remote dial-in users, the link could be going bad, usage could be up, there could be a shortage of buffer memory in the router interface servicing the link, or the hosts could be misconfigured. One of these probable causes will explain the problem 95 percent of the time.
What's the action plan? Finding the exact cause of a problem in a malfunctioning device means dealing with one variable at a time. For example, it wouldn't make sense to replace all network interface modules in a router before rebooting. Doing so might fix the problem, but it wouldn't define the exact source or even what fixed it. In science, this is called changing one variable at a time. The best practice is to zero in on the source by cutting variables down one by one. That way, the problem can be replicated, the fix validated as a good one, and the exact cause recorded for future reference. An action plan also allows you to undo changes that don't fix the problem (or may even make it worse).

Note

Before you get too bogged down in trying to isolate the problem with checking IP addresses or other configuration information, check the cables. You might save yourself hours of trouble and effort by reconnecting a loose cable or a power cord that has come undone.

Most internetwork problems manifest themselves as either seriously degraded performance or as "destination unreachable" timeout messages. Sometimes, the problem is widespread; other times, it's limited to a LAN segment or even to a specific host. Let's take a look at some typical problems mapped to their probable causes. Table 15-1 outlines problems with host connectivity. (Hosts are usually single-user PCs, but not always.)

Table 15-1: Typical Host Access Problems and Causes
Symptoms	Probable Causes
Host can't access networks beyond local LAN segment.	Misconfigured settings in host device, such as bad default gateway IP address or bad subnet mask.
	The gateway router is malfunctioning.
Host can't access certain services beyond local LAN segment.	Misconfigured extended access list on a router between the host and the server. Misconfigured firewall, if the server is beyond the autonomous system.
	The application itself may be down.

Unfortunately, most internetwork problems aren't limited to a single host. If a problem exists in a router or is spread throughout an area, many users and servers are affected. Table 15-2 outlines a couple of typical network problems that are more widespread.

Table 15-2: Typical Router Problems and Causes
Symptoms	Probable Causes
Most users can't access a server.	Misconfigured default gateway specification in the remote server.
	Misconfigured access list in the remote server.
	Hosts unable to obtain IP addresses through DHCP.
Connections to an area can't be made when one path is down.	Routing protocol not converging within the routing domain.
	All interfaces on router handling alternative path not configured with secondary IP addresses (discontinuous addressing).
	Static routes incorrectly configured.

Many times, networks and services are reachable, but performance is unacceptably slow. Table 15-3 outlines factors that can affect performance within a local network. It doesn't address WAN links, however. They're covered separately later in this chapter because serial lines involve a slightly different set of technologies and problems.

Table 15-3: Campus LAN Performance Problems and Causes
Symptoms	Probable Causes
Poor server response; hard to make and keep connections.	Bad network link, usually caused by a malfunctioning network interface module or LAN segment medium.
	Mismatched access lists (in meshed internetwork with multiple paths). Congested link, overwhelmed by too much traffic.
	Poorly configured load balancing (routing protocol metrics).
	Misconfigured speed or duplex settings.

Troubleshooting Host IP Configuration

If a user is having trouble accessing services and the overall network seems to be okay, a good place to start looking for the cause of the problem is inside that person's computer. There are a couple of things that could be misconfigured in the user's host computer:

Incorrect IP information The IP address or subnet mask information could be missing or incorrect.
Incorrect default gateway The default gateway router could be misconfigured.
Nonfunctioning name resolution DNS or WINS could be misconfigured.

To refresh on the subject, every host has a default gateway specified in the host's network settings. A default gateway is an interface on a local router that is used for passing messages sent by the host to addresses beyond the LAN. A default gateway (also called a gateway of last resort) is configured, because it makes sense for one router to handle most of a host's outbound traffic in order to keep an updated cache on destination IP addresses and routes to them. A host must have at least one gateway, and a second one is sometimes configured for redundancy in case the primary gateway goes down.

Checking the Host IP Address Information

Misconfigured network parameters in desktop hosts are usually attributable to a mistake by the end user. Keep in mind that-on Windows computers, at least-administrators and power users can easily access and modify network settings. To check the host's IP address information in Windows XP, for example, click the Start button on the menu bar, choose Run, type cmd, and then click OK. This will open the Windows XP command prompt window. At the command prompt type ipconfig/all and then press, < enter>. This will display the host's IP address configuration.

If you need to configure the host's IP address settings, this is done through the Network Properties screen. To access the network properties in Windows XP, for example, click the Start button on the menu bar, and then choose Connect To | Show All Connections. From the resulting list of network connections, right-click the appropriate network icon, and then select Properties. Click the General tab, and in the This Connection Uses the Following Items box, click Internet Protocol (TCP/IP), and then click the Properties button. In Windows NT/2000, click the Start button on the menu bar, and then choose Control Panel | Network | Configuration, and, finally, TCP/IP Properties. This will allow you to set the IP address and default gateway.

The protocol will usually point to a network interface card (NIC) connecting the host to the LAN, as is the case with the TCP/IP Ethernet PC card highlighted in Figure 15-1. (If the host dials into the internetwork, the protocol that points to the dial-up adapter should be selected instead.)

image from book
Figure 15-1: To troubleshoot a host, the place to start is the network interface card

Once you're pointed at the right NIC, start by making sure that the host is identifying itself correctly to the network. The example in Figure 15-2 shows a statically defined IP address and subnet. These must match what's on file for the host in the config file of the router serving as the default gateway. If the Obtain An IP Address Automatically check box is selected, the host's IP address is dynamically assigned by a server-a Dynamic Host Control Protocol (DHCP) server. While DHCP can make address assignment much easier and somewhat foolproof, rogue DHCP servers can be problematic, so be sure to check IP address settings to verify which DHCP server is assigning the IP address using the ipconfig/all command.

image from book
Figure 15-2: The host's IP address settings and those in the default gateway must match

Next, make sure the host's declared IP address is the right one by logging into its gateway router and entering the show arp command. You'll remember that ARP stands for Address Resolution Protocol, a utility that maps the physical device's media access control (layer 2) address to its assigned IP (layer 3) address in order to handle the final stage of delivery between the gateway router and the host. Figure 15-3 shows the ARP table in the config file of our example gateway router. The shaded line shows that the Ethernet interface indeed has an address 10.1.13.12 on file, as was declared in the host's IP Address tab. The MAC address can also be verified using the ipconfig/all command.

image from book
Figure 15-3: The host's IP address must match the one for the gateway router in the ARP file

Another potential host problem is the config settings for the default gateway itself. In other words, you have to make sure the host has the correct IP address configured as its default gateway router, as shown in Figure 15-4.

image from book
Figure 15-4: Check to make sure the correct default gateway IP address is configured

The host's default gateway IP address must match the one set for the network interface module on the gateway router. To check that this is the case, go to the gateway router and enter the show interfaces command, as shown here:

 MyRouter# show interfaces . . . Ethernet1 is up, line protocol is up  internet address is 10.1.13.1/28  ip accounting output-packets  ip nat inside  ip ospf priority 255 media-type 10BaseT . . .

As you can see, our example interface, Ethernet1, is indeed addressed 10.1.13.1, as declared in the host's Gateway tab.

This is also where you can check to make sure the host's declared subnet mask matches the one on file in the gateway router. Mask/28 is also correct, because it matches the one (255.255.255.240) declared in the host's IP Address tab.

Obviously, if any of the host's network settings are incorrect, the administrator should adjust them to match the gateway router's settings, reboot the PC, and try to make a network connection. On the other hand, if the PC's settings are okay, the troubleshooter must work outward from the operable host to identify the source of the problem.

Isolating Connectivity Problems

Most network problems have to do with the inability to connect to a desired host or service. Connectivity problems-also called "reachability problems"-come in many forms, such as attempted HTTP connections timing out, attempted terminal connections getting no response from the host, and so on. As just outlined, the troubleshooter should first make sure the host reporting the problem is itself properly configured, and then work outward. To draw an analogy, the troubleshooter must work the neighborhood door to door, much like a cop searching for clues.

Checking Between the Host and Its Gateway Router

If the host's network settings are configured properly, the next step is to work outward from the host to the gateway router. This should be done even if the host's problem is failing to connect to a remote server. Before working far afield, the best practice is to first check the link between the host and its gateway router.

Using the ping Command The easiest way to check a link is to use the ping command. This command sends ping packets to a specific network device to see if it's reachable. In technical terms, ping sends its packets through the ICMP transport protocol instead of through UDP or TCP. It actually sends several packets, as shown here:

 MyRouter# ping 10.1.1.100 Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to 10.1.1.100, timeout is 2 seconds: !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 1/1/1 ms MyRouter#

Host computers and network devices both have ping commands. The preceding example was taken from a Cisco router, and the ping successfully reached the destination. But one could just as well use the ping command available in the command line of the host. We're using a Windows host for our examples, but other platforms-such as Macs, the various UNIX platforms, IBM's OS/400, and other proprietary server architectures-all have ping and other basic network commands built into their operating systems.

Usually, the first ping test from a host is the link to its gateway router. On a computer running Windows XP, check this by clicking the Start button in the menu bar and choosing All Programs | Accessories | Command Prompt to open a command prompt window. Then check to see if the gateway router is responding by entering the ping command, as shown in the following code snippet:

 Microsoft Windows XP [Version 5.1.2600] (C) Copyright 1985-2001 Microsoft Corp. C:\Documents and Settings\Tony>ping 10.1.13.1 Pinging 10.1.13.1 with 32 bytes of data: Request timed out. Request timed out. Request timed out. Request timed out. Ping statistics for 10.1.13.1:     Packets: Sent = 4, Received = 0, Lost = 4 (100% loss), C:\Documents and Settings\Tony

The preceding example shows that four ping packets were sent to the gateway router, which failed to respond. This tells the troubleshooter a few things:

The host PC's NIC is good; otherwise, the operating system would have generated an error message when the card failed to respond to the ping command.
The Ethernet LAN segment might be down-a condition often referred to as a "media problem." (The shared medium is apparently not working.)
The network interface module on the gateway router might be faulty.

If the host checks out okay, the troubleshooter must move outward. As mentioned, the investigation should start with the link to the gateway router.

Extended Ping As useful and utilitarian as the ping command is for troubleshooting, it does have its limits. When using ping, the source address of the ping is the IP address of the interface that the packet uses as it exits the router. If you need more precision out of your ping, you can upgrade to the extended ping command.

Extended ping performs a more advanced check of your system's ability to reach a particular host. This command works only at the privileged EXEC command line, whereas a regular ping command works in both user EXEC and privileged EXEC modes.

Usage of extended ping on Cisco routers is fairly straightforward. Simply enter ping at the command prompt, and then press Enter. You will be prompted with a number of conditions and variables. The default setting is enclosed in brackets. If you like the default, simply press Enter; otherwise, enter your preferred setting.

The following shows an example of an extended ping at work:

 Router<>ping Protocol [ip]: Target IP address: 64.66.150.248 Repeat count [5]: 100 Datagram size [100]: Timeout in seconds [2]: Extended commands [n]: Sending 100, 100-byte ICMP Echos to 64.66.150.248, timeout is 2 seconds: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Success rate is 100 percent (100/100), round-trip min/avg/max = 12/19/280 ms

Note

The ping command also exists in its own form in Windows and UNIX/LINUX environments. Simply add the switches–s (UNIX/LINUX) or –t (Windows) after the ping command.

Using the show interfaces Command To check whether the problem is the gateway router's interface or the LAN segment's medium, log into the gateway router, and enter the show interfaces command to obtain the following report:

 MyRouter# show interfaces Ethernet1 is up, line protocol is down   Hardware is MyRouter, address is 0060.2fa3.fabd (bia 0060.2fa3.fabd)   Internet address is 10.1.13.1/28 . . .

In the preceding example, the router reports both that the Ethernet1 network interface module is up and the line protocol is down. The term line protocol denotes both the cable into the router and the LAN protocol running over it. A line protocol reported as down probably indicates that the LAN segment's shared medium-a hub, an access switch, or a cable-is faulty. From there, you would physically check the medium to identify the hardware problem. (How to do that is covered later in this chapter in the section "Troubleshooting Cisco Hardware.")

Another potential condition could be that a network administrator has turned off the interface or the line, or both. This is routinely done while a piece of equipment is being repaired, upgraded, or replaced. Notifying IOS that a piece of equipment is down for maintenance avoids having needless error messages generated by the router. The following example shows the report when a network interface module is administratively down:

 MyRouter# show interfaces Ethernet1 is administratively down, line protocol is down   Hardware is MyRouter, address is 0060.2fa3.fabd (bia 0060.2fa3.fabd)   Internet address is 10.1.13.1/28 . . .

Whether a piece of equipment is down by design or because of a malfunction, it still stops traffic. So it's important to know when a piece of equipment is being worked on in order to make sure an alternative path is available to handle traffic.

If both the gateway router interface and line protocol are up and running fine, the cause of the connectivity problem probably resides in a link to another network.

Troubleshooting Problems Connecting to Other Networks

Things get a little more complicated beyond the home LAN segment. If the host can't connect beyond the gateway router, there are at once both more potential sources and more types of trouble to check out. What's meant by potential problem sources here is that many more hardware devices must be considered as potential causes of the reachability problem. What's meant by potential problem types is that such things as access lists, routing protocols, and other factors beyond hardware must now also be considered.

Using the trace Command to Pinpoint Trouble Spots Instead of pinging outward from the host one link at a time, the route between the host and the unreachable server can be analyzed all at once using the trace route command. In our example Windows host, do this by choosing Start | Run, and then type cmd to access the command prompt. Once there, enter the tracert command, Microsoft's version of the trace route command. The example in Figure 15-5 shows the route being traced from the host PC to http://www.PayrollServer.AcmeEnterprises.com, which is a fictional internal server several hops away. It's optional to use either the domain name or the IP address. Each line in the tracert command represents a hop along the path to the destination.

image from book
Figure 15-5: The trace route command is a great way to pinpoint the source of a problem

In TCP/IP internetworks, trace route commands work by sending three "trace" packets to each router three times and recording the echo response times. As with the ping command, the packets use the ICMP transport protocol. However, these packets differ from ping packets in that they have a time-to-live (TTL) field used to increment outward from the host one step at a time. The TTL field causes the packet to die when the counter hits zero. The trace route command uses the TTL field by sending the first trace packet sent to the nearest router with a TTL of 1 to the next router with a TTL of 2 and so on. This process is repeated until the destination host is reached-if it's reachable. The network administrator can put a limit on how many hops the trace may take to automatically stop the process if the destination proves unreachable.

The ms readings are milliseconds, and you can see that nearby routers naturally tend to echo back faster. Under 10 ms is fast; anything over 100 ms or so is getting slow-but one must always adjust the timings according to how many hops removed the router is. As you can see, the router in the shaded line in Figure 15-5 is the likely suspect for the slow service because of its slow response times. The probable explanation is that the router's interface or the LAN segment attached to it is either congested or experiencing hardware faults. The next step would be to Telnet into router 10.1.49.12 (if possible) and diagnose the system, the involved network interface, and so on. If making a Telnet connection isn't possible, the troubleshooter must go in through the Console or AUX port, which, of course, requires that somebody be physically present at the device, unless a dial-in maintenance solution has been configured beforehand.

Sometimes, a trace route will locate a node that's stopping traffic altogether. An example of this is shown in Figure 15-6, where 10.1.49.12 now is dropping trace packets instead of merely returning them slowly. The asterisks indicate a null timing result because nothing came back, and the message "request timed out" is inserted. Take note that this does not necessarily mean the entire router is down. It could be that only the network interface or LAN segment that connects the suspect router may be down or configured not to respond to pings.

image from book
Figure 15-6: Here's what happens if a traced route finds a router stopping traffic

If possible, first try to Telnet into the router through one of its other interfaces. If this doesn't work, the next move depends on the router's proximity. If it's nearby, go to it and log in through the Console or AUX port. If it's remote, you should contact the person responsible for dealing with it and walk that person through the diagnostic steps.

Note

Troubleshooting almost always takes place within the enterprise's internetwork. This is because the network team can control events only within its autonomous system. The trace route command is a good example of this. If you traced a route through the Internet-say, to troubleshoot a VPN connection-many lines between your gateway router and the destination node will return asterisks instead of timings and "request timed out" messages instead of IP addresses. This is because almost all edge routers are configured by their network teams not to respond to trace routes. This is done as a security precaution. The point here is to highlight the trade-off a VPN must incur: Loss of control is exchanged for very low-cost WAN links; you generally can't troubleshoot somebody else's network.

Using the show interfaces Command Once the suspect network interface module has been identified, the troubleshooter must diagnose what's causing the problem. The best way to do that is to run the show interfaces command and review the latest statistics on the interface's operations. Remember, this information not only reflects on the interface module itself, but also gives a rich set of clues as to what's happening out on the network.

An example show interfaces report is given in Figure 15-7. Don't let its size and cryptic terminology intimidate you. There is indeed a lot of information in it, but nothing that takes a rocket scientist to understand.

image from book
Figure 15-7: The show interfaces command is one of the troubleshooter's best tools

This report is a snapshot of the interface at a particular instant in time. To check for trends, the troubleshooter must run the show interfaces command intermittently to look for changes. The interface is identified by private IP address 10.1.49.12/28. Remember, usually only routers on the edge of an autonomous system-firewalls, Web servers, FTP servers, and the like-use public Internet addresses. The /28 notation lets other routers know that LAN segments attached to RemoteRouter are subnetted using the 255.255.255.240 subnet mask. The notation uses 28 because the 255.255.255.240 mask has 28 bits available for network addressing (as opposed to hosts). As mentioned earlier, mismatched subnets often cause problems.

The first thing to look at is the seventh line of the show interfaces report that reads "Last clearing of show interfaces counters never" (highlighted in Figure 15-7). The example states that nobody has reset the report's counters to zero since the last time the router was rebooted. The length of time since the statistics were last cleared is important, because most of the statistics are absolute numbers, not relative values, such as percentages. In other words, the longer IOS has been compiling the totals, the less weight the statistics should be given. For example, ten lost carriers in a day is a lot, but the same total over six months is not. To see when the last reboot was, use the show version command, as shown here:

 RemoteRouter# show version Cisco Internetwork Operating System Software IOS (tm) 4500 Software (C4500-IS-M), Version 11.2(17), RELEASE SOFTWARE (fc1) Copyright (c) 1986-1999 by Cisco Systems, Inc. Compiled Mon 04-Jan-99 18:18 by etlevynot Image text-base: 0x600088A0, data-base: 0x60604000 ROM: System Bootstrap, Version 5.3(10) [tamb 10], RELEASE SOFTWARE (fc1) BOOTFLASH: 4500 Bootstrap Software (C4500-BOOT-M), Version 10.3(10), RELEASE SOFTWARE (fc1) RemoteRouter uptime is 2 weeks, 3 days, 13 hours, 32 minutes System restarted by power-on . . .

The second-to-last line in the preceding example shows that the router has been up for about two and a half weeks. Knowing this lets the troubleshooter more accurately judge whether certain error types are normal or excessive.

Note

Historically, there have been a wide range of interoperability issues identified between different versions of IOS. When troubleshooting, one should make note of the IOS versions running in the environment and assess the impact of running different versions.

The exception to this sampling window is the two lines sitting in the middle of Figure 15-7. These report input and output to the interface over the five minutes prior to the report having been run. A troubleshooter trying to discern a trend in traffic patterns would periodically generate the show interfaces report and look at these numbers.

Statistics differ on what constitutes excessive. For example, Ethernet arbitrates media access control by collisions, so it's normal for them to occur to some degree in a shared media environment-one with a hub, for example. The count of 3,421 collisions in Figure 15-7 is okay for a period of two weeks or so, but a figure of 50,000 would indicate congested bandwidth. Broadcast packets are also normal, because they perform positive functions, such as alerting routers of topology changes and providing other useful updates-again, within limits. There are over one and a half million in Figure 15-7, which might be excessive. However, what's considered excessive is subject to so many variables that it must be left to the judgment of the troubleshooter. That's where experience comes into play. For a properly configured switched environment, there should be almost no collisions on a single port. If you are seeing collisions, it's quite possible that you have a speed or duplex mismatch.

Many statistics should ideally be low, or even at zero (depending on the time period reported). For example, runts and giants are malformed packets sometimes caused by a poorly functioning network interface card or an improperly configured VLAN. In a WAN link, lost carrier events probably indicate a dirty line or a failing telecommunications component.

Table 15-4 defines many of the items reported using the show interfaces command. Knowing the items will help you understand how they can be used to diagnose problems.

Table 15-4: Definitions of Useful Ethernet Statistics
Statistic	Explanation
Five-minute rates (input or output)	The average number of bits and packets passing through the interface each second, as sampled over the last five-minute interval.
Aborts	Sudden termination of a message transmission's packets.
Buffer failures	Packets discarded for lack of available router buffer memory.
BW	Bandwidth of the interface in kilobits per second (Kbps). This can be used as a routing protocol metric.
Bytes	Total number of bytes transmitted through the interface.
Carrier transitions	A carrier is the electromagnetic signal modulated by data transmissions over serial lines (like the sound your modem makes). Carrier transitions are events where the signal is interrupted, often caused when the remote NIC resets.
Collisions	The number of messages retransmitted due to an Ethernet collision.
CRC	Cyclic redundancy check, a common technique for detecting transmission errors. CRC works by dividing the size of a frame's contents by a prime number and comparing the remainder with that stored in the frame by the sending node.
DLY	Delay of the interface's response time, measured in microseconds (μs), not milliseconds (ms).
Dribble conditions	Frames that are slightly too long, but are still processed by the interface.
Drops	The number of packets discarded for lack of space in the queue.
Encapsulation	The encapsulation method assigned to an interface (if any). Works by wrapping data in the header of a protocol to "tunnel" otherwise incompatible data through a foreign network. For example, Cisco's Inter-Switch Link (ISL) encapsulates frames from many protocols.
Errors (input or output)	A condition in which it is discovered that a transmission does not match what's expected, usually having to do with the size of a frame or packet. Errors are detected using various techniques such as CRC.
Frame	The number of packets having a CRC error and a partial frame size. Usually indicates a malfunctioning Ethernet device.
Giants	Packets larger than the LAN technology's maximum packet size-1,518 bytes or more in Ethernet networks. All giant packets are discarded.
Ignored	Number of packets discarded by the interface for lack of available interface buffer memory (as opposed to router buffer memory).
Interface resets	When the interface clears itself of all packets and starts anew. Resets usually occur when it takes too long for expected packets to be transmitted by the sending node.
Keepalives	Messages sent by one network device to another to notify it that the virtual circuit between them is still active.
Last input or output	Hours, minutes, and seconds since the last packet was successfully transmitted or received by the interface. A good tool for determining when the trouble started.
Load	The load on the interface as a fraction of the number 255. For example, 64/255 is a 25 percent load. This counter can be used as a routing protocol metric.
Loopback	Whether loopback is set to on. Loopback is where signals are sent from the interface and then directed back toward it from some point along the communications path; used to test the link's usability.
MTU	The maximum transmission unit for packets passing through the interface, expressed in bytes.
Output hang	How long since the interface was last reset. Takes its name from the fact that the interface "hangs" because a transmission takes too long.
Overruns	The number of times the router interface overwhelmed the receiving node by sending more packets than the node's buffers could handle. It takes its name from the fact that the router interface "overran" the sender.
Queues (input and output)	Number of packets in the queue. The number behind the slash is the queue's maximum size.
Queuing strategy	FIFO stands for "first in, first out," which means the router handles packets in that order. LIFO stands for "last in, first out." FIFO is the default.
Rely	The reliability of the interface as a fraction of the number 255. For example, 255/255 is 100-percent reliability. This counter can be used as a routing protocol metric.
Runts	Packets smaller than the LAN technology's minimum packet size-64 bytes or less in Ethernet networks. All runt packets are discarded.
Throttles	The number of times the interface advised a sending NIC that it was being overwhelmed by packets being sent and to slow the pace of delivery. It takes its name from the fact that the interface asks the NIC to "throttle" back.
Underruns	The number of times the sending node overwhelmed the interface by sending more packets than the buffers could handle. Takes its name from the fact that the router interface "underran" the sender.

Now that we're introduced to the various statistics compiled in the show interfaces report, let's review how to read it. Figure 15-8 shows the Ethernet statistics portion of the report, this time, with some of the more important variables highlighted. These are the variables an experienced network administrator would scan first for clues.

image from book
Figure 15-8: Each interface item likely has reasons for its statistic being high

More often than not, connectivity problems are caused by some type of configuration problem, not by a piece of failing equipment. Depending on the Ethernet statistic that is high, the interface may be overwhelmed by incoming traffic, have insufficient queue size configured, have insufficient buffer memory, or be mismatched with the speed of a network sending input.

Checking Access Lists for Proper Configuration The classic example of a device malfunctioning even though its hardware is running fine is the misconfigured access list. You'll recall that access lists are used to restrict what traffic may pass through a router's interface, thereby cutting off access to the LAN segment attached to it. The access list does this by inspecting for source and destination IP addresses-a way of controlling who may go where. The extended access list also uses port numbers to further restrict which applications may be run once you're admitted. Indeed, access lists are the most rudimentary form of internetwork security, used as a kind of internal firewall. Not all use of access lists has to do with security; sometimes they're used to steer traffic along certain routes in order to "shape" traffic to best fit the internetwork's resources.

The first step in checking for access list problems is to determine whether a suspect router-or a suspect interface on a router-is even configured with an access list. To find this out, log into the router and enter the show access-lists command to see if all the access lists are configured:

 RemoteRouter# show access-lists Extended IP access list 100     deny ip any host 206.107.120.17     permit ip any any (5308829 matches) Extended IP access list 101     permit tcp any host 209.98.208.33 established     permit udp host 209.98.98.98 host 209.98.208.59     permit icmp any host 209.98.208.59 echo-reply     permit tcp any host 209.98.208.59 eq smtp     permit tcp any host 209.98.208.59 eq pop3     permit tcp any host 209.98.208.59 eq 65     permit tcp any host 209.98.208.59 eq telnet     permit tcp host 209.98.208.34 host 209.98.208.59     permit tcp any 209.98.208.32 0.0.0.15 established          permit icmp any 209.98.208.32 0.0.0.15 echo-reply . . .

Looking at the preceding example, access list 100 explicitly denies traffic to a certain IP address. This is frequently done to stop outbound traffic to a known undesirable IP address or some other type of router that could allow hackers a crack at the enterprise's edge router. Access list 101 is more sophisticated, with a series of permit rules to control which applications may be used between hosts. The application's IP port is defined behind each eq modifier, such as eq smtp for e-mail or eq 65 for TACACS+ database service. (Certain ports can be identified by an acronym; others must be identified by a number.) Also note that access list 101 has only permit rules. This is possible because if a packet's request for service isn't explicitly permitted, it will be denied by the "implicit deny" rule when it reaches the bottom of the access list.

It could be that the inadvertent deny rule, lack of a permit rule, or simple typo is causing the problem. The troubleshooter would scan the access lists for any rules that might be causing the problem at hand. For example, if a person can't connect to the mail server, the troubleshooter would look for statements containing eq smtp or the mail server's IP address. The next step would be to go to the interface connecting the network experiencing the problem to see if the access-group command was used to apply the questionable access list to it. To do this, you must enter privileged EXEC mode and go into configure interface mode pointed to the interface in question, as shown here:

 MyRouter# enable Password: MyRouter# show running-config . . . interface Ethernet1  ip address 10.1.13.1 255.255.255.240  ip access-group 100 in  ip access-group 101 out . . .

If the questionable access list is in force, double-check that the access list is being applied in the correct direction for the interface. If that checks out, temporarily disable it to see if traffic can pass the router without it. There are two access lists in our example, so we would disable them both to see if the problem is being caused by access lists. Disable access lists on the interface as follows:

 MyRouter# config terminal MyRouter(config)# interface ethernet1 MyRouter(config-if)# no ip access-group 100 in MyRouter(config-if)# no ip access-group 101 out

In case you forgot, the in modifier at the end of each access-group statement configures the access lists to be applied to inbound packets only. An out modifier would do the opposite; the absence of a modifier applies the list to both inbound and outbound traffic.

Once the access lists are disabled, attempt to make the connection between the host and the server reported as nonresponding. If the traffic goes through with the access lists disabled, then a statement somewhere in one of the access lists is probably the cause. The next step is to see which list contains the problem by reenabling one of the two. Access list 101, with all its rules, is the most likely culprit. To find out if this is the case, put it back into force with the following command:

 MyRouter(config-if)# ip access-group 101 out

Now try to connect to the server again. If the problem has returned, you've established that the problem resides somewhere inside access list 101.

To debug the access list, carefully review it to find the offending rule. It could be a misplaced deny rule, but a missing TCP or UDP port in a permit rule could also be the problem. Don't forget to check for any typos in your ACL. A simple mistyped IP address can easily be the source of your problem.

Remember, each access list rule must declare to which IP transport protocol it applies: TCP, UDP, or ICMP. Most often, however, offending application ports are the source of the problem, simply because there are so many of them and the network applications being used change frequently. For example, if users are having a problem making a connection to a Web server, look to make sure that HTTP port number 80 is permitted between the host and server addresses.

It's also possible that the traffic is being denied before getting to the permit rule designed to let it through. Remember that access control lists read from the top down until a match is found. If this is the case, the sequence in which rules are listed should be adjusted accordingly by putting the priority rules nearer the top.

Redirecting Traffic from Congested Areas Sometimes traffic becomes congested in a particular router. This could be the result of new hosts having been added in the area, new network applications coming online, or other causes. When this happens, log into the congested router and enter the show ip traffic command to generate the following report:

 MyRouter# show ip traffic IP statistics:   Rcvd:  7596385 total, 477543 local destination          0 format errors, 0 checksum errors, 96 bad hop count          0 unknown protocol, 1 not a gateway                    0 security failures, 0 bad options, 0 with options   Opts:  0 end, 0 nop, 0 basic security, 0 loose source route          0 timestamp, 0 extended security, 0 record route          0 stream ID, 0 strict source route, 0 alert, 0 cipso          0 other   Frags: 0 reassembled, 0 timeouts, 0 couldn't reassemble          0 fragmented, 0 couldn't fragment   Bcast: 53238 received, 280 sent   Mcast: 205899 received, 521886 sent   Sent:  738759 generated, 6113405 forwarded          13355 encapsulation failed, 374852 no route . . .

In addition to reporting IP traffic, the show ip traffic command reports traffic generated by transport protocols, routing protocols, ARP translation requests, and even packet errors. The report also breaks out broadcast and multicast messages. It's a quick way to understand the loads being put on a router and what options you might have to lighten the load.

For example, if broadcast traffic seems excessive, you might look into tightening restrictions in the access lists governing the surrounding routers. But, if it appears that all or most of the heavy traffic is legitimate, traffic affecting neighboring routers should also be analyzed. If there is an inequity in loads between routers of similar power, perhaps load balancing is in order. In most cases, this makes more sense than buying more powerful hardware.

One way to balance traffic loads between routers is to log into the congested router and enter config-router mode by calling up the routing protocols; then set individual distance metrics for each router to steer traffic away from the congested router to its less congested neighbor.

Troubleshooting WAN Links

Troubleshooting WAN problems entails using a slightly different set of tools. This is because most connections into WAN links must go through a serial line. To refresh on the subject, a serial line connects a CSU/DSU unit to a router. Telephone networks don't transmit signals using a data-link layer (layer 2) network technology such as Ethernet. Routers aren't telephone switches, so the transitions between the two technologies must somehow be made. The CSU/DSU-to-serial-line interface gives the router signals it can understand.

Note

A CSU/DSU is like a modem, but it works with digital lines instead of analog ones. CSU stands for channel service unit, an interface connecting to a local digital telephone line, such as a T1 (instead of a modem connecting to an analog phone line). DSU stands for data service unit, a device that adapts to the customer end of the connection, usually into a router or LAN switch.

Serial links have an obvious importance because they extend internetworks beyond the office campus to remote locations. A remote link of any size requires using a digital telephone circuit of some kind, ranging from a fractional T1 up to a full T3 (DS3) line.

image from book

A serial line provides a window through which its entire WAN link can be diagnosed. In other words, not only can you analyze the serial line and its interfaces, but by looking at the traffic it carries, you can also diagnose the digital phone loop and, to some extent, what's happening at the remote end of the link.

Differences in the show interfaces serial Report

Cisco provides a special tool for troubleshooting serial links in the show interfaces serial command. It's largely the same as the normal show interfaces command, but with some important differences, as highlighted in Figure 15-9. Specifically, it shows information for the serial port.

image from book
Figure 15-9: Most WAN links still use serial lines to connect routers to phone loops

One way serial links differ is the type of encapsulation used over digital telephone loops. The High-Level Data-Link Control (HDLC) encapsulation protocol is indicated in the top shaded box in Figure 15-9. Encapsulation is necessary to maintain Ethernet packets over the digital telephone link. Sometimes, encapsulation may have been inadvertently turned off, so the Encapsulation field should be selected.

Another difference is that conversations (sessions) are reported in the show interfaces serial interface_number report. WAN links have less bandwidth than local shared media. To wit, a T1 (DS1) circuit has a data rate of 1.544 Mbps, and a T3 (DS3) has a rate of 45 Mbps. Most enterprises use fractional T1 or T3 by purchasing channels within them (T1 has 24 channels; T3 has 672). WAN bandwidth, therefore, is limited compared to, say, a 100-Mbps LAN segment, and sometimes a particular user session takes more than its share. Therefore, when troubleshooting a WAN link, it helps to know how many conversations are going on. In case you're wondering, the Reserved Conversation field has to do with the Resource Reservation Protocol (dubbed RSVP). RSVP is an industry standard designed for use in QoS (Quality of Service) tools to help guarantee service levels.

The box at the bottom of the figure shows a third difference in the show interfaces serial interface_number report. These five fields are the same as the blinking lights you may have noticed on external modems. For example, DTR stands for Data Terminal Ready, an EIA/TIA-232 (née RS-232) circuit that is activated to notify the data communications equipment at the other end that the host is ready to send and receive data. DCD stands for Data Carrier Detect, which is important because it senses the actual carrier signal (the modem noise you hear when making a modem connection). The five modem circuits are included in the show interfaces serial interface_number report for troubleshooting serial links that run over analog/modem lines instead of digital lines.

Key Diagnostic Fields in the show interfaces serial Report

Serial links differ by nature from LAN segments, so diagnosing them takes a different focus. Certain things that are, to some extent, taken for granted in LAN segment links are often the cause of performance problems or even failures in serial links. Figure 15-10 highlights the items that troubleshooters look at first in a serial interface.

image from book
Figure 15-10: Certain fields are usually the focus when troubleshooting a serial link

As you can see, troubleshooting serial links emphasizes looking at errors and line activity. This is natural, given that the middle part of a WAN link-the telephone circuit-is basically invisible to networking equipment.

Looking at Figure 15-10, we see a case in which input traffic seems to be going okay, but a lot of output packets are being dropped. Given that the serial line is being pushed hard, running at about 80 percent of available bandwidth, we can conclude that the drops are being caused by overuse, not by faulty hardware in the link.

The first line of output in Figure 15-10 can also help your troubleshooting efforts. In the show interfaces serial display, the first line will give one of five status indications. Ideally, as Figure 15-10 shows, you want this line to read "Serial x is up, line protocol is up." However, if there is a problem, that status might be one of the following:

Serial x is down, line protocol is down. This is an indication that the router is not sensing a signal from the WAN connection, there is a problem with the cabling, or even a problem at the telephone company.
Serial x is up, line protocol is down. This is an indication that a local or remote router has been misconfigured, keepalives are not being transmitted by the remote router, or local or remote channel service unit or digital service units have failed.
Serial x is up, line protocol is up (looped). This is an indication that there is a loop in the circuit.
Serial x is up, line protocol is down (disabled). This is an indication that there is a high error rate because of a problem with the telephone carrier, channel service unit or digital service units are experiencing a problem, or the router interface is faulty.
Serial x is administratively down, line protocol is down. This is an indication that the router configuration includes the shutdown command or that a duplicate IP address exists.

Troubleshooting Serial-Line Input Errors One of the most common causes of serial-line problems is input errors-in other words, data inbound from the remote site. Probable causes of serial-line input errors, with suggested actions, are outlined in Table 15-5.

Table 15-5: Input Errors Causes and Actions
Input Error Symptoms	Probable Causes and Suggested Actions
Input errors along with CRC or frame errors	A dirty line, where electrical noise interferes with the data signal. Serial cable exceeds maximum length specified for the type of phone circuit. Serial cable is unshielded. The phone circuit itself may be malfunctioning.
	Actions: Reduce cable length. Install shielded cable. Check phone loop with a line analyzer.
	Clocking jitter in line where data signal varies from reference timing positions, or clocking skew where device clocks are set differently.
	Actions: Make sure all devices are configured to use a common-line clock.
Input errors along with aborts	The transfer of a packet terminated in midtransmission. Usually caused by an interface reset on the router being analyzed. Can also be caused by a reset on the remote router, a bad phone circuit, or a bad CSU/DSU.
	Action: Check local hardware, then remote hardware. Replace faulty equipment.

Troubleshooting Serial-Line Input and Output Errors Another clue to serial-line problems is an increase in dropped packets at the interface. A drop occurs when too many packets are being processed in the system and insufficient buffer memory is available to handle the packet. This applies to both input and output drops, as outlined in Table 15-6.

Table 15-6: Dropped Packet Causes and Actions
Packet Drop Symptoms	Probable Causes and Suggested Actions
Increase in dropped input packets	Input drops usually occur when traffic is being routed from a local interface (Ethernet, Token Ring, FDDI) that is faster than the serial interface. The problem usually emerges during periods of high traffic.
	Actions: Increase the interface's input hold queue size in the router's config file.
Increase in dropped output packets	Output drops happen when no system buffer is available at the time the router is attempting to hand the packet off to the transmit buffer during high traffic.
	Actions: Increase the interface's output hold queue size. Turn off fast switching. Implement priority queuing.

Drops taking place in one direction but not the other (input versus output) can point the troubleshooter toward the problem's source. If they're happening both ways, the router or its serial interface is probably the culprit.

Troubleshooting Serial Links Most of us have used modems long enough to know that sometimes an established connection can falter, or even be broken. This goes for serial lines, too, usually because of interface resets or carrier transitions, as outlined in Table 15-7.

Table 15-7: Serial Line Error Causes and Actions
Line Error Symptoms	Probable Causes and Suggested Actions
Increasing carrier transitions	Interruption in the carrier signal. Usually due to interface resets at the remote end of the link. Resets can be caused by external sources such as electrical storms, T1 or T3 overuse alerts, or faulty hardware.
	Actions: Use breakout box or serial analyzer to check hardware at both ends. Then check router hardware. Replace faulty hardware as necessary. No action required if problem was due to external cause.
Increasing interface resets	Interface resets result from missed keepalive messages. They usually result from carrier transitions, lack of buffer, or a problem with CSU/DSU hardware. Coincidence with increased carrier transitions or input errors indicates a bad link or bad CSU/DSU hardware.
	Actions: Use breakout box or serial analyzer to check hardware at both ends. Contact leased-line vendor if hardware is okay.

Although they're not LAN segments per se, serial links are integral to geographically distributed internetworks. Don't forget to consider them, even when a serial-line problem is not initially apparent. For example, when evaluating performance problems, it could be that a faulty serial link is shifting traffic loads elsewhere within the internetwork.

Client-Server VPNs

As we've discussed already, VPNs are a cost-effective way to use the Internet as your own private WAN. If you're having trouble getting a VPN to work, there are four areas in which VPN problems generally fall:

Blocked VPN traffic
Bad Internet connections
Configuration errors
Network Address Translation (NAT) tunneling problems

At the risk of insulting anyone's intelligence, when problems arise (not only VPN issues), the first thing to do is to check for loose cables. Wiggle the cables on the client's modem, the router, and firewall to ensure they are seated properly. It's also a good idea to make sure you're using straight-through Cat 6 or 7 cabling and didn't pick up a length of crossover cable.

Blocked Traffic

The next step is to ensure your Internet service provider (ISP) allows IPSec VPN traffic. If your provider does not, it will not matter if your VPN is properly configured, because the packets won't be going anywhere. If your ISP does not allow IPSec VPN traffic, you might have to consider changing ISPs.

Check your firewall to ensure that it isn't blocking IPSec or PPTP traffic. To make a VPN connection, it is necessary to configure outbound IPSec traffic on the firewall. To do this, you must configure your firewall to enable IPSec, and then create a rule allowing the passage of traffic between the LAN and WAN. If that's not possible, it might be necessary to locate the client in the DMZ or consider investing in a different router or firewall.

Not only can hardware firewalls block traffic, but so can software firewalls. This is another easy place to check, especially for clients that are traveling or trying to connect from locations that aren't equipped with hardware firewalls, but that are set up with software firewalls. Just disable the software firewall and see if that works. Some software firewalls will ask you if they should allow VPN traffic to be passed and you can add the desired destination IP address to the trusted-zone setting.

NAT

Make sure your NAT is tunneling correctly. A good place to start is by making certain you have the most current firmware updates and software. When IPSec tries to verify the packets' integrity, NAT changes the source IP address to the firewall's WAN address to properly navigate the Internet. Unfortunately, this causes problems with IPSec because the packets fail an integrity check.

You can get a listing of your NAT translations and an overview of your NAT statistics by using two simple EXEC commands:

show ip nat translations verbose This displays the active NAT translations with additional information for each translation table, including how long the entry has been used.
show ip nat statistics This displays a variety of NAT statistics, including the number of active translations, interfaces, and total translations.

Configuration

Configuration can also be the culprit when trying to track down VPN problems. Ensure that the correct IP addresses are being used. For client VPNs, checking and renewing the IP address in Windows is accomplished by opening a command prompt and then entering ipconfig/all.

If the IP address issued by the network administrator to connect to the VPN does not fall within the range shown, then the IP address is not valid. To correct this, renew the lease. This is accomplished by opening a command prompt window and typing ipconfig/renew and the IP address of the adapter .

Note

If the client is using PPPoE to connect to the ISP-which will be the case if a static IP address has not been assigned-make sure the client is connecting to the Internet using whatever connection application is needed.

Send a ping command to your VPN server's IP address. If you get a response, then you know the client is connected to the Internet and able to see the VPN server. Next, you should rule out any DNS configuration problems. This time, conduct a ping test, but use the domain name (for example, http://www.velte.com). If you get a response, the Internet connection is working fine. If not, it means DNS is misconfigured either at the client or on the DNS server itself.

The client and the VPN server must be able to speak the same language to get the job done. As such, it's important to make sure that the encryption settings on both the client and VPN server are the same. Authentication algorithms must be configured properly on both the client and VPN server. Both devices will need the shared secret, or, if using certificates, the correct public key is necessary.

Bad Connections

Next, check whether the client is trying to connect over a slow connection. Latency can cause VPN connections to fail, because they like consistent traffic; otherwise, they tend to drop off. You're most likely to see this as an issue with satellite connections where latency can run from half a second to several seconds.

Connection speed can be checked with the ping tool. Using the "-t" switch, you can get a continuous test of connection speeds between the client and the VPN server. For instance:

 ping 68.93.44.123 -t

This produces a list of the test's efforts to send packets to the address. The test is ended by pressing CTRL-C. Take a look at the results. If you see any stray "Request timed out" error messages, try increasing the timeout value so that you can accurately gauge how much latency your connection suffers. This value can be changed by using the "-w" switch. For instance:

 ping 68.93.44.123 -t-w 7000

This increases the timeout value to 7,000 ms. This should be enough to indicate how much latency is present on your link. Connection times at 1,500 ms and above will cause the VPN link to fail.

Note

It might not sound important at first blush, but if the VPN server and client are not in the correct time zones and have the correct time settings, they might not be able to hook up. This is because correct time settings are necessary for key expiration.