|
10-1. Checking Firewall Vital SignsAfter a firewall is put into production, it is important to know how to check its health. You can proactively monitor various resources and statistics in an effort to determine when the firewall has a problem or becomes underpowered. As well, when users complain of problems or slow response times, you should be able to look into some of the firewall's inner workings to quickly spot issues. How a firewall will behave under the load of a production network with real users and real appli-cations is difficult to predict. When a problem is reported, you can rely on many of the available firewall statistics to see what is wrong. However, those numbers don't mean much if you don't have anything to compare them against. You should make every effort to determine a baseline or average for several firewall parameters. Do this while the firewall is operating normally and all users are satisfied with its performance. Monitor the following statistics periodically so that you get an idea about their values and how they change during a business day or week:
The following sections show you how to monitor these functions. Using the Syslog InformationFirewalls normally stay busy silently inspecting traffic, denying some packets and forwarding others. The only way to "feel the pulse" of your firewall, to see what it's been doing, is to look back through the Syslog messages it has generated. You can view the most recent Syslog information, whether or not you have a Syslog server set up. You can enable buffered logging with the logging buffered level configuration command and view the buffer with the show logging command. However, only the most recent messages are kept in the buffer. On a busy firewall, the buffer can be completely overwritten in only a few seconds, so you might not be able to find the information you need. Instead, you should seriously consider setting up machines to collect Syslog messages from every firewall in your network. Syslog information gives a very useful historic record in the following ways:
TIP When you set up a Syslog server, make sure you have a system (both hardware and the Syslog software) that can handle the appropriate Syslog message rate sent by the firewall(s). This rate varies according to the firewall's connection load. More importantly, it varies according to the Syslog severity level configured on the firewall. Refer to Chapter 9, "Firewall Logging," for detailed information about Syslog planning and analysis. After you configure a firewall to send Syslog information to a server or memory buffer, you should verify its operation. You can use the show logging command for this, paying attention to the first dozen lines of output. Look at each type of logging to see if the appropriate ones are enabled and if they are set to the correct severity level. In particular, you should see a count of the total number of messages sent to each enabled destination. In the following example, both buffer and trap (Syslog server) logging are enabled and active: Firewall# show logging Syslog logging: enabled Facility: 20 Timestamp logging: disabled Standby logging: disabled Console logging: disabled Monitor logging: disabled Buffer logging: level warnings, 5526712 messages logged Trap logging: level warnings, 6933062 messages logged Logging to outside 192.168.2.142 History logging: disabled Device ID: disabled 21.68.65/47808 to 1.0.0.0/0 500004: Invalid transport field for protocol=17, from 172.29.68.65/47808 to 1.0.0.0/0 [logging buffer output omitted] Notice that the total numbers of buffer and trap logging messages are different. This is because each logging method is configured independently. For example, trap logging might have been enabled at an earlier date than buffer logging. Otherwise, each logging method can be configured with a different severity level threshold, causing each one to generate a different type and volume of message. Checking System ResourcesA firewall inspects traffic and performs its functions by using a combination of system resources. From a hardware standpoint, these resources are very straightforward and include the CPU and system memory. The following sections analyze these resources. Firewall CPU LoadYou get a general idea about the processing load on a Cisco firewall by using the show cpu usage command. For example, the following firewall appliance has a 5-second average of 27 percent. The command output also shows that the CPU is under a consistent load of about 24 percent: Firewall# show cpu usage CPU utilization for 5 seconds = 27%; 1 minute: 25%; 5 minutes: 24% Firewall# As a rule of thumb, the CPU utilization should stay below an average of 80 percent. The utilization might spike or temporarily peak at a greater value, as seen in the short-term 5-second utilization. This is normal behavior, because the CPU could be processing a periodic task or a short burst of traffic. In extreme situations, you might see Syslog message ID 211003 (PIX-3-211003, for example) being generated. This message is sent when the firewall CPU has consistently been at 100%. If the CPU stays above 80 percent during a time you consider to be a normal traffic load, without a significant attack occurring, you should consider upgrading the firewall or lightening its traffic load. Some of the possibilities for doing so are as follows:
TIP If your firewall is configured to run multiple security contexts, remember that these "virtual firewalls" are all being emulated by one firewall platform. If the CPU usage is running high, you might have too many different contexts configured or too many contexts that are heavily used. You can get an idea of the breakdown of CPU resources across the contexts with the show cpu usage context all command, as in the following example: Firewall# show cpu usage CPU utilization for 5 seconds = 86%; 1 minute: 84%; 5 minutes: 83% Firewall# show cpu usage context all 5 sec 1 min 5 min Context Name 2.2% 2.5% 2.4% system 1.0% 1.1% 1.0% admin 54.6% 52.2% 53.1% CustomerA 27.3% 27.2% 25.6% CustomerB 0.9% 1.0% 0.9% Test Firewall# The utilization values from the first command are totals for the sum of the context values from the second command. In this example, the CustomerA context is using the most CPU resources and is a good candidate for being moved onto its own firewall platform. To find out more about which activities are taxing the CPU, you can try to track down the most-used processes. The firewall's CPU continuously performs a number of different tasks, such as processing inbound packets; inspecting ICMP, UDP, and TCP traffic; interacting with the console and user sessions; maintaining failover communication; operating routing protocols; and so on. In addition, a large number of tasks involve maintaining timers. When xlate and conn entries are created, timers must be started; when the various timers expire, those entries must be deleted. You also can time and control authenticated user sessions. Each of these timer functions has its own process, requiring periodic attention from the CPU. In fact, a process runs continuously just to keep the CPU utilization values computed and updated! To see the entire list of active firewall processes, you can use the show processes command. Unless you are a Cisco Technical Assistance Center (TAC) engineer, the only interesting information is found in the following columns:
NOTE Each line of the show processes command begins with three flag characters that give more information about the process. The first character denotes the process priority; it can be one of the following: D (dead), L (low), M (medium), or C (critical). The second character denotes the process state: r (ready to run), s (sleeping), x (dead), w (idle), or * (running). For example, the following processes are running at medium and critical priority, respectively: Mrd 003cb154 013ab118 00d9d580 9130720 013a9190 7660/8192 557poll Csi 0071b139 01d48580 00d9d490 0 01d46618 7340/8192 update _cpu_usage Now suppose you see that the CPU utilization is running a consistent 25 percent, and you want to know why. Although it is somewhat involved, you can discover which processes are "hogging" the CPU by following these steps:
You can use a spreadsheet application such as Microsoft Excel to quickly compute the runtime difference on all the processes. Table 10-1 shows an example from a PIX 535 firewall. The actual runtime difference is shown in the rightmost column.
From this information, you can see that the only processes that have used CPU time are i82543_timer, 557_poll, pix/intf0, and pix/intf1. Although some of the process names are intuitive, these aren't particularly obvious:
During this 60-second example, the CPU has only been involved in retrieving packets from the outside and inside interfaces. The actual ASA algorithm is being performed too, although it isn't broken out in the list of processes. Notice that almost all the processes are related to I/O, where data must be moved into or out of the firewall in some fashion. The internal processes simply aren't shown because a firewall can become overloaded only by interacting with external things. Firewall MemoryCisco firewalls base their entire operation on the use of internal RAM memory. This memory can be broken up and allocated for many processes and other uses. Some examples of memory usage are as follows:
From a monitoring standpoint, firewall memory is organized in two ways:
Likewise, you have two ways to query a firewall's memory usage. To see the utilization of the firewall's main memory, you can use the show memory command. The following example shows a PIX 525 with 256 MB of RAM, which is seen to have about 85 MB in use, whereas about 183 MB of memory is still free. The percentages are values shown by PIX 7.x as a quick gauge. Firewall# show memory Free memory: 183131072 bytes (68%) Used memory: 85304384 bytes (32%) ------------- ---------------- Total memory: 268435456 bytes (100%) Firewall# You can use the show memory detail command to display statistics about how the firewall memory has been fragmented during allocations. However, this information is useful mainly to Cisco TAC engineers. You can also see how the firewall is managing its block memory with the show blocks command. Consider the following block statistics for a PIX 535: Firewall# show blocks SIZE MAX LOW CNT 4 100 93 99 80 100 98 100 256 1100 1025 1036 1550 2688 1912 1920 2560 40 40 40 4096 30 30 30 8192 60 60 60 16384 100 100 100 65536 10 10 10 Firewall# Notice that statistics are shown for each different block size, ranging from 4-byte blocks up to 65,536-byte blocks. Table 10-2 lists the possible block sizes and their purposes.
A firewall begins by allocating a default number of each block size when it boots up. The default number of blocks in each size varies with the firewall model, the number and type of interfaces, and the amount of available memory. For example, a PIX 535 might begin by allocating 1,444 1550-byte blocks. During operation, a firewall might use most or all of the blocks of a certain size. This might occur if Ethernet packets arrive faster than they can be inspected and processed. When the number of blocks approaches 0, the firewall attempts to allocate more blocks of that size from the available memory. The output from the show blocks command reports the current state of each block size. The count labels, MAX, LOW, and CNT, are not very intuitive; you should think of these in relation to blocks that are available, not blocks that are used. The count labels have the following meanings:
You can use the clear blocks command to return the maximum number of available blocks to the system defaults and to set the low block counts to the currently available values. This command can be useful if you notice that an extraordinary number of blocks have been allocated and you want to bring the firewall block allocation closer to its default state without a reboot. Normally, you shouldn't see any of the CNT values staying at 0. If some values do tend to remain at 0, the firewall cannot allocate any more memory to the block size shown. TIP Notice that some firewalls set aside memory blocks for Ethernet (1550-byte) and Gigabit Ethernet (16384-byte). Why is there a difference? The 1550-byte blocks are used for traditional Ethernet, for both 10/100 and lower-performance Gigabit Ethernet interfaces. The maximum transmission unit (MTU) for Ethernet is 1500 bytes. The 16384-byte blocks are reserved for firewalls that have higher-performance Gigabit Ethernet interfaces (based on the Intel i82543 controller) installed. These interfaces also use an MTU of 1500 bytes. However, the firewall can achieve better performance by moving the 1500-byte packets into and out of the interface in large numbers. In other words, the best performance is obtained when about ten packets are moved at a time. To see what types of interface controllers your firewall has, you can use the show interface command. You can focus on the controller hardware by filtering the output with the show interface | include (line protocol | Hardware) command, as shown in this example: Firewall# show interface | include (line protocol | Hardware) interface gb-ethernet0 "outside" is up, line protocol is up Hardware is i82543 rev02 gigabit ethernet, address is 0003.47df.8580 interface gb-ethernet1 "inside" is up, line protocol is up Hardware is i82543 rev02 gigabit ethernet, address is 0003.47df.85fc interface ethernet0 "stateful" is up, line protocol is up Hardware is i82559 ethernet, address is 0002.b3ad.8466 interface ethernet1 "dmz" is up, line protocol is up Hardware is i82559 ethernet, address is 0002.b3ad.7f4b Firewall# With PIX 7.x, the results are very similar: Firewall# show interface | include (line protocol | Hardware) Interface GigabitEthernet0 "outside", is up, line protocol is up Hardware is i82542 rev03, BW 1000 Mbps Interface Ethernet0 "", is administratively down, line protocol is up Hardware is i82559, BW 100 Mbps[output omitted] Checking Stateful Inspection ResourcesAs a firewall inspects and passes traffic, it maintains two tables of entries: address translations (xlates) and connections (conns). You can get an idea of the inspection load by looking at the size of these tables. Xlate Table SizeTo see the translation table size, use the following command: Firewall# show xlate count The output from this command shows the current number of xlates in use and the maximum number that have been built since the firewall was booted. The Cisco PIX Firewall in the following example currently has built 15,273 translations. At some time in the past, a maximum of 22,368 xlates were in use: Firweall# show xlate count 15273 in use, 22368 most used Firewall# The xlate count is the sum of both static xlates (from the static command) and dynamic xlates (from nat and global commands). When you see the xlate count, it might not be obvious whether there are too many translations. When a host passes through the firewall, it can use only one translation if it falls in a static mapping. If a host triggers a dynamic translation, the firewall creates one xlate for every unique PAT connection. In other words, it isn't possible to estimate or calculate ahead of time the number of xlates that will be used. Instead, be sure to look at the xlate count periodically so that you can get an average baseline value. Then, if you see the xlate count jump to a much higher value later, you can assume that something is wrong. In that case, your firewall could be faced with building xlates during an attack of malicious activity. Conn Table SizeA firewall creates and tears down connection entries for UDP and TCP connections between pairs of hosts. You can get connection statistics with the following command: Firewall# show conn count The output of this command shows the current number of connections in use and the maximum number of connections that have been built since the firewall was booted. The firewall in the following example has 4,495 connections currently in its conn table, and it has had up to 577,536 simultaneous connections in the past! Firewall# show conn count 4495 in use, 577536 most used Firewall# The conn count is the sum of all types of connections. The count fluctuates as connections are built and torn down. Again, you should periodically use this command to get a feel for the average number of connections your network uses. If the conn count is excessive, as in the maximum number in the preceding example, an attack is likely in progress. If you have a failover pair of firewalls, shouldn't the xlate and conn counts be identical in each unit? The xlate and conn entries are replicated only from the active unit to the standby unit if you have configured stateful failover. Otherwise, the active unit shows positive counts for both xlates and conns, but the standby unit shows counts of 0. Also, it is not unusual for the active and standby table counts to be quite different, even when stateful failover is being used. The entire xlate and conn tables are not replicated between the units. Rather, only new entries (either created or torn down) are replicated. If the stateful failover link goes down, the standby unit misses new table entries. When the link comes back up, only new entries from that point on are received. If the standby unit loses power or reboots, its entire xlate and conn tables are lost. When it comes back up, it receives the continual flow of only new entriesnot the entire contents of the tables. Existing conn table entries from the active unit are replicated to the standby unit only if they are actively passing data. Existing connections that are idle are not replicated. Also, the two units might show a disproportionate number of conn table entries if HTTP replication is not enabled between the failover pair. In this case, the active unit maintains HTTP connection entries but does not replicate those to the standby unit. A failover pair has been configured for stateful failover in the following example. Somewhere along the way, the standby unit lost power and was rebooted. Notice that the table counts are quite different; the standby unit lost its tables and has received only new table changes, and HTTP replication has not been configured on the failover pair. The active unit is shown first, followed by the standby unit. [Active unit] Firewall# show conn count 4263 in use, 577536 most used Firewall# show xlate count 15166 in use, 22368 most used Firewall# ________________________________________________________________ [Standby unit] Firewall# show conn count 686 in use, 690 most used Firewall# show xlate count 659 in use, 665 most used Firewall# Checking Firewall ThroughputMany of the firewall statistics that you might display are based on incrementing counters or "snapshot" values. These give you an idea of the volume of activity over a long period of time, but not of the rate. For example, to gauge your firewall's throughput, you might want to see the number of bytes per second being forwarded on an interface or the number of TCP connections per second that are being inspected. A Cisco firewall keeps several running statistics that you can display. You also can use an external application to perform some analysis on firewall counters and messages. The following sections describe several ways to show the firewall throughput. PDMThe PIX Device Manager (PDM) default view shows several useful throughput calculations. Figure 10-1 shows a sample PDM display, where you can determine the following throughput measures:
Figure 10-1. A Sample PDM Display of Firewall ThroughputSyslogSome Syslog analysis applications can parse the history of Syslog messages generated by a firewall. If the firewall is configured to report about connections being set up and torn down (logging severity level 6, informational, by default), the Syslog analyzer can calculate the number of connections per second, the interface data rate per second, and so on. For example, a Syslog analysis tool is used to present information about the bandwidth being passed through firewall interfaces. Figure 10-2 shows a graph of utilization per unit time, as "bandwidth usage per hour." In this case, rather than showing the throughput for an individual interface, the total aggregate bandwidth for all interfaces is shown over time. Figure 10-2. An Example of Firewall Throughput Reporting by Syslog AnalysisTraffic CountersCisco firewalls can report traffic throughput on each interface through the command-line interface (CLI). This can be handy if you are connected to a firewall over a console, Telnet, or SSH session, and you want to check the throughput. The firewall keeps running counters of input and output data on each interface while it is operational. These counters begin at bootup or at the last counter reset, and they accumulate until you issue the command to display them. The firewall also computes the average throughput in bytes per second, but this is based on the total time elapsed since the counters were last reset. To see the traffic counters and throughput information, follow these steps:
Aggregate interface performance also comes into play for firewalls that are configured for multiple-context security mode. Each context has its own set of logical interfaces (inside and outside, for example) that are mapped from a physical interface or subinterface in the system execution space. From any user context, show traffic shows only the logical interfaces used by that context. However, the system execution space shows a breakdown that includes the aggregate physical interfaces. Perfmon CountersA Cisco firewall also keeps statistics about its stateful inspection performance. These values are called performance monitors or perfmon. From this information, you can get a good idea about the types of traffic passing through the firewall. You can also use the displayed rates to see a load distri-bution of the inspection or fixup processes. To configure and view the performance statistics, follow these steps:
TIP If you are unsure of the perfmon settings, you can use the perfmon settings command to see them. Notice that this command doesn't use a show keyword, like most other EXEC firewall commands. The perfmon interval and reporting mode are shown. Checking Inspection Engine and Service Policy ActivityBeginning with PIX 7.x, application inspection is performed by independent inspection engines that are referenced in service policies. You can get information about the activity of the various inspection engines by displaying the active service policies. One service policy can be applied to a firewall interface to define the actions to take on matching traffic in the inbound and outbound directions. A default service policy also is configured by default and is applied to all firewall interfaces. Any traffic not matched by an interface service policy is matched by the default global service policy. You can use the following command to display all active service policy statistics: Firewall# show service-policy The modular policy configuration of each service policy is shown. This includes the target interface, the service policy name, the class map used to match traffic, and each policy action. For each inspection engine, the number of packets inspected, dropped, and dropped with reset are shown. Packets are counted only in the service policy where they are matched and inspected. The default global service policy matches all packets that aren't matched elsewhere. In the following example, notice the packet counts for HTTP traffic and how each service policy has matched a different set of packets to inspect. The output of this command gives you a quick snapshot of all the inspections and actions that have been configured on the firewall and are actually being used. Firewall# show service-policy Global policy: Service-policy: asa_global_fw_policy Class-map: inspection_default Inspect: dns maximum-length 512, packet 363, drop 0, reset-drop 0 Inspect: ftp, packet 0, drop 0, reset-drop 0 Inspect: h323 h225, packet 0, drop 0, reset-drop 0 Inspect: h323 ras, packet 0, drop 0, reset-drop 0 Inspect: rsh, packet 0, drop 0, reset-drop 0 Inspect: rtsp, packet 26601, drop 0, reset-drop 0 Inspect: esmtp, packet 1668, drop 0, reset-drop 0 Inspect: sqlnet, packet 0, drop 0, reset-drop 0 Inspect: skinny, packet 0, drop 0, reset-drop 0 Inspect: sunrpc, packet 0, drop 0, reset-drop 0 Inspect: xdmcp, packet 0, drop 0, reset-drop 0 Inspect: sip, packet 0, drop 0, reset-drop 0 Inspect: netbios, packet 0, drop 0, reset-drop 0 Inspect: tftp, packet 0, drop 0, reset-drop 0 Inspect: http, packet 36614, drop 0, reset-drop 0 Inspect: icmp, packet 3911, drop 0, reset-drop 0 Inspect: icmp error, packet 171, drop 0, reset-drop 0 Class-map: asa_class_tftp Inspect: tftp, packet 0, drop 0, reset-drop 0 Interface outside: Service-policy: test-policy Class-map: test Inspect: http, packet 369, drop 0, reset-drop 0 Priority: Interface outside: aggregate drop 0, aggregate transmit 0 Interface inside: Service-policy: PolicyA Class-map: http_class Inspect: http test_http, packet 99400, drop 41, reset-drop 0 Class-map: ftp_class Inspect: ftp strict Filter_ftp, packet 696, drop 0, reset-drop 0 Class-map: test Priority: Interface inside: aggregate drop 0, aggregate transmit 0 Firewall# Checking Failover OperationIf you have a failover pair of firewalls, you should periodically check to see that the failover mechanisms are actually working properly. Use the techniques described in the following sections to gauge the failover performance. Verifying Failover RolesFirst, you should verify that the active failover unit is indeed the one you are expecting. When a failover pair is initially configured for failover, only one of them becomes the active unit. The other assumes the standby (passive) mode. If a failover occurs, the two units swap roles. Recall that the two failover units also have a "primary" and "secondary" designation. This has nothing to do with the actual failover operation, other than to distinguish one unit from the other. Usually, the secondary unit is purchased with a failover license at a lower price. Why are the failover units hard to distinguish from each other? The active unit always uses the active IP addresses on its interfaces, and the standby unit uses the standby IP addresses. As soon as a failover happens, the two units swap IP addresses. (Keep in mind that the same thing happens with MAC addresses.) This means if you open a Telnet or SSH session to the active IP address on an interface, you won't know which physical unit answers. To make matters more difficult, both failover units also have the same host name and command-line prompt! After you open a session, you can use the show failover command to learn the identity of the physical unit (primary or secondary), as well as its current failover role (active or standby). For example, suppose a failover pair is configured with an inside active address of 192.168.254.1 and an inside standby address of 192.168.254.2. When failover is first enabled, the primary unit takes on the active failover role. An SSH session is opened to the active address 192.168.254.1. You would use the show failover command to see which unit is currently active: Firewall# show failover Failover On Cable status: Normal Reconnect timeout 0:00:00 Poll frequency 15 seconds Last Failover at: 08:34:03 EST Sun Dec 28 2003 This host: Primary - Active Active time: 7304955 (sec) Interface stateful (192.168.199.1): Normal Interface dmz2 (127.0.0.1): Link Down (Shutdown) Interface outside (172.16.110.65): Normal Interface inside (192.168.254.1): Normal Other host: Secondary - Standby Active time: 2770785 (sec) Interface stateful (192.168.199.2): Normal Interface dmz2 (0.0.0.0): Link Down (Shutdown) Interface outside (172.16.110.66): Normal Interface inside (192.168.254.2): Normal [output deleted] In the highlighted output, it's easy to see that failover is enabled (on), that the host to which you're connected is the primary unit, and that it has the active role. Notice that the pair of units is polling each other every 15 seconds (the default). This means that it takes up to two or three poll intervals (30 to 45 seconds) for one unit to detect that the other unit has failed. You can also use this command to quickly see the status of every firewall interfaceon both units at the same time. For each interface that is being used, you should see the Normal status listed. If you see Waiting, one of the units has missed one hello message from the other and suspects there might be a problem. Testing means that three hellos have been missed, so the interface is currently going through a series of tests. If the tests fail, the interface is marked as Failed. Verifying Failover CommunicationIf failover is enabled, you might want to verify that the two units are communicating properly over the failover links. Cisco firewalls can exchange failover information in three ways:
If a failover cable is being used, look for the cable status in the show failover command output: Firewall# show failover Failover On Cable status: Normal Reconnect timeout 0:00:00 Poll frequency 15 seconds Last Failover at: 08:34:03 EST Sun Dec 28 2004 This host: Primary - Active [output deleted] In this example, the cable is in place and the status is Normal. This means the two units are ex-changing failover information successfully. With the failover cable, each unit can also determine if the other unit is powered on. If the companion unit has lost power, the cable status shows other side is powered off. TIP Keep in mind that the failover cable itself determines which unit is primary and which is secondary. One end of the cable is labeled "primary" and should be plugged into the unit that has the primary firewall license. Obviously, the other end of the cable plugs into the secondary unit, usually the one with the failover license. If you are using LAN-based failover, the status of the failover cable is irrelevant. The status of the interface dedicated to LAN-based failover communication, however, is important. The primary and secondary failover units are identified through configuration commands. With PIX 6.3, rather than sifting through output showing the status of the LAN-based interface, you can quickly see the status with the show failover lan command. In the following example, the dmz interface is being used for failover communication. The failover peer at each end of the LAN connection is seen to be Normal. Firewall# show failover lan Lan Based Failover is Active interface dmz (192.168.1.1): Normal, peer (192.168.1.2): Normal PIX 7.x and later doesn't have this command. Instead, look at the first few lines of the show failover command output: Firewall# show failover Failover On Cable status: N/A - LAN-based failover enabled Failover unit Primary Failover LAN Interface: Failover Ethernet2 (up) Unit Poll frequency 3 seconds, holdtime 9 seconds Interface Poll frequency 15 seconds [output omitted] NOTE Unlike with the failover cable, it isn't possible to detect the power status of the other unit with LAN-based failover. The cable carries a power signal from each unit so that it is easy to sense a loss of power. No power signals can be carried over a LAN-based failover connection, simply because only IP packets can be exchanged between the two units. If one unit has lost power, nothing in the IP failover packets indicates that. The other unit notices only the absence of failover packets. In all releases except PIX 7.x, you can also use the show failover lan detail command to add a generous amount of debugging information. Most of the output messages are coded values that aren't intuitive. However, you can see failover message counters and retransmission queue statistics that show how congested the LAN-based failover link has been. If you are concerned about two firewalls failing over with little impact to a production network, you have likely configured stateful failover. This type of failover works in conjunction with a failover cable or LAN-based failover. The basic housekeeping functions are communicated over the cable or the LAN interface, and the stateful interface is reserved for sending dynamic updates about connection or translation entries. If it works properly, stateful failover keeps the standby unit fully informed about the state of every active TCP and UDP connection in the active unit. In addition, the xlate table entries and ARP table entries are replicated. Should the standby unit need to take over, it already has the stateful infor-mation and can preserve existing connections during the failover transition. A failover pair keeps detailed statistics about the stateful information exchange. You'll find this at the end of the show failover command output. You should only be concerned about verifying effective stateful updates so that the two firewall units stay synchronized at all times. For example, the active unit shows the following output from the show failover command in PIX 7.x: Firewall# show failover Failover On Cable status: N/A - LAN-based failover enabled Failover unit Primary Failover LAN Interface: Failover Ethernet2 (up) Unit Poll frequency 3 seconds, holdtime 9 seconds Interface Poll frequency 15 seconds Interface Policy 2 Monitored Interfaces 3 of 250 maximum Group 1 last failover at: 10:29:18 EST Jan 30 2005 Group 2 last failover at: 10:29:27 EST Jan 30 2005 Stateful Failover Logical Update Statistics Link : Failover Ethernet2 (up) Stateful Obj xmit xerr rcv rerr General 0 0 0 0 sys cmd 13531 0 13531 0 up time 0 0 0 0 RPC services 0 0 0 0 TCP conn 0 0 0 0 UDP conn 0 0 0 0 ARP tbl 29 0 0 0 Xlate_Timeout 0 0 0 0 Logical Update Queue Information Cur Max Total Recv Q: 0 1 13531 Xmit Q: 0 1 13573 Firewall# Here, the number of these types of replicated stateful messages are shown:
The xmit and rcv columns show how many of each message type have been transmitted or received by this firewall, respectively. While the unit is in active mode, you should see the transmit counters increasing much more than the receive counters. This is because the active unit is tasked with keeping the standby unit updated. The xerr and rerr columns show the number of transmit and receive errors encountered while exchanging messages. If you find a large number of transmit errors, the sending firewall unit could not successfully send failover messages because of network congestion or a slow LAN interface. Receive errors indicate failover messages that arrived corrupted. Determining if a Failover Has OccurredIf a failover pair of firewalls is operating correctly, and stateful failover is being used to synchronize the state information, you may never realize when a failover takes place. How can you determine if the units have failed over? You can use the show failover command to see a record of the last failover event. The output from this command displays the date and time, along with the total amount of time that each unit has assumed the active role. In the following example, the failover occurred on December 28 at 8:34:03 a.m. Be aware that you must have already set the firewall clock (both active and standby units) or have configured the units to use NTP. Otherwise, the failover time stamp will be incorrect, and you'll have no idea when it actually occurred. Firewall# show failover Failover On Cable status: Normal Reconnect timeout 0:00:00 Poll frequency 15 seconds Last Failover at: 08:34:03 EST Sun Dec 28 2003 This host: Primary - Active Active time: 7304955 (sec) Interface stateful (192.168.199.1): Normal Interface dmz2 (127.0.0.1): Link Down (Shutdown) Interface outside (172.16.110.65): Normal Interface inside (192.168.254.1): Normal Other host: Secondary - Standby Active time: 2770785 (sec) Interface stateful (192.168.199.2): Normal Interface dmz2 (0.0.0.0): Link Down (Shutdown) Interface outside (172.16.110.66): Normal Interface inside (192.168.254.2): Normal [output deleted] At this point, you should also take note of the failover roles. When this command was entered, the primary unit was in the active role. This means that when the failover occurred, the other unit (the secondary) was active. The Active time values only serve to give you an idea of how much time each unit has spent in the active role. This is the total elapsed active time since the last rebootnot the amount of time the unit has been active since the last failover. In the sample output, the two units may have failed over more than once. Only the last failover event is noted, and you have no knowledge of any previous ones. Therefore, these units might have traded roles and accumulated active duty time on several occasions. If Syslog messages have been generated and recorded, you can find a detailed record of each failover and the symptoms surrounding it, complete with time stamps. Determining the Cause of a FailoverNow consider the importance of knowing why a failover has happened. For a failover to be triggered, one or both firewalls must have detected a problemeither the other unit was unresponsive with failover polls, or an interface had a problem. If a problem exists, you should try to identify it and get it fixed; otherwise, you might be left with only one working firewall out of the pair. TIP For proper failover operation, each firewall unit must be able to send failover messages to the other unit on every useable or monitored interface. This means you should make sure each interface on the primary unit can reach each corresponding interface on the secondary unit. A failure of just one interface can trigger a failover condition. (Beginning with PIX 7.x, you can configure a failover policy that triggers a failover when the number of failed monitored interfaces increases above a threshold.) If you have firewall interfaces that aren't being used, make sure you shut down those interfaces with the interface hardware_id shutdown configuration command (PIX 6.3) or the shutdown interface configuration command (PIX 7.x). You should also make sure that unused interfaces don't have a valid IP address configured. You can use the ip address interface 0.0.0.0 255.255.255.255 configuration command. If any unused interface is left enabled, it could inadvertently trigger a failover just from a lack of connectivity between the failover units. You can diagnose the cause of a failover event using one of the following two methods:
If a firewall interface fails for some reason, it might trigger a failover. However, if the interface becomes usable again, it is shown as Normal in the show failover output. If it fails and stays failed, you can use that command to find the broken interface, even at a later date. The most detailed way to track down the failure is to sift through the Syslog message history. A Syslog server is a must here, because the failover event might be buried within many thousands or millions of other Syslog messages. The firewall logging buffer simply isn't large enough to store a long history of messages. Usually, only the active failover unit is configured to generate "trap" (Syslog) logging to a Syslog server. The standby unit can generate its own Syslog messages, but that causes both units to send duplicate messages to the server. That doubles the amount of message storage required and is usually considered redundant. Therefore, begin by finding the date and time of the failover event with the show failover command. This gives you a window of time to use when searching through the archived Syslog data. Failover messages are generated with the identity of the sending firewall unit embedded in the message text: (Primary) or (Secondary). This is important, because it uniquely identifies which physical firewall unit is reporting the failover activity. If logging from the standby unit has not been enabled, you can conclude that any messages found must have come from the unit that was active at that time. The active unit messages found on the Syslog server tell only half of the failover story; to find the standby unit's testimony, you have to look elsewhere. For this, it is handy to enable buffered logging in addition to trap (Syslog) logging on the active unit. Unlike trap logging, when buffered logging is enabled, it is enabled on both the active and standby units. Therefore, the standby unit (after failover) has in its logging buffer a brief record of the failover. In fact, the failover messages are the very last messages recorded in the buffer. Why? This is because that unit was active up until the failover. It would have recorded all sorts of Syslog messages in its buffer, because they were also sent to the Syslog server. Any failover messages would be recorded there, up until the unit switched to standby mode. Then the unit becomes passive and doesn't really generate any further Syslog messages. Therefore, use the search terms Primary or Secondary when you search through the Syslog messages. Then go to the standby unit (after failover) and get a record of its logging buffer with the show logging command. An Example of Finding the Cause of a FailoverA failover has occurred within a pair of Cisco PIX Firewalls. From the show failover command, you find this time stamp: Firewall# show failover Failover On Cable status: Normal Reconnect timeout 0:00:00 Poll frequency 15 seconds Last Failover at: 08:34:03 EST Mon Feb 28 2005 This host: Primary - Active Active time: 7319775 (sec) On the Syslog server, you perform a search for the terms Primary and Secondary in the message text around that time frame. You're looking for any failover messages that might have been sent by the primary or secondary firewall units. Here are the results of the search: Feb 28 2005 8:34AM Firewall LOCAL4 ALERT %PIX-1-104001: (Primary) Switching to ACTIVE - mate want me Active. Feb 28 2005 8:34AM Firewall LOCAL4 ALERT %PIX-1-105003: (Primary) Monitoring on interface 3 waiting Feb 28 2005 8:34AM Firewall LOCAL4 ALERT %PIX-1-105003: (Primary) Monitoring on interface 0 waiting Feb 28 2005 8:34AM Firewall LOCAL4 ALERT %PIX-1-105004: (Primary) Monitoring on interface 3 normal Feb 28 2005 8:34AM Firewall LOCAL4 ALERT %PIX-1-105004: (Primary) Monitoring on interface 0 normal Feb 28 2005 9:09AM Firewall LOCAL4 ALERT %PIX-1-105008: (Primary) Testing Interface 2 Feb 28 2005 9:09AM Firewall LOCAL4 ALERT %PIX-1-105009: (Primary) Testing on interface 2 Passed Feb 28 2005 9:09AM Firewall LOCAL4 ALERT %PIX-1-105008: (Primary) Testing Interface 2 Feb 28 2005 9:09AM Firewall LOCAL4 ALERT %PIX-1-105009: (Primary) Testing on interface 2 Passed Feb 28 2005 9:10AM Firewall LOCAL4 ALERT %PIX-1-105008: (Primary) Testing Interface 2 Feb 28 2005 9:10AM Firewall LOCAL4 ALERT %PIX-1-105009: (Primary) Testing on interface 2 Passed Feb 28 2005 9:10AM Firewall LOCAL4 ALERT %PIX-1-105008: (Primary) Testing Interface 2 Feb 28 2005 9:10AM Firewall LOCAL4 ALERT %PIX-1-105009: (Primary) Testing on interface 2 Passed Feb 28 2005 9:10AM Firewall LOCAL4 ALERT %PIX-1-105004: (Primary) Monitoring on interface 2 normal From this record, it's evident that something began to happen at 8:34 a.m. on February 28. The primary unit reports that the other unit ("mate") has decided that it should assume the active role. Obviously, before this time, the primary unit had been in the standby role. When the primary unit becomes active, it begins to monitor on several interfaces, determining if the interfaces are working properly and if it can detect the other unit on them too. Notice that at 9:09 a.m., it begins several testing phases on interface 2. Evidently, interface 2 (missing from the earlier monitor tests) had a problem from 8:34 until 9:09. You still don't know what triggered the failover, because you have a record only from the unit that had just become active at that time. That also means that the other unit, which had been active before the failover, was generating other Syslog messages before the failover. However, the failover caused it to enter the standby mode, so no further Syslog information was sent to the server. As a last step, you connect to that unit (currently in standby mode) and have a look at its logging buffer: Firewall# show logging Syslog logging: enabled Facility: 20 Timestamp logging: disabled Standby logging: disabled Console logging: disabled Monitor logging: disabled Buffer logging: level warnings, 553574 messages logged Trap logging: level warnings, 553574 messages logged Logging to inside 192.168.100.100 History logging: disabled Device ID: disabled 305006: Dst IP is network/broadcast IP, translation creation failed for icmp src outside:64.170.37.34 dst inside:169.163.69.0 (type 8, code 0) 305006: Dst IP is network/broadcast IP, translation creation failed for icmp src outside:64.170.37.34 dst inside:169.163.69.0 (type 8, code 0) 500004: Invalid transport field for protocol=17, from 172.21.68.65/47808 to 1.0.0.0/0 [output deleted] 411002: Line protocol on Interface outside, changed state to down 105007: (Secondary) Link status 'Down' on interface 2 104002: (Secondary) Switching to STNDBY - interface check, mate is healthier 105003: (Secondary) Monitoring on interface 3 waiting 105003: (Secondary) Monitoring on interface 0 waiting 105004: (Secondary) Monitoring on interface 3 normal 105004: (Secondary) Monitoring on interface 0 normal 105006: (Secondary) Link status 'Up' on interface 2 105003: (Secondary) Monitoring on interface 2 waiting 105004: (Secondary) Monitoring on interface 2 normal Firewall# Aha! The buffered record shows that the "outside" interface went down, triggering the failover. This unit decided that because its own interface went down hard, it should immediately relinquish the active role. Unfortunately, logging time stamps were not configured, so the firewall didn't add any date and time information to the messages in its own buffer. (This is true only for PIX 7.x or later; earlier releases don't add time stamps to buffered logging messages.) This doesn't really matter, because these are the final recorded messages and must have occurred right before the last time the unit entered the standby role. Intervening in a Failover ElectionCisco firewalls do not toggle their roles as failures come and go. For example, if an active unit fails, it automatically enters the standby role. Even if its failure is cured, it does not resume its former active role. The idea is that after a failure has taken place, a network administrator needs to diagnose the problem and fix it. After a failed firewall unit is repaired, you have to manually inform the pair that they need to reset the failed condition in their failover status. When a problem is resolved on a failed standby unit, you can use the failover reset command on the active unit. Both units recognize that the standby unit has become "unfailed" and resume normal failover communication. Sometimes you might need to manually toggle the roles. This might be necessary if you need to perform some maintenance on one unit, but it is unfortunately already in the active role. You can approach this in two ways:
Checking Firewall InterfacesYou can use the show traffic command to see throughput information about firewall interfaces, but you can monitor other interface statistics as well. You can use the show interface command to see a wealth of information about the interface operation, many types of error conditions, and packet buffering. As with the Cisco IOS software, the show interface command can produce such a condensed dump of interface parameters that it becomes difficult to interpret. To make this easier, think of the com-mand output as being broken into various sections. Figure 10-3 shows an example of the show interface command in PIX 7.x. Other software releases show similar information but are organized slightly differently. Only the "inside" interface is shown, for clarity. If you just glance through the lines of output, looking for any glaring error condition, you are unlikely to find anything. Figure 10-3. A Breakdown of Information Presented by the show interface CommandAs the figure shows, the lines of output are organized into groups of related information. Each of these groups is explained in detail in the following sections to correspond with the figure. The parameters in each group appear in table format for quick reference. The sample values in Figure 10-3 are shown, along with an explanation of the parameter. Interface Name and StatusTable 10-3 describes the interface name and status information displayed by the show interface command as depicted in Figure 10-3.
Interface ControlTable 10-4 describes the interface control information displayed by the show interface command as depicted in Figure 10-3.
The interface bandwidth or speed shown should match that of the network device on the other end of the connection. However, be aware of problems caused by duplex mode configuration. Duplex mode must be configured identically on the firewall and the network device to which it connects. Duplex mode can be autonegotiated only if the other end of the connection is also set to autonegotiate. Otherwise, autonegotiation is unable to "sense" what mode the other side is using, and duplex mode defaults to half-duplex. TIP You can use the interface information to troubleshoot duplex mismatch problems. First, look at the duplex setting that is reported. If the interface is set to Auto-duplex and the actual mode is half-duplex, most likely the far end of the connection is not configured for autonegotiation. The firewall then attempts to negotiate duplex mode and falls back to half-duplex as a default. In addition, nonzero values for collisions, late collisions, and output errors can also indicate a duplex mismatch. If one end is using fullduplex mode and the other end half-duplex mode, there is a good chance that the two devices will attempt to transmit at the same time and cause a frame collision. The resulting data becomes scrambled and appears as an output error. Best practice is to hard-code or configure specific interface speed and duplex settings to avoid any problems with misconfiguration or autonegotiation. Interface AddressesTable 10-5 describes the interface address information displayed by the show interface command as depicted in Figure 10-3.
The MAC address shown is the burned-in address (BIA) that is preprogrammed on the interface. If failover is enabled, the MAC address changes according to the firewall's failover role. The active unit always takes on the MAC address of the primary unit's interface BIA. The standby unit always takes on the MAC address defined for the secondary unit. You can also configure both units to have specific MAC addresses with the failover mac address command. The IP address shown is the address configured for the interface. If failover is enabled, the active unit takes on the IP address configured for the primary unit's interface. The standby unit takes on the IP address configured for the secondary unit's interface. The MTU value shown defaults to 1500 bytes for Ethernet interfaces and can be configured with the mtu command. In PIX 7.x, the interface MTU is shown as MTU not set if the interface is shut down or if it has not been configured with a logical name with the nameif command. Inbound Packet StatisticsTable 10-6 describes the packet statistics information displayed by the show interface command as depicted in Figure 10-3.
Each of these counters is accumulated from the time the firewall was booted or from the time the interface counter was last cleared with the clear interface [if_name] [stats] command. The last three input errorsoverrun, ignored, and abortare traditional values that have been carried over from the Cisco IOS software on routers. You typically don't see these error counts increase because of the firewall queuing strategy. This is described further in the "Packet Queue Status" section a bit later. Outbound Packet StatisticsTable 10-7 describes the outbound packet statistical information displayed by the show interface command as depicted in Figure 10-3.
Each of these outbound counters is accumulated from the time the firewall was booted or from the time the interface counter was last cleared with the clear interface [if_name] [stats] command. You might also see the counters listed in Table 10-8 displayed, depending on the firewall platform and the type of connection.
Beginning with PIX 7.x in multiple-context security mode, the show interface command presents a shortened amount of information when it is used from a user context. For example, the following output was produced from the admin context: Firewall/admin# show interface outside Interface Ethernet0 "outside", is up, line protocol is up MAC address 00a0.c900.0101, MTU 1500 IP address 192.168.93.138, subnet mask 255.255.255.128 Received 7299 packets, 787753 bytes Transmitted 7398 packets, 790589 bytes Dropped 0 packets Firewall/admin# Notice that user context interfaces are known only by their logical names (outside, for example). As well, only the addressing and data counter information is shown. The interface error counters are reserved for the system execution space, where the actual physical interfaces are configured. When an interface is configured as a VLAN trunk, you might see some additional information from the show interface command. PIX 7.x, for example, might produce the following information: Received 214095 VLAN untagged packets, 174551017 bytes Transmitted 103055 VLAN untagged packets, 10106195 bytes Dropped 1456 VLAN untagged packets Here, VLAN untagged packets represents packets that are sent or received over the native (untagged) VLAN on the 802.1Q trunk. PIX 6.3 can produce the following output for a trunk interface, including a breakdown of traffic activity on specific VLANs: 279 aggregate VLAN packets input, 110463 bytes 87 aggregate VLAN packets output, 6412 bytes 8 vlan41 packets input, 540 bytes 0 vlan41 packets output, 0 bytes 0 invalid VLAN ID errors, 6 native VLAN errors Beginning with PIX 7.x, if an interface is not completely configured, you might see some additional information. If the nameif command has not yet been used to assign a logical name to an interface, the following line is shown from the show interface command: Available but not configured via nameif Finally, in multiple-context security mode, physical interfaces must be mapped from the system execution space to the appropriate user contexts. If an interface has not been mapped with the allocate-interface context configuration command, the following line is shown: Available for allocation to a context Packet Queue StatusA Cisco firewall uses several different buffers or queues as it handles packets in a network. To mon-itor a firewall's performance, it is a good idea to become familiar with the buffering process and statistics. Figure 10-4 illustrates how packets arriving on an interface are queued for inspection and how other inspected packets are queued for transmission on the interface. Figure 10-4. Firewall Interface Packet QueuesEach firewall interface has its own inbound and outbound queues, arranged as a hardware queue and a software queue in each direction. Beginning with PIX 7.x, the outbound software queue is known as the best-effort queue (BEQ). Each interface also has an outbound priority queue, the low-latency queue (LLQ), that can be used to forward high-priority traffic. The LLQ is always serviced before the BEQ. Basically, incoming packets arrive from the physical interface and go into a hardware interface queue (if one is present) on the firewall platform. If that queue overflows before packets can be emptied for inspection, new packets are pushed into the input software queue. In the outbound direction, the process is similar but reversed. As packets are inspected and approved for forwarding, they are moved into an output queue. Before PIX 7.x, packets were copied right into the output hardware queue if there was room. If not, they went into the output software queue. All packet delivery was done on a best-effort basis, with no quality of service possible. Beginning with PIX 7.x, outbound packets can be moved into an output BEQ or an output LLQ, depending on the results of a service policy. If priority queuing is enabled, packets are always pulled from the LLQ before the BEQ is serviced. If priority queuing is not enabled, all outbound packets go into the BEQ. Hardware and software queue statistics are reported through the show interface command, as in the following example: Firewall# show interface gigabitethernet0 Interface GigabitEthernet0 "outside", is up, line protocol is up Hardware is i82542 rev03, BW 1000 Mbps (Full-duplex), Auto-Speed(1000 Mbps) Description: Outside Public Network MAC address 0003.4708.ec54, MTU 1500 IP address 10.1.1.1, subnet mask 255.255.255.0 478686 packets input, 591360505 bytes, 0 no buffer Received 828 broadcasts, 0 runts, 0 giants 0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored, 0 abort 332357 packets output, 23589813 bytes, 0 underruns 0 output errors, 0 collisions, 0 interface resets 0 babbles, 0 late collisions, 0 deferred 0 lost carrier, 0 no carrier input queue (curr/max blocks): hardware (0/25) software (0/0) output queue (curr/max blocks): hardware (3/122) software (0/0) Each firewall interface has both input and output queues; the current state of the queues is displayed as a ratio of current/maximum blocks used. Table 10-9 lists the values reported in the preceding command output and describes each value.
It is common to see the hardware queue (either input or output) reported to have activity. You should see a nonzero number for the maximum queue level when the hardware queue has been used. However, you should see nonzero values for the software queue only when the hardware queue has been full sometime in the past. This isn't necessarily a bad thing. If you see large values reported for a software queue, and the current number consistently stays close to the maximum number, your firewall CPU is having trouble keeping up with the interface load. TIP Fast Ethernet firewall interfaces (10/100) always report an inbound hardware queue statistic of 128/128. As well, you always see some inbound software queue activity. This indicates that this type of interface doesn't use a hardware queue. Instead, all inbound packets are copied into the software queue directly. This is not true for the Fast Ethernet outbound queues, which use both hardware and software queues. Outbound priority queues are available only beginning with PIX software release 7.0. The show interface command doesn't report on the outbound priority queue. Instead, you can use the following command to view output queue statistics: Firewall# show priority-queue statistics [if_name] This command displays the current statistics about both the priority queue (LLQ) and the best-effort queue (BEQ) of a firewall interface. These are shown as the queue type in the command output. If no interface name is given, all interfaces are shown. For example, the following statistics resulted from a firewall's outside interface: Firewall# show priority-queue statistics outside Priority-Queue Statistics interface outside Queue Type = BE Packets Dropped = 0 Packets Transmit = 132213 Packets Enqueued = 0 Current Q Length = 0 Max Q Length = 0 Queue Type = LLQ Packets Dropped = 0 Packets Transmit = 1826 Packets Enqueued = 5 Current Q Length = 0 Max Q Length = 32 Firewall# Table 10-10 lists the fields displayed in the command output. The descriptions pertain to the priority queue.
Sometimes you might see a larger number in the Packets Transmit counter than the Packets Enqueued counter. That might seem odd, because outbound priority packets should be put into the LLQ before being transmitted. The difference is that some firewall platforms can write priority packets into the output hardware queue directly, so they don't actually pass through the LLQ first. |
|