Using Network Performance Data | UNIX Fault Management: A Guide for System Administrators

I l @ ve RuBoard

This section provides a brief introduction on how you can use the network performance monitoring tools to detect and address network performance problems. Various types of network performance problems are possible. Potential problems could be an overloaded server, a congested network, or a faulty network masquerading as an overloaded network. You can use the performance tools to avoid these problems, if possible, or at least to correct them quickly when they occur.

Avoiding Performance Issues

To avoid network performance issues, key networks should be monitored regularly to look for trends. You should first use tools such as PerfView and NetMetrix to collect baseline data from when the computing environment is behaving normally. You may want to put some NFS commands into scripts to create your own benchmarks. Be sure that your mixture of NFS commands reflects the typical usage for your environment. In addition to benchmarking utilization rates, you should determine what collision rates are typical by using lanadmin or MeasureWare. Data should be collected over a period of time to get an accurate view of the network load. As loads increase, you should be prepared to segregate network traffic, add more LAN cards, or restrict the number of new applications being used on the network.

Network performance problems can be caused by an increase in the number of users or the introduction of a new application into the production environment. By proactively collecting performance information, you can identify when a jump in usage occurred and you may be able to trace it to a new application or a user 's inappropriate use of the network. Network analyzers and tools such as NetMetrix can be used to identify the applications and users dominating network traffic. A new network application initially should be deployed in a test environment so that its network load can be characterized. If its network usage is unexpectedly high, it can be redesigned before deployment into the production environment, or the production environment can be modified to support the new workload.

Detecting Overloaded Network Servers

If the server performing a network service, such as NFS, is overloaded, it won't be able to handle network traffic effectively. The problem could be a CPU, memory, or disk I/O bottleneck, or an inability of the server's NIC to handle the load.

To check for a CPU bottleneck on the network server, check the CPU utilization and run queue length. GlancePlus/UX can provide this information. CPU utilization greater than 95 percent and a high processor run queue length may indicate a CPU bottleneck. One option is to increase the capacity of the server by adding more processors, but this is not always possible. Also, it may not be sufficient to fix the problem, if the service is single-threaded.

To check for a real memory bottleneck on the network server, you can first check the amount of free memory. It should not drop below 5 percent of the total available. If the system can't keep up with the demands for memory, it will start paging and swapping. Excessive paging and swapping, viewed from GlancePlus, may be a sign of a memory bottleneck. Increasing the capacity of the system by adding more memory may eliminate the bottleneck.

You can check for an I/O imbalance on the network server by using tools such as iostat. High activity on the system disk is normal, but delays among nonsystem disks should be roughly the same. You may want to move files around to balance the disk load if an imbalance exists. You may also want to check your system's buffer cache hit ratio to see whether your buffer cache size is too small. Tools such as MeasureWare and BMC PATROL can provide information about your system buffer cache. If all of your disks are more than 75-percent utilized, then you are disk-bound and may need faster or additional disks.

Detecting Network Congestion

For Ethernet links, you can use the collision rate to determine whether or not a performance problem is due to congestion. lanadmin can be used to calculate the collision rate. You can calculate the rate by dividing the number of collisions by the number of outbound packets. A colli sion rate consistently greater than 5 to 10 percent indicates a congested network. You can also calculate the average collision rate for a network by totaling the number of output packets of all systems on the network and then dividing by the total number of collisions. If this average collision rate is greater than 10 percent, it is another sign that you need to take some action.

Network utilization averaging more than 35 to 40 percent of capacity on a shared medium is another sign of network congestion. If the utilization rate persists, it is likely to lead to collisions on the network.

As mentioned, lanadmin can be used to identify network congestion. Compare the current collision rate to the rate that you benchmarked during normal operation. Specifying netstat “a shows statistics for each open connection, including the send queue, which indicates the number of packets waiting to be sent. This should be 0 for most connections. If it is a large number, it could be an indication of network congestion. You should check this over a period of time to see whether the problem persists.

If you determine that a network is congested, and the network is being used appropriately, then you may need to partition the network into subnetworks and use gateways to isolate independent streams of data. (Techniques for identifying the appropriate location for a LAN bridge or gateway is beyond the scope of this book.) If configured properly, this can reduce the collisions on shared media and thus improve network performance. You should take advantage of any planned downtime to reconfigure networks, add additional LAN cards, or move applications to other systems on different networks.

Information about NFS usage can be shown by using nfsstat. Specifying nfsstat “c shows the client's NFS statistics. If the number of bad XID packets is approximately equal to the number of retransmissions, then one of the client's NFS servers is having trouble keeping up with the workload and is forcing the client to resend requests . Check each of the NFS servers to see whether any have system resource bottlenecks, as discussed earlier in this section. The disk is the most common resource bottleneck for an NFS server. If the nfsstat output shows that only the number of retransmissions is high, then the network is congested or faulty.

I l @ ve RuBoard