Maximizing the Performance of an Ethernet NIC Interface

< Day Day Up >

There are many ways to maximize the performance of your Ethernet NIC interface, and there are a few tools that are valuable in achieving that. The ndd parameters and kernel statistics provide a means to get the best out of your NIC. But there are some other tools for looking at the system behavior and establishing if more tuning can be achieved to better utilize the system as well as the NIC.

The starting point for this discussion is the physical layer because that layer is the most important with respect to creating the link between two systems. At the physical layer, failures can prevent the link from coming up. Or worse, the link comes up and the duplex is mismatched, giving rise to less-visible problems. Then the discussion will move to the data link layer, where most problems are performance related. During that discussion, the architecture features described above can be used to address many of these performance problems.

Ethernet Physical Layer Troubleshooting

The possibility of problems at the physical layer is huge. The problems range from no cable being present to duplex mismatch. The key tool for looking at the physical layer is the kstat command. See the kstat man page.

The first step in checking the physical layer is to check if the link is up.

 kstat ce:0 | grep link_      link_asmpause                   0      link_duplex                     2      link_pause                      0      link_speed                      1000      link_up                         1

If the link_up variable is set, then things are positive, and a physical connection is present. But also check that the speed matches your expectation. For example, if the interface is 1000BASE-Tx interface and you expect it to run at 1000 Mbit/sec, then the link_speed parameter shown should indicate 1000. If this is not the case, then a check of the link partner capabilities might be required to establish if they are the limiting factor. The following kstat command line will show output similar to the following:

 kstat ce:0 | grep lp_cap      lp_cap_1000fdx             1      lp_cap_1000hdx             1      lp_cap_100T4               1      lp_cap_100fdx              1      lp_cap_100hdx              1      lp_cap_10fdx               1      lp_cap_10hdx               1      lp_cap_asmpause            0      lp_cap_autoneg             1 lp_cap_pause                    0

If the link partner appears to be capable of all the desired speed, then the problem might be local. There are two possibilities: The NIC itself is not capable of the desired speed. Or the configuration has no shared capabilities that can be agreed on hence the link will not come up. You can check this using the following kstat command line.

 kstat ce:0 | grep cap_      cap_1000fdx               1      cap_1000hdx               1      cap_100T4                 1      cap_100fdx                1      cap_100hdx                1      cap_10fdx                 1      cap_10hdx                 1      cap_asmpause              0      cap_autoneg               1      cap_pause                 0       .....

If all the required capabilities are available for the desired speed and duplex, yet there remains a problem with achieving the desired speed, the only remaining possibility is an incorrect configuration. You can check this by looking at individual ndd adv_cap_* parameters or you can use the kstat command:

 kstat ce:0 | grep adv_cap_      adv_cap_1000fdx              1      adv_cap_1000hdx              1      adv_cap_100T4                1      adv_cap_100fdx               1      adv_cap_100hdx               1      adv_cap_10fdx                1      adv_cap_10hdx                1      adv_cap_asmpause             0      adv_cap_autoneg              1      adv_cap_pause                0

Configuration issues are where most problems lie. All the issues of configuration can be addressed using the kstat command above to establish the local and remote configuration, and adjusting the adv_cap_* parameters using ndd to correct the problem.

The most common configuration problem is duplex mismatch, which is induced when one side of a link is enabled for auto-negotiation and the other is not. This is known as Forced mode and can only be guaranteed for 10/100 Mode operation. For 1000BASE-T UTP Mode operation, the Forced mode (auto-negotiation disabled) capability is not guaranteed because not all vendors support it.

If Auto-negotiation is turned off, you must ensure that both ends of the connection are also in Forced mode, and that the speed and duplex are matched perfectly.

If you fail to match Forced mode in gigabit operation, the impact will be that the link will not come up at all. Note that this result is quite different from the 10/100 Mode case. While in 10/100 Mode operation, if only one end of the connection is auto-negotiating (with full capabilities advertised) the link will come up with the correct speed, but the duplex will always be set to half duplex (creating the potential for a duplex mismatch if the forced end is set to full duplex).

If both sides are set to Forced mode and you fail to match speeds, the link will never come up.

If both sides are set to forced mode and you fail to match duplex, the link will come up, but you will have a duplex mismatch.

Duplex mismatch is a silent failure that manifests itself from an upper layer point of view as really poor performance as many of the packets get lost because of collisions and late collisions occurring on the half-duplex end of the connection due to violations of Ethernet protocol induced by the full-duplex end.

The half-duplex end experiences collisions and late collisions while the full-duplex end experiences a whole manner of smashed packets, leading to MIB counters measuring, crc, runts, giants, alignment errors all being incremented.

If the node experiencing poor performance is the half duplex end of the connection, you can look at the kstat values for collisions and late_collisions.

 kstat ce:0 | grep collisions      collisions 22332      late_collisions 15432

If the node experiencing poor performance is the full duplex end of the connection, you can look at the packet corruption counters, for example, crc_err, alignment_err.

 kstat ce:0 | grep crc_err      crc_err 22332 kstat ce:0 | grep alignment_err      alignment_err 224532

Depending on the capability of the switch end or remote end of the connection, it may be possible to do similar measurements there.

Forced mode while having the problem of creating a potential duplex mismatch also has the drawback of isolating the link partner capabilities from the local station. In Forced mode, you cannot view the lp_cap* values and determine the capabilities of the remote link partner locally.

Where possible, use the default of Auto-negotiation with all capabilities advertised and avoid tuning the physical link parameters.

Given the maturity of the Auto-negotiation protocol and its requirement in the 802.3z specification for one gigabit UTP Physical implementations, ensure that Auto-negotiation to enabled.

Deviation from General Ethernet MII/GMII Conventions

We must address some remaining deviation from the general Ethernet MII/GMII kernel statistics.

In the case of the ge interface, all of the statistics for getting local capabilities and link partner capabilities are read-only ndd properties, so they cannot be read using the kstat command, as described previously, although the debug mechanism is still valid.

To read the corresponding lp_cap_* using ge, use the following commands:

 hostname# ndd -set /dev/ge instance 0 hostname# ndd -get /dev/ge lp_1000fdx_cap

Or you could use the interactive mode, described previously. The mechanism used for enabling Ethernet Flow control on the ge interface is also different, using the parameters in the table below.

Table 34. Physical Layer Configuration Properties
Statistic	Values	Description
`adv_pauseTX`	0-1	Transmit Pause if the Rx buffer is full.
`adv_pauseRX`	0-1	When you receive a pause slow down Tx.

There's also a deviation in ge for adjusting ndd parameters. For example, when modifying ndd parameters like adv_1000fdx_cap, the changes will not take effect until the adv_autoneg_cap parameter is toggled to change state (from 0-1 or from 1-0). This is a deviation from the General Ethernet MII/GMII convention for the "take affect immediately rule" of ndd.

Ethernet Performance Troubleshooting

Ethernet performance troubleshooting is device specific because not all devices have the same architecture capabilities. Therefore, the discussion of troubleshooting performance issues will have to be tackled on a per-device basis.

The following Solaris™ tools aid in the analysis of performance issues:

kstat to view device-specific statistics
mpstat to view system utilization information
lockstat to show areas of contention

You can use the information from these tools to tune specific parameters. The tuning examples that follow describe where this information is most useful.

You have two options for tuning: using the /etc/system file or the ndd utility.

Using the /etc/system file to modify the initial value of the driver variables requires a system reboot for the to take effect.

If you use the ndd utility for tuning, the changes take effect immediately. However, any modifications you make using the ndd utility will be lost when the system goes down. If you want the ndd tuning properties to persist through a reboot, add these properties to the respective driver.conf file.

Parameters that have kernel statistics but have no capability to tune for improvement are omitted from this discussion because no troubleshooting capability is provided in those cases.

`ge` Gigabit Ethernet

The ge interface provides some kstats that can be used to measure the performance bottlenecks in the driver in the Tx or the Rx. The kstats allow you to decide what corrective tuning can be applied based on the tuning parameters previously described. The useful statistics are shown in TABLE 5-72.

Table 5-72. List of `ge` Specific Interface Statistics
kstat name	Type	Description
`rx_overflow`	counter	Number of times the hardware is unable to receive a packet due to the Internal FIFO's being full.
no_free_rx_desc	counter	Number of times the hardware is unable to post a packet because there are no more Rx Descriptors available.
no_tmds	counter	Number of times transmit packets are posted on the driver streams queue for processing some time later, the queue's service routine.
nocanput	counter	Number of times a packet is simply dropped by the driver because the module above the driver cannot accept the packet.
pci_bus_speed	value	The PCI bus speed that is driving the card.

When rx_overflow is incrementing, packet processing is not keeping up with the packet arrival rate. If rx_overflow is incrementing and no_free_rx_desc is not, this indicates that the PCI bus or SBus bus is presenting an issue to the flow of packets through the device. This could be because the ge card is plugged into a slower I/O bus. You can confirm the bus speed by looking at the pci_bus_speed statistic. An SBus bus speed of 40 MHz or a PCI bus speed of 33 MHz might not be sufficient to sustain full bidirectional one-gigabit Ethernet traffic.

Another scenario that can lead to rx_overflow incrementing on its own is sharing the I/O bus with another device that has similar bandwidth requirements to those of the ge card.

These scenarios are hardware limitations. There is no solution for SBus. For PCI bus, a first step in addressing them is to enable infinite burst capability on the PCI bus. You can achieve that by using the /etc/system tuning parameter ge_dmaburst_mode.

Alternatively, you can reorganize the system to give the ge interface a 66-MHz PCI slot, or you can separate devices that contend for a shared bus segment by giving each of them a bus segment.

The probability that rx_overflow incrementing is the only problem is small. Typically, Sun systems have a fast PCI bus, and memory subsystem, so delays are seldom induced at that level. It is more likely is that the protocol stack software might fall behind and lead to the Rx descriptor ring being exhausted of free elements with which to receive more packets. If this happens, then the kstat parameter no_free_rx_desc will begin to increment, meaning the CPU cannot absorb the incoming packet in the case of a single CPU. If more than one CPU is available, it is still possible to overwhelm a single CPU. But given that the Rx processing can be split using the alternative Rx data delivery models provided by ge, it might be possible to distribute the processing of incoming packets to more than one CPU. You can do this by first ensuring that ge_intr_mode is not set to 1. Also be sure to tune ge_put_cfg to enable the load-balancing worker thread or streams service routine.

Another possible scenario is where the ge device is adequately handling the rate of incoming packets, but the upper layer is unable to deal with the packets at that rate. In this case, the kstat nocanputs parameter will be incrementing. The tuning that can be applied to this condition is available in the upper layer protocols. If you're running the Solaris 8 operating system or an earlier version, then upgrading to the Solaris 9 version will help your application experience fewer nocanputs. The upgrade might reduce nocanput errors due to improved multithreading and IP scalability performance improvements in the Solaris 9 operating system.

While the Tx side is also subject to an overwhelmed condition, this is less likely than any Rx-side condition. If the Tx side is overwhelmed, it will be visible when the no_tmds parameter begins to increment. If the Tx descriptor ring size can be increased, the /etc/system tunable parameter ge_nos_tmd provides that capability.

`ce` Gigabit Ethernet

The ce interface provides a far more extensive list of kstats that can be used to measure the performance bottlenecks in the driver in the Tx or the Rx. The kstats allow you to decide what corrective tuning can be applied based on the tuning parameters described previously. The useful statistics are shown in TABLE 5-73.

Table 5-73. List of `ce` Specific Interface Statistics
kstat name	Type	Description
rx_ov_flow	counter	Number of times the hardware is unable to receive a packet due to the Internal FIFO's being full.
rx_no_buf	counter	Number of times the hardware is unable to receive a packet due to Rx buffers being unavailable.
rx_no_comp_wb	counter	Number of times the hardware is unable to receive a packet due to no space in the completion ring to post Received packet descriptor.
ipackets_cpuXX	counter	Number of packets being directed to load balancing thread XX.
mdt_pkts	counter	Number of packets sent using Multidata interface.
`rx_hdr_pkts`	counter	Number of packets arriving that are less than 252 bytes in length.
`rx_mtu_pkts`	counter	Number of packets arriving that are greater than 252 bytes in length.
`rx_jumbo_pkts`	counter	Number of packets arriving that are greater than 1522 bytes in length.
rx_ov_flow	counter	Number of times a packet is simply dropped by the driver because the module above the driver cannot accept the packet.
`rx_pkts_dropped`	counter	Number of packets dropped due to Service Fifo Queue being full.
`tx_hdr_pkts`	counter	Number of packets hitting the small packet transmission method. Packets are copied into a pre-mapped DMA buffer.
`tx_ddi_pkts`	counter	Number of packets hitting the mid-range DDI DMA transmission method.
`tx_dvma_pkts`	counter	Number of packets hitting the top range DVMA fast path DMA Transmission method.
`tx_jumbo_pkts`	counter	Number of packets being sent that are greater than 1522 bytes in length.
`tx_max_pend`	counter	Measure of the maximum number of packets ever queued on a Tx ring.
`tx_no_desc`	counter	Number of times a packet transmit was attempted and Tx Descriptor Elements were not available. The packet is postponed until later.
`tx_queueX`	counter	Number of packets transmitted on a particular queue.
`mac_mtu`	value	The maximum packet size allowed past the MAC.
`pci_bus_speed`	value	The PCI bus speed that is driving the card.

When rx_ov_flow is incrementing, it indicates that packet processing is not keeping up with the packet arrival rate. If rx_ov_flow is incrementing while rx_no_buf or rx_no_comp_wb is not, this indicates that the PCI bus is presenting an issue to the flow of packets through the device. This could be because ce card is plugged into a slower PCI bus. This can be established by looking at the pci_bus_speed statistic. A bus speed of 33 MHz might not be sufficient to sustain full bidirectional one gigabit Ethernet traffic.

Another scenario that can lead to rx_ov_flow incrementing on its own is sharing the PCI bus with another device that has bandwidth requirements similar to those of the ce card.

These scenarios are hardware limitations. A first step in addressing them is to enable the infinite burst capability on the PCI bus. Use the ndd tuning parameter infinite-burst to achieve that.

Infinite burst will help give ce more bandwidth, but the Tx and Rx of the ce device will still be competing for that PCI bandwidth. Therefore, if the traffic profile shows a bias toward Rx traffic and this condition is leading to rx_ov_flow, you can adjust the bias of PCI transactions in favor of the Rx DMA channel relative to the Tx DMA channel, using ndd parameters rx-dma-weight and tx-dma-weight

Alternatively, you can reorganize the system by giving the ce interface a 66-MHz PCI slot, or you can separate devices that contend for a shared bus segment by giving each of them a bus segment.

If this doesn't contribute much to reducing the problem, then you should consider using Random Early Detection (RED) to ensure that the impact of dropping packets is minimized with respect to keeping connections alive that normally would be terminated due to regular overflow. The following parameters that allow enabling RED are configurable using ndd: red-dv4to6k, red-dv6to8k, red-dv8to10k, and red-dv10to12k.

The probability that rx_overflow incrementing is the only problem is small. Typically Sun systems have a fast PCI bus and memory subsystem, so delays are seldom induced at that level. It is more likely that the protocol stack software might fall behind and lead to the Rx buffers or completion descriptor ring being exhausted of free elements with which to receive more packets. If this happens, then the kstats parameters rx_no_buf and rx_no_comp_wb will begin to increment. This can mean that there's not enough CPU power to absorb the packets, but it can also be due to a bad balance of the buffer ring size versus the completion ring size, leading to the rx_no_comp_wb incrementing without the rx_no_buf incrementing. The default configuration is one buffer to four completion elements. This works great provided that the packets arriving are larger than 256 bytes. If they are not and that traffic dominates, then 32 packets will be packed into a buffer leading to a greater probability that configuration imbalance will occur. For that case, more completion elements need to be made available. This can be addressed using the /etc/system tunables ce_ring_size to adjust the number of available Rx buffers and ce_comp_ring_size to adjust the number of Rx packet completion elements. To understand the traffic profile of the Rx so you can tune these parameters, use kstat to look at the distribution of Rx packets across the rx_hdr_pkts and rx_mtu_pkts.

If ce is being run on a single CPU system and rx_no_buf and rx_no_comp_wb are incrementing, then you will have to resort again to RED or enable Ethernet flow control.

If more than one CPU is available, it is still possible to overwhelm a single CPU. Given that the Rx processing can be split using the alternative Rx data delivery models provided by ce, it might be possible to distribute the processing of incoming packets to more than one CPU, described earlier as Rx load balancing. This will happen by default if the system has four or more CPUs, and it will enable four load-balancing worker threads. The threshold of CPUs in the system and the number of load-balancing worker threads enabled can be managed using the /etc/system tunables ce_cpu_threshold and ce_inst_taskqs.

The number of load balancing worker threads and how evenly the Rx load is being distributed to each worker thread can be viewed with the ipacket_cpuxx kstats. The highest number of xx tells you how many load balancing worker threads are running while the value of these parameters gives you the spread of the work across the instantiated load balancing worker threads. This, in turn, gives an indication if the load balancing is yielding a benefit. For example, if all ipacket_cpuxx kstats have an approximately even number of packets counted on each, then the load balancing is optimal. On the other hand, if only one is incrementing and the others are not, then the benefit of Rx load balancing is nullified.

It is also possible to measure whether the system is experiencing a even spread of CPU activity using mpstat. In the ideal case, if you experience good load balancing as shown in the kstats ipackets_cpuxx, it should also be visible in mpstat that the workload is evenly distributed to multiple CPUs.

If none of this benefit is visible, then disable the load balancing capability completely, using the /etc/system variable ce_taskq_disable.

The Rx load balancing provides packet queues, also known as service FIFOs, between the interrupt threads that fan out the workload and the service FIFO worker threads that drain the service FIFO and complete the workload. These service FIFOs are of fixed size and are controlled by the /etc/system variable ce_srv_fifo_depth. It is possible that the service FIFOs can also overflow and drop packets as the rate of packet arrival exceeds the rate with which the service FIFO draining thread can complete the post processing. These dropped packets can be measured using the rx_pkts_dropped kstat. If this is measured as occurring, you can increase the size of the service FIFO or you can increase the number of service FIFOs, allowing more Rx load balancing. In some cases, it may be possible to eliminate increments in rx_pkts_dropped, but the problem may move to rx_nocanputs, which is generally only addressable by tuning that can be applied by upper layer protocol. If you're running the Solaris 8 operating system or an earlier version, then upgrading to the Solaris 9 version will help your application experience fewer nocanputs. The upgrade might reduce nocanput errors due to improved multithreading and IP scalability performance improvements in the Solaris 9 operating system.

There is a difficulty is maximizing the Rx load balancing, and it is contingent on the Tx ring processing. This is measurable using the lockstat command and will show contention on the ce_start routine at the top as the most contended driver function. This contention cannot be eliminated, but it is possible to employ a new Tx method known as Transmit serialization, which keeps contention to a minimum while forcing the Tx processes on a fixed set of CPUs. Keeping the Tx process on a fixed CPU reduces the risk of CPUs spinning waiting for other CPUs to complete their Tx activity, ensuring CPUs are always kept busy doing useful work. This transmission method can be enabled using the /etc/system variable ce_start_cfg, setting it to 1. When you enable Transmit serialization, you will be trading off Transmit latency for avoiding mutex spins induced by contention.

The Tx side is also subject to overwhelmed condition, although this is less likely than any Rx side condition. This becomes visible when tx_max_pending value matches the size of the /etc/system variable ce_tx_ring_size. If this occurs, then you know that packets are being postponed because Tx descriptors are being exhausted. Therefore the size of the ce_tx_ring_size should be increased.

The tx_hdr_pkts, tx_ddi_pkts, and tx_dvma_pkts are useful for establishing the traffic profile of an application and matching that with the capabilites of a system. For example, many small systems have very fast memory access times making the cost of setting up DMA transactions more expensive than transmitting directly from a pre-mapped DMA buffer, in which case you can adjust the DMA thresholds programmable via /etc/system to push more packet into the preprogrammed DMA versus the per packet programming. Once the tuning is complete, these statistics can be viewed again to see if the tuning took effect.

The tx_queueX kstats give a good indication if Tx load balancing matches the Rx side. If no load balancing is visible, meaning all the packets appear to be getting counted by only one tx_queue, then it may make sense to switch this feature off. The /etc/system variable that does that is ce_no_tx_lb.

The mac_mtu gives an indication of the maximum size of packet that will make it through the ce device. It is useful to know if jumbo frames is enabled at the DLPI layer below TCP/IP. If jumbo frames is enabled, then the MTU indicated by mac_mtu will be 9216.

This is helpful, as it will show if there's a mismatch between the DLPI layer MTU and the IP layer MTU, allowing troubleshooting to occur in a layered manner.

Once jumbo frames is successfully configured at the driver layer and the TCP/IP layer, then you should ensure that jumbo frames packets are being communicated using the rx_jumbo_pkts and tx_jumbo_pkts to ensure Transmits and Receives of jumbo frame packets respectively is happening correctly.

< Day Day Up >

Ethernet Physical Layer Troubleshooting

Deviation from General Ethernet MII/GMII Conventions

Table 34. Physical Layer Configuration Properties

Ethernet Performance Troubleshooting

`ge` Gigabit Ethernet

Table 5-72. List of `ge` Specific Interface Statistics

`ce` Gigabit Ethernet

Table 5-73. List of `ce` Specific Interface Statistics

Maximizing the Performance of an Ethernet NIC Interface

Ethernet Physical Layer Troubleshooting

Deviation from General Ethernet MII/GMII Conventions

Table 34. Physical Layer Configuration Properties

Ethernet Performance Troubleshooting

ge Gigabit Ethernet

Table 5-72. List of ge Specific Interface Statistics

ce Gigabit Ethernet

Table 5-73. List of ce Specific Interface Statistics

`ge` Gigabit Ethernet

Table 5-72. List of `ge` Specific Interface Statistics

`ce` Gigabit Ethernet

Table 5-73. List of `ce` Specific Interface Statistics