Ethernet Technology

< Day Day Up >

This section discusses low-level network interface controller (NIC) architecture features. It explains the elements that make up a NIC adapter, breaking it down into the transmit (Tx) data path and the receive (Rx) data path followed by acceleration features available with more modern NICs.

The components are broken down in this manner to provide the necessary high-level understanding required to discuss the finer details of the Sun NIC devices available.

These broad concepts plus the finer details included in this explanation will help you understand the operation of these devices and how to tune them for maximum benefit in throughput, request/response performance, and level of CPU utilization. These concepts are also useful in explaining the development path that Sun took for its NIC technology. Each concept is retained from one Sun NIC to the next as each new product builds on the strengths of the last.

With the NIC architecture concepts in place, the next area of discussion is by far the largest source of customer discomfort with Ethernet technology: the physical layer. The original ubiquitous Ethernet technology was 10 Mbit/sec. Ethernet technology has been improved continuously over the years, going from 10 Mbit/sec to 100 Mbit/sec and most recently to 1 Gbit/sec. Along the way Ethernet technology always promised to be backward compatible and accomplished this using a technology called auto-negotiation, which allows new Ethernet arrivals to connect to the existing infrastructure and establish the correct speed to operate with and be part of that infrastructure. On the whole, the technology works very well, but there are some difficulties with understanding the Ethernet physical layer. Hopefully, our explanation of this layer will facilitate better use of this feature.

The last addition to the Ethernet technology is network congestion control using pause flow control. This is a useful but under-utilized feature of Ethernet that we hope to demystify.

Software Device Driver Layer

This section discusses the low-level NIC architecture features required to understand the tuning capabilities of each NIC. To discuss this we will divide the process of communication into the software device driver layer relative to TCP/IP and then further into Transmit and Receive.

The software device driver layer conforms to the data link provider interface (DLPI). The DLPI interface layer is how protocols like TCP/IP, Appletalk, and so on talk to the software driving the Ethernet device. This is illustrated further in FIGURE 5-7.

Figure 5-7. Communication Process between the NIC Software and Hardware

Transmit

The Transmit portion of the software device driver level is the simpler of the two and basically is made up of a Media Access Control module (MAC), a direct memory access (DMA) engine, and a descriptor ring and buffers. FIGURE 5-8 illustrates these items in relation to the computer system.

Figure 5-8. Transmit Architecture

The key element to this transmit architecture is the descriptor ring. This is the part of the architecture where the transmit hardware and the device driver transmit software share information required to move data from the system memory to the Ethernet network connection.

The transmit descriptor ring is a circular array of descriptor elements that are constantly being used by the hardware to find data to be transmitted from main memory to the Ethernet media. At a minimum, the transmit descriptor element contains the length of the Ethernet packet data to be transmitted and a physical pointer to a location in system physical memory to find the data.

The transmit descriptor element is created by the NIC device driver as a result of a request at the DLPI interface layer to transmit a packet. That element is placed on the descriptor ring at the next available free location in the array. Then the hardware is notified that a new element is available. The hardware fetches the new descriptor, and using the pointer to the packet data physical memory, moves the data from the physical memory to the Ethernet media for the given length of the packet provided in the Tx descriptor.

Note that requests for more packets to be transmitted by the DLPI interface continue while the hardware is transmitting the packets already posted on the descriptor ring. Sometimes the arrival rate of the transmit packets at the DLPI interface exceeds the rate of transmission of the packets by the hardware to the media. In that case, the descriptor ring fills up and further attempts to transmit must be postponed until previously posted transmissions are completed by the hardware and more descriptor elements are made available by the device driver software. This is a typical producer-consumer effect where the producer is the DLPI interface producing requests for the transmit descriptor ring and the hardware is the consumer consuming those requests and moving data to the media.

This producer-consumer effect can be reduced by increasing the size of the transmit descriptor ring to accommodate the delay that the hardware or the underlying media imposes on the movement of the data. This delay is also known as transmission latency.

Later sections describe how many of the device drivers give a measurement of how often the transmission latency becomes so large that data transmission is postponed, awaiting transmit descriptor ring space. The aim is to avoid this situation. In some cases, NIC hardware allows you to increase the size of the descriptor ring, allowing a larger transmit latency. In other cases, the hardware has a fixed upper limit for the size of the transmit descriptor ring. In those cases, there's a hard limit to how much latency the transmit can endure before postponing packets is inevitable.

Transmit DMA Buffer Method Thresholds

The packets staged for transmission are buffers present in the kernel virtual address space. A mapping is created that provides a physical address that the hardware uses as a base address of the bytes to be fetched from main memory for transmission. The minimum granularity of a buffer is an 8-kilobyte page, so if an Ethernet packet crosses an 8-kilobyte page in the virtual address space, there's no guarantee that the two pages will also be adjacent physical address space. To make sure the physical pages are adjacent in the physical address space, SPARC systems provide an input/output memory management unit (IOMMU), which is designed to make sure that the device view of main memory matches that of the CPU and hence simplifies DMA mapping.

The IOMMU simplifies the mapping of virtual to physical address space and provides a level of error protection against rogue devices that exceed their allowed memory regions, but it does so at some cost to the mapping set up and tear down. The generic device driver interface for creating this mapping is known as ddi_dma. In SPARC platforms, a newer set of functions for doing the mapping known as fast dvma is now available. With fast dma it is possible to further optimize the mapping functions. The use of fast dvma is limited, so when that resource is unavailable, falling back to ddi_dma is necessary.

Another aspect of DMA is the CPU cache coherence. DMA buffers on bridges in the data path between the NIC device and main memory must be synchronized with the CPU cache prior to the CPU reading or writing data. Two different modes of maintaining DMA-to-CPU cache coherency form two types of DMA transaction, know as consistent mode and streaming mode.

Consistent mode uses a consistency protocol in hardware, which is common in both x86 platforms and SPARC platforms.
Streaming mode uses a software synchronization method.

The trade-offs between consistent and streaming modes are largely due to the pre-fetch capability of the DMA transaction. In the consistent mode there's no pre-fetch, so when a DMA transaction is started by the device, each cache line of data is requested individually. In streaming mode a few extra cache lines can be pre-fetched in anticipation of being required by the hardware, hence reducing per cache line re-arbitration costs.

All of these trade-offs lead to the following rules for using ddi_dma, fast dvma, and consistent versus streaming mode:

If the packets are small, avoid setting up a mapping on a per-packet basis. This means that small packets are copied out of the message and passed down from the upper layer to a pre-mapped buffer. That pre-mapped buffer is usually a consistent mode buffer, as the benefits of streaming mode are difficult to realize for small packets.
Large packets should use the fast dvma mapping interface. Streaming mode is assumed in this mode. On x86 platforms, streaming mode is not available.
Mid-range packets should use the ddi_dma mapping interface. This range applies to all cases where fast dvma is not available. The mid-range can be further split, as one can control explicitly whether the DMA transaction uses consistent mode or streaming mode. Given that streaming mode pre-fetch capability works best for larger transactions, the upper half should use streaming mode while the lower half uses consistent mode.

Setting the thresholds for these rules requires clear understanding of the memory latencies of the system and the distance between the I/O Expander card and the CPU card in a system. The rule of thumb here is the larger the system, the larger the memory latency.

Once the course-grain tuning is applied, more fine-grain tuning is required. The best tuning is established by experimentation. A good way to do this is by running FTP or NFS transfers of large files and measuring the throughput.

Multi-data Transmit Capability

A new development in the process of transmission in the Solaris TCP/IP protocol stack is known as multi-data transmission (MDT). Delivering one packet at a time to the transmit driver caused a lot of networking layer traversals. Furthermore, the individual cost of setting up network device DMA hardware to transmit one packet at a time was too expensive. Therefore, a method was devised that allowed multiple packets to be passed to the driver for transmission. At the same time, every effort was made to ensure that the data for those packets remained in one contiguous buffer that could be enabled for transmission with one setup of Network Device DMA hardware.

This feature requires a new interface to the driver, so only the most recent devices have implemented it. Furthermore, it can only be enabled if TCP/IP is configured to allow it. Even with that, it will only attempt to build an MDT transaction to the driver if the TCP connection is operating in a Bulk transfer mode such as FTP or NFS.

The Multi-Data transmit capability is also included as part of the performance enhancements provided in the ce driver. This feature is negotiated with the upper layer protocol so it can be enabled in the ce driver as well as the upper layer protocol. If there's no negotiation, the feature is disabled. The TCP/IP protocol began supporting Multi-Data Transmit capability in the Solaris 9 8/03 operating system, but by default it will not negotiate with the driver to enable it. The first step to making this capability available is to enable the negotiations through an /etc/system tunable parameter.

Table 5-15. Multi-Data Transmit Tunable Parameter
Parameter	Value	Description
`ip_use_dl_cap`	0-1	Enables the ability to negotiate special hardware accelerations with a lower layer. 1 Enable 0 Disable Default 0

To enable the multi-data transmit capability, add the following line to the /etc/system file:

 set ip:ip_use_dl_cap = 1

Receive

The receive side of the interface looks much like the transmission side, but it requires more from the device driver to ensure that packets are passed to the correct stream. There are also multithreading techniques to ensure that the best advantage is made of multiprocessor environments. FIGURE 5-9 shows the basic Rx architecture.

Figure 5-9. Basic Receive Architecture

The receive descriptor plays a key role in the process of receiving packets. Unlike transmission, receive packets originate from remote systems. Therefore, the Rx descriptor ring refers to buffers where those incoming packets can be placed.

At a minimum, the receive descriptor element provides a buffer length and a pointer to an available buffer. When a packet arrives, it's received first by the PHY device and then passed to the MAC, which notifies the Rx DMA engine of an incoming packet. The Rx DMA takes that notification and uses it to initiate a Rx descriptor element fetch. The descriptor is then used by the Rx DMA to post the data from the MAC device internal FIFOs to system main memory. The length provided by the descriptor ensures the Rx DMA doesn't exceed the buffer space provided for the incoming packet.

The Rx DMA continues to move data until the packet is complete. Then it places in the current descriptor location a new completion descriptor containing the size of the packet that was just received. In some cases, depending on the hardware capability, there might be more information in the completion descriptor associated with the incoming packet (for example, a TCP/IP partial checksum).

When the completion descriptor is placed back onto the Rx descriptor ring, the hardware advances its pointer to the next free Rx descriptor. Then the hardware interrupts the CPU to notify the device driver that it has a packet that needs to be passed to the DLPI layer.

Once the device driver receives the packet, it is responsible for replenishing the Rx descriptor ring. That process requires the driver to allocate and map a new buffer for DMA and post it to the ring. When the new buffer is posted to the ring, the hardware is notified that this new descriptor is available. Once the buffer is replenished, the current packet can be passed up for classification to the stream expecting that packet's arrival.

It is possible that the allocation and mapping can fail. In that case, the current packet cannot be received, as its buffer is reposted to the ring to allow the hardware to continue to receive packets. This condition is not very likely, but it is an example of an overflow condition.

Other overflow conditions can occur on the Rx path starting from the DLPI layer:

Overflow can be caused when the DLPI layer cannot receive the incoming packet. In that case, packets are typically dropped, even though they were successfully received by the hardware.
Overflow can be caused when the device driver software is unable to replenish the Rx descriptor elements faster than the NIC hardware consumes them. This usually occurs because the system doesn't have enough CPU performance to keep up with the network traffic.

Overflow also occurs within the NIC device between the MAC and the Rx DMA interface. This is known as MAC overflow. It is caused when the descriptor overflows and backfill occurs because of that condition. MAC overflow can occur when a high-latency system bus makes the MAC overflow its internal buffer as it waits for the Rx DMA to get access to the bus to move the data from the MAC buffer to main memory. Finally, if a MAC overflow condition exists, any packet coming in cannot be received. Hence that packet is considered missed.

In some cases, overflow conditions can be avoided by careful tuning of the device driver software. The extent of available tuning depends on the NIC hardware. In cases where the Rx descriptor ring is overflowed, many devices allow increases in the number of descriptor elements available. This will be discussed further with respect to example NIC cards in later sections.

You can avoid the MAC overflow condition by careful system configuration, which can require more memory, faster CPUs, or more CPUs. It might also require that NIC cards not share the system bus with other devices. Newer devices have the ability to adjust the priority of the Rx DMA versus the Tx DMA, giving one a more favorable opportunity to access the system bus than the other. Therefore, if the MAC overflow condition occurs, it might be possible to adjust the Rx DMA priority to make Rx accesses to the system bus more favorable than the Tx DMA, thus reducing the likelihood of MAC overflow.

The overflow condition from the DLPI layer is caused by an overwhelmed CPU. There are a few new hardware features that help reduce this effect. Those features include hardware checksumming, interrupt blanking, and CPU load balancing.

Checksumming

The hardware checksumming feature accelerates the one's complement checksum applied to TCP/IP packets. The TCP/IP checksum is applied to each packet sent by the TCP/IP protocol. The TCP/IP checksum is made up of a one's complement addition of the bytes in the pseudo header plus all the bytes in the payload. The pseudo header is made up of bytes from the source and destination IP address plus the TCP source and destination port numbers.

The hardware checksumming feature is merely an acceleration. Most hardware designs don't implement the TCP/IP checksumming directly. Instead, the hardware does the bulk of the one's complement additions over the data and allows the software to take that result and mathematically adjust it to make it appear the complete hardware checksum was calculated. On transmission, the TCP checksum field is filled with an adjustment value that is considered just another two bytes of data that the hardware is applying during the one's complement addition of all the bytes of the packet. The end result of that sequence is a mathematically correct checksum that can be placed in the TCP header on transmission by the MAC to the network.

Figure 5-10. Hardware Transmit Checksum

On the Rx path, the hardware completes the one's complement checksum based on a starting point in the packet. That same starting point is passed to TCP/IP along with the one's complement checksum from the bytes in the incoming packet. The TCP/IP software again does a mathematical fix-up, using this information before it finally compares the result with the TCP/IP checksum bytes that arrived as part of the packet.

The main advantage of hardware checksumming is the reduction in cost of requiring the system CPU to calculate the checksum for large packets by allowing the majority of the checksum calculation to be completed by the NIC hardware. Because the hardware does not do the complete TCP/IP checksum calculation, this form of TCP/IP checksum acceleration is called partial checksumming.

Figure 5-11. Hardware Receive Checksum

Interrupt Blanking

Interrupt blanking is another hardware acceleration. Typically with regular NIC devices, the CPU is interrupted when a receive packet arrives. Hence the CPU is interrupted on a per-packet basis. While this is reasonable for transactional requests, where you would expect a response to a request immediately, it is not always required, especially in large bulk data transfers. In the single-interrupt-per-packet case, a packet arrival interrupting the CPU adds the cost of processing each individual packet to the overhead of the interrupt processing. The interrupt blanking technique allows a set number of packets to arrive before the next receive interrupt is generated. This allows the overhead of the interrupt processing to be distributed, or amortized, across the number of received packets. If that number of packets is not reached, then the packets that have arrived so far will not generate an interrupt and hence would not be processed. A timeout ensures that the receive packet interrupt will be generated and those received packets will be processed. The best setting for the interrupt blanking depends on the type of traffic transactional versus bulk data transfers and the speed of the system. The best way to tune these parameters can be achieved empirically when the given parameters are well known and the interrupt blanking can be tuned dynamically to match. This will be discussed further in the context of individual NICs that provide this feature.

CPU Load Balancing

CPU load balancing is the latest hardware acceleration to become available. It is designed to take maximum advantage of the large number of CPUs available in many UltraSPARC-based systems. There are two forms of CPU load balancing: software load balancing and hardware load balancing.

Software load balancing can be enhanced with hardware support, but it can also be implemented without hardware support. Essentially, it requires the ability to separate the workload of different connections from the same protocol stack into flows that can then be processed on different CPUs. The interrupt thread is now required only to replenish buffers for the descriptor ring, allowing more packets to arrive. Packets taken off the receive rings are then load balanced into packet flows based on connection information from the packet. A packet flow has a circular array that is updated with receive packets from the interrupt service routine while packets posted earlier are being removed and post-processed in the protocol stack by the kernel worker thread. Usually more than one flow is set up within a system made up of a circular array and a corresponding kernel worker thread. The more CPUs available, the more flows can be allowed. The kernel worker threads are available to run whenever packet data is available on the flows array. The system scheduler participates using its own CPU load-balancing technique to ensure a fair distribution of workload for incoming data.

FIGURE 5-12 demonstrates the architecture of software load balancing.

Figure 5-12. Software Load Balancing

Hardware load balancing requires that the hardware provide built-in load balancing capability. The PCI bus enables receive hardware load balancing by using its four available interrupt lines together with the ability of the UltraSPARC III systems to allow each of those four interrupt lines to be serviced by different CPUs within the system. The advantage of having the four lines receive interrupts running on different CPUs is that it allows not only the protocol post processing to happen in parallel, as in the case of software load balancing, but it also allows the processing of the descriptor rings in the interrupt service routines to run in parallel, as shown in FIGURE 5-13.

Figure 5-13. Hardware Load Balancing

It is possible to combine the concept of software load balancing with the concept of hardware load balancing if enough CPUs are available to allow all the parallel Rx processing to happen.

However, there is a gotcha with this load balancing capability: To realize its benefit, you must have multiple connections in order to provide the load balancing in the first place.

Received Packet Delivery Method

The received packet delivery method is the way packets are posted to the upper layer for protocol processing and refers to which thread of execution takes responsibility for that last leg of the journey for the received packet.

CPU software load balancing is an example of a received packet delivery method where the interrupt processing is decoupled from the protocol stack processing. A hint provided by the hardware helps decide which worker thread completes the delivery of the packet to the protocol. In this model, many CPUs get to participate in the protocol stack processing.

The Streams Service Queue model also requires the driver to decouple interrupt processing from protocol stack processing. In this model, there's no requirement to provide a hint because there is only one protocol processing thread per queue open to the driver with respect to TCP/IP, that's only one stream. This method works best on systems with a small number, but greater than one, of CPUs. Like CPU load balancing, it compromises on latency.

The most common received packet delivery method is to do all the interrupt processing and protocol processing in the interrupt thread. This is a widely accepted method, but it is restricted by the available CPU bandwidth taking all the NIC driver interrupts. This is really the only option on a single CPU system. In a multi-CPU system, you can choose one of the other two methods if it's established that the CPU taking the NIC interrupts is being overwhelmed. That situation becomes apparent when the system starts to become unresponsive.

Random Early Discard

The Random Early Discard feature was introduced recently to try to reduce the ill effects of a network card going into an overflow state.

Under normal circumstances, there are a couple of overflow possibilities:

The internal device memory is full and the adapter is unable to get timely access to the system bus in order to move data from that device memory to system memory.
The system is so busy servicing packets that the descriptor rings fill up with inbound packets and no further packets can be received. This overflow condition is very likely to also trigger the first overflow condition at the same time.

When these overflow conditions occur, the upper layer connections effectively stop receiving packets and a connection appears to have stopped in motion, at least for the duration of the overflow condition. With TCP/IP in particular, this leads to many packets being lost. The connection state is modified to assume a less reliable connection, and in some cases the connections might be lost completely.

The impact of a lost connection is obvious, but if the TCP/IP protocol assumes a less reliable connection it will further contribute to the congestion on the network by reducing the number of packets outstanding without an ACK from the regular eight to a smaller value. A technique that can avoid this scenario can take advantage of the TCP/IP ability to allow for the occasional single packet loss associated with a connection and still maintain the same number of packets outstanding without an ACK. The lost packet is simply requested again, and the transmitting end of the connection will perform a retry.

Completely avoiding the overflow scenario is impossible, but you can reduce its likelihood by beginning to drop random packets already received in the device memory, avoiding propagating them further into the system and adding to the workload already piled up for the system. This technique, known as Random Early Discard (RED), has the desired effect of avoiding overwhelming the system, while at the same time having minimal negative effect on the TCP/IP connections.

The rate of random discard is done relative to how many bytes of packet data occupy the device internal memory. The internal memory is split into regions. As one region fills up with packet data, it spills into the next until all regions of memory are filled and overflow occurs. When the packet data spills from one region to the next, that's the trigger to randomly discard. The number of packets discarded is based on the number of regions filled; the more regions filled, the more you need to discard, as you're getting closer to the overflow state.

Jumbo Frames

Jumbo frames technology allows the size of an Ethernet data packet to be extended past the current 1514 standard limit, which is the norm for Ethernet networks. The typical size of jumbo frames has been set to 9000 bytes when viewed from the IP layer. Once the Ethernet header is applied, that grows by 14 bytes for the regular case or 18 bytes for VLAN packets.

When jumbo frames are enabled on a subnet or VLAN, every member of that subnet or VLAN should be enabled to support jumbo frames. To ensure that this is the case, configure each node for jumbo frames. The details of how to set and check a node for jumbo frames capability tend to be NIC device/driver-specific and are discussed for interfaces that support them below. If any one node in the subnet is not enabled for jumbo frames, no members in the subnet can operate in jumbo frame mode, regardless of their preconfiguration to support jumbo frames.

The big advantage of jumbo frames is similar to that provided by MDT. They provide a huge improvement in bulk data transfer throughput with corresponding reduction in CPU utilization, but with the addition that the same level of improvement is also available in the receive direction. Therefore, best bulk transfer results can be achieved using this mode.

The jumbo frames mode should be used with care because not all switches or networking infrastructure elements are jumbo frames capable. When you enable jumbo frames, make sure that they're contained within a subnet or VLAN where all the components in that subnet or VLAN are jumbo frames capable.

Ethernet Physical Layer

The Ethernet Physical layer has developed along with the Ethernet technology. When Ethernet moved from 10 Mbit/sec to 100 Mbit/sec, there were a number of technologies and media available to provide 100 Mbit/sec line rate. To allow those media and technologies to develop without altering the long-established Ethernet protocol, a partition was made between the media-specific portion and the Ethernet protocol portion of the overall Ethernet technology. At that partition was placed the Media Independent Interface (MII). The MII allowed Ethernet to operate over fiber-optic cables to switches built to support fiber. It also allowed the introduction of a new twisted-pair copper technology, 100BASE-T4. These differing technologies for the supporting 100 Mbit/sec ultimately did not survive the test of time, leaving 100BASE-T as the standard 100 Mbit/sec media type.

The existing widespread adoption of 10 Mbit/sec Ethernet brought with it a requirement that Ethernet media for 100 Mbit/sec should allow for backward compatibility with existing 10 Mbit/sec networks. Therefore, the MII was required to support 10 Mbit/sec operation as well as 100 Mbit/sec and allow the speed to be user selectable or automatically detected or negotiated. Those requirements led to the ability to force a particular speed setting for a link, known as Forced mode, or based on link speed signaling set a particular speed, known as auto-sensing, where both sides of the link share information about link speed and duplex capabilities and negotiate the best speed and duplex to set for the link, known as Auto-negotiation.

Basic Mode Control Layer

For the benefit of this discussion, the MII is restricted to the registers and bits used by software to allow the Ethernet physical layer to operate in Forced or Auto-negotiation mode.

The first register of interest is the Basic Mode Control Register (BMCR). This register controls whether the link will auto-negotiate or use Forced mode. If Forced mode is chosen, then auto-negotiation is disabled and the remaining bits in the register become meaningful.

Figure 5-14. Basic Mode Control Register

The Reset bit is a self-clearing bit that allows the software to reset the physical layer. This is usually the first bit touched by the software in order to begin the process of synchronizing the software state with the hardware link state.
Speed Selection is a single bit, and it is only meaningful in Forced mode. Forced mode of operation is available when auto-negotiation is disabled. If this bit is set to 0, then the speed selected is 10 Mbit/sec. If set to 1, then the speed selected is 100 Mbit/sec.
When the Auto-negotiation Enable bit is set to 1, auto-negotiation is enabled, and the speed selection and duplex mode bits are no longer meaningful. The speed and Duplex mode of the link are established based on auto-sensing or auto-negotiation advertisement register exchange.
The Restart Auto-negotiation bit is used to restart auto-negotiation. This is required during the transition from Forced mode to Auto-negotiation mode or when the Advertisement register has been updated to a different set of auto-negotiation parameters.
The Duplex mode bit is only meaningful in Forced mode. When set to 1, the link is set up for full duplex mode. When set to 0, the link operates in Half-duplex mode.

Basic Mode Status Register

The next register of interest is the Basic Mode Status Register (BMSR). This read-only register provides the overall capabilities of the MII physical layer device. From these capabilities you can choose a subset to advertise, using the Auto-negotiation Advertisement register during the auto-negotiation process.

Figure 5-15. Basic Mode Status Register

When the 100BASE-T4 bit is set to 1, it indicates that the physical layer device is capable of 100BASE-T4 networking. When set to 0, it is not.
When the 100BASE-T Full-duplex bit is set to 1, it indicates that the physical layer device is capable of 100BASE-T full-duplex networking. When set to 0, it is not capable.
When the 100BASE-T Half-duplex bit is set to 1, it indicates that the physical layer device is capable of 100BASE-T half-duplex networking. When set to 0, it is not capable.
When the 10BASE-T Full-duplex bit is set to 1, it indicates that the physical layer device is capable of 10BASE-T full-duplex networking. When set to 0, it is not.
When the 10BASE-T Half-duplex bit is set to 1, it indicates that the physical layer device is capable of 10BASE-T half-duplex networking. When set to 0, it is not.
The Auto-negotiation Complete bit is only meaningful when the physical layer device is capable of auto-negotiation and it is enabled. Auto-negotiation Complete indicates that the auto-negotiation process has completed and the information in the link partner auto-negotiation accurately reflects the link capabilities of the link partner.
When the Link Status bit is set to 1, it indicates that the physical link is up. When set to 0, the link is down. When used in conjunction with auto-negotiation, this bit must be set together with the Auto-negotiation Complete before the software can establish that the link is actually up. In Forced mode, as soon as this bit is set to 1, the software can assume the link is up.
When the Auto-negotiation Capable bit is set to 1, it indicates that the physical layer device is capable of auto-negotiation. When set to 0, the physical layer device is not capable. This bit is used by the software to establish any further auto-negotiation processing that should occur.

Link-Partner Auto-negotiation Advertisement Register

The next registers of interest are the Auto-negotiation Advertisement Register (ANAR) and Link-Partner Auto-negotiation Advertisement (LPANAR). These two registers at the heart of the Auto-negotiation process, and they both share the same bit definitions.

The ANAR is a read/write register that can be programmed to control the link partner's view of the local link capabilities. The LPANAR is a read-only register that is used to discover the remote link capabilities. Using the information in the LPANAR together with the information available in the ANAR register, the software can establish what shared link capability has been established once auto-negotiation has completed.

Figure 5-16. Link Partner Auto-negotiation Advertisement

When the 100BASE-T4 is set to 1 in the ANAR, it advertises the intention of the local physical layer device to use 100BASE-T4. When set to 0, this capability is not advertised. This same bit, when set to 1 in the LPANAR, indicates that the link partner physical layer device has advertised 100BASE-T4 capability. When set to 0, the link partner is not advertising this capability.

The 100BASE-T Full-duplex, 100BASE-T Half-duplex, 10BASE-T Full-duplex, and 100BASE-T Half-duplex bits all have the same functionality as 100BASE-T4 and provide the ability to decide what link capabilities should be shared for the link. The decision process is made by the physical layer hardware and is based on priority, as shown in FIGURE 5-17. It is the result of logically ANDing ANAR and the LPANAR on completion of auto-negotiation.

Figure 5-17. Link Partner Priority for Hardware Decision Process

Auto-negotiation in the purest sense requires that both sides participate in the exchange of ANAR. This allows both sides to complete loading of the LPANAR and establish a link that operates at the best negotiated value.

It is possible that one side, or even both sides, of the link might be operating in Forced mode instead of Auto-negotiation mode. This can happen because the new device is connected to an existing 10/100 Mbit/sec link that was never designed to support auto-negotiation or because the auto-negotiation is switched off on one or both sides.

If both sides are in Forced mode, one needs to set the correct speed and duplex for both sides. If the speed is not matched, the link will not come up, so speed mismatches can be easily tracked down once the physical connection is checked and considered good. If the duplex is not matched yet the speed is matched, the link will come up, but there's an often unnoticed gotcha in that. If one side is set to half duplex while the other is set to full duplex, then the half-duplex side will operate with the Ethernet protocol Carrier Sense Multiple Access with Collision Detection (CSMA/CD) while the full-duplex side will not. To the physical layer, this means that the full-duplex side is not adhering to the half-duplex CSMA/CD protocol and will not back off if someone is currently transmitting. For the half-duplex side of the connection, this appears as a collision, and its transmit is stopped. These collisions will occur frequently, preventing the link from operating to its best capacity.

If one side of the connection is running Auto-negotiation mode and the other is running Forced mode and the auto-negotiating side is capable and advertising all available MII speeds and duplex settings, the link speed will always be negotiated successfully by the auto-sensing mechanism provided as part of the auto-negotiation protocol. Auto-sensing uses physical layer signaling to establish the operating speed of the Forced side of the link. This allows the link to at least come up at the correct speed. The link duplex, on the other hand, needs the Advertisement register exchange and cannot be established by auto-sensing. Therefore, if the link duplex setting on the Forced mode side of the link is full duplex, then the best guess the auto-negotiating side of the link can make is half duplex. This gives rise to the same effect discussed when both sides are in Forced mode and there's a duplex mismatch. The only solution to the issue of duplex mismatch is to be aware that it can happen and make every attempt to configure both sides of the link to avoid it.

In most cases, enabling auto-negotiation on both sides wherever possible will eliminate the duplex mismatch issue. The alternative is Forced mode, which should only be employed in infrastructures that have full-duplex configurations. Where possible, those configurations should be replaced with an auto-negotiation configuration.

There's one more MII register worthy of note. The Auto-negotiation Expansion register (ANER) can be useful in establishing whether a link partner is capable of auto-negotiation or not, and providing information about the auto-negotiation algorithm.

Figure 5-18. Auto-negotiation Expansion Register

The Parallel Detection Fault bit indicates that the auto-sensing part of the auto-negotiation protocol was unable to establish the link speed, and the regular ANAR exchange was also unsuccessful in establishing a common link parameter. Therefore auto-negotiation failed. If this condition happens, the best course of action is to check each side of the link manually and ensure that the settings are mutually compatible.

Gigabit Media Independent Interface

As time progressed, Ethernet was increased in speed by another multiple of 10 to give 1000 Mbit/sec or 1 Gbit/sec. The MII remained and was extended to support the new 1 Gbit/sec operation, giving rise to the Gigabit Media Independent Interface (GMII).

The GMII was first implemented using fiber-optic physical layer known as 1000BASE-X and was later extended to support twisted-pair copper known as 1000BASE-Tx. Those extensions led to additional bits in registers in the MII specification and some completely new registers, giving a GMII register set definition.

The first register to be extended was the BMCR because it can be used to force speed. Then the ability to force 1-gigabit operation was added. All existing bit definition was maintained with the addition of one bit taken from the existing reserved bits to allow the enumeration of the different speeds that can now be forced with GMII devices.

Figure 5-19. Extended Basic Mode Control Register

The next register of interest was the BMSR. This register was extended to indicate to the driver software that there are more registers that apply to 1-gigabit operation.

Figure 5-20. Basic Mode Status Register

When the 1000BASE-T Extended Status is set, that's the indication to the driver software to look at the new 1-gigabit operating registers. The function is similar to the Basic Mode Status and the ANAR.

The Gigabit Extended Status Register (GESR) is the first of the gigabit operating registers. Like the BMSR, it gives an indication of the types of gigabit operation the physical layer device is capable of.

Figure 5-21. Gigabit Extended Status Register

The 1000BASE-X full duplex indicates that the physical layer device is capable of operating with 1000BASE-X fiber media with full-duplex operation.

The 1000BASE-X half duplex indicates that the physical layer device is capable of operating with 1000BASE-X fiber media with half-duplex operation.

The 1000BASE-T full duplex indicates that the physical layer device is capable of operating with 1000BASE-T twisted-pair copper media with full-duplex operation.

The 1000BASE-T half duplex indicates that the physical layer device is capable of operating with 1000BASE-T twisted-pair copper media with half-duplex operation.

The information provided by the GESR gives the possible 1-gigabit capabilities of the physical layer device. From that information you can choose the gigabit capabilities that will be advertised through the Gigabit Control Register (GCR). In the case of twisted-pair copper physical layer, there is also the ability to advertise the Clock Mastership.

Figure 5-22. Gigabit Control Status

Clock Mastership is a new concept that only applies to copper media running at 1 gigabit. At such high signaling frequencies, it becomes increasingly difficult to continue to have separate clocking for the remote and the local physical layer devices. Hence, a single clocking domain was introduced, which the remote and the local physical layer devices share while a link is established. To achieve the single clocking domain only one end of the connection provides the clock (the link master), and the other (the link slave) simply uses it. The Gigabit Status Register (GSR) bits, Master/Slave Manual Config Enable, and Master/Slave Config Value control how your local physical layer device will behave in this master/slave relationship.

When the Master/Slave manual config enable bit is set, the master slave configuration is controlled by the master config value. When it is cleared, the Master/Slave configuration is established during auto-negotiation by a clock-learning sequence, which automatically establishes a clock master and slave for the link.

Typically in a network, the master is the switch port and the slave is the end port or NIC.

The Master/Slave manual config enable setting is only meaningful when the Master/Slave manual config enable bit is set. If set to 1, it will force the local clock mastership setting to be Master. If set to 0, the local clock becomes the Slave.

When using the Master/Slave manual configuration, take care to ensure that the link partner is set accordingly. For example, if 1-gigabit Ethernet switches are set up to operate as link masters, then the computer system attached to the switches should be set up as a slave.

Figure 5-23. Gigabit Status Register

When the driver fills in the bits in the GSR, it's equivalent to filling in the ANAR in MII: It controls the 1-gigabit capabilities that are advertised. Likewise the GSR is like the LPANAR, providing the capabilities of the link partner. The register definition for the GSR is similar to the GCR.

With GMII operation, once auto-negotiation is complete, the contents of the GCR are compared with those in the GSR and the highest-priority shared capability is used to decide the gigabit speed and duplex.

It is possible to disable 1-gigabit operation. In that case, the shared capabilities must be found in the MII registers as described above.

In GMII mode, at the end of auto-negotiation, once the GCR and GSR are compared and the ANAR and LPANAR are compared, then the choice of the operating speed and duplex is established by the hardware based on the following descending priority:

Figure 5-24. GMII Mode Link Partner Priority

Once the correct setting is established, the device software makes that setting known to the user through kernel statistics. It is also possible to manipulate the configuration using the ndd utility.

Ethernet Flow Control

One area of MII/GMII that appeared after the initial definition of MII but before GMII was the introduction of Ethernet Flow Control. Ethernet Flow Control is a MAC Layer feature that controls the rate of packet transmission in both directions.

Figure 5-25. Flow Control Pause Frame Format

The key to this feature is the use of MAC control frames known as pause frames, which have the following formats:

The Destination Address is a 6-byte address defined for Ethernet Flow Control as a multicast address of 01:80:C2:00:00:1.
The Source Address is a 6-byte address that is the same as the Ethernet station address of the producer of the pause frame.
The Protocol Type Field is a 2-byte address set to the MAC control protocol 0x8808. Pause capability is one example of the usage of MAC control protocol.
The MAC Control Pause Opcode is a 2-byte value, 0x0001, that indicates the type of MAC control feature to be used, in this case pause.
The Mac Control Pause Parameter is a 2-byte value that indicates whether the flow control is started also referred to as XOFF or XON. When the MAC Control Pause Parameter is non zero, you have an XOFF pause frame. When the value is 0, you have an XON pause frame. The value of the parameter is in units of slot time.

To understand the Flow Control capability, consider symmetric flow control first. With symmetric flow control, a network node can generate flow control frames or react to flow control frames.

Generating a flow control frame is known as Transmit Pause capability and is triggered by congestion on the Rx side. The Transmit Pause sends an XOFF flow control message to the link partner, who should react to pause frames (Receive pause capability). By reacting to pause frames, the link partner uses the transmitted pause parameter as a duration that the link partner's transmitter should remain silent while the Rx congestion clears. If the Rx congestion clears within that pause parameter period, an XON flow control message can be transmitted telling the link partner that the congestion has cleared and transmission can continue as normal.

In many cases, Flow Control capability is available in only one direction. This is known as Asymmetric Flow Control. This might be a configuration choice or simply a result of a hardware design.

Therefore, the MII/GMII specification was altered to allow Flow Control capability to be advertised to a link partner along with the best type of flow control to be used for the shared link. The changes were applied to the ANAR along with two new bits: Pause Capability and Asymmetric Pause Capability. FIGURE 5-26 shows the updated register.

Figure 5-26. Link Partner Auto-negotiation Advertisement Register

Starting with Asymmetric Pause Capability, if the value of this bit is set to 0, then the ability to pause is managed by the Pause Capability. If Pause Capability is set to 1, it indicates the local ability to pause in both Rx and Tx direction. If the Asymmetric Pause Capability is set to 1, it indicates the local ability to pause in either the Rx or the Tx direction. When it's set to 1, it indicates that the local setting is to Receive flow control. In other words, reception of XOFF can stop transmitting, and XON can restart it. When set to 0, it indicates transmit flow control, which means when Rx becomes congested, it will transmit XOFF, and once the congestion clears, it can transmit XON.

Figure 5-27. Rx/Tx Flow Control in Action

Now that the Pause Capability and Asymmetric Pause Capability are established, it is required to advertise these parameters to the link partner and negotiate the pause setting to be used for the link.

TABLE 5-16 enumerates all the possibilities for resolving the pause capabilities for a link.

Table 5-16. Possibilities for Resolving Pause Capabilities for a Link
Local Device		Remote Device		Link Resolution
cap_pause	cap_asmpause	lp_cap_pause	lp_cap_asmpause	link_pause	link_asmpause
0	0	X	X	0	X
0	1	0	X	0	0
0	1	1	0	0	0
1	0	0	X	0	0
1	0	1	X	1	0
1	1	0	0	0	0
1	1	1	X	1	0

The link_pause and link_asmpause parameters have the same meanings as the cap_pause and cap_asmpause parameters and enumerate meaningful information for a link given the pause capabilities available for both sides of the link.

Example 1

`cap_asmpause = 1`	The device is capable of asymmetric pause.
`cap_pause = 0`	The device will send pauses if the Receive side becomes congested.
`lp_cap_asmpause = 0`	The device is capable of symmetric pause.
`lp_cap_pause = 1`	The device will send pauses if the Receive side becomes congested, and it will respond to pause by disabling transmit.
`link_asmpause = 0`	Because both the local and remote partner are set to send a pause on congestion, only the remote partner will respond to that pause. This is equivalent to no flow control, as it requires both ends to stop transmitting to alleviate the Rx congestion.
`link_pause = 0`	Further indication that no meaningful flow control is happening on the link.

Example 2

`cap_asmpause = 1`	The device is capable of asymmetric pause.
`cap_pause = 1`	The device will send pauses if the receive side becomes congested.
`lp_cap_asmpause = 0`	The device is capable of symmetric pause.
`lp_cap_pause = 1`	The device will send pauses if the receive side becomes congested, and it will respond to pause by disabling transmit.
`link_asmpause = 1`	Because the local setting is to stop sending on arrival of a flow control message and the remote end is set to send flow control messages when it gets congested, we have flow control on the receive direction of the link. Hence it's asymmetric.
`link_pause = 1`	The direction of the pauses is incoming.

There are more examples of flow control from the table that can be discussed in terms of flow control in action. We'll return to this topic when discussing individual devices that support this feature.

This concludes all the options for controlling the configuration of the Ethernet Physical Layer MII/GMII. The preceding information should come in useful when configuring your network and making sure that each Ethernet link is coming up as required by the configuration.

Finally, there is a perception that auto-negotiation has difficulties, but most of these were cleared up with the introduction of Gigabit Ethernet technology. Therefore it is no longer required to disable auto-negotiation to achieve reliable operation with available Gigabit switches.