Background: IO Subsystem Bottlenecks | HyperTransportв„ў System Architecture

Background: I/O Subsystem Bottlenecks

New I/O buses are typically developed in response to changing system requirements and to promote lower cost implementations . Current-generation I/O buses such as PCI are rapidly falling behind the capabilities of other system components such as processors and memory. Some of the reasons why the I/O bottlenecks are becoming more apparent are described below.

Server Or Desktop Computer: Three Subsystems

A server or desktop computer system is comprised of three major subsystems:

Processor (in servers, there may be more than one)
Main DRAM Memory. There are a number of different synchronous DRAM types, including SDRAM, DDR, and Rambus.
I/O (Input/Output devices). Generally , all components which are not processors or DRAM are lumped together in this subsystem group . This would include such things as graphics, mass storage, legacy hardware, and the buses required to support them: PCI, PCI-X, AGP, USB, IDE, etc.

CPU Speed Makes Other Subsystems Appear Slow

Because of improvements in CPU internal execution speed, processors are more demanding than ever when they access external resources such as memory and I/O. Each external read or write by the processor represents a huge performance hit compared to internal execution.

Multiple CPUs Aggravate The Problem

In systems with multiple CPUs, such as servers, the problem of accessing external devices becomes worse because of competition for access to system DRAM and the single set of I/O resources.

DRAM Memory Keeps Up Fairly Well

Although it is external to the processor(s), system DRAM memory keeps up fairly well with the increasing demands of CPUs for a couple of reasons. First, the performance penalty for accessing external memory is mitigated by the use of internal processor caches. Modern processors generally implement multiple levels of internal caches that run at the full CPU clock rate and are tuned for high "hit rates". Each fetch from an internal cache eliminates the need for an external bus cycle to memory.

In addition, in cases where an external memory fetch is required, DRAM technology and the use of synchronous bus interfaces to it (e.g. DDR, RAMBUS, etc.) have allowed it to maintain bandwidths comparable with the processor external bus rates.

I/O Bandwidth Has Not Kept Pace

While the processor internal speed has raced forward, and memory access speed has managed to follow along reasonably well with the help of caches, I/O subsystem evolution has not kept up.

This Slows Down The Processor

Although external DRAM accesses by processors can be minimized through the use of internal caches, there is no way to avoid external bus operations when accessing I/O devices. The processor must perform small, inefficient external transactions which then must find their way through the I/O subsystem to the bus hosting the device.

It Also Hurts Fast Peripherals

Similarly, bus master I/O devices using PCI or other subsystem buses to reach main memory are also hindered by the lack of bandwidth. Some modern peripheral devices (e.g. SCSI and IDE hard drives ) are capable of running much faster than the busses they live on. This represents another system bottleneck. This is a particular problem in cases where applications are running that emphasize time-critical movement of data through the I/O subsystem over CPU processing.

Reducing I/O Bottlenecks

Two important schemes have been used to connect I/O devices to main memory. The first is the shared bus approach, as used in PCI and PCI-X. The second involves point-to-point component interconnects, and includes some proprietary busses as well as open architectures such as HyperTransport. These are described here, along with the advantages and disadvantages of each.

The Shared Bus Approach

Figure 1-1 on page 12 depicts the common "North-South" bridge PCI implementation. Note that the PCI bus acts as both an "add-in" bus for user peripheral cards and as an interconnect bus to memory for all devices residing on or below it. Even traffic to and from the USB and IDE controllers integrated in the South Bridge must cross the PCI bus to reach main memory.

Figure 1-1. Typical PCI North-South Bridge System

graphics/01fig01.jpg

Until recently, the topology shown in Figure 1-1 on page 12 has been very popular in desktop systems for a number of reasons, including:

A shared bus reduces the number of traces on the motherboard to a single set.
All of the devices located on the PCI bus are only one bridge interface away from the principal target of their transactions ” main DRAM memory.
A single, very popular protocol (PCI) can be used for all embedded devices, add-in cards, and chipset components attached to the bus.

Unfortunately, some of the things that made this topology so popular also have made it difficult to fix the I/O bandwidth problems which have become more obvious as processors and memory have become faster.

A Shared Bus Runs At Limited Clock Speeds

The fact that multiple devices (including PCB connectors) attach to a shared bus means that trace lengths and electrical complexity will limit the maximum usable clock speed. For example, a generic PCI bus has a maximum clock speed of 33MHz; the PCI Specification permits increasing the clock speed to 66MHz, but the number of devices/connectors on the bus is very limited.

A Shared Bus May Be Host To Many Device Types

The requirements of devices on a shared bus may vary widely in terms of bandwidth needed, tolerance for bus access latency, typical data transfer size, etc. All of this complicates arbitration on the bus when multiple masters wish to initiate transactions.

Backward Compatibility Prevents Upgrading Performance

If a critical shared bus is based on an open architecture, especially one that defines user "add-in" connectors, then another problem in upgrading bus bandwidth is the need to maintain backward compatibility with all of the devices and cards already in existence. If the bus protocol is enhanced and a user installs an "older generation card", then the bus must either revert back to the earlier protocol or lose its compatibility.

Special Problems If The Shared Bus Is PCI

As popular as it has been, PCI presents additional problems that contribute to performance limits:

PCI doesn't support split transactions, resulting in inefficient retries.
Transaction size (there is no limit) isn't known, which makes it difficult to size buffers and causes frequent disconnects by targets. Devices are also allowed to insert numerous wait states during each data phase.
All PCI transactions by I/O devices targeting main memory generally require a "snoop" cycle by CPUs to assure coherency with internal caches. This impacts both CPU and PCI performance.
Its data bus scalability is very limited (32/64 bit data)
Because of the PCI electrical specification (low-power, reflected wave signals), each PCI bus is physically limited in the number of ICs and connectors vs. PCI clock speed
PCI bus arbitration is vaguely specified. Access latencies can be long and difficult to quantify. If a second PCI bus is added (using a PCI-PCI bridge), arbitration for the secondary bus typically resides in the new bridge. This further complicates PCI arbitration for traffic moving vertically to memory.

A Note About PCI-X

Other than scalability and the number of devices possible on each bus, the PCI-X protocol has resolved many of the problems just described with PCI. For third-party manufacturers of high performance add-in cards and embedded devices, the shared bus PCI-X is a straightforward extension of PCI which yields huge bandwidth improvements (up to about 2GB/s with PCI-X 2.0).

ThePoint-to-Point Interconnect Approach

An alternative to the shared I/O bus approach of PCI or PCI-X is having point-to-point links connecting devices. This method is being used in a number of new bus implementations, including HyperTransport technology. A common feature of point-to-point connections is much higher bandwidth capability; to achieve this, point-to-point protocols adopt some or all of the following characteristics:

only two devices per connection.
low voltage, differential signaling on the high speed data paths
source-synchronous clocks, sometimes using double data rate (DDR)
very tight control over PCB trace lengths and routing
integrated termination and/or compensation circuits embedded in the two devices which maintain signal integrity and account for voltage and temperature effects on timing.
dual simplex interfaces between the devices rather than one bi-directional bus; this enables duplex operations and eliminates " turn around" cycles.
sophisticated protocols that eliminate retries, disconnects, wait-states, etc.

A Note About Connectors

While connectors may or may not be defined in a point-to-point link specification, they may be designed into some implementations to connect from board-board or for the attachment of diagnostic equipment. There is no definition of a peripheral add-in card connector for HyperTransport as there is in PCI or PCI-X.