3.5 Peripheral Interconnects | System Performance Tuning2002

The last means of interconnection that we need to discuss is between the bus or crossbar that facilitates the communication between processors and main memory (see Section 3.4.1 earlier in this chapter) and the bus that interacts with peripheral devices.

There are two bus architectures in widespread use: Sun's SBus and the industry-standard Peripheral Component Interconnect , or PCI. SBus is being phased out in favor of the PCI standard, but I discuss it here in light of its very large installed base.

3.5.1 SBus

SBus and PCI are fairly similar architecturally: both are designed to be I/O buses (rather than having more general applications), have small form factors requiring a high degree of integration, and roughly similar performance characteristics. If PCI had been available in 1988 when Sun was developing SBus for the SPARCstation 1, SBus probably would not have been developed. Every Sun system implements SBus slightly differently, making a different set of choices between performance, cost, ease of implementation, etc. However, in practice, all current SBus implementations can meet almost all requirements.

SBus is a parallel-transfer (many bits are transferred concurrently) bus architecture. In the original specification, there were 32 address lines and 32 data lines; the Rev B.0 SBus specification permits the address lines to be used during the data cycle to transfer 64 bits; this is implemented in the SPARCstation 10SX, the SPARCstation 20, and all the UltraSPARC-based systems. This arrangement means that 64-bit and 32-bit SBus implementations share the same form factor. However, bus width (while an important factor) is often overestimated in its relevance to performance; very few devices can exceed the 32-bit bus capacity. These are largely very high performance graphics framebuffers and very fast network interfaces (faster than about 250 Mbits/second).

3.5.1.1 Clock speed

In most older Sun systems, the SBus clock is derived from the system clock, either directly or through a divider. Most SPARCstation 10 systems drive the system clock at 40 MHz, and the SBus clock is set at 20 MHz. ^[18] SPARCstation 20 systems operate the system clock at 50 MHz and the SBus at 25 MHz. The SPARCserver 1000 and SPARCcenter 2000 run the SBus at 20 MHz, independent of any other factors. The microSPARC and microSPARC-II systems run the SBus by dividing the processor clock. This is summarized in Table 3-6.

^[18] The exception is in the early models (the SPARCstation 10 Model 20 and Model 30), which had system clocks of 33 MHz and 36 MHz respectively; the SBus clocks are set, then, to 16.5 Mhz or 18 MHz, respectively. Most of these systems have been upgraded to the newer revisions of the SPARCstation 10 boards .

Table 3-6. SBus clock rates for microSPARC systems

CPU clock	Divider	SBus clock
50 MHz	2:1	25 MHz
70 MHz	3:1	23.3 MHz
85 MHz	4:1	21.25 MHz
110 MHz	5:1	22 MHz

Most newer Sun systems (SPARCserver 1000E, SPARCcenter 2000E, and the UltraSPARC-based systems) drive the SBus at 25 MHz, regardless of any other factors.

3.5.1.2 Burst transfer size

The SBus architecture allows data to be transferred in bursts, which are defined as the amount of data that the bus can accept each time a unit on the bus arbitrates for the bus; it is roughly similar to the maximum transmission unit size on a network. Every time a data transfer is initiated, the bus interface hardware accepts one burst's worth of data; it then arbitrates for control of the bus, transmits the target address, and transfers as much data per clock cycle as the bus width allows. This influences the efficiency of bus transactions: on a platform that implements a 16-byte burst size, transferring 2 KB on the bus requires 128 bus arbitration/address/data transfer cycles, compared to just 32 of these cycles when using a 64-byte burst size.

SBus bursts are defined to be 1, 2, 4, 8, 16, 32, and 64 bytes in length. UltraSPARC systems and later MBus-based desktop systems also support a 128-byte burst length. The maximum supported burst size, however, depends on the hardware platform. The SPARCstation 2, SPARCstation 10, SPARCstation 20, and the SPARCsystem 600MP all support 32-byte bursts; all other non-UltraSPARC systems support 16-byte bursts. The UltraSPARC systems support all burst sizes. The burst size is negotiated at boot time, on a per-slot basis, to be the smaller of the maximum size supported by the host SBus device and the controller.

3.5.1.3 Transfer mode

Every SBus implementation offers at least programmed I/O (PIO) and consistent mode direct virtual memory access (DVMA). ^[19] In PIO, the processor uses load and store instructions to transfer data on the bus. This is primarily used for managing control and status registers on the SBus boards, but a handful of boards implement PIO for data transfer. Because it requires the processor's intervention and because the burst sizes are necessarily smaller (non-UltraSPARC processors define load and store instructions in quantities up to 8 bytes only; UltraSPARC processors have 64-byte block load/store instructions), PIO is almost always more expensive than DVMA. DVMA, however, has a fairly complex setup and teardown process for each I/O operation; therefore, small transfers are more efficiently handled by PIO.

^[19] DVMA is very similar to the more common direct memory access (DMA) method. They differ only in that DVMA specifies the address of the memory transaction in virtual space, rather than the physical address.

Some SBus implementations provide an additional transfer mode, called streaming mode DVMA . Streaming mode DVMA provides significantly faster transfers, especially for reads; however, it requires explicit modifications to the driver for keeping transfer hardware consistent with shared memory. Most device drivers implement streaming mode support.

3.5.1.4 Summary of SBus implementations

Table 3-7 briefly summarizes the SBus implementations in many SPARC-based systems.

Table 3-7. SBus implementation details

Platform	Burst size	Bus width	Clock	Streaming?	Speed (read/write)
SPARCstation 1	16 bytes	32 bits	20 MHz	No	12/20 MB/sec
SPARCstation 1+, IPC	16 bytes	32 bits	25 MHz	No	15/25 MB/sec
SPARCstation 2, IPX	32 bytes	32 bits	20 MHz	No	15/32 MB/sec
SPARCstation 10, SPARCstation 600	32 bytes	32 bits	20 MHz	No	32/52 MB/sec
LX, Classic	16 bytes	32 bits	25 MHz	No	17/27 MB/sec
SPARCstation 4, SPARCstation 5	16 bytes	32 bits	21.25-25 MHz	No	36/55 MB/sec
SPARCstation 10 SX	128 bytes	32/64 bits	20 MHz	No	40/95 MB/sec
SPARCstation 20	128 bytes	32/64 bits	25 MHz	No	62/100 MB/sec
SPARCserver 1000, SPARCcenter 2000	64 bytes	32 bits	20 MHz	Yes	45/50 MB/sec
SPARCserver 1000E, SPARCcenter 2000E	64 bytes	32 bits	25 MHz	Yes	56/62 MB/sec
All UltraSPARC systems	128 bytes	32/64 bits	25 MHz	Yes	90/120 MB/sec

The SBus I/O boards used for the Ultra Enterprise series of systems are separate from the processor/memory boards. There are two types of SBus I/O boards; one is designed for graphics cards, whereas the other is more general purpose.

The general-purpose I/O SBus has a dual-ported Fibre Channel interface, a Fast Ethernet port, and a single-ended Fast Wide SCSI-2 port, as well as three 64-bit SBus slots. The board implements two independent SBuses: one is responsible for the Fast Ethernet and SCSI interfaces, along with one of the SBus slots. The other SBus services the Fibre Channel interfaces and the other two SBus slots. ^[20] The graphics I/O board replaces one of the SBus slots with a UPA slot; this means there is only a single SBus controller. Both the SBus slots are attached to this SBus, along with all the devices integrated into the I/O board. The UPA port is totally independent.

^[20] There are actually three types of general-purpose SBus I/O cards. One type (501-4287) has a pair of 25 MB/s Fibre Channel interfaces, and supports only backplane speeds up to 83 MHz. Both of the other types have 100 MB/s Fibre Channel interfaces, implemented via GBIC slots; the only difference between these two is that one (501-4266) runs at 83 MHz, and the other (501-4883) can go up to 100 MHz. However, they all have two independent SBuses, which are broken up as described.

3.5.1.5 SBus card utilization

Table 3-8 provides a general idea of SBus cards and how much bandwidth they typically consume . Of particular interest is the fact that SBus framebuffers are very significant consumers of SBus bandwidth.

Table 3-8. SBus bandwidth utilization by card

Description	Typical bandwidth	Transfer type	Width
Fast/Wide SCSI-2	8-10 MB/sec	DVMA	32
25 MB/s Fibre Channel	16 MB/sec	DVMA	64
Fast/Wide SCSI-2 + Fast Ethernet ( hme )	12 MB/sec	DVMA	64
155 Mb/s ATM (Revision 1)	10 MB/sec	DVMA	32
155 Mb/s ATM (Revision 2)	10 MB/sec	DVMA	64
655 Mb/s ATM	50-100 MB/sec	DVMA	64
Fast Ethernet (Revision 2) ^[21]	4 MB/sec	DVMA	64
Quad Fast Ethernet	20 MB/sec	DVMA	64
Gigabit Ethernet	40 MB/sec	DVMA	64
100 Mb/s FDDI	6-8 MB/sec	DVMA	32
16 Mb/s Token Ring	1 MB/sec	PIO	32
GX, GX+ framebuffer	4-12 MB/sec	PIO	32
TurboGX, TurboGX+ framebuffers	8-20 MB/sec	PIO	32
ZX framebuffers	20-30 MB/sec	DVMA	32

^[21] Revision 1 Fast Ethernet cards, which use the be driver, are not supported past Solaris 2.6. If possible, replace them with Revision 2 cards, which use the hme driver.

An SBus should typically not be more than 80 percent utilized, in order to avoid bus contention and arbitration problems that may degrade performance.

3.5.2 PCI

PCI, or the Peripheral Component Interconnect bus, is probably the most widespread bus design in modern computer systems. It was initially introduced into the Intel-based PC world in 1994, and has largely replaced all the older peripheral bus architectures. It is now widely supported by workstation vendors as well. PCI is a standard under the auspices of the PCI Special Interest Group , or PCI SIG (http://www.pcisig.com).

PCI is a synchronous bus architecture, which means that all data transfers are performed relative to a system clock. The initial PCI specification permitted a maximum clock rate of 33 MHz, but the later Revision 2.1 specification extended this to 66 MHz. Most personal computers support only the 33 MHz standard; the primary application of 66 MHz slots is in very high speed networking (e.g., full-duplex Gigabit Ethernet, which greatly benefits from a 66 MHz slot speed). PCI supports a 32-bit multiplexed address and data bus, ^[22] as well as having architectural support for a 64-bit data bus through a longer connector slot; as with 66 MHz support, 64-bit cards are not widely available in the PC market (see Figure 3-6).

^[22] This multiplexing allows for a reduced pin count on the PCI connector, which translates to a smaller physical size and lower cost. Typical 32-bit PCI boards use about 50 pins.

Figure 3-6. PCI connector types

The high speed of the PCI bus (up to 528 MB/second, at 64-bit data paths and a 66 MHz clock rate) limits the number of expansion slots on a single bus to no more than three or four slots due to electrical concerns. In order to permit an expansion bus to have more slots, the PCI SIG has defined a PCI-to-PCI bridge mechanism in PCI Revision 2.1. These bridges are chips that electrically isolate two PCI buses, while still allowing bus transfers to be forwarded from one bus to the other. Multiple bridge devices can be cascades to create a system with many PCI buses.

Multiboard Sun Enterprise systems support the PCI I/O board. These support two PCI cards per I/O board. Each PCI slot is on a separate PCI bus, and the Fast Ethernet and Fast/Wide SCSI interfaces are also on separate PCI buses.

3.5.2.1 PCI bus transactions

In PCI terminology, data is transferred between an initiator , which is the bus master , and a target , which is the bus slave . After the initiator arbitrates for the bus, the transfer itself consists of one address phase and any number of data phases. During the address phase of the transfer, the initiator signals what type of transfer is being done (read from memory, write to memory, I/O read, I/O write, etc.). Either the initiator or the target can insert wait states into the transfer; in addition, the initiator or target may terminate the transfer at any time.

PCI also supports a well-defined automatic configuration mechanism. Each PCI device includes a set of registers that contain configuration data. These registers define what the type of the card is (SCSI, Ethernet, a framebuffer, etc.), as well as who manufactured the card, what the interrupt level of the card is, and so on.

Finally, PCI supports both 5-volt and 3.3-volt signaling levels. However, most early PCI implementations were 5-volt only, and did not provide power on the 3.3-volt pins. Over more time, the 3.3-volt interface will be used more heavily. A "keying" scheme is incorporated in the PCI connector standard, in order to prevent inserting a board into a slot that doesn't supply the correct voltage.

3.5.2.2 CompactPCI

One variant of PCI, CompactPCI, was designed as an industrial-use bus for applications such as real-time machine control and data acquisition, instrumentation, and telecommunications. Compared to the standard, desktop PCI interface, CompactPCI supports twice as many slots (eight versus four), and offers a significantly more robust form factor. CompactPCI cards are designed to front-load into a card cage; this, along with more reliable pin-and-socket connectors, takes full advantage of CompactPCI's hot swap capability. CompactPCI devices should perform equivalently to comparable PCI devices. However, as CompactPCI cards are just beginning to be introduced into the wider workstation/server marketplace , their long term future is unclear.

3.5.3 A Summary of Peripheral Interconnects

Table 3-9 provides a summary of peripheral interconnects; the emphasis is on the PC hardware arena.

Table 3-9. Peripheral interconnect comparison

Bus type	Year	Width	Clock rate	Peak bandwidth	Burst transfers?
ISA	1984	16 bits	8 MHz	8 MB/sec	No
MCA	1987	32 bits	10 MHz	40 MB/sec	Yes
EISA	1988	32 bits	8 MHz	32 MB/sec	Yes
SBus	1988	32 or 64 bits	Up to 25 MHz	Up to 200 MB/sec	Yes
VESA	1992	32 bits	33 MHz	132 MB/sec	Yes
PCI	1992	32 or 64 bits	33 or 66 MHz	132 to 534 MB/sec	Yes

3.5.4 Interrupts in Linux

You can get a list of all the interrupts in use on a Linux system by inspecting the file /proc/interrupts :

 %  cat /proc/interrupts  CPU0          0:  216651906          XT-PIC  timer   1:       1965          XT-PIC  keyboard   2:          0          XT-PIC  cascade   8:          1          XT-PIC  rtc   9:   96370488          XT-PIC  eth0  14:  381136978          XT-PIC  ide0  15:   60751177          XT-PIC  ide1 NMI:          0  ERR:          0

The most useful way to adjust the interrupt distribution in a Linux system is the irqtune application. This application, which is available through http://www.best.com/~cae/irqtune/ at the time of this writing, allows you to specify certain interrupts that are serviced at higher priority. Interrupt priority is used to determine which interrupts are handled first when multiple interrupts occur concurrently. In general, interrupt priorities are assigned in decreasing order of IRQ; that is, the system timer (IRQ 0) has priority over all other IRQs. irqtune addresses this problem by giving priority 0 to the specified IRQs. To invoke it, just run irqtune and list the interrupts to prioritize. For example, to prioritize IRQs 9 and 14 (the eth0 and ide0 interfaces in the system described previously):

 #  irqtune   9 14

3.5.5 Interrupts in Solaris

Sun enterprise-class systems (the SPARCserver 1000, the SPARCcenter 2000, and Ultra Enterprise series) have many options for managing SBus devices. Since Solaris 2.3, the operating system has used a static interrupt distribution mechanism. This means that each I/O device is assigned to a specific processor; every time that device triggers an interrupt, the specific CPU that is assigned to the device handles the work. The interrupt assignment table is set up at boot time and every time a processor is set to, or removed from, the no-interrupt state (see Section 3.6.7 later in this chapter). Currently, there is no way to manually assign the interrupts from a specific device to a specific processor. The alternative, known as round robin distribution , causes each interrupt to be handled by the " next " processor. It is enabled by setting the do_robin kernel tunable to 1.

Round robin distribution causes the processor caches to be utilized much less efficiently, because a copy of the device-specific interrupt handling code must reside in each processor's cache, but it also results in interrupts being scheduled faster.

Static interrupt processing generally means that high-performance boards should be separated across as many processors and SBus interfaces as possible. The Quad Fast Ethernet (QFE) boards are particularly known for throwing many, many interrupts (up to about 20,000 per second under heavy load). Within an SBus, each slot is treated identically. Because of this, placing multiple instances of an SBus card on the same SBus can sometimes improve performance, because identical boards share the same interrupt processing code. On the other hand, you can easily overload the bandwidth capacity of the SBus. This arrangement improves the chances that the interrupt handling code is resident in the processor's cache. Also note that since interrupts are statically assigned, heavily loading a processor can restrict I/O performance from devices assigned to that processor. If one CPU is taking a large number of interrupts and that CPU is regularly at or near 100% utilization, moving SBus boards into another SBus may provide a significant boost. A simpler solution to try is to enable round robin scheduling.

One common approach to managing interrupt distributions in multiple-CPU systems is to use a small number of CPUs to handle all interrupts, and forbid other processors from taking interrupts. This is accomplished by using the psradm command to put processors into no-intr mode (see Section 3.6.7 later in this chapter). This command lets you segment your workload at a fairly fine level.