Section 3.2. The G5: Lineage and Roadmap


3.2. The G5: Lineage and Roadmap

As we saw earlier, the G5 is a derivative of IBM's POWER4 processor. In this section, we will briefly look at how the G5 is similar to and different from the POWER4 and some of the POWER4's successors. This will help us understand the position of the G5 in the POWER/PowerPC roadmap. Table 32 provides a high-level summary of some key features of the POWER4 and POWER5 lines.

Table 32. POWER4 and Newer Processors
 

POWER4

POWER4+

POWER5

POWER5+

Year introduced

2001

2002

2004

2005

Lithography

180 nm

130 nm

130 nm

90 nm

Cores/chip

2

2

2

2

Transistors

174 million

184 million

276 million/chip[a]

276 million/chip

Die size

415 mm2

267 mm2

389 mm2/chip

243 mm2/chip

LPAR[b]

Yes

Yes

Yes

Yes

SMT[c]

No

No

Yes

Yes

Memory controller

Off-chip

Off-chip

On-chip

On-chip

Fast Path

No

No

Yes

Yes

L1 I-cache

2x64KB

2x64KB

2x64KB

2x64KB

L1 D-cache

2x32KB

2x32KB

2x32KB

2x32KB

L2 cache

1.41MB

1.5MB

1.875MB

1.875MB

L3 cache

32MB+

32MB+

36MB+

36MB+


[a] A chip includes two processor cores and L2 cache. A multichip module (MCM) contains multiple chips and usually L3 cache. A four-chip POWER5 MCM with four L3 cache modules is 95 mm2.

[b] LPAR stands for (processor-level) Logical Partitioning.

[c] SMT stands for simultaneous multithreading.

Transcribing Transistors

In light of the technical specifications of modern processors, it is interesting to see how they compare with some of the most important processors in the history of personal computing.

  • Intel 4004 1971, 750kHz clock frequency, 2,300 transistors, 4-bit accumulator architecture, 8 μm pMOS, 3x4 mm2, 816 cycles/instruction, designed for a desktop printing calculator

  • Intel 8086 1978, 8MHz clock frequency, 29,000 transistors, 16-bit extended accumulator architecture, assembly-compatible with the 8080, 20-bit addressing through a segmented addressing scheme

  • Intel 8088 1979 (prototyped), 8-bit bus version of the 8086, used in the IBM PC in 1981

  • Motorola 68000 1979, 8MHz clock frequency, 68,000 transistors, 32-bit general-purpose register architecture (with 24 address pins), heavily microcoded (even nanocoded), eight address registers, eight data registers, used in the original Macintosh in 1984


3.2.1. Fundamental Aspects of the G5

All POWER processors listed in Table 32, as well as the G5 derivatives, share some fundamental architectural features. They are all 64-bit and superscalar, and they perform speculative, out-of-order execution. Let us briefly discuss each of these terms.

3.2.1.1. 64-bit Processor

Although there is no formal definition of what constitutes a 64-bit processor, the following attributes are shared by all 64-bit processors:

  • 64-bit-wide general-purpose registers

  • Support for 64-bit virtual addressing, although the physical or virtual address spaces may not use all 64 bits

  • Integer arithmetic and logical operations performed on all 64 bits of a 64-bit operandwithout being broken down into, say, two operations on two 32-bit quantities

The PowerPC architecture was designed to support both 32-bit and 64-bit computation modesan implementation is free to implement only the 32-bit subset. The G5 supports both computation modes. In fact, the POWER4 supports multiple processor architectures: the 32-bit and 64-bit POWER; the 32-bit and 64-bit PowerPC; and the 64-bit Amazon architecture. We will use the term PowerPC to refer to both the processor and the processor architecture. We will discuss the 64-bit capabilities of the 970FX in Section 3.3.12.1.

Amazon

The Amazon architecture was defined in 1991 by a group of IBM researchers and developers as they collaborated to create an architecture that could be used for both the RS/6000 and the AS/400. Amazon is a 64-bit-only architecture.


3.2.1.2. Superscalar

If we define scalar to be a processor design in which one instruction is issued per clock cycle, then a superscalar processor would be one that issues a variable number of instructions per clock cycle, allowing a clock-cycle-per-instruction (CPI) ratio of less than 1. It is important to note that even though a superscalar processor can issue multiple instructions in a clock cycle, it can do so only with several caveats, such as whether the instructions depend on each other and which specific functional units they use. Superscalar processors typically have multiple functional units, including multiple units of the same type.

VLIW

Another type of multiple-issue processor is a very-large instruction-word (VLIW) processor, which packages multiple operations into one very long instruction. The compilerrather than the processor's instruction dispatcherplays a critical role in selecting which instructions are to be issued simultaneously in a VLIW processor. It may schedule operations by using heuristics, traces, and profiles to guess branch directions.


3.2.1.3. Speculative Execution

A speculative processor can execute instructions before it is determined whether those instructions will need to be executed (instructions may not need to be executed because of a branch that bypasses them, for example). Therefore, instruction execution does not wait for control dependencies to resolveit waits only for the instruction's operands (data) to become available. Such speculation can be done by the compiler, the processor, or both. The processors in Table 32 employ in-hardware dynamic branch prediction (with multiple branches "in flight"), speculation, and dynamic scheduling of instruction groups to achieve substantial instruction-level parallelism.

3.2.1.4. Out-of-Order Execution

A processor that performs out-of-order execution includes additional hardware that can bypass instructions whose operands are not availablesay, due to a cache miss that occurred during register loading. Thus, rather than always executing instructions in the order they appear in the programs being run, the processor may execute instructions whose operands are ready, deferring the bypassed instructions for execution at a more appropriate time.

3.2.2. New POWER Generations

The POWER4 contains two processor cores in a single chip. Moreover, the POWER4 architecture has features that help in virtualization. Examples include a special hypervisor mode in the processor, the ability to include an address offset when using nonvirtual memory addressing, and support for multiple global interrupt queues in the interrupt controller. IBM's Logical Partitioning (LPAR) allows multiple independent operating system images (such as AIX and Linux) to be run on a single POWER4-based system simultaneously. Dynamic LPAR (DLPAR), introduced in AIX 5L Version 5.2, allows dynamic addition and removal of resources from active partitions.

The POWER4+ improves upon the POWER4 by reducing its size, consuming less power, providing a larger L2 cache, and allowing more DLPAR partitions.

The POWER5 introduces simultaneous multithreading (SMT), wherein a single processor supports multiple instruction streamsin this case, twosimultaneously.

Many Processors . . . Simultaneously

IBM's RS 64 IV, a 64-bit member of the PowerPC family, was the first mainstream processor to support processor-level multithreading (the processor holds the states of multiple threads). The RS 64 IV implemented coarse-grained two-way multithreadinga single thread (the foreground thread) executed until a high-latency event, such as a cache miss, occurred. Thereafter, execution switched to the background thread. This was essentially a very fast hardware-based context-switching implementation. Additional hardware resources allowed two threads to have their state in hardware at the same time. Switching between the two states was extremely fast, consuming only three "dead" cycles.

The POWER5 implements two-way SMT, which is far more fine-grained. The processor fetches instructions from two active instruction streams. Each instruction includes a thread indicator. The processor can issue instructions from both streams simultaneously to the various functional units. In fact, an instruction pipeline can simultaneously contain instructions from both streams in its various stages.

A two-way SMT implementation does not provide a factor-of-two performance improvementthe processor effectively behaves as if it were more than one processor, but not quite two processors. Nevertheless, the operating system sees a symmetric-multiprocessing (SMP) programming mode. Typical improvement factors range between 1.2 and 1.3, with a best case of about 1.6. In some pathological cases, the performance could even degrade.

A single POWER5 chip contains two cores, each of which is capable of two-way SMT. A multichip module (MCM) can contain multiple such chips. For example, a four-chip POWER5 module has eight cores. When each core is running in SMT mode, the operating system will see sixteen processors. Note that the operating system will be able to utilize the "real" processors first before resorting to SMT.


The POWER5 supports other important features such as the following:

  • 64-way multiprocessing.

  • Subprocessor partitioning (or micropartitioning), wherein multiple LPAR partitions can share a single processor.[19] Micropartitioned LPARs support automatic CPU load balancing.

    [19] A single processor may be shared by up to 10 partitions, with support for up to 160 partitions total.

  • Virtual Inter-partition Ethernet, which enables a VLAN connection between LPARsat gigabit or even higher speedswithout requiring physical network interface cards. Virtual Ethernet devices can be defined through the management console. Multiple virtual adapters are supported per partition, depending on the operating system.

  • Virtual I/O Server Partition,[20] which provides virtual disk storage and Ethernet adapter sharing. Ethernet sharing connects virtual Ethernet to external networks.

    [20] The Virtual I/O Server Partition must run in either a dedicated partition or a micropartition.

  • An on-chip memory controller.

  • Dynamic firmware updates.

  • Detection and correction of errors in transmitting data courtesy of specialized circuitry.

  • Fast Path, the ability to execute some common software operations directly within the processor. For example, certain parts of TCP/IP processing that are traditionally handled within the operating system using a sequence of processor instructions could be performed via a single instruction. Such silicon acceleration could be applied to other operating system areas such as message passing and virtual memory.

Besides using 90-nm technology, the POWER5+ adds several features to the POWER5's feature set, for example: 16GB page sizes, 1TB segments, multiple page sizes per segment, a larger (2048-entry) translation lookaside buffer (TLB), and a larger number of memory controller read queues.

The POWER6 is expected to add evolutionary improvements and to extend the Fast Path concept even further, allowing functions of higher-level softwarefor example, databases and application serversto be performed in silicon.[21] It is likely to be based on a 65-nm process and is expected to have multiple ultra-high-frequency cores and multiple L2 caches.

[21] The "reduced" in RISC becomes not quite reduced!

3.2.3. The PowerPC 970, 970FX, and 970MP

The PowerPC 970 was introduced in October 2002 as a 64-bit high-performance processor for desktops, entry-level servers, and embedded systems. The 970 can be thought of as a stripped-down POWER4+. Apple used the 970followed by the 970FX and the 970MPin its G5-based systems. Table 33 contains a brief comparison of the specifications of these processors. Figure 33 shows a pictorial comparison. Note that unlike the POWER4+, whose L2 cache is shared between cores, each core in the 970MP has its own L2 cache, which is twice as large as the L2 cache in the 970 or the 970FX.

Table 33. POWER4+ and the PowerPC 9xx
 

POWER4+

PowerPC 970

PowerPC 970FX

PowerPC 970MP

Year introduced

2002

2002

2004

2005

Lithography

130 nm

130 nm

90 nm[a]

90 nm

Cores/chip

2

1

1

2

Transistors

184 million

55 million

58 million

183 million

Die size

267 mm2

121 mm2

66 mm2

154 mm2

LPAR

Yes

No

No

No

SMT

No

No

No

No

Memory controller

Off-chip

Off-chip

Off-chip

Off-chip

Fast Path

No

No

No

No

L1 I-cache

2x64KB

64KB

64KB

2x64KB

L1 D-cache

2x32KB

32KB

32KB

2x32KB

L2 cache

1.41MB shared[b]

512KB

512KB

2x1MB

L3 cache

32MB+

None

None

None

VMX (AltiVec[c])

No

Yes

Yes

Yes

PowerTune[d]

No

No

Yes

Yes


[a] The 970FX and 970MP use 90 nm lithography, in which copper wiring, strained silicon, and silicon-on-insulator (SOI) are fused into the same manufacturing process. This technique accelerates electron flow through transistors and provides an insulating layer in silicon. The result is increased performance, transistor isolation, and lower power consumption. Controlling power dissipation is particularly critical for chips with low process geometries, where subthreshold leakage current can cause problems.

[b] The L2 cache is shared between the two processor cores.

[c] Although jointly developed by Motorola, Apple, and IBM, AltiVec is a trademark of Motorola, or more precisely, Freescale. In early 2004, Motorola spun out its semiconductor products sector as Freescale Semiconductor, Inc.

[d] PowerTune is a clock-frequency and voltage-scaling technology.

Another noteworthy point about the 970MP is that both its cores share the same input and output busses. In particular, the output bus is shared "fairly" between cores using a simple round-robin algorithm.

Figure 33. The PowerPC 9xx family and the POWER4+


3.2.4. The Intel Core Duo

In contrast, the Intel Core Duo processor line used in the first x86-based Macintosh computers (the iMac and the MacBook Pro) has the following key characteristics:

  • Two cores per chip

  • Manufactured using 65-nm process technology

  • 90.3 mm2 die size

  • 151.6 million transistors

  • Up to 2.16GHz frequency (along with a 667MHz processor system bus)

  • 32KB on-die I-cache and 32KB on-die D-cache (write-back)

  • 2MB on-die L2 cache (shared between the two cores)

  • Data prefetch logic

  • Streaming SIMD[22] Extensions 2 (SSE2) and Streaming SIMD Extensions 3 (SSE3)

    [22] Section 3.3.10.1 defines SIMD.

  • Sophisticated power and thermal management features




Mac OS X Internals. A Systems Approach
Mac OS X Internals: A Systems Approach
ISBN: 0321278542
EAN: 2147483647
Year: 2006
Pages: 161
Authors: Amit Singh

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net