Overview of x86-Compatible Servers Processors

Servers designed to run x86-compatible operating systems (such as Windows 2000 Server, Windows Server 2003, and Linux) and applications have used three different categories of processors:

Processors originally made for desktop computers
Processors derived from desktop designs but optimized for use in servers and workstations
Processors designed from the ground up for use in servers

Until the mid-1990s, there were no processors designed especially for use in servers running x86 software. Instead, these servers used the fastest x86 desktop processors available at the time. For example, desktop processors such as the 80286 (the first x86 processor to directly access more than 1MB of memory), the 80386 (the first x86 processor to introduce a 32-bit data bus), the 80486 (the first x86 processor with an integrated math coprocessor and an integrated L1 cache), and the Pentium (the first processor with superscalar design, running two instructions at the same time) have all been used as server processors.

The Pentium II was the first x86-compatible desktop processor to spawn an offshoot specifically designed as a server processor, the Pentium II Xeon. Ironically, the Pentium II was an offshoot of Intel's first server-oriented processor, the Pentium Pro. Successors to the Pentium II from both Intel and AMD have continued to inspire server-optimized versions:

The Pentium III Xeon was based on the Pentium III.
The Xeon was based on the Pentium 4.
The AMD Athlon MP was based on the Athlon.

Intel's Itanium family is the first x86-compatible processor family designed from the ground up for use in servers. However, as described in detail later in this chapter, the Itanium and Itanium 2 were not intended to be used primarily as x86-compatible processors. x86-compatibility is a convenience feature of the Itanium's design, and Itanium systems will not achieve peak performance when running x-86 operating systems or applications.

Note

x86-compatible processors are those that can run the same operating systems and applications as Intel's 8088 and newer models: MS-DOS, Windows, and so on.

The AMD Opteron was the first x86-compatible processor designed specifically for use in 64-bit servers while maintaining full-speed compatibility with existing 32-bit x86 applications.

Factors to Consider for Desktop Processors Used in Servers

Although some x86 processors used in servers, such as the Intel Xeon and Itanium series, and the AMD Athlon MP and Opteron, are designed especially for use in servers and workstations, many x86-based servers use the same processors that have been used in desktop computers.

Although processors originally designed for desktop use have been, and continue to be, used successfully in servers designed for light to moderate use, several design factors enable some desktop processors to be more successful as server processors than others:

L2 cache size Servers must provide data to and may perform active processing for multiple workstations. As with desktop processors, a larger L2 memory cache enables processor-intensive tasks to be performed more quickly.
The presence of L3 cache (optional) To further improve the performance of processor-intensive tasks, some recent server processors incorporate L3 memory cache in addition to L1 and L2 caches. Some current high-performance desktop processors also incorporate L3 cache.
The amount of system memory the L2 cache supports Servers have traditionally used larger amounts of main memory than desktop computers. The extra memory in a server enables it to process information destined for multiple workstations more quickly than through paging to and from the hard disk or disk array. Some processors used in servers have had relatively low limits on the amount of memory the integrated L2 cache supported, while others can cache larger amounts of memory. When all other factors are equal, the processor model whose L2 cache supports more main memory is the preferred processor for a server application.
The reliability of L2 cache Some older processors used in servers feature L2 cache that does not support error correcting code (ECC) error correction. ECC support, whether in main memory or in the processor cache, enables the correction of single-bit memory errors and the reporting of multibit memory errors. Because servers are responsible for the reliable delivery and storage of information used by multiple workstations, ECC support, both for main memory and for L2 cache, is very important.
High-speed connections to the motherboard chipset Much of a server's work involves data transfer between the server's own storage and workstation storage or memory. To enable this data transfer to take place as quickly as possible, a dedicated high-speed bus between the processor and the North Bridge or South Bridge or MCH or I/O controller hub chips on the motherboard is an important feature. Some chips used in servers have only one high-speed bus connection, while others have multiple bus connections for greater I/O and memory performance. Note that while the speed of the connection between the processor and chipset is a function of the chipset, if you are selecting between processors with similar FSB speeds but one processor has a faster connection to the chipset than another, the one with the faster connection is preferred.
Multiple-processor support Although some servers use only a single processor, many servers use motherboards and operating systems that can support two or more processors for better system performance. It's best to use processors that are specifically designed to support multiple-processor applications on such motherboards. Note that some processors, such as the Pentium 4, can be used only in single-processor configurations.
Dual-processor cores Processors with dual cores provide virtually every advantage of a dual-processor system while occupying only one processor socket. When two or more dual-core processors are installed in a two-way or larger server, the server has twice the number of logical CPUs and offers increased performance to match.

Note

Before the development of true dual-core processors by AMD and Intel, Intel added a sort of virtual dual-core technology to some of its Pentium 4 and Xeon processors. This technology, known as HT Technology, enables a single-processor system to emulate two processors and a two-processor system to emulate four processors. HT Technology is designed to boost performance, but whether you see a performance boost or a performance drop when HT Technology is enabled depends on the software being run on the server. Dual-core processors and multiple processors boost performance when multiple applications are in use at one time. Some of Intel's dual-core processors also feature HT Technology, but as with HT Technology implementations on single-core processors, your results may vary. When HT Technologycompatible processors are installed, HT Technology is enabled or disabled through the system BIOS.

The following sections describe which processors in a given family are the most suitable for use in servers, based on these features.

x86 Processor Modes

All Intel and Intel-compatible 32-bit and 64-bit processors, also known as x86 processors, can run in several modes. Processor modes refer to the various operating environments and affect the instructions and capabilities of the chip. The processor mode controls how a processor sees and manages the system memory and the tasks that use it.

As described in the following sections, there are three possible modes of operation:

Real mode (16-bit software)
Virtual real mode (16-bit programs within a 32-bit environment)
Protected mode (32-bit software)

Real Mode and Virtual Real Mode

The original IBM PC included an 8088 processor that could execute 16-bit instructions using 16-bit internal registers and could address only 1MB of memory, using 20 address lines. All original PC software was created to work with this chip and was designed around the 16-bit instruction set and 1MB memory model. For example, DOS and all DOS software, Windows 1.x through 3.x, and all Windows 1.x through 3.x applications are written using 16-bit instructions. These 16-bit operating systems and applications are designed to run on an original 8088 processor.

Servers do not use real mode because it does not permit access to memory above 1MB and does not permit multitasking. Instead, if a server needs to run software that requires real mode, it does so by emulating real mode, a mode known as virtual real mode.

	See "Internal Registers," p. 35.

	See "The Address Bus," p. 34.

About the only time that a server processor runs in real mode is if a technician boots the computer with a diagnostic disk and performs tests on the computer's hardware. A diagnostic disk usually contains a DOS-based operating system and XMS memory drivers, which enable the diagnostics to test all memory in the system.

Protected (32-Bit) Mode

The Intel 386, the PC industry's first 32-bit processor, introduced an entirely new 32-bit instruction set. To take full advantage of the 32-bit instruction set, a 32-bit operating system and a 32-bit application were required. This new 32-bit mode was referred to as protected mode, which alludes to the fact that software programs running in that mode are protected from overwriting one another in memory. Such protection helps make the system much more crash-proof because an errant program can't very easily damage other programs or the operating system. In addition, a crashed program can be terminated while the rest of the system continues to run unaffected. Protected mode is also the native mode of subsequent x86-compatible processors up through the Pentium 4, Xeon, and Athlon MP. Server-oriented operating systems for these processors, such as Windows Server, Novell NetWare, Linux, and others, use protected mode.

64-Bit Processor Modes

When used with a 64-bit operating system, many recent x86-compatible server and desktop processors, starting with the Intel Itanium, AMD Opteron, and others, support one of two 64-bit processor modes as well as the 32-bit modes discussed previously:

EPIC
AMD64 (originally known as x86-64)

64-bit processors can use much more memory and disk space than 32-bit operating systems when operating in the native 64-bit mode, making them much more suitable for large databases and other applications that need large amounts of these resources.

The following sections discuss the differences between these modes.

EPIC 64-Bit Mode (Intel)

Intel's first 64-bit processor was the Itanium. The Itanium uses a processor architecture known as EPIC (Explicitly Parallel Instructional Computing), which is designed to better support multiple-processorbased systems than x86-based designs.

	See "Itanium and Itanium 2 Specifications," p. 121.

Although the Itanium family also features a backward-compatible x86 mode, the original Itanium and early versions of the Itanium 2 ran x86 operating systems and applications much more slowly than a comparable x86 server processor. The Itanium 2 "Madison," introduced in 2003, features improved x86 code compatibility for better performance with existing 32-bit applications.

Itanium-based servers have not been very popular thus far, perhaps because of the cost and the need to rewrite applications to support the EPIC architecture.

Note

The Intel Itanium family and 64-bit x86 processors such as the AMD Athlon 64/Opteron and Intel Pentium D and EM64T-compatible Pentium 4 and Xeon processors use different 64-bit architectures. Thus, 64-bit software written for one will not work on the other without being recompiled by the software vendor. This means that software written specifically for the Intel EPIC 64-bit architecture will not run on x86 64-bit processors and vice versa.

The Itanium family runs all existing 32-bit software, but to fully take advantage of the processor, a 64-bit operating system and applications are required. Microsoft has released 64-bit versions of Windows XP and Windows Server, and various versions of Linux also support the Itanium family. Current Linux distributions with Itanium support include BioBrew Linux (http://bioinformatics.org/biobrew/), White Box Linux for IA64 (http://gelato.uiuc.edu/projects/whitebox/), Debian GNU for IA64 (www.debian.org/ports/ia64/), Red Hat Enterprise Linux 4 (www.redhat.com), and SUSE Linux Enterprise 9. Several companies have released 64-bit applications for networking and workstation use.

Tip

If you use (or are considering) Linux on Itanium processors, be sure to visit the Gelato Community, at www.gelato.org, and LinuxIA64, at www.ia64-linux.org. These websites provide valuable support for Itanium servers running Linux.

AMD64 64-Bit Mode (AMD)

AMD64 (originally known as x86-64) is AMD's extension of x86 architecture into the 64-bit world. Unlike EPIC-based processors, such as the Itanium family, AMD64-based processors can run existing 32-bit x86 applications at full speed by adding two new operating modes64-bit mode and compatibility modeto the operating modes used by 32-bit x86 processors. Together, these two modes are known as long mode.

When an AMD64-compatible processor runs a 32-bit operating system, it uses protected mode, as described earlier in this chapter. Protected mode, virtual 8086 mode, and real mode are collectively known as legacy mode on an AMD64 processor.

Legacy mode runs about as fast on an AMD64-compatible processor as on a 32-bit processor with similar features. Thus, an AMD64-compatible processor is a much better choice for a mixed 64-bit and 32-bit software environment than an Itanium 2, which runs 32-bit x86 applications much more slowly than its native IA-64 applications. Another benefit of AMD64 is the ability to run a 32-bit operating system and switch to 64-bit operation at your own pace.

The advantages of the AMD64 architecture over the Itanium series IA-64 architecture include the following:

You get immediate speed increases with current software.
The command set is an extension of the x86 architecture for easier recompilation to support AMD64.
The design of AMD64 permits a gradual movement to 64-bit processing.

Intel uses a virtually identical 64-bit instruction set known as EM64T (discussed in the next section) in a wide variety of desktop and server processors. This suggests that AMD64's approach to 64-bit compatibility will continue to be much more popular in the marketplace than IA-64.

The AMD Opteron is a server-optimized implementation of AMD64, and the AMD Athlon 64, Athlon 64 FX, and Athlon 64 x2 are desktop-optimized implementations of AMD64.

	See "AMD Opteron Processors," p. 127.

Note

For more information on the Athlon 64 family of processors, see Upgrading and Repairing PCs, 17th edition.

EM64T 64-Bit Mode (Intel)

Intel's EM64T 64-bit mode, introduced in 2005, is virtually identical to AMD64, except that it adds support for Intel-specific features such as SSE3 (which Opterons built on a 90-nanometer die also support) and HT Technology.

The following server processors include EM64T:

Pentium 4 (6xx, 5x1, and 506 models)
Pentium D (all models)
Pentium Extreme Edition (all models)
Xeon MP 7xxx series
Xeon DP with 800MHz CPU bus

Intel originally used the term "Clackamas Technology" for these processors but now uses the term EM64T.

Endianess and Server Processors

Endianess refers to the method a processor uses to sequence numbers and other values.

Server processors discussed in this chapter fall into one of three categories:

Big-endian These processors store data in order, left to right, from the most significant byte (MSB) to the least significant byte (LSB). Big-endian processors discussed in this book include Sun SPARC and the PowerPC G5 (970) family.
Little-endian These processors store data in order, left to right, from LSB to MSB. x86 processors use the little-endian method.
Bi-endian These processors can store data in either order. Bi-endian processor families include Power, most PowerPC processors (except for the G5), Alpha, MIPS, PA-RISC, and Itanium.

Endianess is a concern primarily for programmers, who need to take into account the order in which data is stored in a system, and for situations in which data will be exchanged between systems that use different endian methods.

It is easier to recompile applications or move data to another processor that uses the same endian method as the application or data's original target processor. Note that most RISC server processors and the Itanium (IA64) family, which is being positioned as a replacement for the Alpha and PA-RISC processor families, feature a bi-endian design.

x86 Processor Speed Ratings

A common misunderstanding about processors is their different speed ratings. This section covers processor speed in general and then provides more specific information about Intel and AMD processors used in servers.

A computer system's clock speed is measured as a frequency, usually expressed as a number of cycles per second. A crystal oscillator controls clock speeds, using a sliver of quartz sometimes contained in what looks like a small tin container. Newer systems include the oscillator circuitry in the motherboard chipset, so it might not be a visible separate component on newer boards. As voltage is applied to the quartz, it begins to vibrate (oscillate) at a harmonic rate dictated by the shape and size of the crystal (sliver). The oscillations emanate from the crystal in the form of a current that alternates at the harmonic rate of the crystal. This alternating current is the clock signal that forms the time base on which the computer operates. A typical computer system runs millions of these cycles per second, so speed is measured in megahertz. (1Hz is equal to one cycle per second.)

Note

The Hertz was named for the German physicist Heinrich Rudolf Hertz. In 1885, Hertz confirmed the electromagnetic theory, which states that light is a form of electromagnetic radiation and is propagated as waves.

A single cycle is the smallest element of time for a processor. Every action requires at least one cycle and usually multiple cycles. To transfer data to and from memory, for example, a modern processor such as a Pentium 4 needs a minimum of three cycles to set up the first memory transfer and then only a single cycle per transfer for the next three to six consecutive transfers. The extra cycles on the first transfer typically are called wait states. A wait state is a clock tick in which nothing happens. This ensures that the processor isn't getting ahead of the rest of the computer.

	See "SIMMs, DIMMs, and RIMMs," p. 368.

Current server-class processors can output one to six instructions per cycle, thanks to multiple pipelines and other advances.

Different instruction execution times (in cycles) make it difficult to compare systems based purely on clock speed or number of cycles per second. How can two processors that run at the same clock rate perform differently, with one running "faster" than the other? The answer is simple: efficiency. For example, although the AMD Opteron processors have clock speeds about one-third slower than Intel Xeon processors, they perform more instructions in the same clock cycle. Thus, a "slower" Opteron is able to keep pace or even outperform a "faster" Xeon.

Although dual-core processor designs improve multitasking performance, different designs offer different levels of efficiency. The dual-core AMD Opterons incorporate a crossbar controller in the processor itself to handle communications between the cores. See Figure 2.6.

Figure 2.6. The AMD dual-core Opteron processors use the crossbar controller to transfer data between the processor cores.

However, the initial versions of Intel's dual-core processors (Pentium D and Xeon) use the MCH (North Bridge) to handle communications between the cores. See Figure 2.7.

Figure 2.7. Intel's initial dual-core design uses the MCH (North Bridge equivalent) chip to manage transfers between the processor cores.

The difference is roughly comparable to walking across a hallway to talk to a co-worker (AMD's method) compared to taking an elevator up one floor, walking to another elevator, riding down one floor, and then walking to the co-worker's office (Intel's method). Because of the inefficiencies inherent in the initial Intel design, Intel will switch to a design more similar to AMD's in dual-core processors introduced in 2006 and beyond.

Although RISC-based processors might feature slower clock speeds than recent Intel x86 or EPIC-based processors, their use of fewer processor instructions and a focus on server tasks enables them to be more efficient at handling very large numbers of clients.

As you can see from these examples, evaluating CPU performance can be tricky. CPUs with different internal architectures execute their instructions differently and can be relatively faster at certain processes and slower at others.

Keep in mind that, unlike with PCs, server performance is less about how quickly a server performs internal operations than it is about how quickly it provides services to clients.

	For more information about server benchmarking, see Chapter 21, "Server Testing and Maintenance."

Processor Speeds Versus Motherboard Speeds

Virtually all processors used in servers, including x86, Itanium, and RISC-based processors, run the system bus and the processor core at different speeds. As mentioned earlier, the system bus is also referred to as the FSB. The effective speed of the FSB is a multiple of the actual system bus speed. For example, AMD processors perform two accesses per clock cycle, making their FSB twice the actual system bus speed. Intel processors perform four accesses per clock cycle, making their FSB four times the actual system bus speed. This is important to note for purposes of configuring the processor in the system BIOS.

	See "BIOS Setup Menus," p. 296.

RISC-based processors also use a dual-speed design, running the processor core and FSB at different speeds.

x86 Processor Features

As new processors are introduced, new features are continually added to their architectures to help improve everything from performance in specific types of applications to the reliability of the CPU as a whole. The next few sections take a look at some of these technologies, Superscalar Execution, MMX, SSE, 3DNow!, and HT Technology.

Superscalar Execution

The fifth-generation Pentium and newer processors feature multiple internal instruction execution pipelines, which enable them to execute multiple instructions at the same time. The 486 and all preceding chips can perform only a single instruction at a time. Intel calls the capability to execute more than one instruction at a time superscalar technology.

RISC and CISC Chips

Superscalar architecture is usually associated with high-output RISC chips. A RISC chip has a less complicated instruction set with fewer and simpler instructions. Although each instruction accomplishes less, the clock speed can be higher, which can usually increase performance. The Pentium is one of the first CISC chips to be considered superscalar. A CISC chip uses a richer, fuller-featured instruction set that has more complicated instructions. As an example, say you wanted to instruct a robot to screw in a light bulb. Using CISC instructions, you would say this:

1.	Pick up the bulb.
2.	Insert it into the socket.
3.	Rotate clockwise until tight.

Using RISC instructions, you would say something more along the lines of this:

1.	Lower hand.
2.	Grasp bulb.
3.	Raise hand.
4.	Insert bulb into socket.
5.	Rotate clockwise one turn.
6.	Is bulb tight? If not, repeat step 5.
7.	End.

Overall, many more RISC instructions are required to do the job because each instruction is simpler (reduced) and does less. The advantage is that there are fewer overall commands the robot (or processor) has to deal with, and it can execute the individual commands more quickly, and thus in many cases it can execute the complete task (or program) more quickly as well. The debate goes on whether RISC or CISC is really better, but in reality, there is no such thing as a pure RISC or CISC chipit is all just a matter of definition, and the lines are somewhat arbitrary.

Intel and compatible processors have generally been regarded as CISC chips, although starting with the Pentium Pro, more recent and current Intel server processors have many RISC attributes and internally break CISC instructions down into RISC versions. Intel's Itanium processors are RISC processors.

MMX, SSE, SSE2, SSE3, 3D, and 3D Now Technologies

Intel and AMD have developed several extensions to basic x86 processor instructions. While these instructions are not important for servers, if you are considering standardizing on a particular processor for use in both server and workstation uses or are writing applications, you may want to consider which processors support particular extensions.

All these extensions are based on the concept of single instruction, multiple data (SIMD). SIMD enables one instruction to perform the same function on multiple pieces of data, similarly to a teacher telling an entire class to sit down rather than addressing each student one at a time. SIMD enables the chip to reduce processor-intensive loops common with video, audio, graphics, and animation. Table 2.13 provides an overview of these processor extensions and the processors that support them. For details about the instructions, see the documentation available for each processor at the Intel and AMD websites.

Table 2.13. Extensions to x86 Processor Instructions
Instruction Set	Vendor	Features	Server Processors Supporting Instruction Set
MMX Technology	Intel	57 instructions for processing graphics, video, and audio data	Pentium MMX and all subsequent processors, including Itanium and Itanium 2
SSE	Intel	70 instructions for processing graphics, video, and audio data; incorporates floating-point support; incorporates MMX	Pentium III, Pentium III Xeon, Itanium, Itanium 2
3D Now! Professional	AMD	Incorporates SSE commands and 3D Now! Enhanced multimedia commands	AMD Athlon MP
SSE2	Intel	Enhanced version of SSE with support for 64-bit double-precision floating-point and 8-bit through 64-bit integer operations; incorporates MMX and SSE	Pentium 4, Itanium 2 (32-bit software only when the IA32 execution layer is used), Xeon, AMD Opteron
SSE3	Intel	13 additional instructions for processing graphics, video, and audio data; incorporates MMX, SSE, and SSE2	Pentium 4 Prescott, Pentium D, Pentium Extreme Edition, Xeon (Nocoma core and newer), AMD Opteron (all 90- nanometer process)

Dynamic Execution

First used in the P6, or sixth-generation, processors, dynamic execution enables a processor to execute more instructions in parallel, so tasks are completed more quickly. This technology innovation comprises three main elements:

Branch prediction Branch prediction is a feature formerly found only in high-end mainframe processors. It enables the processor to keep the instruction pipeline full while running at a high rate of speed. A special fetch/decode unit in the processor uses a highly optimized branch prediction algorithm to predict the direction and outcome of the instructions being executed through multiple levels of branches, calls, and returns. It is similar to a chess player working out multiple strategies in advance of game play by predicting the opponent's strategy several moves into the future. By predicting the instruction outcome in advance, the instructions can be executed with no waiting.
Dataflow analysis Dataflow analysis studies the flow of data through the processor to detect any opportunities for out-of-order instruction execution. A special dispatch/execute unit in the processor monitors many instructions and can execute these instructions in an order that optimizes the use of the multiple superscalar execution units. The resulting out-of-order execution of instructions can keep the execution units busy even when cache misses and other data-dependent instructions might otherwise hold things up.
Speculative execution Speculative execution is the processor's capability to execute instructions in advance of the actual program counter. The processor's dispatch/execute unit uses dataflow analysis to execute all available instructions in the instruction pool and store the results in temporary registers. A retirement unit then searches the instruction pool for completed instructions that are no longer data dependent on other instructions to run or that have unresolved branch predictions. If any such completed instructions are found, the results are committed to memory by the retirement unit or the appropriate standard Intel architecture, in the order in which they were originally issued. They are then retired from the pool.

Dynamic execution essentially removes the constraint and dependency on linear instruction sequencing. By promoting out-of-order instruction execution, it can keep the instruction units working rather than waiting for data from memory. Even though instructions can be predicted and executed out of order, the results are committed in the original order so as not to disrupt or change program flow.

The Dual Independent Bus Architecture

The dual independent bus (DIB) architecture was first implemented by Intel and AMD in their sixth-generation processors (Pentium Pro, Pentium II/Xeon, Pentium III/Xeon, and Athlon MP). DIB was created to improve processor bus bandwidth and performance. Having two (dual) independent data I/O buses enables the processor to access data from either of its buses simultaneously and in parallel, rather than in a singular, sequential manner (as in a single-bus system). The main processor bus (the FSB) is the interface between the processor and the motherboard or chipset. The second (back-side) bus in a processor with DIB is used for the L2 cache, enabling it to run at much greater speeds than if it were to share the main processor bus.

	The DIB architecture is explained more fully in Chapter 3, "Server Chipsets."

Two buses make up the DIB architecture: the L2 cache bus and the main CPU bus, often called the FSB. Both buses can be used at the same time, eliminating a bottleneck there. The dual-bus architecture enables the L2 cache of the newer processors to run at full speed inside the processor core on an independent bus, leaving the main CPU bus (FSB) to handle normal data flowing in and out of the chip. The two buses run at different speeds. The FSB, or main CPU bus, is coupled to the speed of the motherboard, whereas the back-side, or L2 cache, bus is coupled to the speed of the processor core. As the frequency of processors increases, so does the speed of the L2 cache.

The key to implementing DIB was to move the L2 cache memory off the motherboard and into the processor package. L1 cache always has been a direct part of the processor die, but L2 was larger and originally had to be external. Moving the L2 cache into the processor meant that the L2 cache could run at speeds more like those of the L1 cachemuch faster than the motherboard or processor bus.

DIB also enables the system bus to perform multiple simultaneous transactions (instead of singular sequential transactions), accelerating the flow of information within the system and boosting performance. Overall, DIB architecture offers up to three times the bandwidth performance over a single-busarchitecture processor.

Hyper-Threading Technology

Computers with two or more physical processors have long had a performance advantage over single-processor computers when the operating system has supported multiple processors, as with Windows NT 4.0, 2000 Server, Windows Server 2003, Linux, and Novell NetWare 6.x.

	See "Multiple CPUs," p. 37.

However, dual-processor motherboards and systems have always been more expensive than otherwise-comparable single-processor systems, and upgrading a dual-processorcapable system to dual-processor status can be difficult because of the need to match processor speeds and specifications. However, Intel's HT Technology allows a single processor to handle two independent sets of instructions at the same time. In essence, HT Technology converts a single physical processor into two virtual processors.

Intel originally introduced HT Technology in its line of Xeon processors for servers in early 2002. HT Technology enables multiprocessor servers to act as if they have twice as many processors installed. HT Technology was introduced on Xeon workstation-class processors with a 533MHz system bus and later found its way into PC processors with the Pentium 4 3.06GHz processor in late 2002.

How Hyper-Threading Works

Internally, an HT-enabled processor has two sets of general-purpose registers, control registers, and other architecture components, but both logical processors share the same cache, execution units, and buses. During operations, each logical processor handles a single thread (see Figure 2.8).

Figure 2.8. A processor with HT Technology enabled can fill otherwise-idle time with a second process, improving multitasking and performance of multithreading single applications.

Although the sharing of some processor components means that the overall speed of an HT-enabled system isn't as high as that of a true dual-processor system, speed increases of 25% or more are possible when multiple applications or a single multithreaded application is being run.

Hyper-Threading Requirements

The first HT-enabled processor was the Intel Pentium 4 3.06GHz. All faster Pentium 4 models also support HT Technology, as do all processors 2.4GHz and faster that use the 800MHz bus. However, an HT-enabled P4 processor by itself can't bring the benefits of HT Technology to a system. You also need the following:

A compatible motherboard (chipset) Your system might need a BIOS upgrade.
BIOS support to enable/disable HT Technology If your operating system doesn't support HT Technology, you should disable this feature. Application performance varies (some faster, some slower) when HT Technology is enabled. You should perform application-based benchmarks with HT Technology enabled and disabled to determine whether the applications you use benefit from using HT Technology.
A compatible operating system, such as Windows Server 2003 When HT is enabled on these operating systems, the Device Manager shows two processors on systems with a single HT-compatible processor or four processors on systems with dual HT-compatible processors.

Most of Intel's recent server chipsets for the Pentium 4 and Xeon support HT Technology.

	See "Intel Pentium 4 Chipsets for Single-Processor Servers," p. 180, and "Intel Xeon DP and Xeon MP Chipsets," p. 188, for details.

If your motherboard or server was released before HT Technology was introduced, you need a BIOS upgrade from the motherboard or system vendor to be able to use HT Technology. Although Windows NT 4.0 and Windows 2000 are designed to use multiple physical processors, HT Technology requires specific operating system optimizations in order to work correctly. Linux distributions based on kernel 2.4.18 and higher also support HT Technology.

While HT Technology is designed to simulate two processors in a single physical unit, it also needs properly written software to improve application performance. Unfortunately, many applications do not support HT Technology, and some even slow down when HT Technology is enabled. If you find that enabling HT Technology does not benefit server performance, you should disable it.

Dual-Core Technology

Dual-core processors include two processor cores in the same physical package, providing virtually all the advantages of a multiple-processor computer at a cost lower than that of two matched processors. Unlike HT Technology, dual-core processors require no support from applications.

Another advantage of dual-core processors is their ability to boost performance while maintaining the ability to use standard-size motherboards and cases for servers. Two-way and larger servers based on multiple single-core processors must use extended ATX or larger case designs to provide adequate space for the additional processor socket and support circuitry. If using extended ATX or proprietary case designs is a concern for your server environment, a dual-core CPU might fit your needs nicely.

A number of RISC-based processors developed by IBM have used dual-core designs for several years, including IBM's Power4 (introduced in 2001) and its successors, as well as the PowerPC 970MP. AMD's dual-core Opterons were introduced in 2005, and Intel announced dual-core Xeon MP processors for shipment in early 2006. Dual-core Opterons can even be used as replacements for existing single-core Opteron processors. A BIOS upgrade might be necessary on some Opteron systems. Dual-core Xeons use the Intel E8500 chipset; older Xeon chipsets do not support dual-core Xeons.

Factors to Consider for Desktop Processors Used in Servers

x86 Processor Modes

Real Mode and Virtual Real Mode

Protected (32-Bit) Mode

64-Bit Processor Modes

EPIC 64-Bit Mode (Intel)

AMD64 64-Bit Mode (AMD)

EM64T 64-Bit Mode (Intel)

Endianess and Server Processors

x86 Processor Speed Ratings

Figure 2.6. The AMD dual-core Opteron processors use the crossbar controller to transfer data between the processor cores.

Figure 2.7. Intel's initial dual-core design uses the MCH (North Bridge equivalent) chip to manage transfers between the processor cores.

Processor Speeds Versus Motherboard Speeds

x86 Processor Features

Superscalar Execution

RISC and CISC Chips

MMX, SSE, SSE2, SSE3, 3D, and 3D Now Technologies

Table 2.13. Extensions to x86 Processor Instructions

Dynamic Execution

The Dual Independent Bus Architecture

Hyper-Threading Technology

How Hyper-Threading Works

Figure 2.8. A processor with HT Technology enabled can fill otherwise-idle time with a second process, improving multitasking and performance of multithreading single applications.

Hyper-Threading Requirements

Dual-Core Technology