Servers designed to run x86-compatible operating systems (such as Windows 2000 Server, Windows Server 2003, and Linux) and applications have used three different categories of processors:
Until the mid-1990s, there were no processors designed especially for use in servers running x86 software. Instead, these servers used the fastest x86 desktop processors available at the time. For example, desktop processors such as the 80286 (the first x86 processor to directly access more than 1MB of memory), the 80386 (the first x86 processor to introduce a 32-bit data bus), the 80486 (the first x86 processor with an integrated math coprocessor and an integrated L1 cache), and the Pentium (the first processor with superscalar design, running two instructions at the same time) have all been used as server processors. The Pentium II was the first x86-compatible desktop processor to spawn an offshoot specifically designed as a server processor, the Pentium II Xeon. Ironically, the Pentium II was an offshoot of Intel's first server-oriented processor, the Pentium Pro. Successors to the Pentium II from both Intel and AMD have continued to inspire server-optimized versions:
Intel's Itanium family is the first x86-compatible processor family designed from the ground up for use in servers. However, as described in detail later in this chapter, the Itanium and Itanium 2 were not intended to be used primarily as x86-compatible processors. x86-compatibility is a convenience feature of the Itanium's design, and Itanium systems will not achieve peak performance when running x-86 operating systems or applications. Note x86-compatible processors are those that can run the same operating systems and applications as Intel's 8088 and newer models: MS-DOS, Windows, and so on. The AMD Opteron was the first x86-compatible processor designed specifically for use in 64-bit servers while maintaining full-speed compatibility with existing 32-bit x86 applications. Factors to Consider for Desktop Processors Used in ServersAlthough some x86 processors used in servers, such as the Intel Xeon and Itanium series, and the AMD Athlon MP and Opteron, are designed especially for use in servers and workstations, many x86-based servers use the same processors that have been used in desktop computers. Although processors originally designed for desktop use have been, and continue to be, used successfully in servers designed for light to moderate use, several design factors enable some desktop processors to be more successful as server processors than others:
Note Before the development of true dual-core processors by AMD and Intel, Intel added a sort of virtual dual-core technology to some of its Pentium 4 and Xeon processors. This technology, known as HT Technology, enables a single-processor system to emulate two processors and a two-processor system to emulate four processors. HT Technology is designed to boost performance, but whether you see a performance boost or a performance drop when HT Technology is enabled depends on the software being run on the server. Dual-core processors and multiple processors boost performance when multiple applications are in use at one time. Some of Intel's dual-core processors also feature HT Technology, but as with HT Technology implementations on single-core processors, your results may vary. When HT Technologycompatible processors are installed, HT Technology is enabled or disabled through the system BIOS. The following sections describe which processors in a given family are the most suitable for use in servers, based on these features. x86 Processor ModesAll Intel and Intel-compatible 32-bit and 64-bit processors, also known as x86 processors, can run in several modes. Processor modes refer to the various operating environments and affect the instructions and capabilities of the chip. The processor mode controls how a processor sees and manages the system memory and the tasks that use it. As described in the following sections, there are three possible modes of operation:
Real Mode and Virtual Real ModeThe original IBM PC included an 8088 processor that could execute 16-bit instructions using 16-bit internal registers and could address only 1MB of memory, using 20 address lines. All original PC software was created to work with this chip and was designed around the 16-bit instruction set and 1MB memory model. For example, DOS and all DOS software, Windows 1.x through 3.x, and all Windows 1.x through 3.x applications are written using 16-bit instructions. These 16-bit operating systems and applications are designed to run on an original 8088 processor. Servers do not use real mode because it does not permit access to memory above 1MB and does not permit multitasking. Instead, if a server needs to run software that requires real mode, it does so by emulating real mode, a mode known as virtual real mode.
About the only time that a server processor runs in real mode is if a technician boots the computer with a diagnostic disk and performs tests on the computer's hardware. A diagnostic disk usually contains a DOS-based operating system and XMS memory drivers, which enable the diagnostics to test all memory in the system. Protected (32-Bit) ModeThe Intel 386, the PC industry's first 32-bit processor, introduced an entirely new 32-bit instruction set. To take full advantage of the 32-bit instruction set, a 32-bit operating system and a 32-bit application were required. This new 32-bit mode was referred to as protected mode, which alludes to the fact that software programs running in that mode are protected from overwriting one another in memory. Such protection helps make the system much more crash-proof because an errant program can't very easily damage other programs or the operating system. In addition, a crashed program can be terminated while the rest of the system continues to run unaffected. Protected mode is also the native mode of subsequent x86-compatible processors up through the Pentium 4, Xeon, and Athlon MP. Server-oriented operating systems for these processors, such as Windows Server, Novell NetWare, Linux, and others, use protected mode. 64-Bit Processor ModesWhen used with a 64-bit operating system, many recent x86-compatible server and desktop processors, starting with the Intel Itanium, AMD Opteron, and others, support one of two 64-bit processor modes as well as the 32-bit modes discussed previously:
64-bit processors can use much more memory and disk space than 32-bit operating systems when operating in the native 64-bit mode, making them much more suitable for large databases and other applications that need large amounts of these resources. The following sections discuss the differences between these modes. EPIC 64-Bit Mode (Intel)Intel's first 64-bit processor was the Itanium. The Itanium uses a processor architecture known as EPIC (Explicitly Parallel Instructional Computing), which is designed to better support multiple-processorbased systems than x86-based designs.
Although the Itanium family also features a backward-compatible x86 mode, the original Itanium and early versions of the Itanium 2 ran x86 operating systems and applications much more slowly than a comparable x86 server processor. The Itanium 2 "Madison," introduced in 2003, features improved x86 code compatibility for better performance with existing 32-bit applications. Itanium-based servers have not been very popular thus far, perhaps because of the cost and the need to rewrite applications to support the EPIC architecture. Note The Intel Itanium family and 64-bit x86 processors such as the AMD Athlon 64/Opteron and Intel Pentium D and EM64T-compatible Pentium 4 and Xeon processors use different 64-bit architectures. Thus, 64-bit software written for one will not work on the other without being recompiled by the software vendor. This means that software written specifically for the Intel EPIC 64-bit architecture will not run on x86 64-bit processors and vice versa. The Itanium family runs all existing 32-bit software, but to fully take advantage of the processor, a 64-bit operating system and applications are required. Microsoft has released 64-bit versions of Windows XP and Windows Server, and various versions of Linux also support the Itanium family. Current Linux distributions with Itanium support include BioBrew Linux (http://bioinformatics.org/biobrew/), White Box Linux for IA64 (http://gelato.uiuc.edu/projects/whitebox/), Debian GNU for IA64 (www.debian.org/ports/ia64/), Red Hat Enterprise Linux 4 (www.redhat.com), and SUSE Linux Enterprise 9. Several companies have released 64-bit applications for networking and workstation use. Tip If you use (or are considering) Linux on Itanium processors, be sure to visit the Gelato Community, at www.gelato.org, and LinuxIA64, at www.ia64-linux.org. These websites provide valuable support for Itanium servers running Linux. AMD64 64-Bit Mode (AMD)AMD64 (originally known as x86-64) is AMD's extension of x86 architecture into the 64-bit world. Unlike EPIC-based processors, such as the Itanium family, AMD64-based processors can run existing 32-bit x86 applications at full speed by adding two new operating modes64-bit mode and compatibility modeto the operating modes used by 32-bit x86 processors. Together, these two modes are known as long mode. When an AMD64-compatible processor runs a 32-bit operating system, it uses protected mode, as described earlier in this chapter. Protected mode, virtual 8086 mode, and real mode are collectively known as legacy mode on an AMD64 processor. Legacy mode runs about as fast on an AMD64-compatible processor as on a 32-bit processor with similar features. Thus, an AMD64-compatible processor is a much better choice for a mixed 64-bit and 32-bit software environment than an Itanium 2, which runs 32-bit x86 applications much more slowly than its native IA-64 applications. Another benefit of AMD64 is the ability to run a 32-bit operating system and switch to 64-bit operation at your own pace. The advantages of the AMD64 architecture over the Itanium series IA-64 architecture include the following:
Intel uses a virtually identical 64-bit instruction set known as EM64T (discussed in the next section) in a wide variety of desktop and server processors. This suggests that AMD64's approach to 64-bit compatibility will continue to be much more popular in the marketplace than IA-64. The AMD Opteron is a server-optimized implementation of AMD64, and the AMD Athlon 64, Athlon 64 FX, and Athlon 64 x2 are desktop-optimized implementations of AMD64.
Note For more information on the Athlon 64 family of processors, see Upgrading and Repairing PCs, 17th edition. EM64T 64-Bit Mode (Intel)Intel's EM64T 64-bit mode, introduced in 2005, is virtually identical to AMD64, except that it adds support for Intel-specific features such as SSE3 (which Opterons built on a 90-nanometer die also support) and HT Technology. The following server processors include EM64T:
Intel originally used the term "Clackamas Technology" for these processors but now uses the term EM64T. Endianess and Server ProcessorsEndianess refers to the method a processor uses to sequence numbers and other values. Server processors discussed in this chapter fall into one of three categories:
Endianess is a concern primarily for programmers, who need to take into account the order in which data is stored in a system, and for situations in which data will be exchanged between systems that use different endian methods. It is easier to recompile applications or move data to another processor that uses the same endian method as the application or data's original target processor. Note that most RISC server processors and the Itanium (IA64) family, which is being positioned as a replacement for the Alpha and PA-RISC processor families, feature a bi-endian design. x86 Processor Speed RatingsA common misunderstanding about processors is their different speed ratings. This section covers processor speed in general and then provides more specific information about Intel and AMD processors used in servers. A computer system's clock speed is measured as a frequency, usually expressed as a number of cycles per second. A crystal oscillator controls clock speeds, using a sliver of quartz sometimes contained in what looks like a small tin container. Newer systems include the oscillator circuitry in the motherboard chipset, so it might not be a visible separate component on newer boards. As voltage is applied to the quartz, it begins to vibrate (oscillate) at a harmonic rate dictated by the shape and size of the crystal (sliver). The oscillations emanate from the crystal in the form of a current that alternates at the harmonic rate of the crystal. This alternating current is the clock signal that forms the time base on which the computer operates. A typical computer system runs millions of these cycles per second, so speed is measured in megahertz. (1Hz is equal to one cycle per second.) Note The Hertz was named for the German physicist Heinrich Rudolf Hertz. In 1885, Hertz confirmed the electromagnetic theory, which states that light is a form of electromagnetic radiation and is propagated as waves. A single cycle is the smallest element of time for a processor. Every action requires at least one cycle and usually multiple cycles. To transfer data to and from memory, for example, a modern processor such as a Pentium 4 needs a minimum of three cycles to set up the first memory transfer and then only a single cycle per transfer for the next three to six consecutive transfers. The extra cycles on the first transfer typically are called wait states. A wait state is a clock tick in which nothing happens. This ensures that the processor isn't getting ahead of the rest of the computer.
Current server-class processors can output one to six instructions per cycle, thanks to multiple pipelines and other advances. Different instruction execution times (in cycles) make it difficult to compare systems based purely on clock speed or number of cycles per second. How can two processors that run at the same clock rate perform differently, with one running "faster" than the other? The answer is simple: efficiency. For example, although the AMD Opteron processors have clock speeds about one-third slower than Intel Xeon processors, they perform more instructions in the same clock cycle. Thus, a "slower" Opteron is able to keep pace or even outperform a "faster" Xeon. Although dual-core processor designs improve multitasking performance, different designs offer different levels of efficiency. The dual-core AMD Opterons incorporate a crossbar controller in the processor itself to handle communications between the cores. See Figure 2.6. Figure 2.6. The AMD dual-core Opteron processors use the crossbar controller to transfer data between the processor cores.
However, the initial versions of Intel's dual-core processors (Pentium D and Xeon) use the MCH (North Bridge) to handle communications between the cores. See Figure 2.7. Figure 2.7. Intel's initial dual-core design uses the MCH (North Bridge equivalent) chip to manage transfers between the processor cores.
The difference is roughly comparable to walking across a hallway to talk to a co-worker (AMD's method) compared to taking an elevator up one floor, walking to another elevator, riding down one floor, and then walking to the co-worker's office (Intel's method). Because of the inefficiencies inherent in the initial Intel design, Intel will switch to a design more similar to AMD's in dual-core processors introduced in 2006 and beyond. Although RISC-based processors might feature slower clock speeds than recent Intel x86 or EPIC-based processors, their use of fewer processor instructions and a focus on server tasks enables them to be more efficient at handling very large numbers of clients. As you can see from these examples, evaluating CPU performance can be tricky. CPUs with different internal architectures execute their instructions differently and can be relatively faster at certain processes and slower at others. Keep in mind that, unlike with PCs, server performance is less about how quickly a server performs internal operations than it is about how quickly it provides services to clients.
Processor Speeds Versus Motherboard SpeedsVirtually all processors used in servers, including x86, Itanium, and RISC-based processors, run the system bus and the processor core at different speeds. As mentioned earlier, the system bus is also referred to as the FSB. The effective speed of the FSB is a multiple of the actual system bus speed. For example, AMD processors perform two accesses per clock cycle, making their FSB twice the actual system bus speed. Intel processors perform four accesses per clock cycle, making their FSB four times the actual system bus speed. This is important to note for purposes of configuring the processor in the system BIOS.
RISC-based processors also use a dual-speed design, running the processor core and FSB at different speeds. x86 Processor FeaturesAs new processors are introduced, new features are continually added to their architectures to help improve everything from performance in specific types of applications to the reliability of the CPU as a whole. The next few sections take a look at some of these technologies, Superscalar Execution, MMX, SSE, 3DNow!, and HT Technology. Superscalar ExecutionThe fifth-generation Pentium and newer processors feature multiple internal instruction execution pipelines, which enable them to execute multiple instructions at the same time. The 486 and all preceding chips can perform only a single instruction at a time. Intel calls the capability to execute more than one instruction at a time superscalar technology. RISC and CISC ChipsSuperscalar architecture is usually associated with high-output RISC chips. A RISC chip has a less complicated instruction set with fewer and simpler instructions. Although each instruction accomplishes less, the clock speed can be higher, which can usually increase performance. The Pentium is one of the first CISC chips to be considered superscalar. A CISC chip uses a richer, fuller-featured instruction set that has more complicated instructions. As an example, say you wanted to instruct a robot to screw in a light bulb. Using CISC instructions, you would say this:
Using RISC instructions, you would say something more along the lines of this:
Overall, many more RISC instructions are required to do the job because each instruction is simpler (reduced) and does less. The advantage is that there are fewer overall commands the robot (or processor) has to deal with, and it can execute the individual commands more quickly, and thus in many cases it can execute the complete task (or program) more quickly as well. The debate goes on whether RISC or CISC is really better, but in reality, there is no such thing as a pure RISC or CISC chipit is all just a matter of definition, and the lines are somewhat arbitrary. Intel and compatible processors have generally been regarded as CISC chips, although starting with the Pentium Pro, more recent and current Intel server processors have many RISC attributes and internally break CISC instructions down into RISC versions. Intel's Itanium processors are RISC processors. MMX, SSE, SSE2, SSE3, 3D, and 3D Now TechnologiesIntel and AMD have developed several extensions to basic x86 processor instructions. While these instructions are not important for servers, if you are considering standardizing on a particular processor for use in both server and workstation uses or are writing applications, you may want to consider which processors support particular extensions. All these extensions are based on the concept of single instruction, multiple data (SIMD). SIMD enables one instruction to perform the same function on multiple pieces of data, similarly to a teacher telling an entire class to sit down rather than addressing each student one at a time. SIMD enables the chip to reduce processor-intensive loops common with video, audio, graphics, and animation. Table 2.13 provides an overview of these processor extensions and the processors that support them. For details about the instructions, see the documentation available for each processor at the Intel and AMD websites.
Dynamic ExecutionFirst used in the P6, or sixth-generation, processors, dynamic execution enables a processor to execute more instructions in parallel, so tasks are completed more quickly. This technology innovation comprises three main elements:
Dynamic execution essentially removes the constraint and dependency on linear instruction sequencing. By promoting out-of-order instruction execution, it can keep the instruction units working rather than waiting for data from memory. Even though instructions can be predicted and executed out of order, the results are committed in the original order so as not to disrupt or change program flow. The Dual Independent Bus ArchitectureThe dual independent bus (DIB) architecture was first implemented by Intel and AMD in their sixth-generation processors (Pentium Pro, Pentium II/Xeon, Pentium III/Xeon, and Athlon MP). DIB was created to improve processor bus bandwidth and performance. Having two (dual) independent data I/O buses enables the processor to access data from either of its buses simultaneously and in parallel, rather than in a singular, sequential manner (as in a single-bus system). The main processor bus (the FSB) is the interface between the processor and the motherboard or chipset. The second (back-side) bus in a processor with DIB is used for the L2 cache, enabling it to run at much greater speeds than if it were to share the main processor bus.
Two buses make up the DIB architecture: the L2 cache bus and the main CPU bus, often called the FSB. Both buses can be used at the same time, eliminating a bottleneck there. The dual-bus architecture enables the L2 cache of the newer processors to run at full speed inside the processor core on an independent bus, leaving the main CPU bus (FSB) to handle normal data flowing in and out of the chip. The two buses run at different speeds. The FSB, or main CPU bus, is coupled to the speed of the motherboard, whereas the back-side, or L2 cache, bus is coupled to the speed of the processor core. As the frequency of processors increases, so does the speed of the L2 cache. The key to implementing DIB was to move the L2 cache memory off the motherboard and into the processor package. L1 cache always has been a direct part of the processor die, but L2 was larger and originally had to be external. Moving the L2 cache into the processor meant that the L2 cache could run at speeds more like those of the L1 cachemuch faster than the motherboard or processor bus. DIB also enables the system bus to perform multiple simultaneous transactions (instead of singular sequential transactions), accelerating the flow of information within the system and boosting performance. Overall, DIB architecture offers up to three times the bandwidth performance over a single-busarchitecture processor. Hyper-Threading TechnologyComputers with two or more physical processors have long had a performance advantage over single-processor computers when the operating system has supported multiple processors, as with Windows NT 4.0, 2000 Server, Windows Server 2003, Linux, and Novell NetWare 6.x.
However, dual-processor motherboards and systems have always been more expensive than otherwise-comparable single-processor systems, and upgrading a dual-processorcapable system to dual-processor status can be difficult because of the need to match processor speeds and specifications. However, Intel's HT Technology allows a single processor to handle two independent sets of instructions at the same time. In essence, HT Technology converts a single physical processor into two virtual processors. Intel originally introduced HT Technology in its line of Xeon processors for servers in early 2002. HT Technology enables multiprocessor servers to act as if they have twice as many processors installed. HT Technology was introduced on Xeon workstation-class processors with a 533MHz system bus and later found its way into PC processors with the Pentium 4 3.06GHz processor in late 2002. How Hyper-Threading WorksInternally, an HT-enabled processor has two sets of general-purpose registers, control registers, and other architecture components, but both logical processors share the same cache, execution units, and buses. During operations, each logical processor handles a single thread (see Figure 2.8). Figure 2.8. A processor with HT Technology enabled can fill otherwise-idle time with a second process, improving multitasking and performance of multithreading single applications.
Although the sharing of some processor components means that the overall speed of an HT-enabled system isn't as high as that of a true dual-processor system, speed increases of 25% or more are possible when multiple applications or a single multithreaded application is being run. Hyper-Threading RequirementsThe first HT-enabled processor was the Intel Pentium 4 3.06GHz. All faster Pentium 4 models also support HT Technology, as do all processors 2.4GHz and faster that use the 800MHz bus. However, an HT-enabled P4 processor by itself can't bring the benefits of HT Technology to a system. You also need the following:
Most of Intel's recent server chipsets for the Pentium 4 and Xeon support HT Technology.
If your motherboard or server was released before HT Technology was introduced, you need a BIOS upgrade from the motherboard or system vendor to be able to use HT Technology. Although Windows NT 4.0 and Windows 2000 are designed to use multiple physical processors, HT Technology requires specific operating system optimizations in order to work correctly. Linux distributions based on kernel 2.4.18 and higher also support HT Technology. While HT Technology is designed to simulate two processors in a single physical unit, it also needs properly written software to improve application performance. Unfortunately, many applications do not support HT Technology, and some even slow down when HT Technology is enabled. If you find that enabling HT Technology does not benefit server performance, you should disable it. Dual-Core TechnologyDual-core processors include two processor cores in the same physical package, providing virtually all the advantages of a multiple-processor computer at a cost lower than that of two matched processors. Unlike HT Technology, dual-core processors require no support from applications. Another advantage of dual-core processors is their ability to boost performance while maintaining the ability to use standard-size motherboards and cases for servers. Two-way and larger servers based on multiple single-core processors must use extended ATX or larger case designs to provide adequate space for the additional processor socket and support circuitry. If using extended ATX or proprietary case designs is a concern for your server environment, a dual-core CPU might fit your needs nicely. A number of RISC-based processors developed by IBM have used dual-core designs for several years, including IBM's Power4 (introduced in 2001) and its successors, as well as the PowerPC 970MP. AMD's dual-core Opterons were introduced in 2005, and Intel announced dual-core Xeon MP processors for shipment in early 2006. Dual-core Opterons can even be used as replacements for existing single-core Opteron processors. A BIOS upgrade might be necessary on some Opteron systems. Dual-core Xeons use the Intel E8500 chipset; older Xeon chipsets do not support dual-core Xeons. |