Examining the x86IA-64x86-64 Platform (Intel and AMD) | Maximizing Performance and Scalability with IBM WebSphere

Examining the x86/IA-64/x86-64 Platform (Intel and AMD)

The following sections discuss the x86 family.

Platform Overview

The x86 platform is probably the best-known processor architecture. Starting as far back as 1980, the x86 platform has been manufactured in numerous guises by many companies, most notably Intel and AMD.

Intel currently has a larger install base for the x86 platform, but AMD's recent high-quality processor families have taken a bite out of Intel's market share. AMD was founded in 1969 and used to produce various non-CPU based products. In 1979, AMD commenced manufacturing 8086 and 8088 processors for Intel and continued doing so with the 80286 through 1986 when Intel and AMD parted ways.

Up until the mid-1990s, AMD produced clones of the 80386 and 80486 processors. In 1995, via a series of corporate purchases, AMD commenced manufacturing its own x86-compatible CPU family. The first product off the line was the AMD-K5, a direct competitor to the original Intel Pentium.

Most of you know the history of the x86, with its inception into the mass market thanks to the 8080 in the late 1970s, shortly followed by the 8086/8088 processors that were used to power the first IBM PCs.

Note	Previous Intel-manufactured processors, the 4004 and 8008, were available but not deemed "mass market" in the early 1970s. These processors were available as part of computer kits.

The x86 is quite possibly the driving reason for the computer revolution because it brought low cost (to the consumer) and microprocessing capacity to end users.

Platform Architecture

You can find the x86 architecture in many flavors. Three companies have produced x86 processors in the past 30 years : Intel, AMD, and Cyrix, with some input from NEC, National Semiconductor, and more recently Transmeta.

Unlike a SPARC-based processor, there isn't a standard for the x86. What makes the x86 a leading processor architecture is its low entry cost. You can pick up basic-level Intel and AMD CPUs for less than $35, which makes them a bargain for anyone wanting to set up test or low-end systems.

Although you can't compare a Power4 or UltraSPARC III to a lower-end x86 processor, the cost indicates why x86 is so popular. Some have compared it to Linux versus the commercial Unix systems. That is, Linux ”being more or less free ”means that anyone (those without big budgets ) can download Linux, install it, and use it and perform pretty much the same tasks as someone who operates a commercial Unix such as Tru64, Solaris, or AIX.

To focus on the most relevant x86 processor line-up, I'll compare only the processors released since January 2002 from Intel and AMD. This in no way represents a view on other manufacturers' processors, however; AMD and Intel are the most commonly used x86 vendors for WebSphere implementations on the Linux and Windows NT/200 x /XP platforms.

You'll now see some of the key processors commonly found in x86 server platforms.

Note	Over the years, the nomenclature of x86 platforms has been fairly well publicized. The IA-32 term is essentially Intel's convention of the x86 platform architecture and refers to Intel Architecture ”32 bit . The term IA-64 stands for, as you can probably guess, Intel Architecture ”64 bit .

Note	x86-64 is AMD's answer to IA-64, with the important and clever distinction of the x86 preceding the 64-bit term. In other words, x86-64 means an x86 platform on a 64-bit architecture. Incidentally, AMD has code-named the x86-64 the AMD Opteron .

What Makes an x86/x86-64/IA-64 Processor Fast?

Like all processors, many factors make up the reasons why one processor is fast and another is slow.

Back in the 16-bit versus 32-bit days, it was fairly straightforward. The AMD 80386DX-40 processor was a step ahead of the market, with true 32-bit in, 32-bit out bus-processing lanes . You may recall that some manufacturers dabbled in hybrid 32/16-bit processors, which were termed SX processors , such as the 80386SX-33.

Nowadays, the processors are an order of magnitude more complex. They include internal threading, dual- core engines, superscalar technologies, multipipeline approaches, and preemptive processing. The clock rate is still in there, but it's masked significantly by the quality and architectural implementation of the previous factors.

Cache is another big hitter in these types of processors (as it is in most processor architectures).

You'll revisit this question of what makes a fast processor after you've explored all the recent CPUs in this group of processor architecture.

Intel Pentium III Xeon: 900MHz Version

The arrival of the Pentium III Xeon 900 megahertz (MHz) processor was possibly the first sign of Intel's real entry into the high-end workstation/server market. The initial Pentium III Xeon processor was designed to operate in systems with up to eight-way processing. Each CPU came with 2MB of advanced transfer level 2 cache, which was a large upgrade from the standard 256KB level 2 cache found in other Pentium III-based processors.

The core CPU was interconnected to the system bus at 100MHz with an address width of 64 bits. This processor also directly supported up to 64GB of memory.

Intel Pentium 4: 1.7GHz “2GHz Version

Although not the first release of the Pentium 4 processor, the 1.7GHz-2GHz versions were the sign of things to come with the Intel line-up. Pentium 4 was the first new CPU architecture for Intel since its Pentium Pro, which was made available in mid-1995.

This range of Pentium 4 processors operated on a 400MHz system bus speed, and level 2 cache was supported by 256KB cache memory. This initial batch of Pentium 4 processors was based on what's commonly known as the Willamette core.

Intel Pentium 4: 2GHzGHz “2.6GHz Version

This second family of Pentium 4 processors arrived in January 2002, all based on the new Northwood "A" processor core. Among other things, the new core is developed with a 0.13 micron process as opposed to the 0.18 micron in the previous Pentium 4 processors.

Intel Pentium 4: 2.26GHz “3.2GHz Version

Pentium 4 underwent another change in April 2002 when the Northwood "A" processor was replaced with a "B" model. The biggest change for the newer post-April 2002 Pentium 4 processor was with the system bus interconnect or, as Intel has coined, the Front Side Bus (FSB). The previous Pentium 4s came with a 400MHz FSB, and the newer Northwood "B" models came with a 533MHz FSB.

What's more, the 3.06GHz version of the Pentium 4 introduced a feature known as hyper-threading into the mainstream Intel product line-up.

Previous Intel CPUs such as Xeon processors came equipped with the hyper-threading technology, which essentially provides a dual-processor core. Although hyper-threading doesn't provide 100-percent twin-CPU capabilities, it does come close. Nonexecution areas within the CPUs are duplicated. Areas such as the processor scheduler are duplicated , which allows for multiple system threads to be scheduled simultaneously . Symmetric Multiprocessor (SMP) operating systems will see the logically split processors and multithreaded applications (such as WebSphere-deployed applications) and will be able to take advantage of the feature.

If you're looking to use an Intel x86 platform architecture, the Pentium 4 3.2GHz is a high-quality and high-performing processor.

Note	At the time of writing this book, Intel has announced a pending 1,000MHz FSB version of the Intel Pentium 4 processor family.

Intel Xeon

As well as the Intel Pentium III Xeon processor, Intel provides the Xeon processor in its Pentium 4 family. Although the latest Xeon is quite different from the Pentium 4 processor, there are some similarities. The Xeon is probably the closest , architecturally speaking, to the Pentium 4. It's a server- targeted processor in several forms. To continue with the generational breakdown of processors, you'll see that the Intel Xeon comes in a number of flavors.

The first generation of the current family of Xeon processors is supplied with 256KB of level 2 cache, interfacing with the system bus at 400MHz. The clock speeds range from 1.4GHz “2GHz. This Xeon doesn't support hyper-threading (Intel's dual-execution CPU core technology) but does directly support dual-processor configurations.

The second generation of Xeon processors support a slightly higher amount of level 2 cache at 512KB, the 400MHz system bus interconnect speed. This generation of Xeons does support hyper-threading, and like the first-generation Xeon, it directly supports dual processors. Clock speeds for this Xeon range from 1.8GHz “3GHz.

The third generation of Xeon processors introduces a faster 533MHz system bus interconnect speed and similar characteristics to the second-generation Xeon, such as hyper-threading, native dual-processor support, 512KB of level 2 cache, and clock rates from 2GHz “3.06GHz.

Intel Xeon MP

The Xeon MP is Intel's high-performance (computing bandwidth) x86 processor. Similar to the Xeon in some respects, the Xeon MP is targeted at implementations and computing requirements of more than two Xeon CPUs, with direct support for up to eight-way configuration. Proprietary chipsets can support greater than eight-way configuration for Xeon MP processors.

The key point about the Xeon MP is that it comes with a level 3 cache, as well as the standard level 1 and 2 caches. The level 3 cache is touted as the integrated cache , effectively placing an abstraction caching layer between the CPU (level 2 cache) and the off-CPU activities.

The Xeon comes with clock rates of 1.4GHz “2GHz. The level 2 cache comes with up to 512KB of cache memory, and the level 3 cache can be supplied with up to 2MB of cache memory with all cache memory operating at the clock rate frequency.

The other key design aspect of the Xeon MP is that the I/O bandwidth is faster than the standard Xeon processors, capable of up to 4.8GB per second using the PCI-X architecture standard.

As you'd expect, hyper-threading comes standard. It's important to note that the Xeon MP also is limited in its chipset usage, so selection of main boards is important. The Xeon MP supports only the ServerWorks GC-HE chipset or any proprietary chipset that a main board manufacturer chooses to design and implement.

Intel Itanium and Itanium 2

The Itanium is Intel's flagship 64-bit computing platform. It, unlike previous generational CPU architecture upgrades from Intel, doesn't provide native backward compatibility with the x86 architecture and instead uses an IA-64 architecture.

This is unlike the AMD Opteron, which follows the same path that the x86 architecture has for the past 20 years. That is, as each evolutional change in data and CPU sizing has taken place (8-bit to 16-bit to 32-bit and so on), the next processor model in line is always backward compatible with the previous data set.

The Itanium does in fact provide limited non-native support for legacy x86 code by means of an emulation mode in which the processor can be run. This is implemented through firmware capability on the processor.

The Itanium and Itanium 2 processors are Intel's new-world server processor platforms. The Itanium 2 supports a three-layer cache system ”layer 1, layer 2, and layer 3 cache of up to 6MB.

Intel markets the Itanium in a "workstation" and a "server" flavor. The key difference between the two target platform offerings essentially boil down to better support for dual or more processors in the server incarnation of the Itanium 2.

The Itanium 2 DP processor is primarily a dual processor-optimized Itanium 2 offering. The Itanium 2 MP is a multiprocessor version, which is targeted to servers with more than two processors. A third flavor exists that's targeted toward high-density computing platforms such as supercomputer-type implementations. This third flavor of the Itanium 2 is powered by a smaller cache size and processor clock speed, driving down the voltage level.

AMD Duron

The AMD Duron is a lighter-weight processor than the Athlon and somewhat analogous to Intel's Celeron family of processors.

Duron is soon to be retired by AMD, but you can still order systems and parts for the Duron.

The Duron is essentially the same processor as the Athlon ”it uses the same core, with the main exception being the amount of level 2 cache. The Duron comes with a level 2 cache of 64KB, and the Athlon's are 256KB and upward.

The Duron has been produced based on three cores, first commencing in 2000. The original core, the Spitfire, was based on the Althon's Thunderbird core and provided clock speeds of 600 “950MHz. The more recent, or second-generation, core ”known as the Morgan core ”was based on the Althon XP Palomino core and is what today's Durons are based on.

Duron processors are available in speeds up to 1.3GHz.

AMD Athlon, Athlon XP, and Athlon MP

The Athlon AMD processors are the primary competing models of the Intel Pentium III and Intel Pentium 4 processors. Over the past few years, Pentium and Althon have both been the leader for the fastest (in processing terms) x86-based CPU.

Athlon

The original Athlon, operating on the AMD-K7 core, was limited to speeds of 1GHz. These initial Athlons were more in line with the Intel Pentium II processor family than the Pentium III.

Athlon started to improve its place in the microprocessor world in mid-2000 when it released the Thunderbird core. Interestingly, the Thunderbird core-based Althon's level 2 cache is half the size of the original K7-based Althon. The Thunderbird has 256KB of level 2 cache where as the K7 has 512KB of level 2 cache. The key difference, other than the size of the cache memory, is that the Thunderbird reverted to having the cache physically on the Thunderbird processor die where as the K7 has the memory externally mounted. This comounting or cohousing the level 2 cache on the processor die improved the performance of the level 2 cache; and in cache memory terms, faster is typically better than more!

Athlon XP

The Athlon XP is essentially AMD's third-generation of Althon processor. Released in mid-2001, the Athlon XP included a number of new capabilities, as well as performance improvements. The key architectural differences of the third-generation Palomino core-based Althon were an additional two instruction sets, SSE and 3DNow, and an increase in the initial clock rate up to 1.73GHz.

A reduction in power consumption of the Palomino core meant that the stock processor can be clocked faster.

Two additional generations of Althon XP are available ”one based on the Thoroughbred core and the more recently released Barton core. These processors are essentially clock rate and instruction core optimized versions of the Palomino-based Althon XP. The clock rate of the two processors weren't the same as the offerings from Intel Pentium 4, however. Depending on who you talked to, the type of processing you were undertaking, the lower clock rate Barton and Thoroughbred-based AMD processors were faster than the Intel Pentium 4 processors.

You'll notice that AMD now names its processors in the form of Athlon XP 2800+ and Athlon XP 3000+ . The number at the end of the processor name (for example, 2800+ and 3000+) is an attempt to compare each processor against the relative speed of the Intel Pentium 4 processors. AMD uses that rating as an indicator of the relative speed of the Intel Pentium 4 equivalents.

Athlon MP

The Althon MP is a modified Althon processor that has been designed specifically for multiprocessor-based systems. Several features are present that allow better utilization of pipelining and cross CPU cache coherency. These concepts are quite sound, and many of the off-CPU functions that are associated with SMP-based main boards are incorporated into the CPU itself. The more manufacturers can pack onto a CPU rather than traversing a bus, the faster the overall performance of the CPU will be.

AMD Athlon 64

Code-named ClawHammer , the Athlon 64 is one of two CPU branches that AMD is manufacturing for 64-bit computing. Although not available commercially until late 2003, there's a fair amount of information available about the new family of AMD engines.

Although I could (happily) talk about CPU architecture all day, I'll try to highlight some of the key points of the new Athlon 64. Obviously, it's a 64-bit CPU that's capable of running 32-bit (and 16-bit) applications natively. AMD has cleverly taken the x86-32 specification and extended it to allow for seamless 64-bit computing. Additional registers and some extensions to the existing ones in the core provide the extra 32 bits.

The Athlon 64 also incorporates an on-CPU memory controller and a new bus known as HyperTransport . The on-CPU memory controller greatly reduces latency when communicating with the memory itself as the memory controllers actually operates at the same frequency, or clock rate, as the CPU. The HyperTransport bus is a concept that allows multiple AMD CPUs and or other bus-connected technologies such as I/O interfaces ”Universal Serial Bus (USB), Integrated Device Electronics (IDE) devices, and so on ”and graphics devices to be interconnected. Similar in theory to the on-CPU memory controller, having the HyperTransport bus technology colocated on the CPU means that the HyperTransport bus is operating at the ultra -high frequencies at which the core CPU is operating. A 32-bit HyperTransport bus interface with a 1600MHz clock rate will allow it to operate at 6.4GB per second.

AMD Opteron

Similar to the Athlon XP, the Opteron is the server, or high-end workstation, version of the x86-64 AMD offering (in other words, the Athlon 64). Based on the SledgeHammer core, the Opteron provides some clever performance features. Essentially, the Opteron is AMD's competitor to Intel's Xeon server processor family.

The Opteron comes in three models ”the Opteron 1 xx , 2 xx , and 8 xx . The first digit represents the validation certification of the CPU, or what its platform design is intended for ”one-, two-, or up to eight-way systems. The next two digits represent the AMD performance index of the particular processor.

Where Opteron is different from AMD's Athlon 64 CPU powerhouse is primarily in the cache and number of HyperTransport interfaces. Opteron has been designed using the SledgeHammer core to provide more and potentially faster on-CPU cache than the Athlon 64. The Athlon 64 also supports one HyperTransport interface where as the Opteron supports up to three HyperTransport interfaces.

Quite simply, the need for a workstation or desktop to have vast amounts of cache and on-CPU hyperlevel interconnects is probably not required (given the typical use of a desktop or workstation). Obviously, in a server configuration, the more cache and number of meshed interconnects for SMP is important.

Comparison Chart: x86/IA-64/x86-64/ Processors

Now that you've gotten an overview of the most common server-oriented x86 processors that are used for WebSphere implementations, Table 4-1 provides recommendations for different CPUs.

Table 4-1: CPU Comparison Chart: x86/x86-64/IA-64 Platform
CPU Name	Workstation	Midrange Server	High-End Server
Intel Pentium III Xeon		¼	¼
Intel Pentium 4 Generation 1	¼	¼
Intel Pentium 4 Generation 2		¼	¼
Intel Pentium 4 Generation 3		¼	¼
Intel Xeon Generation 1		¼
Intel Xeon Generation 2		¼
Intel Xeon Generation 3		¼	¼
Intel Xeon MP			¼
Intel Itanium			¼
AMD Duron	¼
AMD Athlon Classic	¼
AMD Athlon Generation 2		¼
AMD Athlon XP		¼	¼
AMD Athlon MP			¼
AMD Athlon 64		¼	¼
AMD Opteron		¼	¼

The definition of each system type is fairly high level but indicates the suggested use. Workstation could be either a development workstation or a development server for environments where budgets aren't as large and therefore lower-end systems must suffice. Although there's no rule for what constitutes a production or development server, I've used the guide of one to two CPUs constituting a workstation or a midrange server environment where horizontal scaling may be more extensively used. Furthermore, high-end server refers to a larger production environment where more than two CPUs will be required with an emphasis on vertical scaling.

32-Bit or 64-Bit Computing?

The computing industry is at the crossroads of 32-bit versus 64-bit computing. Many, if not all, of the high-end server platforms such as SPARC, Alpha, and PowerPC are 64-bit architectures. The x86 world is slowly but surely catching up with AMD's Opteron and Athlon 64 and Intel's Itanium processors.

The argument of 32-bit versus 64-bit computing is also an old one. The 16-bit versus 32-bit argument was definitely not as prevalent as the 64-bit one, but over the next two to five years, desktop computing will demand it. Right now, people can get away with 32-bit desktops ”who has a personal computer with the need for more than 4GB of memory? Intel's Xeon processor is a 32-bit CPU but actually addresses memory through a masked 36-bit address space, thus supporting up to 64GB of memory.

Typically, at least in the past, if someone required more than 4GB of memory, 64-bit CPUs (and operating systems) were available from IBM, Sun, and Digital/Compaq/Hewlett-Packard (HP). Sun now can supply 64-bit processors for less than $2,000 with a decent amount of memory and internal hard disk space.

The bottom line is that the choice of 32-bit versus 64-bit computing is a nonissue for the Reduced Instruction Set Computer (RISC) players such as the SPARC, PowerPC, and Alpha systems. They're all 64 bits. The issue gives rise to the x86-based system managers. The question therefore needs to be asked ”do you require bigger than 2GB files, and do you require more than 4GB of memory?

As you'll explore in later chapters, different WebSphere topologies allow you to get around the 4GB memory issue if you decide to use a commodity CPU system architecture. And if there's a need for large database file and or memory support, then an Intel Xeon configuration may be the obvious choice (getting you to 64GB of memory).

My personal recommendation is this: Take a serious look at the AMD Opteron and Intel Itanium processors. Both are 64-bit, and if you're implementing a new environment where no Java/Java 2 Enterprise Edition (J2EE) code porting will be taking place to migrate from 32-bit to 64-bit, then Itanium is a breeze . AMD Opteron, on the other hand, natively supports x86 instructions, with no translation of abstraction layers . If you're porting from a 32-bit to 64-bit environment and want your application code to access more than the stock amount of memory, then Opteron may be the easiest approach.

Note	At the time of this writing, Itanium supported 32-bit computing through a firmware-based emulation layer. Intel has been hinting that some form of direct or native support for legacy 32-bit instructions may become available later.

That all said, the Java Virtual Machine (JVM) runtime should take care of this memory referencing issue for you. As the Opteron and Itanium processors become more prevalent and more support is made for them, investigate the available JVMs that may take better advantage of the 64-bit platform for you.

You'll explore JVMs more in later chapters.

So, Which CPU?

You'll now look at the best CPUs. Using Table 4-1, you can break it down to low-end requirements such as desktop machines or development nodes, production servers, and high-end mission-critical servers.

Development and Desktop Needs

AMD Athlon and Intel Pentium 4 processors are the best choice here. If your budget allows, get the maximum amount of cache you can and, as always, the fastest processor rating.

What you'll find, however, is that if you're running a low-end environment for personal development needs, basic environments typically suffice. In my home lab, I still have a number of 266MHz Intel Celerons and Pentium III machines running a mixture of operating systems (Linux, BSD, and x86 Solaris). As long as they have enough memory (typically for personal or development use, 384MB is probably the least you want to have for a WebSphere and a database server environment), you'll find that almost any processor will perform the task.

Generally, it's a bit of a "how long is a piece of string?" question. If you're writing, testing, integrating, or developing a new Java/J2EE-based application that's somewhat large with multiple JVMs and so on, then it may pay to have more grunt .

In later chapters, you'll compare JVM threads to CPUs to Kernel threads ”this will also help you understand the CPU requirements.

That said, if you're running a fairly complex application with several JVMs, investigate the hyper-threading features of the Xeons or Pentium 4 3.06+GHz processors. Busy JVMs typically perform best with their own CPUs. Again, you'll investigate this further in later chapters when I discuss operating systems.

Server and High-End Server Needs

As I hinted in the previous section, one of the key drivers of selecting a CPU, other than pure performance, is the number of threads that are active on it, driven by the JVM.

As a guide, JVMs can have an operating system thread to JVM thread ratio of anywhere from 1: 1 to 1:25. Depending on how your Java applications are transaction weighted (for example, how many threads are used per client or user transaction), your JVM threads may be long running and require more Kernel threads. Kernel threads are basically what drive the need for additional CPUs or additional CPU performance.

For servers, cache, memory interconnect speed, interleaving factors, and general memory type are the four key factors. All of these factors typically correlate to the speed of the CPU itself. The faster than CPU, the faster the cache and memory interconnect speed (because of architectural CPU releases and general CPU performance). That is, you're not going to find 533MHz Double Data Rate (DDR) memory on a Pentium 4, 1.7GHz processor. The main reason is that 533MHz DDR wasn't available when the Pentium 4 1.7GHz became available.

That's not the say that the Pentium 4 1.7GHz isn't a good choice of CPU ”for lower-end or horizontally scaled environments, the Pentium 4 1.7GHz may be a good choice. You'll look this point in the next chapter when I discuss topologies.

In summary, if your WebSphere applications are referencing large amounts of memory from the Java heap (the memory allocated to the JVM), the DDR memory or the newer Intel 800MHz FSB memory interconnect technologies are a good choice. Large and fast cache will improve the overall performance of the application but will noticeably impact calculation- intensive applications.

Table 4-2 summarizes and rates each of the more prominent CPU features from low to high in terms of preference or overall performance improvement for the example implementations.

Table 4-2: Example Application Implementation CPU Choice
Application Type	Memory Interconnect	CPU Speed	Cache Size	Threading/Dual CPU
Java Server Page (JSP)/Hypertext Markup Language (HTML)-based WebSphere application	Medium	Medium	Medium	Low
JSP/servlet-based WebSphere application	High	Medium	Medium	Medium
JSP/servlet/Java Database Connectivity (JDBC)-based WebSphere application	High	Medium	Medium	Medium
Small Enterprise JavaBean (EJB)-based WebSphere application	High	High	High	Medium
Multi-JVM EJB-based WebSphere application	High	High	High	High

Note	Of course, if a large budget is available, get the fastest, largest cache-based CPU available!