Examples of Server Systems

team bbl


Several server systems are available. The performance of these servers is critical to the success of the evolving information infrastructure. Here, we'll discuss a few examples of what is available.

IBM MainframeszSeries

Mainframe servers are the ideal platform for some Linux server consolidation efforts. Mainframe servers are large, extremely reliable platforms. These platforms have high memory and I/O bandwidth, low memory latencies, shared large L2 caches, dedicated I/O processors, and very advanced RAS capabilities. IBM zSeries platforms are the mainstay of this technology.

Mainframes support logical partitioning, which is the capability to carve up the machine memory, processor, and I/O resources into multiple machine images, each capable of running an independent operating system. The memory allocated to a partition is dedicated physical memory, whereas the processors can be dedicated or shared, and the I/O channels can also be either dedicated or shared among partitions. When I/O or processor resources are shared among partitions, this sharing is handled by firmware controlling the partitioning and is invisible to the operating systems running in the partitions.

Although Linux can run in a partition on a mainframe, the real advantage is to run large numbers of Linux server images in virtual machines under z/VM. z/VM is an operating system that provides virtual machine environments to its "guests," thus giving each guest the view of an entire machine. This virtual machine implementation allows for supporting a large number of Linux images limited only by system resources. Real deployments have provided thousands of Linux servers running as guests in virtual machines hosted by z/VM on a single mainframe.

Resources can be shared among the Linux images, and high-speed virtual networks are possible. These virtual networks essentially run at memory speed, because there is no need to send packets onto a wire to reach other Linux guests.

New instances of Linux virtual machines are created completely through software control and can be readily scripted. This simplifies system management and allows deployment of a new Linux server (as a guest to z/VM) in a matter of minutes, rather than the hours it would take to install a new physical server. Physical resources do not have to be reserved and dedicated for each guest; rather, under control of z/VM, all the system resources are shared.

z/VM understands the hardware and can exploit the advanced RAS capabilities on behalf of the guest servers. A robust workload monitoring and control facility is provided that supplies an advanced resource management capability. z/VM also provides many debugging tools to assist in diagnosing software problems with a Linux guest, and it also aids in testing new servers. For details on the IBM mainframe technology circa 2000, see http://www.research.ibm.com/journal/rd43-56.html.

These machines are optimal for server consolidation. They can support hundreds to thousands of discrete Linux images. Workload types suited to run in this environment exhibit the following characteristics:

  • I/O-intensive operations (for example, serving web pages)

  • Lightly loaded servers

Computation-intensive, graphic-oriented workloads are not good matches for mainframe serversnor are workloads that check the system clock often. Heavily loaded servers also are not good candidates for consolidation on a mainframe server.

Older mainframes did not support IEEE-format floating-point operations. Thus, all floating-point operations under Linux had to be converted to the IBM proprietary format, executed, and then converted back to IEEE format. Newer mainframes support IEEE format and do not pay this performance penalty.

Advanced mainframe system management and programming skills are needed for planning and installing a Linux deployment under z/VM. However, after the system is installed and configured, adding new virtual servers is fairly simple.

Hardware Design Overview

The IBM zSeries architecture implements either 12 or 20 processors on a processor module. There are multiple types of processors, which are determined by the microcode that is loaded that controls the processor. The principal types of processors are the processor units, which are equivalent to CPUs on most servers, and the System Assist Processor (SAP), which handles control of the I/O operations. Each machine has at least two SAPs, and could have more depending on the I/O capacity needs.

The modules have two large L2 caches, with each L2 cache shared among half the processors on the module. This differs from standard servers that have an L2 cache associated with each processor. The shared L2 cache allows a process to migrate between processors without losing its cache warmth.

Memory bandwidth is 1.5GBps per processor with an aggregate system bandwidth of 24GBps. The very large memory bandwidth design favors applications that do not have cache-friendly working sets or behavior.

Normal Linux servers usually run at no more than 50% average utilization to provide headroom for workload spikes. The design of the mainframes allows systems to run at 80% to 90% utilization.

Reliability

A significant feature of the mainframe servers is their extremely high reliability. Reliability on these systems is measured in terms of the overall availability. Five-nines of reliabilitythat is, 99.999% availabilityis a common target that is achievable at the hardware level. This amounts to less than 5 minutes of downtime per year. This level of reliability is achieved through advanced hardware design techniques referred to as Continuous Reliable Operation (CRO).

The goal of continuous reliable operation is to keep a customer's workload running without interruptions for error conditions, maintenance, or system change. While remaining available, the machine must also provide reliable results (data integrity) and stable performance. These requirements are met through constant error checking, hot-swap capabilities, and other advanced methods.

IBM Mainframe I/O Design

Each system has two or more processors dedicated to performing I/O operations. SAPs run specialized microcode and manage I/O operations, removing the work of managing I/O operations from the main processors. I/O cards supported are as follows:

  • ESCON-16 channel cards

  • FICON channel cards

  • Open Systems Adapter (OSA) Express

  • Gigabit Ethernet (GbE)

  • Asynchronous Transfer Mode (ATM)

  • Fast Ethernet (FENET)

  • PCI-Cryptographic Coprocessor (PCI-CC) cards

ESCON is a zSeries technology that supports 20MBps half-duplex serial bit transmission over fiber-optic cables.

FICON is a newer zSeries technology capable of supporting 100MBps full-duplex serial interface over fiber. It supports multiple outstanding I/O operations at the same time to different channel control units. FICON provides the same I/O concurrency as up to eight ESCON channels.

Open Systems Adapter cards (Ethernet, token ring, or ATM) provide network connectivity. The OSA-Express card implements Queued Direct I/O (QDIO), which uses shared memory queues and a signaling protocol to exchange data directly with the TCP/IP stack.

Disk storage is connected via ESCON or FICON. The recommended storage device is the Enterprise Storage Server (ESS), commonly known as Shark. Shark is a full-featured disk-storage array that supports up to nearly 14 terabytes of disk storage. The ESS has large internal caches and multiple RISC processors to provide high-performance disk storage. Multiple servers can be connected to a single ESS using Fibre Channel, ESCON, FICON, or UltraSCSI technology. More information about the ESS is available at http://www.storage.ibm.com/hardsoft/products/ess/ess.htm.

Blades

Blades are computers implemented in a small form factor, usually an entire computer, including one or two disk drives, on a single card. Multiple cards (blades) reside in a common chassis that provides power, cooling, and cabling to support network and system management connectivity. This packaging allows significant computing power to be provided in dense packaging, thus saving space. Blades are used heavily in data centers that need large quantities of relatively small independent computers, such as large web server environments. Blades are also candidates for use in clusters.

Because size is a significant factor for blade designs, most blades are limited to single processors, although blades with dual processors are available. As processor technology and packaging continue to advance, it is likely that blades with high processor counts will become available. However, the practicality of large processor count blades is somewhat hampered by the amount of I/O connectivity available.

NUMA

Demand for greater computing capacity has led to the increased use of multiprocessor computers. Most multiprocessor computers are considered Symmetric Multiprocessors (SMPs) because each processor is equal and has equal access to all system resources (such as memory and I/O buses). SMP systems generally are built around a system bus that all system components are connected to and which is used to communicate between the components. As SMP systems have increased their processor count, the system bus has increasingly become a bottleneck. One solution that is gaining use by hardware designers is Non-Uniform Memory Architecture (NUMA).

NUMA systems colocate a subset of the system's overall processors and memory into nodes and provide a high-speed, high-bandwidth interconnect between the nodes, as shown in Figure 3-2. Thus, there are multiple physical regions of memory, but all memory is tied together into a single cache-coherent physical address space. In the resulting system, some processors are closer to a given region of physical memory than are other processors. Conversely, for any processor, some memory is considered local (that is, it is close to the processor) and other memory is remote. Similar characteristics can also apply to the I/O busesthat is, I/O buses can be associated with nodes.

Figure 3-2. NUMA's high-bandwidth interconnect.


Although the key characteristic of NUMA systems is the variable distance of portions of memory from other system components, there are numerous NUMA system designs. At one end of the spectrum are designs where all nodes are symmetricalthey all contain memory, CPUs, and I/O buses. At the other end of the spectrum are systems where there are different types of nodesthe extreme case being separate CPU nodes, memory nodes, and I/O nodes. All NUMA hardware designs are characterized by regions of memory being at varying distances from other resources, thus having different access speeds.

To maximize performance on a NUMA platform, Linux takes into account the way the system resources are physically laid out. This includes information such as which CPUs are on which node, which range of physical memory is on each node, and what node an I/O bus is connected to. This type of information describes the system's topology.

Linux running on a NUMA system obtains optimal performance by keeping memory accesses to the closest physical memory. For example, processors benefit by accessing memory on the same node (or closest memory node), and I/O throughput gains by using memory on the same (or closest) node to the bus the I/O is going through. At the process level, it is optimal to allocate all of a process's memory from the node containing the CPU(s) the process is executing on. However, this also requires keeping the process on the same node.

Hardware Implementations

Many design and implementation choices result in a wide variety of NUMA platforms. This section discusses hardware implementations and provides examples and descriptions of NUMA hardware implementations.

Types of Nodes

The most common implementation of NUMA systems consists of interconnecting symmetrical nodes. In this case, the node itself is an SMP system that has some form of high-speed and high-bandwidth interconnect linking it to other nodes. Each node contains some number of processors, physical memory, and I/O buses. Typically, there is a node-level cache. This type of NUMA system is depicted in Figure 3-3.

Figure 3-3. Hierarchical NUMA design.


A variant on this design is to put only the processors and memory on the main node, and then have the I/O buses be separate. Another design option is to have separate nodes for processors, memory, and I/O buses, which are all interconnected.

It is also possible to have nodes that contain nodes, resulting in a hierarchical NUMA design. This is depicted in Figure 3-3.

Types of Interconnects

There is no standardization of interconnect technology. More relevant to Linux, however, is the topology of the interconnect. NUMA machines can use the following interconnect topologies:

  • Ring topology, in which each node is connected to the node on either side of it. Memory access latencies can be nonsymmetricthat is, accesses from node A to node B might take longer than accesses from node B to node A.

  • Crossbar interconnect, where all nodes connect to a common crossbar.

  • Point-to-point, where each node has a number of ports to connect to other nodes. The number of nodes in the system is limited to the number of connection ports plus one, and each node is directly connected to each other node. This type of configuration is depicted in Figure 3-3.

  • Mesh topologies, which are more complex topologies that, like point-to-point topologies, are built on each node having a number of connection ports. Unlike point-to-point topologies, however, there is no direct connection between each node. Figure 3-4 depicts a mesh topology for an 8-node NUMA system with each node having three interconnects. This allows direct connections to three "close" nodes. Access to other nodes requires an additional "hop," passing through a close node.

    Figure 3-4. 8-node NUMA configuration.


The topology provided by the interconnect affects the distance between nodes. This distance affects the access times for memory between the nodes.

Latency Ratios

An important measurement for determining a system's "NUMA-ness" is the latency ratio. This is the ratio of memory latency for on-node memory access to off-node memory access. Depending on the topology of the interconnect, there might be multiple off-node latencies. This latency is used to analyze the cost of memory references to different parts of the physical address space and thus influences decisions affecting memory usage.

Specific NUMA Implementations

Several hardware vendors are building NUMA machines that run the Linux operating system. This section briefly describes some of these machines, but it is not an all-inclusive survey of the existing implementations.

One of the earlier commercial NUMA machines is the IBM NUMA-Q. This machine is based on nodes that contain four processors (i386), memory, and PCI buses. Each node also contains a management module to coordinate booting, monitor environmentals, and communicate with the system console. The nodes are interconnected using a ring topology. Up to 16 nodes can be connected for a maximum of 64 processors and 64GB of memory. Remote-to-local memory latency ratios range from 10:1 to 20:1. Each node has a large remote cache that helps compensate for the large remote memory latencies. Much of the Linux NUMA development has been on these systems because of their availability.

NEC builds NUMA systems using the Intel Itanium processor. The most recent system in this line is the NEC TX7. The TX7 supports up to 32 Itanium2 processors in nodes of four processors each. The nodes are connected by a crossbar and grouped in two supernodes of four nodes each. The crossbar provides fast access to nonlocal memory with low latency and high bandwidth (12.8GBps per node). The memory latency ratio for remote-to-local memory in the same supernode is 1.6:1. The remote-to-local memory latency ratio for outside the supernode is 2.1:1. There is no node-level cache. I/O devices are connected through PCI-X buses to the crossbar interconnect and thus are all the same distance to any CPU/node.

The large IBM xSeries boxes use Intel processors and the IBM XA-32 chipset. This chipset provides an architecture that supports four processors, memory, PCI buses, and three interconnect ports. These interconnect ports allow point-to-point connection of up to four nodes for a 16-processor system. The system depicted in Figure 3-2 approximates the configuration of a four-node x440. Also supported is a connection to an external system with additional PCI slots to increase the system's I/O capacity. The IBM eServer xSeries 440 is built on this architecture with Intel Xeon processors.

MPIO on NUMA Systems

Multipath I/O, as mentioned earlier in this chapter, can provide additional I/O bandwidth for servers. However, on NUMA platforms, it can provide even larger benefits. MPIO involves using multiple I/O adaptors (SCSI cards, network cards) to gain multiple paths to the underlying resource (hard disks, the network), thus increasing overall bandwidth. On SMP platforms, potential speedups due to MPIO are limited by the fact that all CPUs and memory typically share a bus, which has a maximum bandwidth. On NUMA platforms, however, different groups of CPUs, memory, and I/O buses have their own distinct interconnects. Because these distinct interconnects allow each node to independently reach its maximum bandwidth, larger I/O aggregate throughput is likely.

An ideal MPIO on NUMA setup consists of an I/O card (SCSI, network, and so on) on each node connected to every I/O device, so that no matter where the requesting process runs, or where the memory is, there is always a local route to the I/O device. With this hardware configuration, it is possible to saturate several PCI buses with data. This is even further assisted by the fact that many machines of this size use RAID or other MD devices, thus increasing the potential bandwidth by using multiple disks.

Timers

On uniprocessor systems, the processor has a time source that is easily and quickly accessible, typically implemented as a register. On SMP systems, the processors' time source is usually synchronized because all the processors are clocked at the same rate. Therefore, synchronization of the time register between processors is a straightforward task.

On NUMA systems, synchronization of the processors' time source is not practical because not only does each node have its own crystal providing the clock frequency, but there tends to be minute differences in the frequencies that the processors are driven at, which thus leads to time skew.

On multiprocessor systems, it is imperative that there be a consistent system time. Otherwise, time stamps provided by different processors cannot be relied on for ordering. If a process is dispatched on a different processor, it is possible that there can be unexpected jumps (backward or forward) in time.

Ideally, the hardware provides one global time source with quick access times. Unfortunately, global time sources tend to require off-chip access and often off-node access, which tend to be slow. Clock implementations are very architecture-specific, with no clear leading implementation among the NUMA platforms. On the IBM eServer xSeries 440, for example, the global time source is provided by node 0, and all other nodes must go off-node to get the time.

In Linux 2.6, the i386 timer subsystem has an abstraction layer that simplifies the addition of a different time source provided by specific machine architecture. For standard i386 architecture machines, the timestamp counter (TSC) is used, which provides a very quick time reference. For NUMA machines, a global time source is used (for example, on the IBM eServer xSeries 440, the global chipset timer is used).

    team bbl



    Performance Tuning for Linux Servers
    Performance Tuning for Linux Servers
    ISBN: 0137136285
    EAN: 2147483647
    Year: 2006
    Pages: 254

    flylib.com © 2008-2017.
    If you may any questions please contact us: flylib@qtcs.net