1.1 An Introduction to Computer Architecture | System Performance Tuning2002

A full discussion of computer architecture is far beyond the level of this text. Periodically, we'll go into architectural matters, in order to provide the conceptual underpinnings of the system under discussion. However, if this sort of thing interests you, there are a great many excellent texts on the topic. Perhaps the most commonly used are two textbooks by John Hennessy and David Patterson: they are titled Computer Organization and Design: The Hardware/Software Interface and Computer Architecture: A Quantitative Approach (both published by Morgan Kaufmann).

In this section, we'll focus on the two most important general concepts of architecture: the general means by which we approach a problem (the levels of transformation), and the essential model around which computers are designed.

1.1.1 Levels of Transformation

When we approach a problem, we must reduce it to something that a computer can understand and work with: this might be anything from a set of logic gates, solving the fundamental problem of "How do we build a general-purpose computing machine?" to a few million bits worth of binary code. As we proceed through these logical steps, we transform the problem into a "simpler" one (at least from the computer's point of view). These steps are the levels of transformation .

1.1.1.1 Software: algorithms and languages

When faced with a problem where we think a computer will be of assistance, we first develop an algorithm for completing the task in question. Algorithms are, very simply, a repetitive set of instructions for performing a particular task -- for example, a clerk inspecting and routing incoming mail follows an algorithm for how to properly sort the mail.

This algorithm must then be translated by a programmer into a program written in a language . ^[1] Generally , this is a high-level language, such as C or Perl, although it might be a low-level language, such as assembler. The language layer exists to make our lives easier: the structure and grammar of high-level languages lets us easily write complex programs. This high-level language program, which is usually portable between different systems, is then transformed by a compiler into the low-level instructions required by a specific system. These instructions are specified by the Instruction Set Architecture.

^[1] It has been conjectured that mathematicians are devices for transforming coffee into theorems. If this is true, then perhaps programmers are devices for transforming caffeine and algorithms into source code.

1.1.1.2 The Instruction Set Architecture

The Instruction Set Architecture , or ISA, is the fundamental language of the microprocessor: it defines the basic, indivisible instructions that we can execute. The ISA serves as the interface between software and hardware. Examples of instruction set architectures include IA-32, which is used by Intel and AMD CPUs; MIPS, which is implemented in the Silicon Graphics/MIPS R-series microprocessors (e.g., the R12000); and the SPARC V9 instruction set used by the Sun Microsystems UltraSPARC series.

1.1.1.3 Hardware: microarchitecture, circuits, and devices

At this level, we are firmly in the grasp of electrical and computer engineering. We concern ourselves with functional units of microarchitecture and the efficiency of our design. Below the microarchitectural level, we worry about how to implement the functional units through circuit design: the problems of electrical interference become very real. A full discussion of the hardware layer is far beyond us here; tuning the implementations of microprocessors is not something we are generally able to do.

1.1.2 The von Neumann Model

The von Neumann model has served as the basic design model for all modern computing systems: it provides a framework upon which we can hang the abstractions and flesh generated by the levels of transformation. ^[2] The model consists of four core components :

^[2] A good book to read more about the von Neumann model is William Aspray's John von Neumann and the Origins of Modern Computing (MIT Press).

A memory system , which stores both instructions and data. This is known as a stored program computer . This memory is accessed by means of the memory address register (MAR), where the system puts the address of a location in memory, and a memory data register (MDR), where the memory subsystem puts the data stored at the requested location. I discuss memory in more detail in Chapter 4.
At least one processing unit , often known as the arithmetic and logic unit (ALU). The processing units are more commonly called the central processing unit (CPU). ^[3] It is responsible for the execution of all instructions. The processor also has a small amount of very fast storage space, called the register file . I discuss processors in detail in Chapter 3.

^[3] In modern implementations, the "CPU" includes both the central processing unit itself and the control unit.
A control unit , which is responsible for controlling cross-component operations. It maintains a program counter , which contains the next instruction to be loaded, and an instruction register , which contains the current instruction. The peculiarities of control design are beyond the scope of this text.
The system needs a nonvolatile way to store data, as well as ways to represent it to the user and to accept input. This is the domain of the input/output (I/O) subsystem. This book primarily concerns itself with disk drives as a mechanism for I/O; I discuss them in Chapter 5. I also discuss network I/O in Chapter 7.

Despite all the advances in computing over the last sixty years , they all fit into this framework. That is a very powerful statement: despite the fact that computers are orders of magnitude faster now, and being used in ways that weren't even imaginable at the end of the Second World War, the basic ideas, as formulated by von Neumann and his colleagues, are still applicable today.

1.1.3 Caches and the Memory Hierarchy

As you'll see in Section 1.2 later in this chapter, one of the principles of performance tuning is that there are always trade-offs. This problem was recognized by the pioneers in the field, and we still do not have a perfect solution today. In the case of data storage, we are often presented with the choice between cost, speed, and size . (Physical parameters, such as heat dissipation, also play a role, but for this discussion, they're usually subsumed into the other variables .) It is possible to build extremely large, extremely fast memory systems -- for example, the Cray 1S supercomputer used very fast static RAM exclusively for memory. ^[4] This is not something that can be adapted across the spectrum of computing devices.

^[4] Heat issues with memory was the primary reason that the system was liquid cooled. The memory subsystem also comprised about three- quarters of the cost of the machine in a typical installation.

The problem we are trying to solve is that storage size tends to be inversely proportional to performance, particularly relative to the next highest level of price/performance. A modern microprocessor might have a cycle time measured in fractions of a nanosecond, while making the trip to main memory can easily be fifty times slower.

To try and work around this problem, we employ something known as the memory hierarchy . It is based on creating a tree of storage areas (Figure 1-1). At the top of the pyramid, we have very small areas of storage that are exceedingly fast. As we progress down the pyramid, things become increasingly slow, but correspondingly larger. At the foundation of the pyramid, we might have storage in a tape library: many terabytes, but it might take minutes to access the information we are looking for.

Figure 1-1. The memory hierarchy

From the point of view of the microprocessor, main memory is very slow. Anything that makes us go to main memory is bad -- unless we're going to main memory to prevent going to an even slower storage medium (such as disk).

The function of the pyramid is to cache the most frequently used data and instructions in the higher levels. For example, if we keep accessing the same file on tape, we might want to store a temporary copy on the next fastest level of storage (disk). We can similarly store a file we keep accessing from disk in main memory, taking advantage of main memory's substantial performance benefit over disk.

1.1.4 The Benefits of a 64-Bit Architecture

Companies that produced computer hardware and software often make a point of mentioning the size of their systems' address space (typically 32 or 64 bits). In the last five years, the shift from 32-bit to 64-bit microprocessors and operating systems has caused a great deal of hype to be generated by various marketing departments. The truth is that although in certain cases 64-bit architectures run significantly faster than 32-bit architectures, in general, performance is equivalent.

1.1.4.1 What does it mean to be 64-bit?

The number of "bits" refers to the width of a data path. However, what this actually means is subject to its context. For example, we might refer to a 16-bit data path (for example, UltraSCSI). This means that the interconnect can transfer 16 bits of information at a time. With all other things held constant, it would be twice as fast as an interconnect with a 8-bit data path .

The "bitness" of a memory system refers to how many wires are used to transfer a memory address. For example, if we had an 8-bit path to the memory address, and we wanted the 19th location in memory, we would turn on the appropriate wires (1, 2, and 5; we derive this from writing 19 in binary, which gives 00010011 -- everywhere there is a one, we turn on that wire). Note, however, that since we only have 8 bits worth of addressing, we are limited to 64 (2 ⁸ ) addresses in memory. 32-bit systems are, therefore, limited to 4,294,967,296 (2 ³² ) locations in memory. Since memory is typically accessible in 1-byte blocks, this means that the system can't directly access more than 4 GB of memory. The shift to 64-bit operating systems and hardware means that the maximum amount of addressable memory is about 16 petabytes (16777216 GB), which is probably sufficient for the immediately forseeable future.

Unfortunately, it's often not quite this simple in practice. A 32-bit SPARC system is actually capable of having more than 4 GB of memory installed, but, in Solaris, no single process can use more than 4 GB. This is because the hardware that controls memory management actually uses a 44-bit addressing scheme, but the Solaris operating system can only give any one process the amount of memory addressable in 32 bits.

1.1.4.2 Performance ramifications

The change from 32-bit to 64-bit architectures, then, expanded the size of main memory and the amount of memory a single process can have. An obvious question is, how did applications benefit from this? Here are some kinds of applications that benefitted from larger memory spaces:

Applications that could not use the most time-efficient algorithm for a problem because that algorithm would use more than 4 GB of memory.
Applications where caching large data sets is critically important, and therefore the more memory available to the process, the more can be cached.
Applications where the system is short on memory due to overwhelming utilization (many small processes). Note that in SPARC systems, this was not a problem: each process could only see 4 GB, but the system could have much more installed.

In general, the biggest winners from 64-bit systems are high-performance computing and corporate database engines. For the average desktop workstation, 32 bits is plenty.

Unfortunately, the change to 64-bit systems also meant that the underlying operating system and system calls needed to be modified, which sometimes resulted in a slight slowdown (for example, more data needs to be manipulated during pointer operations). This means that there may be a very slight performance penalty associated with running in 64-bit mode.