A.1 Processor Architecture | HP-UX CSE(c) Official Study Guide and Desk Reference

I have found that too often people ignore and underestimate the importance of appreciating the workings of the processor. After all, it is the processor that executes instructions to do the work our systems were designed to accomplish. Since the first computer was designed, engineers have strived to push processor performance to achieve more and more. I remember being taught a principle about getting more work done :

Work harder : In terms of a processor, this can mean increasing the clock speed of the processor and/or increasing the density of transistors on the processor itself. Both of these solutions pose problems in terms of heat dissipation, purity of raw materials, and production costs. Processor architects are now finding that the density of transistors on processors can cause electromagnetic effects at the quantum level, known as quantum tunneling. This precludes significant further miniaturization using current materials and fabrication techniques. As always, processor design is a trade-off between what the architects would like to do and what will make a profit for the organization.
Work smarter : Here, we need to look at the overall architecture of the processor. We have just noted that processors may be reaching some fundamental brick walls in terms of quantum level effects in the materials used to construct processors. Over the last few decades, we have seen different processor families emerge, which take different approaches to how the processor operates. We are thinking of CISC, RISC, Vector, and VLIW processors.
Get someone else to help you : I am thinking of two ways in which we can get someone else to help . First, we'd have help on the processor itself. This is a common occurrence in the form of a co-processor. Nowadays this is at least a floating-point co-processor. Many computations involve floating-point numbers, i.e., fractional numbers like 3.5. The other help we are thinking of is having more than one processor. The design of a multi-processor machine has many design criteria. Some of the more common multi-processor architectures include SMP, NUMA, CC-NUMA, NORMA, MPP, and COWs. The decision to choose one architecture over another can be based on cost, flexibility of configuration, the type of computing problems expected, what the competition is doing, as well as conformance with current industry standards.

A.1.1 The basic processor

We can think of the basic functioning of a simple processor by considering what it is designed to do: execute instructions. The closest we can come to seeing an instruction is via a language known as assembly language , sometimes known as assembler . Instructions can take many forms: arithmetic, shift, logical, and floating-point instructions to name a few. Considering this basic functionality, we can further break it down into a basic architecture. Here goes. The processor performs these functions:

Fetches an instruction
Decodes that instruction
Fetches any data that instruction refers to
Executes the instruction
Stores/ preserves the result for further processing

We call this the fetch-execute cycle . Most of the computers we deal with in the business world use this basic philosophy and are known as stored-program computers. Let's take another step in further defining this architecture by drawing a diagram of a simple processor and explaining what the basic components are intended for (Figure A-1):

Note : Not all connection paths and components are necessarily shown. This is a simplified model of a processor. In today's processor architectures, you may have many more esoteric components such as shift/merge, multiply, cache, TLB , and branch management circuitry . This simplified model is used to ensure that we understand the basics.

High-speed memory : Instructions and data (simple data items or the results of previous instructions) are all stored in memory before being acted on by the processor.
Data bus : This connects two or more parts of the processor. The connections between individual components and the data bus are indicated by black arrows over which several bits (usually a data word) can move simultaneously .
Arithmetic/logical unit (ALU) : This is the brain that performs arithmetic and/or logical operations, e.g., ADD, NEGATE, AND, OR, XOR, and so on.
Floating-point coprocessor : This performs similar operations as the ALU but does so on floating-point numbers. On decoding an instruction, the control unit decides whether we are using integers or floating-point numbers.
Memory data/address registers : Processors may use special purpose registers to temporarily hold the address and/or the data of the next expected datum. Some architectures use this philosophy heavily for prefetch and branch prediction technologies.
Next instruction pointer (NIP or program counter) : This is another special purpose register. Again, architectures may or may not use this idea heavily. The idea is that with NIP we know where in high-speed memory the next instruction is physically located.
Instruction register : This is the special purpose register holding the instruction currently being executed.
Control unit : This directs and controls the various components of the processor in performing its tasks , e.g., a general-purpose data register to open its output path in preparation for a data item to be transferred to the ALU. The unidirectional arrows that appear to go nowhere lead to other parts of the processor, memory, and other components such as the IO subsystem. They carry control signals to coordinate activities between the processor and other components, e.g., control signals known as interrupts coming from external devices such as the IO subsystem.
Clock : This determines the speed of operations. It normally operates at clock frequencies measured in megahertz , and the frequency is sometime referred to as the clock period . The ideal for a processor is to execute every instruction every clock period; an easy way to remember this is "every tick of the clock." However, this is seldom the case, except with possibly the most primitive of operations, i.e., arithmetic and logic operations on integers. The time to execute an individual instruction is commonly measured in multiples of the clock period, e.g., it is not uncommon for a floating-point instruction to take three ticks to complete. The latency (the amount of time it takes) to perform different types of operations goes some way toward determining the overall performance of a system and helping to demonstrate the efficiencies of the underlying architecture. The clock usually has some form of external repeating pulse generator, such as a quartz crystal.
Data registers : I have left these for last to include a brief description of the registers themselves . Registers are known as bistable devices, and sometimes known as a " flip-flop " because it is an electronic device made up of elements such as transistors and capacitors, capable of exhibiting one of two states (on=1 off=0) in a consistent fashion. A register of n bistables can store a word of length n bits: hence, a 64-bit register can store a 64-bit value. This doesn't mean that it's necessarily a 64-bit integer. Some instructions may be expecting a 32-bit integer and, hence, we could achieve a form of parallelism by performing a single data load, but in doing so we have access to two data elements. This is sometimes referred to as an SIMD (Single Instruction Multiple Data, from Michael Flynn's 1972 Flynn's Classification ) architecture. In an ideal world, all our data and instructions would be stored in registers. Because the physical space on the processor is limited, we are usually limited to tens of registers ”some special purpose, some general purpose. A simple way to distinguish between the two types of registers is that general-purpose registers can be named explicitly in instructions, while special purpose registers cannot be named explicitly by instructions , but are controlled explicitly by the control unit.

Figure A-1. A basic processor.

So let's revisit out basic architecture and flesh out the fetch-execute cycle .

The next instruction to be executed is fetched from memory into the special-purpose register designed to hold it. At this point, the NIP is updated to point to the next instruction to be executed.
The control unit decodes the instruction and if necessary instructs other components on the processor to perform certain tasks. Implicit in this step is the fact that in decoding the instruction we now know whether we need any operands (the datum to be operated on) and how many, e.g., an architect designer will normally require two operands for an ADD operation as well as a target register to store the result.
The control unit causes the operands to be fetched from memory and stored in the appropriate registers, if they are not already there.
The ALU is sent a signal from the control unit to carry out the operation.
The result is temporarily stored in the ALU before being written out to the target register .
The next instruction is fetched and executed.
And the cycle continues.

Even at system boot-up time, we can see how this fetch-execute cycle would operate at a primitive level:

A special-purpose processor performs Power On Self Test (POST) operations.
The NIP is loaded with the first instruction to be executed to get the operating system up and running, i.e., the starting address of the operating kernel.

Obviously, this is a simplified description, but once we understand this basic operation, we can then move on to discuss more complicated models and appreciate the technological choices and challenges faced by processor designers.

A.1.2 More complex architectures

We have spoken of a simple processor exhibiting a simple architecture. The design of our Instruction Set Architecture needs to accommodate all the design goals we laid out in our initial processor design. In reality, we need to think about more complex architecture considerations in order to maximize our processor performance. For example, while the control unit is decoding an instruction, the electronics that make up the ALU are sitting idol. Wouldn't it be helpful if each individual component could be busy all the time, coordinated in a kind of harmonic symphony of calculatory endeavors .

A further motivation for conducting this orchestra of components dates back to the 1940s and 1950s. Two machines developed during and after the Second World War laid down the design of most current machines. The underlying architecture born from these ideas is known as the von Neumann architecture . John von Neumann designed EDVAC (Electronic Discrete Variable Automatic Computer). This was one of the first ^[1] computers to store instructions and data as essentially a single data stream. This model makes for a simplified and succinct design and has been the cornerstone of most computer architectures since then (other architectures do exists, for example, the reduction and dataflow machines; however, they are not used widely in the business arena). The von Neumann architecture can have an impact on overall processor performance because with both data and instruction elements being viewed as similar datum, they have to follow essentially the same data path to get to the processor. Being on the same data path means that we are dealing with either an instruction datum or data datum, but not both simultaneously. The problem of only doing one thing at a time has become known as the von Neumann bottleneck . If we can do only one thing at a time , this has a massive impact on the overall throughput of a processor. One well-trodden path to alleviate the effects of the von Neumann bottleneck is simply to increase the speed of the processor. With more instructions being executed per clock period, we hope to get more work done overall. As mentioned previously, there is only so far we can take that work harder ethic with current materials and fabrication methods . To try to work around the von Neumann bottleneck, we need to also work smarter . This is where processor designers have some interesting design decisions to make.

^[1] There is much intellectual pie throwing between the EDVAC and ENIAC camps. John W. Mauchly and J. Presper Eckert, Jr., engineered ENIAC (Electronic Numerical Integrator and Calculator) around the same time as EDVAC. There is a long-running debate among academics as to which machine really came first. To this day, the architecture is still known after John von Neumann.

A.1.3 A bag of tricks

While we could simply work harder and continue to increase the clock speed of our processors, eventually effects such as quantum tunneling will mitigate any further development in that particular field of study. Not to be outdone, there are some cunning ways in which processor architects can extract more and more performance without resorting to mega-megahertz . Let's look at some of the tricks that processor architects have up their sleeves to alleviate the problems imposed by the von Neumann bottleneck.

A.1.3.1 SUPERSCALAR PROCESSORS

A scalar processor has the ability to start only one instruction per cycle. It follows that superscalar processors have the ability to start more than one instruction per cycle. This allows the different components of the processor to be functioning simultaneously, e.g., the ALU and the control unit can be doing their own things while not interfering with each other. In order to utilize this idea, we need an advanced compiler that can generate a generous mix of instruction types, e.g., floating point, integer, memory, multiply (if you have an independent Multiply Unit), so that a processor can be seen to be scheduling more than one instruction every clock period. Even though it appears to be scheduling more than one instruction per clock period, each component is still limited by the von Neumann bottleneck; each component is doing only one thing. The difference is that collectively the entire processor is getting more work done in a given time period. This introduces a level of parallelism into the instruction stream. Parallelism is a fundamental benefit to any processor. Some advanced processors also have built-in circuitry such as branch prediction and/or instruction reorder buffers to help implement a superscalar architecture. The idea behind this additional circuitry is to allow the processor to operate in a wider set of circumstances where an advanced compiler may not be available. Having both an advanced processor and advanced intelligent compilers can produce phenomenal results. When migrating a program from one particular architecture to another, you will probably have to recompile your program. If the new architecture offers backward compatibility, you need to ask yourself, "Do I still want to do it the old way?" It is always sound advice to recompile a program from an older scalar architecture when you move it to a superscalar architecture, or even between versions of an existing superscalar architecture . This ensures that the new intelligent compiler generates an optimum mix of instructions to take advantage of the new processor's features. Processors that can schedule n instructions every clock period are said to be n-way superscalar, e.g., 4 instructions per cycle = 4-way superscalar. Several HP processors have used a superscalar architecture in the recent past. Table A-1 shows some of them.

Table A-1. HP Scalar and Superscalar Processors

PA- RISC version	Models	Characteristics
PA 1.0	PN5, PN10, PCX	Scalar implementation One integer or floating-point instruction per cycle 32-bit instruction path
PA 1.1a	PA70000, PCX/S	Scalar implementation One integer or floating-point instruction per cycle 32-bit instruction path
PA 1.1b	PA71000, PA7150, PCX/T	Limited superscalar implementation One integer and one floating-point instruction per cycle 64-bit instruction path (2 instructions per fetch)
PA 1.1c	PA7100L, PCX/L	Superscalar implementation Two integer or one integer and one floating-point instruction per cycle 64-bit instruction path
PA 1.2	PA7200, PCX/T	Limited superscalar implementation One integer or floating-point instruction per cycle 64-bit instruction path (two instructions per fetch)
PA 2.0	PA8000, PA8200, PA8500, PA8600, PA8700, PA8700+	Two integer and two floating-point and two loads or two stores per cycle 64-bit extensions 128-bit data and instruction path
* HP-UX Tuning and Performance , Prentice Hall, 2000.

A.1.3.2 PIPELINED PROCESSORS

An easy analogy to visualize processor pipelining is to visualize a manufacturing production line. To make a widget, you need to break down the construction process to discreet production stages. Ideally, each production stage is of a similar length so that workers in subsequent stages are not left hanging around waiting for a previous stage to finish. If everyone is kept busy with his or her particular part of the process, the overall number of widgets produced will increase. In a similar way, we can view the pipeline of instructions within a processor. If a large, complex procedure can be broken down into easily definable stages and executing each stage takes the same amount of time, we can interleave each stage in such a way that individual parts of the processor circuitry are performing their own stage of the instruction. Let's break an instruction down into four stages.

Instruction Fetch (IF) : Instruction fetch.
Instruction Decode (ID) : Instruction decode; may include fetch data elements referenced in the instruction.
Execute (EX) : Execute the Instruction.
Write Back (WB) : Write back the result.

If we assume that each stage takes one unit of time to execute, then this instruction would take 4 units of time to complete. Breaking down the instruction into distinct stages means that we can interleave the commencement of the next instruction before the previous instruction completes. Figure A-2 shows the effect of pipelining .

Figure A-2. The effects of a pipelined architecture.

graphics/ap01fig02.gif

In a non-pipelined architecture, executing the instruction stream shown in Figure A-2 would have taken 6 x 4 =24 units of time. With pipelining in place, we have reduced the overall time needed to execute the instruction stream to an amazing 9 units of time. It is apparent from Figure A-2 that adding an additional instruction adds only 1 additional unit of time into the overall execution time. With a small instruction stream like the one in Figure A-2, we have achieved an amazing 267 percent speedup ! With a large number of instructions, we actually approach the situation where

This ideal maximum where the speedup equates to the number of stages in the pipeline is again a nirvana we seldom see. It does rely on the factors that we mentioned previously:

Every instruction is broken down to exactly the same number of stages.
All stages take exactly 1 unit of time to execute.
All stages can operate in parallel.
The instruction stream is always full of useful instructions.

This ideal scenario seldom exists with a real-world instruction stream. For this to be the case, every program ever executed would have to be entirely linear, i.e., no branches, loops , or breaks in execution. Another factor that can go against the maximum possible speedup is the fact that pipelining an instruction into discreet stages requires additional logic within the processor itself. This takes the form of additional circuitry in the form logic gates which, in turn , are made up of individual electronic components. With the real estate on a processor being so expensive (remember the density of transistors, data paths, and so on, on a processor increases the cost of manufacture and increases the probability of errors, which push the price of processors ever higher), the number of discreet stages in the pipeline is another design decision that the processor architect needs to make.

As mentioned above, an aspect of pipelines that can mitigate achieving maximum speedup is the nature of a real-life instruction stream. Programs commonly loop and branch based on well-known programming constructs, e.g., if-then-else , while-do , and so on. This poses problems for the compiler as well as the processor. Loops and branches interrupt the sequential flow of a program and require a jump to some other memory location. Knowing when this is going to happen and, more difficult, where we are going to jump to is not the easiest thing for the architecture to predict. However, many optimizing compilers as well as specially designed circuitry on the processors themselves go a long way toward circumventing a problem known as pipeline hazards . We see essentially three types of pipeline hazards :

Control hazards : Some hazards are due to instructions and/or branches changing the sequential flow of the program. When we have such a situation, e.g., an if-then-else construct, we may find that we need to branch to a previous location to restart the pipeline. This would require flushing the pipeline and starting again. During this flushing, the processor is not executing useful instructions. One technique used to get around this is called a delayed branch where the compiler predicts a branch (it can see it coming in the instruction stream and it knows a branch is going to slow things down ) and performs it earlier than expected because it knows on average how long the branch will take. In the time it takes the process to jump to the new memory location, the processor can be executing non- related but otherwise useful instructions. The ability of the processor to reorganize the instruction stream on the fly is sometimes referred to as out of order execution . This capability can be hardwired into processors themselves, at a cost. Should the reorganization introduce other hazards, the processor needs to be able to flush the completion queue in order to back out the previously executed instructions. Such additional circuitry is reducing the available real estate on the processor itself. The idea of predictive intelligence can also be built into optimizing compilers. We can also assist the processor in predicting what happens next by running applications many times with typical input streams. In this way, we are building a profile of a typical instruction stream that the compiler can use to further optimize the application, which is also known as profile-based optimization .
Data hazards : It is not uncommon for one instruction to require the result from a previous instruction, as in these two equations:

x + y = z )

z + a = b

The second equation is dependent on the result from the first. The processor would need to ensure that the value of z is written back (the WB, or Write Back, stage has not completed) to memory before calculating the second equation. A number of methods can be used to alleviate these problems. Similar to the way in which branch delay slots work, a designer may use load delay slots to alleviate data hazards. As an example, take the instruction LOAD r2, locationA , where we are loading register r2 from memory location locationA . A LOAD instruction will take an amount of time for the memory bus to locate the datum in memory and transfer the value to the register; possibly several clock periods. While this happens, the register r2 may have a previous value still stored in it. If the next instruction to be executed also referenced r2 , for example SUB r2 , r1 , r2 , the instruction stream has inadvertently introduced a data hazard ; while the LOAD is completing, we have an instruction which is in effect using the old value of r2 . This particular hazard is known as a read-after-write (RAW) hazard and can cause the pipeline to stall while the instruction stream is rolled back to a point where the LOAD instruction was executed. We sometimes see write-after-write (WAW) and write-after-read (WAR) hazards that are caused by similar issues in the instruction stream. One solution to get around this type of problem would be to use a load delay slot . As the name suggests, whenever a LOAD is executed, a delay is inserted (usually a NOOP instruction) into the instruction stream. This is designed to allow the LOAD to complete before the next instruction starts executing. This is not very practical. The architect would have to insert enough load delay slots to accommodate the worse -case scenario when a LOAD takes an inordinate amount of time to complete. An alternative solution would be to stall the processor: Immediately following a LOAD , the processor is effectively frozen until the LOAD completes. This has been seen to be a better solution than inserting lots of NOOP s.
Structural hazards : Structural hazards can be caused by limitations in the underlying architecture itself, i.e., the architecture is not able to support all permutations of instructions. Remember that we are probably dealing with a von Neumann machine where data and instructions are treated the same. To LOAD a data element and an instruction simultaneously goes against the von Neumann bottleneck because both are treated as equals. Both data and instructions are stored in main memory as a stream of data points; they are both effectively seen as data . As a result, both data and instructions share the same data path to arrive at the processor. Designers can alleviate this by utilizing two separate memory buses for data and instruction elements. In these situations, designers may also employ separate data and instruction caches (more on cache memory later).

A word of warning : Some people think that pipelining and superscalar are essentially the same ”if not identical, then quite similar. They may appear similar, but they are definitely not the same. Superscalar, simply put, is the ability to start more than one instruction simultaneously. Many architectures employ superscalar processors. Not all architectures can easily employ pipelining . As we have seen, to employ pipelining to its maximum benefit requires all instructions to be decomposed to the same number of stages, all stages take the same amount of time to execute, and all stages can operate in parallel. This is not necessarily achievable in every situation. What we are aiming to achieve with pipelining is an improvement in the overall cycle time for the execution stream. With superscalar architectures, we are hoping to achieve the same improvements with instruction throughput. We may have a four-way superscalar processor, but if each instruction is a not easily decomposed into discreet individual stages, then each instruction is still taking a significant length of time to complete. Statistically, we should achieve more throughput simply because more instructions are being executed simultaneously as a result of the superscalar architecture. If we can achieve both superscalar and pipelining, we can see even more significant improvements in the performance. Some newer architectures are even claiming to be super-pipelining whereby individual instructions are being decomposed even further to try to ensure that as many individual circuits in the processor are working simultaneously.

A.1.3.3 Instruction size : "How big is yours?"

The size of objects on a processor is an important feature in design. Accommodating 64-bit instructions requires significantly more processor hardware than a 32-bit instruction simply because we need more bits to store a 64-bit instruction than a 32-bit instruction. Remember, instructions will be represented by binary digits: 1s and 0s. Having a larger instruction does give you more flexibility as far as instruction format, the number of different instructions in your instruction set, as well as how many operands an instruction can operate on. Before we go any further, let's remind ourselves again about some numbers:

32-bit = 2 ³²

A 32-bit instruction gives us 4,294,967,296 permutations of instruction format and number of instructions. For most architectures, an instruction size of 32 bits yields more than enough instructions. However, limiting an architecture to 32 bits also limits the size of data elements to 32 bit. This means that the largest address we can use is 2 ³² = 4GB. An address space of 2 ³² , i.e., 4GB, limits the size of individual operating system processes/threads. This is a major problem facing applications today. Many applications require access to a larger address space. This requires the underlying hardware to support larger objects. Today's processors support 64-bit objects.
64 - bit = 2 ⁶⁴

A 64-bit instruction set gives us quite a few more permutations. In fact, the number of permutations goes up to 18,446,744,073,709,551,616 which equates to 16 Exabytes (EB). Most 64-bit operating systems do not require the full 16EB of addressing space available to them. We now have machines that have more physical memory than can be accommodated by a 2 ³² (4GB) operating system. In order to utilize more than 4GB of RAM, the operating system must be able to support larger objects. In turn, for the operating system to support larger objects, the underlying architecture must support these larger objects as well. Although many architectures support 64-bit objects, it is not uncommon for instructions to be 32 bits in size. Remember, a 32-bit instruction still gives the instruction set designers 2 ³² different instructions. We can achieve an additional parallelism with 32-bit instructions in a 64-bit architecture; we can fetch 2 instructions per LOAD .
8-bit = 2 ⁸ . Commonly known as a byte.

Almost every machine now uses an 8-bit byte . One reason for using 8 bits could rest with the notation of what we are trying to represent inside the computer: characters . Take the letter A . It is now convention that this is represented by the decimal number = 65 . So our letter A in binary looks like this:

2 ⁷

2 ⁶

2 ⁵

2 ⁴

2 ³

2 ²

2 ¹

2

128

64

32

16

8

4

2

1

1

1

This comes from the ASCII (American Standard Code for Information Interchange) character-encoding scheme developed in 1963. ASCII is actually a 7-bit code. If we were to use all 8 bits, we could represent 0 to 255 = 256 natural numbers and then the ASCII character encoding could represent 256 letters and symbols. But we have learned that we don't need the diversity of symbols: 128, i.e., 7-bits, are enough for the Western character set.

The additional eighth bit could be used to represent negative numbers: known as the sign-bit , with 1 representing a negative value. In reality, computers don't use this sign-and-magnitude representation of negative numbers. A simple explanation as to why the eighth bit is not used as the sign-bit is to take this example:

If we were to use our sign-and-magnitude representation above, our calculations would come unstuck. One thing to remember with binary arithmetic is that 1 + 1 = 0 carry 1 . Let's perform our simple calculation using sign-and-magnitude binary representation:

Represent “71 in sign-and-magnitude binary with the most significant bit (leftmost in this case) representing the sign:

Sign-bit	2 ⁶	2 ⁵	2 ⁴	2 ³	2 ²	2 ¹	2
Sign-bit	64	32	16	8	4	2	1
1	1				1	1	1

Do the same for “6:

Sign-bit	2 ⁶	2 ⁵	2 ⁴	2 ³	2 ²	2 ¹	2
Sign-bit	64	32	16	8	4	2	1
1					1	1

Now add them together:

Sign-bit	2 ⁶	2 ⁵	2 ⁴	2 ³	2 ²	2 ¹	2
Sign-bit	64	32	16	8	4	2	1
1	1				1	1	1
1					1	1
	1			1	1		1

The answer comes out as +77. Although this may seem strange to our logical minds, the rules of addition dictate that in binary 1 + 1 = 0 carry 1 . The carry 1 is carried to the next position. Somehow, we would need to inform the processor not to process the sign bit; this is what we do mentally, but it would not be straightforward to program a processor to do the same. In this case, we would need a carry bit just in case we meet the situation we see above. The alternative is to use a number representation known as two's complement . This is how computers store signed integers . The name two's complement doesn't give any hint as to how this works, so I'll try my best to explain.

As we saw in the binary numbers above, each bit in a binary number has what we call a weighting: Bit 0 has a weighting of 2 (0), while bit 7 has a weighting of 2 ⁷ (128). We can use this to convert a decimal number into binary. First, we take the largest power of 2 that is less than the original value . We set this bit to be 1. We subtract the decimal value of that power of 2 from our original value . The remainder becomes our new value . We then subtract successive lower powers of 2 in a similar fashion until we reach 0. Look at decimal 77 as an example:

Weighting	Value	Result	Bit	Remainder
2 ⁸		256
2 ⁷		128
2 ⁶	77	64	1	13
2 ⁵		32
2 ⁴		16
2 ³	13	8	1	5
2 ²	5	4	1	1
2 ¹		2
2	1	1	1
Reading from the top: 77 ₁₀ = 10011012

This is one way we convert decimal numbers into binary. In two's complement, bit 7 has a weighting of “2 ⁷ (-128). We then construct our numbers using these new weightings. Let's take our original example of “71 and represent it using two's complement :

-2 ⁷	2 ⁶	2 ⁵	2 ⁴	2 ³	2 ²	2 ¹	2
-128	64	32	16	8	4	2	1
1		1	1	1			1

When we consider the weightings, -2 ⁷ + 2 ⁵ + 2 ⁴ + 2 ³ + 2 = -128 + 32 + 16 + 8 + 1 = -71, “6 becomes:

-2 ⁷	2 ⁶	2 ⁵	2 ⁴	2 ³	2 ²	2 ¹	2
-128	64	32	16	8	4	2	1
1	1	1	1	1		1

-2 ⁷ + 2 ^{6 +} 2 ⁵ + 2 ⁴ + 2 ³ + 2 ¹ = -128 + 64 + 32 + 16 + 8 + 2 = -6

We can now perform our original calculation using the two's-compliment representation:

-2 ⁷	2 ⁶	2 ⁵	2 ⁴	2 ³	2 ²	2 ¹	2
-128	64	32	16	8	4	2	1
1		1	1	1			1
1	1	1	1	1		1
1		1	1			1	1

When we look at the weightings, we see -2 ⁷ + 2 ⁵ + 2 ⁴ + 2 ¹ + 2 = -128 + 32 + 16 + 2 + 1 = -77.

The clever part with two's complement arithmetic is that even if you have a carry-bit, it can always be ignored.

A word:

A word is an indeterminate value. It can take many forms, but its size is usually the normal processing unit used by the processor, e.g., a 64-bit processor will work with a 64-bit word. A word is usually big enough to store either a single instruction or integer. It most often takes the form of an integer multiple of bytes.
Some other numbers to consider :

1 kilobyte (KB) = 1024 bytes

1 megabyte (MB) = 1024 KB

1 gigabyte (GB) = 1024 MB

1 terabyte (TB) = 1024 GB

1 petabyte (PB) = 1024 TB

1 exabyte (EB) = 1024 PB

1 zetabyte (ZB) = 1024 EB

1 yottabyte (YB) = 1024 ZB

Our original question about size was related to the size of an instruction itself. Do we really need 2 ⁶⁴ possible permutations for the format of a single instruction? Probably not. There is more than enough flexibility in a 32-bit instruction. Consequently, a designer could elect to have variable size instructions. The benefit would be that for simple, smaller instructions the processor only has to fetch a few bits, speeding up the fetch cycle. However, somewhere in the logic of the processor, it would need to know how big the next instruction was. This would be some other architectural feature that would need to be built into the processor hardware and logic circuitry: Again, that's a design tradeoff . Registers that can accommodate 64-bit values can be seen as a bonus because data elements stored in the register can represent big numbers. A register that can accommodate a 64-bit value means that the value stored in a register could actually be the address of a memory location. This in turn means that we can have 2 ⁶⁴ worth of physical memory in our machine. Having a data path (a microscopic wire or connector inside the processor) that can transfer 64 or even 128 bits simultaneously is a desirable feature because you can move lots of data round quicker. As we can see, design tradeoffs have to be made in many aspects of the processor design.

A.1.3.4 ADDRESSING MODES

Before an instruction can actually do something, it needs to locate any data elements (called operands) on which it is supposed to be working. The instruction will use an address to locate its operands. The complicating factor is that an address may not be a real memory location, but rather a relative address , relative to the current memory location. This interpretation of an address is coded into the instruction and is known as the addressing mode . Architects can choose to use different addressing modes . The type and number of addressing modes used will have an impact on the actual design of the instructions themselves. Here are some of the more common addressing modes used:

Immediate addressing : This is where the memory location follows immediately after the instruction, e.g., ADDIMMEDIATE 5 . The data element 5 is located immediately after the instruction. This means "add 5 to whatever is currently stored in the ALU/register."
Direct addressing : This is where the operand following the instruction is not the actual data itself but the memory location of where to find the real data, e.g., ADDDIRECT 0x00ef2311 means "add the contents of the ALU/register to the data you find at address 0x00ef2311 ."
Indirect addressing : This is similar to direct addressing except that the address we pass to the instruction is not the real address of the data, but simply a reference to it; in other words, it behaves like a forwarding point, e.g., LOADINDIRECT 0x000ca330 means "go to address 0x000ca330 and there you will find the real address of the data to LOAD ." This can be useful in programming when we don't currently know where in memory a datum will actually be located, but we can use a reference to it in this indirect fashion. A concept known as pointers utilizes indirect addressing .
Indexed addressing : When we are dealing with collections of data elements, possibly in an array, it can be useful if we can quickly reference as specific element in that array. Our instruction will be passed the starting address of the collection of data elements and use the content of what we will call and index register to reference the actual data element in question. The index register may be a special purpose register or simply a general register . The instruction will have been programmed by the architect to know how the indexing works.
Inherent addressing : This occurs where an instruction does not use an addresses or the address is apparent, e.g., STOP or CLEAR.

We have looked at a number of tricks that the processor architect has to choose when designing his processor. Next, we look at three of the most prevalent families of processors in the marketplace today.