< Day Day Up > |
Engineers have constantly searched for ways to improve processor performance. One way to remove performance bottlenecks is to make the processor do more in one clock cycle. This has been accomplished through pipelined processors, superscalar processors, hyper-pipelined processors, dynamic execution, and the Explicitly Parallel Instruction Computing (EPIC) model. 3.2.1 Pipelined ProcessorsIn April 1989, Intel introduced the i486 processor, which was the first pipelined processor. A pipelined processor does not wait for one instruction to be completed before it begins another. Instead, when the prefetch unit completes the first stage and sends an instruction to the decode unit, it immediately starts working on the next instruction. As soon as the decode unit finishes one instruction, it immediately begins working on the next, and so on. This is illustrated in Figure 3-3. Figure 3-3. How a pipelined processor works.3.2.2 Superscalar ProcessorsThe Pentium processor was introduced in 1993. The Pentium added another execution unit to the processor. This innovation also allowed the Pentium to have two pipelines. When both pipelines were operating simultaneously, the processor could execute two instructions during one clock cycle. This is also known as having "two instructions in flight," and is illustrated in Figure 3-4. A processor that can execute more than one instruction per clock cycle is a superscalar processor. Pentium II and Pentium III processors are superscalar processors, each having five execution units. Figure 3-4. "Two instructions in flight" with a superscalar processor.3.2.3 Hyper-Pipelined ProcessorsAdding execution units is one method of increasing the number of instructions that a processor deals with in one clock cycle. Expanding the number of steps, or stages, in the pipeline is another method. The Pentium processor has a 5-stage pipeline, the Pentium Pro processor has a 12-stage pipeline, and the Pentium 4 processor, introduced in 1999, has a 20-stage pipeline, as shown in Figure 3-5. Figure 3-5. 20-stage pipeline of the Pentium 4 processor.Increasing the number of stages in a pipeline might at first seem illogical. An instruction that previously took only 5 cycles to complete now might take as many as 20 cycles. One benefit, however, is that by breaking the work into smaller steps, the processor has to do less work in each step. The system clock speed can be increased significantly, and more instructions can be completed in the same amount of time. Example A hyper-pipelined processor can complete 11 instructions in 20 cycles. Without the pipeline, the processor could complete only two instructions in 20 cycles, making a hyper-pipelined processor 5.5 times faster. 3.2.4 Dynamic ExecutionIn 1995, Intel introduced the Pentium Pro, the first processor based on the P6 architecture. Part of the P6 architecture is a set of technologies known collectively as dynamic execution. Dynamic execution consists of out-of-order execution, branch prediction, and speculative execution. These technologies are designed to help overcome pipeline hazards performance problems in hyper-pipelined processors that can be caused by pipelining in some situations. A pipeline hazard blocks the pipeline until the hazard is removed. This is called a pipeline stall. The deeper the pipeline, the more costly (in performance terms) a pipeline stall can be. A 20-stage pipeline might have to wait as many as 19 cycles to process the correct branch. A 10-stage pipeline would have to wait only 9 cycles in the same type of situation. This concept is illustrated in Figure 3-6. Figure 3-6. An example of a pipeline stall.Two types of common pipeline hazards are dependence problems and branching problems. Dependence problems are handled by a processor function known as out-of-order execution. Branching problems are handled by a processor function known as branch prediction. A final processor function that helps resolve pipeline hazards is known as speculative execution. Each of these functions is explained in the following subsections. 3.2.4.1 OUT-OF-ORDER EXECUTIONDependence problems occur when one instruction needs data from another instruction to complete its operation. If the first instruction is not complete before the second instruction needs the data, the pipeline must wait until the first instruction is complete. To handle dependence problems, processors perform out-of-order execution. Instead of holding up the pipeline when it is waiting for the first instruction to finish, the processor can process other instructions that do not depend on the first instruction. The results of these instructions are stored until they are needed later. When the first instruction has finished processing, the dependent instruction is introduced into the pipeline. The result is that fewer cycles are wasted. Example Here is an example of out-of-order execution. Suppose a processor needs to execute the following three instructions:
Notice that Instruction 2 depends on Instruction 1. A processor that can perform out-of-order execution could introduce Instruction 3 (which has no dependencies on previous instructions) into the pipeline before Instruction 2. When Instruction 1 is complete (and Instruction 3 is already far down the pipeline), Instruction 2 is introduced into the pipeline, as illustrated in Figure 3-7. Figure 3-7. An example of out-of-order execution.3.2.4.2 BRANCH PREDICTIONBranching instructions check a condition, and then execute one branch of instructions if the condition is true, or another branch of instructions if the condition is false. The branching instruction might take up to 20 cycles to complete, but the prefetch unit keeps filling the pipeline. It could send instructions from the wrong branch into the pipeline before the branching instruction is complete. When this happens, all the instructions in the pipeline are stopped until the instructions from the wrong branch are flushed from the pipeline, as illustrated in Figure 3-8. Figure 3-8. An example of a branching problem.Branch prediction helps mitigate branching problems. During branch prediction, the first time a branch instruction is executed, its address and that of the next instruction executed (the correct branch) are stored in the branch target buffer (BTB). The processor uses this information to predict which way the instruction will branch the next time the processor executes it. Branch prediction is correct more than 90% of the time. When the first instruction is used again, the processor prefetches the branch that was correct the last time, so the wait is minimized and incorrect guesses need to be flushed less frequently. When branch prediction is correct, executing a branch does not cause a pipeline stall. 3.2.4.3 SPECULATIVE EXECUTIONSpeculative execution is the third component of dynamic execution. It is a blend of branch prediction and out-of-order execution. A processor that can perform speculative execution will predict the path that a branching instruction will take and begin executing nonconditional instructions in that path. (This is different from branch prediction, which only prefetches the instructions, but does not execute them.) Instead of writing the instruction results to the general purpose registers as it normally would, the processor stores these speculative results in temporary registers. When the branching instruction is resolved, the precalculated results can be taken from the temporary registers and written to the general registers in the order they were issued. 3.2.5 EPICIn May 2001, Intel introduced the Itanium processor, which is based on the Explicitly Parallel Instruction Computing (EPIC) model. In the EPIC model, a compiler bundles three groups of instructions together. Each bundle includes a few bits, called a template, that tell the execution units whether the instructions are dependent or independent. Intel calls this explicitly parallel because the compiler tells the execution units which instructions can be processed in parallel. The Itanium processor does not have to determine the processing order the way a processor with out-of-order capabilities must. EPIC architecture also handles branching in a new way. Each instruction has a 1-bit predicate. The predicates for the instructions in one branch are set to true; the predicates for the instructions in the other branch are set to false. A branching instruction is sent to the processor followed quickly by both branches. The processor determines whether the branching instruction is true or false. If it is true, only the instructions with a true predicate are processed. If the statement is false, only instructions with a false predicate are processed. There is no need to flush the pipeline for mispredicted branches. |
< Day Day Up > |