The superscalar processor provides a single control flow with instructions operating on scalar operands and being executed in parallel. Figure 2.2 depicts this architecture. In general, the superscalar processor has several instruction execution units executing instructions in parallel. Except for a small number of special instructions for data transfer between main memory and registers, the instructions operate on scalar operands located on the scalar registers.
Figure 2.2: Superscalar processor architecture.
Two successive instructions can be executed in parallel by two different IEUs if they do not have conflicting operands, that is, if they do not write in the same register and neither instruction uses a register in which the other writes. In any case, the parallel execution should not change the functional semantics of a serial execution of successive instructions.
One can imagine a dispatcher unit directing instructions to relevant IEUs based on their type. Each IEU is characterized by the set of instructions that it executes. So the entire set of instructions is divided into a number of nonintersecting subsets each associated with some IEU.
Additionally each IEU can be a pipelined unit; that is, it can simultaneously execute several successive instructions, each being on its stage of execution. That parallel execution also should not change functional semantics of serial execution of the successive instructions. Consider the work of the pipelined unit executing a number of successive independent instructions. Let the pipeline of the unit consist of m stages. Let n successive instructions of the program, I1,…, In, be performed by the unit. Instruction Ik takes operands from registers ak, bk and puts the result on register ck (k = 1,…, n). Let no two instructions have conflicting operands. Then the work of the unit can be summarized as follows:
At the first step, the unit performs stage 1 of instruction I1:
At the ith step (i = 2,…, m - 1), the unit performs in parallel stage 1 of instruction Ii, stage 2 of instruction Ii-1, etc., and stage i of instruction I1,
At the mth step, the unit performs in parallel stage 1 of instruction Im, stage 2 of instruction Im-1, etc., as well as the final stage m of instruction I1; after completion of this, register c1 contains the result of instruction I1:
At the (m + j)-th step (j = 1,…, n - m), the unit performs in parallel stage 1 of instruction Im+j, stage 2 of instruction Im+j-1, etc., as well as the final stage m of instruction Ij+1; after completion of this, register cj+1 contains the result of instruction Ij+1:
At the (n + k - 1)-th step (k = 2,…, m - 1), the unit performs in parallel stage k of instruction In, stage k + 1 of instruction In-1, etc., as well as the final stage m of instruction In-m+k; after completion of this, register cn-m+k contains the result of instruction In-m+k:
At the (n + m - 1)-th step, the unit only performs the final stage m of instruction In; after completion of this, register cn contains the result of instruction In:
In total, it takes (n + m-1) steps to execute n instruction. The pipeline of the unit is fully loaded only from the mth to the nth step of the execution. Strictly serial execution by the unit of n successive instructions takes n X m steps. Thus the maximal speedup provided by this pipelined unit is
If n is large enough, the speedup is approximately equal to the length of the unit’s pipeline, SIEU ≈ m.
The maximal speedup provided by the entire superscalar processor having K parallel IEUs is approximately equal to the sum of lengths of pipelines of the IEUs:
CDC 6600 (1964) was the first processor with several IEUs functioning in parallel. CDC 7600 (1969) was the first processor with several pipelined IEUs functioning in parallel. Nowadays practically all manufactured microprocessors have many IEUs, each of which is normally a pipelined unit, and we can hardly imagine other microprocessor architectures.
Superscalar architecture includes the serial scalar architecture as a particular case (K = 1, m = 1).