Pipelining is discussed in chapter 6 of your book.
This is obviously better than waiting for an instruction to complete the whole sequence before pushing the next one. In the case where everything goes well, and assuming all stages are of equal duration, the speedup is the depth of the pipeline: here five. More generally, the maximum throughput is dictated by the longest stage.
However, everything does not go well all the time.
lw $t0, array($t1) add $t2, $t0, 1There might be a problem with the availability of the loaded value: the lw is in EX (computing the address) when the add tries to read the value of $t0 from the register file in the RD stage. The add might be reading the previous value of $t0, which was not what we intended ! This a data hazard . An instruction's execution depends on the result of a previous one, which has not completed yet.
sub $t1, $t0, $a0 add $t2, $t1, 1The result of the sub is not needed by the add before it is actually available: the sub finishes the EX stage, and the result could be given to the add entering the EX stage. This is called forwarding.
To implement this, we need some special hardware. It will have to detect that $t1 is produced by the sub and is going to be used right away by the add.
Note that the "before" is very important: in the case of the lw/add pair, there is one stall we cannot avoid, because the data is simply not available when it is needed. There are other kinds of hazards.
sub $t1, $t0, 1 beq $t1, $zero, loop add $t1, $t1, 1In the simple pipeline, we need to wait until the second stage (RD) to determine whether or not we will take the branch. But meanwhile, the add has been fetched. But we might not want to execute the add, since it looks like something we would like to do only when exiting the loop. If we want to enforce this, and declare the add as not executed when the branch is taken, then we have one stall (at least).
On more complex pipelines, we could be unable to decide whether the branch is taken without getting to a later stage; the penalty would be even larger. Think of more complex tests in branches.
As processors become more and more complex, the delay slot technique becomes less important. It doesn't help much to be helped by the instruction set on the instruction that follows the branch when you actually have six instructions in flight.
bsy@cse.ucsd.edu, last updated