Branch Target Buffer, Return Address Stack

February 25 2004

branch prediction review

there are two steps to branch prediction:

  1. generate prediction
  2. update state

we generate predictions before we know the branch outcome - this means that prediction happens early in the pipeline. we update the state of our predictors after we know the actual branch outcome - this means that update happens later in the pipeline. let's review what happens in these two steps for the types of branch predictors we saw last week:

questions

why are the number of local/global history bits determined by the size of the pattern history table?

why do we always update the saturating counters before we update the shift registers?

branch target buffer [btb]

branch target buffers are very simple: they're just fixed-size hashtables that map from pc's to branch target addresses.

how do you use a branch target buffer? if our pc is 12, we search our branch target buffer, looking for an entry marked with pc 12. if we find a matching entry [this is called a "btb hit"], we know two things: we know that we are executing a branch instruction, and we know the target address of the branch. if we don't find a matching entry in the branch target buffer, it doesn't tell us anything.

before, we had to wait until the decode stage to perform branch prediction, because branch prediction requires two pieces of information that are usually not available during fetch:

  1. is the current instruction a branch instruction?
  2. if so, what is the branch target address?

when we have a btb hit, the branch target buffer tells us these two important pieces of information - and the only input it needed was the current pc. this means that we can perform branch prediction in the fetch stage, if we have a branch target buffer.

why is this good? draw a pipeline timing diagram. if we predict branches in the decode stage, there is one cycle of wasted fetch after every branch instruction, because we don't know what we're supposed to be fetching until the branch finishes decode. if we predict branches in the fetch stage, there are no cycles of wasted fetch.

return address stack [ras]

return address stacks are also very simple: they're fixed size stacks of return addresses.

to use a return address stack, we push pc+4 onto the stack when we execute a procedure call instruction. this pushes the return address of the call instruction onto the stack - when the call is finished, it will return to pc+4 of the procedure call instruction. when we execute a return instruction, we pop an address off the stack, and predict that the return instruction will return to the popped address.

since return instructions almost always return to the last procedure call instruction, return address stacks are highly accurate.

remember that return address stacks only generate predictions for return instructions. they don't help at all for procedure call instructions [we use the btb to predict calls].

problems

suppose i have a standard 5-stage pipeline where branches and jumps are resolved in the execute stage. 20% of my instructions are branches, and they're taken 60% of the time. 5% of my instructions are return instructions [return instructions are not branches!].

what's the cpi of my system if i always stall my processor on branches and jumps? what if i always predict that branches and jumps are taken?