# Branch Target Buffer, Return Address Stack

February 25 2004

## branch prediction review

there are two steps to branch prediction:

1. generate prediction
2. update state

we generate predictions before we know the branch outcome - this means that prediction happens early in the pipeline. we update the state of our predictors after we know the actual branch outcome - this means that update happens later in the pipeline. let's review what happens in these two steps for the types of branch predictors we saw last week:

• pattern history table with N entries
1. generate prediction
1. compute the index into the pattern history table. since our pattern history table contains N entries, we need log2(N) bits. we take the bottom log2(N) bits of (pc >> 2).
2. look at the value of the two-bit saturating counter at the index we just computed in the pattern history table. if our index is 2, look at the second counter, etc.
3. use the value of the counter to generate a prediction. if the counter is 0 or 1, predict not taken; if it is 2 or 3 predict taken.
2. update state
1. use the index we calculated when we were generating a prediction
2. if the branch was actually taken, increment the two-bit saturating counter at the index we computed in the pattern history table. if the branch was actually not taken, decrement the counter.
• local history predictor with N-entry local history table, and M-entry pattern history table. we keep log2(M) bits in each entry of our local history table.
1. generate prediction
1. compute the index into the local history table. this is exactly like the index calculation we just did for the pattern history table. since our local history table contains N entries, we need log2(N) bits. we take the bottom log2(N) bits of (pc >> 2).
2. look at the value of the shift register at the index we just computed in the local history table. if our index is 2, look at the second shift register, etc.
3. use the value of the shift register to as an index into the pattern history table. for example, if the shift register contains the bits 10, look at the value of the second saturating counter in the pht.
4. if the value of the saturating counter is 0 or 1, predict not taken; if it is 2 or 3 predict taken.
2. update state
1. use the index into the pattern history table that we computed in the prediction step.
2. update the saturating counter. if the branch was actually taken, increment the two-bit saturating counter at the index we computed in the pattern history table. if the branch was actually not taken, decrement the counter.
3. use the index into the local history table that we computed in the prediction step.
4. update the shift register. if the branch was taken, shift left and insert a zero, if the branch was not taken, shift left and insert a one.
• global history predictor with N-entry pattern history table. we keep log2(N) history bits in our global history register.
1. generate prediction
1. use the value of the global history register as an index into the pattern history table. for example, if the global history register contains the bits 10, look at the value of the second saturating counter in the pht.
2. if the value of the saturating counter is 0 or 1, predict not taken; if it is 2 or 3 predict taken.
2. update state
1. use the value of the ghr as an index into the pattern history table.
2. update the saturating counter. if the branch was actually taken, increment the two-bit saturating counter at the index we computed in the pattern history table. if the branch was actually not taken, decrement the counter.
3. update the global history register. if the branch was taken, shift left and insert a zero, if the branch was not taken, shift left and insert a one.

## questions

why are the number of local/global history bits determined by the size of the pattern history table?

why do we always update the saturating counters before we update the shift registers?

## branch target buffer [btb]

branch target buffers are very simple: they're just fixed-size hashtables that map from pc's to branch target addresses.

how do you use a branch target buffer? if our pc is 12, we search our branch target buffer, looking for an entry marked with pc 12. if we find a matching entry [this is called a "btb hit"], we know two things: we know that we are executing a branch instruction, and we know the target address of the branch. if we don't find a matching entry in the branch target buffer, it doesn't tell us anything.

before, we had to wait until the decode stage to perform branch prediction, because branch prediction requires two pieces of information that are usually not available during fetch:

1. is the current instruction a branch instruction?
2. if so, what is the branch target address?

when we have a btb hit, the branch target buffer tells us these two important pieces of information - and the only input it needed was the current pc. this means that we can perform branch prediction in the fetch stage, if we have a branch target buffer.

why is this good? draw a pipeline timing diagram. if we predict branches in the decode stage, there is one cycle of wasted fetch after every branch instruction, because we don't know what we're supposed to be fetching until the branch finishes decode. if we predict branches in the fetch stage, there are no cycles of wasted fetch.

return address stacks are also very simple: they're fixed size stacks of return addresses.

to use a return address stack, we push pc+4 onto the stack when we execute a procedure call instruction. this pushes the return address of the call instruction onto the stack - when the call is finished, it will return to pc+4 of the procedure call instruction. when we execute a return instruction, we pop an address off the stack, and predict that the return instruction will return to the popped address.

since return instructions almost always return to the last procedure call instruction, return address stacks are highly accurate.

remember that return address stacks only generate predictions for return instructions. they don't help at all for procedure call instructions [we use the btb to predict calls].

## problems

suppose i have a standard 5-stage pipeline where branches and jumps are resolved in the execute stage. 20% of my instructions are branches, and they're taken 60% of the time. 5% of my instructions are return instructions [return instructions are not branches!].

what's the cpi of my system if i always stall my processor on branches and jumps? what if i always predict that branches and jumps are taken?