Branch Target Buffer, Return Address Stack
February 25 2004
branch prediction review
there are two steps to branch prediction:
- generate prediction
- update state
we generate predictions before we know the branch outcome - this means
that prediction happens early in the pipeline. we update the
state of our predictors after we know the actual branch outcome - this
means that update happens later in the pipeline. let's review
what happens in these two steps for the types of branch predictors we
saw last week:
-
pattern history table with N entries
- generate prediction
- compute the index into the pattern history table. since our
pattern history table contains N entries, we need log2(N)
bits. we take the bottom log2(N) bits of (pc >> 2).
- look at the value of the two-bit saturating counter at the index
we just computed in the pattern history table. if our index is 2, look
at the second counter, etc.
- use the value of the counter to generate a prediction. if the
counter is 0 or 1, predict not taken; if it is 2 or 3 predict taken.
- update state
- use the index we calculated when we were generating a prediction
- if the branch was actually taken, increment the two-bit saturating
counter at the index we computed in the pattern history table. if
the branch was actually not taken, decrement the counter.
-
local history predictor with N-entry local history table, and M-entry
pattern history table. we keep log2(M) bits in each entry
of our local history table.
- generate prediction
- compute the index into the local history table. this is exactly
like the index calculation we just did for the pattern history table.
since our local history table contains N entries, we need
log2(N) bits. we take the bottom log2(N) bits of
(pc >> 2).
- look at the value of the shift register at the index we just
computed in the local history table. if our index is 2, look
at the second shift register, etc.
- use the value of the shift register to as an index into the
pattern history table. for example, if the shift register contains the
bits 10, look at the value of the second saturating counter in
the pht.
- if the value of the saturating counter is 0 or 1, predict not
taken; if it is 2 or 3 predict taken.
- update state
- use the index into the pattern history table that we computed in the
prediction step.
- update the saturating counter. if the branch was actually taken,
increment the two-bit saturating counter at the index we computed in
the pattern history table. if the branch was actually not taken,
decrement the counter.
- use the index into the local history table that we computed in the
prediction step.
- update the shift register. if the branch was taken, shift left and
insert a zero, if the branch was not taken, shift left and insert a
one.
-
global history predictor with N-entry pattern history table. we keep
log2(N) history bits in our global history register.
- generate prediction
- use the value of the global history register as an index into the
pattern history table. for example, if the global history register
contains the bits 10, look at the value of the second
saturating counter in the pht.
- if the value of the saturating counter is 0 or 1, predict not
taken; if it is 2 or 3 predict taken.
- update state
- use the value of the ghr as an index into the pattern history
table.
- update the saturating counter. if the branch was actually taken,
increment the two-bit saturating counter at the index we computed in
the pattern history table. if the branch was actually not taken,
decrement the counter.
- update the global history register. if the branch was taken, shift
left and insert a zero, if the branch was not taken, shift left and
insert a one.
questions
why are the number of local/global history bits determined by the size
of the pattern history table?
why do we always update the saturating counters before we update the
shift registers?
branch target buffer [btb]
branch target buffers are very simple: they're just fixed-size
hashtables that map from pc's to branch target addresses.
how do you use a branch target buffer? if our pc is 12, we search our
branch target buffer, looking for an entry marked with pc 12. if we
find a matching entry [this is called a "btb hit"], we know two
things: we know that we are executing a branch instruction, and we
know the target address of the branch. if we don't find a matching
entry in the branch target buffer, it doesn't tell us anything.
before, we had to wait until the decode stage to perform branch
prediction, because branch prediction requires two pieces of
information that are usually not available during fetch:
- is the current instruction a branch instruction?
- if so, what is the branch target address?
when we have a btb hit, the branch target buffer tells us these two
important pieces of information - and the only input it needed was the
current pc. this means that we can perform branch prediction in the
fetch stage, if we have a branch target buffer.
why is this good? draw a pipeline timing diagram. if we predict
branches in the decode stage, there is one cycle of wasted fetch after
every branch instruction, because we don't know what we're supposed to
be fetching until the branch finishes decode. if we predict branches
in the fetch stage, there are no cycles of wasted fetch.
return address stack [ras]
return address stacks are also very simple: they're fixed size stacks
of return addresses.
to use a return address stack, we push pc+4 onto the stack when we
execute a procedure call instruction. this pushes the return address
of the call instruction onto the stack - when the call is finished, it
will return to pc+4 of the procedure call instruction. when we execute
a return instruction, we pop an address off the stack, and predict
that the return instruction will return to the popped address.
since return instructions almost always return to the last procedure
call instruction, return address stacks are highly accurate.
remember that return address stacks only generate predictions for
return instructions. they don't help at all for procedure call
instructions [we use the btb to predict calls].
problems
suppose i have a standard 5-stage pipeline where branches and jumps
are resolved in the execute stage. 20% of my instructions are
branches, and they're taken 60% of the time. 5% of my instructions are
return instructions [return instructions are not branches!].
what's the cpi of my system if i always stall my processor on branches
and jumps? what if i always predict that branches and jumps are taken?