Control Hazards, Branch Prediction

February 18 2004

control hazards

branch instructions can cause a lot of trouble for pipelined processors. like any other instruction, they can cause data hazards, since they require register values. they also introduce a new type of hazard: the control hazard. we don't know which instructions to execute after the branch until we know whether the branch was taken or not. consider the following piece of code [no branch delay slots for now]:

  lw  $6, 0($10)
  lw  $7, 0($11)
  add $1, $2, $3
  beq $0, $1, label
  sub $2, $3, $4
label:
  or  $3, $4, $5

we can't push instructions into our pipeline until we know whether the branch is taken or not, because we don't know which instructions we're supposed to be executing. in the example above, until we know the outcome of the branch instruction, we don't know whether we should be fetching the subtract instruction, or the 'or' instruction.

as we discussed last week, all hazards can be resolved by waiting. how long do we have to wait before we can start fetching instructions after a branch? in other words, how long does it take for us to figure out if the branch was taken or not?

we can reduce the branch delay by moving the branch-related datapath logic [logic to determine taken/not taken and to calculate branch target address] into the decode stage. if we make this change, how long do we have to wait before we can start fetching instructions after a branch?

assume we have moved branch logic into the decode stage. draw a pipeline timing diagram for the code fragment above, assuming that the branch is taken. don't forget about data hazards.

branch delay slots

even after moving the branch logic into the decode stage, we still have to wait one cycle before we know where we should be fetching instructions. we can eliminate this stall cycle if we make a small adjustment to the definition of a branch instruction. if we declare that the instruction immediately following a branch is always executed, even if the branch is taken, our pipeline will always know where to fetch instructions, so we can keep our pipeline filled even when we have branch instructions.

how does the code fragment above need to change if we have branch delay slots?

with branch delay slots, an instruction in the delay slot is always executed, regardless of the outcome of the branch instruction. branch delay slots are only useful if you can find an instruction to put in the slot - if you can't find anything, you have to fill it with a no-op, and that's just as bad as stalling the processor.

branch delay slots are one of those ideas that seem really great at the time, but, as time passes and things change, they become dead weight that's difficult to leave behind.

why? because branch delay slots require us to change the way branch instructions work. as we learn about branch prediction, you'll see that we don't need branch delay slots... and think about how annoying it will be to mimic the behavior of branch delay slots even when we don't need them.

of course, we could redefine branch instructions again by declaring that the instruction immediately following a branch instruction is not always executed... but this change would break old programs. everyone will need to recompile their code. and people don't like to recompile things. this is one of the main reasons why intel and microsoft are so dominant - they go to great lengths to maintain backwards compatability.

branch prediction

the idea behind branch prediction is simple: if we can correctly predict whether branches are taken or not, we can reduce the stalls due to control hazards.

fortunately, branches are highly predictable. remember that there are only two possibilities: taken and not taken. in addition, think about how branches work... consider the branches involved in the following piece of code, for example:

for(i=0; i < n; i++) {
  for(j=0; j < m; j++) {
    sum += a[i][j]
  }
}

as you can imagine, branches almost always exhibit regular behavior... if there is regular behavior, once you see the pattern, it's easy to figure out what will happen next. that's really all there is to branch prediction: we look at the past to predict the future.

the tricky part is building something in hardware that can analyze the past behavior of branches, and generate predictions.

one of the most frequently used components in branch prediction is the two bit saturating counter. it's really simple: it's just a two bit counter that can never go below 0, or above 3. you increment the counter when branches are taken [but remember, the counter can't go above 3], and decrement when branches are not taken [can't go below 0]. to generate a prediction, you just look at the value of the counter. if it's 2 or 3, the branch will probably be taken. if it's 0 or 1, the branch will probably be not taken.

suppose we have a two bit counter initialized to zero. show the states for our two bit counter given the following branch outcomes: nt, nt, nt, nt, t, nt, nt, nt, nt. what predictions will we generate? how many mispredictions will we have?

we can make our first predictor by making a big array of these two bit saturating counters. for now, we'll say that we have one counter for each branch instruction in the program. so, whenever we decode a branch instruction and we want a prediction, we just look in our big array of saturating counters, we find the counter that corresponds to the branch instruction that we just fetched, and we look at the value of the counter.

later, when we know whether the branch was actually taken or not, we need to update that counter: if the branch was actually taken, we increment the counter, otherwise we decrement the counter. this 'update' step is very important! it's what allows our branch predictor to adapt to changes in the program's behavior.

the predictor described above assumes that we have one counter for each branch in the program we're running. a more reasonable thing to do is to have some fixed number of counters, and use a hash function to determine which branch instruction will use which counter. this is usually done by using a power-of-two number of counters, and using bits from the pc of the branch instruction as a hash.

for example, suppose we have 8 saturating counters. we can use the bottom 3 bits of the pc of each branch instruction to determine which counter each branch instruction will use. this is not a very good scheme, since the bottom two bits of each branch instruction will be zeroes [remember, we are dealing with word-aligned 32-bit instructions]. a better scheme is to first drop the lowest two bits, then take the bottom 3 bits [in other words, index=(pc>>2) & 7].

these arrays of two-bit saturating counters are called "pattern history tables", commonly abbreviated "pht".

with our first branch predictor, we generated predictions for each branch instruction. with a local history predictor, we generate predictions based on the history of each branch instruction. to do this, we introduce a "local history table," [lht] which is a table that keeps track of the outcome of each branch instruction for the last N times we executed it. we index the local history table the same way we indexed the pattern history table before, so index=(pc>>2) & 7 if we have 8 entries in our local history table. the trick is to use the data found in the local history table as an index into the pattern history table. so, to generate a prediction, we look up the value of pht[lht[index]].

local branch predictors don't consider context - they generate predictions for each branch instruction, based on the history of that branch instruction. but the outcome of one branch instruction is often related to the outcome of another branch instruction [this is very common in nested 'if' statements]. global history predictors try to take advantage of this correlation between branches. the design is fairly simple: we keep track of the outcome of the last N branches [we do this with a shift register called the "global history register," [ghr] and we use this data as an index into our pattern history table. so, to generate a prediction, we look up the value of pht[ghr].

problem

consider the following code [no branch delay slots]:

  add  $2, $0, $0
  addi $11, $0, 5
  addi $12, $0, 3
outer:
  addi $2, $2, 1
  add  $1, $0, $0
inner:
  addi $1, $1, 1
  bne  $1, $11, inner
  bne  $2, $12, outer

the first add instruction is located at address 0. what is our misprediction rate if we just use a pattern history table with 2 entries? [misprediction rate = # mispredictions / # predictions]

what if we use a local history predictor with a 2-entry local history table, and a 4-entry pattern history table?

what if we use a global history predictor with a 4-entry pattern history table?