Final Review

March 13 2004

problem 1

52-bit virtual address, 16k cache, 64 byte blocksize. byte addressable machine.

part a

assume direct mapped cache. how many bits for tag, index, offset? how much memory (in bytes) is needed for tags in the cache?

we need log2(blocksize) bits for the offset. blocksize is 64 bytes, so we need 6 bits for the offset.

we need log2(# sets) bits for the index. how many sets do we have? it's direct mapped (associativity=1), it holds 16k of data, and our blocksize is 64 bytes. so we have 16k/64 = (16*1024)/64 = 256 sets. log2(256) = 8, so we need 8 bits for the index.

all the remaining bits in each address are the tag. so we need 52-8-6 = 38 bits for the tags.

we need to store a tag for each piece of data in our cache. we have 256 sets in our direct mapped cache, so we are storing 256 pieces of data, so we need 256 tags. 256 * 38 bits = 9728 bits = 1216 bytes of memory to store tags.

part b

assume the cache is 8-way set associative. how many bits for tag, index, offset? how much memory is needed to store the tags?

we need log2(blocksize) bits for the offset. blocksize is 64 bytes, so we still need 6 bits for the offset.

we need log2(# sets) bits for the index. how many sets do we have? it's 8-way (associativity=8), it holds 16k of data, and our blocksize is 64 bytes. so we have 16k/(64*8) = (16*1024)/(64*8) = 32 sets. log2(32) = 5, so we need 5 bits for the index.

all the remaining bits in each address are the tag. so we need 52-5-6 = 41 bits for the tags.

we need to store a tag for each piece of data in our cache. we have 32 sets in our 8-way cache, so we are storing 32*8 pieces of data, so we need 256 tags. 256 * 41 bits = 10496 bits = 1312 bytes of memory to store tags.

part c

suppose the cache is physically tagged. the processor supports at most 512mb of memory. if our cache is 16k 8-way set associative with 64 byte blocks, how many tag bits do we need for each cache block?

we need one tag for each cache block. so we need to figure out how many bits are needed for each tag.

we need log2(blocksize) bits for the offset. blocksize is 64 bytes, so we still need 6 bits for the offset.

we need log2(# sets) bits for the index. how many sets do we have? it's 8-way (associativity=8), it holds 16k of data, and our blocksize is 64 bytes. so we have 16k/(64*8) = (16*1024)/(64*8) = 32 sets. log2(32) = 5, so we need 5 bits for the index.

all the remaining bits in each address are the tag. how many bits do we need for each physical address? if we have at most 512mb of memory, we need log2(512mb) = log2(512*1024*1024) = 29 bits. so we need 29-5-6 = 18 bits for each tag.

problem 2

we have a 64-byte cache with 16 bytes per block. memory is byte addressable. cache is initially empty. use the following address stream: 80,111,60,94,112,35,60,45,112,10,70,80

part a

how many cache hits are there for a direct mapped cache? which addresses are hits and which are misses?

our cache is 64 bytes, 16 byte blocks, direct mapped (associativity=1). this means that we have 64/16 = 4 sets.

so we'll need log2(blocksize) = log2(16) = 4 bits for offset, and log2(# sets) = log2(4) = 2 bits for index.

now let's take a look at what happens for each of our accesses: [address (binary) is just the address written out in binary, with spaces inserted to separate the tag, index, and offset. "t i o" = "tag index offset"]

      |   address | 
      |  (binary) | 
 addr | t i  o    | status 
------+-----------+-------------------------------------------------
   80 | 1 01 0000 | miss. store in set 1, tag=1
  111 | 1 10 1111 | miss. store in set 2, tag=1
   60 | 0 11 1100 | miss. store in set 3, tag=0
   94 | 1 01 1110 | hit.  we have the block with tag=1 in set 1. 
  112 | 1 11 0000 | miss. store in set 3, tag=1. data with tag=0 evicted
   35 | 0 10 0011 | miss. store in set 2, tag=0. data with tag=1 evicted
   60 | 0 11 1100 | miss. store in set 3, tag=0. data with tag=1 evicted
   45 | 0 10 1101 | hit.  we have the block with tag=0 in set 2. 
  112 | 1 11 0000 | miss. store in set 3, tag=1. data with tag=0 evicted
   10 | 0 00 1010 | miss. store in set 0, tag=0
   70 | 1 00 0110 | miss. store in set 0, tag=1. data with tag=0 evicted
   80 | 1 01 0000 | hit.  we have the block with tag=1 in set 1. 

there are 3 hits.

part b

how many cache hits are there for a 2-way set associative cache with LRU replacement? which addresses are hits and which are misses?

our cache is 64 bytes, 16 byte blocks, 2-way set associative. this means that we have 64/(16*2) = 2 sets.

so we'll need log2(blocksize) = log2(16) = 4 bits for offset, and log2(# sets) = log2(2) = 1 bit for index.

now let's take a look at what happens for each of our accesses: [address (binary) is just the address written out in binary, with spaces inserted to separate the tag, index, and offset. "t i o" = "tag index offset"]

      |   address | 
      |  (binary) | 
 addr | t  i o    | status 
------+-----------+-------------------------------------------------
   80 | 10 1 0000 | miss. store in set 1, way 0, tag=10
  111 | 11 0 1111 | miss. store in set 0, way 0, tag=11
   60 | 01 1 1100 | miss. store in set 1, way 1, tag=01
   94 | 10 1 1110 | hit.  we have the block with tag=10 in set 1, way 0
  112 | 11 1 0000 | miss. store in set 1, way 1, tag=11. evict tag=01
   35 | 01 0 0011 | miss. store in set 0, way 1, tag=01
   60 | 01 1 1100 | miss. store in set 1, way 0, tag=01. evict tag=10
   45 | 01 0 1101 | hit.  we have the block with tag=01 in set 0, way 1
  112 | 11 1 0000 | hit.  we have the block with tag=11 in set 1, way 1
   10 | 00 0 1010 | miss. store in set 0, way 0, tag=00. evict tag=11
   70 | 10 0 0110 | miss. store in set 0, way 1, tag=10. evict tag=01
   80 | 10 1 0000 | miss. store in set 1, way 0, tag=10. evict tag=01

there are 3 hits

part c

how many cache hits are there for a fully associative cache with LRU replacement? which addresses are hits and which are misses?

our cache is 64 bytes, 16 byte blocks, fully associative. fully associative means that the associativity = # blocks, and # sets = 1. so we have 1 set, and 4 ways.

so we'll need log2(blocksize) = log2(16) = 4 bits for offset, and log2(# sets) = log2(1) = 0 bits for index. this means there is no index [no bits for the index].

now let's take a look at what happens for each of our accesses: [address (binary) is just the address written out in binary, with a space inserted to separate the tag and offset. "t o" = "tag offset"]

      |  address | 
      | (binary) | 
 addr | t   o    | status 
------+----------+-------------------------------------------------
   80 | 101 0000 | miss. store in set 0, way 0, tag=101
  111 | 110 1111 | miss. store in set 0, way 1, tag=110
   60 | 011 1100 | miss. store in set 0, way 2, tag=011
   94 | 101 1110 | hit.  we have the data with tag=101 in set 0, way 0
  112 | 111 0000 | miss. store in set 0, way 3, tag=111
   35 | 010 0011 | miss. store in set 0, way 1, tag=010. evict tag=110
   60 | 011 1100 | hit.  we have the data with tag=011 in set 0, way 2
   45 | 010 1101 | hit.  we have the data with tag=010 in set 0, way 1
  112 | 111 0000 | hit.  we have the data with tag=111 in set 0, way 3
   10 | 000 1010 | miss. store in set 0, way 0, tag=000. evict tag=101
   70 | 100 0110 | miss. store in set 0, way 2, tag=100. evict tag=011
   80 | 101 0000 | miss. store in set 0, way 1, tag=101. evict tag=010

there are 4 hits

problem 3

assume standard 5-stage pipeline, branches resolve in decode, branch delay slot is used. consider the following code: [numbers in brackets like [1] are endpoints for arrows] [sorry it's kinda hard for me to draw arrows here :)]

  lw   [00] r3, 0( [01] r2)
loop:
  sw   [02] r3, 0( [03] r2)
  sub  [04] r1, [05] r3, [06] r4
  lw   [07] r3, 0( [08] r1)
  sw   [09] r1, 0( [10] r3)
  subi [11] r2, [12] r2, 4
  bnez [13] r2, loop
  lw   [14] r3, 0( [15] r2)
  sub  [16] r5, [17] r6, [18] r7
  add  [19] r7, [20] r8, [21] r9
  sub  [22] r2, [23] r3, [24] r2

part a

draw arrows and label all raw, waw, and war non-loop carried dependencies in the above code example.

a * indicates dependencies that are hazards for this architecture. in other words, our processor will not perform correctly unless we forward or stall for the * dependencies.

part b

draw a pipeline cycle timing diagram for the above code. start with the first lw before the loop, run one loop iteration, and stop at the first store of the second iteration. show all forwarding needed to eliminate stalls. what cycle does the first sw in the second loop iteration enter the execute stage?

instruction 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
lw r3, 0(r2) f d x m [1] w
sw r3, 0(r2) f d x [2] m w
sub r1, r3, r4 f d [3] x [4] m w
lw r3, 0(r1) f d [5] x m [6] w
sw r1, 0(r3) f d s [7] x m w
subi r2, r2, 4 f s d x [8] m w
bnez r2, loop f s [9] d x m w
lw r3, 0(r2) f d x m [10] w
sw r3, 0(r2) f d x [11] m w

forwarding arrows:

the first sw in the second loop iteration enters the execute stage in cycle 13

notes:

cycle number actions
5 value of r3 was loaded in cycle 4. sw needs value of r3 for storage in memory stage, and sub needs value of r3 for subtraction in execute stage.
6 value of r1 was computed in cycle 5. lw needs value of r1 for effective address calculation in cycle 6.
7 stall because sw needs value of r3 for effective address calculation, and it won't be loaded in time. we don't need to forward the value of r1 to sw because we re-run the decode stage in this cycle
8 value of r3 was loaded in cycle 7, sw needs value of r3 for effective address calculation in cycle 8
9 stall because bnez needs value of r2 to resolve branch in decode, but r2 won't be computed in time
10 value of r2 was computed in cycle 9, bnez needs value of r2 to resolve branch in cycle 10. fetch lw this cycle because it is in the branch delay slot
14 value of r3 loaded in cycle 13, sw needs value of r3 to perform store in cycle 14

problem 4

part a

three pieces of information are needed to predict the next pc of a branch instruction that we're currently fetching. what are these three pieces of information?

  1. is this a branch instruction? remember, we're fetching - we don't know what kind of instruction we have yet.
  2. is this branch taken or not taken?
  3. what is the branch target address? if the branch is taken, we won't know where to fetch next unless we know the target address

part b

see "Advanced Pipelining" lecture notes, bottom of page 8

part c

sequence 1

actual: nt t nt t nt t nt t nt t nt t nt t
state: 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1
prediction: nt nt t nt t nt t nt t nt t nt t nt
correct: c m m m m m m m m m m m m m

sequence 2

actual: t t nt t t nt t t nt t t nt t t
state: 0 1 1 0 1 1 0 1 1 0 1 1 0 1 1
prediction: nt t t nt t t nt t t nt t t nt t
correct: m c m m c m m c m m c m m c

sequence 3

actual: nt nt t t t nt t nt nt nt t nt t t
state: 0 0 0 1 1 1 0 1 0 0 0 1 0 1 1
prediction: nt nt nt t t t nt t nt nt nt t nt t
correct: c c m c c m m m c c m n m c

part d

sequence 1

actual: nt t nt t nt t nt t nt t nt t nt t
state: 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2
prediction: t nt t nt t nt t nt t nt t nt t nt
correct: m m m m m m m m m m m m m m

sequence 2

actual: t t nt t t nt t t nt t t nt t t
state: 2 3 3 2 3 3 2 3 3 2 3 3 2 3 3
prediction: t t t t t t t t t t t t t t
correct: c c m c c m c c m c c m c c

sequence 3

actual: nt nt t t t nt t nt nt nt t nt t t
state: 2 1 0 1 2 3 2 3 2 1 0 1 0 1 2
prediction: t nt nt nt t t t t t nt nt nt nt nt
correct: m c m m c m c m m c m c m m

part e

sequence 1

actual: nt t nt t nt t nt t nt t nt t nt t
state: 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1
prediction: nt nt nt nt nt nt nt nt nt nt nt nt nt nt
correct: c m c m c m c m c m c m c m

sequence 2

actual: t t nt t t nt t t nt t t nt t t
state: 0 1 2 1 2 3 2 3 3 2 3 3 2 3 3
prediction: nt nt t nt t t t t t t t t t t
correct: m m m m c m c c m c c m c c

sequence 3

actual: nt nt t t t nt t nt nt nt t nt t t
state: 0 0 0 1 2 3 2 3 2 1 0 1 0 1 2
prediction: nt nt nt nt t t t t t nt nt nt nt nt
correct: c c m m c m c m m c m c m m

problem 5

assume standard 5-stage pipeline. we have separate instruction and data caches. 25% of instrs are loads, 15-cycle cache miss penalty. 30% of instrs are branches, 60% of branches are taken, branches resolve in execute. btb has 80% hit rate, btb generates predictions with 70% accuracy.

part a

if we have 15% instruction cache miss rate, and 20% data cache miss rate, what is our overall cpi?

first, let's figure out the average penalty for branches. there are six cases:

  1. branch was taken, btb hit and btb predicts taken. this happens 60% * 80% * 70% = 33.6% of the time
  2. branch was taken, btb hit and btb predicts not taken. this happens 60% * 80% * 30% = 14.4% of the time
  3. branch was taken, btb miss. this happens 20% * 60% = 12% of the time
  4. branch was not taken, btb hit and btb predicts taken. this happens 40% * 80% * 30% = 9.6% of the time
  5. branch was not taken, btb hit and btb predicts not taken. this happens 40% * 80% * 70% = 22.4% of the time
  6. branch was not taken, btb miss. this happens 20% * 40% = 8% of the time

now let's figure out how long we need to stall in each case.

  1. branch was taken, btb hit and btb predicts taken. we use our btb in fetch, so there is no penalty for this case - we correctly predict the next pc of the branch instruction. so the penalty is zero cycles
  2. branch was taken, btb hit and btb predicts not taken. we generate an incorrect prediction for the next pc of the branch instruction, so we spend two cycles fetching the wrong instructions. after the branch resolves, we realize we made a mistake, and we undo. so the penalty is two cycles
  3. branch was taken, btb miss. we will fetch pc+4 next cycle, and pc+8 the cycle after that. these are not the right instructions to fetch. after the branch resolves, we realize we made a mistake, and we undo. so the penalty is two cycles
  4. branch was not taken, btb hit and btb predicts taken. we generate an incorrect prediction for the next pc of the branch instruction, so we spend two cycles fetching the wrong instructions. after the branch resolves, we realize we made a mistake, and we undo. so the penalty is two cycles
  5. branch was not taken, btb hit and btb predicts not taken. we use our btb in fetch, so there is no penalty for this case - we correctly predict the next pc of the branch instruction. so the penalty is zero cycles
  6. branch was not taken, btb miss. we will fetch pc+4 next cycle, and pc+8 the cycle after that. these are the right instructions to fetch. when the branch resolves, we realize that we were fetching the right instructions, so there is no penalty. so the penalty is zero cycles

so the average number of penalty cycles for branch instructions is 33.6% * 0 + 14.4% * 2 + 12% * 2 + 9.6% * 2 + 22.4% * 0 + 8% * 0 = 0.72 cycles

we need to figure out the average number of penalty cycles. there are three main sources of penalty cycles for this problem:

  1. instruction cache - there's a 15% chance we'll stall for 15 cycles when fetching an instruction. note that every instruction must be fetched, so this penalty applies to all instructions [100% of instructions]
  2. data cache - there's a 20% chance we'll stall for 15 cycles when executing a load instruction. this penalty only applies to load instructions, which are 25% of our instructions
  3. branches - the average penalty for a branch instruction is 0.72 cycles [we just computed this]. this penalty applies only to branch instructions, which are 30% of our instructions

so our overall cpi is 1 + 100% * 15% * 15 + 25% * 20% * 15 + 30% * 0.72 = 4.216

part b

suppose we decrease the instruction cache miss rate to 10%, decrease the data cache miss rate to 15%, split decode into two stages, and increase cycle time by 25%. what is the speedup of this new processor?

first let's recompute the average branch penalty. the only change that affects branches is the split decode. since our branches resolve in execute, this means it now takes us three cycles to figure out what a branch instruction really does. this means that whenever we predict incorrectly, the penalty increases from two cycles to three cycles.

so the new average number of penalty cycles for branch instructions is 33.6% * 0 + 14.4% * 3 + 12% * 3 + 9.6% * 3 + 22.4% * 0 + 8% * 0 = 1.08 cycles

the changes to the cache miss rates affect our overall cpi in fairly obvious ways... our new overall cpi is 1 + 100% * 10% * 15 + 25% * 15% * 15 + 30% * 1.08 = 3.3865

now we need to compute speedup.

            extime_old     ic_old * cpi_old * ct_old 
 speedup = ------------ = ---------------------------
            extime_new     ic_new * cpi_new * ct_new

for this problem, the instruction count does not change, so ic_old == ic_new. also, we know that ct_new = ct_old + 25% * ct_old [cycle time increases by 25%]. so:

            cpi_old * ct_old     4.73440 * ct_old
 speedup = ------------------ = -----------------------------------
            cpi_new * ct_new     4.16410 * (ct_old + 25% * ct_old)

            4.216 * ct_old
 speedup = ---------------------------
            3.3865 * (1.25 * ct_old)

            4.216
 speedup = ----------------
            3.3865 * 1.25

            4.216
 speedup = ----------------
            3.3865 * 1.25

           
 speedup = .995

so our new processor is 0.5% slower

problem 6

we want to implement the following instruction in the multicycle datapath:

maxstore rt, rs, immed

executing this instruction will have the following effect:

if rt > M[imm + rs]
  M[imm + rs] = rt

i'm going to go through this kind of fast, because most of you seem to understand the multicycle cpu. this instruction is a load followed by a conditional store: we need to load M[imm+rs], and compare it to rt. if rt is bigger than M[imm+rs], we need to store rt into M[imm+rs].

let's figure out what we need to add to our datapath. our datapath can load M[imm+rs] just fine. the value of M[imm+rs] will end up in the memory data register [mdr]. we need to compare this loaded value to rt. we have to use the alu to do this comparison. rt is already one of the options into the second input of the alu, but mdr is not. so we'll need to make the value in the mdr one of the options to the first input of the alu.

next, we need to compare the values of mdr and rt, and use the result of the comparison to figure out if we want to store rt. a tricky point here is that our effective address is currently in aluout, and we'd really like that value to stay there - we'll need it if we want to store rt. so i'm going to add a aluoutwrite signal that decides whether we're going to write to aluout or not.

all that's left is essentially a conditional store, and an easy way to do this is to replace the memwrite signal with some logic that chooses whether we actually want to write to memory.

here's the three changes in more detail:

  1. add a third input to the alusrca mux [adding input number 2], and add a wire from the mdr output to this new input on alusrca
  2. add a write enable signal to aluout. we want to always write aluout unless we're doing a comparison for our maxstore instruction, so i'm going to negate the write enable, and call the signal "DoNotWriteALUOut"
  3. replace memwrite with some logic to figure out if we really want to write to memory. we want the usual memwrite signal to determine if we're writing to memory or not, except when we're running our maxstore instruction. the logic will look something like "MemWrite AND (!MaxStore OR (MaxStore AND ALU31))", where ALU31 is the high bit of the alu's output. note that i am not reading from the aluout register - i'm reading directly from the output of the alu

we need to figure out how the control signals need to be set on each cycle of execution for our new instruction. the first four cycles are the same as a load instruction. the state machine looks like this:

  1. fetch:
    iord=0
    alusrca=0
    alusrcb=1
    memread=1
    irwrite=1
    aluop=add
    pcwrite=1
  2. decode:
    alusrca=0
    alusrcb=3
    aluop=add
  3. execute1: [compute effective address]
    alusrca=1
    alusrcb=2
    aluop=add
  4. execute2: [load M[imm + rs]]
    iord=1
    memread=1
  5. execute3: [compare M[imm + rs] and rt]
    alusrca=2
    alusrcb=0
    aluop=sub
    DoNotWriteALUOut=1
  6. execute4: [conditional store - we need to make sure we push the same inputs to the alu so the value of ALU31 is stable]
    alusrca=2
    alusrcb=0
    aluop=sub
    DoNotWriteALUOut=1
    iord=1
    memwrite=1
    maxstore=1