52-bit virtual address, 16k cache, 64 byte blocksize. byte addressable machine.
assume direct mapped cache. how many bits for tag, index, offset? how much memory (in bytes) is needed for tags in the cache?
we need log2(blocksize) bits for the offset. blocksize is 64 bytes, so we need 6 bits for the offset.
we need log2(# sets) bits for the index. how many sets do we have? it's direct mapped (associativity=1), it holds 16k of data, and our blocksize is 64 bytes. so we have 16k/64 = (16*1024)/64 = 256 sets. log2(256) = 8, so we need 8 bits for the index.
all the remaining bits in each address are the tag. so we need 52-8-6 = 38 bits for the tags.
we need to store a tag for each piece of data in our cache. we have 256 sets in our direct mapped cache, so we are storing 256 pieces of data, so we need 256 tags. 256 * 38 bits = 9728 bits = 1216 bytes of memory to store tags.
assume the cache is 8-way set associative. how many bits for tag, index, offset? how much memory is needed to store the tags?
we need log2(blocksize) bits for the offset. blocksize is 64 bytes, so we still need 6 bits for the offset.
we need log2(# sets) bits for the index. how many sets do we have? it's 8-way (associativity=8), it holds 16k of data, and our blocksize is 64 bytes. so we have 16k/(64*8) = (16*1024)/(64*8) = 32 sets. log2(32) = 5, so we need 5 bits for the index.
all the remaining bits in each address are the tag. so we need 52-5-6 = 41 bits for the tags.
we need to store a tag for each piece of data in our cache. we have 32 sets in our 8-way cache, so we are storing 32*8 pieces of data, so we need 256 tags. 256 * 41 bits = 10496 bits = 1312 bytes of memory to store tags.
suppose the cache is physically tagged. the processor supports at most 512mb of memory. if our cache is 16k 8-way set associative with 64 byte blocks, how many tag bits do we need for each cache block?
we need one tag for each cache block. so we need to figure out how many bits are needed for each tag.
we need log2(blocksize) bits for the offset. blocksize is 64 bytes, so we still need 6 bits for the offset.
we need log2(# sets) bits for the index. how many sets do we have? it's 8-way (associativity=8), it holds 16k of data, and our blocksize is 64 bytes. so we have 16k/(64*8) = (16*1024)/(64*8) = 32 sets. log2(32) = 5, so we need 5 bits for the index.
all the remaining bits in each address are the tag. how many bits do we need for each physical address? if we have at most 512mb of memory, we need log2(512mb) = log2(512*1024*1024) = 29 bits. so we need 29-5-6 = 18 bits for each tag.
we have a 64-byte cache with 16 bytes per block. memory is byte addressable. cache is initially empty. use the following address stream: 80,111,60,94,112,35,60,45,112,10,70,80
how many cache hits are there for a direct mapped cache? which addresses are hits and which are misses?
our cache is 64 bytes, 16 byte blocks, direct mapped (associativity=1). this means that we have 64/16 = 4 sets.
so we'll need log2(blocksize) = log2(16) = 4 bits for offset, and log2(# sets) = log2(4) = 2 bits for index.
now let's take a look at what happens for each of our accesses: [address (binary) is just the address written out in binary, with spaces inserted to separate the tag, index, and offset. "t i o" = "tag index offset"]
| address |
| (binary) |
addr | t i o | status
------+-----------+-------------------------------------------------
80 | 1 01 0000 | miss. store in set 1, tag=1
111 | 1 10 1111 | miss. store in set 2, tag=1
60 | 0 11 1100 | miss. store in set 3, tag=0
94 | 1 01 1110 | hit. we have the block with tag=1 in set 1.
112 | 1 11 0000 | miss. store in set 3, tag=1. data with tag=0 evicted
35 | 0 10 0011 | miss. store in set 2, tag=0. data with tag=1 evicted
60 | 0 11 1100 | miss. store in set 3, tag=0. data with tag=1 evicted
45 | 0 10 1101 | hit. we have the block with tag=0 in set 2.
112 | 1 11 0000 | miss. store in set 3, tag=1. data with tag=0 evicted
10 | 0 00 1010 | miss. store in set 0, tag=0
70 | 1 00 0110 | miss. store in set 0, tag=1. data with tag=0 evicted
80 | 1 01 0000 | hit. we have the block with tag=1 in set 1.
there are 3 hits.
how many cache hits are there for a 2-way set associative cache with LRU replacement? which addresses are hits and which are misses?
our cache is 64 bytes, 16 byte blocks, 2-way set associative. this means that we have 64/(16*2) = 2 sets.
so we'll need log2(blocksize) = log2(16) = 4 bits for offset, and log2(# sets) = log2(2) = 1 bit for index.
now let's take a look at what happens for each of our accesses: [address (binary) is just the address written out in binary, with spaces inserted to separate the tag, index, and offset. "t i o" = "tag index offset"]
| address |
| (binary) |
addr | t i o | status
------+-----------+-------------------------------------------------
80 | 10 1 0000 | miss. store in set 1, way 0, tag=10
111 | 11 0 1111 | miss. store in set 0, way 0, tag=11
60 | 01 1 1100 | miss. store in set 1, way 1, tag=01
94 | 10 1 1110 | hit. we have the block with tag=10 in set 1, way 0
112 | 11 1 0000 | miss. store in set 1, way 1, tag=11. evict tag=01
35 | 01 0 0011 | miss. store in set 0, way 1, tag=01
60 | 01 1 1100 | miss. store in set 1, way 0, tag=01. evict tag=10
45 | 01 0 1101 | hit. we have the block with tag=01 in set 0, way 1
112 | 11 1 0000 | hit. we have the block with tag=11 in set 1, way 1
10 | 00 0 1010 | miss. store in set 0, way 0, tag=00. evict tag=11
70 | 10 0 0110 | miss. store in set 0, way 1, tag=10. evict tag=01
80 | 10 1 0000 | miss. store in set 1, way 0, tag=10. evict tag=01
there are 3 hits
how many cache hits are there for a fully associative cache with LRU replacement? which addresses are hits and which are misses?
our cache is 64 bytes, 16 byte blocks, fully associative. fully associative means that the associativity = # blocks, and # sets = 1. so we have 1 set, and 4 ways.
so we'll need log2(blocksize) = log2(16) = 4 bits for offset, and log2(# sets) = log2(1) = 0 bits for index. this means there is no index [no bits for the index].
now let's take a look at what happens for each of our accesses: [address (binary) is just the address written out in binary, with a space inserted to separate the tag and offset. "t o" = "tag offset"]
| address |
| (binary) |
addr | t o | status
------+----------+-------------------------------------------------
80 | 101 0000 | miss. store in set 0, way 0, tag=101
111 | 110 1111 | miss. store in set 0, way 1, tag=110
60 | 011 1100 | miss. store in set 0, way 2, tag=011
94 | 101 1110 | hit. we have the data with tag=101 in set 0, way 0
112 | 111 0000 | miss. store in set 0, way 3, tag=111
35 | 010 0011 | miss. store in set 0, way 1, tag=010. evict tag=110
60 | 011 1100 | hit. we have the data with tag=011 in set 0, way 2
45 | 010 1101 | hit. we have the data with tag=010 in set 0, way 1
112 | 111 0000 | hit. we have the data with tag=111 in set 0, way 3
10 | 000 1010 | miss. store in set 0, way 0, tag=000. evict tag=101
70 | 100 0110 | miss. store in set 0, way 2, tag=100. evict tag=011
80 | 101 0000 | miss. store in set 0, way 1, tag=101. evict tag=010
there are 4 hits
assume standard 5-stage pipeline, branches resolve in decode, branch delay slot is used. consider the following code: [numbers in brackets like [1] are endpoints for arrows] [sorry it's kinda hard for me to draw arrows here :)]
lw [00] r3, 0( [01] r2) loop: sw [02] r3, 0( [03] r2) sub [04] r1, [05] r3, [06] r4 lw [07] r3, 0( [08] r1) sw [09] r1, 0( [10] r3) subi [11] r2, [12] r2, 4 bnez [13] r2, loop lw [14] r3, 0( [15] r2) sub [16] r5, [17] r6, [18] r7 add [19] r7, [20] r8, [21] r9 sub [22] r2, [23] r3, [24] r2
draw arrows and label all raw, waw, and war non-loop carried dependencies in the above code example.
a * indicates dependencies that are hazards for this architecture. in other words, our processor will not perform correctly unless we forward or stall for the * dependencies.
draw a pipeline cycle timing diagram for the above code. start with the first lw before the loop, run one loop iteration, and stop at the first store of the second iteration. show all forwarding needed to eliminate stalls. what cycle does the first sw in the second loop iteration enter the execute stage?
| instruction | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| lw r3, 0(r2) | f | d | x | m | [1] w | ||||||||||
| sw r3, 0(r2) | f | d | x | [2] m | w | ||||||||||
| sub r1, r3, r4 | f | d | [3] x | [4] m | w | ||||||||||
| lw r3, 0(r1) | f | d | [5] x | m | [6] w | ||||||||||
| sw r1, 0(r3) | f | d | s | [7] x | m | w | |||||||||
| subi r2, r2, 4 | f | s | d | x | [8] m | w | |||||||||
| bnez r2, loop | f | s | [9] d | x | m | w | |||||||||
| lw r3, 0(r2) | f | d | x | m | [10] w | ||||||||||
| sw r3, 0(r2) | f | d | x | [11] m | w |
forwarding arrows:
the first sw in the second loop iteration enters the execute stage in cycle 13
notes:
| cycle number | actions |
|---|---|
| 5 | value of r3 was loaded in cycle 4. sw needs value of r3 for storage in memory stage, and sub needs value of r3 for subtraction in execute stage. |
| 6 | value of r1 was computed in cycle 5. lw needs value of r1 for effective address calculation in cycle 6. |
| 7 | stall because sw needs value of r3 for effective address calculation, and it won't be loaded in time. we don't need to forward the value of r1 to sw because we re-run the decode stage in this cycle |
| 8 | value of r3 was loaded in cycle 7, sw needs value of r3 for effective address calculation in cycle 8 |
| 9 | stall because bnez needs value of r2 to resolve branch in decode, but r2 won't be computed in time |
| 10 | value of r2 was computed in cycle 9, bnez needs value of r2 to resolve branch in cycle 10. fetch lw this cycle because it is in the branch delay slot |
| 14 | value of r3 loaded in cycle 13, sw needs value of r3 to perform store in cycle 14 |
three pieces of information are needed to predict the next pc of a branch instruction that we're currently fetching. what are these three pieces of information?
see "Advanced Pipelining" lecture notes, bottom of page 8
sequence 1
| actual: | nt | t | nt | t | nt | t | nt | t | nt | t | nt | t | nt | t | |||||||||||||||
| state: | 0 | → | 0 | → | 1 | → | 0 | → | 1 | → | 0 | → | 1 | → | 0 | → | 1 | → | 0 | → | 1 | → | 0 | → | 1 | → | 0 | → | 1 |
| prediction: | nt | nt | t | nt | t | nt | t | nt | t | nt | t | nt | t | nt | |||||||||||||||
| correct: | c | m | m | m | m | m | m | m | m | m | m | m | m | m |
sequence 2
| actual: | t | t | nt | t | t | nt | t | t | nt | t | t | nt | t | t | |||||||||||||||
| state: | 0 | → | 1 | → | 1 | → | 0 | → | 1 | → | 1 | → | 0 | → | 1 | → | 1 | → | 0 | → | 1 | → | 1 | → | 0 | → | 1 | → | 1 |
| prediction: | nt | t | t | nt | t | t | nt | t | t | nt | t | t | nt | t | |||||||||||||||
| correct: | m | c | m | m | c | m | m | c | m | m | c | m | m | c |
sequence 3
| actual: | nt | nt | t | t | t | nt | t | nt | nt | nt | t | nt | t | t | |||||||||||||||
| state: | 0 | → | 0 | → | 0 | → | 1 | → | 1 | → | 1 | → | 0 | → | 1 | → | 0 | → | 0 | → | 0 | → | 1 | → | 0 | → | 1 | → | 1 |
| prediction: | nt | nt | nt | t | t | t | nt | t | nt | nt | nt | t | nt | t | |||||||||||||||
| correct: | c | c | m | c | c | m | m | m | c | c | m | n | m | c |
sequence 1
| actual: | nt | t | nt | t | nt | t | nt | t | nt | t | nt | t | nt | t | |||||||||||||||
| state: | 2 | → | 1 | → | 2 | → | 1 | → | 2 | → | 1 | → | 2 | → | 1 | → | 2 | → | 1 | → | 2 | → | 1 | → | 2 | → | 1 | → | 2 |
| prediction: | t | nt | t | nt | t | nt | t | nt | t | nt | t | nt | t | nt | |||||||||||||||
| correct: | m | m | m | m | m | m | m | m | m | m | m | m | m | m |
sequence 2
| actual: | t | t | nt | t | t | nt | t | t | nt | t | t | nt | t | t | |||||||||||||||
| state: | 2 | → | 3 | → | 3 | → | 2 | → | 3 | → | 3 | → | 2 | → | 3 | → | 3 | → | 2 | → | 3 | → | 3 | → | 2 | → | 3 | → | 3 |
| prediction: | t | t | t | t | t | t | t | t | t | t | t | t | t | t | |||||||||||||||
| correct: | c | c | m | c | c | m | c | c | m | c | c | m | c | c |
sequence 3
| actual: | nt | nt | t | t | t | nt | t | nt | nt | nt | t | nt | t | t | |||||||||||||||
| state: | 2 | → | 1 | → | 0 | → | 1 | → | 2 | → | 3 | → | 2 | → | 3 | → | 2 | → | 1 | → | 0 | → | 1 | → | 0 | → | 1 | → | 2 |
| prediction: | t | nt | nt | nt | t | t | t | t | t | nt | nt | nt | nt | nt | |||||||||||||||
| correct: | m | c | m | m | c | m | c | m | m | c | m | c | m | m |
sequence 1
| actual: | nt | t | nt | t | nt | t | nt | t | nt | t | nt | t | nt | t | |||||||||||||||
| state: | 0 | → | 0 | → | 1 | → | 0 | → | 1 | → | 0 | → | 1 | → | 0 | → | 1 | → | 0 | → | 1 | → | 0 | → | 1 | → | 0 | → | 1 |
| prediction: | nt | nt | nt | nt | nt | nt | nt | nt | nt | nt | nt | nt | nt | nt | |||||||||||||||
| correct: | c | m | c | m | c | m | c | m | c | m | c | m | c | m |
sequence 2
| actual: | t | t | nt | t | t | nt | t | t | nt | t | t | nt | t | t | |||||||||||||||
| state: | 0 | → | 1 | → | 2 | → | 1 | → | 2 | → | 3 | → | 2 | → | 3 | → | 3 | → | 2 | → | 3 | → | 3 | → | 2 | → | 3 | → | 3 |
| prediction: | nt | nt | t | nt | t | t | t | t | t | t | t | t | t | t | |||||||||||||||
| correct: | m | m | m | m | c | m | c | c | m | c | c | m | c | c |
sequence 3
| actual: | nt | nt | t | t | t | nt | t | nt | nt | nt | t | nt | t | t | |||||||||||||||
| state: | 0 | → | 0 | → | 0 | → | 1 | → | 2 | → | 3 | → | 2 | → | 3 | → | 2 | → | 1 | → | 0 | → | 1 | → | 0 | → | 1 | → | 2 |
| prediction: | nt | nt | nt | nt | t | t | t | t | t | nt | nt | nt | nt | nt | |||||||||||||||
| correct: | c | c | m | m | c | m | c | m | m | c | m | c | m | m |
assume standard 5-stage pipeline. we have separate instruction and data caches. 25% of instrs are loads, 15-cycle cache miss penalty. 30% of instrs are branches, 60% of branches are taken, branches resolve in execute. btb has 80% hit rate, btb generates predictions with 70% accuracy.
if we have 15% instruction cache miss rate, and 20% data cache miss rate, what is our overall cpi?
first, let's figure out the average penalty for branches. there are six cases:
now let's figure out how long we need to stall in each case.
so the average number of penalty cycles for branch instructions is 33.6% * 0 + 14.4% * 2 + 12% * 2 + 9.6% * 2 + 22.4% * 0 + 8% * 0 = 0.72 cycles
we need to figure out the average number of penalty cycles. there are three main sources of penalty cycles for this problem:
so our overall cpi is 1 + 100% * 15% * 15 + 25% * 20% * 15 + 30% * 0.72 = 4.216
suppose we decrease the instruction cache miss rate to 10%, decrease the data cache miss rate to 15%, split decode into two stages, and increase cycle time by 25%. what is the speedup of this new processor?
first let's recompute the average branch penalty. the only change that affects branches is the split decode. since our branches resolve in execute, this means it now takes us three cycles to figure out what a branch instruction really does. this means that whenever we predict incorrectly, the penalty increases from two cycles to three cycles.
so the new average number of penalty cycles for branch instructions is 33.6% * 0 + 14.4% * 3 + 12% * 3 + 9.6% * 3 + 22.4% * 0 + 8% * 0 = 1.08 cycles
the changes to the cache miss rates affect our overall cpi in fairly obvious ways... our new overall cpi is 1 + 100% * 10% * 15 + 25% * 15% * 15 + 30% * 1.08 = 3.3865
now we need to compute speedup.
extime_old ic_old * cpi_old * ct_old
speedup = ------------ = ---------------------------
extime_new ic_new * cpi_new * ct_new
for this problem, the instruction count does not change, so ic_old == ic_new. also, we know that ct_new = ct_old + 25% * ct_old [cycle time increases by 25%]. so:
cpi_old * ct_old 4.73440 * ct_old
speedup = ------------------ = -----------------------------------
cpi_new * ct_new 4.16410 * (ct_old + 25% * ct_old)
4.216 * ct_old
speedup = ---------------------------
3.3865 * (1.25 * ct_old)
4.216
speedup = ----------------
3.3865 * 1.25
4.216
speedup = ----------------
3.3865 * 1.25
speedup = .995
so our new processor is 0.5% slower
we want to implement the following instruction in the multicycle datapath:
maxstore rt, rs, immed
executing this instruction will have the following effect:
if rt > M[imm + rs] M[imm + rs] = rt
i'm going to go through this kind of fast, because most of you seem to understand the multicycle cpu. this instruction is a load followed by a conditional store: we need to load M[imm+rs], and compare it to rt. if rt is bigger than M[imm+rs], we need to store rt into M[imm+rs].
let's figure out what we need to add to our datapath. our datapath can load M[imm+rs] just fine. the value of M[imm+rs] will end up in the memory data register [mdr]. we need to compare this loaded value to rt. we have to use the alu to do this comparison. rt is already one of the options into the second input of the alu, but mdr is not. so we'll need to make the value in the mdr one of the options to the first input of the alu.
next, we need to compare the values of mdr and rt, and use the result of the comparison to figure out if we want to store rt. a tricky point here is that our effective address is currently in aluout, and we'd really like that value to stay there - we'll need it if we want to store rt. so i'm going to add a aluoutwrite signal that decides whether we're going to write to aluout or not.
all that's left is essentially a conditional store, and an easy way to do this is to replace the memwrite signal with some logic that chooses whether we actually want to write to memory.
here's the three changes in more detail:
we need to figure out how the control signals need to be set on each cycle of execution for our new instruction. the first four cycles are the same as a load instruction. the state machine looks like this: