Midterm Review

February 2 2004

problem 1

part a: a program executes 20,000,000 instructions on a 100 MHz single cycle processor. what's the execution time?

for single cycle processors, CPI=1. so:

 2e7 instructions     1 cycles       1 second
                  * ------------- * ---------- = .2 seconds
                    1 instruction   1e8 cycles

part b: we run the same program on a 400 MHz multicycle processor. what average CPI do we need to get the same performance as in part a?

we want the execution time of the program on our new processor to be the same as it was on the old processor. another way to look at it is that we're looking for a speedup of 1. so the setup looks like this:

 2e7 instructions     X cycles       1 second
                  * ------------- * ---------- = .2 seconds
                    1 instruction   4e8 cycles

solving for X, we need a CPI of 4.

problem 2

we have a multicycle cpu. our workload is as follows: 30% alu instrs, 30% mem instrs, 10% float instrs, 30% branch instrs. it takes us 1 cycle per alu instr, 10 cycles per mem instr, 20 cycles per float instr, and 5 cycles per branch instr. what's the average cpi?

this is just a weighted average of the individual CPI's:

.30 * 1 + .30 * 10 + .10 * 20 + .30 * 5 = 6.8

another way to think about this problem is like this: suppose we have a completely average program containing 100 instructions. how many cycles do we need to execute the program? well, we know there will be 30 alu instrs, 30 mem instrs, 10 float instrs, and 30 branch instrs, so:

30 alu instrs * 1 cycle / alu instr + 
30 mem instrs * 10 cycles / meminstr + 
10 float instrs * 20 cycles / float instr + 
30 branch instrs * 5 cycles / branch instr 
= 680 cycles

now, what was our CPI for our completely average program?

680 cycles / 100 instructions = 6.8 

problem 3

we divide all instructions into integer instrs and fp instrs. our program takes 100 seconds to execute on our old processor. we have a new processor that runs fp instrs 10x faster. our program executes in 70 seconds on the new processor. how much of our execution time on the old processor was spent running fp instrs?

for problems invovling amdahl's law, i find it helpful to draw diagrams like these:

amdahl's law figure

from the diagram, it's pretty clear that we can solve for x. we know that the amount of time we spend executing integer instructions will be the same on our new processor, because our improvement only affects floating point instructions. after we solve for x, the percentage of time we spent on the old processor doing fp instrs is just x/100.

to solve for x:

(100-x) + (x/10) = 70
...
x = 33

to find the amount of time we spent on the old processor doing fp instrs:

33/100 = 33%

problem 4

part a: given that 01000101000110001110000000000000 is a 32-bit ieee fp number, convert to decimal.

  1. locate the sign bit, the exponent, and the mantissa. our sign bit is 0, our exponent is 10001010, and our mantissa is 00110001110...
  2. write out the mantissa in binary. don't forget about the leading 1 that is not represented. our mantissa is 00110001110..., so we write down 1.0011000111
  3. convert the exponent to decimal. the exponent is excess-127, so you need to subtract 127 from it. our exponent is 138, subtracting 127 we get 11.
  4. the number computed in the last step tells you how far you need to shift the binary point in the mantissa, and in which direction. positive means right, negative means left. we shift the binary point in the mantissa right eleven places, and we get 100110001110.0
  5. after you've shifted the binary point, convert the mantissa to decimal. 100110001110.0 binary is 2446 decimal
  6. write down the negative sign if the sign bit was 1. our sign bit was zero, so we're done, our value is just 2446.

part b: give the 32-bit ieee fp representation of -43.265625

  1. write out the number in binary. -43.265625 is -101011.010001
  2. count how many times you need to shift the binary point to get 1.something. if you need to shift right, negate your count. given -101011.010001, we need to shift five places to the left to get -1.01011010001, so our count is 5.
  3. take the shifted binary number, and drop the negative sign [if any] and the leading 1. this is your mantissa. we had -1.01011010001, so we get 01011010001
  4. add 127 to your count. we shifted 5, so we get 132
  5. convert this number to binary. 132 is 10000100
  6. write down sign bit, exponent, and mantissa. encode in hex if desired [it'll make the grader's life a little easier :). we have
    1 10000100 010110100010... which is 0xc22d1000

problem 5

i'm using the following single cycle datapath diagram from the class webpage:

single cycle cpu datapath

we want to add an instruction loop r1, r2, offset. this new instruction has the same effect as the following two instructions:

addi r1, r1, 1
bne r1, r2, offset

to answer this question, (1) give the above sequence of instructions using the rs, rt, rd, and immediate fields from the immediate format, (2) draw the parts of the datapath that have been changed, and (3) give the state for the control for this instruction in the modified datapath.

for part (1), recall that i-type instructions use the rs, rt, and immediate fields of the instruction. so:

addi rs, rs, 1
bne rs, rt, immediate

for part (2), we need to look at our single cycle datapath, and figure out what we need to add to support this new instruction.

the new instruction that we're adding is a modified branch instruction. so to begin, let's take a look at how the datapath is used for branch instructions.

when we execute a normal bne instruction (like bne $1, $2, 7), we use the main alu to figure out if $1 and $2 are equal or not. we do this by subtracting registers $1 and $2. in other words, we set alusrc=1, aluop=add, and we check the "zero" output on the alu. the "zero" output tells us if the alu result is zero or not: zero=1 means the output was zero, zero=0 means the output is not zero. if zero=0, then we know that registers $1 and $2 are not equal, and therefore we know that the branch is taken.

remember that there is some additional logic required for branch instructions that is not shown on the datapath [it's discussed in the book]. the problem is that the control logic doesn't know the value of pcsrc, because it depends on whether the branch was taken or not. if the branch is taken, we want pcsrc=0, otherwise we want pcsrc=1.

the figure below shows the two additional gates needed to support bne instructions. try pushing a few values of "bne", "zero", and "pcsrc" through these two gates. if bne=0, the value of pcsrc determines which way the mux will go. if bne=1 and pcsrc=1, then the value of zero determines which way the mux will go.

single cycle cpu with bne support

so to summarize, when we execute a normal bne instruction, we set bne=1, pcsrc=1, alusrc=1, aluop=subtract, regwrite=0, memread=0, memwrite=0.

now we need to figure out how our new instruction is different from normal bne instructions. we need compute rs+1 before comparing with rt, and we need to store rs+1 back into the register file.

to compute rs+1, we're going to need another adder. we can't use the main alu because we need that to do the comparison. it needs to take the value of rs, add 1, and its output needs to go into the main alu to be compared with rt. we need to mux the output of this adder and the original value of rs, because we only want rs+1 going into the alu for our new instruction. we'll call the control signal on this new mux "addone".

we also need to write the value of rs+1 back into register rs. to do this, we need to be able to write to register rs [our datapath can currently only write to registers rt or rd]. so we need to make the "regdst" mux bigger... rs must be one of our options. we also need to get the value of rs+1 into the "write data" port on the register file. to do this, we can make the "memtoreg" mux bigger, and make rs+1 one of our options.

these changes are shown below:

modified single cycle cpu

how do we set the control signals for this instruction? we need to choose pc+4 or the branch target [the old pcsrc signal] based on the value of the "zero" output of the alu. we set bne=1 and pcsrc=1 to achieve this effect. we need to compare rs+1 with rt, so we need to set addone=1, we need to compare with rt, so we set alusrc=1, we need to compare, so we set aluop=subtract, we need to write rs+1 into rs, so we set memtoreg=2, regdst=2, and regwrite=1. we don't touch memory, so memread=0, and memwrite=0.

problem 6

i'm using the following multi-cycle datapath from the class webpage:

multi-cycle cpu datapath

we want to add a MemIndAdd r1,offset(r2) instruction which does the following:

tmp=memory[offset+r2]
tmp=memory[tmp]
r1=r1+tmp

we need to (1) show the code sequence using immediate field, rs, rt, rd, and the multi-cycle hardware registers, (2) modify the datapath to execute the new instruction, and (3) show the fsm for the control for this instruction.

the first step is to figure out how many cycles we need to execute this instruction, and what needs to be done on each cycle.

it will take us one cycle to compute the effective address [offset+rs], one cycle to do the first memory read [memory[offset+rs]], one cycle to do the second memory read [memory[memory[offset+r2]]], one cycle to add r1 to that value [r1+memory[memory[offset+r2]]], and one cycle to store this mess into r1.

so we're looking at 5 cycles of execution. including fetch and decode, it will take us a total of 7 cycles.

let's look at each cycle in a little more detail. let's describe how data needs to move across our datapath in each cycle, using rs, rt, rd, immediate, etc.

  1. fetch: read current instruction from memory, pc = pc+4
  2. decode: decode current instruction, read register file, precompute branch target address
  3. execute1: compute immediate+rs
  4. execute2: read memory[immediate+rs]
  5. execute3: read memory[memory[immediate+rs]]
  6. execute4: compute rt + memory[memory[immediate+rs]]
  7. writeback: write (rt + memory[memory[immediate+rs]]) into register rt

now let's look at each cycle in even more detail, using the registers on our datapath to store temporary values. ir = "instruction register", mdr = "memory data register".

  1. fetch:
    ir <- mem[pc]
    pc <- pc+4
  2. decode:
    a <- rs
    b <- rt
    aluout <- pc+4+(immediate*4)
  3. execute1:
    aluout <- immediate+a
  4. execute2:
    mdr <- memory[aluout]
  5. execute3:
    mdr <- memory[mdr]
  6. execute4:
    aluout <- b + mdr
  7. writeback:
    rt <- aluout

the above is what i'd write down for part (1) of this question.

for part (2), we need to figure out what changes we need to make to the datapath. let's go through each cycle, and figure out if the datapath can handle the operations we want to perform.

  1. fetch:
    no problem. same old stuff happens in this cycle.
  2. decode:
    no problem. same old stuff happens in this cycle.
  3. execute1:
    no problem. this is just effective address calculation.
  4. execute2:
    no problem. this is just like cycle 4 of a load.
  5. execute3:
    we can't do this. we need some way to get the data in the mdr into the "address" input of memory. an easy way to do this is to extend the "IorD" mux with another input for the value in the mdr.
  6. execute4:
    we can't do this. the data in the mdr doesn't go anywhere near the alu. so we'll have to extend one of the ALUSrc muxes with another input for the data in the mdr. which mux should we extend? well, we want to compute b+mdr, and b conveniently goes into input 0 of ALUSrcB, so it'll make our lives easier if we extend ALUSrcA.
  7. writeback:
    no problem. this is just like writeback of any arithmetic instruction

okay, we need to extend some muxes. i'm going to add a "pcwrite" signal to the pc also [because we don't want to be writing the alu's output into the pc on every cycle]. when we're done, the datapath will look like this:

modified multi cycle cpu

part (3) wants us to show the fsm for the control of this instruction. if you've come this far, this is the easy part :)

we need to figure out how the control signals need to be set on each cycle of execution of our new instruction, to achieve the effects we described in part (1) of this problem. here we go:

  1. fetch:
    iord=0
    alusrca=0
    alusrcb=1
    memread=1
    irwrite=1
    aluop=add
    pcwrite=1
  2. decode:
    alusrca=0
    alusrcb=3
    aluop=add
  3. execute1:
    alusrca=1
    alusrcb=2
    aluop=add
  4. execute2:
    iord=1
    memread=1
  5. execute3:
    iord=2
    memread=1
  6. execute4:
    alusrca=2
    alusrcb=0
    aluop=add
  7. writeback:
    regdst=0
    memtoreg=0
    regwrite=1