CSE 240A -- Principles in Computer Architecture

Homework Assignments


 

Readings.   I will cover the text (H&P) in pretty much this order.  So if you know where I am, you should be able to stay ahead of me on the reading. 1-1.3 (skim), 1.5-1.6, 1.7 (skim), 1.8, 1.9 (okay, just always read fallacies and pitfalls, and always at least skim the historical perspective, but I’ll quit typing them in).  2.1-2.13 (skim signal processing sections).  A.1-A.7, A.8 (skim well), 3.1-3.12.  4.1-4.2 (skim), 4.3,  4.4 (skim), 4.5-4.7, 4.8 (skim).  5.1-5.7, 5.8-5.9 (skim), 5.10, 5.11 (skim), 5.12-5.13.  6.1, 6.3, 6.9.

 

Homework #1.  Due Thursday, October 6.  Homework should be typed.  Please put your name only on the first page.

· H&P 1.2b,c,e, 1.3a, P1, P2, 1.16, 2.1(note 2), 2.5 (note 1), 2.6 (note 1), 2.9, 2.10 (note 1), 2.11 (note 1), 2.16a

· P1:  Program A runs 20 billion instructions on a 2 GHz processor, and achieves a CPI of 1.5.   Introduction of a new instruction to the ISA (and recompiling the code) would allow a reduction in the instruction count to 19 billion instructions, resulting in a speedup of 1.2.  What is the CPI for the new code on the improved processor?

· P2:  Machine B runs at 2 GHz and has a CPI of 1.3 for a particular program.  Machine C runs at 5 GHz and has a CPI of 2.4 for that program, while executing 20% more instructions.  Which machine is faster?  What is the speedup over the slower machine?

· Note 1 — Ignore figure 2.32 and assume 15% of instructions are conditional branches, 20% are loads, 5% are stores, and 2% are jumps.

· Note 2 — assume A, B, C, D, and E are 32-bit integers.

 

Homework #2.  Due Tuesday, October 18.

· H&P A.1 (note 1), A.3 (note 2), A.5 (note 3), P3, P4, A.6a,c (note 4)

· note 1: for a and b, show the pipeline timing diagram as we did in class (IF ID EX ...), with arrows showing forwarding.  For c, if you can find the number of cycles without the timing diagram, you don't need to show it.

· note 2: assume no branch delay slot.  Also, ignore the phrase beginning “assuming that only the first pipe stage…”.  For the fun of it, rather than ignoring other stalls, assume a CPI of 1.4 when data hazards (but not control hazards) are accounted for. Also, assume a predict-not-taken strategy for conditional branches.

· note 3: reference the discussion in Problem A.5, but only answer the following questions:

· (a) show all situations where forwarding is required, by example.  Consider only the case where an “add” is the data source, and consider the case of adds, loads, stores, or branches as data sinks.

· (b) show all situations where a hazard results in one or more bubbles, by example.

· (c) show what happens with the control hazards after a branch.  Assume no branch delay slot, that the processor predicts all branches not taken.  Show the taken and the not-taken case.

· P3:  Assuming the pipeline from problem A.5, show the pipeline timing diagram for the following code:

add    r5, r1, 1000(r10)

sub    r7, r1, r5

add    r10,  r9, 2000(r5)

add    r1, r2, r7

add    r8, r7, 1000(r1)

beq    r8, r4, label

· P4:  Consider the following code:

dadd     r5, r1, r2

lw    r7, 1000(r5)

sw   r7, 2000(r10)

dadd     r9, r7, r8

beqz     r9, label

assuming the MIPS pipeline from figure A.18, and the forwarding paths from figure A.23, what forwarding paths needed to support the above code are not included in this figure?  Describe the extra paths needed.

· note 4:  by “hazards”, they mean those that require a bubble.

 

Homework #3:  Due Thursday, November 3 (but please work the problems before the midterm).

· H&P 3.9, P5, 3.1a,b, 3.3 (note 1), P6, P7

· P5 — Consider 2 branch predictors, a 2-level local history predictor (local history table indexed by PC, branch history table of predictors indexed by local history table entry), and a correlating predictor (branch history table indexed by global history register).  In each case the BHT is the same size (indexed by the same number of bits).  We also discover that for the following code, the 2-level local predictor predicts the loop back branch B perfectly (after it gets warmed up), while the correlating predictor does not.  What do we know about the size of the BHT?  Assume the branches do not conflict with each other in the prediction tables.

                          for (j=0;j<10,000;j++) {

                                       for (i=0;i<7;i++) {    // seven iterations

                                                 

                                                    if (A[i] = = 0) {             // branch A

                                                                

                                                    }

                                       }                                              // loop back branch B

                          }                                                          // loop back branch C

 

                                       (all branches shown)

· Note 1: assume branch resolved in ID stage (A.24 with forwarding)

· P6 — Consider a scalar processor.  I’ll define latency as the number of stalls required if a dependent instruction immediately follows the register-writing instruction, and loads have a latency of 1, other integer instructions have a latency of 0, and fp adds have a latency of 4.  For the following code, what is the approximate speedup of an out-of-order processor vs an in-order processor.  Assume the loop is executed many times.  Assume branch prediction eliminates all branch hazards.

                               loop:l.d F2, 1000(r2)

                                       add.d F6, F2, F4

                                       add.d F10, F6, F8

                                       addi   r2, r2, #8

                                       add.d F12, F2, F14

                                       lw  r3, 2000(r5)

                                       beqz r3, loop

· P7 — Indicate all instructions in the following code which could legally be used to replace the nop in the branch delay slot.

                          add  $5, $3, $7

                          sub  $6, $1, $4

                          and  $7, $8, $5

                          beqz $7, there

                          nop   /* branch delay slot */

                          add  $9, $1, $2

                          sub  $2, $9, $5

                          ...

 

             there:     add $2, $10, $11

                          sub $9, $7, $9

                         

 

Homework #4:  Due Thursday, November 17.

· H&P 3.6 b (note 1), 3.7a, 3.17, 3.18 (note 2), 3.21

· Note 1: Five things. First, ignore "cycles in EX" in Figure 3.62 -- it is inconsistent with 3.63. Second, the definition of latency is sometimes unclear. Here, assume that latency L means that if the producing instruction begins execution in cycle 0, the dependent instruction can begin execution in cycle L+1 (that is, with L cycles between the instructions). Third, assume the load latency is 1 cycle instead of 0.  Fourth, assume the multiply latency is 6 and the add latency is 4, regardless of the consumer. Fifth, in order to have a latency of zero for integer adds, you need to start EX and use CDB in same cycle — assume this is possible.

· Note 2:  assume the same latencies in figure 3.27 — instructions dependent on a load begin execution no earlier than T+3 if the load begins in T.  Instructions dependent on an add begin no earlier than T+2.  Assume single-issue (one issue per clock).

 

Homework #5:  Due Tuesday, November 29:

· H&P 5.3 a-c (note 1), 5.4d, e, 5.7, P8-P10.

· Note 1:  assume the units in the graph are bytes of data.  So 2 on the x axis represents 2^2=4 bytes.  The numbers in the legend are the total size of the array in bytes.

· P8-P10 are here, from old tests.