CSE 240A -- Principles in Computer Architecture
Homework Assignments Readings. I will cover the text (H&P) in pretty much this order. So if you know where I am, you should be able to stay ahead of me on the reading. 1-1.3 (skim), 1.5-1.6, 1.7 (skim), 1.8, 1.9 (okay, just always read fallacies and pitfalls, and always at least skim the historical perspective, but I’ll quit typing them in). 2.1-2.13 (skim signal processing sections). A.1-A.7, A.8 (skim well), 3.1-3.12. 4.1-4.2 (skim), 4.3, 4.4 (skim), 4.5-4.7, 4.8 (skim). 5.1-5.7, 5.8-5.9 (skim), 5.10, 5.11 (skim), 5.12-5.13. 6.1, 6.3, 6.9. Homework #1. Due Thursday, October 6. Homework should be typed. Please put your name only on the first page. · H&P 1.2b,c,e, 1.3a, P1, P2, 1.16, 2.1(note 2), 2.5 (note 1), 2.6 (note 1), 2.9, 2.10 (note 1), 2.11 (note 1), 2.16a · P1: Program A runs 20 billion instructions on a 2 GHz processor, and achieves a CPI of 1.5. Introduction of a new instruction to the ISA (and recompiling the code) would allow a reduction in the instruction count to 19 billion instructions, resulting in a speedup of 1.2. What is the CPI for the new code on the improved processor? · P2: Machine B runs at 2 GHz and has a CPI of 1.3 for a particular program. Machine C runs at 5 GHz and has a CPI of 2.4 for that program, while executing 20% more instructions. Which machine is faster? What is the speedup over the slower machine? · Note 1 — Ignore figure 2.32 and assume 15% of instructions are conditional branches, 20% are loads, 5% are stores, and 2% are jumps. · Note 2 — assume A, B, C, D, and E are 32-bit integers. Homework #2. Due Tuesday, October 18. · H&P A.1 (note 1), A.3 (note 2), A.5 (note 3), P3, P4, A.6a,c (note 4) · note 1: for a and b, show the pipeline timing diagram as we did in class (IF ID EX ...), with arrows showing forwarding. For c, if you can find the number of cycles without the timing diagram, you don't need to show it. · note 2: assume no branch delay slot. Also, ignore the phrase beginning “assuming that only the first pipe stage…”. For the fun of it, rather than ignoring other stalls, assume a CPI of 1.4 when data hazards (but not control hazards) are accounted for. Also, assume a predict-not-taken strategy for conditional branches. · note 3: reference the discussion in Problem A.5, but only answer the following questions: · (a) show all situations where forwarding is required, by example. Consider only the case where an “add” is the data source, and consider the case of adds, loads, stores, or branches as data sinks. · (b) show all situations where a hazard results in one or more bubbles, by example. · (c) show what happens with the control hazards after a branch. Assume no branch delay slot, that the processor predicts all branches not taken. Show the taken and the not-taken case. · P3: Assuming the pipeline from problem A.5, show the pipeline timing diagram for the following code: add r5, r1, 1000(r10) sub r7, r1, r5 add r10, r9, 2000(r5) add r1, r2, r7 add r8, r7, 1000(r1) beq r8, r4, label · P4: Consider the following code: dadd r5, r1, r2 lw r7, 1000(r5) sw r7, 2000(r10) dadd r9, r7, r8 beqz r9, label assuming the MIPS pipeline from figure A.18, and the forwarding paths from figure A.23, what forwarding paths needed to support the above code are not included in this figure? Describe the extra paths needed. · note 4: by “hazards”, they mean those that require a bubble. Homework #3: Due Thursday, November 3 (but please work the problems before the midterm). · H&P 3.9, P5, 3.1a,b, 3.3 (note 1), P6, P7 · P5 — Consider 2 branch predictors, a 2-level local history predictor (local history table indexed by PC, branch history table of predictors indexed by local history table entry), and a correlating predictor (branch history table indexed by global history register). In each case the BHT is the same size (indexed by the same number of bits). We also discover that for the following code, the 2-level local predictor predicts the loop back branch B perfectly (after it gets warmed up), while the correlating predictor does not. What do we know about the size of the BHT? Assume the branches do not conflict with each other in the prediction tables. for (j=0;j<10,000;j++) { for (i=0;i<7;i++) { // seven iterations … if (A[i] = = 0) { // branch A … } } // loop back branch B } // loop back branch C (all branches shown) · Note 1: assume branch resolved in ID stage (A.24 with forwarding) · P6 — Consider a scalar processor. I’ll define latency as the number of stalls required if a dependent instruction immediately follows the register-writing instruction, and loads have a latency of 1, other integer instructions have a latency of 0, and fp adds have a latency of 4. For the following code, what is the approximate speedup of an out-of-order processor vs an in-order processor. Assume the loop is executed many times. Assume branch prediction eliminates all branch hazards. loop:l.d F2, 1000(r2) add.d F6, F2, F4 add.d F10, F6, F8 addi r2, r2, #8 add.d F12, F2, F14 lw r3, 2000(r5) beqz r3, loop · P7 — Indicate all instructions in the following code which could legally be used to replace the nop in the branch delay slot. add $5, $3, $7 sub $6, $1, $4 and $7, $8, $5 beqz $7, there nop /* branch delay slot */ add $9, $1, $2 sub $2, $9, $5 ... there: add $2, $10, $11 sub $9, $7, $9 … Homework #4: Due Thursday, November 17. · H&P 3.6 b (note 1), 3.7a, 3.17, 3.18 (note 2), 3.21 · Note 1: Five things. First, ignore "cycles in EX" in Figure 3.62 -- it is inconsistent with 3.63. Second, the definition of latency is sometimes unclear. Here, assume that latency L means that if the producing instruction begins execution in cycle 0, the dependent instruction can begin execution in cycle L+1 (that is, with L cycles between the instructions). Third, assume the load latency is 1 cycle instead of 0. Fourth, assume the multiply latency is 6 and the add latency is 4, regardless of the consumer. Fifth, in order to have a latency of zero for integer adds, you need to start EX and use CDB in same cycle — assume this is possible. · Note 2: assume the same latencies in figure 3.27 — instructions dependent on a load begin execution no earlier than T+3 if the load begins in T. Instructions dependent on an add begin no earlier than T+2. Assume single-issue (one issue per clock). Homework #5: Due Tuesday, November 29: · H&P 5.3 a-c (note 1), 5.4d, e, 5.7, P8-P10. · Note 1: assume the units in the graph are bytes of data. So 2 on the x axis represents 2^2=4 bytes. The numbers in the legend are the total size of the array in bytes. · P8-P10 are here, from old tests. |