cse141-a: Assignments


Homework Policy

Integrity Policy

Assignments

Assignment 1: Secret Username atnd Clicker Registration

Part 1: Secret Username

Due: April 3

We would like to make all grading transparent and available as soon as possible. Because of this, we are requesting students to choose a secret username which will appear in grade sheets.

Submit your secret username via this Google Form. Please do not make it obvious to identify you. In the rare case of name conflicts, you will be notified to resubmit.

Deliverable

Your secret username submitted via the webform.

Due: April 3

Part 2: Register your Clicker

Due: April 3

Register your clicker on Ted.

  1. Log into Ted.
  2. Choose "Course tools" from the blue menu on the left.
  3. Choose "i>Clicker Remote Registration"

Deliverable

Successfully take the reading quiz on Thursday.

Due: April 3

Assignment 2: Instruction Set Architecture, Performance and Other ISAs

Changelog

April 22 Solutions are up.

Required Problems

Due: April 17

Unless otherwise noted, the following problems are from the Patterson & Hennessy textbook (5th ed.).

  1. Translation between C and MIPS:

    (1) C to MIPS. Assume that the variables f, g, h, i, and j are assigned to registers $s0, $s1, $s2, $s3, and $s4, respectively. Assume that the base address of the arrays A and B are in registers $s6 and $s7, respectively.
    C Code:
    f = g + A[B[4]-B[3]];


    (2) MIPS to C. Assume that the variables f, g, h, i, and j are assigned to registers $s0, $s1, $s2, $s3, and $s4, respectively. Assume that the base address of the arrays A and B are in registers $s6 and $s7, respectively.
    MIPS Code:
    sll $t0, $s0, 2
    add $t0, $s6, $t0
    sll $t1, $s1, 2
    add $t1, $s7, $t1
    lw $s0, 0($t0)
    addi $t2, $t0, 4
    lw $t0, 0($t2)
    add $t0, $t0, $s0
    sw $t0, 0($t1)


  2. Pseudo-instructions: Show how to implement the following pseudo-instructions as they would be implemented in a real machine (e.g., using $at):
    1. nop # Do nothing (show three different ways)
    2. li $s0, <32-bit constant> # Load 32-bit immediate value into s0
    3. div $s0, $s1, $s2 # Integer division: s0 = s1/s2

  3. New instructions: Implement register indirect conditional branches (beqr and bner) as pseudo-instructions. Give a proposal for adding them to the ISA (i.e., describe how they could be encoded in the I-Type, R-Type, or J-Type format, and propose an opcode for them (use Figure A.10.2 MIPS opcode map)). Give a (brief) argument for or against their inclusion.

  4. Performance:
    Consider three different processors, P1 P2 and P3, executing the same instruction set. P1 has a 3 GHz clock rate and a CPI of 1.5. P2 has a 2.5 GHz clock rate and a CPI of 1.0. P3 has a 4.0 GHz clock rate and a CPI of 2.2.

Deliverable

Place your typed (or well written) solutions to the problems in Andiry's campus mailbox (EBU3B, room 2237, labeled "Xu, Jian") before 9am.

Due: April 17

Assignment 3: Instruction Set Architecture, Performance and Other ISAs (continued)

Changelog

May 5 Solutions are up.

Required Problems

Due: May 1

  1. Performance Processors: The table below describes the performance of two processors, the rAlpha and the c86, and two compilers on a common 'benchmark' program.
    GHz Cost Compiler A Compiler B
    Instructions Average CPI Instructions Average CPI
    rAlpha 3.4 $100 7000 1.2 5000 1.5
    c86 2.6 $100 1500 2.2 1000 4.0

    a. Which is the best compiler-machine combination?

    b. Both hardware vendors release quad-core versions of the processor for 1.7x the cost of the single core version. The results on the two new systems:

    GHz Cost Compiler A Compiler B
    Instructions Average CPI Instructions Average CPI
    rAlphaX 3.4 $170 2250/core 1.5 1750/core 1.8
    c86x4 2.6 $170 625/core 3.2 500/core 5.0
    Assume the benchmarks used are perfectly parallelizable i.e. the cores process equal parts of the program independently and at the same time. Now which combination performs best?

    c. Using the metric (Cycles/s)*Cores/(Dollars^2), which is the best machine? Is this metric useful?

    d. The four processors have the following power ratings:
              rAlpha  20 W
              c86     35 W
              rAlphaX 50 W
              c86x4   80 W
            
    If you wanted to show that the first processor had the best power efficiency for performance, what metric would you use? For this part, use values only from Compiler A.

  2. Amdahl's Law:

    The following table shows the execution time of five routines of a program running on 16 processors.

    # Processors Routine A (ms) Routine B (ms) Routine C (ms) Routine D (ms) Routine E (ms)
    16 4 14 2 12 2

    (1) Find the total execution time, and how much it is reduced if the time of routines A, C, & E is improved by 15%.

    (2) By how much is the total time reduced if routine B is improved by 10%?

    (3) By how much is the total time reduced if routine D is improved by 10%?

    (4) Execution time in a multiprocessor system can be split into computing time for the routines plus routing time spent sending data from one processor to another. Consider the execution time and routing time given in the following table.

    # Processors Routine A (ms) Routine B (ms) Routine C (ms) Routine D (ms) Routine E (ms) Routing (ms)
    2 20 78 9 65 4 11
    4 12 44 4 34 2 13
    8 1 23 3 19 3 17
    16 4 13 1 10 2 22
    32 2 5 1 5 1 23
    64 1 3 0.5 1 1 26

    For each doubling of the number of processors, determine the ratio of new to old computing time and the ratio of new to old routing time.

  3. MIPS ISA: Exercise 2.18.
    Assume that we would like to expand the MIPS register file to 128 registers and expand the instruction set to contain four times as many instructions.

    (1) How would this affect the size of each of the bit fields in the R-type insructions?

    (2) How would this affect the size of each of the bit fields in the I-type instructions?

    (3) How could each of two propose changes decrease the size of an MIPS assembly program? On the other hand, how could the proposed change increase the size of an MIPS assembly program?

Deliverable

Place your typed (or well written) solutions to the problems in Qi's campus mailbox (2nd floor of the CSE building) before 9am.

Due: May 1

Assignment 4: Other ISAs, continued

Changelog

May 5 Solutions are up.

Required Problems

Due: May 2

References

Deliverable

Place your typed (or well written) solutions to the problems in Andiry's campus mailbox (EBU3B, room 2237, labeled "Xu, Jian") before 4pm.

Due: May 2

Assignment 5: Pipelining and Hazards

Changelog

May 14 Solutions posted here.

Required Problems

Due: May 9

  1. Simple Processor Performance
    The critical path latencies for the 7 major blocks in a simple processor are given below.
    I-Mem Add Mux ALU Regs D-Mem Control
    a. 400ps 100ps 30ps 140ps 200ps 350ps 100ps
    b. 500ps 150ps 100ps 180ps 320ps 1000ps 65ps

    For each part, answer the following questions:
    1. What is the critical path for a MIPS ADD instruction? Explain your break-up.
    2. If the number of registers is doubled, this increases Regs by 100ps and Control by 20ps. This results in 10% fewer instructions due to fewer load/stores. What is the new critical path for a MIPS ADD instruction?
  2. Pipelining I
    The 5 stages of the processor have the following latencies:
    Fetch Decode Execute Memory Writeback
    a. 300ps 400ps 350ps 550ps 100ps
    b. 200ps 150ps 100ps 190ps 140ps

    Assume that when pipelining, each pipeline stage costs 20ps extra for the registers between pipeline stages.
    1. Non-pipelined processor: what is the cycle time? What is the latency of an instruction? What is the throughput?
    2. Pipelined processor: What is the cycle time? What is the latency of an instruction? What is the throughput?
    3. If you could split one of the pipeline stages into 2 equal halves, which one would you choose? What is the new cycle time? What is the new latency? What is the new throughput?
    4. Assume the distribution of instructions that run on the processor is:
      	50%: ALU
      	25%: BEQ
      	15%: LW
      	10%: SW
                      
      Assuming there are no stalls or hazards, what is the utilization of the data memory? What is the utilization of the register block's write port? (Utilization in percentage of clock cycles used)
  3. Pipelining II
    You are given a non-pipelined processor design which has a cycle time of 10ns and average CPI of 1.4. Calculate the latency speedup in the following questions.
    1. What is the best speedup you can get by pipelining it into 5 stages?
    2. If the 5 stages are 1ns, 1.5ns, 4ns, 3ns, and 0.5ns, what is the best speedup you can get compared to the original processor?
    3. If each pipeline stage added also adds 20ps due to register setup delay, what is the best speedup you can get compared to the original processor?
    4. The pipeline from 3.3 stalls 20% of the time for 1 cycle and 5% of the time for 2 cycles (these occurences are disjoint). What is the new CPI? What is the speedup compared to the original processor?
  4. Data Hazards
        Sequence of instructions:
    	    lw  $s2, 0($s1)
    	    lw  $s1, 40($s3)
    	    sub $s3, $s1, $s2
    	    add $s3, $s2, $s2
    	    or  $s4, $s3, $zero
    	    sw  $s3, 50($s1)
                
    1. List the Read-After-Write data dependencies.
    2. Assume the 5-stage MIPS pipeline with no forwarding, and each stage takes 1 cycle. Instead of inserting nops, you let the processor stall on hazards. How many times does the processor stall? How long is each stall (in cycles)? What is the execution time (in cycles) for the whole program?
    3. Assume the 5-stage MIPS pipeline with full forwarding. Write the program with nops to eliminate the hazards. Assume there is no load delay slot for the purpose of this problem.

Deliverable

Place your typed (or well written) solutions to the problems in Qi's campus mailbox before 4 pm.

Due: May 9

Assignment 6: Branch Prediction

Changelog

May 27 Solutions are up.

Required Problems

Due: May 27

  1. Measuring branch prediction speedup: You'd like to add a branch predictor to your Baseline processor, and you're considering two options: PikaChooser and CharWizard. Evaluate the speedup of each relative to your Baseline if branches are 15% of all instructions. Assume normal CPI is 1, but the branch mispredict penalty is 2 extra stall cycles.

    PikaChooser:
    CharWizard:
    1. Which predictor would you choose?
    2. If branches are instead 25% of all instructions, which predictor would you choose?
  2. Measuring branch predictor accuracy: This exercise examines the accuracy of various branch predictors for the following repeating patterns (e.g., in a loop) of branch outcomes. Accuracy is defined as the percentage of guesses that are correct. Answer each question (1-4) for each sequence (a-b):


      Static branch prediction:
    1. What is the accuracy of always-taken and always-not-taken predictors for this sequence of branch outcomes?

    2. Dynamic branch prediction:
    3. What is the accuracy of a predict-last-taken predictor for the first 5 branches in this sequence? Assume this predictor starts in the "Predict not taken" state.
    4. What is the accuracy of a 2-bit predictor for the first 5 branches in this sequence? Using the diagram in slide 131 ("Slides from class.pdf"), assume this predictor starts in the "Strongly not taken" state.
    5. What is the accuracy of a 2-bit predictor if this pattern is repeated forever? Using the diagram in slide 131 ("Slides from class.pdf"), assume this predictor starts in the "Strongly not taken" state.

Deliverable

Place your typed (or well written) solutions to the problems in Andiry's campus mailbox (EBU3B, room 2237, labeled "Xu, Jian") before 9am.

Due: May 27

Assignment 7: Caches and Virtual Memory

Changelog

June 5 Solutions are up.

Required Problems

Due: June 2

  1. Introduction to caches: For a direct-mapped cache design with a 32-bit address and byte-addressable memory, the following bits of the address are used to access the cache:

        Tag      Index     Offset
    a.  31-10     9-5      4-0
    b.  31-12    11-6      5-0
    
    For each configuration (a and b):
    1. What is the cache line size (in words)?
    2. How many entries (cache lines) does the cache have? Cache lines are also called blocks.
  2. Too Small a Backpack

    You have a 2-way set associative L1 cache that is 8KB, with 4-word cache lines. Writing and reading data to L2 takes 10 cycles. You get the following sequence of writes to the cache -- each is a 32-bit address in hexadecimal:
    0x1000
    0x1004
    0x1010
    0x11c0
    0x2000
    0x21c0
    0x3400
    0x3404
    0x3f00
    0x2004
    0x1004
              
    1. How many cache misses occur with an LRU policy?
    2. How many cache misses occur with a most-recently used policy?
    3. Would the miss-rate increase or decrease if the cache was the same size, but direct-mapped? Explain.
    4. How long does a read-miss eviction take if the cache is write-back, write-allocate? What about a write-miss? (Assume the cache line is dirty)
    5. How long does a read-miss eviction take if the cache is write-through, write-allocate? What about a write-miss?

  3. Virtual and Physical Addresses. For each configuration (a-c), state how many bits are needed for each of the following:

    a. 32-bit operating system, 4-KB pages, 1 GB of RAM

    b. 32-bit operating system, 16-KB pages, 2 GB of RAM

    c. 64-bit operating system, 16-KB pages, 16 GB of RAM

    1. You can fill out a table like this:
      Config. V. Addr. P. Addr. V. Page # P. Page # Offset
      a
      b
      c
    2. What are some advantages of using a larger page size?
    3. What are some disadvantages of using a larger page size?

  4. Using the TLB. As described in your textbook, virtual memory uses a page table to track the mapping of virtual addresses to physical addresses. To speed up this translation, modern processors implement a cache of the most recently used translations, called the translation-lookaside buffer (TLB). This exercise shows how the page table and TLB must be updated as addresses are accessed.

    The following list is a stream of virtual addresses as seen on a system. Assume 4-KB pages and a four-entry fully associative TLB with LRU replacement policy. If pages must be brought in from disk, give them the next largest unused page number (i.e., starting at 13).

    Address Result
    (H, M, PF)
    0x0FFF
    0x7A28
    0x3DAD
    0x3A98
    0x1C19
    0x1000
    0x22D0

    Initial TLB: (Note: Values are in base-10)
    Valid Tag Physical Page # LRU
    1 11 12 2
    1 7 4 3
    1 3 6 4
    0 4 9 1
    The LRU column works as follows: The older an entry is, the lower its LRU number will be. Each time an entry is used, its LRU number is set to 4 (since it is now the most-recently used), and the other numbers are adjusted downward accordingly.

    Initial Page Table: (Note: Values are in base-10)
    Index Valid Physical Page
    or On Disk
    0 1 5
    1 0 Disk
    2 0 Disk
    3 1 6
    4 1 9
    5 1 11
    6 0 Disk
    7 1 4
    8 0 Disk
    9 0 Disk
    10 1 3
    11 1 12


    Given the address stream, initial TLB, and initial page table, show the final state of the system (TLB and page table). Also show for each memory access whether it is a hit in the TLB (H), a hit in the page table (TLB miss, or M), or a page fault (PF).

Deliverable

Place your typed (or well written) solutions to the problems in Qi's campus mailbox before 9 am.

Due: June 2