cse141-a: Assignments


Homework Policy

Integrity Policy

Assignments

Assignment 1: Secret Username

Secret Username

Due: April 9

We would like to make all grading transparent and available as soon as possible. Because of this, we are requesting students to choose a secret username which will appear in grade sheets.

Submit your secret username via this Google Form. Please do not make it obvious to identify you. In the rare case of name conflicts, you will be notified to resubmit.

Deliverable

Your secret username submitted via the webform.

Due: April 9

Assignment 2: Instruction Set Architecture, Performance, Spim, and Other ISAs

Changelog

May 5 #7 solutions updated. If I marked you wrong on the hw, please take a picture and email me.
April 18 Solutions are up.
April 15 Here's a link to the PCSpim tutorials on Google Drive since we're still having permission errors on the website.
April 10 It has come to my attention that these posted problems are from a previous edition of the textbook. Please do the problems as posted on this page instead of from the Revised 4th edition textbook assigned to the current class. Thanks.

Required Problems

Due: April 18

Click here for a tutorial on PCSpim

Unless otherwise noted, the following problems are from the Patterson & Hennessy textbook (4th ed.).

  1. Chapter 2: Exercise 2.4. Part (b) only (i.e., 2.4.1b-2.4.6b):

    Parts 2.4.1-3 deal with translating from C to MIPS. Assume that the variables f, g, h, i, and j are assigned to registers $s0, $s1, $s2, $s3, and $s4, respectively. Assume that the base address of the arrays A and B are in registers $s6 and $s7, respectively.
    Code:
    f = g - A[B[4]];


    Parts 2.4.4-6 deal with translating from MIPS to C. Assume that the variables f, g, h, i, and j are assigned to registers $s0, $s1, $s2, $s3, and $s4, respectively. Assume that the base address of the arrays A and B are in registers $s6 and $s7, respectively.
    Code:
    lw $s0, 4($s6)


  2. Pseudo-instructions: Show how to implement the following pseudo-instructions as they would be implemented in a real machine (e.g., using $at):
    1. nop # Do nothing (show three different ways)
    2. li $s0, <32-bit constant> # Load 32-bit immediate value into s0
    3. div $s0, $s1, $s2 # Integer division: s0 = s1/s2

  3. New instructions: Implement register indirect conditional branches (beqr and bner) as pseudo-instructions. Give a proposal for adding them to the ISA (i.e., describe how they could be encoded in the I-Type, R-Type, or J-Type format, and propose an opcode for them (use Figure B.10.2)). Give a (brief) argument for or against their inclusion.

  4. Fibonacci sequence: Write MIPS assembly to compute the Nth Fibonacci number. Assume N is passed to your function in register $a0. Your output should be in register $v0 at the end of your function. Submit your code and a screenshot of Spim that shows the registers with correct output value for N=10, i.e., Fib(10) = 55 = 0x37. Note: You must implement Fibonacci recursively. The purpose of this assignment is to learn how to manipulate the stack correctly in MIPS.

  5. Performance Processors: The table below describes the performance of two processors, the rAlpha and the c86, and two compilers on a common 'benchmark' program.
    GHz Cost Compiler A Compiler B
    Instructions Average CPI Instructions Average CPI
    rAlpha 3.4 $100 7000 1.2 5000 1.5
    c86 2.6 $100 1500 2.2 1000 4.0

    a. Which is the best compiler-machine combination?

    b. Both hardware vendors release quad-core versions of the processor for 1.7x the cost of the single core version. The results on the two new systems:

    GHz Cost Compiler A Compiler B
    Instructions Average CPI Instructions Average CPI
    rAlphaX 3.4 $170 2250/core 1.5 1750/core 1.8
    c86x4 2.6 $170 625/core 3.2 500/core 5.0
    Assume the benchmarks used are perfectly parallelizable i.e. the cores process equal parts of the program independently and at the same time. Now which combination performs best?

    c. Using the metric (Cycles/s)*Cores/(Dollars^2), which is the best machine? Is this metric useful?

    d. The four processors have the following power ratings:
              rAlpha  20 W
              c86     35 W
              rAlphaX 50 W
              c86x4   80 W
            
    If you wanted to show that the first processor had the best power efficiency for performance, what metric would you use? For this part, use values only from Compiler A.

  6. Chapter 1: Exercise 1.5. Part (b) only (i.e., 1.5.1b-1.5.6b):

    Consider two different implementations, P1 and P2, of the same instruction set. There are five classes of instructions (A, B, C, D, and E) in the instruction set. The clock rate and CPI of each class is given below.
    Table:
    Machine  Clock rate   CPI A   CPI B   CPI C   CPI D   CPI E
    P1         1.0 GHz      1       1       2       3       2
    P2         1.5 GHz      1       2       3       4       3


    The table below shows instruction-type breakdown for different programs. Using this data, you will be exploring the performance tradeoffs with different changes made to a MIPS processor.
    Table:
                            # Instructions
    Program       Compute  Load  Store  Branch  Total
    Program 4     1500     300   100    100     1750

  7. Amdahl's Law: Exercises 1.16.1b - 1.16.3b, 1.16.4.

    The following table shows the execution time of five routines of a program running on different numbers of processors.

    # Processors Routine A (ms) Routine B (ms) Routine C (ms) Routine D (ms) Routine E (ms)
    16 4 14 2 12 2

    1.16.1: Find the total execution time, and how much it is reduced if the time of routines A, C, & E is improved by 15%.

    1.16.2: By how much is the total time reduced if routine B is improved by 10%?

    1.16.3: By how much is the total time reduced if routine D is improved by 10%?

    1.16.4: Execution time in a multiprocessor system can be split into computing time for the routines plus routing time spent sending data from one processor to another. Consider the execution time and routing time given in the following table.

    # Processors Routine A (ms) Routine B (ms) Routine C (ms) Routine D (ms) Routine E (ms) Routing (ms)
    2 20 78 9 65 4 11
    4 12 44 4 34 2 13
    8 1 23 3 19 3 17
    16 4 13 1 10 2 22
    32 2 5 1 5 1 23
    64 1 3 0.5 1 1 26

    For each doubling of the number of processors, determine the ratio of new to old computing time and the ratio of new to old routing time.

  8. MIPS ISA: Exercises 2.12.1 - 2.12.3. Parts c, d. are not from the book!
    In the following problems, you will investigate the impact of certain modifications to the MIPS ISA.

    a. 8 Registers
    b. 10 bit immediate constants
    c. 128 Registers
    d. All arithmetic instructions can use base addressing mode with the last argument (Example: add $a0, $a2, 0[$t1])

    2.12.1: If the instruction set of the MIPS processor is modified, the instruction format must also be changed. For each of the suggested changes, show the size of the bit fields of an R-type format instruction. What is the total number of bits needed for each instruction?

    2.12.2: If the instruction set of the MIPS processor is modified, the instruction format must also be changed. For each of the suggested changes, show the size of the bit fields of an I-type format instruction. What is the total number of bits needed for each instruction?

    2.12.3: Why could the suggested change decrease the size of a MIPS assembly program? Why could the suggested change increase the size of a MIPS assembly program?

Deliverable

Place your typed (or well written) solutions to the problems in Alice's campus mailbox (2nd floor of the CSE building) before 9am.

Due: April 18

Assignment 3: Other ISAs, continued

Changelog

May 1 Solutions are up.
January 27 Updated link to "Table of instructions you are likely to encounter in this class".

Required Problems

Due: April 25

In this assignment, you may need gcc and gdb. If you are not familiar with these tools, here is a short tutorial about the basic use of them.

References

Deliverable

Place your typed (or well written) solutions to the problems in Alice's campus mailbox (2nd floor of the CSE building) before 9am. Include your code and screenshots. Remember that your screenshot must show all of Spim, not just the registers.

Due: April 25

Assignment 4: Pipelining and Hazards

Changelog

May 7 Fixed mistake in solutions for 2.2 and 2.3. Forgot to add register delay to cycle time and throughput.
May 3 Solutions are up.
April 30 Data Hazards problem (3) moved to next homework.

Required Problems

Due: May 2

  1. Processor Performance
    The critical path latencies for the 7 major blocks in a simple processor are given below.
    I-Mem Add Mux ALU Regs D-Mem Control
    a. 400ps 100ps 30ps 140ps 200ps 350ps 100ps
    b. 500ps 150ps 100ps 180ps 320ps 1000ps 65ps

    For each part, answer the following questions:
    1. What is the critical path for a MIPS ADD instruction? Explain your break-up.
    2. If the number of registers is doubled, this increases Regs by 100ps and Control by 20ps. This results in 10% fewer instructions due to fewer load/stores. What is the new critical path for a MIPS ADD instruction?
  2. Pipelining
    The 5 stages of the processor have the following latencies:
    Fetch Decode Execute Memory Writeback
    a. 300ps 400ps 350ps 550ps 100ps
    b. 200ps 150ps 100ps 190ps 140ps

    Assume that when pipelining, each pipeline stage costs 20ps extra for the registers between pipeline stages.
    1. Non-pipelined processor: what is the cycle time? What is the latency of an instruction? What is the throughput?
    2. Pipelined processor: What is the cycle time? What is the latency of an instruction? What is the throughput?
    3. If you could split one of the pipeline stages into 2 equal halves, which one would you choose? What is the new cycle time? What is the new latency? What is the new throughput?
    4. Assume the distribution of instructions that run on the processor is:
      	50%: ALU
      	25%: BEQ
      	15%: LW
      	10%: SW
                      
      Assuming there are no stalls or hazards, what is the utilization of the data memory? What is the utilization of the register block's write port? (Utilization in percentage of clock cycles used)
  3. Data Hazards
  4. More Pipelines
    You are given a non-pipelined processor design which has a cycle time of 10ns and average CPI of 1.4. Calculate the latency speedup in the following questions.
    1. What is the best speedup you can get by pipelining it into 5 stages?
    2. If the 5 stages are 1ns, 1.5ns, 4ns, 3ns, and 0.5ns, what is the best speedup you can get compared to the original processor?
    3. If each pipeline stage added also adds 20ps due to register setup delay, what is the best speedup you can get compared to the original processor?
    4. The pipeline from Q4.3 stalls 20% of the time for 1 cycle and 5% of the time for 2 cycles (these occurences are disjoint). What is the new CPI? What is the speedup compared to the original processor?

Question data and statements are in part derived from Patterson & Hennessy, 4th edition.

Deliverable

Submit your solutions via TED by 9am. Only PDF submissions accepted. No other formats!

Due: May 2

Assignment 5: Branch Prediction and Intro to Caches

Changelog

June 6 Updated solutions for #1.3 and #2.
May 22 Solutions are up.
May 14 Removed #4 cache problem from homework.

Required Problems

Due: May 16

  1. Data Hazards
        Sequence of instructions:
    	    lw  $s2, 0($s1)
    	    lw  $s1, 40($s3)
    	    sub $s3, $s1, $s2
    	    add $s3, $s2, $s2
    	    or  $s4, $s3, $zero
    	    sw  $s3, 50($s1)
                
    1. List the Read-After-Write data dependencies.
    2. Assume the 5-stage MIPS pipeline with no forwarding, and each stage takes 1 cycle. Instead of inserting nops, you let the processor stall on hazards. How many times does the processor stall? How long is each stall (in cycles)? What is the execution time (in cycles) for the whole program?
    3. Assume the 5-stage MIPS pipeline with full forwarding. Write the program with nops to eliminate the hazards. (Hint: time travel is not possible!)
  2. Measuring branch prediction speedup: You'd like to add a branch predictor to your Baseline processor, and you're considering two options: PikaChooser and CharWizard. Evaluate the speedup of each relative to your Baseline if branches are 15% of all instructions. Assume normal CPI is 1, but the branch mispredict penalty is 2 extra stall cycles.

    PikaChooser:
    CharWizard:
    1. Which predictor would you choose?
    2. If branches are instead 25% of all instructions, which predictor would you choose?
  3. Measuring branch predictor accuracy: This exercise examines the accuracy of various branch predictors for the following repeating patterns (e.g., in a loop) of branch outcomes. Accuracy is defined as the percentage of guesses that are correct. Answer each question (1-4) for each sequence (a-b):


      Static branch prediction:
    1. What is the accuracy of always-taken and always-not-taken predictors for this sequence of branch outcomes?

    2. Dynamic branch prediction:
    3. What is the accuracy of a predict-last-taken predictor for the first 5 branches in this sequence? Assume this predictor starts in the "Predict not taken" state.
    4. What is the accuracy of a 2-bit predictor for the first 5 branches in this sequence? Using the diagram in slide 115 ("ImplementingMIPS-e" PDF page 196), assume this predictor starts in the "Strongly not taken" state.
    5. What is the accuracy of a 2-bit predictor if this pattern is repeated forever? Using the diagram in slide 115 ("ImplementingMIPS-e" PDF page 196), assume this predictor starts in the "Strongly not taken" state.

Deliverable

In your PDF file, please bold, highlight, or circle your final answers to make grading easier. Type your solutions and upload them as a PDF file attachment to TED. No other file types accepted. Upload by 9am.

Due: May 16

Assignment 6: Caches

Changelog

June 10 Updated 2.1 solutions for average CPI of stores vs loads.
June 8 Solutions are up.
May 22 Added info to #3. Reading data to L2 takes 10 cycles.
May 21 Due date pushed back a week to 5/30.

Required Problems

Due: May 30

  1. Introduction to caches: For a direct-mapped cache design with a 32-bit address and byte-addressable memory, the following bits of the address are used to access the cache:

        Tag      Index     Offset
    a.  31-10     9-5      4-0
    b.  31-12    11-6      5-0
    
    For each configuration (a and b):
    1. What is the cache line size (in words)?
    2. How many entries (cache lines) does the cache have? Cache lines are also called blocks.
  2. Your Backpack

    The Ca$h processor has a very strange ISA. Popular programs on it have the following instruction breakdown:
    45% ALU
    05% Branch
    10% Store
    40% Loads
              
    They also have some interesting properties: Assume there is no branch prediction. The following questions deal with the L1 cache behavior for the Ca$h processor.
    1. For each of these pairs, identify which is likely to be higher, and why.
      1. Spatial locality of instruction cache references and spatial locality of data cache references.
      2. Temporal locality of instruction cache references and spatial locality of instruction cache references.
      3. Instruction cache hit rate and instruction cache miss rate.
      4. Data cache hit rate and instruction cache hit rate.
      5. Data cache miss rate and data cache hit rate.
      6. Average CPI for loads and average CPI for stores. (assume that on a cache hit, the CPI = 1 for both).
    2. Both your L1 instruction and data caches separate a 32-bit address as follows:
      bits  0 -  3 = offset
      bits  4 - 14 = index
      bits 15 - 31 = tag
                    
      What is the size of each cache? How much space is required to store the tags for the L1 instruction cache?
    3. Refer to the cache structure from 1b. For each of the following, identify whether it might increase or decrease (or its impossible to tell) the hit rate for the instruction cache and data cache.
      1. Increase the offset to bits 0-9, decrease the index to 10-14.
      2. Decrease the offset to bit 0, shift the index to bits 1-11, increase the tag to bits 12-31.

  3. Too Small a Backpack

    You have a 2-way set associative L1 cache that is 8KB, with 4-word cache lines. Writing and reading data to L2 takes 10 cycles. You get the following sequence of writes to the cache -- each is a 32-bit address in hexadecimal:
    0x1000
    0x1004
    0x1010
    0x11c0
    0x2000
    0x21c0
    0x3400
    0x3404
    0x3f00
    0x2004
    0x1004
              
    1. How many cache misses occur with an LRU policy?
    2. How many cache misses occur with a most-recently used policy?
    3. Would the miss-rate increase or decrease if the cache was the same size, but direct-mapped? Explain.
    4. How long does a read-miss eviction take if the cache is write-back, write-allocate? What about a write-miss? (Assume the cache line is dirty)
    5. How long does a read-miss eviction take if the cache is write-through, write-allocate? What about a write-miss?

  4. Los Angeles

    You have an L1 data cache, L2 cache, and main memory. The hit rates and hit times for each are:
    50%  hit rate, 2   cycle hit time to L1.
    70%  hit rate, 15  cycle hit time to L2.
    100% hit rate, 200 cycle hit time to main memory.
              
    1. What fraction of accesses are serviced from L2? From main memory?
    2. What is the miss rate and miss time for the L2 cache?
    3. What is the miss rate and miss time for the L1 cache? (Hint: depends on previous answer).
    4. If main memory is improved by 10%, what is the improvement in L1 miss time?
    5. You remove the L2 to add more L1. As a result, the new L1 hit rate is 75%. What is the improvement in L1 miss time?

  5. Cache Design

    You have 3 cache designs for a 16-bit address machine.
    Dictator:
    Direct-mapped cache.
    Each cache line is 1 byte.
    10-bit index, 6-bit tag.
    1 cycle hit time.
              
    Oligarch:
    2-way set associative cache.
    Each cache line is 1 word (4 bytes).
    7-bit index, 7-bit tag.
    2 cycle hit time.
              
    Democle:
    fully associative cache with 256 cache lines.
    Each cache line is 1 word.
    14-bit tag.
    5 cycle hit time.
              
    1. What is the size of each cache?
    2. How much space does each cache need to store tags?
    3. Which cache design has the most conflict misses? Which has the least?
    4. If someone told you the hit rate for the 3 caches is 50%, 70% and 90% but did not tell you which hit rate corresponds to which cache, which cache would you guess corresponded to which hit rate? Why?
    5. Assuming the miss time for each is 20 cycles, what is the average service time for each? (Service Time = (hit rate)*(hit time) + (miss rate)*(miss time)).

Deliverable

Type your solutions and upload them as a PDF file attachment to TED. Only a PDF file will be accepted (not Word, not OpenOffice)! Upload by 9am. Only the last submission on TED will be evaluated.

Due: May 30

Assignment 7: Virtual Memory

Changelog

June 8 Solutions are up.

Required Problems

Due: June 6

  1. Virtual and Physical Addresses. For each configuration (a-c), state how many bits are needed for each of the following:

    a. 32-bit operating system, 4-KB pages, 1 GB of RAM

    b. 32-bit operating system, 16-KB pages, 2 GB of RAM

    c. 64-bit operating system, 16-KB pages, 16 GB of RAM

    1. You can fill out a table like this:
      Config. V. Addr. P. Addr. V. Page # P. Page # Offset
      a
      b
      c
    2. What are some advantages of using a larger page size?
    3. What are some disadvantages of using a larger page size?

  2. Using the TLB. As described in your textbook, virtual memory uses a page table to track the mapping of virtual addresses to physical addresses. To speed up this translation, modern processors implement a cache of the most recently used translations, called the translation-lookaside buffer (TLB). This exercise shows how the page table and TLB must be updated as addresses are accessed.

    The following list is a stream of virtual addresses as seen on a system. Assume 4-KB pages and a four-entry fully associative TLB with LRU replacement policy. If pages must be brought in from disk, give them the next largest unused page number (i.e., starting at 13).

    Address Result
    (H, M, PF)
    0x0FFF
    0x7A28
    0x3DAD
    0x3A98
    0x1C19
    0x1000
    0x22D0

    Initial TLB: (Note: Values are in base-10)
    Valid Tag Physical Page # LRU
    1 11 12 2
    1 7 4 3
    1 3 6 4
    0 4 9 1
    The LRU column works as follows: The older an entry is, the lower its LRU number will be. Each time an entry is used, its LRU number is set to 4 (since it is now the most-recently used), and the other numbers are adjusted downward accordingly.

    Initial Page Table: (Note: Values are in base-10)
    Index Valid Physical Page
    or On Disk
    0 1 5
    1 0 Disk
    2 0 Disk
    3 1 6
    4 1 9
    5 1 11
    6 0 Disk
    7 1 4
    8 0 Disk
    9 0 Disk
    10 1 3
    11 1 12


    Given the address stream, initial TLB, and initial page table, show the final state of the system (TLB and page table). Also show for each memory access whether it is a hit in the TLB (H), a hit in the page table (TLB miss, or M), or a page fault (PF).

Deliverable

In your PDF file, please highlight your final answers to make grading easier. Type your solutions and upload them as a PDF file attachment to TED. Only a PDF file will be accepted (not Word, not OpenOffice). Upload by 9am.

Due: June 6