cse141-a: Assignments


Homework Policy

Integrity Policy

Assignments

Assignment 1: Secret Username

Secret Username

Due: January 12

We would like to make all grading transparent and available as soon as possible. Because of this, we are requesting students to choose a secret username which will appear in grade sheets.

Submit your secret username via this Google Form. Please do not make it obvious to identify you. In the rare case of name conflicts, you will be notified to resubmit.

Deliverable

Your secret username submitted via the webform.

Due: January 12

Assignment 2: Instruction Set Architecture, Performance, and Spim

Changelog

January 14 Fixed a bug (Fib(10) should be 55) and added link to Spim tutorial.
January 15 Added the text of the book exercises for students without the 4th ed.
January 18 Clarified that you must implement Fibonacci recursively.
January 26 Solutions posted here. You can also download sample Fibonacci programs: fib.s and fib-iter.s

Required Problems

Due: January 19

Click here for a tutorial on PCSpim

Unless otherwise noted, the following problems are from the Patterson & Hennessy textbook (4th ed.). Nathan is the point of contact for this assignment.

  1. Chapter 2: Exercise 2.4. Part (a) only (i.e., 2.4.1a-2.4.6a):

    Parts 2.4.1-3 deal with translating from C to MIPS. Assume that the variables f, g, h, i, and j are assigned to registers $s0, $s1, $s2, $s3, and $s4, respectively. Assume that the base address of the arrays A and B are in registers $s6 and $s7, respectively.
    Code:
    f = g + h + B[4];


    Parts 2.4.4-6 deal with translating from MIPS to C. Assume that the variables f, g, h, i, and j are assigned to registers $s0, $s1, $s2, $s3, and $s4, respectively. Assume that the base address of the arrays A and B are in registers $s6 and $s7, respectively.
    Code:

    add $s0, $s0, $s1
    add $s0, $s0, $s2
    add $s0, $s0, $s3
    add $s0, $s0, $s4

  2. Pseudo-instructions: Show how to implement the following pseudo-instructions as they would be implemented in a real machine (e.g., using $at):
    1. nop # Do nothing (show three different ways)
    2. li $s0, <32-bit constant> # Load 32-bit immediate value into s0
    3. mulou $s0, $s1, $s2 # Unsigned multiply: s0 = low32bits(s1*s2)

  3. New instructions: Implement register indirect conditional branches (beqr and bner) as pseudo-instructions. Give a proposal for adding them to the ISA (i.e., describe how they could be encoded in the I-Type, R-Type, or J-Type format, and propose an opcode for them (use Figure B.10.2)). Give a (brief) argument for or against their inclusion.

  4. Fibonacci sequence: Write MIPS assembly to compute the Nth Fibonacci number. Assume N is passed to your function in register $a0. Your output should be in register $v0 at the end of your function. Submit your code and a screenshot of Spim that shows the registers with correct output value for N=10, i.e., Fib(10) = 55 = 0x37. Note: You must implement Fibonacci recursively. The purpose of this assignment is to learn how to manipulate the stack correctly in MIPS.

  5. Chapter 1: Exercise 1.5. Part (a) only (i.e., 1.5.1a-1.5.6a):

    Consider two different implementations, P1 and P2, of the same instruction set. There are five classes of instructions (A, B, C, D, and E) in the instruction set. The clock rate and CPI of each class is given below.
    Table:
    Machine  Clock rate   CPI A   CPI B   CPI C   CPI D   CPI E
    P1         1.0 GHz      1       2       3       4       3
    P2         1.5 GHz      2       2       2       4       4


    The table below shows instruction-type breakdown for different programs. Using this data, you will be exploring the performance tradeoffs with different changes made to a MIPS processor.
    Table:
                            # Instructions
    Program       Compute  Load  Store  Branch  Total
    Program 1     1000     400   100    50      1550

Deliverable

Place your typed (or well written) solutions to the problems in Nathan's campus mailbox (2nd floor of the CSE building) before 3:30pm.

Due: January 19

Assignment 3: Measuring Performance and Other ISAs

Changelog

January 22 Questions 4 & 5 were removed. If you already did them, they will be on the next assignment, so hold onto your answers.
January 22 Parts c. and d. were added to Question 3.
February 5 Solutions posted here.

Required Problems

Due: January 26

The problems from the textbook are from Patterson & Hennessy (4th ed). The contact point for this assignment is Manoj.

  1. Performance Processors: The table below describes the performance of two processors, the rAlpha and the c86, and two compilers on a common 'benchmark' program.
    GHz Cost Compiler A Compiler B
    Instructions Average CPI Instructions Average CPI
    rAlpha 3.4 $100 7000 1.2 5000 1.5
    c86 2.6 $100 1500 2.2 1000 4.0

    a. Which is the best compiler-machine combination?

    b. Both hardware vendors release quad-core versions of the processor for 1.7x the cost of the single core version. The results on the two new systems:

    GHz Cost Compiler A Compiler B
    Instructions Average CPI Instructions Average CPI
    rAlphaX 3.4 $170 2250/core 1.5 1750/core 1.8
    c86x4 2.6 $170 625/core 3.2 500/core 5.0
    Assume the benchmarks used are perfectly parallelizable i.e. the cores process equal parts of the program independently and at the same time. Now which combination performs best?

    c. Using the metric (Cycles/s)*Cores/(Dollars^2), which is the best machine? Is this metric useful?

    d. The four processors have the following power ratings:
              rAlpha  20 W
              c86     35 W
              rAlphaX 50 W
              c86x4   80 W
            
    If you wanted to show that the first processor had the best power efficiency for performance, what metric would you use? For this part, use values only from Compiler A.

  2. Amdahl's Law: Exercises 1.16.1a - 1.16.3a, 1.16.4.

    The following table shows the execution time of five routines of a program running on different numbers of processors.

    # Processors Routine A (ms) Routine B (ms) Routine C (ms) Routine D (ms) Routine E (ms)
    2 20 80 10 70 5

    1.16.1a: Find the total execution time, and how much it is reduced if the time of routines A, C, & E is improved by 15%.

    1.16.2a: By how much is the total time reduced if routine B is improved by 10%?

    1.16.3a: By how much is the total time reduced if routine D is improved by 10%?

    1.16.4: Execution time in a multiprocessor system can be split into computing time for the routines plus routing time spent sending data from one processor to another. Consider the execution time and routing time given in the following table.

    # Processors Routine A (ms) Routine B (ms) Routine C (ms) Routine D (ms) Routine E (ms) Routing (ms)
    2 20 78 9 65 4 11
    4 12 44 4 34 2 13
    8 1 23 3 19 3 17
    16 4 13 1 10 2 22
    32 2 5 1 5 1 23
    64 1 3 0.5 1 1 26

    For each doubling of the number of processors, determine the ratio of new to old computing time and the ratio of new to old routing time.

  3. MIPS ISA: Exercises 2.12.1 - 2.12.3. Parts c, d. are not from the book!
    In the following problems, you will investigate the impact of certain modifications to the MIPS ISA.

    a. 8 Registers
    b. 10 bit immediate constants
    c. 128 Registers
    d. All arithmetic instructions can use base addressing mode with the last argument (Example: add $a0, $a2, 0[$t1])

    2.12.1: If the instruction set of the MIPS processor is modified, the instruction format must also be changed. For each of the suggested changes, show the size of the bit fields of an R-type format instruction. What is the total number of bits needed for each instruction?

    2.12.2: If the instruction set of the MIPS processor is modified, the instruction format must also be changed. For each of the suggested changes, show the size of the bit fields of an I-type format instruction. What is the total number of bits needed for each instruction?

    2.12.3: Why could the suggested change decrease the size of a MIPS assembly program? Why could the suggested change increase the size of a MIPS assembly program?

Deliverable

Place your typed (or well written) solutions to the problems in Manoj's campus mailbox (room 2237 of the CSE building) before 3:15pm.

Due: January 26

Assignment 4: Other ISAs, continued

Changelog

January 27 Submit this assignment online, via TED. Do not submit it to either TA's mailbox.
January 30 Fixed a bug in the x86 code for problem 2. (The instructions before main were never executed, leading to undefined behavior.) Also clarified Spim screenshot submission.
February 9 Solutions posted here. You can also download sample MIPS code for question 2: a4q2.s

Required Problems

Due: February 2

In this assignment, you may need gcc and gdb. If you are not familiar with these tools, here is a short tutorial about the basic use of them. Nathan is the point of contact for this assignment.

References

Deliverable

Submit this assignment online via TED before 3:15pm. Upload your MIPS code and full Spim screenshot as separate attachments. Remember that your screenshot must show all of Spim, not just the registers.

Due: February 2

Assignment 5: Pipelining and Hazards

Changelog

February 7 Clarified wording on Question 3.1
February 15 Solutions posted here.

Required Problems

Due: February 9

  1. Processor Performance
    The critical path latencies for the 7 major blocks in a simple processor are given below.
    I-Mem Add Mux ALU Regs D-Mem Control
    a. 400ps 100ps 30ps 120ps 200ps 350ps 100ps
    b. 500ps 150ps 100ps 180ps 220ps 1000ps 65ps

    For each part, answer the following questions:
    1. What is the critical path for a MIPS ADD instruction? Explain your break-up.
    2. If the number of registers is doubled, this increases Regs by 100ps and Control by 20ps. This results in 10% fewer instructions due to fewer load/stores. What is the new critical path for a MIPS ADD instruction?
  2. Pipelining
    The 5 stages of the processor have the following latencies:
    Fetch Decode Execute Memory Writeback
    a. 300ps 400ps 350ps 500ps 100ps
    b. 200ps 150ps 120ps 190ps 140ps

    Assume that when pipelining, each pipeline stage costs 20ps extra for the registers between pipeline stages.
    1. Non-pipelined processor: what is the cycle time? What is the latency of an instruction? What is the throughput?
    2. Pipelined processor: What is the cycle time? What is the latency of an instruction? What is the throughput?
    3. If you could split one of the pipeline stages into 2 equal halves, which one would you choose? What is the new cycle time? What is the new latency? What is the new throughput?
    4. Assume the distribution of instructions that run on the processor is:
      	50%: ALU
      	25%: BEQ
      	15%: LW
      	10%: SW
                      
      Assuming there are no stalls or hazards, what is the utilization of the data memory? What is the utilization of the register block's write port? (Utilization in percentage of clock cycles used)
  3. Data Hazards
        Sequence of instructions:
    	    lw  $s2, 0($s1)
    	    lw  $s1, 40($s6)
    	    sub $s6, $s1, $s2
    	    add $s6, $s2, $s2
    	    or  $s3, $s6, $zero
    	    sw  $s6, 50($s1)
                
    1. List the Read-After-Write data dependencies.
    2. Assume the 5-stage MIPS pipeline with no forwarding, and each stage takes 1 cycle. Instead of inserting nops, you let the processor stall on hazards. How many times does the processor stall? How long is each stall (in cycles)? What is the execution time (in cycles) for the whole program?
    3. Assume the 5-stage MIPS pipeline with full forwarding. Write the program with nops to eliminate the hazards. (Hint: time travel is not possible!)
  4. More Pipelines
    You are given a non-pipelined processor design which has a cycle time of 10ns and average CPI of 1.4. Calculate the latency speedup in the following questions.
    1. What is the best speedup you can get by pipelining it into 5 stages?
    2. If the 5 stages are 1ns, 1.5ns, 4ns, 3ns, and 0.5ns, what is the best speedup you can get compared to the original processor?
    3. If each pipeline stage added also adds 20ps due to register setup delay, what is the best speedup you can get compared to the original processor?
    4. The pipeline from Q4.3 stalls 20% of the time for 1 cycle and 5% of the time for 2 cycles (these occurences are disjoint). What is the new CPI? What is the speedup compared to the original processor?

Question data and statements are in part derived from Patterson & Hennessy, 4th edition.

Deliverable

Submit your solutions via TED by 3:15pm.

Due: February 9

Assignment 6: Branch Prediction and Intro to Caches

Changelog

February 26 Clarified that branch predictor accuracy means percentage of correct guesses.
February 27 Clarified wording of last question: Assume byte-addressable memory, and "cache entries" means the same thing as cache lines or cache blocks.
March 6 Solutions posted here.

Required Problems

Due: March 1

  1. Measuring branch prediction speedup: You'd like to add a branch predictor to your Baseline processor, and you're considering two options: PikaChooser and CharWizard. Evaluate the speedup of each relative to your Baseline if branches are 15% of all instructions. Assume normal CPI is 1, but the branch mispredict penalty is 2 extra stall cycles.

    PikaChooser:
    CharWizard:
    1. Which predictor would you choose?
    2. If branches are instead 25% of all instructions, which predictor would you choose?
  2. Measuring branch predictor accuracy: This exercise examines the accuracy of various branch predictors for the following repeating patterns (e.g., in a loop) of branch outcomes. Accuracy is defined as the percentage of guesses that are correct. Answer each question (1-4) for each sequence (a-b):


      Static branch prediction:
    1. What is the accuracy of always-taken and always-not-taken predictors for this sequence of branch outcomes?

    2. Dynamic branch prediction:
    3. What is the accuracy of a predict-last-taken predictor for the first 5 branches in this sequence? Assume this predictor starts in the "Predict not taken" state.
    4. What is the accuracy of a 2-bit predictor for the first 5 branches in this sequence? Using the diagram in slide 115 ("ImplementingMIPS-e" PDF page 196), assume this predictor starts in the "Strongly not taken" state.
    5. What is the accuracy of a 2-bit predictor if this pattern is repeated forever? Using the diagram in slide 115 ("ImplementingMIPS-e" PDF page 196), assume this predictor starts in the "Strongly not taken" state.
  3. Introduction to caches: For a direct-mapped cache design with a 32-bit address and byte-addressable memory, the following bits of the address are used to access the cache:

        Tag      Index     Offset
    a.  31-10     9-5      4-0
    b.  31-12    11-6      5-0
    
    For each configuration (a and b):
    1. What is the cache line size (in words)?
    2. How many entries (cache lines) does the cache have? Cache lines are also called blocks.

Deliverable

In your PDF file, please highlight your final answers to make grading easier. Type your solutions and upload them as a PDF file attachment to TED. Only a PDF file will be accepted (not Word, not OpenOffice). Upload by 3:15pm.

Due: March 1

Assignment 7: Caches

Changelog

March 2 Q1 instruction mix corrected to add up to 100%.
March 5 Q2 cache size corrected to 8KB. Hint: with the incorrect 32KB cache size, you would end up with no misses since the tag is always 0. Use this to check your cache geometry.
March 15 Solution is published.

Required Problems

Due: March 8

  1. Your Backpack

    The Ca$h processor has a very strange ISA. Popular programs on it have the following instruction breakdown:
    45% ALU
    05% Branch
    10% Store
    40% Loads
              
    They also have some interesting properties: Assume there is no branch prediction. The following questions deal with the L1 cache behavior for the Ca$h processor.
    1. For each of these pairs, identify which is likely to be higher, and why.
      1. Spatial locality of instruction cache references and spatial locality of data cache references.
      2. Temporal locality of instruction cache references and spatial locality of instruction cache references.
      3. Instruction cache hit rate and instruction cache miss rate.
      4. Data cache hit rate and instruction cache hit rate.
      5. Data cache miss rate and data cache hit rate.
      6. Average CPI for loads and average CPI for stores. (assume that on a cache hit, the CPI = 1 for both).
    2. Both your L1 instruction and data caches separate a 32-bit address as follows:
      bits  0 -  3 = offset
      bits  4 - 14 = index
      bits 15 - 31 = tag
                    
      What is the size of each cache? How much space is required to store the tags for the L1 instruction cache?
    3. Refer to the cache structure from 1b. For each of the following, identify whether it might increase or decrease (or its impossible to tell) the hit rate for the instruction cache and data cache.
      1. Increase the offset to bits 0-9, decrease the index to 10-14.
      2. Decrease the offset to bit 0, shift the index to bits 1-11, increase the tag to bits 12-31.

  2. Too Small a Backpack

    You have a 2-way set associative L1 cache that is 8KB, with 4-word cache lines. Writing data to L2 takes 10 cycles. You get the following sequence of writes to the cache -- each is a 32-bit address in hexadecimal:
    0x1000
    0x1004
    0x1010
    0x11c0
    0x2000
    0x21c0
    0x3400
    0x3404
    0x3f00
    0x2004
    0x1004
              
    1. How many cache misses occur with an LRU policy?
    2. How many cache misses occur with a most-recently used policy?
    3. Would the miss-rate increase or decrease if the cache was the same size, but direct-mapped? Explain.
    4. How long does a read-miss eviction take if the cache is write-back, write-allocate? What about a write-miss? (Assume the cache line is dirty)
    5. How long does a read-miss eviction take if the cache is write-through, write-allocate? What about a write-miss?

  3. Los Angeles

    You have an L1 data cache, L2 cache, and main memory. The hit rates and hit times for each are:
    50%  hit rate, 2   cycle hit time to L1.
    70%  hit rate, 15  cycle hit time to L2.
    100% hit rate, 200 cycle hit time to main memory.
              
    1. What fraction of accesses are serviced from L2? From main memory?
    2. What is the miss rate and miss time for the L2 cache?
    3. What is the miss rate and miss time for the L1 cache? (Hint: depends on previous answer).
    4. If main memory is improved by 10%, what is the improvement in L1 miss time?
    5. You remove the L2 to add more L1. As a result, the new L1 hit rate is 75%. What is the improvement in L1 miss time?

  4. Cache Design

    You have 3 cache designs for a 16-bit address machine.
    Dictator:
    Direct-mapped cache.
    Each cache line is 1 byte.
    10-bit index, 6-bit tag.
    1 cycle hit time.
              
    Oligarch:
    2-way set associative cache.
    Each cache line is 1 word (4 bytes).
    7-bit index, 7-bit tag.
    2 cycle hit time.
              
    Democle:
    fully associative cache with 256 cache lines.
    Each cache line is 1 word.
    14-bit tag.
    5 cycle hit time.
              
    1. What is the size of each cache?
    2. How much space does each cache need to store tags?
    3. Which cache design has the most conflict misses? Which has the least?
    4. If someone told you the hit rate for the 3 caches is 50%, 70% and 90% but did not tell you which hit rate corresponds to which cache, which cache would you guess corresponded to which hit rate? Why?
    5. Assuming the miss time for each is 20 cycles, what is the average service time for each? (Service Time = (hit rate)*(hit time) + (miss rate)*(miss time)).

Deliverable

Type your solutions and upload them as a PDF file attachment to TED. Only a PDF file will be accepted (not Word, not OpenOffice)! Upload by 3:15pm. Only the last submission on TED will be evaluated.

Due: March 8

Assignment 8: Virtual Memory

Changelog

March 12 Clarified that TLB and page table entries are base-10 values.
March 16 Solutions posted here.

Required Problems

Due: March 15

  1. Virtual and Physical Addresses. For each configuration (a-c), state how many bits are needed for each of the following:

    a. 32-bit operating system, 4-KB pages, 1 GB of RAM

    b. 32-bit operating system, 16-KB pages, 2 GB of RAM

    c. 64-bit operating system, 16-KB pages, 16 GB of RAM

    1. You can fill out a table like this:
      Config. V. Addr. P. Addr. V. Page # P. Page # Offset
      a
      b
      c
    2. What are some advantages of using a larger page size?
    3. What are some disadvantages of using a larger page size?

  2. Using the TLB. As described in your textbook, virtual memory uses a page table to track the mapping of virtual addresses to physical addresses. To speed up this translation, modern processors implement a cache of the most recently used translations, called the translation-lookaside buffer (TLB). This exercise shows how the page table and TLB must be updated as addresses are accessed.

    The following list is a stream of virtual addresses as seen on a system. Assume 4-KB pages and a four-entry fully associative TLB with LRU replacement policy. If pages must be brought in from disk, give them the next largest unused page number (i.e., starting at 13).

    Address Result
    (H, M, PF)
    0x0FFF
    0x7A28
    0x3DAD
    0x3A98
    0x1C19
    0x1000
    0x22D0

    Initial TLB: (Note: Values are in base-10)
    Valid Tag Physical Page # LRU
    1 11 12 2
    1 7 4 3
    1 3 6 4
    0 4 9 1
    The LRU column works as follows: The older an entry is, the lower its LRU number will be. Each time an entry is used, its LRU number is set to 4 (since it is now the most-recently used), and the other numbers are adjusted downward accordingly.

    Initial Page Table: (Note: Values are in base-10)
    Index Valid Physical Page
    or On Disk
    0 1 5
    1 0 Disk
    2 0 Disk
    3 1 6
    4 1 9
    5 1 11
    6 0 Disk
    7 1 4
    8 0 Disk
    9 0 Disk
    10 1 3
    11 1 12


    Given the address stream, initial TLB, and initial page table, show the final state of the system (TLB and page table). Also show for each memory access whether it is a hit in the TLB (H), a hit in the page table (TLB miss, or M), or a page fault (PF).

Deliverable

In your PDF file, please highlight your final answers to make grading easier. Type your solutions and upload them as a PDF file attachment to TED. Only a PDF file will be accepted (not Word, not OpenOffice). Upload by 3:15pm.

Due: March 15