cse141-b: Assignments


Homework Policy

Integrity Policy

Assignments

Assignment 1: Secret Username and Clicker Registration

Part 1: Secret Username

Due: January 8

We would like to make all grading transparent and available as soon as possible. Because of this, we are requesting students to choose a secret username which will appear in grade sheets.

Submit your secret username via this Google Form. Please do not make it obvious to identify you. In the rare case of name conflicts, you will be notified to resubmit.

Deliverable

Your secret username submitted via the webform.

Due: January 8

Part 2: Register for the class on Piazza

Due: January 8

Login into http://www.piazza.com and register for this class.

Deliverable

None.

Due: January 8

Part 3: Register your Clicker

Due: January 8

Register your clicker on Ted.

  1. Log into Ted.
  2. Choose "Course tools" from the blue menu on the left.
  3. Choose "i>Clicker Remote Registration"

Deliverable

Successfully take the reading quiz on Thursday.

Due: January 8

Assignment 2: Instruction Set Architecture, Performance and Other ISAs

Required Problems

Due: January 15

Unless otherwise noted, the following problems are from the Patterson & Hennessy textbook (5th ed.).

  1. Translation between C and MIPS:

    (1) C to MIPS. Assume that the variables f, g, h, i, and j are assigned to registers $s0, $s1, $s2, $s3, and $s4, respectively. Assume that the base address of the arrays A and B are in registers $s6 and $s7, respectively.
    C Code:
    f = g + A[B[4]-B[3]];


    (2) MIPS to C. Assume that the variables f, g, h, i, and j are assigned to registers $s0, $s1, $s2, $s3, and $s4, respectively. Assume that the base address of the arrays A and B are in registers $s6 and $s7, respectively.
    MIPS Code:
    sll $t0, $s0, 2
    add $t0, $s6, $t0
    sll $t1, $s1, 2
    add $t1, $s7, $t1
    lw $s0, 0($t0)
    addi $t2, $t0, 4
    lw $t0, 0($t2)
    add $t0, $t0, $s0
    sw $t0, 0($t1)


  2. Pseudo-instructions: Show how to implement the following pseudo-instructions as they would be implemented in a real machine (e.g., using $at):
    1. nop # Do nothing (show three different ways)
    2. li $s0, <32-bit constant> # Load 32-bit immediate value into s0
    3. div $s0, $s1, $s2 # Integer division: s0 = s1/s2

  3. New instructions: Implement register indirect conditional branches (beqr and bner). Give a proposal for adding them to the ISA (i.e., describe how they could be encoded in the I-Type, R-Type, or J-Type format, and propose an opcode for them (use Figure A.10.2 MIPS opcode map)). Give a (brief) argument for or against their inclusion.

Deliverable

Place your typed (or well written) solutions to the problems in the TA's mailbox (Nam, Soohyun) 15 minutes before the class starts. It is located in the mailroom on the 2nd floor of CSE building.

Due: January 15

Assignment 3: Instruction Set Architecture, Performance and Other ISAs (continued)

Required Problems

Due: January 22

  1. MIPS ISA: Exercise 2.18.
    Assume that we would like to expand the MIPS register file to 128 registers and expand the instruction set to contain four times as many instructions.

    (1) How would this affect the size of each of the bit fields in the R-type insructions?

    (2) How would this affect the size of each of the bit fields in the I-type instructions?

    (3) How could each of two propose changes decrease the size of an MIPS assembly program? On the other hand, how could the proposed change increase the size of an MIPS assembly program?


  2. Recursive C Procedure to MIPS Exercise 2.31.
    Implement the the Fibonacci in MIPS assembly given the following C code. What is the total number of MIPS instructions needed to execute this function?

    Note that this code contains two recursive calls.


  3. Delay Slot
    Determine all places in the following code that a delay slot is required. Fill those slots with nop. Then try to make this code more efficient by rearranging the instructions and replacing the nops with useful instructions. Note that after reordering the functionality of code must be preserved. Assume you know nothing about the missing code (filled with ....). Explain why your arrangement works.

Deliverable

Place your typed (or well written) solutions to the problems in Atieh's mailbox (EBU3B, room 2237 (Grad mailroom), labled "Lotfi, Atieh") 15 minutes before the class on Jaunary 22 starts.

Due: January 22

Assignment 4: Performance Measurement

Required Problems

Due: January 29

  1. Performance Processors: The table below describes the performance of two processors, the rAlpha and the c86, and two compilers on a common 'benchmark' program.
    GHz Cost Compiler A Compiler B
    Instructions Average CPI Instructions Average CPI
    rAlpha 3.4 $100 7000 1.2 5000 1.5
    c86 2.6 $100 1500 2.2 1000 4.0

    a. Which is the best compiler-machine combination?

    b. Both hardware vendors release quad-core versions of the processor for 1.7x the cost of the single core version. The results on the two new systems:

    GHz Cost Compiler A Compiler B
    Instructions Average CPI Instructions Average CPI
    rAlphaX 3.4 $170 2250/core 1.5 1750/core 1.8
    c86x4 2.6 $170 625/core 3.2 500/core 5.0
    Assume the benchmarks used are perfectly parallelizable i.e. the cores process equal parts of the program independently and at the same time. Now which combination performs best?

    c. Using the metric (Cycles/sec)*Cores/(Dollars^2), which is the best machine? Is this metric useful?

    d. The four processors have the following power ratings:
              rAlpha  20 W
              c86     35 W
              rAlphaX 50 W
              c86x4   80 W
            
    If you wanted to show that the first processor had the best power efficiency for performance, what metric would you use? For this part, use values only from Compiler A.

  2. Performance:
    Consider three different processors, P1 P2 and P3, executing the same instruction set. P1 has a 3 GHz clock rate and a CPI of 1.5. P2 has a 2.5 GHz clock rate and a CPI of 1.0. P3 has a 4.0 GHz clock rate and a CPI of 2.2.

Deliverable

Place your typed (or well written) solutions to the problems in Soohyun's mailbox (EBU3B, room 2237 (Grad mailroom), lable "Nam, Soohyun") 15 minutes before the class on Jaunary 29 starts.

Due: January 29

Assignment 5: Amdahl's law, Other ISAs

Required Problems

Due: February 5

  1. Amdahl's Law:

    The following table shows the execution time of five routines of a program running on 16 processors.

    # Processors Routine A (ms) Routine B (ms) Routine C (ms) Routine D (ms) Routine E (ms)
    16 4 14 2 12 2

    (1) Find the total execution time, and how much it is reduced if the time of routines A, C, & E is improved by 15%.

    (2) By how much is the total time reduced if routine B is improved by 10%?

    (3) By how much is the total time reduced if routine D is improved by 10%?

  2. x86: The following is x86 assembly. Write equivalent code in C and MIPS assembly. (It will probably be easier for you to write the C code first, and then the MIPS assembly equivalent.) For C code, include a mapping between the variables you use and the x86 registers; for MIPS code, include a mapping between the MIPS registers you use and the x86 registers.
    Note: Be careful about the syntax; this code is in AT&T format.
    	.text
    	.globl main
    main:
    	movl $0x0, %ebx
    	movl $0x0, %ecx
    	jmp  check
    loop:
    	mov  %ecx, %eax
    	add  %eax, %ebx
    	addl $0x1, %ecx
    check:
    	cmpl $0x63, %ecx
    	jle  loop
    	ret
    
    To figure out what the assembly code does, use the three documents listed below under References. The first one should have most of what you need and includes common addressing modes as well, so start there. However, depending on which version of gcc your machine has installed and which particular processor you are running, you may encounter additional instructions. The other documents or Google searches will be helpful in this case. Unfortunately, we have not found a good, concise listing of x86 instructions.

  3. VLIW: In this problem, you are given a tool to transform machine code that is written for the MIPS ISA to code in a VLIW ISA.
    The VLIW ISA is identical to MIPS except that multiple instructions can be grouped together into one VLIW instruction. Up to N MIPS instructions can be grouped together (N is the machine width, which depends on the particular machine). The transformation tool cannot reorder MIPS instructions to fill VLIW instructions. You are given the following MIPS program (we have numbered the instructions for reference below):
    (01) lw $t0, 0($a0)
    (02) lw $t2, 8($a0)
    (03) lw $t1, 4($a0)
    (04) add $t6, $t0, $t1
    (05) lw $t3, 12($a0)
    (06) sub $t7, $t1, $t2
    (07) lw $t4, 16($a0)
    (08) lw $t5, 20($a0)
    (09) srlv $s2, $t6, $t7
    (10) sub $s1, $t4, $t5
    (11) add $s0, $t3, $t4
    (12) sllv $s4, $t7, $s1
    (13) srlv $s3, $t6, $s0
    (14) sllv $s5, $s0, $s1
    (15) add $s6, $s3, $s4
    (16) add $s7, $s4, $s6
    (17) srlv $t0, $s6, $s7
    (18) srlv $t1, $t0, $s7
    
    a. When you run the tool with its settings targeted for a particular VLIW machine, you find that the resulting VLIW code has 9 VLIW instructions. What minimun value of N must the target VLIW machine have?
    b. Write the MIPS instruction numbers (from the code above) corresponding to each VLIW instruction, for this value of N.
    MIPS Inst. No. MIPS Inst. No. MIPS Inst. No. MIPS Inst. No. MIPS Inst. No. MIPS Inst. No. MIPS Inst. No. MIPS Inst. No. MIPS Inst. No.
    VLIW Instruction 1:
    VLIW Instruction 2:
    VLIW Instruction 3:
    VLIW Instruction 4:
    VLIW Instruction 5:
    VLIW Instruction 6:
    VLIW Instruction 7:
    VLIW Instruction 8:
    VLIW Instruction 9:
  4. References

Deliverable

Place your typed (or well written) solutions to the problems in Atieh's mailbox (EBU3B, room 2237 (Grad mailroom), lable "Lotfi, Atieh") 15 minutes before the class on February 5 starts.

Due: February 5

Assignment 6: Pipelining and Hazards

Required Problems

Due: February 19

  1. Simple Single-cycle Processor Performance
    The critical path latencies for the 7 major blocks in two simple processors are given below.
    I-Mem Add Mux ALU Regs D-Mem Control
    a. 400ps 100ps 30ps 140ps 200ps 350ps 100ps
    b. 500ps 150ps 100ps 180ps 320ps 1000ps 65ps

    For each part, answer the following questions:
    1. What is the critical path for a MIPS ADD instruction? Explain your break-up.Which processor between a and b will finish ADD instruction faster?
    2. If the number of registers is doubled, this increases Regs by 10ps and Control by 150ps. This results in 10% fewer instructions due to fewer load/stores. What is the new critical path for a MIPS ADD instruction?
  2. Pipelining I
    The 5 stages of the processor have the following latencies:
    Fetch Decode Execute Memory Writeback
    a. 300ps 400ps 350ps 550ps 100ps
    b. 200ps 150ps 100ps 190ps 140ps

    Assume that when pipelining, each pipeline stage costs 20ps extra for the registers between pipeline stages.
    1. Non-pipelined processor: what is the cycle time? What is the latency of an instruction? What is the throughput?
    2. Pipelined processor: What is the cycle time? What is the latency of an instruction? What is the throughput?
    3. If you could split one of the pipeline stages into 2 equal halves, which one would you choose? What is the new cycle time? What is the new latency? What is the new throughput?
    4. Assume the distribution of instructions that run on the processor is:
      	50%: ALU
      	25%: BEQ
      	15%: LW
      	10%: SW
                      
      Assuming there are no stalls or hazards, what is the utilization of the data memory? What is the utilization of the register block's write port? (Utilization in percentage of clock cycles used)
  3. Pipelining II
    You are given a non-pipelined processor design which has a cycle time of 10ns and average CPI of 1.4. Calculate the latency speedup in the following questions.Ignore latch time
    1. What is the best speedup you can get by pipelining it into 5 stages?
    2. If the 5 stages are 1ns, 1.5ns, 4ns, 3ns, and 0.5ns, what is the best speedup you can get compared to the original processor?
    3. If each pipeline stage added also adds 20ps due to register setup delay, what is the best speedup you can get compared to the original processor?
    4. The pipeline from 3.3 stalls 20% of the time for 1 cycle and 5% of the time for 2 cycles (these occurences are disjoint). What is the new CPI? What is the speedup compared to the original processor?

Deliverable

Place your typed (or well written) solutions to the problems in Soohyun's mailbox (EBU3B room 2237, labeled as "Nam, Soohyun") before 15 minutes before the class starts. It is located in the mailroom on the 2nd floor of CSE building.

Due: February 19

Assignment 7: Hazard,Branch Prediction, Register Renaming

Required Problems

Due: February 26

  1. Data Hazards
        Sequence of instructions:
    	    lw  $s2, 0($s1)
    	    lw  $s1, 40($s3)
    	    sub $s3, $s1, $s2
    	    add $s3, $s2, $s2
    	    or  $s4, $s3, $zero
    	    sw  $s3, 50($s1)
                
    1. List all RAW and WAW data dependencies in the code.
    2. Assume the 5-stage MIPS pipeline with no forwarding, and each stage takes 1 cycle. Instead of inserting nops, you let the processor stall on hazards. How many times does the processor stall? How long is each stall (in cycles)? What is the execution time (in cycles) for the whole program?
    3. Assume the 5-stage MIPS pipeline with full forwarding. Write the program with nops to eliminate the hazards. (Hint: time travel is not possible!)
  2. Measuring branch predictor accuracy: This exercise examines the accuracy of various branch predictors for the following repeating patterns (e.g., in a loop) of branch outcomes. Accuracy is defined as the percentage of guesses that are correct. Answer each question (1-4) for each sequence (a-b):


      Static branch prediction:
    1. What is the accuracy of always-taken and always-not-taken predictors for this sequence of branch outcomes?

    2. Dynamic branch prediction:
    3. What is the accuracy of a 1-bit predictor for the first 5 branches in this sequence? Assume this predictor starts in the "Predict not taken" state.
    4. What is the accuracy of a 2-bit predictor for the first 5 branches in this sequence? Using the diagram in slide 115 (Slides from "ImplementingMIPS_final.pdf"), assume this predictor starts in the "Strongly not taken (00)" state.
    5. What is the accuracy of a 2-bit predictor if this pattern is repeated forever? Using the diagram in slide 115 (Slides from "ImplementingMIPS_final.pdf"), assume this predictor starts in the "Strongly not taken (00)" state.

  3. Register Renaming
     Consider the following piece of code. 
    	     
    	    add $s1, $s2, $s3
    	    sub $s5, $s2, $s3
    	    and $s7, $s5, $s2
     	    lw $s8, 1000($s5)
    	    beq $s7, $s15, label
    	    nop
    	    add $s6, $s7, $s5
    	    sub $s5, $s7, $s5
    	    lw $s9, 2000($s2)
    	    ...
    	    label: add $s9, $s7, $s5
    	    addi $s5, $s0, 0
                
    To enable greater performance, the compiler should determine what instruction could be safely moved into the branch delay slot (replacing the nop). What choice does the compiler have in the following scenarios?
    Circle all instructions that could be moved into the Branch delay slot.
    1. Software register renaming is disabled
    2. Software register renaming is enabled
    (Assume you know nothing about the missing code (in ellipses), and there is no direct path from the fall through code to the taken path.)

Deliverable

Place your typed (or well written) solutions to the problems in Atieh's mailbox (EBU3B, room 2237 (Grad mailroom), lable "Lotfi, Atieh") 15 minutes before the class on February 26 starts.

Due: February 26

Assignment 8: Introduction to Caches

Required Problems

Due: March 5

  1. Introduction to caches: For a direct-mapped cache design with a 32-bit address and byte-addressable memory, the following bits of the address are used to access the cache:

        Tag      Index     Offset
    a.  31-13    12-5       4-0
    b.  31-12    11-6       5-0
    
    For each configuration (a and b):
    1. What is the cache line size (in words)?
    2. How many entries (cache lines) does the cache have? Cache lines are also called blocks.
  2. Your Backpack: The Ca$h processor has a very strange ISA. Popular programs on it have the following instruction breakdown: 45% ALU 05% Branch 10% Store 40% Loads They also have some interesting properties:

    • all loads are to distinct, often random, memory locations.
    • all stores are to locations that have been loaded from before. 
    • there are no forward branches.
    • branch targets are never more than 5 instructions back.
    
    Assume there is no branch prediction. The following questions deal with the L1 cache behavior for the Ca$h processor.

    1. For each of these pairs, identify which is likely to be higher, and why.
      (a) Spatial locality of instruction cache references vs. spatial locality of data cache references
      (b) Temporal locality of instruction cache references vs. spatial locality of instruction cache references
      (c) Instruction cache hit rate vs. instruction cache miss rate
      (d) Data cache hit rate vs. instruction cache hit rate
      (e) Average CPI for loads vs. average CPI for stores (assume that on a cache hit, the CPI = 1 for both)

    2. Both your L1 instruction and data caches separate a 32-bit address as follows:
          Tag      Index     Offset
         31-15     14-6       5-0
        
      What is the size of entire cache?

Deliverable

Place your typed (or well written) solutions to the problems in Soohyun's mailbox (EBU3B room 2237, labeled as "Nam, Soohyun") before 15 minutes before the class starts. It is located in the mailroom on the 2nd floor of CSE building.

Due: March 5

Assignment 9: Cache, Virtual Memory

Required Problems

Due: March 12

  1. Cache

    You have a 2-way set associative L1 cache that is 8KB, with 4-word cache lines. Writing and reading data to L2 takes 10 cycles. You get the following sequence of writes to the cache each is a 32-bit address in hexadecimal:
    0x1000
    0x1004
    0x1010
    0x11c0
    0x2000
    0x21c0
    0x3400
    0x3404
    0x3f00
    0x2004
    0x1004
              
    1. How many cache misses occur with an LRU policy?
    2. How many cache misses occur with a most-recently used policy?
    3. Would the miss-rate increase or decrease if the cache was the same size, but direct-mapped? Explain.
    4. How long does a read-miss eviction take if the cache is write-back, write-allocate? What about a write-miss? (Assume the cache line is dirty)
    5. How long does a read-miss eviction take if the cache is write-through, write-allocate? What about a write-miss?

  2. Virtual and Physical Addresses. For each configuration (a-c), state how many bits are needed for each of the following:

    a. 32-bit operating system, 4-KB pages, 1 GB of RAM

    b. 32-bit operating system, 16-KB pages, 2 GB of RAM

    c. 64-bit operating system, 16-KB pages, 16 GB of RAM

    1. You can fill out a table like this:
      Config. V. Addr. P. Addr. V. Page # P. Page # Offset
      a
      b
      c
    2. What are some advantages of using a larger page size?
    3. What are some disadvantages of using a larger page size?

Deliverable

Place your typed (or well written) solutions to the problems in Atieh's mailbox (EBU3B, room 2237 (Grad mailroom), lable "Lotfi, Atieh") 15 minutes before the class on March 12 starts.

Due: March 12