cse240a: Assignments


Homework Policy

Integrity Policy

Assignments

Assignment 1: Paper summaries

Changelog

January 8 The WebCT system for paper summary submission has been setup.

Due: 1 hour before every class meeting

This course is discussion-based, so it is essential that you read the papers and come to class prepared to discuss. The purpose of these summaries to ensure that you do so and provide some structure to how your approach the papers.

Reading research papers critically is an essential skill that you need to develop in graduate school. It will serve you will regardless of what kind of job you end up with. There have been several articles written about how to about this. As your first excercise (in this course, anyway) in reading and synthesizing several articles on a subject, take a look at these three: one, two, three.

You will submit your paper reviews via the "Assessments" in CSE240A WebCT. To access the WebCT, you need to login http://webctweb.ucsd.edu first. Then select "CSE 240A - 2010 Winter Swanson".

You do not need to submit summaries for papers that you present as part of assignment 2.

Deliverable

Your paper review submitted via the web form.

Due: 1 hour before every class meeting

Assignment 2: Class presentations

In place of the prefetcher project/contest you can prepare a presentation for one of the subjects we will be covering in class. There are several motivations for this:

  1. Preparing and presenting presentations is a very important skill. Ideas do not propogate themselves. If you want your ideas to have an impact, you must be able to convey them clearly to others.
  2. Preparing the presentation will give you in-depth exposure to one of the topics we cover. You should choose the topics you present to match your own interests
  3. By replacing the exams with presentations, you can decide where your work for this class occurs during the quarter. For instance, you can work around conference deadlines and exams in other classes.

Since your presentation will cover an important part of the course's material, it is important that you do a good job. To help you accomplish this, the presentation is broken down into several parts spread out over several weeks.

A note about the deadlines. Once you have been assigned a day, it is your job to track the deadlines for the milestones that precede your presentation. I am not going to track you down asking to see your slides. If you do not get them to me by the deadline, your grade will reflect it.

Attendence during classes with a student presenter is extremely important (very nearly mandatory).

Important: The goal here is to provide you experience in becoming an expert on a topic. This is a difficult task, and this is a learning experience. If you get stuck looking for some key piece of information, come talk to me. I can probably point you in the right direction.

Part 1: Topic selection

Due: First week of class

Look at the course schedule and select a day/topic you would like to present. Assignments are first-come, first serve.

Deliverable

none.

Due: First week of class

Part 2: Meet with me to discuss your topic.

Due: Two weeks prior to your presentation

Make an appointment with me no more less than two weeks before your presenation day to discuss the topic. I will point you in the direction of extra readings, etc. that you should look at.

To ensure that we meet early enough, you should drop me a line about the meeting at least several days in advance.

Deliverable

none.

Due: Two weeks prior to your presentation

Part 3: Draft of your slides

Due: One week before your presentation

To prepare your slides, you should become an expert on the subject you are presenting. This means you should read papers in addition to the ones listed on the course web site and in addition to the ones I give to you during our meeting.

A goal to strive for is to be able to answer any question that someone might ask after reading the assigned papers. At the very least, this means you should have tried to find answers to all the questions that arose in your mind while reading those papers. You should also have good answers to most of the questions suggested in the guides to reading papers that you read at the beginning of the quarter.

Once you have become an expert on the subject, you need to condense all the information about the papers into about 45 minutes worth of slides. Typically, slides take about 1-2 minutes each. Use that value as a guide, but also practice presenting the slides to see if that guideline holds for you.

Spend time thinking about the slides and pay attention to the following issues:

Deliverable

Your slides in either PowerPoint (OpenOffice is fine), Keynote, or pdf format.

Due: One week before your presentation

Part 4: The Presentation

Due: The day of your presentation

You will present your presentation during the second half of class. As usual, we will collect a set of questions about the material that you will look into and present answers to during the next class meeting.

Deliverable

A copy of your final slides emailed to me.

Due: The day of your presentation

Part 5: Questions answered

Due: The class period following your presentation

Find answers to as many of the questions we raise in class as possible. This may mean reading some extra papers.

Deliverable

A copy of the slides covering the answers you present.

Due: The class period following your presentation

Assignment 3: Memory Hierarchy

Changelog

January 20 Take off problem 5(4)
January 25 For problem 2, you may assume it's write-allocate.
January 26 Solution is published.

Part 1: E-Mail TA your code name!

Due: January 19

Please send TA a code name for the grade sheet (appearing in the grade sheet instead of your real name).

Deliverable

Please send TA a code name for the grade sheet. Your email title should be "[CSE240A] Assignment 3-1, your_name , code_name".

Due: January 19

Part 2: Required Problems

Due: January 26

You are strongly encouraged to work on this assignment in groups of two. If you do work in a group of 2, submit only 1 report (but make sure both names are on it). While you may have your partner do all the work, this will only hurt you when the midterm and final come around so don't do it.

  1. Consider the matrix_add function shown below:
    	int matrix_add(int a[128][128], int b[128][128], int c[128][128])
    	{
    	  int i, j;
        	  for(i = 0; i < 128; i++)
                 for(j = 0;j < 128; j++)
                  	c[i][j] = a[i][j] + b[i][j];
              return 0;
            }                
    	
    In each iteration, the compiled code will load a[i][j] first, and then load b[i][j]. After performing the addition of a[i][j] and b[i][j], the result will be stored to c[i][j]. The processor has a 64KB, 2-way, 64Byte-block L1 data cache, and the cache uses LRU policy once a set if full. The L1 data cache is write-back and write-allocate. If the addresses of array a, b, c are 0x10000, 0x20000, 0x30000 and the cache is currently empty, please answer the following questions:
  2. You are building a system around a single-issue in-order processor running at 1 GHz and has a CPI of 0.7 excluding memory accesses. The only instructions that read or write data from memory are loads (20% of all instructions) stores (5% of all instructions).
    The memory system for this computer is composed of a split L1 cache that imposes no penalty on hits. Both the I-cache and D-cache are direct mapped and hold 32KB each. The I-cache has 2% miss rate and 32-byte blocks, and the D-cache is write through with a 5% miss rate and 16-byte blocks. There is a write buffer on the D-cache that eliminates stalls for 95% of all writes. You may assume the caches use write-allocate.
    The 512KB write-back, unified L2 cache has 64-byte blocks and an access time of 10ns. It is connected to the L1 cache by a 128-bit data bus that runs at 200 MHz and can transfer one 128-bit word per bus cycle. Of all memory references sent to the L2 cache in this system, 80% are satisfied without going to main memory. Also 50% of all blocks replaced are dirty.
    The 128-bit-wide main memory has an access latency of 60ns, after which any number of bus words may be transferred at the rate of one per cycles on the 128-bit-wide 133 MHz main memory bus.
  3. Smith and Goodman found that for a given small size, a direct-mapped instruction cache consistently outperformed a fully associative instruction cache using LRU replacement.
  4. Way prediction allows an associative cache to provide the hit time of a direct-mapped cache. The MIPS R10K processor uses way prediction to achieve a different goal: reduce the cost of a chip package.
    The R10K hardware includes an on-chip L1 cache, an on-chip L2 tag comparison circuitry, and an on-chip L2 way prediction table. L2 tag information is brought on chip to detect an L2 hit or miss. The way prediction tables contains 8K 1-bit entries, each corresponding to two L2 cache blocks. L2 cache storage is built externally to the processor package, must be two-way associative, and may have one of several block sizes.
  5. You are interested in calculating the access time of a 128KB, 4-way set associative write-back cache with 32 byte blocks. As you learned in class, there are various ways to index and tag your cache in relation to your virtual memory system. Consider that you have the following information about your system:

    Assume that all other parts are insignificant to the access time of the cache.

    For the following configurations, calculate the amount of time it takes to get data from the cache on a load hit. Assume that your page size is 4KB. Be sure to explain your answers; just writing down numbers won't get you credit even if they end up being correct.

    1. A physically-indexed, physically-tagged cache
    2. A virtually-indexed, virtually-tagged cache
    3. A virtually-indexed, physically-tagged cache

Deliverable

Place your typed (or well written) solutions to the problems in Hung-Wei's campus mailbox (room 2237 of the CSE building) or hand-in to Hung-Wei before 2pm.

Due: January 26

Assignment 4: Branch prediction and Advanced Pipelining

Changelog

February 8 Change the code used in problem 1.
February 12 Add some more description in problem 1-3 and problem 2.
February 9 Change the code/branch predictor used in problem 2 to make the problem simpler.
February 16 Solution is published.
February 18 Solution is updated.

Problems

Due: February 16

You are strongly encouraged to work on this assignment in groups of two. If you do work in a group of 2, submit only 1 report (but make sure both names are on it). While you may have your partner do all the work, this will only hurt you when the midterm and final come around so don't do it.

  1. Use the following code fragment:
    Loop: L.D. F1,0(R3)
      ADD.D F2,F1,F4
      MUL.D F1,F2,F6
      ADD.D F1,F1,F5
      ADD.D F7,F7,F1
      DADDI R2,R2,-1
      BNEZ R2,Loop
      DADDI R6,R6,#4
      L.D F3,0(R6)

    The initial value of R2 is 2. Assume the branch instruction is in the BTB and the branch is always predicted taken.

  2. You are asked to evaluate the performance of the following branch prediction schemes: Now, you are given the following code segment. Assume each of the branch PC never cause conflicts/alias with other branches and the predictors are intialized as all zeros. Please evaluate the branch prediction accuracy for all the give prediction schemes.

Deliverable

The solutions to the above problems either typed-up (highly recommended) or nicely hand-written. Include your name (or names, if working in a group), on all pages.

Due: February 16

Assignment 5: Multithreading/Multiprocessor

Changelog

March 4 Update the description of CMP processor used in problem 1
March 9 Add some descriptions of problem 1
March 9 In problem 3, you may assume the cache is coherent.
March 12 Solution is published.

Problems

Due: March 11

You are strongly encouraged to work on this assignment in groups of two. If you do work in a group of 2, submit only 1 report (but make sure both names are on it). While you may have your partner do all the work, this will only hurt you when the midterm and final come around so don't do it.

  1. Consider the following three CPU organizations:
    I. CPU CMP: 2-core superscalar processor with out-of-order issue capabilities. Each core has two functional units. Only a single thread can run on each core at a time.
    II. CPU FMT: A fine-grained multithreaded processor that allows instructions from two threads to be run concurrently on the two functional units, though only instructions from a single thread can be issued to run on any cycle.
    III.CPU SMT: An SMT processor that allows instructions from two threads to be run concurrently on the two functional units, and instructions from either or both threads can be issued to run on any cycle.
    Assume we have two threads A and B to run on these CPUs that include the following operations:
    Thread A Thread B

    A1- takes 2 cycle to execute
    A2- depends on the result of A1
    A3- conflicts for a functional unit with A2 (They must use the same one)
    A4- depends on the result of A2

    B1- no dependencies
    B2- conflicts for a functional unit with B1 (They must use the same one)
    B3- no dependencies
    B4- depends on the result of B2
    How many cycles will it take to execute these two threads on each of the above three processors? How many issue slots are wasted due to hazards? (You may assume these processors have ful capability of eliminating false dependencies and all the instructions are already in instruction queues. If not specified, the instruction will take one cycle to execute)
  2. A simulation study shows that the average branch misprediction rate of selected SPEC92 benchmarks nearly doubles from 5% to 9.1% while going from 1 thread to 8 threads in an SMT processor. However, the wrong-path instructions fetched drops from 24% on a single-threaded processor to 7% on an 8-thread SMT processor.
    1. Please give the reason why the branch misprediction rate increases.
    2. Why is the number of fetched wrong-path instructions decreases? (Hint: This is related to the scheduling of instructions)
    3. Another simulation study for online transaction-processing workload, OLTP, shows a decrease in instruction cache miss rate from 14% to 9%. When is such a decrease, called constructive cache interference, likely to happen?
  3. Assume that you are running the following code segment on a symmetric multicore processor, where the inital values of both x and y are 0.
    Core 1 Core 2

    x = 2;
    w= x+y+1;

    y=2;
    z=x+y;
    1. Please list all the possible resulting values of w, x, y, z. For each possible outcome, explain how we might arrive at those values. You may need to examine all possible interleavings of instructions.(If you feel that there are too many possibilities, you may assume the cache is coherent.)
    2. If you wanted to make the system sequentially consistent, what are the key constrains you need to impose?
    3. Assume the processor supports "relaxed memory consistency" model, which requires synchronization to be explicitly identified. To achieve "relaxed memory consistency" model, the processor provides a "barrier" instruction, which ensures that all memory operation preceding the barrier instruction complete before any memory operations following the barrier are allowed to begin. Where would you include the barrier instructions in the above code segment to get the final result as x = 2, y = 2, w = 5, and z = 4?

Deliverable

The solutions to the above problems either typed-up (highly recommended) or nicely hand-written. Include your name (or names, if working in a group), on all pages.

Due: March 11