CSE.240b Final Project

    



The final project has four possible options. All four options may be done in groups of 1 or 2. In all cases, the forum may be used to discuss the project. However, code may not be posted. In the case of Project (I), your writeup cannot depend on any results posted to the forum. I.e., if you use the cost of a branch prediction, you must measure it yourself, even if somebody has posted the value to the forum.

All four proposals require a write-up, which may include code. The projects are worth 30% of your grade and will be evaluated both on the results (50%) and on the writeup (50%).

I will be giving "best of class" prizes for both parts of Project I, and possibly for Projects II, III, and IV.

Each group must post on the Forum by Monday, May 15, with your choice of project. The projects are due the last day of class. (If this violates any UCSD policy, then students must bring this to my attention before Monday, May 15.)

If you choose Projects 2-4, you must include a few paragraphs on what you propose to do.

I will give you feedback within 24 hours, either ok'ing the proposal, ok'ing with modifications,
or recommending that you do Project I instead.



Project I. DataStar Contest


This option has two mandatory components:


Part 1. 5-tuple measurements on the Power4 machine.

The goal is to determine the 5-tuples for the following communications on the Power4 machines:

a. communication between processors on the same chip.
b. communication between processors on different chips.
c. communication between processors on different MCMs.
     (you'll have to use a 32-processor machine for this.)
d. communication between ALUs fed by the same issue queue on the same processor
e. communication between ALUs fed by different issue queues on the same processor


In order to do this, you will have to experiments to determine the most efficient way to implement scalar operand network-style communication on the Power4. Essentially you are writing producer-consumer code that transmits individual operands between instructions on different threads. You want to do this in a way that minimizes the sender latency, sender occupancy, hop latency receive occupancy, and receive latency components. You can refer to the Scalar Operand Network paper in class to determine what exactly this means.

Remember, you want to minimize these components. So, for instance, instead of looping 100 times waiting for data coming in, you could have the receiving processor do a bunch of work before it starts to check if it has received anything. This effectively reduces the measured receive-side occupancy.

During and after you figure out what this code is, you'll have to write and time a number of smaller code snippets to try to figure out how to attribute the costs across the 5 components of the 5-tuple. To do this, you'll have to write a number of small assembly snippets that measure different aspects of the architecture (e.g., the branch misprediction cost, the total cost of a communication, etc).
In your writeup, you will be writing a report that:

a) argues that your methodology is valid for measuring the 5-tuples. note, I say argue because it will not necessarily be clear-cut (e.g., how does one handle superscalar issue in the 5-tuple schema?)

b) describes the methodology and experiments in detail.

c) for the most efficient SON code you were able to find, describes what you believe happens on the Power4 when your code is executing. This would include a description of what happens to cache lines, reservations, networks, etc. You will need to refer to the Power4 paper to do this.

d) makes a recommmendation for the Power4 architects on how to improve the 5-tuple value for their machine. Be detailed. For instance, if you say that they should implementing register-mapped static networks, then you need to describe how the machine would have to be changed.


In the spirit of scientific methodology, part b) of your writeup must document the sub-experiments you did. You should
include:


1. the sub-experiment you were trying to run (e.g., measuring branch mispredict penalty by ...), and what you were trying to learn from it.

2. what the expected outcome was (the "hypothesis")

3. what the actual outcome was, and your explanation of the outcome.

4. what your conclusion is, and what the potential interfering effects (sources of error) might be.

5. in the appendix, a copy of the code you tried for the sub-experiment. Please omit support code (such as Makefiles, timing routines, etc) if the code snippet is short enough, you may include it inline in the writeup.


Part 2. Parallelizing and performance-tuning Matrix Multiply.

For this component, you will optimize and parallelize matrix multiply (previously, the assignment specified face detection, however matrix multiply was chosen to keep the project in-scope.) The code, including test harness (in mxm.c) is in:

/rmount/users02/ucsd/mbtaylor/prog2

You will rewrite the function multiply_optimized for optimized, parallel execution on an 8-processor Power4 machine. You will also want to tune the serial performance, by using blocking techniques, unrolling, compiler flags and using algorithms like Strassen's matrix multiply algorithm. Keep in mind that a lot of the cost of performing matrix multiply comes down to optimizing cache behaviour and serial execution time. (You should be able to get well under 42M cycles for a 256x256 matrix.)

The performance of each submission will be evaluated across a range of square matrix sizes
(at least to matrices of size 4096x4096).

You will need to submit:

1. A description of the techniques you employed to optimize your code, including those that you tried that didn't work out.

2. A .tar.gz file containing all of your code. I should be able to type "gmake run" and it should build, run, and time your code automatically. Please leave the harness (mxm.c) in-tact so that I can easily replace it with another piece of code that times your programs.

3. The running times of your routine, for square matrices of sizes 32x32, 64x64, 128x128, 256x256, 512x512, 1024x1024, 2048x2048 ( ... and on up as far it goes and still runs in reasonable time.)

Policy: There are many references online for matrix multiply. You may refer to these. I ask only that you do not copy (or recode) anybody's code, or examine any code optimized specifically for PowerPC/Power4/Power5, since that detracts from the spirit of the assigment. You also should not call any third-party subroutines or libraries for doing matrix or vector multiplications, or transposes, corner-turns, etc. (Any "stretching" of the rules will be seen as such.) You may feel free to post ideas, performance numbers, and questions to the forum, but no code.


Project II. Architectural Proposal

In this project, you will write a 6-12 page proposal for an interesting new architecture, or architecture-related mechanism/technique. The basic model is that it should look like an IEEE Micro paper (e.g., the IEEE Micro Raw paper, or the Alpha 21264 paper.) It should describe the basic idea, describe how it works in a fair degree of detail (focusing on the issues related to what's unique about the mechanism), and discuss the related work. You do not need to evaluate the idea through simulation. You should also discuss the possible benefits and drawbacks of the ideas, as well as any challenges with implementing it. Finally, you should discuss how you would go about evaluating the proposed idea. The 6-12 pages is a rough guideline; the goal is to have a significant amount of thought.

One posibility is to propose a system that merges or extends the best parts of Raw, GRID, and Wavescalar. Here, you would have to be detailed and describe the benefits of your approaches.

Wild ideas are great as well. For instance you can use this to develop an idea for the ASPLOS XII Wild and Crazy Idea session or improve upon one from (see ASPLOS XI).

Another possibility is to write a proposal for something that mixes your own research area with architecture.

Project III. Evaluation Project


a. Write a 10-12 page paper the evaluates and compares Raw, Wavescalar and GRID. What are the benefits and drawbacks of each? Show example code sequences that highlight the benefits of one architecture over the others. (Unless these items have been implemented, take items like frequencies, 5-tuples, power, and area estimates with a grain of salt.) What recommendations would you make to the three groups?

b. Use simplescalar, or some other standard infrastructure to evaluate an idea that interests you.

c. Propose something.

Project IV. DataStar hacking.


a. Maybe you're really dying to take advantage of the opportunity to use the DataStar cluster in some way. Propose something.