|
     CSE.240b Final Project      |
Part 1. 5-tuple measurements on the Power4 machine.
The goal is to determine the 5-tuples for the following communications on the Power4 machines:
a. communication between processors on the same chip.
b. communication between processors on different chips.
c. communication between processors on different MCMs.
(you'll have to use a 32-processor machine for this.)
d. communication between ALUs fed by the same issue queue on the same processor
e. communication between ALUs fed by different issue queues on the same processor
In order to do this, you will have to experiments to determine the most efficient way to implement scalar operand network-style communication on the Power4.
Essentially you are writing producer-consumer code that transmits individual operands between
instructions on different threads. You want to do this in a way that minimizes the sender latency, sender occupancy, hop latency
receive occupancy, and receive latency components. You can refer to the Scalar Operand Network paper in class to determine what exactly this means.
Remember, you want to minimize these components. So, for instance, instead of looping 100 times waiting for data coming in, you could have the receiving
processor do a bunch of work before it starts to check if it has received anything. This effectively reduces the measured receive-side occupancy.
During and after you figure out what this code is, you'll have to write and time a number of smaller code snippets to try to figure out how to attribute
the costs across the 5 components of the 5-tuple. To do this, you'll have to write a number of small assembly snippets that measure different aspects of
the architecture (e.g., the branch misprediction cost, the total cost of a communication, etc).
In your writeup, you will be writing a report that:
a) argues that your methodology is valid for measuring the 5-tuples. note, I say argue because it will not necessarily be clear-cut
(e.g., how does one handle superscalar issue in the 5-tuple schema?)
b) describes the methodology and experiments in detail.
c) for the most efficient SON code you were able to find, describes what you believe happens on the Power4 when your code is executing.
This would include a description of what happens to cache lines, reservations, networks, etc. You will need to refer to the Power4 paper to do this.
d) makes a recommmendation for the Power4 architects on how to improve the 5-tuple value for their machine. Be detailed. For instance, if you
say that they should implementing register-mapped static networks, then you need to describe how the machine would have to be changed.
In the spirit of scientific methodology, part b) of your writeup must document the sub-experiments you did. You should
include:
1. the sub-experiment you were trying to run (e.g., measuring branch mispredict penalty by ...), and what you were trying to learn from it.
2. what the expected outcome was (the "hypothesis")
3. what the actual outcome was, and your explanation of the outcome.
4. what your conclusion is, and what the potential interfering effects (sources of error) might be.
5. in the appendix, a copy of the code you tried for the sub-experiment. Please omit support code (such as Makefiles, timing routines, etc)
if the code snippet is short enough, you may include it inline in the writeup.
For this component, you will optimize and parallelize matrix multiply (previously, the assignment specified face detection, however matrix multiply was chosen to keep the project in-scope.) The code, including test harness (in mxm.c) is in:
/rmount/users02/ucsd/mbtaylor/prog2
You will rewrite the function multiply_optimized for optimized, parallel execution on an 8-processor Power4 machine.
You will also want to tune the serial performance, by using blocking techniques, unrolling, compiler flags and using algorithms like Strassen's matrix multiply algorithm.
Keep in mind that a lot of the cost of performing matrix multiply comes down to optimizing cache behaviour and serial execution time.
(You should be able to get well under 42M cycles for a 256x256 matrix.)
The performance of each submission will be evaluated across a range of square matrix sizes
(at least to matrices of size 4096x4096).
You will need to submit:
1. A description of the techniques you employed to optimize your code, including those that you tried that didn't work out.
2. A .tar.gz file containing all of your code. I should be able to type "gmake run" and it should build, run, and time your code automatically.
Please leave the harness (mxm.c) in-tact so that I can easily replace it with another piece of code that times your programs.
3. The running times of your routine, for square matrices of sizes 32x32, 64x64, 128x128, 256x256, 512x512, 1024x1024, 2048x2048 ( ... and on up as far it goes and still runs in reasonable time.)
Policy: There are many references online for matrix multiply. You may refer to these. I ask only that you do not copy (or recode) anybody's code, or examine any
code optimized specifically for PowerPC/Power4/Power5, since that detracts from the spirit of the assigment. You also should not call any third-party subroutines or libraries
for doing matrix or vector multiplications, or transposes, corner-turns, etc. (Any "stretching" of the rules will be seen as such.)
You may feel free to post ideas, performance numbers, and questions to the forum, but no code.