Date | Description |
29-Sep-09 | Original posting |
06-Oct-09 | Source code made available here |
In this assignment you'll implement the blocked matrix multiplication algorithm discussed in class, and carry out a set of experiments to assess the performance benefits of the algorithm compared with the unblocked algorithm, which is provided to you on Triton as $PUB/cse260-fa09/A1, and also here as the gzipped tar file mm.tgz. This assignment runs on a single processor.
Run on Triton and at least one other machine: a laptop, or any other machine that comes handy. On Triton, be sure to collect your timings in batch mode, using the provided batch submission script run.sh
Restructure the provided unblocked code mm.c to implement the blocked algorithm (Algorithm 3) described in the Demmel on-line reader. Your code will now be able to run both blocked and unblocked variants of matrix multiplication. Be sure that your code is correct.
You will know that
you have reached the desired performance goal
when the blocked algorithm significantly outperforms (by at least a factor of 105)
the unblocked algorithm.
The code includes a microsecond timer, as well as verification code for a matrix product that can be expressed in closed form. Note that the code enables you to repeatedly execute matrix multiplication. This is necessary with smaller matrices, since it might not be possible to measure reliably short running times for a single matrix product.
Test your code for different values of n, repetition count, and blocking factor in order to build confidence in the correctness of your code. Note that, these values are input as positional command line arguments; see the code for the details. You may assume that the blocking factor divides n evenly.
Next, conduct a performance study. To do this, sweep over different values of n, starting, say, from 32 and working your way up in powers of 2 until the running time of the unblocked code takes about 10 seconds. Once you have done this, look at intermediate values in order to get a clearer picture of how performance changes with n and the blocking factor. To start with, set the blocking factor to n/2. You may reduce the blocking factor by halving it. Note the effect on performance. Is there a "sweet spot" (or spots) that minimize the running time for a fixed n?
Take timings with the provided timer code (it uses gettimeofday()). Take at least 3 timings for each combination of n and block size, and report the minimum timing. If you are having trouble getting consistent timings, especially if you are using a shared machine, try running when the machine is quieter, and repeat the runs several more times. (If you are working on a personal computer, shut down all open applications.)
You'll need to use much a large repetition count for smaller matrices As you increase n, you'll need to figure out a schedule for decreasing the repetition count so that each run takes a second or two. As you increase n still further, you eventually set the repetition count to one and you will no longer be able to limit the running time to a second or two (Don’t run for more than 5 minutes on Triton, as you will be using only 1 of 8 processors, but will be charged for all 8).
Once you have conducted your experiments, plot the comparative grind
times Gigaflop rate for the blocked and unblocked algorithm.
Why are the times comparable for smaller values of n?
At what point do they diverge? Sample the values of n
closely within the region where they diverge, and use this
critical value of n
determine the size of the processor’s
highest level cache.
Discuss any other performance
variations that will enable you to estimate the size(s) of any other
level(s) of cache and estimate them.
(As described below, there are resources
that provide this information directly, but try and come up with
you own estimates before looking).
Remember that there are 3 arrays to be cached: A, B and C.
When reporting performance results, it is important to identify the hardware and software testbeds used. To this end, run a suite of commands to identify the hardware and the environment you used to develop and run your code. (This is set up in the batch file run.sh for Triton.)
Place all output from these commands in a file called config.out, e.g. by using the script command. If you found the hardware characteristics of your processor, that is the clock speed, TLB, L1, L2, and (if present) L3 cache, list the information in the report, and cite the source, e.g. a URL or a file. On Linux systems you can examine /cpu/procinfo for some of this information. (Be sure to do this on the processor you actually ran the job on, that is in batch, and not on the front end.)
model : 26 model name : Intel(R) Xeon(R) CPU E5530 @ 2.40GHz stepping : 5 cpu MHz : 2395.000
The contents of procinfo enables us to identify the specific processor and the identifying information may be used to obtain other details from on-line resources. For example, ours is a 2.4GHz Intel Xeon E5530 processor. We may then learn more by accessing Intel's ProcessorFinder (There is also a site to explain about processor numbers). Using the information obtained from procinfo select the processor family (Xeon) CPU Speed (2.4 GHz), Processor Number (E5530). You will be left with one choice, a clickable link to the datasheet. Follow a short sequence of links to "Multi-Core Intel Xeon 5000 Sequence" and then "5530 Datasheet." The introduction describes the hardware in more detail. (There is additional information in procinfo: family, model, and stepping. So in this case the family is 6, the model is 26, and the stepping is 5. This will allow you to explore smaller scale changes to the processor’s design that occurred over time.)
If you are using a non-Intel processor see if you can locate the information, and I will post links to the resources here.
Document your work in a well-written report which discusses your findings carefully. Your report should present a clear evaluation of the design of your code, including any bottlenecks in the implementation. Describe any special coding or tuned parameters. Provide pseudo-code listings for code you wrote. There is no need to provide pseudo-code for the provided unblocked matrix multiply code.
Provide all data you plotted in tabular form. This can appear, for example, as an appendix in your report.
Your report should cite any written works or software as appropriate in the text of the report. Citations should include any software you used that was written by someone else, or that was written by you for purposes other than this class.
Your turnin will consist of two parts. Turn in in a hard copy of your report in class on Monday 10/12. Transmit your report electronically by the 9PM deadline on 10/8. Email a single archive file, e.g. .zip, .tgz, containing your writeup and source code. Be sure and delete all .o files and executables before creating the tar file. Email the archive file as an attachment to the following address, with cse260:A1 in the subject line
baden | @ | ucsd. |
edu |
The name of your archive file should start with your login ID, e.g. baden.tgz. When your archive file is extracted, a new directory should be created with the name of your login ID. The directory should contain two subdirectories
src: | All source code including your makefile, so we can build the software. |
report | Your report including any image files, screen shots, or other output you wish us to see. Also put the config.out file in this subdirectory. |
At the top level of your directory there should be an ASCII file called README.txt, listing all the parts of your electronic submission. If you like, include a .html file called index.html at the top level. From there you can then link to a README file as well as the src and report directories.