Programming Lab #3, Implementing Cannon's algorithm
(Due Friday February 24th, 2012 at 6PM)


Date Description
11-Feb-12 Original posting


Your assignment is to implement Canon's Matrix multiplication algorithm in MPI and to conduct scaling studies on Trestles. If working in a group of 3, you must also implement the SUMMA algorithm and compare yours results with Cannon's algorithm. Instructions for using MPI are found here. The requirements for your report are described at the end of this writeup.

You've been provided with some MPI starter code in $PUB/HW/A4/Canon_Starter. This code manages communicators for handling processor rows and columns, but includes no message passing calls, e.g. Sends and Receives. Add the required message passing calls, trying to make the code go as fast as you can. The code uses an efficient serial implementation of DGEMM (double precision general matrix multiplication) which is provided by the Intel Math Kernel Library, MKL. The MKL implementation has been autotuned to the hardware, and will vastly improve performance over most "home made" matrix multiplication versions.

The provided starter code

Be sure to look over the code so you are familiar with the various modules. All of your changes will made in one file: cannon.cpp. The local matrix multiply code resides in utils.cpp, which makes the calls to the CBLAS version of dgemm. As with previous assignments, command line arguments are parsed in cmdLine.cpp

To ensure that your program is correctly graded, do not change any of the provided source code files, except for cannon.cpp. Also, don't change the signature of two functions defined in cannon.cpp: cannon() and Verify(), as certain routines in utils.cpp make callbacks to these functions. You may modify the Makefile if you need to add new source code files.

The Assignment

  1. Conduct a weak scaling study. Details are described below.

  2. Measure communication costs using the indirect method discussed in class. There is a command line option -k that shuts the sends and receives used to implement the shift operations required by the algorithm. By shutting off communication you can then assess the net impact of communication. While results will be incorrect, there will be no numerical overflows so the running times won't be affected. Plot the communication cost as a function of the number of cores, and measure it as a fraction of total running time. Be sure to include the plotted data in tabular form in your report.

  3. Speed up verification. The code provides an option (-c) to verify results, as the product C = A*B is known in closed form for the provided input. Be sure to check your results, as incorrect results will lose significant credit. Though it is parallelized, the verification code uses an inefficient algorithm and it's running time significantly exceeds that of the local matrix multiplies used in Cannon's algorithm (The code reports the verification time)! Improve the verification time by optimizing the code. You should be able to bring the running time close to that of the computed matrix product.

  4. Groups of 3 must also implement the SUMMA matrix multiplication algorithm, and conduct the same study as for parts (A) and (B). Do you observe any performance differences? If so, what is your explanation?
    When implementing your SUMMA code, put your code in a new file called summa.cpp. The provided Makefile includes a target called summa. Modify as needed if you want to add other source code files. We will permit you to modify summa.h representation of the matrix datatype, so you'll be allowed to make changes to cannon.h

Weak Scaling study

In this assignment, we'll be performing weak scaling studies, that is, we'll grow the workload in proportion to the number of cores so that it remains constant as we increase P. It follows that if our code is scaling perfectly, the running time will also remain constant, independent of P.

In practice, our codes won't scale perfectly. The peak GFlop rate to multiply two matrices on one core of Trestles is slightly more than 8.5 GFlops/sec. You'll measure this base rate and then use it as the basis for reporting parallel efficiency, that is, we'll normalize the measured GFlop rate for a given value of N, and then divided by the following quantity: the base rate times the number of cores. (Note that the peak speed of a Trestles core is somewhat higher than the base rate: 9.6 GFlops/sec.)

Since the running time of matrix multiplication is cubic in N (we'll assume square matrices in this assignment) we will vary N such that the following relationship is maintained, where N0 is the value of N we use on 1 core:

N3 /  P = N0

We'll conduct our weak scaling studies in two rounds.

Though Trestles has many more than 1024 cores, it is not possible for us to use more than 1024 at a time, except by special arrangement. To avoid exhausting our bank account, start with 4 cores (1 node), and quadruple the number of cores until the parallel efficiency drops below 50% or you have reached the processor limit of the round, whichever comes first. Since Cannon's algorithm requires a square array of processors, we quadruple the number of cores each time we double the level of parallelism.

If your efficiency drops below 50% before reaching the limit of the round, re-assess your implementation. Don't progress to round 2 until you have completed round 1. Leave time for authoring your report; if you can't scale to 1024 cores, it is better to tell us why (and in detail) rather than working until the bitter end and handing in a "thin" report.

Setup for MPI

You may develop your code on Lilliput or any of the cseclass machines, but take measurements on Trestles using batch mode. Don't use the shared queue when collecting measurements, as you may experience contention from other jobs.


  1. On Trestles, do not attempt to run on more then 16 cores in interactive mode, i.e. 16 MPI processes.
  2. On Lilliput or the CSEclass machines, do not attempt to run on more then 4 cores, i.e. 4 MPI processes.

Before using MPI, be sure to set up your environment on Trestles and on Lilliput, as explained here.

Things you should turn in

Document your work in a well-written report of about 4-6 pages (at the longer end of the range for groups of 3) which discusses your findings carefully and offers insight into your observations. Your report should present a clear evaluation of the design of your code, including bottlenecks of the implementation, and describe the performance programming you performed. Explain your design decisions affected the observed performance. Negative results are also valid and especially valued if you carefully document them, with the underlying causes, or best guesses if the causes are not clear. Provide pseudo code listings to illustrate how you parallelized the code, but do not include full code listings as these will be in your turnin. Be sure to include any plotted data in tabular form.

Cite any written works or software as appropriate in the text of the report. Citations should include any software you used that was written by someone else, or that was written by you for purposes other than this class.

Your turnin will consist of two parts.

  1. Turn in your source code and your lab report electronically not later than the 9pm deadline on the day the assignment is due. To hand in your assignment electronically, copy your code over to Lilliput and then run turninA4 script located in $PUB/turnin/turninA4. We will announce when the script has been enabled via the A4 Moodle forum.
  2. Turn in hard copy of your report not later than the 6pm deadline. You may leave your report in Professor Baden's mailbox on the 2nd floor of EBU3B. To avoid lateness penalties, don't miss these two deadlines.

Your turnin must include the provided Makefile so that we can build and test your code. We will run your code to test it and will check correctness using the residual measurement.

Your turnin must also include three important documents:

  1. A completed team self-evaluation discussing the division of labor and other aspects of how you worked together as a team. A copy of the form is available in $PUB/turnin/teameval.txt.
  2. A MEMBERS file that identifies your team members, with your full names and email addresses. A template showing the format can be found in $PUB/turnin/MEMBERS.
  3. Your report, a file called report.pdf

These documents are required, as the turnin process will not complete without them. Copies of these files are available in the turnin directory, $PUB/turnin Empty forms, or forms not properly filled out, will result in a 10% penalty being assessed, in addition to holding up the grading process.

A note about the turnin process: Since the turnin script will try to compile your project before letting you submit your assignment, it's important that you turn in all the files you have, including a Makefile. However, when grading your assignment, we will substitute all the files that you are not allowed to modify (as noted in the list above) with the original copies of the version we provided you. This is why it is important that you do not change those files in your own turnin, as they will be lost. The turnin script will have an option to do some basic testing, and we strongly suggest you avail yourself of this option.

Don't wait until the last minute to attempt a turnin for the first time. The turnin program will ask you various questions, and it will check for the presence of certain required files, as well as your source code. It will also build the source code if you like, to catch any obvious build errors. Try out the turnin process early on so you are sure that you you are familiar with the process, and that all the required additional files are present. The turnin procedure will be available about a week before the deadline.

Copyright © 2012 Scott B. Baden   [Sun Feb 12 17:07:00 PST 2012]