CSE 260 Homework Assignment #3

High Performance Implementation of an Iterative solver

Due: Friday 10/30/09 at 5pm

Changelog

Date Description
21-Oct-09 Original posting
26-Oct-09 Added a clarification about establishing baseline performance
27-Oct-09 Added a clarification about scaling the workload to 64 processors



Overview

In this laboratory you'll convert the threaded implementation of Jacobi's method that you implemented in the last assignment to run under MPI or CUDA, and conduct various experiments to asses performance. If using MPI, run on Triton, if CUDA, use Lincoln.

MPI implementation

Your implementation should have the same capabilities as that of the previous assignment, except that it will run with message passing under MPI. As previously, your code should handle 1, 2, and 3-dimensional geometries and you may assume that the processor geometry evenly divides the mesh size N. Verify that the code is correct by comparing the final error (residual) for runs performed with different numbers of processors and with different processor geometries. Print out the value of the residual to 6 decimal places. To what degree are the results independent of the number of threads and the processor geometry? Explain your findings.

Implement the -nocomm command line option to disable all calls to MPI send and receive; you'll need it to use the indirect method of measuring communication costs.

Performance Measurement

Report the grind time as the measure of performance. Recall that the grind time is the total running time divided by the total number of updated mesh points. The latter quantity is the product of the number of iterations times N3.

To take timings, use MPI's  timer function MPI_Wtime() to measure wall clock time (like a stopwatch). Your timings will reflect switch contention from other jobs. In a true dedicated run, you would be the only user on the machine, and contention would be minimal; running times would  be shorter and reproducible. Since it is unrealistic to exclude other users, you’ll use a different technique, that will filter out variations. Run the program program at least 3 times (for each geometry, input size, and number of processors or any variable that affects the running time) until you see a consistent timing emerge, such that a majority agree within 10%. If you observe a lot more variation try doubling the number of runs to see if you can obtain a better consensus. Exclude outlying timings, but note them in your report. You may take the average, or simply report the minimal running time.

To measure time, take the difference in times sampled by two successive calls to MPI_Wtime():

        double t0   = MPI_Wtime(); 
        // some code....
        double t1   = MPI_Wtime();
        double time = (t1-t0);

The man-page for MPI_Wtime() is found on line. Note, however, that all times are reported in seconds, and there is no need to divide by MPI_Wtick() to convert to seconds. (You can obtain the resolution of MPI_Wtime using MPI_Wtick())

Recall that the running time of the application is that of the last processor to finish. Don't time individual iterations, but rather, time an entire run, but excluding initialization, calls to MPI_Init() and MPI_Finalize() and any other activities that are not representative of a run with many iterations.

 


Experiments

For your first experiment run on 8 processors and start with a 3203 cubical problem mesh. Run for about 10 iterations so that the runs completes in 5 or 10 seconds.

a. Start with the three 1 dimensional geometries, 1 x 1 x 8, 1 x 8 x 1, and 8 x 1 x 1. Since it can be difficult to measure communication times accurately,owing to the short time durations involved, we will use an indirect measurement technique: run with communication disabled (using the -nocomm option that you were asked to implement) and subtract the resultant time from that obtained from a run that includes communication.

b. Compare the running times for the 3 different 1D geometries and discuss the differences in computation and communication costs.

c. Next repeat with the 2D and 3D geometries. Which is optimal?

d. Next, run on 64 processors (8 node), but use only 3D geometries. Adjust the problem size so that you maintain the same amount of work per processor, i.e. it remains constant (double N). [10/27/09]  a factor of 21/3 ~= 1.26).

When you’ve determined the optimal geometry, report the grind time. Also report the fraction of time spent communicating.

Teams of 3

If you are working in a team of 3 you must implement the convergence acceleration scheme that was used in the previous assignment.

Building and Running MPI Programs

To build your code, you’ll need a correctly configured Makefile and a batch submission file. You’ll find these in two applications directories, Basic and Ring, which are installed in $PUB

To run an MPI job interactively, you must specify a sufficient number of nodes and processes per node. For an 8 process job:

qsub -A djy -I -V -l walltime=00:05:00,nodes=1:ppn=8
Note that it may be difficult to obtain interactive access to more than one node.

Once you have access to the node, start the MPI daemon by typing

mpd&
If you don’t start the daemon, you will receive an error message
mpiexec_abe1088: cannot connect to local mpd (/tmp/mpd2.console_baden); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
In case 1, you can start an mpd on this host with:
    mpd &
and you will be able to run jobs just on this host.
For more details on starting mpds on a set of hosts, see
the MVAPICH2 User Guide.

You only need to do this once per interactive session. To run your job, use the mpirun command

mpirun -np 4 ring -t 10000 -s 16
The mpirun command provides the -np flag to specify the number of processes (Since Lincoln has 8-way nodes, you'll allocate them accordingly.) The value of this parameter should not exceed the number of available physical processors; the program will run, but very slowly, as it is multiplexing resources . The next parameter (ring) is the name of the executable. Any command line arguments come in the usual position, after the name of the executable.

To establish that your environment has been set up correctly, copy, compile and run the parallel "hello world" program. This program prints "Hello World'' from each process along with the process ID. It also reports the total number of processes in the run. The hello world program is found in $(PUB)/Examples/Basic/hello. To run the program use mpirun as follows:

mpirun -np 2 hello

Here is some sample output:

# processes: 2
Hello world from node 0
Hello world from node 1

MPI Documentation

I have put together a web page with links to MPI documentation Man pages for the MPI calls may be accessed with the man command. More extensive documentation can be found at at http://www-cse.ucsd.edu/users/baden/Doc/mpi.html. You can obtain man pages online for MPI calls used in the example programs described here:



 

Cuda Implementation

Implement the iterative method in Cuda in single precision. Note that serial code provided in the last assignment is set up to enable you to switch between single and double precision arithmetic via a compile line macro. Simply remove the definition of the DOUBLE macro in the Makefile. [10/26/09] 

Experiment with different grid and thread block geometries, and report the optimal value. Compare the running time with that of threaded implementation you built in the last assignment, running on a single core of Lincoln, and discuss any significant differences. If you are working in a team of 2, you should be able to demonstrate a performance improvement of at least a factor of 10  for a problem size of 5123. If in a group of 3, a factor of 20. Note that performance of you code will be a contributing factor in grading your assignment. To keep a competitive, but friendly spirit, post your latest performance differences in the A3 Moodle forum. You need not implement acceleration, but you may if you wish.

I have posted instructions for compiling and running CUDA applications on the Lincoln web page.

It is important to establish an appropriate baseline: run with single precision arithmetic; use the same compiler and optimizations employed in the previous assignment (via the provided Makefile).


Things you should turn in

Document your work in a well-written report of about 7 to 8 pages-- a bit longer if you are working in a team of 3, and hence implementing convergence acceleration. Your writeup should discuss decisions you made in parallelizing the algorithm. Provide pseudocode to help describe how your code works, but do not include any source code. That will appear in your electronic turnin. Read over these Tips for your writeup.

Include a self-evaluation discussing the division of labor and other aspects of how you worked together as a team team. A copy of the form is available here.

Please put a hard-copy of your report in my mailbox on the 2nd floor of the CSE Building (EBU3B).

Also transmit an electronic copy by emailing a single archive file, e.g. .zip, .tgz, containing your writeup and source code-- and the self-evaluation. Note: submitting the electronic copy establishes the turnin time of your assignment.

Be sure to delete all .o files and executables before creating the tar file. Email the archive file as an attachment to the following address, and put the string cse260:A3 in the Subject: line:
baden @ ucsd.
edu

The name of your archive file should start with your login ID, followed by a colon (:) e.g. baden:a3.tgz. When your archive file is extracted, a new directory should be created with the same name as the archive file, excluding the filename extension, e.g. baden:a2 The directory should contain two subdirectories
src: All source code including your makefile, so we can build the software.
report Your report including any image files, screen shots, or other output you want me to see.



Copyright © 2009 Scott B. Baden

Maintained by baden @ ucsd.
edu   [Mon Oct 26 15:36:16 PDT 2009]