CSE 260 Homework Assignment #4

MPI Implementation of an Iterative solver for Solving Poisson's Equation

Due: Tuesday 10/28/08 at 5pm

Changelog

Date Description
21-Oct-08 Original posting
24-Oct-08 Added a tarball containing need parts of public direcory
28-Oct-08 Addendum to assignment for running on Abe.



Overview

In this laboratory you'll convert the threaded implementation of Jacobi's method that you implemented in the last assignment to run under MPI and conduct various experiments to asses performance. Develop and run your code on Valkyrie.

Note: A copy of the public files need to complete the assignment, /home/cs260f/public-cs260f contained in this tarball:

cs260f-public.tgz

MPI implementation

Parallelize the code using MPI. Implement 1, 2, and 3-dimensional geometries, and properly handle the case when N3 is not divisible by the thread geometry. Verify that the code is correct by comparing the final error (residual) for runs performed with different numbers of processors and with different processor geometries. Print out the value of the residual to 6 decimal places. To what degree are the results independent of the number of threads and the processor geometry? Be sure to provide an explanation.

The solver handles cubical meshes of size N × N × N. Choose a value of N such that you can run 100 iterations in under 10 seconds on 1 processor.

If the number of processors Pkdoes not divide Nevenly along some axis k, you must make sure that all processors are evenly loaded to the extent limited by the in ability of the processor geometry to divide the computational box evenly. In particular, you must make sure that no processor is without a work assignment unless the mesh is too small to permit them all to receive work (It is not necessary to handle that case.)

Implement the -nocomm command line option to disable all calls to MPI send and receive; you'll need it to indirectly measure communication.

Install timers

To take timings, use MPI's  timer function MPI_Wtime() to measure wall clock time (like a stopwatch). Your timings will reflect switch contention from other jobs. In a true dedicated run, you would be the only user on the machine, and contention would be minimal; running times would  be shorter and reproducible. Since it is unrealistic to exclude other users, you’ll use a different technique, that will filter out variations. Run the program program several times until you see a consistent pattern emerge, that is, that agree within 10%. Exclude the other timings, but note them in your report. You may take the average, or simply report the minimal running time.

To measure time, take the difference in times sampled by two successive calls to MPI_Wtime():

        double t0   = MPI_Wtime(); 
        // some code....
        double t1   = MPI_Wtime();
        double time = (t1-t0);

The man-page for MPI_Wtime() is found on line. Note, however, that all times are reported in seconds, and there is no need to divide by MPI_Wtick() to convert to seconds. (You can obtain the resolution of MPI_Wtime using MPI_Wtick())

Recall that the running time of the application is that of the last processor to finish. Don't time individual iterations, but rather, time an entire run, but excluding initialization, calls to MPI_Init() and MPI_Finalize() and any other activities that are not representative of a run with many iterations.

 


Experiments

For your first experiment run on 4 processors and start with a 4003 cubical problem mesh. Run for about 20 iterations so that the runs completes in a few seconds.

a. How many different geometries are there on 4 processors?

b. Start with the three 1 dimensional geometries, 1 x 1 x 4, 1 x 4 x 1, and 4 x 1 x 1. Since it can be difficult to measure communication times accurately,owing to the short time durations involved, we will use an indirect measurement technique: run with communication disabled (using the -nocomm option that you were asked to implement) and subtract the resultant time from that obtained from a run that includes communication.

Compare the running times for the 3 different 1D geometries and discuss the differences in computation and communication costs.

c. Next repeat with the 2D geometries. Which is optimal?

d. If you have time, run on 8 processors, but use only the 2D and 3D geometries. Adjust the problem size so that the amount of work per processor is a constant (i.e. to double the computational work, increase N by a factor of 21/3 ~= 1.26)

Teams of 2

If you are working in a team of 2 you must implement the convergence acceleration scheme that was used in the previous assignment.

Running on Abe

Install your code on Abe, and run on at least 32 processors (4 nodes). Modify the provide batch script in the "hello world" example to run in batch mode. Report the optimal processor geometry; as discussed in class, you need not test 1D geometries. When you’ve determined the optimal geometry, report the grind time, which is the total running time divided by the total number of updated mesh points. The latter quantity is the product of the number of iterations times N3. Also report the fraction of time spent communicating.

Run at least 3 times for each different geometry, and report the minimum running time. The running times should be highly repeatable; if you see a lot of variation try doubling the number of runs in order to observe a consensus.

You may use the problem size you used on 8 processors of Valkyrie. If you have the time, vary the size, noting any impact on the fraction of time spent communicating.

Turn in your amended report in class on Tuesday 11/3, both electronic and hard copy. If you have already turned in A4, you may submit in addendum if you wish, or submit an updated version of your lab report.


Things you should turn in

Document your work in a well-written report of about 5 pages --not including code listings or plots)-- a little bit shorter if you are working alone, and hence not implementing the accelerator. Your writeup should discuss decisions you made in parallelizing the algorithm. Provide pseudocode to help describe how your code works. Be sure to include any plotted data in tabular form. Read over these Tips for your writeup.

If you are working in a team, include a self-evaluation discussing the division of labor and other aspects of how you worked together as a team team. A copy of the form is available here.

Your turnin will consist of two parts. Submit a hard copy of your report in class on Thursday 10/30. Transmit your report electronically by the 5pm deadline on 10/28. Transmit your report by email in a single archive file, e.g. .zip, .tgz, containing your writeup and source code (and the self-evaluation, if applicable). Be sure and delete all .o files and executables before creating the tar file. Email the archive file as an attachment to the following address:
baden +260_a4 @ ucsd.
edu

Be sure to put the string cse260:A4 in the Subject: line and to use the email address above.

The name of your archive file should start with your login ID, e.g. baden.tgz. When your archive file is extracted, a new directory should be created with the name of your login ID. The directory should contain two subdirectories
src: All source code including your makefile, so I can build the software.
report Your report including any image files, screen shots, or other output you wish me to see.



Copyright © 2008 Scott B. Baden

Maintained by baden @ ucsd.
edu   [Tue Oct 28 20:05:58 PDT 2008]