CSE 260 Homework Assignment #2: Iterative solver for Poisson's Equation

Due: Tuesday 10/20/09 at 5pm

Changelog

Date Description
06-Oct-09 Original posting
16-Oct-09 Fixed discussion about impact of -px -py and -pz flags. Added discussion about roundoff errors in the residual. Discussed a strategy for setting the convergence threshold for the accelarted computation



In this laboratory you'll parallelize an existing serial implementation of Jacobi's Method in 3 dimensions using pthreads. The serial C code is available as the compressed archive file j3d_serial.tgz and, on Triton, in ~baden/cse260-fa09/j3d_serial. Matlab code is also available as the compressed archive jac_matlab.tgz. For this assignment, use the Triton cluster but not the PDAF nodes. You may take measurements using the interactive queue, or with the batch queue. Do not take measurements on the "head" node (front-end).

The program provides various command line options which are parsed in the source file cmdLine.C, including the following options you’ll need when parallelizing the code:

 -N      <integer>       domain size, the bounds of the computational mesh
 -i      <integer>        # iterations
 -err                          report error (Residual in the L2 norm)
 -px     <integer>       number of processors on x axis
 -py     <integer>       number of processors on y axis
 -pz     <integer>       number of processors on z axis   

Thus, the command line

   Jacobi3D -N 128 -i 5 -px 1 -py 2 -pz 4

will run Jacobi3D on 8 processors (threads) over a 1283 mesh for 5 iterations. The processors will be configured in a 1 x 2 x 4 geometry. though the -px -py -pz flags will have no effect in a serial code. Specifying a non-unary value for any of these flags will result in incorrect results until you have correctly parallelized the code [10/16/09]

Threaded implementation

Your code should support 1, 2, and 3-dimensional geometries. Verify that your code is correct by comparing the final error (residual) with single processor runs. For a correct parallel code, the residual should be independent of the number of threads and the processor geometry, or "nearly so." We say "nearly so" because the parallel code will employ a different ordering than the serial code when accumulating the residual. Due to roundoff errors, you might see a slight error in one of the least significant digits, corresponding to the accuracy of machine arithmetic (double precision if you used the provided Makefile).

This a "feature" of parallel computation. We assume that things generally happen in the same order every time we run the program, and that we know the specific order in which things are done. But, we do not have these guarantees in a parallel algorithm. We might not even have them in a serial program, due to optimizations and differences in how compilers treat them.

Since double precision carries around 14 digits, you’ll need to increase the printing precision of double precision to see the effect. Another way is to compute with single precision, where the effect of the reordering will be more pronounced. The run with single precision, remove the definition of the DOUBLE macro in the Makefile. (You will see that hooks have been added to the code to enable you to change between single and double precision). [10/16/09]

Are the results Is the reported residual independent of the number of threads and the processor geometry? If so not, provide an explanation. [10/16/09]

The solver handles cubical meshes of size N × N × N. When debugging, choose a value of N such that you can run 50 iterations in between 5 and 10 seconds on 1 processor. This means that runs will complete in a few seconds on 4 and 8 cores.

Next, do a scaling study, running a fixed problem size on 1, 2, 4, and 8 processors. On a single graph, plot the speedup as a function of P, the number of processors, for different problem sizes N. To double the computational workload, increase N by a factor 21/3 ~= 1.26. For 4 and 8 processors, experiment with differing processor geometries. Is performance sensitive to the geometry? If so, explain your observation and report the optimal geometry that minimizes the running time (There may be more than one for a given number of processors).

Teams of 3

If you are working in a team of 3, I expect a more ambitious assignment. In particular, you must implement the convergence acceleration scheme discussed in class, that uses a coarse grid to improve the initial guess. If you implement this option, you need only generate a speedup curve for a single value of N.

The technique we'll use is to coarsen the solution and iterate several times to obtain an approximate solution, which serves as a good initial guess. This strategy contains the essential machinery for implementing multigrid, which is an algorithm for accelerating convergence.

Modify your code so that, once the RHS and initial guess have been generated, a new coarsened version of the initial guess and RHS are produced, and subsequently smoothed by a Jacobi solver. This coarsened solution is then projected (via interpolation) back onto the original mesh and is used as an improved initial guess.

Examine the error (L2 residual) in order to to demonstrate the improved running time of your two level solver when delivering a given level of accuracy. A convenient way to do this is to set the convergence threshold to the value of a previously reported residual. That way you will get consistent results. It’s possible that you’ll run for a slighly different number of iterations, since the printout suppresses some digits in the reported residual, but the effect can be rendered harmless by running for a sufficient number of iterations [10/16/09]

You’ll need to iterate several times at the coarse level so that you see a significant improvement in the time needed to converge. To coarsen a mesh (use multigrid restriction), take the average of the 6 nearest neighbors (Manhattan directions). Thus, if C is the coarsening factor each coarsened mesh point (I,J,K) will be obtained from points ((I*C)+i, (J*C)+j), (K*C)+k ), where i, j, k range from 0 to C-1.  A coarsening factor of 2 should be sufficient, but larger values may help speed up the convergence rate.

Once we have the initial guess on the coarsened mesh, we must transfer it to the original (fine) mesh. We do this by upsampling the coarse mesh via interpolation (multigrid prolongation). The details were discussed in class, and are also described in Jim Demmel's notes.

With the fine mesh in hand, we may then iterate on it to solve the problem. Note that our goal is to reduce the number of iterations on the fine mesh, which are C3 times more costly than coarse mesh iterations. Vary the number of coarse iterations needed in order to significantly reduce the running time needed to obtain a solution with an error meeting a given error threshold. I suggest that you choose a threshold that roughly equals 1/N.

Structure your code so that you use a single function to solve both the coarsened and refined problems.


Things you should turn in

Document your work in a well-written report of about 3 pages-- a bit longer if you are working in a team of 3, and hence implementing convergence acceleration. Your writeup should discuss decisions you made in parallelizing the algorithm. Provide pseudocode to help describe how your code works, but do not include any source code. That will appear in your electronic turnin. Read over these Tips for your writeup.

Include a self-evaluation discussing the division of labor and other aspects of how you worked together as a team team. A copy of the form is available here.

Please put a hard-copy of your report in my mailbox on the 2nd floor of the CSE Building (EBU3B).

Also transmit an electronic copy by emailing a single archive file, e.g. .zip, .tgz, containing your writeup and source code-- and the self-evaluation. Note: submitting the electronic copy establishes the turnin time of your assignment.

Be sure to delete all .o files and executables before creating the tar file. Email the archive file as an attachment to the following address, and put the string cse260:A2 in the Subject: line:
baden @ ucsd.
edu

The name of your archive file should start with your login ID, followed by a colon (:) e.g. baden:a2.tgz. When your archive file is extracted, a new directory should be created with the same name as the archive file, excluding the filename extension, e.g. baden:a2 The directory should contain two subdirectories
src: All source code including your makefile, so we can build the software.
report Your report including any image files, screen shots, or other output you wish us to see.



Copyright © 2009 Scott B. Baden

Maintained by baden @ ucsd.
edu   [Fri Oct 16 18:46:34 PDT 2009]