CSE 260 Homework Assignment #2: Getting started with MPI

Due: Tuesday 10/22/02 in Class

Computing environment

The hardware platform for the course is a Beowulf cluster named Valkyrie. Valkyrie is managed by the ACS and runs the Rocks software developed at the San Diego Supercomputer Center. Rocks consists of 16 dual 1 GHz Pentium III CPUs, each with 1GB of RAM and running Linux. A Myrinet switch provides low latency connectivity between the nodes.

Valkyrie should only be used for parallel program development and measurement. If you need to use a Linux or UNIX system, please use your student account on the Solaris machines in the Advanced Programming Environment (APE) lab, located in APM 6426.

Software for the course will be placed in a public directory, which you may access from your home directory as ~/../public.   The software for this assignment lives in various subdirectories of ~/../public/examples. 

Please see two important web pages with extensive listings of software available in the course, and additional information to locate research papers and so on. These are found at:

ACS has set up a web page providing documentation on  how to use Valkyrie: http://www-acs.ucsd.edu/offerings/userhelp/HTML/rocks,d.html


Logging into Valkyrie for the first time

The first time you login to Valkyrie you will be asked for some input related to ssh authentication. I suggest that you hit return in response to all 3 questions. (One consequence is that you'll have an empty pass phrase)

On Valkyrie you will run with the bash shell. To get the correct compilers, you will need to set your path as follows. Add the following to your .bash_profile file which you should find in your home directory:

export PATH=/usr/mpich/c/bin:$PATH


The Assignment

This Assignment consists of 3 parts.  In all three parts you'll compile and run code supplied to you, and in part 2 you'll make modifications to the code.

Part A

This part is not to be handed it. It will help acquaint you with process of running an MPI program. Compile and run  the two programs in the subdirectory called Basic. Be sure to use the Makefile that we've supplied so you'll get the correct compiler and loader flags. The Makefile includes an "arch" file that defines appropriate command line flags for the compiler. You currently have arch.valkyrie which provides  the  appropriate settings needed for Valkyrie.

Your first program is a parallel "hello world" program. This program prints "Hello World'' from each processor, along with the process ID. It also reports the total number of processors in the run, The hello world program is found in ~/public/examples/Basic/hello.    Run it as follows:

mpi-launch -m compute-0-0,compute-0-1 /home/cs260f/public/examples/Basic/hello

Here is some sample output:

# processes: 2
Hello world from node 0
Hello world from node 1
Do the node ID's come in any particular order? Why is this?

In the second program each processor sends two integers to node 0, which prints out the values. A common convention is to call node 0 the manager or boss. Run program 2 several times on 4 processors. Do the node ID's come in any particular order? Why is this the case? Modify the program so that the nodes report to the manager in increasing order of node rank. Note the use of the MPI_Barrier() call, which provides barrier synchronization.  Remove this call and note the effect.

Part B

The purpose of this part is to show you how to set up the nodes to communicate as if they were connected in a ring, and to measure some aspects of communication performance. You'll be asked to make some changes to this program. Bring the plots for your runs to class on Tuesday, as described below.

The program we've supplied you, Ring.C, is found in the Ring/ subdirectory of ~/../public/examples. This program treats the processors as if connected in a ring. Node 0 sends a message to successor node 1, and so on until node P-1, which then sends the message back to node 0, which is P-1's successor in in mod P arithmetic. The cycle then repeats. Messages of various sizes are passed around the ring, starting with 2 byte long messages and progressing in powers of two up to a user-specified limit.

Ring extracts up to 2 arguments from the command line, which must appear in the order specified

  • NNN where NNN is the maximum size message to be passed around the ring (in kilobytes), and
  • TTT is the "trips" parameter giving the number of times the message will be passed around the ring. If you want to specify this parameter, you must also specify the first one.
  • By default NNN=1024,TTT=5. Under default settings Ring passes messages of 2, 4, 8, ..., 1K, ..., 256K five times around the processor ring. It calls MPI's built in timer MPI_Wtime() to measure message passing times, and reports the statistics based on these times as a function of increasing message length.

    Experiments

    Run Ring on 2, 4, and 8 processors, and experiment with messages in the megabyte range. Note any differences in message passing times on differing numbers of processors, and attempt to explain them.  Run several times and report the best timings. Did you notice any variations in the timings?  Compare results when you run with 1 CPU per node and 2 CPUs per node and account for the differences (see below for how to achieve these effects).

    To get accurate timings, you may need to increase the repetitions parameter for short messages below about 1KB in length. Use the message size parameter to restrict the maximum message length to 1KB, and successively double the repetitions parameter until the timings stabilize. For longer lengths (1KB to 16KB) you may be able to use a smaller repetition factor.

    Prior to collecting timings, Ring "warms up" the machine by passing messages around the ring once before it turns on the timer. This helps to eliminate transient program behavior from your timing runs. Ring divides the timings by the number of repetitions around the ring, and reports three quantities: message length, bandwidth(MB/s), and message time in microseconds (ms). Using this information

  • plot the cost of message passing as a function of the message size. Use your favorite plotting package, e.g. gnuplot, or plot in matlab. Explain the shape of the curve.
  • Plot the bandwidth as a function of the number of bytes in the message. You should observe that this curve levels off after a certain point is reached, which is called the peak bandwidth . Why does this occur?
  • Also report the message startup time , which is the overhead of initiating a message. A good approximation to this time is simply the message passing time for the shortest messages. Pick the smallest time you observed as the startup time.
  • compute the half power point n1/2, which was discussed in class.
  • Additional experiments

    Message passing time depends on a number of factors, including the amount of copying needed to transmit and receive the data. Modify the Ring program to assess the effect of additional memory copying. In particular, copy each incoming message into a buffer before sending the data to the next processor. You might also try a second copy and note any effects observed.

    You may have noticed that some of the output in the Ring program is garbled. In particular, one or more node names appear later in the output than is consistent with their appearance in the code. What do you think is causing this behavior? Try and fix it if you can. Note the use of MPI_Barrier() and I/O flushing, which was a fix to the problem that worked correctly on a different installation of MPI.


    Part C

    You've been provided with a program called rb that implements the red-black Gauss-Seidel method. The code is on valkyrie in the directory ~/../public/examples/rb In this code, the Right Hand side is hardwired to the the constant function '8.0' Note that the method converges very slowly. In fact, it will take O(n2) iterations to converge as mentioned in class. So, try running with a smaller value of n so you can check that that the solution converges. Note that you may need to adjust the error tolerance, too. I would not let the error tolerance drop below 1/n. Print out the solution, to make sure that it converges to the exact solution. Do this on 1 processor, and verify that the code is correct (to within machine roundoff) as you vary the number of processors.

    The directory contains a README file explaining how to use the program, along with a subdirectory containing the source. Perform experiments to determine bottlenecks of the application. Run on 1, 2, and 4 processors--and more if you are able to get repeatable results--with a fixed workload. Repeat the run several times and report the best timings, but also report the worst time and the distribution of times (you may use a scatter plot if you wish). Did you notice any variations in the timings? You'll probably need to repeat each run 3 or 4 times, but try to continue until at least 2 runs agree to within 10%.

    Your runs should last several seconds or longer, say, 5 seconds. (This implies that the runs on 1 processor will last for tens of seconds.) Report the running time T(P,N), and the parallel efficiency, E(P,N).

    One difficulty with speedup curves is that small problems often do not have enough parallelism to utilize a large machine. An alternative measure is to let the problem size grow proportionately to the machine, so that in our case each processor gets a constant amount of work. This seems reasonable since one reason parallel computers are used is to solve more ambitious problems, that might offer new insight. So in some cases we want the problem size to grow with the machine.

    Report the running time T(P,N), and the parallel efficiency, E(P,N), keeping W/P fixed (what is W for this computation?) Plot E(P,N) as a function of P, the number of processors. Determine whether or not the overheads of your program are small when P=1, i.e. whether or not it is the "best serial program." Scale the N such that running times are on the order of 5 seconds, as before. Do you notice an increase in running time as you increase work with P.

    The program performs convergence checking according to a frequency value passed via the command line flag -f. However, the code that carries out the convergence check was intentionally left blank. Implement the convergence check.

    To compute the error, you may use any of the functions discussed in class including: the change in the solution, the residual, or any other function you deem appropriate. The metric of the error may be computed using any of the norms discussed in class, e.g. L2, L infinity.

    The frequency is in units of outer iterations. The smallest value is 1, the largest value is the length of the run in iterations, which is given by the -i flag. Vary the frequency f and plot the grind time as a function of f. Run on 8 processors or more. Is convergence checking costly when done at each iteration? At what value of f does convergence checking become inexpensive? You may need to vary N and Pto perceive the effect. Report any patterns you observe and explain the underlying behavior.

    For each experiment state whether or not you think the program is scalable. What factors limit performance as you increase the number of processors P?


    Things you should turn in

    Document your work in a well-written report. Your report should present a clear evaluation of the performance, including bottlenecks of the implementation, and describe any special coding or tuned parameters. Do this for each for Parts B and C. Be sure to include the plots described above, along with any plotted data in in tabular form.

    Your report should cite any written  works or software as appropriate in the text of the report. Citations should include any software you used that was written by someone else, or that was written by you for purposes other than this class.

    Provide sample output demonstrating correct operation of your code Be sure and attach  this output, along with your code listings and plots. Provide hard copy in class. Also transmit the report electronically. If possible, send a compressed archive file containing the the report and attachments along with an html file with an explanation of each image.


    How the program takes timings

    MPI_Wtime() reports wall clock times, so your timings will reflect interference from other jobs. On a dedicated run, such interference will be minimal; running times will be shorter and reproducible. We aren't yet able to make dedicated runs, however. So run the program several times until you see consistent results. If you are still having difficulties getting consistent results, try running on other nodes. You should use no more than 4 nodes for this experiment.

    To get an absolute time take the difference in times sampled by two successive calls to MPI_Wtime:

            double t0   = MPI_Wtime(); 
    
    	// some code....
            double t1   = MPI_Wtime();
            double time = (t1-t0);
    

    The man-page for MPI_Wtime() is found on line. Note, however, that all times are reported in seconds, and there is no need to divide by MPI_Wtick() to convert to seconds. (You can obtain the resolution of MPI_Wtime using MPI_Wtick())


    Compiling and Running your MPI programs

    To compile, use the makefiles provided for you. These makefiles include an architecture file (arch.valkyrie) file containing the appropriate compiler settings (We use a special version of the Gnu C++ compilers that incorporate the MPI libraries)  

    To run your program, use the mpi-launch command. mpi-launch uses ssh to run programs remotely on the compute nodes. In order to run with mpi-launch, you must automate the ssh authentication procedure, by executing:
    $ ssh-agent $SHELL
    $ ssh-add
    
    Do this once per login session.

    mpi-launch provides the -m flag so you can specify which nodes to run on. The nodes are numbered from 0 to 15, and are named compute-0-0 through compute-0-15. (Each node contains 2 CPUs. For the moment you can run only on 1 CPU per node. (More on this later).)

    I'll be setting up a script to select nodes for you automatically. In the meantime, you'll need to specify the -m nodes. (But to get you started I've put run files along with some of the programs you'll need to run)

    The simplest usage of mpi-launch is:

    $ mpi-launch -m [ comma separated list of compute nodes ] command args

    Caution: You must use the full pathname for the command!

    For example, if user yoda wants to launch the program /home/yoda/hello-world on compute-0-4 and compute-0-5, yoda would execute:

    $ mpi-launch -m compute-0-4,compute-0-5 /home/yoda/hello-world
    

    Don't use MPIrun, as that utilizes the slower 100Mbit ethernet interconnect. (You must also set different path to use a different version of the compiler)

    To run the executable program Ring (found in ~/../public/examples/Ring) on 4 processors  with command line arguments 5 and 1024, you  type:

    mpi-launch -mcompute-0-0,compute-0-1,compute-0-2,compute-0-3 /home/cs260f/public/examples/ring/ring 5 1024

    Be sure and specify the full path of the executable


    MPI

    MPI documentation is found at at http://www.cse.ucsd.edu/users/baden/classes/cse260_fa02/testbeds.html#MPI. You can obtain man pages for the MPI calls used in the example programs described here:



     Copyright 2002 Scott B. Baden. Last modified: 10/08/02 09:3142 PM