Valkyrie should only be used for parallel program development and measurement. If you need to use a Linux or UNIX system, please use your student account on the Solaris machines in the Advanced Programming Environment (APE) lab, located in APM 6426.
Software for the course will be placed in a public directory, which you may access from your home directory as ~/../public. The software for this assignment lives in various subdirectories of ~/../public/examples.
Please see two important web pages with extensive listings of software available in the course, and additional information to locate research papers and so on. These are found at:
ACS has set up a web page providing documentation on how to use
Valkyrie:
http://www-acs.ucsd.edu/offerings/userhelp/HTML/rocks,d.html
On Valkyrie you will run with the bash shell. To get the correct compilers, you will need to set your path as follows. Add the following to your .bash_profile file which you should find in your home directory:
export PATH=/usr/mpich/c/bin:$PATH
This Assignment consists of 3 parts. In all three parts you'll compile and run code supplied to you, and in part 2 you'll make modifications to the code.
This part is not to be handed it. It will help acquaint you with process of running an MPI program. Compile and run the two programs in the subdirectory called Basic. Be sure to use the Makefile that we've supplied so you'll get the correct compiler and loader flags. The Makefile includes an "arch" file that defines appropriate command line flags for the compiler. You currently have arch.valkyrie which provides the appropriate settings needed for Valkyrie.
Your first program is a parallel "hello world" program. This program prints "Hello World'' from each processor, along with the process ID. It also reports the total number of processors in the run, The hello world program is found in ~/public/examples/Basic/hello. Run it as follows:
mpi-launch -m compute-0-0,compute-0-1 /home/cs260f/public/examples/Basic/hello
Here is some sample output:
# processes: 2 Hello world from node 0 Hello world from node 1Do the node ID's come in any particular order? Why is this?
In the second program each processor sends two integers to node 0, which prints out the values. A common convention is to call node 0 the manager or boss. Run program 2 several times on 4 processors. Do the node ID's come in any particular order? Why is this the case? Modify the program so that the nodes report to the manager in increasing order of node rank. Note the use of the MPI_Barrier() call, which provides barrier synchronization. Remove this call and note the effect.
The purpose of this part is to show you how to set up the nodes to communicate as if they were connected in a ring, and to measure some aspects of communication performance. You'll be asked to make some changes to this program. Bring the plots for your runs to class on Tuesday, as described below.
The program we've supplied you, Ring.C, is found in the Ring/ subdirectory of ~/../public/examples. This program treats the processors as if connected in a ring. Node 0 sends a message to successor node 1, and so on until node P-1, which then sends the message back to node 0, which is P-1's successor in in mod P arithmetic. The cycle then repeats. Messages of various sizes are passed around the ring, starting with 2 byte long messages and progressing in powers of two up to a user-specified limit.
Ring extracts up to 2 arguments from the command line, which must appear in the order specified
By default NNN=1024,TTT=5. Under default settings Ring passes messages of 2, 4, 8, ..., 1K, ..., 256K five times around the processor ring. It calls MPI's built in timer MPI_Wtime() to measure message passing times, and reports the statistics based on these times as a function of increasing message length.
Run Ring on 2, 4, and 8 processors, and experiment with messages in the megabyte range. Note any differences in message passing times on differing numbers of processors, and attempt to explain them. Run several times and report the best timings. Did you notice any variations in the timings? Compare results when you run with 1 CPU per node and 2 CPUs per node and account for the differences (see below for how to achieve these effects).
To get accurate timings, you may need to increase the repetitions parameter for short messages below about 1KB in length. Use the message size parameter to restrict the maximum message length to 1KB, and successively double the repetitions parameter until the timings stabilize. For longer lengths (1KB to 16KB) you may be able to use a smaller repetition factor.
Prior to collecting timings, Ring "warms up" the machine by passing messages around the ring once before it turns on the timer. This helps to eliminate transient program behavior from your timing runs. Ring divides the timings by the number of repetitions around the ring, and reports three quantities: message length, bandwidth(MB/s), and message time in microseconds (ms). Using this information
You may have noticed that some of the output in the Ring program
is garbled. In particular, one or more node names appear later in the output
than is consistent with their appearance in the code.
What do you think is causing this behavior? Try and fix it if you can.
Note the use of MPI_Barrier() and I/O flushing, which was a fix
to the problem that worked correctly on a different installation of MPI.
The directory contains a README file explaining how to use the program, along with a subdirectory containing the source. Perform experiments to determine bottlenecks of the application. Run on 1, 2, and 4 processors--and more if you are able to get repeatable results--with a fixed workload. Repeat the run several times and report the best timings, but also report the worst time and the distribution of times (you may use a scatter plot if you wish). Did you notice any variations in the timings? You'll probably need to repeat each run 3 or 4 times, but try to continue until at least 2 runs agree to within 10%.
Your runs should last several seconds or longer, say, 5 seconds. (This implies that the runs on 1 processor will last for tens of seconds.) Report the running time T(P,N), and the parallel efficiency, E(P,N).
One difficulty with speedup curves is that small problems often do not have enough parallelism to utilize a large machine. An alternative measure is to let the problem size grow proportionately to the machine, so that in our case each processor gets a constant amount of work. This seems reasonable since one reason parallel computers are used is to solve more ambitious problems, that might offer new insight. So in some cases we want the problem size to grow with the machine.
Report the running time T(P,N), and the parallel efficiency, E(P,N), keeping W/P fixed (what is W for this computation?) Plot E(P,N) as a function of P, the number of processors. Determine whether or not the overheads of your program are small when P=1, i.e. whether or not it is the "best serial program." Scale the N such that running times are on the order of 5 seconds, as before. Do you notice an increase in running time as you increase work with P.
The program performs convergence checking according to a frequency value passed via the command line flag -f. However, the code that carries out the convergence check was intentionally left blank. Implement the convergence check.
To compute the error, you may use any of the functions discussed in class including: the change in the solution, the residual, or any other function you deem appropriate. The metric of the error may be computed using any of the norms discussed in class, e.g. L2, L infinity.
The frequency is in units of outer iterations. The smallest value is 1, the largest value is the length of the run in iterations, which is given by the -i flag. Vary the frequency f and plot the grind time as a function of f. Run on 8 processors or more. Is convergence checking costly when done at each iteration? At what value of f does convergence checking become inexpensive? You may need to vary N and Pto perceive the effect. Report any patterns you observe and explain the underlying behavior.
For each experiment state whether or not you think the program
is scalable.
What factors limit
performance as you increase the number of processors P?
Document your work in a well-written report. Your report should present a clear evaluation of the performance, including bottlenecks of the implementation, and describe any special coding or tuned parameters. Do this for each for Parts B and C. Be sure to include the plots described above, along with any plotted data in in tabular form.
Your report should cite any written works or software as appropriate in the text of the report. Citations should include any software you used that was written by someone else, or that was written by you for purposes other than this class.
Provide sample output demonstrating correct operation of your code Be sure and attach this output, along with your code listings and plots. Provide hard copy in class. Also transmit the report electronically. If possible, send a compressed archive file containing the the report and attachments along with an html file with an explanation of each image.MPI_Wtime() reports wall clock times, so your timings will reflect interference from other jobs. On a dedicated run, such interference will be minimal; running times will be shorter and reproducible. We aren't yet able to make dedicated runs, however. So run the program several times until you see consistent results. If you are still having difficulties getting consistent results, try running on other nodes. You should use no more than 4 nodes for this experiment.
To get an absolute time take the difference in times sampled by two successive calls to MPI_Wtime:
double t0 = MPI_Wtime(); // some code.... double t1 = MPI_Wtime(); double time = (t1-t0);
The man-page for MPI_Wtime()
is found on line. Note, however, that all times are
reported in seconds, and there is no need to divide
by MPI_Wtick() to convert to seconds.
(You can obtain the resolution of MPI_Wtime
using MPI_Wtick())
$ ssh-agent $SHELL $ ssh-addDo this once per login session.
mpi-launch provides the -m flag so you can specify which nodes to run on. The nodes are numbered from 0 to 15, and are named compute-0-0 through compute-0-15. (Each node contains 2 CPUs. For the moment you can run only on 1 CPU per node. (More on this later).)
I'll be setting up a script to select nodes for you automatically. In the meantime, you'll need to specify the -m nodes. (But to get you started I've put run files along with some of the programs you'll need to run)
The simplest usage of mpi-launch is:
$ mpi-launch -m [ comma separated list of compute nodes ] command args
Caution: You must use the full pathname for the command!
For example, if user yoda wants to launch the program /home/yoda/hello-world on compute-0-4 and compute-0-5, yoda would execute:
$ mpi-launch -m compute-0-4,compute-0-5 /home/yoda/hello-world
Don't use MPIrun, as that utilizes the slower 100Mbit ethernet interconnect. (You must also set different path to use a different version of the compiler)
To
run the executable program Ring (found in ~/../public/examples/Ring)
on 4 processors with command line arguments 5 and 1024, you
type:
mpi-launch -mcompute-0-0,compute-0-1,compute-0-2,compute-0-3
/home/cs260f/public/examples/ring/ring 5 1024
Be sure and specify the full path of the executable
MPI documentation is found at at http://www.cse.ucsd.edu/users/baden/classes/cse260_fa02/testbeds.html#MPI. You can obtain man pages for the MPI calls used in the example programs described here:
Copyright © 2002 Scott B. Baden. Last modified: 10/08/02 09:3142 PM