Date | Description |
26-Sep-06 | Original posting |
03-Oct-06 | The final part (modifying ring to communicate with nearest neighbors) is intended to demonstrate the impact of increased communication traffic on performance. In lieu of communicating with both neighbors, I suggest the following alternative. Have the processors communicate with just one neighbor, in matched even-odd pairs. That is, have processors 0 and 1 exchange messages, processors 2 and 3, and so on (Thus, the number of processors must be an even number.) Use MPI_Sendrecv in lieu of asynchronous (immediate) calls. |
4-Oct-06 | Fixed typo in previous note. Number of processors must be an even number. |
4-Oct-06 | There is also MPI_Sendrecv_replace, which uses just one buffer. Either way is fine. |
This assignment contains 2 parts involving MPI code provided to you. You'll make functional modifications to the code and conduct some experiments. All the code you need exists as subdirectories of the publicly readable directory
Your first program is a parallel "hello world" program. This program prints "Hello World'' from each processor, along with the process ID. It also reports the total number of processors in the run. Run the program on 3 processors. Here is some sample output:
# processes: 3
Hello world from process 2 Hello world from process 0 Hello world from process 1
Do the process ID's come in any particular order? Why is this?
In the second program
message
each process sends two integers to process 0, which prints out
the values. A common convention is to call process 0 the root. Run program 2 several times on 4 processes. Do the process ID's come in
any particular order? Why is this the case? Modify the program so that
the process report to the root in increasing order of rank.
The program we've supplied you, ring, is found in $(A1)/Ring. This program treats the processors as if connected in a ring. Process 0 sends a message to successor process 1, and so on until process P-1, which then sends the message back to process 0, which is P-1's successor in mod P arithmetic. The cycle then repeats. Messages of various sizes are passed around the ring, starting with 2 byte long messages and progressing in powers of two up to a user-specified limit.
Ring extracts up to 2 arguments from the command line, which must appear in the order specified
By default , NNN=1024,TTT=5. Under
default settings Ring passes messages of 2, 4, 8, ..., 1K, ..., 256K
five times around the processor ring.
Ring measures the total time taken to
pass messages around the ring, and reports the statistics based on these times
as a function of increasing message length. (See below for how
Ring collects
timing information.)
Run ring on 4 processors, and experiment with messages into the megabyte range. Note any differences in message passing times on differing numbers of processors, and attempt to explain them. Run several times and report the best timings. Did you notice any variations in the timings?
To get accurate timings, you may need to increase the repetitions parameter for short messages below about 1KB in length. Use the message size parameter to restrict the maximum message length to 1KB, and successively double the repetitions parameter until the timings stabilize. For longer lengths (1KB to 16KB) you may be able to use a smaller repetition factor.
Prior to collecting timings, ring "warms up" the machine by
passing messages around the ring once before it turns on the timer. This helps
to eliminate transient program behavior from your timing runs. ring
divides the timings by the number of repetitions around the ring, and reports
three quantities: message length, bandwidth( x106 bytes/sec), and
message time in microseconds (ms). Using this
information
You may have noticed that some of the output in the Ring program is garbled. In particular, one or more process names appear later in the output than is consistent with their appearance in the code. What do you think is causing this behavior?
Finally, modify the ring program so that each process exchanges messages with both neighbors. Have each process post two asynchronous receives and then send out messages to each neighbor. Repeat the timing measures you did for ring, and explain any differences.
Ring calls MPI's timer function MPI_Wtime() to measure message passing times (as wall clock time), Your timings will reflect switch contention from other jobs. In a true dedicated run, you would be the only user on the machine, and contention would be minimal; running times would be shorter and reproducible. Since it is unrealistic to exclude other users, which we collect runs, we can use a technique that can help filter out spurious timing: run the program several times until you see consistent results. You should use no more than 4 processes for this experiment.
To get an absolute time take the difference in times sampled by two successive calls to MPI_Wtime():
double t0 = MPI_Wtime(); // some code.... double t1 = MPI_Wtime(); double time = (t1-t0);
The man-page for MPI_Wtime() is found on line. Note, however, that all times are reported in seconds, and there is no need to divide by MPI_Wtick() to convert to seconds. (You can obtain the resolution of MPI_Wtime using MPI_Wtick())
Also see two other web pages with links to software available in the course as well as references and other valuable on-line sources of information.. These are found at:
Man pages for the MPI calls may be accessed with the man command. More extensive documentation can be found at at http://www-cse.ucsd.edu/users/baden/Doc/mpi.html. You can obtain man pages online for MPI calls used in the example programs described here:
Document your work in a well-written report which discusses your findings carefully. Your report should present a clear evaluation of the design of your code, including bottlenecks of the implementation, and describe any special coding or tuned parameters. Provide code listings for files you wrote, or that you modified based on versions we gave to you (there is no need to provide listings of code we provided, that you did not modify). Be sure to include the plots described above, along with any plotted data in tabular form.
Your report should cite any written works or software as appropriate in the text of the report. Citations should include any software you used that was written by someone else, or that was written by you for purposes other than this class.
Turn in hard copy: you can either put in my mailbox on the 2nd floor or put it in an envelope and slip under my door if I'm not in. Also transmit the report electronically along with your code; instructions are discussed below. Your code will be run on Valkyrie, so be sure to report results from that machine.Transmit an electronic copy of your report as a compressed archive file (e.g. .zip, .tgz, .tar.gz). Organize the source code and plots into subdirectories, and include a README file explaining the contents. If you like, include a .html file called index.html.. From there you can then link to the README file and to the various components of your report, including the source code and any other files you want to include.
The name of your tar file should begin with your login ID, followed by the underscore character '_' followed by "hw1," i.e. mine would be scott-b_hw1.tar.gz. Your archive file should create a directory with the name of your login ID. Be sure and delete all .object (.o) files and executables before creating the archive file. file Send the file to my CS email address baden (at) ucsd (dot) edu with the string cse260:hw2 in the subject: line.
.
Copyright © 2006 Scott B. Baden. Last modified: 09/26/06 07:48 PM