CSE 260 Assignment 2
Due Thursday, 10/5/06 in class

Changelog

Date Description
26-Sep-06 Original posting
03-Oct-06 The final part (modifying ring to communicate with nearest neighbors) is intended to demonstrate the impact of increased communication traffic on performance. In lieu of communicating with both neighbors, I suggest the following alternative. Have the processors communicate with just one neighbor, in matched even-odd pairs. That is, have processors 0 and 1 exchange messages, processors 2 and 3, and so on (Thus, the number of processors must be an even number.) Use MPI_Sendrecv in lieu of asynchronous (immediate) calls.
4-Oct-06 Fixed typo in previous note. Number of processors must be an even number.
4-Oct-06 There is also MPI_Sendrecv_replace, which uses just one buffer. Either way is fine.

 

This assignment contains 2 parts involving MPI code provided to you. You'll make functional modifications to the code and conduct some experiments. All the code you need exists as subdirectories of the publicly readable directory 
/export/home/cs260x-public/A1.
   Henceforth this directory will be referred to as  $(A1). 
 
  1. Simplest MPI.  This part will  acquaint you with process of running an MPI program on Valkyrie. Copy the directory
    $(A1)/Basic
    to your home directory. Compile and run the two programs contained in Basic. Be sure to use the supplied  Makefile  so that you get the correct compiler and loader flags. The Makefile includes an  architecture file called arch  file that defines appropriate command line flags to compile on Valkyrie. (If you modify this file, be sure and discuss your modifications--along with any impact(s)--in your report.)

    Your first program is a parallel "hello world" program. This program prints "Hello World'' from each processor, along with the process ID. It also reports the total number of processors in the run.   Run the program on 3 processors. Here is some sample output:

    # processes: 3
    Hello world from process 2
    Hello world from process 0
    Hello world from process 1

    Do the process ID's come in any particular order? Why is this?

    In the second program message each process sends two integers to process 0, which prints out the values. A common convention is to call process 0 the root. Run program 2 several times on 4 processes. Do the process ID's come in any particular order? Why is this the case? Modify the program so that the process report to the root in increasing order of rank.
     

  2. The purpose of this part is to show you how to set up the processes to communicate as if they were connected in a ring, and to measure some aspects of communication performance. You'll be asked to make some changes to this program as well.

    The program we've supplied you, ring, is found in $(A1)/Ring. This program treats the processors as if connected in a ring. Process 0 sends a message to successor process 1, and so on until process  P-1, which then sends the message back to process  0, which is P-1's successor in mod P arithmetic. The cycle then repeats. Messages of various sizes are passed around the ring, starting with 2 byte long messages and progressing in powers of two up to a user-specified limit.

    Ring extracts up to 2 arguments from the command line, which must appear in the order specified

  3. -s NNN ( NNN is the maximum size message to be passed around the ring in kilobytes.
  4. -t TTT (TTT   is the "trips" parameter giving the number of times the message will be passed around the ring.
  5. By default , NNN=1024,TTT=5. Under default settings Ring passes messages of 2, 4, 8, ..., 1K, ..., 256K five times around the processor ring.  Ring measures the total time taken to pass messages around the ring, and reports the statistics based on these times as a function of increasing message length.  (See below for how Ring collects timing information.)
     

    Experiments

    Run ring on 4 processors, and experiment with messages into the megabyte range. Note any differences in message passing times on differing numbers of processors, and attempt to explain them.  Run several times and report the best timings. Did you notice any variations in the timings? 

    To get accurate timings, you may need to increase the repetitions parameter for short messages below about 1KB in length. Use the message size parameter to restrict the maximum message length to 1KB, and successively double the repetitions parameter until the timings stabilize. For longer lengths (1KB to 16KB) you may be able to use a smaller repetition factor.

    Prior to collecting timings, ring "warms up" the machine by passing messages around the ring once before it turns on the timer. This helps to eliminate transient program behavior from your timing runs. ring divides the timings by the number of repetitions around the ring, and reports three quantities: message length, bandwidth( x106 bytes/sec), and message time in microseconds (ms). Using this information
     

  6. Plot the cost of message passing as a function of the message size. Use your favorite plotting package, e.g. gnuplot, or plot in matlab. Explain the shape of the curve. (For some documentation on GnuPlot, go to http://www.duke.edu/~hpgavin/gnuplot.html.)
  7. Plot the bandwidth as a function of the number of bytes in the message. You should observe that this curve levels off after a certain point is reached, which is called the peak bandwidth . Why does this occur? Use logarithmic scales as necessary to improve readability.
  8. Also report the message startup time, which is the overhead of initiating a message. A good approximation to this time is simply the message passing time for the shortest messages. Pick the smallest time you observed as the startup time.
  9. Determine the half power point n1/2, such that α accounts for one-half the total transmission time.
  10. You may have noticed that some of the output in the Ring program is garbled. In particular, one or more process names appear later in the output than is consistent with their appearance in the code. What do you think is causing this behavior?

    Additional experiments

    Message passing time depends on a number of factors, including the amount of copying needed to transmit and receive the data. Modify the Ring program to assess the effect of additional memory copying. In particular, copy each incoming message into a buffer before sending the data to the next process. You might also try a second copy and note any effects observed.

    Finally, modify the ring program so that each process exchanges messages with both neighbors. Have each process post two asynchronous receives and then send out messages to each neighbor. Repeat the timing measures you did for ring, and explain any differences.


How the ring program takes timings

Ring calls MPI's  timer function MPI_Wtime() to measure message passing times (as wall clock time), Your timings will reflect switch contention from other jobs. In a true dedicated run,  you would be the only user on the machine, and  contention would  be minimal; running times would  be shorter and reproducible. Since it is unrealistic to exclude other users,  which we collect runs, we can use a technique that can help filter out spurious timing:  run the program several times until you see consistent results. You should use no more than 4 processes for this experiment.

To get an absolute time take the difference in times sampled by two successive calls to MPI_Wtime():

        double t0   = MPI_Wtime(); 
        // some code....
        double t1   = MPI_Wtime();
        double time = (t1-t0);

The man-page for MPI_Wtime() is found on line. Note, however, that all times are reported in seconds, and there is no need to divide by MPI_Wtick() to convert to seconds. (You can obtain the resolution of MPI_Wtime using MPI_Wtick())

 


Computing environment

The hardware platform for the course is a Beowulf cluster named Valkyrie which is managed by ACS. To learn how to use Valkyrie, be sure to read this very important web page which will tell you how to compile and run MPI programs: Getting Started with Valkyrie.

Also see two other web pages with links to software available in the course as well as references and other valuable on-line sources of information.. These are found at:



MPI

Man pages for the MPI calls may be accessed with the man command. More extensive documentation can be found at at http://www-cse.ucsd.edu/users/baden/Doc/mpi.html. You can obtain man pages online for MPI calls used in the example programs described here:

 


Things you should turn in

Document your work in a well-written report which discusses your findings carefully. Your report should present a clear evaluation of the design of your code, including bottlenecks of the implementation, and describe any special coding or tuned parameters.  Provide code listings for files you wrote, or that you modified based on versions we gave to you (there is no need to provide listings of code we provided, that you did not modify). Be sure to include the plots described above, along with any plotted data in tabular form.

Your report should cite any written  works or software as appropriate in the text of the report. Citations should include any software you used that was written by someone else, or that was written by you for purposes other than this class.

Turn in hard copy: you can either put in my mailbox on the 2nd floor or put it in an envelope and slip under my door if I'm not in.  Also transmit the report electronically along with your code; instructions are discussed below. Your code will be run on Valkyrie, so be sure to report results from that machine.

Electronic turnin instructions

Transmit an electronic copy of your report as a compressed archive file (e.g. .zip, .tgz, .tar.gz). Organize the source code and plots into subdirectories, and include a README file explaining the contents. If you like, include  a .html file called index.html.. From there you can then link to the README file and to the various components of your report, including the source code and any other files you want to include.

The name of your tar file should begin with your login ID, followed by the underscore character '_' followed by "hw1,"  i.e. mine would be scott-b_hw1.tar.gz. Your archive file should create a directory with the name of your login ID. Be sure and delete all .object (.o)  files and executables before creating the archive  file. file   Send  the file to my CS email address baden (at) ucsd (dot) edu with the string cse260:hw2 in the subject: line.

.


Copyright © 2006  Scott B. Baden. Last modified: 09/26/06 07:48 PM