Date | Description |
22-Sep-13 | Original posting |
MPI has been set up on Bang and we are using a publicly available version of MPI called OpenMPI Our cluster runs Rocks Cluster Linux 6.0, and we are is using the Rocks-bundled "SGE" batch job scheduler. The latest OFED (OpenFabrics Enterprise Distribution), the InfiniBand software (v 1.5.4.1) is installed and support the 20Gbps 4X DDR InfiniBand low-latency interconnect network.
To use MPI, you first need to set up your environment. Add the following line to the end of your .bash_profile in $HOME
module load openmpi_openib
You may run your job interactively, via qsub or interactively, via qlogin. To submit a batch job use the qsub command to submit a batch file.
qsub batch_bang.sh
If the job was successfully submitted, you'll get back a message of the following form:
Your job 1883 ("RING") has been submitted
You may view the progress of your job with the qstat command:
job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 1883 0.55500 RING baden r 10/28/2012 09:45:23 normal.q@compute-0-10.local 12 1884 0.55500 RING baden r 10/28/2012 09:45:38 normal.q@compute-0-4.local 12
If you wish to see only your job(s), use "qsub -u <your login> The "r" value tells you that the job is running. A "qw" value tells you that your job is waiting in the queue.
When the job finishes, you'll get a job output file of the form RING.o1885. There will also be a .po file, but you may ignore it. Note that, after a job completes, a minute or two may elapse can take a minute or two before the system frees up the resources for other users Thus, you may get messages as follows, until the resources have been freed; you may ignore these messages, and we are looking into how to remove them.
$ qstat -j scheduling info: queue instance "normal.q@compute-0-13.local" dropped because it is full queue instance "normal.q@compute-0-11.local" dropped because it is full
If you need to remove a job, use the qdel command, specifying the job number listed in column 1 of qstat command (This is the same job number that was returned when you submitted the job)
qdel 1883
The job script will specify the number of cores you wish to use. The script also specifies the queue. Remember that debug queue jobs can run for at most 30 minutes on 8 cores, and that jobs on the normal queue may run on at most 16 cores for 10 minutes. (If you are interested in running on more cores, let us know.)
Some sample MPI applications have been installed in $(PUB)/Examples/MPI. The Makefile for all MPI codes used in the course use an MPI version of the "arch" file, arch.gnu.mpi, which has been installed in $(PUB)/Arch. The MPI compilers (e.g. mpicc) are actually wrappers to the compiler back end (e.g. gcc).
To launch an MPI program, use the mpirun command:
mpirun -np 4 ring -t 10000 -s 16
The mpirun command provides the -np flag to specify the number of processes. The value of this parameter should not exceed the number of available physical cores; otherwise the program will run, but very slowly, as it is multiplexing resources. Thus, in interactive node, you should not run on more than 8 cores, and in batch jobs you shouldn't attempt to run with more cores than were allocated to the job. The next parameter (ring) is the name of the executable. Any command line arguments come in the usual position, after the name of the executable.
To establish that your environment has been set up correctly, copy, compile and run the parallel "hello world" program. This program prints "Hello World'' from each process along with the process ID. It also reports the total number of processes in the run.
The hello world program is found in $(PUB)/Examples/Basic/hello. To run the program use mpirun as follows:
mpirun -np 2 hello
Here is some sample output:
# processes: 2 Hello world from node 0 Hello world from node 1
In addition to specifying the number of cores you'll run with on the mpirun command line, you also need to specify the number of cores to be allocated to your job in the batch file. For example, to allocate 16 cores, add this line to your batch file:
#$ -pe orte 16
In general, the number of cores specified in the batch file must be no smaller than the number of cores specified on the mpirun command line. If it is, then you'll experience a slowdown as MPI will cause multiple processes to be multiplexed on the same core as with this command line here:
mpirun -np 64 ./apf -n 1132 -i 3000 -x 4 -y 16
On the other hand, if you allocate many more cores to the job than you actually need, you are wasting resources, which are reserved for your exclusive use for the duration of your job. Use care, and avoid allocating more cores than you require in a job. In a computing system with allocation budgets (as at NERSC), this practice will help conserve your allocation.
While the number of cores in the batch file can be any number in the range of 1 to 256, the job scheduler assigns resources in units of whole nodes - 8 cores. So there is no advantage to specifying a number of cores that is not divisible by 8. (In CSE 160, Fall 2013, the maximum number of cores at 16 at the time of this writing. This number could increase, and watch Moodle for announcements).
To take timings, use MPI's timer function MPI_Wtime() to measure wall clock time (like a stopwatch). On some systems, interactive mode shares the resources among multiple users. Thus, timings will reflect contention from other jobs. On Bang, you'll have exclusive access to processing nodes in interactive or batch mode, but not the Infiniband switch. To avoid the effects of switch contention, we’ll use a technique to filter out the variations. Run the program at least 3 times (for each input size and number of cores, or any variable that affects the running time) until you see a consistent timing emerge, such that a majority agree within 10%. If you observe a lot more variation try doubling the number of runs to see if you can obtain a better consensus. Exclude outlying timings, but note them in your report. Report the minimal running time rather than the average, as previously discussed in class.
To measure time, take the difference in times sampled by two successive calls to MPI_Wtime():
double t0 = MPI_Wtime(); // some code.... double t1 = MPI_Wtime(); double time = (t1-t0);
MPI_Wtime() reports time in seconds so there is no need to scale the value returned by the timer (resolution of MPI_Wtime using MPI_Wtick()). The man-page for MPI_Wtime() is found on line.
Recall that the running time of the application is that of the last processor to finish. Don't time individual iterations, but rather, time an entire run, but excluding initialization, calls to MPI_Init() and MPI_Finalize() and any other activities that are not representative of a run with many iterations. communication time.
It can be difficult to measure communication times accurately, owing to the short time durations involved. Therefore, we will use an indirect measurement technique. We do this by disabling communication and subtracting the resultant time from that obtained from a run that includes communication. To this end, you'll need to add an option to your program to disable communication calls like MPI_Send.
While the PPF package discussed in class can help sort out some I/O problems, it can only synchronize a single line of output. When you have multiple lines, and you are using PPF_Print() or just printf(), you will find that fflush() does not sort out the problem of garbled I/O.
The only sure way of getting I/O to stdout (or stderr) to properly synchronize is to use sprintf() to generate string output, and then collect via Gatherv() on a distinguished process (say 0). In other words, only 1 process should actually perform the I/O.
There is an example of how to manage I/O in the Ring example code
$PUB/Examples/MPI/Ring/getHost.C
This example shows how to gather string output using an MPI function that returns the host name as a string.
It is possible to debug your code with gdb. Follow the instructions at the OpenMP debugging FAQ site, and read section Attach to individual MPI processes after they are running. The example program messages.cpp (in $PUB/Examples/MPI/Basic) has been set up so you can use gdb with the --pid flag, as discussed in that OpenMPI writeup. Be sure to include these two header files as shown at top of the code.
#include#include
The web pages also discuss a mechanism to check that the MPI calls are set up correctly and memchecker. (But be warned that valgrind will slow your program down a lot). MPI routines have many of arguments and these can be messy to sort out unless you have the documentation at your fingertips. You can access the MPI man pages on bang once you set up your MANPATH to include them. Just put the following in your .bash_profile:
export MANPATH="/opt/openmpi/share/man:$MANPATH"
There is a nice tutorial with some worked examples. Man pages for the MPI calls may be accessed with the man command and they are also available On-line. More extensive documentation can be found at http://www-cse.ucsd.edu/users/baden/Doc/mpi.html. The on-line man pages for MPI calls used in this assignment appear below.
MPI _Init | Initialize the MPI environment. Call this prior to making any MPI calls. |
MPI_Finalize | Exit the MPI environment. Call this function before exiting your program. |
MPI_Comm_size | Queries the total number of processes use in this program invocation. |
MPI_Comm_rank | Queries this process's unique identifier, an integer called the rank |
MPI _Send | Sends a message to another process, a return signifies that the message buffer may be reused. |
MPI_Recv | Receives a message from another process, a return signifies that the message has been received. |
MPI_Irecv | Immediate form of Receive. A return does not signify that the message has been received. To receive this assurance, you need to call MPI_Wait(). |
MPI _Isend | Non blocking variant of Send(). A return does not signify that the message buffer may be reused. To receive this assurance, you need to call MPI_Wait(). |
MPI_Wait | Return indicates that a previously posted Irecv (or Isend) has completed |
MPI_Wtimer | Timer. |
MPI_Gather | Gathers data from all processes onto the designated root process. All processes contribute the same amount of data. A collective routine, which all processes in the communicator must call. This may be be useful in debugging. |
MPI_Barrier | Blocks calling process until all processes have checked in. A collective routine, which all processes in the communicator must call. |