Date | Description |
5-Oct-06 | Original posting |
6-Oct-06 | Discuss strategy for gaining dedicated access to nodes. |
6-Oct-06 | Linked in the Leighton handout. |
20-Oct-06 | Mentioned USE_SPRNG definition for MPI |
In this laboratory you'll parallelize and experiment with an existing connected
component labeling algorithm
provided to you. You should also attempt
to speed up single processor performance.
Refer to the
Leighton and Schroeder handouts for reference.
Although ccl is a serial program, it contains some hooks to help get you started with the MPI implementation: broadcasting command line arguments and taking times with MPI_Wtime(). The hooks may be enabled by uncommenting the definition of MPI_ON within the Makefile. The code may be configured to use the Linux built in random number generator, or to use SPRNG, a random number generator installed on Valkyrie in PUB/lib/SPRNG. SPRNG may also be used in parallel mode to ensure multiple independent random number streams, and there are two examples of serial and parallel (MPI) invocations of SPRNG in PUB/examples/sprng_ex. If you are using SPRNG together with MPI, you must assert USE_MPI. [10/20/06] If you want to use SPRNG, uncomment the Macro USE_SPRNG in the Makefile.
The program provides various command line options which are parsed in the source file cmdLine.C. In particular, you may seed the random number generator with an arbitrary value; else you get a seed based on the time of day. The program outputs the seed so that you will be able to reproduce a given run.
The code considers points to be adjacent only if they are nearest neighbors on the Manhattan coordinate directions-- left, right, up, and down. This is different from the stencil used in the Schroeder excerpt, which includes corners. If you like, experiment with the 9-point stencil, but be sure to present results for the Manhattan stencil.
First determine pc, the critical value of the independent probability p, that maximizes the running time. It may be easier to pinpoint pc by looking at tabular data; sample p more closely in the neighborhood of criticality so you can determine pc to 3 decimal places. Include tabular data along with your report, and also plot the running time as a function of p.
Since the initial conditions are randomized you'll need to take steps to ensure that your timings are statistically significant. For each value of p, make several (5 to 10) runs each with a different random number seed. Plot the mean value with a curve, but also plot the maximum and minimum value for each value of p you measured. You should also repeat a few of your runs using the same random number seed, to see if your timings are consistent. To do this, use the -s flag to set the random number seed to the value reported by the reference run you wish to reproduce.
To determine an appropriate value of N, experiment by starting with N=100 and doubling N until the run completes in about 5 to 10 seconds at criticality.
Parallelize the code. Label components in two phases. In the first phase, each processor label s its assigned part of the domain. (You may use ghost cells to contain labels from cells in neighboring processes.) In the second phase, processes re-label their clusters, using newly updated labels obtained from neighboring processes. This second phase may take several iterations before all labels are finalized. Try and make the code go as fast as you can. If you are working in a team, you must implement 2D partitioning, else 1D. But we will be using 2D partitioning in the next assignment, so try and implement 2D partitioning as well.
It is important that you have dedicated access to the nodes you've chosen to run on. Observe the following protocol to help improve the consistency of your results. If you see that others are on the machine use nodes that appear unoccupied. The ganglia command can help identify the nodes that aren't being used:
ganglia load_one | sort -n -k 2Use nodes with load averages close to zero. Stick with those nodes and make all your timing runs from a single script. This way, if someone sees that the node is busy, they'll stay away for the duration of time that you are performing your runs. This implies that you should use a script to perform your runs and that you recheck node availability each time you begin run.
Include two appendices. The first should contain performance data for the clustering algorithm on one processor, as well as your scaling experiments. Any plotted data should also be included in tabular form. In the second appendix, submit a listing of your software (only in electronic form, please).
Your writeup should discuss decisions you made in parallelizing the algorithm. Present a clear evaluation of performance, including bottlenecks of the implementation. Describe any special coding or tuned parameters. What factors limit performance?
Verify correctness, in particular, you must show that results are independent of the number of processors. To do this, you'll need to make several runs with different random number seeds for a fixed value of N, showing that the average number of clusters (and the average cluster size) is independent of P. Do not attempt to look at output, as labels may vary on different numbers of processors. (The provided code will output results for small clusters, but this is a convenience)