|20-Oct-06||Installed a new version of the serial code in the A3 directory, including some minor bug fixes discussed on the web board. If you are using SPRNG together with MPI, you must assert USE_MPI.|
|28-Oct-06||Some of you may have noticed in A3 that the convergence rate was sensitive to the number of processors and perhaps even to the processor geometry. Thus, parallel efficiency may not be a reliable measure of performance. This is common in algorithms where convergence is sensitive to how the problem is subdivided. This even suggests a better serial algorithm that reorders the updates in a way that mimics that of the parallel algorithm. If you are observing that the convergence rate is sensitive to task subdivision, you can either improve the serial algorithm or separately account for machine and numerical efficiencies. To do the latter, plot the convergence rate as a function of P and N. Also plot the grind rate, which is computed by dividing the total number of updates by the total running time. This takes account of the convergence, since the total # updates = # sweeps * total points.|
In this laboratory you'll improve the connected component algorithm you implemented in the last assignment. You may use any strategy you wish, though it is suggested you use a method that requires two phases: (1) label local clusters, (2) relabel the local clusters by exchanging labels globally. Stage (1) does not require any communication.
The diffusion method provided to you in the last assignment converges slowly. The running time varies according to the complexity of the clusters. On way to characterize complexity is to use the diameter measure, the maximum path length between any two points in a cluster. It can be seen that the convergence will increase with diameter. One approach to coping with this problem is to employ a union-find algorithm; see for example, Suzuki et al., Wu et al., which discusses a variant of the Suzuki algorithm. Another approach is to employ depth first search, for example Dewer and Harris' ants in the labyrinth algorithm. These are direct methods which perform just one sweep over the local mesh.
However you choose to handle local clustering, you'll want to use an efficient union find algorithm to quickly recognize connected clusters. This is discussed in the Suzuki and Wu papers. (For a more detailed discussion, see "Two linear time Union-Find strategies for image processing", by Fiorio and Gustedt, Theor. Comput. Sci., 154(2):165-181 (1996).)
There are various strategies for the second phase of clustering. The simplest approach is to have processors repeatedly exchange labels on their border regions, updating local labels until there are no more changes. Another approach is to collect all the labels along the border regions on one process, perform the relabeling there, and then broadcast the updated labels to the other processes which subsequently complete the relabeling process locally. In both cases, you'll notice that as you increase the number of processors, the cost of the global relabeling will increase. Fink et al., and the references contained therein, describe an approach to coping with this problem.
Compared to the diffusion algorithm used in the last assignment, you may observe dramatically different program behavior with respect to the independent probability p. Discuss this phenomenon and illustrate with performance measurements. You may find that labeling time has improved to the extent that you will need to attempt much larger runs!
Report both weak and strong scaling results, that is, let N2 grow linearly with P and remain fixed with P, respectively. Run on 4 and 8 processors. Report the optimal processor geometry for each number of processors, and present results using that geometry only. Explain your results. Do you observe any trends as you vary N and p? For a fixed value of P, how does the running time scale as a function of N? For a fixed value of N, how does the running time scale as a function of P? Consider both overhead (i.e. communication or serial bottlenecks) and convergence effects in your analysis. Where are the bottlenecks that compromise scalability?
Repeat the runs several times to ensure that your statistics are representative. To do this, first perform 5 runs with different random seeds (SPRNG allows you to do this automatically). Double the number of runs you have made until the averaged quantities start converging. Let's define convergence in terms of relative change when doubling the number of existing runs, such that the change is not more than a few percent.
Verify correctness, in particular, you must show that results are independent of the number of processors. To do this, you'll need to make several runs with different random number seeds for a fixed value of N, showing that the average number of clusters (and the average cluster size) is independent of P.
If you have the time, implement the 3D algorithm.
Don't include code listings in your writeup, but do include pseudocode illustrating the principles of operation of your algorithm implemented by your code. Also include plots and tabular data.
Turn in an electronic copy of your report, including code, plot files, and tabular data.