Date |
Description |

19-Oct-06 | Original posting |

20-Oct-06 | Installed a new version of the serial code in the A3 directory, including some minor bug fixes discussed on the web board. If you are using SPRNG together with MPI, you must assert USE_MPI. |

28-Oct-06 | Some of you may have noticed in A3 that the convergence rate was sensitive to the number of processors and perhaps even to the processor geometry. Thus, parallel efficiency may not be a reliable measure of performance. This is common in algorithms where convergence is sensitive to how the problem is subdivided. This even suggests a better serial algorithm that reorders the updates in a way that mimics that of the parallel algorithm. If you are observing that the convergence rate is sensitive to task subdivision, you can either improve the serial algorithm or separately account for machine and numerical efficiencies. To do the latter, plot the convergence rate as a function of P and N. Also plot the grind rate, which is computed by dividing the total number of updates by the total running time. This takes account of the convergence, since the total # updates = # sweeps * total points. |

Note: all but the Wu

In this laboratory you'll improve the connected component algorithm you implemented in the last assignment. You may use any strategy you wish, though it is suggested you use a method that requires two phases: (1) label local clusters, (2) relabel the local clusters by exchanging labels globally. Stage (1) does not require any communication.

The diffusion method provided to you in the last assignment
converges slowly. The running time varies
according to the complexity of the clusters. On way to characterize
complexity is to use the *diameter* measure, the
maximum path length between any two points in a cluster.
It can be seen that the convergence will increase with diameter.
One approach to coping with this problem is to
employ a union-find algorithm; see for example,
Suzuki et al.,
Wu et al.,
which discusses a variant of the Suzuki algorithm.
Another approach is to employ
depth first search, for example
Dewer and Harris'
ants in the labyrinth algorithm.
These are direct methods which
perform just one sweep over the local mesh.

However you choose to handle local clustering, you'll want to use
an efficient union find algorithm to quickly
recognize connected clusters. This is discussed in the
Suzuki and Wu papers.
(For a more detailed discussion, see
"Two linear time Union-Find strategies for image processing",
by Fiorio and Gustedt,
*Theor. Comput. Sci.,* **154**(2):165-181 (1996).)

There are various strategies for the second phase of clustering. The simplest approach is to have processors repeatedly exchange labels on their border regions, updating local labels until there are no more changes. Another approach is to collect all the labels along the border regions on one process, perform the relabeling there, and then broadcast the updated labels to the other processes which subsequently complete the relabeling process locally. In both cases, you'll notice that as you increase the number of processors, the cost of the global relabeling will increase. Fink et al., and the references contained therein, describe an approach to coping with this problem.

Compared to the diffusion algorithm
used in the last assignment, you may observe
dramatically different program behavior
with respect to the independent probability *p*.
Discuss this phenomenon and illustrate
with performance measurements.
You may find that labeling time has improved to the extent that you will need to attempt much larger runs!

Report both weak and strong scaling results, that is,
let N^{2} grow linearly with P and remain fixed with P,
respectively.
Run on 4 and 8 processors. Report the optimal processor geometry
for each number of processors, and present
results using that geometry only.
Explain your results.
Do you observe any trends as you vary N and p?
For a fixed value of P, how does the running time scale as
a function of N?
For a fixed value of N, how does the running time scale as
a function of P?
Consider both
overhead (i.e. communication or
serial bottlenecks) and convergence effects in your analysis.
Where are the bottlenecks that compromise scalability?

Repeat the runs several times to ensure that your statistics are representative. To do this, first perform 5 runs with different random seeds (SPRNG allows you to do this automatically). Double the number of runs you have made until the averaged quantities start converging. Let's define convergence in terms of relative change when doubling the number of existing runs, such that the change is not more than a few percent.

Verify correctness, in particular, you must show that results are independent of the number of processors. To do this, you'll need to make several runs with different random number seeds for a fixed value of N, showing that the average number of clusters (and the average cluster size) is independent of P.

If you have the time, implement the 3D algorithm.

Don't include code listings in your writeup, but do include pseudocode illustrating the principles of operation of your algorithm implemented by your code. Also include plots and tabular data.

Turn in an electronic copy of your report, including code, plot files, and tabular data.

Copyright © 2006 Scott B. Baden. Thu Oct 19 10:40:47 PDT 2006