Work sharing and data decomposition

Partitioning is the process of dividing up a computation among processors. There are often many ways of doing this, and the appropriate scheme depends on our application's requirements as well as the hardware. Architectural factors include the granularity of communication, and whether or not memory is shared. . We ask 3 questions in determining the general partitioning requirements of an application:

In the finest grain execution, at most only a few CPs occur between synchronization points. SPMD programs generally employ a much higher level of granularity, on the order of millions of CPs or even billions. It is very important that the granularity of computation match the speed of the processor interconnect. If the interconnect is slow, then the granularity must be large enough to offset the cost of communication.

The granularity selection process is affected by considerations of locality. We improve locality when we make the grains larger, we disrupt locality when we make the grains smaller. However, as we will see later on, excessive increases in granularity can introduce severe load imbalance, where only a few processors effectively carry the workload.

A large class of applications are organized around a mesh, and entail updating each point of the mesh as a function of nearest neighbors. Moreover, the work required to compute each element of the solution--a single point [i,j]-- is constant. Consider a 2D image smoothing application, where each element Y[i,j] is a function of neighboring values of X[i,j]:

fori = 0 : N-1forj = 0 : N-1 Y[i,j] = (X[i-1,j] + X[i+1,j] + X[i,j-1] + X[i,j+1])endend

For such an application we may employ a *uniform partitioning*, which
we've previously discussed

Consider the following loop

with the distribution of negative numbers in x[ ] as follows:fori = 0 : N-1if(x[i] > 0) x[i] = sqrt(x[i]);

+----------+----------+----------+----------+ | <0 | <0 | <0 | >0 | +----------+----------+----------+----------+ p0 p1 p2 p3The load is carried by only one processor, and therefore no parallel speedup is realized. We say that the workloads are

In general, the distribution of negative values of x[i] may not known *a
priori *(in advance) and may also change dynamically. There are two ways of
dealing with this problem:

+---------------------------------+ |012301230123.... | +---------------------------------+

Cyclic partitioning can evenly assign the work so long as the pattern of
negative and positive values of `x[ ]` doesn't correlate with the
repeating pattern of the wrapping strategy. (e.g. if every 4th value of the
array in our square root calculation were negative.)

We use the notation `CYCLIC `to denote a cyclic decomposition. Cyclic
mappings also apply to higher dimensional arrays, e.g. `(CYCLIC, CYCLIC)`

------------------------------- | P0 P1 | P0 P1 | P0 P1 | | P2 P3 | P2 P3 | P2 P3 | ------------------------------- | P0 P1 | P0 P1 | P0 P1 | | P2 P3 | P2 P3 | P2 P3 | -------------------------------

`(*, CYCLIC) `and so on:

------------------------------- | P0 P1 | P2 P3 | P0 P1 | | P0 P1 | P2 P3 | P0 P1 | | P0 P1 | P2 P3 | P0 P1 | | P0 P1 | P2 P3 | P0 P1 | -------------------------------

Wrapping can incur excessive communication overhead in computations requiring interprocessor communication. For example:

fori = 1 : N-1if(x[i] < 0) y[i] = x[i] - x[i-1];elsey[i] = sqrt(x[i]) - x[i-1];end end

Observe that `x[i]` and `x[i-1]` will lay on separate
processors for every value of `i`. Compare this with regular partitioning
where `x[i]` and x[i-1] lay on separate processors only 1 time in
`N/P,` that is, for only the leftmost element of the partition, which
contains `N/P` values. In practice, we can expect `N>>P,` so
communication is far less costly than for wrapped partitioning.

We refer to the difficulty with wrapped partitioning as a
*surface-to-volume *effect, that is, the surface of each partition
represents the amount of communication that must be done, and the volume
represents the work. With wrapped partitioning the ratio is 1:1, whereas with
regular partitioning the surface to volume ratio is `1:N/P.`

We may reduce the surface to volume ratio by *chunking:* we apply the
wrap pattern over larger units of work. We wrap K consecutive elements instead
of just 1. This is shown for the case of K=2 on 4 processors, which is
designated as `CYCLIC(2)`:

There is, however, a limit to how far this technique can be taken, for as the
chunks get larger, so does load imbalance. This *granularity tradeoff* is
ubiquitous in load balancing, and is shown in the following graph; as we
decrease granularity we improve load imbalance but at the cost of disrupting
locality.

decreased load increased load imbalance, but imbalance, but increased comm. decreased comm. | | |\ / time | \ / | \ / | \_ _/ | \__ __/ | \___ ___/ | \_____/ | | +-------------------------------- ^ chunk size ^ optimal Optimal chunk size balances communication overhead against load imbalance

In some cases load sharing with wrapped partitioning can be ineffective,
because of the high communication overheads. On shared memory architecture one
can employ *processor self-scheduling, *in which processors access a
shared structure like a counter or queue for work assignments.

Load sharing is a passive technique in that no attempt is made to actively respond to workload variations. An alternative is to actively measure workload variations and to partition the workloads non-uniformly so that each processor gets fair piece of the work. The pieces of work assigned to processors have different sizes according to the spatial density distribution of the work. In effect this is an off-line algorithm.

In general non-uniform partitioning will only work of we have a good idea of
the workload distribution. If the distribution changes with time, then we may
have to periodically repartition the work. In most physical problems, the
solution changes gradually enough that the cost of repartitioning won't be too
great.
Non-uniform partitioning strategies are numerous, and we'll discuss
one common approach:
*Orthogonal Recursive Bisection *
Orthogonal Recursive Bisection(ORB) (also known as Recursive
Coordination Bisection (RCB)) is a useful non-uniform partitioning strategy.
(M. J. Berger, S. H. Bokhari: "A Partitioning Strategy for
Nonuniform Problems on Multiprocessors." *IEEE Transactions on Computers* **36**(5): 570-580 (1987)).

It works by splitting the computation into to equal parts (of work, not space), and recursively splitting each part until done. ORB works when the number of processors is not a power of two, and it can be applied to multidimensional problems.

Consider the following loop:

for (i = 1; i < N-1; i++) if (x[i] < 0) y[i] = f1( x[i], x[i-1] ); else y[i] = f2( x[i], x[i-1] ); end ifAssume that

1 1 2 1 2 2 2 1 1 1 1 1 1 2 2 1We split the above array at the point where the sum of the elements on either side of the split is as equal as possible.

1 1 2 1 2 2 2 1 1 1 1 1 1 2 2 1 ^ | Cut hereIf we are running on 4 processors, then we split again, once for each part

1 1 2 1 2 2 2 1 1 1 1 1 1 2 2 1 ^ ^ ^ | | | Cut #2 Cut #1 Cut #3

We now have 4 partitions with 5, 6, 6, and 5 units of work respectively.

ORB reduces the higher dimensional partitioning problem to a set of simpler 1-dimensional partitioning problems, successively partitioning along orthogonal coordinate directions. Generally the cutting direction is varied at each level of the recursion to avoid elongated partitions that could lead to poorly-balanced workloads.

The following figure shows some relatives relatives of ORB:

**The Orthogonal Sectioning family of partitionings, from left to right:
(a) ORB,
(b) ORB-H,
(c) rectilinear, and (d) ORB-MM.**

In some cases we may want to section a coordinate
into more than two pieces.
ORB-H or hierarchical ORB is a technique that employs * multi-sectioning:*
it splits the problem into ` p > 2 ` strips and recurses on each strip.
This technique is also referred to as **multilevel ORB**.

The principal disadvantage of the
orthogonal sectioning algorithms is the geometric
constraint that all cuts must be straight-edged.
Hence, a cut in ` d` dimensions introduces a workload imbalance
that is carried by a ` d-1` dimensional hyper-plane.
A way around the difficulty is to section the dividing hyper-plane, too;
this strategy is called ORB-MM, or ORB-``Median of Medians.''
(Shown above.)
However, ORB-MM generates highly irregular partitionings
in three or more space dimensions which for all practical purposes
are unstructured. Since unstructured partitionings
do not have a compact representation, they require additional software
bookkeeping to
manage non-rectangular iteration spaces appearing in application software.
In addition, distributed mapping tables must be maintained on message
passing architectures, at the cost of additional communication overheads
and a further increase in the application's complexity.

As we mentioned earlier, computations often exhibit temporal locality, whereby the solution changes gradually, for example, due to a time-step constraint. An incremental approach to orthogonal sectioning may effectively exploit temporal locality by computing work gradients across the partition boundaries and shifting the boundaries accordingly. However, complications arise when a boundary moves, since all other boundaries introduced later on in the recursion will be affected. In effect, we trade off temporal locality against spatial locality, and the trade-off may not be favorable. An approach which exploits temporal locality without the drawback of disrupting spatial locality is Nicol's rectilinear partitioning strategy. (Shown above.) This strategy avoids the difficulty of awkward workload adjustments by imposing a fixed connectivity on the partitionings. Partitionings are formed by the tensor product of strips taken across orthogonal coordinate directions. Load balance is improved by iteratively adjusting cuts independently along each problem dimension as necessary.

Inverse space-filling partitioning (ISP) is an alternative technique
that shares the desirable fine-grained load balancing characteristic
of ORB-MM but with simplified bookkeeping. ISP is fast---
it does not incur the heavy time penalties of optimization techniques---and is
therefore appropriate for dynamic problems. ISP works by drawing a Hamiltonian
path through the mesh and subdividing the resultant 1-d path.
This path is also referred to as an Inverse Spacefilling
curve, from which the method ``Inverse Spacefilling Partitioning''
takes its name.
Because ISP partitionings are logically 1-dimensional,
they are easier to manipulate than unstructured partitionings
rendered by strategies such as ORB-MM.
(If you are interested in learning more about these
see
Dynamic Partitioning of Non-Uniform Structured Workloads with Spacefilling
Curves, by J. R. Pilkington and S. B. Baden.
* IEEE Trans. on Parallel and Distributed Systems *,
**7**(3), March 1996, pp. 288-300.)

Software support for irregular decompositions is usually found in an application library. In addition to the decomposition utilities, we will also need some help in handling communication in message passing implementations. This is true because processors need to determine which neighbors they will communicate. Data will generally be non-contiguous, and must be packed into contiguous messages for transmission, and unpacked on the receiving side. One way of dealing with this is to use a library like KeLP.

Copyright © 2010 Scott B. Baden

Maintained by | baden | @ | ucsd. |
edu | [Mon Mar 1 10:40:02 PST 2010] |