Partitioning is the process of dividing up a computation among processors. There are often many ways of doing this, and the appropriate scheme depends on our application's requirements as well as the hardware. Architectural factors include the granularity of communication, and whether or not memory is shared. . We ask 3 questions in determining the general partitioning requirements of an application:

Although managing locality is an important part of building a high
performance parallel program, **task granularity **is significant, too. Both
issues are related. Task granularity designates the amount of "useful"
work done between synchronization points or between successive message
transmission. We usually quantify granularity in terms of the number of
processor cycles (CP) or floating point work that could be done between
consecutive communication points or barrier synchronizations.

In the finest grain execution, at most only a few CPs occur between synchronization points. SPMD programs generally employ a much higher level of granularity, on the order of millions of CPs or even billions. It is very important that the granularity of computation match the speed of the processor interconnect. If the interconnect is slow, then the granularity must be large enough to offset the cost of communication.

The granularity selection process is affected by considerations of locality. We improve locality when we make the grains larger, we disrupt locality when we make the grains smaller. However, as we will see later on, excessive increases in granularity can introduce severe load imbalance, where only a few processors effectively carry the workload.

A large class of applications are organized around a mesh, and entail updating each point of the mesh as a function of nearest neighbors. Moreover, the work required to compute each element of the solution--a single point [i,j]-- is constant. Consider a 2D image smoothing application, where each element Y[i,j] is a function of neighboring values of X[i,j]:

fori = 0 : N-1forj = 0 : N-1 Y[i,j] = (X[i-1,j] + X[i+1,j] + X[i,j-1] + X[i,j+1])endend

For such an application we may employ a *uniform partitioning*,
which we've previously discussed

Consider the following loop

with the distribution of negative numbers in x[ ] as follows:fori = 0 : N-1if(x[i] > 0) x[i] = sqrt(x[i]);

+----------+----------+----------+----------+ | <0 | <0 | <0 | >0 | +----------+----------+----------+----------+ p0 p1 p2 p3The load is carried by only one processor, and therefore no parallel speedup is realized. We say that the workloads are

In general, the distribution of negative values of x[i] may not known *a
priori *(in advance) and may also change dynamically. There are two ways of
dealing with this problem:

+---------------------------------+ |012301230123.... | +---------------------------------+

Cyclic partitioning can evenly assign the work so long as the pattern of
negative and positive values of `x[ ]` doesn't correlate with the
repeating pattern of the wrapping strategy. (e.g. if every 4th value of the
array in our square root calculation were negative.)

We use the notation `CYCLIC `to denote a cyclic decomposition. Cyclic
mappings also apply to higher dimensional arrays, e.g. `(CYCLIC, CYCLIC)`

------------------------------- | P0 P1 | P0 P1 | P0 P1 | | P2 P3 | P2 P3 | P2 P3 | ------------------------------- | P0 P1 | P0 P1 | P0 P1 | | P2 P3 | P2 P3 | P2 P3 | -------------------------------

`(*, CYCLIC) `and so on:

------------------------------- | P0 P1 | P2 P3 | P0 P1 | | P0 P1 | P2 P3 | P0 P1 | | P0 P1 | P2 P3 | P0 P1 | | P0 P1 | P2 P3 | P0 P1 | -------------------------------

Wrapping can incur excessive communication overhead in computations requiring interprocessor communication. For example:

fori = 1 : N-1if(x[i] < 0) y[i] = x[i] - x[i-1];elsey[i] = sqrt(x[i]) - x[i-1];end end

Observe that `x[i]` and `x[i-1]` will lay on separate
processors for every value of `i`. Compare this with regular partitioning
where `x[i]` and x[i-1] lay on separate processors only 1 time in `N/P,`
that is, for only the leftmost element of the partition, which contains `N/P`
values. In practice, we can expect `N>>P,` so communication is far
less costly than for wrapped partitioning.

We refer to the difficulty with wrapped partitioning as a *surface-to-volume
*effect, that is, the surface of each partition represents the amount of
communication that must be done, and the volume represents the work. With
wrapped partitioning the ratio is 1:1, whereas with regular partitioning the
surface to volume ratio is `1:N/P.`

We may reduce the surface to volume ratio by *chunking:* we apply the
wrap pattern over larger units of work. We wrap K consecutive elements instead
of just 1. This is shown for the case of K=3 on 2 processors, which is
designated as `CYCLIC(3)`:

+--------------------------------+ processors get |000111000111..... | alternating, +--------------------------------+ small partitions |-| chunk size

There is, however, a limit to how far this technique can be taken, for as the
chunks get larger, so does load imbalance. This *granularity tradeoff* is
ubiquitous in load balancing, and is shown in the following graph; as we
decrease granularity we improve load imbalance but at the cost of disrupting
locality.

decreased load increased load imbalance, but imbalance, but increased comm. decreased comm. / / time | \ / | \ / | \_ _/ | \__ __/ | \___ ___/ | \_____/ | | +-------------------------------- ^ chunk size ^ optimal Optimal chunk size balances communication overhead against load imbalance

In some cases load sharing with wrapped partitionings can be ineffective,
because of the high communication overheads. On shared memory architecture one
can employ *processor self-scheduling, *in which processors access a
shared structure like a counter or queue for work assignments.

With the first approach we dedicate 1 processor to the task of handing out
work. We often refer to this dedicated processor as the manager. The other
processors wait for a message containing some input; upon receipt of that input
they compute, and then return their result to the manager when done--or they may
simply do some I/O. The manager sends out additional work as the servers
complete their tasks. When there is no more work, the manager informs each
processor in turn. When the last piece of work is returned to the manager, all
exit. This approach works quite well if the tasks require no
communication—often called **embarrassingly parallel**—or if the cost of
handling a task's input input and output is small compared to the amount of work
performed by the task.

Load sharing is a passive technique in that no attempt is made to actively respond to workload variations. An alternative is to actively measure workload variations and to partition the workloads non-uniformly so that each processor gets fair piece of the work. The pieces of work assigned to processors have different sizes according to the spatial density distribution of the work. In effect this is an off-line algorithm.

In general non-uniform partitioning will only work of we have a good idea of the workload distribution. If the distribution changes with time, then we may have to periodically repartition the work. In most physical problems, the solution changes gradually enough that the cost of repartitioning won't be too great.

* Orthogonal Recursive Bisection(ORB)*
[also known as Recursive Coordination Bisection (RCB)
is a useful non-uniform
partitioning strategy. It works by splitting the computation
into to equal parts (of work, not space), and recursively splitting
each part until done. ORB works when the number of processors
is not a power of two, and it can be applied to multidimensional
problems.
(We will talk more about load balancing later in the course.)

Consider the following loop:

for (i = 1; i < N-1; i++) if (x[i] < 0) y[i] = f1( x[i], x[i-1] ); else y[i] = f2( x[i], x[i-1] ); end ifAssume that

1 1 2 1 2 2 2 1 1 1 1 1 1 2 2 1We split the above array at the point where the sum of the elements on either side of the split is as equal as possible.

1 1 2 1 2 2 2 1 1 1 1 1 1 2 2 1 ^ | Cut hereIf we are running on 4 processors, then we split again, once for each part

1 1 2 1 2 2 2 1 1 1 1 1 1 2 2 1 ^ ^ ^ | | | Cut #2 Cut #1 Cut #3

We now have 4 partitions with 5, 6, 6, and 5 units of work respectively.

ORB reduces the higher dimensional partitioning problem to a set of simpler 1-dimensional partitioning problems, successively partitioning along orthogonal coordinate directions. Generally the cutting direction is varied at each level of the recursion to avoid elongated partitions that could lead to poorly-balanced workloads.

We next discuss relatives of ORB, some of which
are shown below:

**The Orthogonal Sectioning family of partitionings, from left to right:
(a) ORB,
(b) ORB-H,
(c) rectilinear, and (d) ORB-MM.**

In some cases we may want to section a coordinate
into more than two pieces.
ORB-H or hierarchical ORB
is a technique
that employs * multi-sectioning:*
it splits the problem into ` p > 2 ` strips and
recurses on each strip.
This technique is also referred to as
multilevel ORB.

The principal disadvantage of the
orthogonal sectioning algorithms is the geometric
constraint that all cuts must be straight-edged.
Hence, a cut in ` d` dimensions introduces a workload imbalance
that is carried by a ` d-1` dimensional hyper-plane.
A way around the difficulty is to section the dividing hyper-plane, too;
this strategy is called ORB-MM, or ORB-``Median of Medians.''
(Shown above.)
However, ORB-MM generates
highly irregular partitionings
in three or more space dimensions
which for all practical purposes
are unstructured.
Since unstructured partitionings
do not have a compact representation, they require additional software
bookkeeping to
manage non-rectangular iteration spaces appearing in application
software.
In addition, distributed mapping
tables must be maintained on message
passing architectures, at the cost of additional communication overheads
and a further increase in the application's complexity.

As we mentioned earlier, computations often exhibit temporal locality, whereby the solution changes gradually, for example, due to a time-step constraint. An incremental approach to orthogonal sectioning may effectively exploit temporal locality by computing work gradients across the partition boundaries and shifting the boundaries accordingly. However, complications arise when a boundary moves, since all other boundaries introduced later on in the recursion will be affected. In effect, we trade off temporal locality against spatial locality, and the trade-off may not be favorable. An approach which exploits temporal locality without the drawback of disrupting spatial locality is Nicol's rectilinear partitioning strategy. (Shown above.) This strategy avoids the difficulty of awkward workload adjustments by imposing a fixed connectivity on the partitionings. Partitionings are formed by the tensor product of strips taken across orthogonal coordinate directions. Load balance is improved by iteratively adjusting cuts independently along each problem dimension as necessary.

Inverse space-filling partitioning (ISP) is an alternative technique
that shares the desirable fine-grained load balancing characteristic
of ORB-MM but
with simplified bookkeeping. ISP is fast---
it does not incur the heavy time penalties of optimization techniques---and is
therefore appropriate for dynamic problems.
ISP works by drawing a Hamiltonian
path through the mesh and subdividing the resultant 1-d path.
This path is also referred to as an Inverse Spacefilling
curve, from which the method ``Inverse Spacefilling Partitioning''
takes its name.
Because ISP partitionings are logically 1-dimensional,
they are easier to manipulate than unstructured partitionings
rendered by strategies
such as ORB-MM.
We will also talk about these later in the course.
(If you are interested in learning more in the meantime,
see
Dynamic Partitioning of Non-Uniform Structured Workloads with Spacefilling
Curves,
by J. R. Pilkington and S. B. Baden.
* IEEE Trans. on Parallel and
Distributed Systems *, 7(3), March 1996, pp. 288-300.)

In addition to the above techniques there are also optimization
methods.
Typical approaches are based on simulated annealing,
neural networks, and genetic algorithms.
Approaches based on optimization are attractive because they
balance workloads particularly well---they are fine-grained, mapping
each point individually to a processor---and they also
minimize communication.
However, the benefits
must be weighed against the cost:
long and sometimes unpredictable running times. For
this reason, optimization techniques are restricted
to static problems.
Later in the course, we will look at an optimization-type partitioning
technique known
* spectral graph decomposition*. Optimization techniques have also
been applied to coarse-grain applications, known as multiblock methods. In
these applications, we represent the communication structure of the blocks as a
graph, and apply graph partitioning. These are discussed in the on-lineDemmel
reader.

Software support for irregular decompositions is usually found in an application library. In addition to the decomposition utilities, we will also need some help in handling communication in message passing implementations. This is true because processors need to determine which neighbors they will communicate. Data will generally be non-contiguous, and must be packed into contiguous messages for transmission, and unpacked on the receiving side. One way of dealing with this is to use a library like KeLP, which has been developed at UCSD.

Having now discussed various load balancing approaches, we can appreciate the differences between load sharing and non-uniform partitioning. With load sharing, no effort is or can be made to measure the workload or to reduce imbalance. With non-uniform partitioning, the workload distribution is assessed, and the information used to reduce workload imbalance. The subject of non-uniform partitioning is rich and diverse, and we have only just touched upon it in this course.

Copyright © 2002 Scott B. Baden. Last modified: 01/27/02 07:24 PM