Partitioning is the process of dividing up a computation among processors. There are often many ways of doing this, and the appropriate scheme depends on our application's requirements as well as the hardware. Architectural factors include the granularity of communication, and whether or not memory is shared. . We ask 3 questions in determining the general partitioning requirements of an application:
Although managing locality is an important part of building a high performance parallel program, task granularity is significant, too. Both issues are related. Task granularity designates the amount of "useful" work done between synchronization points or between successive message transmission. We usually quantify granularity in terms of the number of processor cycles (CP) or floating point work that could be done between consecutive communication points or barrier synchronizations.
In the finest grain execution, at most only a few CPs occur between synchronization points. SPMD programs generally employ a much higher level of granularity, on the order of millions of CPs or even billions. It is very important that the granularity of computation match the speed of the processor interconnect. If the interconnect is slow, then the granularity must be large enough to offset the cost of communication.
The granularity selection process is affected by considerations of locality. We improve locality when we make the grains larger, we disrupt locality when we make the grains smaller. However, as we will see later on, excessive increases in granularity can introduce severe load imbalance, where only a few processors effectively carry the workload.
A large class of applications are organized around a mesh, and entail updating each point of the mesh as a function of nearest neighbors. Moreover, the work required to compute each element of the solution--a single point [i,j]-- is constant. Consider a 2D image smoothing application, where each element Y[i,j] is a function of neighboring values of X[i,j]:
for i = 0 : N-1 for j = 0 : N-1 Y[i,j] = (X[i-1,j] + X[i+1,j] + X[i,j-1] + X[i,j+1]) end end
For such an application we may employ a uniform partitioning, which we've previously discussed
Consider the following loop
for i = 0 : N-1 if (x[i] > 0) x[i] = sqrt(x[i]);with the distribution of negative numbers in x[ ] as follows:
+----------+----------+----------+----------+ | <0 | <0 | <0 | >0 | +----------+----------+----------+----------+ p0 p1 p2 p3The load is carried by only one processor, and therefore no parallel speedup is realized. We say that the workloads are imbalanced. The load imbalance arises because no work is done on the negative numbers: on modern microprocessors, the time to take a square root is roughly an order of magnitude higher than making a comparison.
In general, the distribution of negative values of x[i] may not known a priori (in advance) and may also change dynamically. There are two ways of dealing with this problem:
+---------------------------------+ |012301230123.... | +---------------------------------+
Cyclic partitioning can evenly assign the work so long as the pattern of negative and positive values of x[ ] doesn't correlate with the repeating pattern of the wrapping strategy. (e.g. if every 4th value of the array in our square root calculation were negative.)
We use the notation CYCLIC to denote a cyclic decomposition. Cyclic mappings also apply to higher dimensional arrays, e.g. (CYCLIC, CYCLIC)
------------------------------- | P0 P1 | P0 P1 | P0 P1 | | P2 P3 | P2 P3 | P2 P3 | ------------------------------- | P0 P1 | P0 P1 | P0 P1 | | P2 P3 | P2 P3 | P2 P3 | -------------------------------
(*, CYCLIC) and so on:
------------------------------- | P0 P1 | P2 P3 | P0 P1 | | P0 P1 | P2 P3 | P0 P1 | | P0 P1 | P2 P3 | P0 P1 | | P0 P1 | P2 P3 | P0 P1 | -------------------------------
Wrapping can incur excessive communication overhead in computations requiring interprocessor communication. For example:
for i = 1 : N-1 if (x[i] < 0) y[i] = x[i] - x[i-1]; else y[i] = sqrt(x[i]) - x[i-1]; end end
Observe that x[i] and x[i-1] will lay on separate processors for every value of i. Compare this with regular partitioning where x[i] and x[i-1] lay on separate processors only 1 time in N/P, that is, for only the leftmost element of the partition, which contains N/P values. In practice, we can expect N>>P, so communication is far less costly than for wrapped partitioning.
We refer to the difficulty with wrapped partitioning as a surface-to-volume effect, that is, the surface of each partition represents the amount of communication that must be done, and the volume represents the work. With wrapped partitioning the ratio is 1:1, whereas with regular partitioning the surface to volume ratio is 1:N/P.
We may reduce the surface to volume ratio by chunking: we apply the wrap pattern over larger units of work. We wrap K consecutive elements instead of just 1. This is shown for the case of K=3 on 2 processors, which is designated as CYCLIC(3):
+--------------------------------+ processors get |000111000111..... | alternating, +--------------------------------+ small partitions |-| chunk size
There is, however, a limit to how far this technique can be taken, for as the chunks get larger, so does load imbalance. This granularity tradeoff is ubiquitous in load balancing, and is shown in the following graph; as we decrease granularity we improve load imbalance but at the cost of disrupting locality.
decreased load increased load imbalance, but imbalance, but increased comm. decreased comm. / / time | \ / | \ / | \_ _/ | \__ __/ | \___ ___/ | \_____/ | | +-------------------------------- ^ chunk size ^ optimal Optimal chunk size balances communication overhead against load imbalance
In some cases load sharing with wrapped partitionings can be ineffective, because of the high communication overheads. On shared memory architecture one can employ processor self-scheduling, in which processors access a shared structure like a counter or queue for work assignments.
With the first approach we dedicate 1 processor to the task of handing out work. We often refer to this dedicated processor as the manager. The other processors wait for a message containing some input; upon receipt of that input they compute, and then return their result to the manager when done--or they may simply do some I/O. The manager sends out additional work as the servers complete their tasks. When there is no more work, the manager informs each processor in turn. When the last piece of work is returned to the manager, all exit. This approach works quite well if the tasks require no communication—often called embarrassingly parallel—or if the cost of handling a task's input input and output is small compared to the amount of work performed by the task.
Load sharing is a passive technique in that no attempt is made to actively respond to workload variations. An alternative is to actively measure workload variations and to partition the workloads non-uniformly so that each processor gets fair piece of the work. The pieces of work assigned to processors have different sizes according to the spatial density distribution of the work. In effect this is an off-line algorithm.
In general non-uniform partitioning will only work of we have a good idea of the workload distribution. If the distribution changes with time, then we may have to periodically repartition the work. In most physical problems, the solution changes gradually enough that the cost of repartitioning won't be too great.
Orthogonal Recursive Bisection(ORB) [also known as Recursive Coordination Bisection (RCB) is a useful non-uniform partitioning strategy. It works by splitting the computation into to equal parts (of work, not space), and recursively splitting each part until done. ORB works when the number of processors is not a power of two, and it can be applied to multidimensional problems. (We will talk more about load balancing later in the course.)
Consider the following loop:
for (i = 1; i < N-1; i++) if (x[i] < 0) y[i] = f1( x[i], x[i-1] ); else y[i] = f2( x[i], x[i-1] ); end ifAssume that f1() takes twice as long to compute as f2( ). Now, if x[ ] changes very slowly, then we can build a workload density mapping giving the relative cost of updating each value of x[ ]. Let's assume that the mapping is as follows:
1 1 2 1 2 2 2 1 1 1 1 1 1 2 2 1We split the above array at the point where the sum of the elements on either side of the split is as equal as possible.
1 1 2 1 2 2 2 1 1 1 1 1 1 2 2 1 ^ | Cut hereIf we are running on 4 processors, then we split again, once for each part
1 1 2 1 2 2 2 1 1 1 1 1 1 2 2 1 ^ ^ ^ | | | Cut #2 Cut #1 Cut #3
We now have 4 partitions with 5, 6, 6, and 5 units of work respectively.
ORB reduces the higher dimensional partitioning problem to a set of simpler 1-dimensional partitioning problems, successively partitioning along orthogonal coordinate directions. Generally the cutting direction is varied at each level of the recursion to avoid elongated partitions that could lead to poorly-balanced workloads.
We next discuss relatives of ORB, some of which
are shown below:
In some cases we may want to section a coordinate
into more than two pieces.
ORB-H or hierarchical ORB
is a technique
that employs multi-sectioning:
it splits the problem into p > 2 strips and
recurses on each strip.
This technique is also referred to as
multilevel ORB.
The principal disadvantage of the
orthogonal sectioning algorithms is the geometric
constraint that all cuts must be straight-edged.
Hence, a cut in d dimensions introduces a workload imbalance
that is carried by a d-1 dimensional hyper-plane.
A way around the difficulty is to section the dividing hyper-plane, too;
this strategy is called ORB-MM, or ORB-``Median of Medians.''
(Shown above.)
However, ORB-MM generates
highly irregular partitionings
in three or more space dimensions
which for all practical purposes
are unstructured.
Since unstructured partitionings
do not have a compact representation, they require additional software
bookkeeping to
manage non-rectangular iteration spaces appearing in application
software.
In addition, distributed mapping
tables must be maintained on message
passing architectures, at the cost of additional communication overheads
and a further increase in the application's complexity.
As we mentioned earlier,
computations often exhibit temporal locality, whereby the solution
changes gradually, for example,
due to a time-step constraint.
An incremental approach to orthogonal sectioning may effectively
exploit temporal locality by computing
work gradients across the partition boundaries and shifting
the boundaries accordingly. However, complications
arise when a boundary moves, since all other boundaries
introduced later on in the recursion will be affected.
In effect, we trade off temporal locality against spatial locality,
and the trade-off may not be favorable.
An approach which exploits temporal locality without the drawback of
disrupting spatial locality
is Nicol's
rectilinear partitioning strategy.
(Shown above.)
This strategy
avoids the difficulty of awkward workload adjustments by
imposing a fixed connectivity on the partitionings.
Partitionings are formed by the tensor product
of strips taken across orthogonal coordinate directions.
Load balance is improved by
iteratively adjusting cuts independently along each problem dimension
as necessary.
Inverse space-filling partitioning (ISP) is an alternative technique
that shares the desirable fine-grained load balancing characteristic
of ORB-MM but
with simplified bookkeeping. ISP is fast---
it does not incur the heavy time penalties of optimization techniques---and is
therefore appropriate for dynamic problems.
ISP works by drawing a Hamiltonian
path through the mesh and subdividing the resultant 1-d path.
This path is also referred to as an Inverse Spacefilling
curve, from which the method ``Inverse Spacefilling Partitioning''
takes its name.
Because ISP partitionings are logically 1-dimensional,
they are easier to manipulate than unstructured partitionings
rendered by strategies
such as ORB-MM.
We will also talk about these later in the course.
(If you are interested in learning more in the meantime,
see
Dynamic Partitioning of Non-Uniform Structured Workloads with Spacefilling
Curves,
by J. R. Pilkington and S. B. Baden.
IEEE Trans. on Parallel and
Distributed Systems , 7(3), March 1996, pp. 288-300.)
In addition to the above techniques there are also optimization
methods.
Typical approaches are based on simulated annealing,
neural networks, and genetic algorithms.
Approaches based on optimization are attractive because they
balance workloads particularly well---they are fine-grained, mapping
each point individually to a processor---and they also
minimize communication.
However, the benefits
must be weighed against the cost:
long and sometimes unpredictable running times. For
this reason, optimization techniques are restricted
to static problems.
Later in the course, we will look at an optimization-type partitioning
technique known
spectral graph decomposition. Optimization techniques have also
been applied to coarse-grain applications, known as multiblock methods. In
these applications, we represent the communication structure of the blocks as a
graph, and apply graph partitioning. These are discussed in the on-lineDemmel
reader.
Software support for irregular decompositions is usually
found in an application library. In addition to the decomposition
utilities, we will also need some help in handling communication
in message passing implementations. This is true because processors
need to determine which neighbors they will communicate. Data
will generally be non-contiguous, and must be packed into contiguous
messages for transmission, and unpacked on the receiving side. One way of
dealing with this is to use a library like KeLP,
which has been developed at UCSD.
Having now discussed various load balancing approaches, we can appreciate the
differences between load sharing and non-uniform partitioning. With load
sharing, no effort is or can be made to measure the workload or to reduce
imbalance. With non-uniform partitioning, the workload distribution is assessed,
and the information used to reduce workload imbalance. The subject of
non-uniform partitioning is rich and diverse, and we have only just touched upon
it in this course.
The Orthogonal Sectioning family of partitionings, from left to right:
(a) ORB,
(b) ORB-H,
(c) rectilinear, and (d) ORB-MM.
III. b.2. Optimization techniques
III.b3. An application for non-uniform decomposition in multiple dimensions
Let's consider a different type of solver for Laplace's Equation, in which
we are to compute the solution within a disk
instead of a square. Now, we may choose to represent the circular
domain in various ways. The simplest (and as we will see,
not necessarily the most efficient), is to embed a disk of radius R in a
square, but compute only over the points (i,j)
such that i*i + j*j <= R^2. It should become clear
that we don't want to compute the points outside the circle,
since we would waste time computing on them (not to mention
that we would have to keep resetting them to the appropriate boundary
condition). More generally, the computation could be much
more involved than computing a few floating point values.
In a 3D problem, we could save as much as a factor of two in time
if we restrict computation to the actual domain. R (i,j) i*i + j*j <= R2
R (i,j)
i*i + j*j <= R2
Copyright © 2002 Scott B. Baden.
Last modified: 01/27/02 07:24 PM