Load balancing and data decomposition

Partitioning is the process of dividing up a computation among the processors. There are many ways of doing this, and the appropriate scheme depends on the application's requirements as well as the hardware. Architectural factors include the granularity of communication, and whether or not memory is shared. To determine the general partitioning requirements of an application we ask 3 questions:

- Can the application effectively employ
**data parallelism,**in which we divide up the data, or task parallelism, in which we divide up the code into separate tasks, or both?

- Does the application spend equal amounts of time updating each piece or
element of the solution?

- If the application spends unequal amounts of time updating each element of the solution is the variation static or dynamic.

In what follows, we focus on data parallelism rather than task parallelism.

As we know, there is
a large difference in the cost of accessing cache and main memory.
The point of managing cache locality is to ensure that
frequently accessed datum will be likely located in cache.
Otherwise, we need to move data from main memory into cache,
a process that we consider to be *communication*.
As we shall see, work partitioning strategies impact locality as well as
**task granularity**. Both
issues are related. Task granularity designates the amount of "useful" work done
between successive communications. We
usually quantify granularity in terms of the number of processor cycles (CP) or
floating point work that could be done between data transfers.

In the finest grain execution, at most only a few CPs occur between synchronization points. SPMD programs generally employ a much higher level of granularity, on the order of millions of CPs or even billions. It is very important that the granularity of computation match the communication cost. If cost is high, then the granularity must be large enough to offset the cost of communication.

The granularity selection process is affected by considerations of locality. We improve locality when we make the grains larger, we disrupt locality when we make the grains smaller. However, as we will see later on, excessive increases in granularity can introduce severe load imbalance, where only a few processors effectively carry the workload.

A large class of applications are organized around a mesh, and entail updating each point of the mesh as a function of nearest neighbors. Moreover, the work required to compute each element of the solution--a single point [i,j]-- is constant. Consider a 2D image smoothing application, where each element Y[i,j] is a function of neighboring values of X[i,j]:

fori = 0 : N-1forj = 0 : N-1 Y[i,j] = (X[i-1,j] + X[i+1,j] + X[i,j-1] + X[i,j+1])endend

For such an application we may employ a *uniform partitioning*, which
we've previously discussed

Consider the following loop

fori = 0 : N-1if(x[i] > 0) x[i] = sqrt(x[i]);

with the distribution of negative numbers in x[ ] as follows:

+----------+----------+----------+----------+ | <0 | <0 | <0 | >0 | +----------+----------+----------+----------+ p0 p1 p2 p3

The load is carried by only one processor, and therefore no parallel
speedup is realized. We say that the workloads are *imbalanced.* The load
imbalance arises because no work is done on the negative numbers: on modern
microprocessors, the time to take a square root is roughly an order of magnitude
higher than making a comparison.

In general, the distribution of negative values of x[i] may not known *a
priori *(in advance) and may also change dynamically. There are two ways of
dealing with this problem:

- Load sharing
- Non-uniform partitioning

With load sharing we do not attempt to measure the workload imbalance. Rather, we rely on statistical properties of the workload distribution to let each processor obtain fair share of the workload. When memory isn't shared, we require an alternative to processor self scheduling

An effective technique which is used,
for example, to solve systems of linear equations (e.g. Gaussian
Elimination) is called
*cyclic* partitioning
(another strategy is client server, which we discussed earlier in the course).
This
strategy assigns data to processors in round robin fashion:
element i of array x[ ] is assigned to processor
i mod P. This is shown for the case of P=4 processors:

+---------------------------------+ |012301230123.... | +---------------------------------+

Cyclic partitioning can evenly assign the work so long as the pattern of
negative and positive values of `x[ ]` doesn't correlate with the
repeating pattern of the wrapping strategy. (e.g. if every 4th value of the
array in our square root calculation were negative.)

We use the notation `CYCLIC `to denote a cyclic decomposition. Cyclic
mappings also apply to higher dimensional arrays, e.g. `(CYCLIC, CYCLIC)`

------------------------------- | P0 P1 | P0 P1 | P0 P1 | | P2 P3 | P2 P3 | P2 P3 | ------------------------------- | P0 P1 | P0 P1 | P0 P1 | | P2 P3 | P2 P3 | P2 P3 | -------------------------------

`(*, CYCLIC) `and so on:

------------------------------- | P0 P1 | P2 P3 | P0 P1 | | P0 P1 | P2 P3 | P0 P1 | | P0 P1 | P2 P3 | P0 P1 | | P0 P1 | P2 P3 | P0 P1 | -------------------------------

Wrapping can incur high data transfer overheads.
For example in the following loop, we see that
`x[i]` and `x[i-1]` will be computed by different cores for every value of `i`:

fori = 1 : N-1if(x[i] < 0) y[i] = x[i] - x[i-1];elsey[i] = sqrt(x[i]) - x[i-1];end end

Thus, every core must load the left neighbor value into its local cache
and it will not *reuse* that value since the left neighbor will
compute the value. The result is that all values must be loaded P times,
where there are P processors, for a total of N*P loads.
Compare with regular partitioning
where `x[i]` and x[i-1] are computed by different processors only 1 time in
`N/P,` that is, for only the leftmost element of the partition, which
contains `N/P` values.
Thus, a core will compute a conecutive set of values and reuse those values.
It won't have to load the neighbors for every value in `x[]`
In practice, we can expect `N>>P,` so
the data trasfer cost is far lower than for wrapped partitioning.
In particular, each processor communicates `N/P+1` values
and so a total of `N+P` values are loaded, a much smaller value
than N*P.

We refer to the difficulty with wrapped partitioning as a
*surface-to-volume *effect, that is, the surface of each partition
represents the amount of communication that must be done, and the volume
represents the work. With wrapped partitioning the ratio is 1:1, whereas with
regular partitioning the surface to volume ratio is `1:N/P.`

We may reduce the surface to volume ratio by *chunking:* we apply the
wrap pattern over larger units of work. We wrap K consecutive elements instead
of just 1. This is shown for the case of K=2 on 4 processors, which is
designated as `CYCLIC(2)`:

There is, however, a limit to how far this technique can be taken, for as the
chunks get larger, so does load imbalance. This *granularity tradeoff* is
ubiquitous in load balancing, and is shown in the following graph; as we
decrease granularity we improve load imbalance but at the cost of disrupting
locality.
There is usually a "sweet spot" where a
chunk size (or range of sizes)
optimally trades off communication overhead against the cost of load imbalance

decreased load increased load imbalance, but imbalance, but increased comm. decreased comm. | | |\ / time | \ / | \ / | \_ _/ | \__ __/ | \___ ___/ | \_____/ | | +-------------------------------- ^ chunk size ^ "Sweet spot"

In some cases load sharing with wrapped partitioning can be ineffective,
because of the high communication overheads. On shared memory architecture one
can employ *processor self-scheduling, *in which processors access a
shared structure like a counter or queue for work assignments.
OpenMP implements this srategy using *dynamic scheduling*.

Copyright © 2013 Scott B. Baden [Sun Oct 27 19:35:04 PDT 2013]