Load balancing and Data Decomposition

CSE 260 Lecture 16 (11/28/2012)
Load balancing and data decomposition

Partitioning

Partitioning is the process of dividing up a computation among processors. There are often many ways of doing this, and the appropriate scheme depends on our application's requirements as well as the hardware. Architectural factors include the granularity of communication, and whether or not memory is shared. . We ask 3 questions in determining the general partitioning requirements of an application:

Can the application effectively employ data parallelism, in which we divide up the data, or function parallelism, in which we divide up the code into separate tasks, or both?

Does the application spend equal amounts of time updating each piece or element of the solution?

If the application spends unequal amounts of time updating each element of the solution is the variation static or dynamic.

I. The granularity-locality tradeoff

In a parallel computer there is a large difference in the cost of accessing local and remote memory. The point of managing locality is to ensure that frequently accessed datum will be likely located in local memory, for much the same reason that we want to manage on-chip cache effectively. As we shall see, work partitioning strategies impact locality as well as task granularity. Both issues are related. Task granularity designates the amount of "useful" work done between synchronization points or between successive message transmission. We usually quantify granularity in terms of the number of processor cycles (CP) or floating point work that could be done between consecutive communication points or barrier synchronizations.

In the finest grain execution, at most only a few CPs occur between synchronization points. SPMD programs generally employ a much higher level of granularity, on the order of millions of CPs or even billions. It is very important that the granularity of computation match the speed of the processor interconnect. If the interconnect is slow, then the granularity must be large enough to offset the cost of communication.

The granularity selection process is affected by considerations of locality. We improve locality when we make the grains larger, we disrupt locality when we make the grains smaller. However, as we will see later on, excessive increases in granularity can introduce severe load imbalance, where only a few processors effectively carry the workload.

II. Uniform Static partitioning

A large class of applications are organized around a mesh, and entail updating each point of the mesh as a function of nearest neighbors. Moreover, the work required to compute each element of the solution--a single point [i,j]-- is constant. Consider a 2D image smoothing application, where each element Y[i,j] is a function of neighboring values of X[i,j]:

for i = 0 : N-1
    for j = 0 : N-1
	Y[i,j] = (X[i-1,j] + X[i+1,j] + X[i,j-1] + X[i,j+1])
    end 
end

For such an application we may employ a uniform partitioning, which we've previously discussed

III. Non-uniform decompositions and load balancing

Consider the following loop

	for  i = 0 : N-1
	    if (x[i] > 0)
		 x[i] = sqrt(x[i]);

with the distribution of negative numbers in x[ ] as follows:

		+----------+----------+----------+----------+
		|    <0    |    <0    |    <0    |    >0    |
		+----------+----------+----------+----------+

		     p0         p1         p2         p3

The load is carried by only one processor, and therefore no parallel speedup is realized. We say that the workloads are imbalanced. The load imbalance arises because no work is done on the negative numbers: on modern microprocessors, the time to take a square root is roughly an order of magnitude higher than making a comparison.

In general, the distribution of negative values of x[i] may not known a priori (in advance) and may also change dynamically. There are two ways of dealing with this problem:

Load sharing

Non-uniform partitioning

III.a Load Sharing

With load sharing we do not attempt to measure the workload imbalance. Rather, we rely on statistical properties of the workload distribution to let each processor obtain fair share of the workload. When memory isn't shared, we require an alternative to processor self scheduling

An effective technique which is used, for example, to solve systems of linear equations (e.g. Gaussian Elimination) is called cyclic partitioning (another strategy is client server, which we discussed earlier in the course). This strategy assigns data to processors in round robin fashion: element i of array x[ ] is assigned to processor i mod P. This is shown for the case of P=4 processors:

    +---------------------------------+
    |012301230123....                 |
    +---------------------------------+

Cyclic partitioning can evenly assign the work so long as the pattern of negative and positive values of x[ ] doesn't correlate with the repeating pattern of the wrapping strategy. (e.g. if every 4th value of the array in our square root calculation were negative.)

We use the notation CYCLIC to denote a cyclic decomposition. Cyclic mappings also apply to higher dimensional arrays, e.g. (CYCLIC, CYCLIC)

		-------------------------------	
		|  P0 P1  |  P0 P1  |  P0 P1  |
		|  P2 P3  |  P2 P3  |  P2 P3  |    
		-------------------------------	    
		|  P0 P1  |  P0 P1  |  P0 P1  |
		|  P2 P3  |  P2 P3  |  P2 P3  |  
		-------------------------------

(*, CYCLIC) and so on:

		-------------------------------	
		|  P0 P1  |  P2 P3  |  P0 P1  |
		|  P0 P1  |  P2 P3  |  P0 P1  |    
		|  P0 P1  |  P2 P3  |  P0 P1  |
		|  P0 P1  |  P2 P3  |  P0 P1  |  
		-------------------------------

Wrapping can incur excessive communication overhead in computations requiring interprocessor communication. For example:

    for  i = 1 : N-1
	if (x[i] < 0)
           y[i] = x[i] - x[i-1];
	else
	    y[i] = sqrt(x[i]) - x[i-1];
	end
    end

Observe that x[i] and x[i-1] will lay on separate processors for every value of i. Compare this with regular partitioning where x[i] and x[i-1] lay on separate processors only 1 time in N/P, that is, for only the leftmost element of the partition, which contains N/P values. In practice, we can expect N>>P, so communication is far less costly than for wrapped partitioning.

We refer to the difficulty with wrapped partitioning as a surface-to-volume effect, that is, the surface of each partition represents the amount of communication that must be done, and the volume represents the work. With wrapped partitioning the ratio is 1:1, whereas with regular partitioning the surface to volume ratio is 1:N/P.

We may reduce the surface to volume ratio by chunking: we apply the wrap pattern over larger units of work. We wrap K consecutive elements instead of just 1. This is shown for the case of K=2 on 4 processors, which is designated as CYCLIC(2):

There is, however, a limit to how far this technique can be taken, for as the chunks get larger, so does load imbalance. This granularity tradeoff is ubiquitous in load balancing, and is shown in the following graph; as we decrease granularity we improve load imbalance but at the cost of disrupting locality.

	     decreased load              increased load
	     imbalance, but              imbalance, but
	     increased comm.             decreased comm.

	     |
	     |
	     |\                             /
	time | \                           /
	     |  \                         /
             |   \_                     _/
             |     \__               __/
             |        \___       ___/
	     |            \_____/
	     |
	     |
       +--------------------------------

	                     ^       chunk size
	                     ^
	                  optimal

		Optimal chunk size balances communication
		overhead against load imbalance

In some cases load sharing with wrapped partitioning can be ineffective, because of the high communication overheads. On shared memory architecture one can employ processor self-scheduling, in which processors access a shared structure like a counter or queue for work assignments.

III.b. Irregular Decomposition

Load sharing is a passive technique in that no attempt is made to actively respond to workload variations. An alternative is to actively measure workload variations and to partition the workloads non-uniformly so that each processor gets fair piece of the work. The pieces of work assigned to processors have different sizes according to the spatial density distribution of the work. In effect this is an off-line algorithm.

In general non-uniform partitioning will only work of we have a good idea of the workload distribution. If the distribution changes with time, then we may have to periodically repartition the work. In most physical problems, the solution changes gradually enough that the cost of repartitioning won't be too great. Non-uniform partitioning strategies are numerous, and we'll discuss one common approach: Orthogonal Recursive Bisection Orthogonal Recursive Bisection(ORB) (also known as Recursive Coordination Bisection (RCB)) is a useful non-uniform partitioning strategy. (M. J. Berger, S. H. Bokhari: "A Partitioning Strategy for Nonuniform Problems on Multiprocessors." IEEE Transactions on Computers 36(5): 570-580 (1987)).

It works by splitting the computation into to equal parts (of work, not space), and recursively splitting each part until done. ORB works when the number of processors is not a power of two, and it can be applied to multidimensional problems.

Consider the following loop:

  for (i = 1; i < N-1; i++)
	if (x[i] < 0)
	    y[i] = f1( x[i], x[i-1] );	
	else
	    y[i] = f2( x[i], x[i-1] );	
	end if

Assume that f1() takes twice as long to compute as f2( ). Now, if x[ ] changes very slowly, then we can build a workload density mapping giving the relative cost of updating each value of x[ ]. Let's assume that the mapping is as follows:

  1  1  2  1  2  2  2  1  1  1  1  1  1  2  2  1

We split the above array at the point where the sum of the elements on either side of the split is as equal as possible.

  1  1  2  1  2  2  2  1  1  1  1  1  1  2  2  1
		      ^
		      |
		   Cut here

If we are running on 4 processors, then we split again, once for each part

  1  1  2  1  2  2  2  1  1  1  1  1  1  2  2  1
	     ^	      ^                 ^
             |	      |		        |
	   Cut #2   Cut #1	     Cut #3

We now have 4 partitions with 5, 6, 6, and 5 units of work respectively.

ORB reduces the higher dimensional partitioning problem to a set of simpler 1-dimensional partitioning problems, successively partitioning along orthogonal coordinate directions. Generally the cutting direction is varied at each level of the recursion to avoid elongated partitions that could lead to poorly-balanced workloads.

The following figure shows some relatives relatives of ORB:

The Orthogonal Sectioning family of partitionings, from left to right: (a) ORB, (b) ORB-H, (c) rectilinear, and (d) ORB-MM.

In some cases we may want to section a coordinate into more than two pieces. ORB-H or hierarchical ORB is a technique that employs multi-sectioning: it splits the problem into p > 2 strips and recurses on each strip. This technique is also referred to as multilevel ORB.

The principal disadvantage of the orthogonal sectioning algorithms is the geometric constraint that all cuts must be straight-edged. Hence, a cut in d dimensions introduces a workload imbalance that is carried by a d-1 dimensional hyper-plane. A way around the difficulty is to section the dividing hyper-plane, too; this strategy is called ORB-MM, or ORB-``Median of Medians.'' (Shown above.) However, ORB-MM generates highly irregular partitionings in three or more space dimensions which for all practical purposes are unstructured. Since unstructured partitionings do not have a compact representation, they require additional software bookkeeping to manage non-rectangular iteration spaces appearing in application software. In addition, distributed mapping tables must be maintained on message passing architectures, at the cost of additional communication overheads and a further increase in the application's complexity.

As we mentioned earlier, computations often exhibit temporal locality, whereby the solution changes gradually, for example, due to a time-step constraint. An incremental approach to orthogonal sectioning may effectively exploit temporal locality by computing work gradients across the partition boundaries and shifting the boundaries accordingly. However, complications arise when a boundary moves, since all other boundaries introduced later on in the recursion will be affected. In effect, we trade off temporal locality against spatial locality, and the trade-off may not be favorable. An approach which exploits temporal locality without the drawback of disrupting spatial locality is Nicol's rectilinear partitioning strategy. (Shown above.) This strategy avoids the difficulty of awkward workload adjustments by imposing a fixed connectivity on the partitionings. Partitionings are formed by the tensor product of strips taken across orthogonal coordinate directions. Load balance is improved by iteratively adjusting cuts independently along each problem dimension as necessary.

Inverse space-filling partitioning (ISP) is an alternative technique that shares the desirable fine-grained load balancing characteristic of ORB-MM but with simplified bookkeeping. ISP is fast--- it does not incur the heavy time penalties of optimization techniques---and is therefore appropriate for dynamic problems. ISP works by drawing a Hamiltonian path through the mesh and subdividing the resultant 1-d path. This path is also referred to as an Inverse Spacefilling curve, from which the method ``Inverse Spacefilling Partitioning'' takes its name. Because ISP partitionings are logically 1-dimensional, they are easier to manipulate than unstructured partitionings rendered by strategies such as ORB-MM. (If you are interested in learning more about these see Dynamic Partitioning of Non-Uniform Structured Workloads with Spacefilling Curves, by J. R. Pilkington and S. B. Baden. IEEE Trans. on Parallel and Distributed Systems , 7(3), March 1996, pp. 288-300.)

An application

Let's consider a different type of solver for Laplace's Equation, in which we are to compute the solution within a disk instead of a square. Now, we may choose to represent the circular domain in various ways. The simplest (and as we will see, not necessarily the most efficient), is to embed a disk of radius R in a square, but compute only over the points (i,j) such that i*i + j*j <= R². It should become clear that we don't want to compute the points outside the circle, since we would waste time computing on them (not to mention that we would have to keep resetting them to the appropriate boundary condition). More generally, the computation could be much more involved than computing a few floating point values. In a 3D problem, we could save as much as a factor of two in time if we restrict computation to the actual domain:

{ (i,j) | i*i + j*j <= R²} }

Software support for irregular decompositions is usually found in an application library. In addition to the decomposition utilities, we will also need some help in handling communication in message passing implementations. This is true because processors need to determine which neighbors they will communicate. Data will generally be non-contiguous, and must be packed into contiguous messages for transmission, and unpacked on the receiving side. One way of dealing with this is to use a library like KeLP.

CSE 260 Lecture 16 (11/28/2012) Load balancing and data decomposition