Lecture 7 (1/29/02): Partitioning and Load Balancing


Partitioning is the process of dividing up a computation among processors.  There are often many ways of doing this, and the appropriate scheme depends on our application's requirements as well as the hardware.  Architectural factors include the granularity of communication, and whether or not memory is shared. .  We ask 3 questions  in determining the general partitioning requirements of an application:

  • Can the application effectively employ data parallelism, in which we divide up the data, or function parallelism, in which we divide up the code into separate tasks, or both?
  • Does the application spend equal amounts of time updating each piece or element of the solution?
  • If the application spends unequal amounts of time updating each element of the solution is the variation static or dynamic.

    I The granularity-locality tradeoff

    In a parallel computer there is a large difference between the costs of going to local memory and to making a remote access, such as passing a message. Sometimes the differential can be 100:1 or even as high as 1000:1. In a sense the central problem in parallel computing is managing locality, and we will return to this theme many times throughout the course. The point of managing locality is to ensure that frequently accessed datum will be likely located in local memory, for much the same reason that we want to manage on-chip cache effectively.

    Although managing locality is an important part of building a high performance parallel program, task granularity is significant, too. Both issues are related. Task granularity designates the amount of "useful" work done between synchronization points or between successive message transmission. We usually quantify granularity in terms of the number of processor cycles (CP) or floating point work that could be done between consecutive communication points or barrier synchronizations.

    In the finest grain execution, at most only a few CPs occur between synchronization points. SPMD programs generally employ a much higher level of granularity, on the order of millions of CPs or even billions. It is very important that the granularity of computation match the speed of the processor interconnect. If the interconnect is slow, then the granularity must be large enough to offset the cost of communication.

    The granularity selection process is affected by considerations of locality. We improve locality when we make the grains larger, we disrupt locality when we make the grains smaller. However, as we will see later on, excessive increases in granularity can introduce severe load imbalance, where only a few processors effectively carry the workload.

    II. Uniform Static partitioning

    A large class of applications are organized around a mesh, and entail updating each point of the mesh as a function of nearest neighbors. Moreover, the work required to compute each element of the solution--a single point [i,j]-- is constant. Consider a 2D image smoothing application, where each element Y[i,j] is a function of neighboring values of X[i,j]:

    for i = 0 : N-1
        for j = 0 : N-1
    	Y[i,j] = (X[i-1,j] + X[i+1,j] + X[i,j-1] + X[i,j+1])

    For such an application we may employ a uniform partitioning, which we've previously discussed

    III. Non-uniform decompositions and load balancing

    Consider the following loop

    	for  i = 0 : N-1
    	    if (x[i] > 0)
    		 x[i] = sqrt(x[i]);
    with the distribution of negative numbers in x[ ] as follows:
    		|    <0    |    <0    |    <0    |    >0    |
    		     p0         p1         p2         p3
    The load is carried by only one processor, and therefore no parallel speedup is realized. We say that the workloads are imbalanced. The load imbalance arises because no work is done on the negative numbers: on modern microprocessors, the time to take a square root is roughly an order of magnitude higher than making a comparison.

    In general, the distribution of negative values of x[i] may not known a priori (in advance) and may also change dynamically. There are two ways of dealing with this problem:

  • Load sharing
  • Non-uniform partitioning
  • III.a  Load Sharing

    With load sharing we do not attempt to measure the workload imbalance. Rather, we rely on statistical properties of the workload distribution to let each processor obtain fair share of the workload.

    III.a.1 Cyclic partitioning

    The simplest load sharing strategy is called cyclic or wrapped partitioning. With cyclic partitioning we assign the elements round robin to the processors: element i of array x[ ] is assigned to processor i mod P. This is shown as follows for P=4 processors: cyclic wrapped partitioning. i x[ ] i mod P. P=4
        |012301230123....                 |

    Cyclic partitioning can evenly assign the work so long as the pattern of negative and positive values of x[ ] doesn't correlate with the repeating pattern of the wrapping strategy. (e.g. if every 4th value of the array in our square root calculation were negative.)

    We use the notation CYCLIC to denote a cyclic decomposition. Cyclic mappings also apply to higher dimensional arrays, e.g. (CYCLIC, CYCLIC)

    		|  P0 P1  |  P0 P1  |  P0 P1  |
    		|  P2 P3  |  P2 P3  |  P2 P3  |    
    		|  P0 P1  |  P0 P1  |  P0 P1  |
    		|  P2 P3  |  P2 P3  |  P2 P3  |  

    (*, CYCLIC) and so on:

    		|  P0 P1  |  P2 P3  |  P0 P1  |
    		|  P0 P1  |  P2 P3  |  P0 P1  |    
    		|  P0 P1  |  P2 P3  |  P0 P1  |
    		|  P0 P1  |  P2 P3  |  P0 P1  |  

    Wrapping can incur excessive communication overhead in computations requiring interprocessor communication. For example:

        for  i = 1 : N-1
    	if (x[i] < 0)
               y[i] = x[i] - x[i-1];
    	    y[i] = sqrt(x[i]) - x[i-1];

    Observe that x[i] and x[i-1] will lay on separate processors for every value of i. Compare this with regular partitioning where x[i] and x[i-1] lay on separate processors only 1 time in N/P, that is, for only the leftmost element of the partition, which contains N/P values. In practice, we can expect N>>P, so communication is far less costly than for wrapped partitioning.

    We refer to the difficulty with wrapped partitioning as a surface-to-volume effect, that is, the surface of each partition represents the amount of communication that must be done, and the volume represents the work. With wrapped partitioning the ratio is 1:1, whereas with regular partitioning the surface to volume ratio is 1:N/P.

    We may reduce the surface to volume ratio by chunking: we apply the wrap pattern over larger units of work. We wrap K consecutive elements instead of just 1. This is shown for the case of K=3 on 2 processors, which is designated as CYCLIC(3):

    	+--------------------------------+  processors get
    	|000111000111.....               |  alternating, 
    	+--------------------------------+  small partitions
    	chunk size

    There is, however, a limit to how far this technique can be taken, for as the chunks get larger, so does load imbalance. This granularity tradeoff is ubiquitous in load balancing, and is shown in the following graph; as we decrease granularity we improve load imbalance but at the cost of disrupting locality.

    	     decreased load              increased load
    	     imbalance, but              imbalance, but
    	     increased comm.             decreased comm.
    	         /                           /
    	time | \                           /
    	     |  \                         /
                 |   \_                     _/
                 |     \__               __/
                 |        \___       ___/
    	     |            \_____/
    	                     ^       chunk size
    		Optimal chunk size balances communication
    		overhead against load imbalance

    In some cases load sharing with wrapped partitionings can be ineffective, because of the high communication overheads. On shared memory architecture one can employ processor self-scheduling, in which processors access a shared structure like a counter or queue for work assignments.

    III.a.2 Master/Slave

    In some applications we may also employ a centralized load sharing algorithm which employs some form of work scheduling. Work scheduling relies on shared work queues or counters, or employs a single processor to hand out the work to processors.

    With the first approach we dedicate 1 processor to the task of handing out work. We often refer to this dedicated processor as the manager. The other processors wait for a message containing some input; upon receipt of that input they compute, and then return their result to the manager when done--or they may simply do some I/O. The manager sends out additional work as the servers complete their tasks. When there is no more work, the manager informs each processor in turn. When the last piece of work is returned to the manager, all exit. This approach works quite well if the tasks require no communication—often called embarrassingly parallel—or if the cost of handling a task's input input and output is small compared to the amount of work performed by the task.

    III. b. Non-uniform partitioning

    Load sharing is a passive technique in that no attempt is made to actively respond to workload variations. An alternative is to actively measure workload variations and to partition the workloads non-uniformly so that each processor gets fair piece of the work. The pieces of work assigned to processors have different sizes according to the spatial density distribution of the work. In effect this is an off-line algorithm.

    In general non-uniform partitioning will only work of we have a good idea of the workload distribution. If the distribution changes with time, then we may have to periodically repartition the work. In most physical problems, the solution changes gradually enough that the cost of repartitioning won't be too great.

    III. b. 1. Orthogonal Recursive Bisection

    Orthogonal Recursive Bisection(ORB) [also known as Recursive Coordination Bisection (RCB) is a useful non-uniform partitioning strategy. It works by splitting the computation into to equal parts (of work, not space), and recursively splitting each part until done. ORB works when the number of processors is not a power of two, and it can be applied to multidimensional problems. (We will talk more about load balancing later in the course.)

    Consider the following loop:

      for (i = 1; i < N-1; i++)
    	if (x[i] < 0)
    	    y[i] = f1( x[i], x[i-1] );	
    	    y[i] = f2( x[i], x[i-1] );	
    	end if
    Assume that f1() takes twice as long to compute as f2( ). Now, if x[ ] changes very slowly, then we can build a workload density mapping giving the relative cost of updating each value of x[ ]. Let's assume that the mapping is as follows:
      1  1  2  1  2  2  2  1  1  1  1  1  1  2  2  1
    We split the above array at the point where the sum of the elements on either side of the split is as equal as possible.
      1  1  2  1  2  2  2  1  1  1  1  1  1  2  2  1
    		   Cut here
    If we are running on 4 processors, then we split again, once for each part
      1  1  2  1  2  2  2  1  1  1  1  1  1  2  2  1
    	     ^	      ^                 ^
                 |	      |		        |
    	   Cut #2   Cut #1	     Cut #3

    We now have 4 partitions with 5, 6, 6, and 5 units of work respectively.

    ORB reduces the higher dimensional partitioning problem to a set of simpler 1-dimensional partitioning problems, successively partitioning along orthogonal coordinate directions. Generally the cutting direction is varied at each level of the recursion to avoid elongated partitions that could lead to poorly-balanced workloads.

    We next discuss relatives of ORB, some of which are shown below:
    The Orthogonal Sectioning family of partitionings, from left to right: (a) ORB, (b) ORB-H, (c) rectilinear, and (d) ORB-MM.

    In some cases we may want to section a coordinate into more than two pieces. ORB-H or hierarchical ORB is a technique that employs multi-sectioning: it splits the problem into p > 2 strips and recurses on each strip. This technique is also referred to as multilevel ORB.

    The principal disadvantage of the orthogonal sectioning algorithms is the geometric constraint that all cuts must be straight-edged. Hence, a cut in d dimensions introduces a workload imbalance that is carried by a d-1 dimensional hyper-plane. A way around the difficulty is to section the dividing hyper-plane, too; this strategy is called ORB-MM, or ORB-``Median of Medians.'' (Shown above.) However, ORB-MM generates highly irregular partitionings in three or more space dimensions which for all practical purposes are unstructured. Since unstructured partitionings do not have a compact representation, they require additional software bookkeeping to manage non-rectangular iteration spaces appearing in application software. In addition, distributed mapping tables must be maintained on message passing architectures, at the cost of additional communication overheads and a further increase in the application's complexity.

    As we mentioned earlier, computations often exhibit temporal locality, whereby the solution changes gradually, for example, due to a time-step constraint. An incremental approach to orthogonal sectioning may effectively exploit temporal locality by computing work gradients across the partition boundaries and shifting the boundaries accordingly. However, complications arise when a boundary moves, since all other boundaries introduced later on in the recursion will be affected. In effect, we trade off temporal locality against spatial locality, and the trade-off may not be favorable. An approach which exploits temporal locality without the drawback of disrupting spatial locality is Nicol's rectilinear partitioning strategy. (Shown above.) This strategy avoids the difficulty of awkward workload adjustments by imposing a fixed connectivity on the partitionings. Partitionings are formed by the tensor product of strips taken across orthogonal coordinate directions. Load balance is improved by iteratively adjusting cuts independently along each problem dimension as necessary.

    Inverse space-filling partitioning (ISP) is an alternative technique that shares the desirable fine-grained load balancing characteristic of ORB-MM but with simplified bookkeeping. ISP is fast--- it does not incur the heavy time penalties of optimization techniques---and is therefore appropriate for dynamic problems. ISP works by drawing a Hamiltonian path through the mesh and subdividing the resultant 1-d path. This path is also referred to as an Inverse Spacefilling curve, from which the method ``Inverse Spacefilling Partitioning'' takes its name. Because ISP partitionings are logically 1-dimensional, they are easier to manipulate than unstructured partitionings rendered by strategies such as ORB-MM. We will also talk about these later in the course. (If you are interested in learning more in the meantime, see Dynamic Partitioning of Non-Uniform Structured Workloads with Spacefilling Curves, by J. R. Pilkington and S. B. Baden. IEEE Trans. on Parallel and Distributed Systems , 7(3), March 1996, pp. 288-300.)

    III. b.2. Optimization techniques

    In addition to the above techniques there are also optimization methods. Typical approaches are based on simulated annealing, neural networks, and genetic algorithms. Approaches based on optimization are attractive because they balance workloads particularly well---they are fine-grained, mapping each point individually to a processor---and they also minimize communication. However, the benefits must be weighed against the cost: long and sometimes unpredictable running times. For this reason, optimization techniques are restricted to static problems. Later in the course, we will look at an optimization-type partitioning technique known spectral graph decomposition.  Optimization techniques have also been applied to coarse-grain applications, known as multiblock methods.  In these applications, we represent the communication structure of the blocks as a graph, and apply graph partitioning. These are discussed in the  on-lineDemmel reader.

    III.b3. An application for non-uniform decomposition in multiple dimensions

    Let's consider a different type of solver for Laplace's Equation, in which we are to compute the solution within a disk instead of a square. Now, we may choose to represent the circular domain in various ways. The simplest (and as we will see, not necessarily the most efficient), is to embed a disk of radius R in a square, but compute only over the points (i,j) such that i*i + j*j <= R^2. It should become clear that we don't want to compute the points outside the circle, since we would waste time computing on them (not to mention that we would have to keep resetting them to the appropriate boundary condition). More generally, the computation could be much more involved than computing a few floating point values. In a 3D problem, we could save as much as a factor of two in time if we restrict computation to the actual domain. R (i,j) i*i + j*j <= R2 R (i,j) i*i + j*j <= R2

    Software support for irregular decompositions is usually found in an application library. In addition to the decomposition utilities, we will also need some help in handling communication in message passing implementations. This is true because processors need to determine which neighbors they will communicate. Data will generally be non-contiguous, and must be packed into contiguous messages for transmission, and unpacked on the receiving side.  One way of dealing with this is to use a library like KeLP, which has been developed at UCSD.

    Having now discussed various load balancing approaches, we can appreciate the differences between load sharing and non-uniform partitioning. With load sharing, no effort is or can be made to measure the workload or to reduce imbalance. With non-uniform partitioning, the workload distribution is assessed, and the information used to reduce workload imbalance. The subject of non-uniform partitioning is rich and diverse, and we have only just touched upon it in this course.

    Copyright © 2002 Scott B. Baden. Last modified: 01/27/02 07:24 PM