CSE 160 (Fall 2013)
Load balancing and data decomposition

Partitioning

Partitioning is the process of dividing up a computation among the processors. There are many ways of doing this, and the appropriate scheme depends on the application's requirements as well as the hardware. Architectural factors include the granularity of communication, and whether or not memory is shared. To determine the general partitioning requirements of an application we ask 3 questions:

In what follows, we focus on data parallelism rather than task parallelism.

I. The granularity-locality tradeoff

As we know, there is a large difference in the cost of accessing cache and main memory. The point of managing cache locality is to ensure that frequently accessed datum will be likely located in cache. Otherwise, we need to move data from main memory into cache, a process that we consider to be communication. As we shall see, work partitioning strategies impact locality as well as task granularity. Both issues are related. Task granularity designates the amount of "useful" work done between successive communications. We usually quantify granularity in terms of the number of processor cycles (CP) or floating point work that could be done between data transfers.

In the finest grain execution, at most only a few CPs occur between synchronization points. SPMD programs generally employ a much higher level of granularity, on the order of millions of CPs or even billions. It is very important that the granularity of computation match the communication cost. If cost is high, then the granularity must be large enough to offset the cost of communication.

The granularity selection process is affected by considerations of locality. We improve locality when we make the grains larger, we disrupt locality when we make the grains smaller. However, as we will see later on, excessive increases in granularity can introduce severe load imbalance, where only a few processors effectively carry the workload.

II. Uniform Static partitioning

A large class of applications are organized around a mesh, and entail updating each point of the mesh as a function of nearest neighbors. Moreover, the work required to compute each element of the solution--a single point [i,j]-- is constant. Consider a 2D image smoothing application, where each element Y[i,j] is a function of neighboring values of X[i,j]:

for i = 0 : N-1
    for j = 0 : N-1
	Y[i,j] = (X[i-1,j] + X[i+1,j] + X[i,j-1] + X[i,j+1])
    end 
end 

For such an application we may employ a uniform partitioning, which we've previously discussed

III. Non-uniform decompositions and load balancing

Consider the following loop

	for  i = 0 : N-1
	    if (x[i] > 0)
		 x[i] = sqrt(x[i]);

with the distribution of negative numbers in x[ ] as follows:

		+----------+----------+----------+----------+
		|    <0    |    <0    |    <0    |    >0    |
		+----------+----------+----------+----------+

		     p0         p1         p2         p3

The load is carried by only one processor, and therefore no parallel speedup is realized. We say that the workloads are imbalanced. The load imbalance arises because no work is done on the negative numbers: on modern microprocessors, the time to take a square root is roughly an order of magnitude higher than making a comparison.

In general, the distribution of negative values of x[i] may not known a priori (in advance) and may also change dynamically. There are two ways of dealing with this problem:

III.a  Load Sharing

With load sharing we do not attempt to measure the workload imbalance. Rather, we rely on statistical properties of the workload distribution to let each processor obtain fair share of the workload.  When memory isn't shared, we require an alternative to processor self scheduling

An effective technique which is used, for example, to solve systems of linear equations (e.g. Gaussian Elimination) is called cyclic partitioning (another strategy is client server, which we discussed earlier in the course). This strategy assigns data to processors in round robin fashion: element i of array x[ ] is assigned to processor i mod P. This is shown for the case of P=4 processors:

    +---------------------------------+
    |012301230123....                 |
    +---------------------------------+

Cyclic partitioning can evenly assign the work so long as the pattern of negative and positive values of x[ ] doesn't correlate with the repeating pattern of the wrapping strategy. (e.g. if every 4th value of the array in our square root calculation were negative.)

We use the notation CYCLIC to denote a cyclic decomposition. Cyclic mappings also apply to higher dimensional arrays, e.g. (CYCLIC, CYCLIC)

		-------------------------------	
		|  P0 P1  |  P0 P1  |  P0 P1  |
		|  P2 P3  |  P2 P3  |  P2 P3  |    
		-------------------------------	    
		|  P0 P1  |  P0 P1  |  P0 P1  |
		|  P2 P3  |  P2 P3  |  P2 P3  |  
		-------------------------------	

(*, CYCLIC) and so on:

		-------------------------------	
		|  P0 P1  |  P2 P3  |  P0 P1  |
		|  P0 P1  |  P2 P3  |  P0 P1  |    
		|  P0 P1  |  P2 P3  |  P0 P1  |
		|  P0 P1  |  P2 P3  |  P0 P1  |  
		-------------------------------	

Wrapping can incur high data transfer overheads. For example in the following loop, we see that x[i] and x[i-1] will be computed by different cores for every value of i:

    for  i = 1 : N-1
	if (x[i] < 0)
           y[i] = x[i] - x[i-1];
	else
	    y[i] = sqrt(x[i]) - x[i-1];
	end
    end

Thus, every core must load the left neighbor value into its local cache and it will not reuse that value since the left neighbor will compute the value. The result is that all values must be loaded P times, where there are P processors, for a total of N*P loads. Compare with regular partitioning where x[i] and x[i-1] are computed by different processors only 1 time in N/P, that is, for only the leftmost element of the partition, which contains N/P values. Thus, a core will compute a conecutive set of values and reuse those values. It won't have to load the neighbors for every value in x[] In practice, we can expect N>>P, so the data trasfer cost is far lower than for wrapped partitioning. In particular, each processor communicates N/P+1 values and so a total of N+P values are loaded, a much smaller value than N*P.

We refer to the difficulty with wrapped partitioning as a surface-to-volume effect, that is, the surface of each partition represents the amount of communication that must be done, and the volume represents the work. With wrapped partitioning the ratio is 1:1, whereas with regular partitioning the surface to volume ratio is 1:N/P.

We may reduce the surface to volume ratio by chunking: we apply the wrap pattern over larger units of work. We wrap K consecutive elements instead of just 1. This is shown for the case of K=2 on 4 processors, which is designated as CYCLIC(2): [cyclic2_xy]

There is, however, a limit to how far this technique can be taken, for as the chunks get larger, so does load imbalance. This granularity tradeoff is ubiquitous in load balancing, and is shown in the following graph; as we decrease granularity we improve load imbalance but at the cost of disrupting locality. There is usually a "sweet spot" where a chunk size (or range of sizes) optimally trades off communication overhead against the cost of load imbalance

	     decreased load              increased load
	     imbalance, but              imbalance, but
	     increased comm.             decreased comm.

	     |
	     |
	     |\                             /
	time | \                           /
	     |  \                         /
             |   \_                     _/
             |     \__               __/
             |        \___       ___/
	     |            \_____/
	     |
	     |
       +--------------------------------

	                     ^       chunk size
	                     ^
	                "Sweet spot"

In some cases load sharing with wrapped partitioning can be ineffective, because of the high communication overheads. On shared memory architecture one can employ processor self-scheduling, in which processors access a shared structure like a counter or queue for work assignments. OpenMP implements this srategy using dynamic scheduling.


Copyright © 2013 Scott B. Baden   [Sun Oct 27 19:35:04 PDT 2013]

Valid XHTML 1.0 Strict   Valid CSS!