CSE 262, Lecture 15 (3/5/02): Data parallel languages

(Note: these notes are based in part on on-line lecture notes written by Jim Demmel at UC Berkeley)

So far we have been programming with message passing. This is often called ``SPMD Programming'' (SPMD = Single Program Multiple Data.) We will now look at a different kind of parallel programming model, called Data Parallelism. With SPMD programming a single program text is executed by all processors. This is a explicit programming model in the sense that the programmer has to orchestrate data motion and synchronization, and to contend with system dependent issues concerning the hardware
  • How is parallelism expressed?
  • How is communication between processors expressed?
  • How is synchronization between processors expressed?
  • What global/local data structures can be constructed?
  • How do you optimize for performance?
  • Data Parallel Programming

    In data parallel programming, the programmer has the illusion of a single program running in a single address sapce. Parallelism arises when operating on arrays (or sometimes lists or sequence) with each array element conceptually being processed in parallel on its own virtual processor. Any interprocessor communication is implicit in the array operation. This is a natural model for an SIMD computer, but can also be implemented under MIMD.

    Data parallel languages go back many years to the days of the ILIAC IV (1960's) and later saw a resurgence in the 1980s with Connection Machine Fortran (CMF was targeted to Thinking Machine's CM-2, an SIMD machine, and CM-5, an MIMD machine), High Performance Fortran  (HPF) was later devised by committee.  Today, data parallelism arises in Fortran 90 and 95 and in Matlab, which goes. To see how it works, let us look at an example we've seen earlier in the course: the solution of an ordinary differential equation.

    Let U[ ] and UNew[ ] be arrays. The following statement computes the difference between the new and old solutions at all points.

          dU = ( UNew - U )
    The arrays must be conformable, i.e. have the same size and shape, in order to be operated on together in this way. The above array expression has in implicit loop. The sequential equivalent of this would be
         for i = 0 to n+1
             dU[i] = UNew[i]  - U[i]
    The compiler will automatically assume that array operations are to be done in parallel, provided the arrays are declared to reside on more than one processor, as we explain below.

    Continuing our comparison of the codes, consider the convergence computation: mean square error

      Err = sqrt ( sum ( dU*dU ) )
    We ignore the fact that Err is a scalar and concentrate on the computation on the right-hand-side. As before the computation dU*dU involves an implicit loop. The intrinsic function sum() is applied to the entire vector quantity dU*dU and reduces the vector to a scalar. Operations like sum(), prod(), max(), min() are called reduction operations because the reduce arrays to scalars.

    HPF provides a Forall statement, which is a parallel for loop. We can compute the sweep in the solution mesh using the following statement:

            forall (i=1:N) UNew(i) = (U(i+1) + U(i-1) - 8*h^2)/2.0
    With Forall all operations are executed concurrently. Forall extends to multiple dimensions:
        forall (i=1:N, j=1:N) X(i,j) = mod ( i+j , 2)
        forall (i=1:N, j=1:N, i .eq. j) X(i,j) = 0
    With the second form we can employ a guard which selects only certain iterations. The last form is identical to:
            for i=1 to N
    	    for j=1  to N
    		if (i == j) X(i,j) = 0
    	    end for
    	end for

    Array Sections. A portion of an array is defined by a triplet in each dimension. It may appear wherever an array is used.

           A(3)              ! third element
           A(1:5)            ! first 5 element
           A(:5)             ! same
           A(1:10:2)         ! odd elements in order
           A(10:-2:2)        ! even elements in reverse order
           A(1:2,3:5)        ! 2-by-3 subblock
           A(1,:)            ! first row
           A(:,j)            ! jth column
    We may use array sections to handle the (Jacobi) updates for our ODE solver:
        UNew(1:N) = ( U(2:N+1) + U(0:N-1) - 8*h^2 ) / 2
    Which form is more appropriate? That depends on the situation, and sometimes on the programmer's preference.

    Implicit Communication. Operations on conformable array sections may require interprocessor communication. In theory, each array element is assigned its own virtual processor, and so the expression

       B( 1:10 ) + B( 11:20 )
    involves implicit communication between the elements in B(1:10), which are on virtual processors 1:10, and B(11:20) which are on virtual processors 11:20. However, as we will see below, virtual processors get mapped onto physical processors, and much of the implicit communication will involve local memory accesses.
       A( 1:10 ) = B( 1:10 ) + B( 11:20 )
       C( 1:5:2 ) = C( 2:6:2 )            ! shift noncontiguous sections
       D = D( 10:1:-1 )                   ! permutation (reverse)
       A = [1,0,2,0,0,0,4]
       I = [1,3,7]
       B = A(I)                           ! B = [1,2,4]   "gather"
       C(I) = B                           ! C = A "scatter", no duplicates in I
       D = A([ 1, 1, 3, 3 ])              ! replication
       B = CSHIFT( A, 1 )                 ! B = [4,1,0,2,0,0,0], circular shift
       B = TRANSPOSE( H )                 ! matrix transpose
       B = SPREAD(A,2,3)                  ! if B is 3-by-7 then
    				      ! B = [ 1,0,2,0,0,0,4 ]
    				      !     [ 1,0,2,0,0,0,4 ]
    				      !     [ 1,0,2,0,0,0,4 ]

    Conditional Operation. HPF provides a where statement, which is like a conditional. Where's may not be nested. Only assignment is permitted in the body of a "where". Consider the following loop, where we must protect agains illegal square roots and division by 0:

           for i = 1 to N
    	if (X(i) <= 0) then
    	    X(i) = sqrt(abs(X(i)))
    	    X(i) = 1.0/X(i)
    	end if
    This may be written more succinctly using the where clause:
           where (X <= 0)
    	    X = sqrt(abs(X))
    	    X = 1.0/X

    Data Layout. Whether or not a statement will be executed in parallel depends on how the data it accesses is arranged on the machine. An array spread across all p processors can hope to enjoy p-fold parallelism in operations, whereas arrays that aren't distributed will run sequentially. Briefly, scalar data and arrays which do not participate in any array operations (like A=B+C) will be replicated on all processors. All processor will perform the identical operations on such scalar data, which is equivalent to serial execution. There are compiler directives to help insure that arrays are laid out the way the user wants them to be.

          REAL A(64,8), B(64,8), C(64,8)
    !HPF  DISTRIBUTE A( BLOCK, : ), B( BLOCK , : ), C( : , BLOCK )
          A = B
          A = C
    In this code fragment, the !HPF DISRIBUTE compiler directive is used to tell HPF where to store the arrays A, B, C. Suppose to we have a 64 processor machine. Here are ttwo natural ways to store a 64-by-8 array:
        Mem 0     Mem 1            ...                Mem 64
        A(1,1)    A(2,1)           ...                A(64,1)
        A(1,2)    A(2,2)           ...                A(64,2)
         ...       ...             ...
        A(1,8)    A(2,8)           ...                A(64,8)
        Mem 0     Mem 1    ...   Mem 7   Mem 8  ...   Mem 64
        C(1,1)    C(1,2)   ...   C(1,8)   xxx   ...    xxx
        C(2,1)    C(2,2)   ...   C(2,8)   xxx   ...    xxx
         ...       ...            ...
        C(64,1)   C(64,2)  ...   C(64,8)   xxx   ...    xxx
    The former is specified by A( BLOCK , : ), which says the first subscript should be stored with different values corresponding to different parallel memories, and different values of the second subscript should be stored sequentially in the same processor memory. Thus, there is parallelism available in the BLOCK direction, but not in the : direction. The latter is specified by A( :, BLOCK ), which says the second subscript should be stored with different values corresponding to different parallel memories, and different values of the first subscript should be stored sequentially in the same processor memory. Thus, there is parallellism available in the BLOCK direction, but not in the : direction.

    Since A and B have the same layout, the statement A=B is perfectly parallel. On the other hand, A=C essentiallyl performs a matrix transpose, which incurs large amount of communication. When optimizing programs for time, it is essential to make sure layouts are chosen to minimize communication.

    Libraries and Intrinsics. HPF has a variety of intrinsics available for moving data including scan (not discussed). If you can find an appropriate intrinsic routine, it will often run faster than coding the operation oneself.

    More information about HPF can be found by consulting Ian Foster's Text, Chapter 7.

    Programming Example: Many body simulations

    Reminder about many body simulations.

    We talked about many body simulations earlier in the course.  In many body problems, a collection of bodies, called particles (in fact the Greek word somati'dion literally means "little body"). interact according to a force law. The force law depends on the application, for example, gravitational interaction. In particle computations we compute the trajectory of the particles over a discrete instants of time called timesteps . To compute the trajectory at each time step we compute the force induced by each particle on every other particle, and then we advance the position of each particle according to the force, which is the force times the size of the timestep. The basic algorithm is

         t = 0
         while t < t_final
             for i = 1 to n                  ... n = number_of_particles
                 compute f(i) = force on particle i
                 move particle i under force f(i) for time dt
             end for
             compute interesting properties of particles
             t = t + dt                      ... dt = time increment
         end while
    The forces are expensive to compute because the force on each particle depends on all the other particles. The simplest expression for the force f(i) on particle i is
         for i = 1 to n
             f(i) = sum_{j=1, j != i}^n f(i,j)
         end for
    where f(i,j) is the force on particle i due to particle j. Thus, the cost grows as O(n^2). We will look at a simple force: gravitational interaction. Assuming that each body has a mass m(i), and is located at position ( x(i), y(i), z(i) ), then f(i,j) is:
                                  x(i) - x(j)   y(i) - y(j)   z(i) - z(j)
    force = G * m(i) * m(j) *   ( ----------- , ----------- , ----------- )
                                     r^3           r^3            r^3
    r = sqrt ( (x(i) - x(j))^2 + (y(i) - y(j))^2 + (z(i) - z(j))^2)
    and G is the gravitational constant.

    To solve this problem in HPF, we use the following program, where we use the notation [x, y, z] to pack 3 elements into an array (this is not standard HPF but it simplifies our discussions)

    real x(N), y(N), z(N), m(N), force(N)
    for k = 1 to n
        force = force +  F([x, y, z], m, [x(k), y(k), z(k)], m(k))
    end for
    Where we have used an intuitive vector notation, and F() is appropriately defined to ignore the self-induced force:
    	vector3 function F ( [x , y , z], m, [xk, yk, zk], mk )
    	d = sqrt ( (x-xk)^2 + (y-yk)^2 + (z-zk)^2 )
    	where (d != 0)
    	    F =  m * mk * [ x-xk , y-yk, z-zk ] / d
    	end where
    Is this algorithm feasible? Let's look at the communication cost. For each iteration of the k loop we must compute n forces (ignoring the degenerate computation of 0). However, because the multiplication within F() involves scalar values, (the components xk,yk,zk) the scalars must be broadcast to all other processors. If we are running on (p < n) processors, this costs log(p) message starts per iteration of the k loop.

    Alternatively, let's configure the processors in a ring and have a second copy of the x,y,z, and m vectors. Then, we pass the copies around the ring using the CSHIFT operation, 1 CSHIFT per iteration of the k loop:

    real x(N), y(N), z(N), m(N), force(i)
    real xc(N), yc(N), zc(N), mc(N)
    [xc,yc,zc] = [x,y,z]
    mc = m
    for k = 1 to n
        force = force +  F([x, y, z], m, [xc, yc, zc], mc )
        xc = CSHIFT (xc, 1)....
    end for
    where we now have
    	function F ( [x , y , z], m, [xc, yc, zc], mk )
    	d = sqrt ( (x-xc)^2 + (y-yc)^2 + (z-zc)^2 )
    	where (d != 0)
    	    F =  m * mk * [ x-xc , y-yc, z-zc ] / d
    	end where
    Now we have vector operations only so there is no need to broadcast. There is only 1 message start per step vs log(P) before. Of course, we are now moving a chunk of the position and mass vectors at each step, so we must include the cost of moving the data; the communication cost for one step is
    	Ts + (N/P)*W*Tw for the ring algorithm vs.
    	log(P)*Ts for the broadcast algorithm
    	(we can ignore the time to move the data)
    where W is the number of bytes in the position vector and the mass. Assuming 4 byte real numbers, W=16. When will our ring algorithm be faster? When
    	Ts + (N/P)*16*Tw < log(P)*Ts
    or when
    	N < (Ts/Tw) * (P/16) * log(P/2)
    Consider Valkyrie, where (Tw/Ts) = 2000 If we use 16 processors, then the ring algorithm is faster for N approximately 6000.

    © 2002 Scott B. Baden. Last modified: 03/05/02 06:19 AM