Data parallel languages go back many years to the days of the ILIAC IV (1960's) and later saw a resurgence in the 1980s with Connection Machine Fortran (CMF was targeted to Thinking Machine's CM-2, an SIMD machine, and CM-5, an MIMD machine), High Performance Fortran (HPF) was later devised by committee. Today, data parallelism arises in Fortran 90 and 95 and in Matlab, which goes. To see how it works, let us look at an example we've seen earlier in the course: the solution of an ordinary differential equation.

Let U[ ] and UNew[ ] be arrays. The following statement computes the difference between the new and old solutions at all points.

dU = ( UNew - U )The arrays must be

for i = 0 to n+1 dU[i] = UNew[i] - U[i] endThe compiler will automatically assume that array operations are to be done in parallel, provided the arrays are declared to reside on more than one processor, as we explain below.

Continuing our comparison of the codes, consider the convergence computation: mean square error

Err = sqrt ( sum ( dU*dU ) )We ignore the fact that Err is a scalar and concentrate on the computation on the right-hand-side. As before the computation dU*dU involves an implicit loop. The intrinsic function sum() is applied to the entire vector quantity dU*dU and reduces the vector to a scalar. Operations like sum(), prod(), max(), min() are called

HPF provides a ** Forall ** statement, which is a parallel for loop.
We can compute the sweep in the solution mesh using the following
statement:

forall (i=1:N) UNew(i) = (U(i+1) + U(i-1) - 8*h^2)/2.0With

forall (i=1:N, j=1:N) X(i,j) = mod ( i+j , 2) or forall (i=1:N, j=1:N, i .eq. j) X(i,j) = 0With the second form we can employ a

for i=1 to N for j=1 to N if (i == j) X(i,j) = 0 end for end for

**Array Sections.** A portion of an array is defined by a triplet in
each dimension. It may appear wherever an array is used.

A(3) ! third element A(1:5) ! first 5 element A(:5) ! same A(1:10:2) ! odd elements in order A(10:-2:2) ! even elements in reverse order A(1:2,3:5) ! 2-by-3 subblock A(1,:) ! first row A(:,j) ! jth columnWe may use array sections to handle the (Jacobi) updates for our ODE solver:

UNew(1:N) = ( U(2:N+1) + U(0:N-1) - 8*h^2 ) / 2Which form is more appropriate? That depends on the situation, and sometimes on the programmer's preference.

**Implicit Communication.** Operations on conformable array sections
may require interprocessor communication. In theory, each array element
is assigned its own virtual processor, and so the expression

B( 1:10 ) + B( 11:20 )involves implicit communication between the elements in B(1:10), which are on virtual processors 1:10, and B(11:20) which are on virtual processors 11:20. However, as we will see below, virtual processors get mapped onto physical processors, and much of the implicit communication will involve local memory accesses.

A( 1:10 ) = B( 1:10 ) + B( 11:20 ) C( 1:5:2 ) = C( 2:6:2 ) ! shift noncontiguous sections D = D( 10:1:-1 ) ! permutation (reverse) A = [1,0,2,0,0,0,4] I = [1,3,7] B = A(I) ! B = [1,2,4] "gather" C(I) = B ! C = A "scatter", no duplicates in I D = A([ 1, 1, 3, 3 ]) ! replication B = CSHIFT( A, 1 ) ! B = [4,1,0,2,0,0,0], circular shift B = TRANSPOSE( H ) ! matrix transpose B = SPREAD(A,2,3) ! if B is 3-by-7 then ! B = [ 1,0,2,0,0,0,4 ] ! [ 1,0,2,0,0,0,4 ] ! [ 1,0,2,0,0,0,4 ]

**Conditional Operation.**
HPF provides a ** where ** statement, which is like a conditional.
Where's may not be nested. Only assignment
is permitted in the body of a "where".
Consider the following loop, where we must protect agains
illegal square roots and division by 0:

for i = 1 to N if (X(i) <= 0) then X(i) = sqrt(abs(X(i))) else X(i) = 1.0/X(i) end ifThis may be written more succinctly using the

where (X <= 0) X = sqrt(abs(X)) elsewhere X = 1.0/X

**Data Layout.** Whether or not a statement will be executed in parallel
depends on how the data it accesses
is arranged on the machine. An array spread across all
p processors can hope to enjoy p-fold parallelism in operations, whereas
arrays that aren't distributed will run sequentially.
Briefly, scalar data and arrays which do not participate in any array
operations (like A=B+C) will be replicated on all processors.
All processor will perform the identical operations on such scalar data,
which is equivalent to serial execution.
There are compiler directives to help insure that arrays are
laid out the way the user wants them to be.

REAL A(64,8), B(64,8), C(64,8) !HPF DISTRIBUTE A( BLOCK, : ), B( BLOCK , : ), C( : , BLOCK ) A = B A = CIn this code fragment, the !HPF DISRIBUTE compiler directive is used to tell HPF where to store the arrays A, B, C. Suppose to we have a 64 processor machine. Here are ttwo natural ways to store a 64-by-8 array:

Mem 0 Mem 1 ... Mem 64 A(1,1) A(2,1) ... A(64,1) A(1,2) A(2,2) ... A(64,2) ... ... ... A(1,8) A(2,8) ... A(64,8)and

Mem 0 Mem 1 ... Mem 7 Mem 8 ... Mem 64 C(1,1) C(1,2) ... C(1,8) xxx ... xxx C(2,1) C(2,2) ... C(2,8) xxx ... xxx ... ... ... C(64,1) C(64,2) ... C(64,8) xxx ... xxxThe former is specified by A( BLOCK , : ), which says the first subscript should be stored with different values corresponding to different parallel memories, and different values of the second subscript should be stored sequentially in the same processor memory. Thus, there is parallelism available in the BLOCK direction, but not in the : direction. The latter is specified by A( :, BLOCK ), which says the

Since A and B have the same layout, the statement A=B is perfectly parallel. On the other hand, A=C essentiallyl performs a matrix transpose, which incurs large amount of communication. When optimizing programs for time, it is essential to make sure layouts are chosen to minimize communication.

**Libraries and Intrinsics.** HPF has a variety of intrinsics
available for moving data including scan (not discussed).
If you can find an appropriate intrinsic routine, it will often run
faster than coding the operation oneself.

More information about HPF can be found by consulting Ian Foster's Text, Chapter 7.

We talked about many body simulations
earlier in the course.
In many body problems, a collection of bodies, called * particles*
(in fact the Greek word *somati'dion* literally means
"little body").
interact according to a * force law*. The force law depends
on the application, for example, gravitational interaction.
In particle computations we compute the trajectory of the particles
over a discrete instants of time called * timesteps *.
To compute the trajectory at each time step we compute the force
induced by each particle on every other particle, and then
we advance the position of each particle according to the force,
which is the force times the size of the timestep.
The basic algorithm is

t = 0 while t < t_final for i = 1 to n ... n = number_of_particles compute f(i) = force on particle i move particle i under force f(i) for time dt end for compute interesting properties of particles t = t + dt ... dt = time increment end whileThe forces are expensive to compute because the force on each particle depends on all the other particles. The simplest expression for the force f(i) on particle i is

for i = 1 to n f(i) = sum_{j=1, j != i}^n f(i,j) end forwhere f(i,j) is the force on particle i due to particle j. Thus, the cost grows as O(n^2). We will look at a simple force: gravitational interaction. Assuming that each body has a mass m(i), and is located at position ( x(i), y(i), z(i) ), then f(i,j) is:

x(i) - x(j) y(i) - y(j) z(i) - z(j) force = G * m(i) * m(j) * ( ----------- , ----------- , ----------- ) r^3 r^3 r^3where

r = sqrt ( (x(i) - x(j))^2 + (y(i) - y(j))^2 + (z(i) - z(j))^2)and G is the gravitational constant.

To solve this problem in HPF, we use the following program, where we use the
notation `[x, y, z] `to pack 3 elements into an array (this is not
standard HPF but it simplifies our discussions)

real x(N), y(N), z(N), m(N), force(N) for k = 1 to n force = force + F([x, y, z], m, [x(k), y(k), z(k)], m(k)) end forWhere we have used an intuitive vector notation, and F() is appropriately defined to ignore the self-induced force:

vector3 function F ( [x , y , z], m, [xk, yk, zk], mk ) d = sqrt ( (x-xk)^2 + (y-yk)^2 + (z-zk)^2 ) where (d != 0) F = m * mk * [ x-xk , y-yk, z-zk ] / d end whereIs this algorithm feasible? Let's look at the communication cost. For each iteration of the

Alternatively, let's configure the processors in a ring and have a second copy of the x,y,z, and m vectors. Then, we pass the copies around the ring using the CSHIFT operation, 1 CSHIFT per iteration of the k loop:

real x(N), y(N), z(N), m(N), force(i) real xc(N), yc(N), zc(N), mc(N) [xc,yc,zc] = [x,y,z] mc = m for k = 1 to n force = force + F([x, y, z], m, [xc, yc, zc], mc ) xc = CSHIFT (xc, 1).... end forwhere we now have

function F ( [x , y , z], m, [xc, yc, zc], mk ) d = sqrt ( (x-xc)^2 + (y-yc)^2 + (z-zc)^2 ) where (d != 0) F = m * mk * [ x-xc , y-yc, z-zc ] / d end whereNow we have vector operations only so there is no need to broadcast. There is only 1 message start per step vs log(P) before. Of course, we are now moving a chunk of the position and mass vectors at each step, so we must include the cost of moving the data; the communication cost for one step is

Ts + (N/P)*W*Tw for the ring algorithm vs. log(P)*Ts for the broadcast algorithm (we can ignore the time to move the data)where W is the number of bytes in the position vector and the mass. Assuming 4 byte real numbers, W=16. When will our ring algorithm be faster? When

Ts + (N/P)*16*Tw < log(P)*Ts or when N < (Ts/Tw) * (P/16) * log(P/2)Consider Valkyrie, where (Tw/Ts) = 2000 If we use 16 processors, then the ring algorithm is faster for N approximately 6000.

© 2002 Scott B. Baden. Last modified: 03/05/02 06:19 AM