RB3D

The rb3D program is a 3D iterative solver which solves Laplace's equation using Gauss-Seidel's method with Red/Black ordering. The rb3D source is found in the directory http:./rb3D and you may also download a gzipped tar file in http:./rb3D.tar.gz

rb3D supports various command line options to configure processor geometries and so on, as follows:

-l ni -m nj -n nk	Integer bounds of the computational mesh: ni x nj x nk
-N nijk			The single bound of a cubic mesh
			(These options are mutually exclusive)
-gi pi -gj pj		Leading dimensions of the processor geometry are pi x pj
-i niter		(Integer) number of iterations
-r reps			(Integer) number of repetitions of the run
-f freq			frequency of convergence check in iterations
-e epsilson		convergence threshold (double)
-si bi -sj bj	    	cache blocking factors for i & j axes

Thus, the command line

     mpprun -n 16 rb3D -l 32 -n 64 -n 128 -gi 4 -gj 2 -i 5 -r 3
will run rb3D on 16 processors over a 32 x 64 x 128 mesh for 5 iterations. (Note that the -n nprocs option of mpprun determines the number of processors to run on, and that the command line option to rb3D defines the problem size.) The program will be run 3 times. The processors will be configured in a 4 x 2 x 2 geometry, as the trailing dimension = nprocs / (pi * pj). If you are running with a cubic mesh, and this is perfectly reasonable, then you need only specify the -n option in -l -m -n. The command line
     mpprun -n 8 rb3D -n 100 -gi 2 -gj 2 -i 4 -r 2
runs a problem of size 100 x 100 x 100 as will the command line
     mpprun -n 8 rb3D -N 100 -gi 2 -gj 2 -i 4 -r 2

Running on SDSC's T3E using NQS

An NQS script has also been provided as nqs_job. You must edit this file to change the working directory, set the number of processors you will run on, and the command line options for rb3D. Submit the job as described on the 3rd line of the file, making sure that the number of processors you run on is specified consistently within the script file and as the qsub command line argument -l mpp_p=4. You can specify the output file on the command line using the -o option. Be sure that the two environment variables SCACHE_D_STREAMS and MPI_BUFFER_MAX are cleared as shown or your message passing performance may suffer.

You can do this by putting the following lines in your .cshrc file:

# Be sure and run with the next two environment variables cleared to 0
# Turn off streams
setenv SCACHE_D_STREAMS 0

# Don't internally buffer messages
setenv MPI_BUFFER_MAX 0

The sample nqs_job file appears below:

! #/bin/csh
#Submit this job as
#qsub -s /bin/csh -eo -x -q normal -l mpp_p=4 -l mpp_t=0:10:00 -o output nqs_job
#QSUB -q normal           # Run job at low priority
#QSUB -l mpp_t = 00:10:00 # Maximum run time = 10 min
#QSUB -l mpp_p = 4        # Run job on 4 processors
#QSUB -o file.out         # Send output to file.out if you like
#QSUB -eo                 # Combine stderr and stdout
#QSUB -me                 # Mail notification at completion of job if you like
#
# Be sure and change this directory to the one that you will be running from
cd 
# Run on 4 processors in a 70^3 mesh for 5 iterations and 3 repetitions
mpprun -n 4 rb3D -n 70 -i 5 -r 3
# Try different processor geometries
mpprun -n 4 rb3D -gi 4 -gj 1 -n 70 -i 5 -r 3
mpprun -n 4 rb3D -gi 1 -gj 4 -n 70 -i 5 -r 3

You can monitor the status of your job by typing:

qstat -a
The last column tells you the status of your job. To check the status of all running jobs (including interactive ones as well as NQS jobs), type:
ps -MPel

If you have any problems, see the URL http://www.npaci.edu/News for up-to-date T3E info, or the URL http://www.npaci.edu/T3E for online T3E documentation.

The output file

rb3D reports the various options you have selected along with timings. You can ignore most of the output except for a few essential lines:

                                                               Local
       Blocking   Processors    Grind T   MF/s   Time  Comm  MF/s  Time
N= 70 [ 32x 32] P=  4{ 1x 1x 4} 6.613e-08  91    0.11  0.08  99    0.10

Timings (in sec.)
Total, Compute only

0.135   0.104
0.113   0.105
0.113   0.104
In what follows we look at the third line, which begins with N=. Column 1 tells us that the problem size was 70 x 70 x 70. (Column 2 gives us blocking factors for cache (32 x 32) which you may ignore, but experiment with these if you like. You may notice a difference in running time.) Column 3 tells us that we ran with 4 processors configured as a 1 x 1 x 4 processor array. This means that we employed a 1D decomposition, cutting the Z axis into 4 parts. The grind time was 6.613e-08 seconds ("Grind T") and the equivalent megaflop rate 91 Megaflops/second. The total running time was 0.11 seconds.

The program also runs the computation with communication shut off, and uses the timings with and without communication to indirectly measure the cost of communication by taking differences. How is this different from measuring the cost directly? The last 3 columns pertain to this portion of the run. The last of these columns reports the total running with communication shut off: 0.10 seconds. The equivalent megaflop rate is 99 megaflops/second. The fraction of total time that was spent in communication was 0.08, or 8% (1st of the three columns).

The above statistics are reported for the best of nreps runs, that is, for the fastest total running time. The last bit of the output shows us the total and communication-free running times, permitting us to gauge our confidence in our measurements. We note that the timing for the first run is a bit slow, and this is often the case on the Cray T3E. (The program also warms itself up by running a few untimed iterations.)

Note that you may use non-cubical domains to also carry out a sanity check. You should be able to reproduce the parallel running time with communication turned off by running the reduced problem on one processor, which corresponds to what a single processor is assigned when the full problem is run in parallel. For example, if you are running with a 126^3 mesh on 16 processors, then you should run a 126 x 126 x 8 problem on 1 processor and observe in theory the same running time as the 16-processor run with communication shut off.

Beware that if the number of processors P does not divide N evenly, then some processors may get assigned no work, and the program may fail. In particular, for a 1x1xP strip partitioning, you need to ensure that P and N satisfy the following condition:

   N > ceil[ N/P ] * (P-1)

To learn more about why this is so, read page 36 of the KeLP user's guide. (This program uses DOCK BLOCK1 type decompositions.) Take a look at the FloorPlan in the output if you are in doubt. The program will not detect whether any local sub-array has at least one trivial extent of 0 (This bug should be fixed.)

Measurements

For various combinations of number of processors P and problem size N measure the grind time and plot the times as a function of P for
  • fixed amount of work and
  • scaled workload (work/P = constant)
  • You should run on at least 16 processors; running with larger numbers of processors (especially P > 32) could leave you waiting in the batch queue but you may want to try your luck! Note that this program is also the "best serial program"

    Choose your runs carefully! As you will see when you read below, there are many possibilities regarding the appropriate problem sizes. You should think your choices over before beginning your runs. you are in doubt.

    You will have two sets of plots and one table. For the first plot, pick at least 3 problem sizes with N0, N1, N2 such that N2^3 is at least twice as large as N1^3 and so on for succeeding problem sizes. This plot will have 1 curve for each value of N. Do you note any change in grind time as P, N increase? Explain your answer. See if you can observe a super-linear speedup.

    Now, make a second plot in which you show amortized communication cost per point. You can determine this value by scaling the grind time according to the ratio

        (total time - computation only time)
        ------------------------------------
    		total time
    
    So, for the above timing, the cost of communication is 6.613*1/11 = 0.601e-8. Report on your observations about communication cost.

    For the second set of plots you will need to keep N^3/P a constant. Choose 3 different values of N^3/P and plot the timings for each value as a separate curve. Also plot amortized communication timings. How do these curves compare with the previous set? Explain your findings.

    For the table, experiment with different processor geometries, keeping the problem size fixed. Run with 4 processors, and note the effect of processor geometry on performance. Repeat for 16 processors. Explain any differences. I like to sort the runs on running time (fastest to slowest) to get a sense of any patterns which may emerge. You may notice that some geometries produce nearly identical running times.

    Performance

    The measurement process is complicated by the fact that

    1. performance is sensitive to processor geometry;
    2. communication time and local computation time vary independently with the geometry; and
    3. the optimal geometry depends on N.

    Thus, when comparing runs across different number of processors, or different problem sizes, it is necessary to use the optimal geometry in each case. Moreover, we should remember that unless the extents of the processor geometry divides N evenly, then there will be some loss of performance due to load imbalance. The first step is to observe just how much the geometry impacts performance. Examining the running times for P=4 processors, we clearly see this. Consider N=128^3:

    				       ---- Local ---
       Processors    Grind T   MF/s   Time  Comm  MF/s  Time
    P=  4{ 1x 4x 1} 7.357e-08  82    0.77  0.11  92    0.69 
    P=  4{ 1x 1x 4} 7.926e-08  76    0.83  0.03  78    0.81 
    P=  4{ 1x 2x 2} 8.205e-08  73    0.86  0.04  76    0.83 
    P=  4{ 2x 1x 2} 8.247e-08  73    0.86  0.11  82    0.77 
    P=  4{ 2x 2x 1} 8.749e-08  69    0.92  0.12  78    0.81 
    P=  4{ 4x 1x 1} 9.608e-08  62    1.01  0.30  89    0.71 
    
    The {1x4x1} geometry is the most effective. Performance is is about 30% better than the worst case of a {4x1x1} geometry. However, the optimal geometry for overall running time does not correspond with the optimal geometry with respect to communication alone--{1x1x4}--and we note a dramatic range of performance across the various 1D decompositions: 20 milliseconds, 80 msec, and 300 msec for, respectively, {1x1x4}, {1x4x1}, {4x1x1}.

    The optimal geometry is also sensitive to N. Consider N=120:

    				       ---- Local ---
       Processors    Grind T   MF/s   Time Comm  MF/s  Time
    P=  4{ 1x 2x 2} 6.691e-08  90    0.58  0.05  94    0.55 
    P=  4{ 1x 1x 4} 6.775e-08  89    0.59  0.03  91    0.57 
    P=  4{ 1x 4x 1} 6.899e-08  87    0.60  0.13 100    0.52 
    P=  4{ 2x 1x 2} 7.474e-08  80    0.65  0.14  94    0.55 
    P=  4{ 2x 2x 1} 7.651e-08  78    0.66  0.15  92    0.56 
    P=  4{ 4x 1x 1} 8.494e-08  71    0.73  0.36 110    0.47 
    

    Finally, we consider the effects of load imbalance, where P doesn't divide N evenly. Consider N=126:

    				       ---- Local ---
       Processors    Grind T   MF/s   Time Comm  MF/s  Time
    P=  4{ 2x 1x 2} 8.664e-08  69    0.87  0.12  79    0.76 
    P=  4{ 2x 2x 1} 8.964e-08  67    0.90  0.12  76    0.79 
    P=  4{ 1x 2x 2} 9.472e-08  63    0.95  0.04  66    0.91 
    P=  4{ 1x 4x 1} 1.051e-07  57    1.05  0.04  60    1.01 
    P=  4{ 4x 1x 1} 1.120e-07  54    1.12  0.13  62    0.98 
    P=  4{ 1x 1x 4} 1.125e-07  53    1.13  0.02  54    1.10 
    

    Now we increase the number of processors to P=16. For N=256^3 we find that a 2-D decomposition is optimal, but this geometry does not minimize local computation time. (It comes close to the minimal local computation time.)

    				       ---- Local ---
       Processors    Grind T  MF/s   Time  Comm MF/s  Time
    P= 16{ 1x 8x 2} 1.751e-08 343    1.47  0.12 389   1.29 
    P= 16{ 1x 4x 4} 1.776e-08 338    1.49  0.07 362   1.39 
    P= 16{ 2x 8x 1} 1.923e-08 312    1.61  0.15 368   1.37 
    P= 16{ 2x 1x 8} 1.965e-08 305    1.65  0.07 329   1.53 
    P= 16{ 1x16x 1} 1.996e-08 301    1.67  0.21 379   1.33 
    P= 16{ 4x 1x 4} 2.037e-08 294    1.71  0.18 360   1.40 
    P= 16{ 2x 2x 4} 2.067e-08 290    1.73  0.07 312   1.61 
    P= 16{ 1x 1x16} 2.098e-08 286    1.76  0.05 300   1.68 
    P= 16{ 1x 2x 8} 2.112e-08 284    1.77  0.03 294   1.71 
    P= 16{ 2x 4x 2} 2.172e-08 276    1.82  0.09 305   1.65 
    P= 16{ 4x 2x 2} 2.210e-08 272    1.85  0.18 329   1.53 
    P= 16{ 8x 1x 2} 2.269e-08 264    1.90  0.33 393   1.28 
    P= 16{ 4x 4x 1} 2.387e-08 251    2.00  0.19 312   1.62 
    P= 16{ 8x 2x 1} 2.440e-08 246    2.05  0.31 355   1.42 
    P= 16{16x 1x 1} 2.911e-08 206    2.44  0.50 409   1.23 
    

    Now for N=512. We note that the ordering of the geometries with respect to performance is different than in the previous case.

    				       ---- Local ---
       Processors    Grind T   MF/s   Time  Comm  MF/s  Time
    P= 16{ 2x 8x 1} 1.790e-08 335   12.02  0.08 363   11.08 
    P= 16{ 1x 8x 2} 1.813e-08 331   12.17  0.06 351   11.48 
    P= 16{ 4x 1x 4} 1.841e-08 326   12.35  0.10 362   11.12 
    P= 16{ 8x 1x 2} 1.887e-08 318   12.66  0.19 393   10.24 
    P= 16{ 2x 1x 8} 1.958e-08 306   13.14  0.03 317   12.68 
    P= 16{ 1x 4x 4} 1.981e-08 303   13.29  0.03 312   12.92 
    P= 16{ 1x16x 1} 1.982e-08 303   13.30  0.10 337   11.95 
    P= 16{ 4x 2x 2} 2.000e-08 300   13.42  0.10 332   12.15 
    P= 16{ 2x 2x 4} 2.045e-08 293   13.72  0.03 303   13.27 
    P= 16{ 2x 4x 2} 2.131e-08 282   14.30  0.04 295   13.67 
    P= 16{ 4x 4x 1} 2.141e-08 280   14.37  0.10 313   12.86 
    P= 16{ 8x 2x 1} 2.021e-08 297   13.56  0.18 362   11.12 
    P= 16{16x 1x 1} 2.087e-08 287   14.01  0.35 443    9.09 
    P= 16{ 1x 2x 8} 2.350e-08 255   15.77  0.01 259   15.55 
    P= 16{ 1x 1x16} 2.366e-08 254   15.88  0.02 259   15.56 
    

    Keeping the above findings in mind, we are now ready to design our experiments for measuring fixed-size and scaled-size speedups.

    First we begin with the scaled measurements. We choose

    		N / P = 128, 256, 512
    

    We will carry out runs on P=1 to P=64 processors, though we won't run with all possible combinations of N and P. For example, the largest problems (N=512) won't fit into the memory of a single T3E 21164 processor (available user memory is approximately 110 megabytes.) We also note that problem sizes cannot be a nice power of two when P is not a perfect cube.

    It is interesting to look at grind time T_g normalized to the number of processors. Under ideal circumstances this should remain constant. Deviations (below) this ideal line indicate performance losses.

    Now, we are ready to take fixed-size measurements. We pick 3 values of N that we used in the scaled measurements. Again we report T_g and amortized communication time.

    At last we can conduct our communication analysis.

    It is important to correctly model communication overheads, which can depend on the access strides. Note that communication rates will be slightly different than that reported by an MPI program, due to some operating overheads in KeLP. To correctly ascertain the effects of access stride, take a look at the timings you generated for the three 1-dimensional decompositions on 4 processors. Consider the case when you are partitioning across the last axis. Note that we are assuming column major ordering such that the last axis varies the least rapidly, and the first axis the most. This is the opposite of C/C++ which uses row-major ordering.

    For this simple case, communication is simply

     
    Alpha + beta_z *  W
    
    where beta_z is the peak bandwidth of the machine, Alpha is the startup cost and W corresponds to the number of 8 byte, double precision words transmitted across the face of the split grid. You may determine Alpha and beta_z by fitting a curve to timings produced with different values of W. Now, repeat for the other two axes. Assume that the values of Alpha is the same is in the first case, and determine the different values of beta_x and beta_y which correspond to cutting the x and y axis, respectively.

    You may have to refine your model slightly to account for another effect: the number of cache lines touched by the processor in the process of communicating a face. Assume that the cache line size is 32 bytes for the T3E's 21164 processor.

    We can now see that there are in fact 3 variants of the communication cost function. Alpha is constant, but there are three beta values--beta_z, beta_y, and beta_x---which are found to increase roughly in the ratio 1 : 3.5 : 9. We estimate these parameters using the 1D case on 4 processors. Since communication is presumed to be additive across the different axes, we can get to within 10% or less of the actual communication time. Note, however, that we did not attempt to model computation time, which effect is significant.

    Now, verify your model against the empirically observed data. You need only look at communication costs. First try 1D decompositions involving different problem sizes. Now, try 2D and 3D decompositions. In the latter case you will need to employ the 3 different formulas you derived to handle each of the faces. How well do your predictions fit the data?

    Getting access to the code

    Included in the rb3D directory is a sample NQS batch file called nqs_job to get you started. Your program will be run with mpprun but you should not normally run your jobs interactively since interactive times are charged twice the rate of jobs run in the "normal" queue.

    There is also a directory called Ring/ containing an executable to measure message start time and bandwidth for a stride of 1. This program measures message latency in a ring and is written in MPI. The nqs_job file is self-documenting and will show you how to run the code. The source is also available if you are interested.


    Maintained by Scott B. Baden . Last modified: Thu Nov 20 18:56:09 PST 1997 Copyright (c) 1997, Scott B. Baden.