RB3D

The rb3D program is a 3D iterative solver which solves Poisson's equation using Gauss-Seidel's method with Red/Black ordering. The rb3D source is found in the directory http:./rb3D and you may also down-load a gzipped tar file in http:./rb3D.tar.gz

rb3D supports various command line options to configure processor geometries and so on, as follows:

-l ni -m nj -n nk	Integer bounds of the computational mesh: ni x nj x nk
-N nijk			The single bound of a cubic mesh
			(These options are mutually exclusive)
-gi pi -gj pj		Leading dimensions of the processor geometry are pi x pj
-i niter		(Integer) number of iterations
-r reps			(Integer) number of repetitions of the run
-f freq			frequency of convergence check in iterations
-e epsilson		convergence threshold (double)
-si bi -sj bj	    	cache blocking factors for i & j axes

Thus, if you are using mpirun to execute MPI jobs, then the command line

     mpirun -np 16 rb3D -l 32 -n 64 -n 128 -gi 4 -gj 2 -i 5 -r 3
will run rb3D on 16 processors over a 32 x 64 x 128 mesh for 5 iterations. (Note that the -np nprocs option of mpirun determines the number of processors to run on, and that the command line option to rb3D defines the problem size.) The computation will be run 3 times. The processors will be configured in a 4 x 2 x 2 geometry, as the trailing dimension = nprocs / (pi * pj). If you are running with a cubic mesh, and this is perfectly reasonable, then you need only specify the -l option in -l -m -n:
     mpirun -np 8 rb3D -l 100 -gi 2 -gj 2 -i 4 -r 2
runs a problem of size 100 x 100 x 100

The output file

rb3D reports the various options you have selected along with timings.

N: 512; Blocking: [32 x 32]; Processors: 16 = 2 x 8 x 1
Tgrind=   8.664e-09 s., 693 MF/s., Ttotal: 11.6 s. 
Compute Only: 823 MF/s., Ttotal: 9.79s. 
Added cost of commun.: 1.84 s.

                                                                  Local
       Blocking   Processors    Grind T   MF/s   Time  Comm    MF/s  Time
N=512 [ 32x 32] P= 16{ 2x 8x 1} 8.664e-09 693   11.63    1.84 823    9.79 

Timings (in sec.)
Total, Compute only

11.7	9.81
11.7	9.79
11.6	9.82

The first line of output reports the problem size (cubical), cache blocking factors (32 x 32), the number of processors (16), and the processor geometry (2 x 8 x 1). We employed a 2-dimensional partitioning, cutting the X axis into 2 parts, and the Y axis into 8 parts.

Line 2 reports performance. The grind time was 8.664e-09 seconds; this is the total time divided by the total number of mesh point updates. The megaflop rate was 823 Megaflops/second. The total running time was 11.6 seconds.

The program also runs the computation with communication shut off, and uses the timings with and without communication to indirectly measure the cost of communication by taking differences. (How is this different from measuring the cost directly?) The megaflop rate and compute time with communication shut off (823 MF/s., 9.79 s., respectively). Thus, communication added 1.84 seconds to the running time.

The above information is also reported in tabular form, and you can use grep(1) to conveniently collect the summary reports by looking for the pattern N=

The above statistics are reported for the best of nreps trials, that is, for the fastest total running time. The last bit of the output shows us the total and communication-free running times for all trials, permitting us to gauge our confidence in our measurements. (The program warms itself at the beginning of each trial up by running a few untimed iterations.)

Performance

Performance is complicated by the fact that

  1. it is sensitive to processor geometry;
  2. communication time and local computation time vary independently with the geometry; and
  3. the optimal geometry depends on N.

Thus, when comparing runs across different number of processors, or different problem sizes, it is necessary to use the optimal geometry in each case. Moreover, we should remember that unless the extents of the processor geometry divides N evenly, then there will be some loss of performance due to load imbalance. The first step is to observe how changing the processor geometry affects performance. Here are the running times for P=16 processors for the remaining geometries

				         ---- Local ---
Processors    Grind T   MF/s   Time    Comm  MF/s   Time
 1x 1x16      1.177e-08 510   15.80    2.41  601   13.39 
 1x16x 1      9.159e-09 655   12.29    2.43  816    9.87 
16x 1x 1      1.097e-08 547   14.72    8.33 1260    6.39 
 1x 4x 4      9.817e-09 611   13.18    0.90  656   12.27 
 4x 1x 4      8.898e-09 674   11.94    2.71  872    9.23 
 4x 4x 1      9.654e-09 621   12.96    2.77  790   10.19 
 1x 2x 8      1.083e-08 554   14.54    1.43  614   13.11 
 2x 1x 8      9.688e-09 619   13.00    1.21  683   11.79 
 2x 2x 4      9.542e-09 629   12.81    1.10  688   11.71 
 2x 4x 2      9.545e-09 629   12.81    1.07  686   11.74 
 4x 2x 2      8.940e-09 671   12.00    2.48  846    9.52 
 1x 8x 2      8.964e-09 669   12.03    1.39  757   10.64 
 8x 1x 2      9.727e-09 617   13.06    4.40  931    8.65 
 8x 2x 1      1.071e-08 560   14.37    4.43  810    9.95 
 2x 8x 1      8.664e-09 693   11.63    1.84  823    9.79 
The {2x8x1} geometry leads to the highest performance. Performance is is about 36% better than the worst case {1x1x16} geometry. As a side note, we observe that when we shut off communication, the optimal geometry changes to {16x1x1}.

The optimal geometry is also sensitive to N. Consider N=360, where the optimal decomposition is {1x4x4}, and we achieve 727 MFlops. If we use the optimal geometry for the N=500 case, performance drops to 626 Mflops.

				          --- Local ---
 Processors    Grind T   MF/s   Time    Comm  MF/s  Time
 1x 1x16       1.423e-08 422    6.64    1.19  513    5.45 
 1x16x 1       9.739e-09 616    4.54    1.26  853    3.28 
16x 1x 1       1.291e-08 465    6.02    3.64 1177    2.38 
 1x 4x 4       8.253e-09 727    3.85    0.64  872    3.21 
 4x 1x 4       9.959e-09 602    4.65    1.26  827    3.38 
 4x 4x 1       1.092e-08 549    5.10    1.30  737    3.80 
 1x 2x 8       9.270e-09 647    4.32    0.81  796    3.52 
 2x 1x 8       8.967e-09 669    4.18    0.77  819    3.42 
 2x 2x 4       9.857e-09 609    4.60    0.74  726    3.86 
 2x 4x 2       9.650e-09 622    4.50    0.56  711    3.94 
 4x 2x 2       1.030e-08 583    4.80    1.29  797    3.51 
 1x 8x 2       8.520e-09 704    3.97    0.72  859    3.26 
 8x 1x 2       1.128e-08 532    5.26    2.21  918    3.05 
 8x 2x 1       1.199e-08 500    5.60    2.17  816    3.43 
 2x 8x 1       9.587e-09 626    4.47    0.67  735    3.81 

Beware that if the number of processors P does not divide N evenly, then some processors may get assigned no work. Some programs (such as rb3D) may fail under certain circumstances. For example, if N=360, then 1-dimensional partitionings will be unable to balance the workloads evenly since 16 does not divide 360 evenly. More generally, for each dimension k you need to ensure that Pk and Nk satisfy the following condition:

   Nk > ceil[ Nk/Pk ] * (Pk-1)
where Pk is the extent of the processor geometry along dimension k, and Nk is the extent of the solution array along dimension k. (Too see the effect, you can take a look at the FloorPlan in rb3D's output.)

Note that you may use non-cubical domains to carry out a sanity check. You should be able to reproduce the parallel running time with communication turned off by running the reduced problem on one processor, which corresponds to what a single processor is assigned when the full problem is run in parallel. 126^3 mesh on 16 processors, then you should run a 126 x 126 x 8 problem on 1 processor and observe in theory the same running time as the 16-processor run with communication shut off.>

Running on the SP2 using batch

A set of load leveler files have also been provided in the subdirectory SP2_scripts, to run on 8, 16, 32, and 64 nodes. These files begin with the capital letter 'P' and end with the number of processors the job will be run on. Each load leveler file in turn invokes an executable script file. If the load leveler file is PXX, where XX is the number of processors, then the corresponding script file is pxx.script. You must modify this file to change the working directory and the command line options for rb3D appropriately. Submit the job using the llsubmit command.

If you have any problems, see the NPACI HotPage at http://hotpage.sdsc.edu for up-to-date machine information, or the URL http://www.npaci.edu/SP2 for online SP2 documentation.


Copyright © 1999, Scott B. Baden . Last modified: Mon Apr 5 21:59:13 PDT 1999