rb3D supports various command line options to configure processor geometries and so on, as follows:
-l ni -m nj -n nk Integer bounds of the computational mesh: ni x nj x nk -N nijk The single bound of a cubic mesh (These options are mutually exclusive) -gi pi -gj pj Leading dimensions of the processor geometry are pi x pj -i niter (Integer) number of iterations -r reps (Integer) number of repetitions of the run -f freq frequency of convergence check in iterations -e epsilson convergence threshold (double) -si bi -sj bj cache blocking factors for i & j axes
Thus, if you are using mpirun to execute MPI jobs, then the command line
mpirun -np 16 rb3D -l 32 -n 64 -n 128 -gi 4 -gj 2 -i 5 -r 3
will run rb3D on 16 processors over a 32 x 64 x 128
mesh for 5 iterations. (Note that the -np nprocs option of
mpirun determines the number of processors to run on, and that
the command line option to rb3D defines the problem size.)
The computation will be run 3 times.
The processors will be configured in a 4 x 2 x 2
geometry, as the trailing dimension = nprocs / (pi * pj).
If you are running with a cubic mesh, and this is perfectly reasonable,
then you need only specify the -l option in -l -m -n:
mpirun -np 8 rb3D -l 100 -gi 2 -gj 2 -i 4 -r 2
runs a problem of size 100 x 100 x 100
rb3D reports the various options you have selected along with timings.
N: 512; Blocking: [32 x 32]; Processors: 16 = 2 x 8 x 1
Tgrind= 8.664e-09 s., 693 MF/s., Ttotal: 11.6 s.
Compute Only: 823 MF/s., Ttotal: 9.79s.
Added cost of commun.: 1.84 s.
Local
Blocking Processors Grind T MF/s Time Comm MF/s Time
N=512 [ 32x 32] P= 16{ 2x 8x 1} 8.664e-09 693 11.63 1.84 823 9.79
Timings (in sec.)
Total, Compute only
11.7 9.81
11.7 9.79
11.6 9.82
The first line of output reports the problem size (cubical), cache blocking factors (32 x 32), the number of processors (16), and the processor geometry (2 x 8 x 1). We employed a 2-dimensional partitioning, cutting the X axis into 2 parts, and the Y axis into 8 parts.
Line 2 reports performance.
The grind time was 8.664e-09 seconds;
this is the total time divided by the total number of mesh point
updates. The megaflop rate was 823 Megaflops/second. The total
running time was 11.6 seconds.
The program also runs the computation with communication shut off,
and uses the timings with and without communication to indirectly
measure the cost of communication by taking differences.
(How is this different from measuring the cost directly?)
The megaflop rate and compute time with communication shut off
(823 MF/s., 9.79 s., respectively). Thus, communication
added
1.84 seconds to the running time.
The above information is also reported in tabular form,
and you can use grep(1) to conveniently collect the
summary reports by looking for the pattern N=
The above statistics are reported for the best of nreps
trials, that is, for the fastest total running time.
The last bit of the output shows us the total and communication-free
running times for all trials, permitting us to gauge our confidence in our
measurements.
(The program warms itself at the beginning
of each trial up by running a few untimed iterations.)
Performance is complicated by the fact that
Thus, when comparing runs across different number of processors,
or different problem sizes, it is necessary to use the optimal
geometry in each case. Moreover, we should remember that unless
the extents of the processor geometry divides N evenly, then
there will be some loss of performance due to load imbalance.
The first step is to observe how changing the processor geometry
affects performance. Here are the running times for P=16 processors
for the remaining geometries
The optimal geometry is also sensitive to N. Consider N=360, where
the optimal decomposition is {1x4x4}, and we achieve
727 MFlops.
If we use the optimal geometry for the N=500 case, performance drops
to 626 Mflops.
Beware that if the number of processors P does not divide
N evenly, then some processors may get assigned no work.
Some programs (such as rb3D) may fail under certain circumstances.
For example, if N=360, then 1-dimensional partitionings will be
unable to balance the workloads evenly since
16 does not divide 360 evenly.
More generally, for each dimension k
you need to ensure that
Pk and Nk
satisfy the following condition:
Note that you may use non-cubical domains to carry out a sanity
check. You should be able to reproduce the parallel running
time with communication turned off by running the reduced problem
on one processor, which corresponds to what a single processor
is assigned when the full problem is run in parallel. 126^3 mesh on 16 processors, then
you should run a 126 x 126 x 8 problem on 1 processor and observe in
theory the same running time as the 16-processor run with communication
shut off.>
A set of load leveler files have also been provided in the subdirectory
SP2_scripts, to run on 8, 16, 32, and 64 nodes.
These files begin with the capital letter 'P' and end
with the number of processors the job will be run on. Each load leveler
file in turn invokes an executable script file. If the
load leveler file is PXX, where XX is the number of processors,
then the corresponding script file is pxx.script.
You must modify this file to change the working directory
and the command line options for rb3D appropriately. Submit the job
using the llsubmit command.
If you have any problems, see the NPACI HotPage at
http://hotpage.sdsc.edu
for up-to-date machine information,
or the URL http://www.npaci.edu/SP2
for online SP2 documentation.
Performance
---- Local ---
Processors Grind T MF/s Time Comm MF/s Time
1x 1x16 1.177e-08 510 15.80 2.41 601 13.39
1x16x 1 9.159e-09 655 12.29 2.43 816 9.87
16x 1x 1 1.097e-08 547 14.72 8.33 1260 6.39
1x 4x 4 9.817e-09 611 13.18 0.90 656 12.27
4x 1x 4 8.898e-09 674 11.94 2.71 872 9.23
4x 4x 1 9.654e-09 621 12.96 2.77 790 10.19
1x 2x 8 1.083e-08 554 14.54 1.43 614 13.11
2x 1x 8 9.688e-09 619 13.00 1.21 683 11.79
2x 2x 4 9.542e-09 629 12.81 1.10 688 11.71
2x 4x 2 9.545e-09 629 12.81 1.07 686 11.74
4x 2x 2 8.940e-09 671 12.00 2.48 846 9.52
1x 8x 2 8.964e-09 669 12.03 1.39 757 10.64
8x 1x 2 9.727e-09 617 13.06 4.40 931 8.65
8x 2x 1 1.071e-08 560 14.37 4.43 810 9.95
2x 8x 1 8.664e-09 693 11.63 1.84 823 9.79
The {2x8x1} geometry leads to the highest performance. Performance is
is about 36% better than the worst case {1x1x16} geometry.
As a side note, we observe that when we shut off communication,
the optimal geometry changes to
{16x1x1}.
--- Local ---
Processors Grind T MF/s Time Comm MF/s Time
1x 1x16 1.423e-08 422 6.64 1.19 513 5.45
1x16x 1 9.739e-09 616 4.54 1.26 853 3.28
16x 1x 1 1.291e-08 465 6.02 3.64 1177 2.38
1x 4x 4 8.253e-09 727 3.85 0.64 872 3.21
4x 1x 4 9.959e-09 602 4.65 1.26 827 3.38
4x 4x 1 1.092e-08 549 5.10 1.30 737 3.80
1x 2x 8 9.270e-09 647 4.32 0.81 796 3.52
2x 1x 8 8.967e-09 669 4.18 0.77 819 3.42
2x 2x 4 9.857e-09 609 4.60 0.74 726 3.86
2x 4x 2 9.650e-09 622 4.50 0.56 711 3.94
4x 2x 2 1.030e-08 583 4.80 1.29 797 3.51
1x 8x 2 8.520e-09 704 3.97 0.72 859 3.26
8x 1x 2 1.128e-08 532 5.26 2.21 918 3.05
8x 2x 1 1.199e-08 500 5.60 2.17 816 3.43
2x 8x 1 9.587e-09 626 4.47 0.67 735 3.81
Nk > ceil[ Nk/Pk ] * (Pk-1)
where
Pk is the extent of the processor geometry
along dimension k, and Nk is the extent
of the solution array along dimension k.
(Too see the effect, you can take a look at the FloorPlan in rb3D's
output.)
Running on the SP2 using batch
Copyright © 1999,
Scott B. Baden .
Last modified:
Mon Apr 5 21:59:13 PDT 1999