Technological factors are continuing to raise the cost of communication relative to computation, and will heavily influence how future scalable systems are programmed. Reformulating an algorithm to mask communication delays can be crucial in maintaining scalability, and will play an important role in using future Petascale systems. Unfortunately, the programming required to implementing overlap is a challenge, even for the expert.
We are developing Thyme, a run time library and programming methodology for implementing latency-tolerant algorithms using a graph-based formulation. Thyme avoids the classic problems of split-phase code and built-in scheduling assumptions that hamper performance robustness in the presence of technological change. Thyme employs a task precedence graph to represent a partial ordering over a logical grouping of the application's loop iteration spaces and is similar to dataflow. This model naturally accommodates the asynchronous behavior of latency tolerance strategies. Results from three applications show that Thyme reduces communication costs significantly.
Thyme relies on background run time services to realize dataflow semantics. The library is lightweight, and avoids the heavy machinery of thread migration, an alternative technique for supporting latency tolerance. These run time services commandeer one CPU per processing node. Despite this loss of a CPU to so called "non-productive work," Thyme is still able to realize a net gain in performance.
We ran on two large-scale systems: DataStar, located at the San Diego Supercomputer Center, and Thunder located at Lawrence Livermore National Laboratory. We used DataStar's 8-way Symmetric Multiprocessor (SMP) IBM P655+ nodes, with 1.5 GHz Power-4+ processors sharing 16 Gigabytes of memory. The Thunder system contains "Madison Tiger4" nodes, each with four Itanium2 CPUs running at 1.4 GHz and sharing 8 Gigabytes of memory.
We report results with 3 applications: Jacobi3D, a three-dimensional iterative Poisson solver (Dirichlet Boundary Conditions); SUMMA (van de Geijn and Watts), a parallel matrix multiply; and the NAS-FT parallel benchmark (version 3), which is dominated by a 3D Fast Fourier Transform.
For each application, we compared the Thyme implementation against explicit MPI coding. The baseline variant (BASE) uses blocking communication and does not attempt to overlap communication with computation. The explicit overlap variant (OLAP) uses asynchronous non-blocking communication to implement a split-phase algorithm to overlap communication with computation. The THYME variant uses Thyme to achieve overlap. To put results in perspective, we compare against the IDEAL running time, the time required to perform computational work only. This timing was obtained by disabling communication in BASE, and represents a "not to exceed" figure of merit.
Results for Jacobi3D and SUMMA appear as Fig. 1. Fig. 2 shows the results for NAS-FT. For Jacobi3D and SUMMA we ran on a fixed number of processors: 256. For NAS-FT we ran only on Thunder and give results on 64 and 128 processors. We are investigating scaling issues which we believe are due to memory hierarchy interactions. Due to the complexity of hand coding, we did not implement an OLAP version of NAS-FT.
We conclude from these results that Thyme is able to improve performance by realizing overlap, and that performance may exceed that of hand coding.
Thyme commandeers one CPU per processing node to perform its run time services. This CPU is not available to carry out computation. Despite this loss of a CPU to so called "non-productive work," Thyme is still able to improve performance.
Acknowledgments: This work is supported by the United States Department of Energy under Award Number DE-FG02-05ER25706 and^ Prime Contract No. W-7405-ENG-48.