Hiding Communication Latency with Thyme

Jacob Sorensen and Scott B. Baden
Department of Computer Science and Engineering
University of California, San Diego
9500 Gilman Drive
La Jolla, CA 92093-0404 USA

Technological factors are continuing to raise the cost of communication relative to computation, and will heavily influence how future scalable systems are programmed. Reformulating an algorithm to mask communication delays can be crucial in maintaining scalability, and will play an important role in using future Petascale systems. Unfortunately, the programming required to implementing overlap is a challenge, even for the expert.

We are developing Thyme, a run time library and programming methodology for implementing latency-tolerant algorithms using a graph-based formulation. Thyme avoids the classic problems of split-phase code and built-in scheduling assumptions that hamper performance robustness in the presence of technological change. Thyme employs a task precedence graph to represent a partial ordering over a logical grouping of the application's loop iteration spaces and is similar to dataflow. This model naturally accommodates the asynchronous behavior of latency tolerance strategies. Results from three applications show that Thyme reduces communication costs significantly.

Thyme relies on background run time services to realize dataflow semantics. The library is lightweight, and avoids the heavy machinery of thread migration, an alternative technique for supporting latency tolerance. These run time services commandeer one CPU per processing node. Despite this loss of a CPU to so called "non-productive work," Thyme is still able to realize a net gain in performance.

We ran on two large-scale systems: DataStar, located at the San Diego Supercomputer Center, and Thunder located at Lawrence Livermore National Laboratory. We used DataStar's 8-way Symmetric Multiprocessor (SMP) IBM P655+ nodes, with 1.5 GHz Power-4+ processors sharing 16 Gigabytes of memory. The Thunder system contains "Madison Tiger4" nodes, each with four Itanium2 CPUs running at 1.4 GHz and sharing 8 Gigabytes of memory.

We report results with 3 applications: Jacobi3D, a three-dimensional iterative Poisson solver (Dirichlet Boundary Conditions); SUMMA (van de Geijn and Watts), a parallel matrix multiply; and the NAS-FT parallel benchmark (version 3), which is dominated by a 3D Fast Fourier Transform.

For each application, we compared the Thyme implementation against explicit MPI coding. The baseline variant (BASE) uses blocking communication and does not attempt to overlap communication with computation. The explicit overlap variant (OLAP) uses asynchronous non-blocking communication to implement a split-phase algorithm to overlap communication with computation. The THYME variant uses Thyme to achieve overlap. To put results in perspective, we compare against the IDEAL running time, the time required to perform computational work only. This timing was obtained by disabling communication in BASE, and represents a "not to exceed" figure of merit.

Results for Jacobi3D and SUMMA appear as Fig. 1. Fig. 2 shows the results for NAS-FT. For Jacobi3D and SUMMA we ran on a fixed number of processors: 256. For NAS-FT we ran only on Thunder and give results on 64 and 128 processors. We are investigating scaling issues which we believe are due to memory hierarchy interactions. Due to the complexity of hand coding, we did not implement an OLAP version of NAS-FT.

We conclude from these results that Thyme is able to improve performance by realizing overlap, and that performance may exceed that of hand coding.

Thyme commandeers one CPU per processing node to perform its run time services. This CPU is not available to carry out computation. Despite this loss of a CPU to so called "non-productive work," Thyme is still able to improve performance.

Image jac_3d_ds_A Image jac_3d_th_A
Image summa_ds_256_A Image summa_th_256_A
Figure 1: Plots showing the benefits of using Thyme to overlap communication with computation in Jacobi3D and SUMMA, on Datastar and Thunder. The Thyme implementation of Jacobi3D used a 4×4 × 4 processor geometry on Thunder, and a 4 × 2 × 4 geometry on Datastar. The MPI Implementations used an 8 × 4 × 8 processor geometry. SUMMA used an 16 × 16 processor geometry. The OLAP variant of Jacobi3D was significantly slower than the BASE implementation, so we do not show those results in the plot. For SUMMA, the OLAP and BASE variants realize nearly identical performance, so we show the results for BASE only.

Image ft_64 Image ft_128
Figure 2: Floating point rate (GigaFlops) for the Ideal, Baseline, and Thyme variants of the NAS-FT benchmark for meshes of size 5123 (Class C), 1024 × 1024 × 512, 10243$ In all cases a 1-dimensional (non-optimal) processor geometry was used.

Acknowledgments: This work is supported by the United States Department of Energy under Award Number DE-FG02-05ER25706 and^ Prime Contract No. W-7405-ENG-48.