Tarragon


Parallel programming is necessary to solve large and computation-intensive problems. Parallelism is also the way an application achieves optimal performance on multi-core processors. However, parallel programming poses more challenges than serial programming. The goal of my research is to investigate a parallel programming model that eases the development of scientific and engineering applications and helps achieve optimal performance.

Tarragon is a parallel library that raises the level of abstraction by providing classes for tasks, inter-task communication, and dependency descriptions. The run-time structure of an application is described by a graph. The graph helps separate performance from correctness and other application-specific concerns. Tarragon encapsulates run-time services which are able to analyze the graph and to discover opportunities for dynamic optimizations, such as overlap of computation with communication. Applications that would benefit from dynamic scheduling, dynamic load balancing, and fine-grain communication are the target of Tarragon.

Tarragon applications
I am collaborating with neuroscientists in the Computational Neurobiology Lab at the Salk Institute (Tom Bartol and Terry Sejnowski) and the Pittsburgh Supercomputing Center (Joel Stiles) on MCell, a cell microphysiology simulator that models molecular diffusion using random walks. The current parallel version, called MCell-K, runs under bulk synchronous parallelism (BSP) and is the result of previous collaborations between Prof. Baden's research group and the MCell group. By collecting traces on test cases of interest, I was able to reveal the heavy load imbalance that often characterizes MCell models. In presence of load imbalance, MCell-K frequent synchronizations are a performance bottleneck and exacerbate the effect of uneven load distribution. By running a load balancer offline I also discovered that redistributions greatly improve performance but also increase communication. A combination of load balancing and techniques for overlapping communication with computation is the most desirable solution for this case. The first figure shows the load trace and compares measured load against load balanced (every 100 steps) and the average load (optimal load distribution). The second figure shows the communication trace and the increase in communication when load balancing is active.

load distributioncommunication trace graph

The first prototype of Tarragon that I developed supports asynchronous fine-grain communication. Experiments with Cells, a synthetic benchmark simulating Brownian motion in two dimensions that resembles the algorithms of MCell showed that Tarragon overlaps communication with computation and so hides a significant amount of the communication cost. The following graph shows the comparison between Cells running with Tarragon in synchronous and asynchronous mode.

Tarragon prototype and Cells

I am also collaborating Dr.Xiaoye Sherry Li, a computer scientist of the Computational Research Division at the Lawrence Berkeley National Lab. During a summer internship at Lawrence Berkeley National Lab I had the opportunity to work with her and study the implementation of SuperLU, a direct solver for sparse linear systems based on LU factorization. In that time I also developed a simulator that accurately reproduces and predicts the cost of factorization. The goal is to use the simulator to evaluate possible optimizations and the benefits of graph-based scheduling before applying Tarragon to SuperLU_DIST. The following table shows preliminary results when simulating the current SuperLU_DIST implementation on several input matrixes and running on one processor.

Matrix
Sim time (sec)
Real time (sec)
bbmat
64
67
circuit
0.45
0.45
ecl32
90
100
g7kac200
137
179
inv-extrusion
66
80
mixing-tank
101
113
rajat24
7.9
8.2
stomach
187
208
torso1
30
32
twotone
141
197