Parallel programming is necessary to solve large
and computation-intensive problems. Parallelism is also the way an
application achieves optimal performance on multi-core processors.
However,
parallel programming poses more challenges than serial programming. The
goal of
my research is to investigate a parallel programming model that eases
the
development of scientific and engineering applications and helps
achieve optimal performance.
Tarragon is a parallel library that raises the
level of
abstraction by providing classes for tasks, inter-task communication,
and dependency descriptions. The run-time structure of an application
is
described
by a graph. The graph helps separate performance from correctness and
other
application-specific concerns. Tarragon encapsulates run-time services
which
are able to analyze the graph and to discover opportunities for dynamic
optimizations, such as overlap of computation with communication.
Applications
that would benefit from dynamic scheduling, dynamic load balancing,
and fine-grain communication are the target of Tarragon.
Tarragon applications
I am collaborating with neuroscientists in the Computational Neurobiology Lab
at the Salk Institute (Tom
Bartol and
Terry
Sejnowski)
and the Pittsburgh Supercomputing Center (Joel
Stiles) on MCell, a
cell microphysiology simulator that
models molecular diffusion using random walks.
The current parallel version, called MCell-K,
runs under bulk synchronous parallelism (BSP) and is the result of
previous
collaborations
between Prof. Baden's research group and the MCell
group. By collecting traces on test cases of interest, I was able to
reveal the
heavy load imbalance that often characterizes MCell
models. In presence of load imbalance, MCell-K
frequent synchronizations are a performance bottleneck and exacerbate
the
effect of uneven load distribution. By running a load balancer offline
I also
discovered that redistributions greatly improve performance but also
increase
communication. A combination of load balancing and techniques for
overlapping
communication with computation is the most desirable solution for this
case. The first figure shows the load trace and compares measured load
against load balanced (every 100 steps) and the average load (optimal
load distribution). The second figure shows the communication trace and
the increase in communication when load balancing is active.
The first prototype of Tarragon that I developed
supports asynchronous fine-grain communication.
Experiments with Cells, a synthetic benchmark simulating Brownian
motion in two dimensions that resembles the algorithms
of MCell
showed that Tarragon overlaps communication with computation and so
hides a
significant amount of the communication cost. The following graph shows
the comparison between Cells running with Tarragon in synchronous and
asynchronous mode.
I am also collaborating Dr.Xiaoye
Sherry Li,
a computer scientist of the Computational
Research Division
at the Lawrence Berkeley National Lab.
During a summer internship at Lawrence Berkeley
National Lab I had the
opportunity to work with her and study the implementation of
SuperLU, a direct solver
for sparse linear systems based on LU factorization. In that time I
also developed a simulator that accurately
reproduces and predicts the cost of factorization.
The goal is to use the simulator to evaluate possible optimizations
and the benefits of graph-based scheduling
before applying Tarragon to SuperLU_DIST. The following table shows
preliminary results when simulating the current SuperLU_DIST
implementation on several input matrixes and running on one processor.