CSE 260 Project ideas
This list will grow, so check frequently
- Multigrid (MPI or CUDA).
Multigrid is a multi-level method that uses a progression
of meshes to accelerate the solution of equations.
Reference material includes
- Multigrid tutorial slides by William Briggs
-
MG NET, a repository
for information related to multigrid, multilevel, multiscale,
aggregation, defect correction, and domain decomposition methods.
In implementing multigrid in MPI, keep a couple of points in mind.
First 3D multigrid should be spending most of its running time at the finest level mesh,
so you'll want to be able to ascribe computational costs on a per-level basis.
At some point, the mesh will become sufficiently coarse that you will need
to perform all computation on a single core.
There simply isn't enough work to keep processors
busy doing computation and communication overheads will be too high
to be able to reduce the running time via parallelism.
More generally, as you recurse to coarser levels, reduce the amount of
parallelism incrementally.
The reason why is that as the levels get coarser, there
will be less and less
computational work per core, and the communication overhead can become a problem.
This strategy makes the communication patterns more interesting; when you coarsen a mesh you
must send it to a subset of processors that will work at the next coarse level,
and similarly, when you perform coarse grid correction you'll distribute
the data to a larger subset of processors before
projecting the coarsened data onto a finer mesh.
A simple strategy for reducing parallelism
is to just use 1/8 the number of processors
at the next coarsest level, and to increase the number
of processors by 8 as you drop a level.
You can experiment with other strategies if your communication overheads
are too high.
Communication avoiding matrix multiplication (MPI or UPC).
If in a team of 3,
implement latency hiding.
For references, visit the Berkeley Bebop project web site: http://bebop.cs.berkeley.edu/#pubs
Perform a performance qualification of Fermi, and compare with published
results for the 200 Series GPUs.
For example, study cache behavior. (There is a cache benchmark available.)
Implement
Matrix Multiplication, consider previous results described in
V. Volkov and J. Demmel,
Benchmarking GPUs to tune dense linear algebra,
Proc. 2008 ACM/IEEE Conf. on Supercomputing.
There are many other applications, consult these
GPU resources
for your options.
Scalable 3D Fast Fourier Transform (MPI or UPC).
The Fast Fourier Transform is widely used in signal
processing and also in solving numerical problems.
Implement the 3D Fast Fourier transform using 2D data decomposition.
Groups of 3 should also overlap communication with computation.
Scalable particle simulation (MPI, MPI+OpenMP, UPC).
Various inefficient implementations are available
The provided implementation won't scale;
you'll need to geographically
sort the particles into a mesh to avoid O(N2)
searches for neighbors. You'll also have to balance the workloads,
likely using a cyclic mapping. These techniques
are described in
Lectures 14 and 15 given in CSE 260, Fall 2009. There is also a reader on
Work sharing and data decomposition (CSE 260, Fall 2010; the
link to the reader from the cse 260, fall 2009 course is broken).
If you have time, implement orthogonal recursive bisection.
Show that you can scale to 64 cores.
Performance comparison of OpenMP and pthreads, using various
benchmarks.
Maintained by
[Sun Mar 4 21:33:52 PST 2012]