## CSE 262, Lecture 8 (1/31/02): Structured Adaptive Mesh Refinement

As we have seen from our study of partial differential equations, diverse physical phenomena exhibit spatial and temporal locality. Included are gravitation attraction, coulombic interaction of charged particles, diffusion of energy, and the propagation of waves. These phenomena often exhibit fine structures that are unpredictable and irregular, and their accurate modeling is important economically. Turbulence, for example, that occurs in the wake of a Boeing 747, or the flow of oil and water in an oil field are examples of such phenomena. Shock waves that accompany a sonic boom are another.

Irregular phenomena are often highly localized, and exhibit large derivatives that are difficult to capture numerically. From our earlier study of finite methods we know that the error terms in a finite different approximation are polynomials in the mesh spacing h , and the coefficients are the derivatives of the function we are computing. Up to this point, we have implicitly assumed that the function is "nice;" that is it has finite, bounded derivatives of all orders. However, this is not always the case. A shock wave emanating from an aircraft moving at supersonic speeds actually exhibits discontinuities or jumps were strictly speaking, the derivatives are not defined. In other cases, we may singularities due to the presence of point charges; the potential function varies as 1/r, and the derivative, ln r is unbounded as r -> 0.

Consider, for example, the 1D Poisson equation

```u''(x) = 1 / | x | on the intervals [-1.0, 0.0) and (0.0,1.0],
u''(x)  = C at x=0, where C is an appropriately defined constant

u(-1.0) = u(1.0) = K, where K is an appropriately defined constant
```
The various derivatives of u(x) are exceedingly large at x=h the mesh spacing, and grow as h -> 0. Although our solution might be accurate to O(h2), the factor in front of the h2 term is so large that we cannot get an accurate answer in practice.

To capture functions that do not have "nice" derivatives we must reduce the spacing of the finite difference mesh. Unfortunately, we might have to refine the mesh by as much as a factor of 50 or 100 (or even more depending on the initial data). While this requirement may not be problematic in one dimension it almost certainly will be in two or more spatial dimensions. A factor of 100 increase in the linear mesh spacing would require that the number of unknowns in a 3-dimensional equation grow by a factor of 106, and for a time dependent problem we would also have to decrease the time-step by a factor of 100 resulting in a 100 million-fold increase in the computational effort!

But this is only part of the story. Clearly we are wasting our effort if the solution is smooth over large portions of the problem domain. In fact, for a given amount of resources we can improve accuracy dramatically by adjusting the mesh spacing according to the estimate error of the solution, which is often measured in terms of its derivatives. Thus, a non-uniform mesh not only reduces the cost of the computation, but often more importantly it reduces memory consumption significantly.

In this lecture we will discuss an adaptive mesh refinement algorithm that was developed by Berger and Oliger ( J. Computational Physics, 53:484-512, 1984) and later refined by Berger and Colella ( J. Computational Physics, 82(1):64-84, May 1989.) It is based in part on Bolstad's earlier work in one dimension. This algorithm is known by various names including the "Berger-Oliger-Colella" algorithm or "patch based AMR." Another important type of AMR, based on quad- or oct-trees, won't be discussed here. We'll consider tree-like refinement structures in the context of particle methods instead later in this lecture.

#### The Berger-Oliger-Colella algorithm

Here is a picture of a not-so-nice function. On close examination,  the mesh spacing varies according to an estimate of the error. Note how the mesh points are spaced close together in the vicinity of the cusp.

Two issues arise in implementing a non-uniform mesh

1. How do we decide on the local mesh spacing?
2. How do we handle a point that sees different spacings for its left and right neighboring points?

Our solution to #1 entails two sub-steps:

1. Error estimation
2. Grid generation

Our solution to #2 entails some special fix-up procedures to ensure that the solution seen at such transitional points is consistent as viewed by the two neighboring points. In the following mesh for example

```         o--------o--------o----o----o
x-4h     x-2h      x   x+h  x+2h
```
we want the slope (derivative) in the solution at u(x), as viewed from u(x-2h) to match the slop as viewed from u(x+h). Otherwise, the refined region to the right of and including x will be "polluted" by a less accurate solution computing on the coarse mesh to the left of and including x .

#### Implementation

In structured AMR we don't represent each point on the mesh with its own value of h. Rather, we form a composite mesh organized into levels, each succeeding level with a finer spacing than the previous one. This is shown in the following figure

The spacing at each level is uniform, and the refinement factor R gives us the ratio between the mesh spacings at adjacent levels. Usually R is a constant; typical values are two or four. ( R is usually restricted to a power of 2).

In general we will have several levels. We may fix the number of levels arbitrarily, or we may simply add levels until we run out of memory. For the time being, we consider a 2-level hierarchy comprising a fine grid uf and a coarse grid uc:

Now, how do we solve on the composite structure (The procedure is usually accelerated by multigrid, which we leave out to simplify the discussion).? In an elliptic equation it turns out that u is defined only at the finest level appearing in the representation. Thus, we don't represent uc at positions covered by uf.

To compute on the composite mesh we first compute on the base mesh uc. We are careful to exclude the points covered by uf. Thus, we must solve Poisson's equation for two separate right hand sides, to the left and to the right of the overlapping refined region. Each equation take Dirichlet boundary conditions. The "left" equation takes it left boundary condition from the physical boundary conditions. The right boundary condition is obtained from the fine mesh; we average two adjacent fine mesh cells onto the coarse mesh. This extra coarse mesh cell is like ghost cell, except that we must also carry out numerical computation (this is identical to the multigrid restriction operations.

We next solve on uf . Generally uf will be divided into multiple disjoint regions but in our simple example we have only 1 region. uf expects boundary conditions which it obtains from uc by interpolation (multigrid prolongation ). Again we add ghost cells to hold these boundary conditions.

When setting up boundary conditions we must take care that the slopes match. This is nicely described in the tutorial "Solving Poisson's Equation using Adaptive Mesh Refinement" by Dan Martin and Keith Cartright (see the URL http://barkley.me.berkeley.edu/~martin/AMRPoisson.html.)

In higher dimensions, grids communicate numerical information in a more elaborate manner, as shown in the following figure. This figure shows a coarsening inter-level communication operation in which grids communicate coarsened data to their parent. Notice how the fine grids cast shadows on the coarse grids, and that the coarse grids break these shadows into subregions. This bookkeeping is tedious and must be written carefully. Grids also communicate within a level, which is known as intra-level communication. This entails ghost cell information as in rb2D. We note, however, that owing to the presence of coarse-fine interfaces grids must also employ inter-level communication to fill in ghost cells.

#### Error estimation and grid generation

We haven't yet said how we generated the refinements. Generally the refinement structure will change according to the evolving solution. Let's consider the initial coarse mesh uc. We apply a user-supplied error estimation procedure which tells us where the error exceeds some pre-specified threshold. (This estimation procedure generally depends on the physical problem). We then cluster the flagged points, filling in gaps between islands as shown to coalesce the points into an efficient set of refinements. We try to minimize the number of filled in gaps, noting that some fill-in will likely reduce the overhead of processing the refinements.

In one dimension, clustering is straight forward. In two and more dimensions, however, the procedure gets more complicated. In the basic Berger-Oliger-Colella algorithm each level is refined as a single logical unit. This implies that refinements are not aware of the breakdown of a level into grid components. As a result, a refinement can have more than one parent as shown in the figure. Refinement structures may not generally be represented as a tree, e.g. a quad-tree or oct-tree, which are employed in adaptive quadrature .

A 3 level refinement structure. The base mesh is pink, and is refined by two blue level-2 grids. The red spots mark the location of flagged points in the second level, which are covered by green level-3 refinements. Note that one of the level-3 refinements has more than one parent grid component.

Refinements are generated recursively, stopping when some user-specified criteria is met. Refinements must be properly nested , that is, there must be some padding around each refinements such that it doesn't abut the edge of the coarse mesh. This constraint bounds the number of levels that must be spanned when interpolating coarse cells onto fine ghost cells, and simplifies the implementation.

The Berger-Rigoutsos algorithm is a standard technique used to handle clustering; it is based on an edge detection algorithm from image understanding. We'll give a brief summary here; consult Scott Kohn's Ph.D. Dissertation or Berger and Rigoutsos' original article (IEEE Transactions Systems, Man and Cybernetics, 21(5):1278-1286,1991.) The aim of clustering is to minimally cover all points requiring refinement. That is, we want to avoid covering unflagged points, but we are willing to cover some of these points if the overall cost of handling the refinement structures is reduced. This effectively boils down to a granularity tradeoff. In one extreme, we could exactly cover only the flagged points. But, we would likely have many small refinements., with a large communicating surface area. At the other extreme, we could have only a few refinements, but would do considerable unnecessary work. Somewhere in between, we do a little bit of extra work, in exchange for lower operating overheads.

Clustering algorithms for patch-based AMR tend to be based on a few primitives: slicing the flagged points at a position(s) where the density of points is low, wrapping a bounding box around an area of interest to fit snugly around the flagged points, splitting the bounding box if the fraction of unflagged points is too high. Using these basic operations, we can express a variety of interesting strategies.

In some higher-dimensional variants, refinements may be rotated with respect to one another. This complicates boundary condition transfers between grids at the same level. When the grids are not rotated, then these intra-level transfers are handled just like ghost cells in uniform mesh methods, except that some ghost cells will get filled in by interpolating from a coarse mesh. Note that this added complexity of intra-level transfers does not arise in one dimensional computations.

In some applications, refinement structures evolve according to the solution we are computing. Thus, we will discard some refined points, and add others. Often we pad refinements to permit us to re-grid the meshes less frequently. Clearly there is a tradeoff; if we don't pad enough, then we must refine too often, but if we pad too much to avoid refining too frequently, we increase the computational cost of numerical integration.

To refine a point we interpolate it onto a finer mesh. We never throw away refined points that are to remain refined; we always use the best information available. Thus we copy old refined points where they intersect, into corresponding positions of the new refinement structure. If we threw away such points, then we would be forced to reconstruct them by interpolation, which would pollute the fine meshes with less accurate coarse grid data.

We note that the exact details of error estimation, interpolation, coarsening, and smoothing are problem-dependent. Grid generation, however, is largely application-neutral since it doesn't know anything about the problem. This tells us that library support for adaptive mesh methods is likely to be as diverse as the problems themselves. However, by layering the library support we may be able to reuse some software components.

### Parallel implementation of AMR

On a serial computer, and AMR algorithm shows a strong resemblance to the parallel SPMD implementation of a single-mesh, iterative finite difference solver. The grid patch is the basic unit of work, and each patch contains local state which is computed in parallel with other patches at the same level. Grid patches communicate information across their interfaces. Ghost cells are thus employed by the numerical algorithm proper and are not an artifact of parallelization. Note that unlike the single mesh finite difference method, AMR introduces coarse-fine grid communication which incurs computation as well. Let's consider the simple case of a single coarse grid Wcoarse refined by a single fine patch Wfine, as shown in the left side of the following figure:

We may express the ghost cell region as the set difference between the fine patch including the ghost region and the fine patch excluding the ghost region.

It should come as no surprise that the effort required to construct even a serial adaptive code is substantial and requires appropriate library support. After all, we are carrying out many of the activities that would occur in a parallel implementation: decomposition (grid generation) and inter-grid transfers. The implementation issues introduced on a parallel computer, e.g. load balancing and inter-processor communication, compound the difficulties entailed in a serial implementation, but the major intellectual leap is in dealing with the issues of a single processor implementation.

Consider, for example, grid generation. Recall that a grid patch may have more than one parent. This is true because error estimation is carried out on an irregular region that makes up a level. The specific grouping of points in the parent level aren't relevant to refinement decisions. In fact, one grouping is as good as any other as far as the error estimator is concerned. (Of course, there are externally imposed constraints that remove some groupings from consideration.)

To raise the level of abstraction from grid patch to irregular grid level we need to build an appropriate application level interface. Support for adaptive mesh refinement is usually specialized.

An important abstraction is the IrregularGrid, which hides the exact assignment of grid components to processors.  This abstraction enables us to treat irregularly shaped region as if it were a single logical unit, as shown on the right half of the previous Figure. In particular, communication of ghost cells across grid interfaces is handled exactly in the same way as for a uniform mesh method. We use the set difference construct as before, but interpreted over irregular regions. Coarse/fine interfaces are handled somewhat differently. As with fine-fine interfaces, we don't know the number of processors we are running on. However, we must also carry out computation.

Load balancing arises because the grids have different sizes. Some processors may obtain more than one grid, and some larger grids may have to be split over multiple processors. This process is complicated by the fact that communication workloads may also be imbalanced. In the interests of managing software complexity, we prefer to treat load balancing as a separate post-processing step after grid generation. This separation of load balancing from grid generation simplifies software design by permitting application and hardware-specific knowledge to be handled separately.

To simplify the debugging process we want to constrain the load balancer so that the points in the resultant refinement structure will be the same regardless of the number of processors employed. In other words, there exists a canonical refinement structure and load balancing decisions may neither add nor remove refined points to or from this canonical refinement structure, but may only regroup them. This ensures that refinement decisions are unaffected by parallelization, which otherwise could introduce subtle debugging problems. In a correctly operating code, answers will differ only by roundoff, which is easier to check than in the more general case when there may be different refinement structures.

A variety of different approaches are taken to balancing workloads. Perhaps the simplest approach relies on the use of uniform size refinements. Such an approach was taken by Berger and Saltzman using Connection Machine Fortran on the CM-2. (See Saltzman's Patch Based Methods for Adaptive Mesh Refinement Solutions of Partial Differential Equations (Los Alamos Report Number LA-UR-97-3595).) Note, however, that the constraint that all refinements have the same size imposes computation and memory overheads in the form of additional refined points that would not be present were we to permit variable size refinements. Kohn observed that the cost of these refinements can be substantial, as much as 40%.

Another approach is to employ the knapsack algorithm (see Saltzman's tech report, pp 58-59). We start by sorting the refinements in decreasing order of size, which is measured in number of points. We then assign each grid in reverse sorted order to the most lightly loaded processor at the time the grid is encountered. Finally, we swap grids between the most heavily loaded processor and another processor if doing so improves load imbalance. This strategy is not sensitive to communication overheads and may produce a non-optimal data layout across processors.

Kohn explores an approach based on processor affinity. First, he subdivides any grid that exceeds the mean grid size, and then sorts the grids in reverse sorted order as with the knapsack algorithm. Each patch is assigned a processor preference based on interactions with patches assigned to other processors. In particular, a patch will have various affinities to other processors, based on the size of the intersections. This technique reduces inter-level communication costs, but does not deal with intra-level communication costs. In addition, it does not deal with communication load imbalances.

#### Performance

Adaptive mesh methods can reduce space and time requirements by as much as two orders of magnitude. Kohn observed a 160-fold decrease in the amount of storage required in an electronic structure computation. In particular, he employed about 800,000 unknowns in a computation that provided the effective resolution of uniform 5123 mesh.

Kohn also observed that 4 processors of an IBM/SP2 (with Power 2 chips) delivered comparable performance to a single processor of a Cray C90 (actually the SP2 slightly outperformed the Cray C90). On paper, both these configurations deliver roughly the same peak performance: about 1 Gigaflop/second. In practice, however, the rates can be much lower. The SP2 delivers about about one-sixth this rate for adaptive mesh refinement (150 Megaflops/second) and the Cray about the same. The primary difficulties on the SP2 are communication overhead and poor reuse of cached data. On the Cray the difficulty is with short vector lengths. Adaptive mesh refinement structures tend to be fragmented to avoid unnecessary over-refinement. In the application considered, refinements were 10 to 30 cells on a side, which result in short vector lengths. Modern super-scalar processors are nearly insensitive to short vector lengths.

#### Software

A variety of software libraries have built in support for Srucured AMR. These libraries have primarily been written in C++ and includE
• SAMRAI (Scott Kohn and Robert Hornung at the Center for Applied Scientific Computing, Lawrence Livermore National Laboratory)
• CHOMBO (Applied Numerical Algorithms Group, Lawrence Berkeley National Laboratory)
• PARMESH
• CACTUS
• See the web page on adaptive mesh refinement for structured Grids at http://www.camk.edu.pl/~tomek/AMRA/amr.html for further information. Libraries that support multiblock meshes (which also arise in AMR) include:
• KeLP
• ACE and DAGH
• Overture (Center for Applied Scientific Computing, Lawrence Livermore National Laboratory)
• #### Acknowledgment

The work described here in parallel AMR was part of the Ph.D. dissertation research of Scott Kohn, when a Ph.D. student in the Department of Computer Science and Engineering at UCSD.