Performance Programming

Columbia, Fall 1994
Professor: Bowen Alpern
SORRY! These are no longer online.

Course Outline

1. Introduction

What is performance programming? When is it justified? Review of RAM and PRAM analysis. Overview of examples to be used in the course. Poisson distributions. Extended example: seismic migration.

2. The High Cost of Data Movement

Extended example: unblocked matrix multiplication. The Two Level model of computation. The memory hierarchy of an RS/6000. Locality: local, semilocal, and nonlocal data passes. Why Quicksort is better than Heapsort. Communication cost on SP-1 and SP-2.

3. The Parallel Memory Hierarchy Model

The PMH model of computation and its parameters. Visualizing computers. Programming a PMH. Performance programming techniques. Hierarchial tiling. Extended example: a PDE-like code.

4. Linear Algebra and the Inner Loop

LAPACK and the BLAS. daxpy vs. ddot. Eliminating sequential dependencies in ddot. Computation ``sticks.'' Prefetching. LU decomposition. The importants of proper parenthesization.

5. Practical Localization: The Memory Hierarchy

Cache. Translation Lookaside Buffer (TLB). Disk. Blocking. Associativity: what it is and how to fight it. Blocking multiple access patterns. Examples: ranking integers, LU decomposition, and FFT. Poisson distribution (again).

6. Practical Parallelization

Distributed and shared address space parallelism. Latency and bandwidth. Collective Communication. Choreography. 1D and 2D LU decomposition.

7. Bringing it all together

Example: NAS/CG

8. Scalability and Portable High-Performance

Scalability analysis. Performance landscapes. Examples: LU decomposition and sparse matrix-vector product. What would a portable performance program look like? Program variants. Generic Models. Deriving a PMH model for a machine. Modeling the CM-5. Distortions of PMH models.

9. Message Compression

Message Compression: fewer messages, fewer values, or fewer bits. Hardware support for message compression. Example: NAS/IS. Natural sequence compression. Standard techniques: run-length coding, Huffman codes, arithmetic coding.

10. Exploiting the Processor and FFTs

IEEE floating-point format. Rounding modes. Integer arithmetic in the floating-point unit. Examples: Pseudo-random number generation and data compression. Microparallelism. Inner-loop tricks. Table lookup and the NAS/EP benchmark. Approximate (inverse) squareroots.

Fast Fourier Transforms.

11. Some Theory

Uniformity and the UMH model. Some theorems and their proofs. Threshold functions and candidate threshold functions. Examples: transpose, matrix multiplication, FFT and parallel matrix multiplication.

12. Dynamic Programming and Review

Rules of thumb. Dynamic programming. Extended example: optimal Steiner trees.

In addition, Rick Lawrence, taught one class on his experience tuning the NAS/MG benchmark for the IBM SP2.