CSE 260 Schedule (Winter 2012)

This schedule is subject to change, so check frequently

When references to the texts are cited, we use the following naming conventions:

Lecture 1, 1/10/12 (Tue): Introduction
  • Lecture slides
  • Today's reading:
    • Pacheco: Chapter 1 (ALL)
    • On-line reader
    • For background reading on memory hierarchies (optional), including virtual memory, see §2.2.1-2.2.4 of Pacheco. Also consult one of
      • John L. Hennessy and David A. Patterson, Computer Architecture, A Qualitative Approach,
        Ed. 3, Morgan Kaufmann, 2003, Chapter 5, esp §5.1-4 and §5.9. Other editions will have similar chapters. This book is on reserve at the S&E library
      • What Every Programmer Should Know About Memory (Ulrich Drepper)
Lecture 2, 1/12/12 (Thu): Address space organization, shared memory, and Memory locality optimization.

Lecture 3, 1/17/12 (Tue): Stencil methods, Multicore programming OpenMP

Lecture 4, 1/19/12 (Thu): More OpenMP; performance measurement and characterization.

Lecture 5, 1/24/12 (Tue): Performance programming of stencil methods, Vectorization (SIMD and SSE), GPU architecture

Lecture 6, 1/26/12 (Thu): Programming with CUDA
  • Lecture slides Posted
  • Today's reading:
    • Programming Massively Parallel Processors: A Hands-on Approach, by David Kirk and Wen-mei Hwu, Morgan Kaufmann Publishers (2010). Chapter 3 (all), Chapter 4 (59-68)

Lecture 7, 1/31/12 (Tue): Under the hood of the device
  • Lecture slides Posted
  • Today's reading:
    • Programming Massively Parallel Processors: A Hands-on Approach, by David Kirk and Wen-mei Hwu, Morgan Kaufmann Publishers (2010). Chapter 4 (71-74)
    • "Benchmarking GPUs to tune dense linear algebra," by V. Volkov and J. Demmel. Proc. 2008 ACM/IEEE Conf. on Supercomputing, Austin, TX, Nov. 15 - 21, 2008.     PDF
      [Read through §3.6] (4 pp.)
  • To probe further (optional). "NVIDIA Tesla: A Unified Graphics and Computing Architecture," by Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym. IEEE Micro 28(2):39-255, March 2008, DOI [This paper is a little outdated but it discusses warp scheduling]

Lecture 8, 2/2/12 (Thu): Matrix Multiplication, Using shared memory
  • Lecture slides Posted
  • Today's reading:
    • Programming Massively Parallel Processors: A Hands-on Approach, by David Kirk and Wen-mei Hwu, Morgan Kaufmann Publishers (2010). Chapter 5 (all).

Lecture 9, 2/3/12 (Friday, Room 1202): Performance programming
  • Lecture slides Posted
  • Today's reading:
    • Programming Massively Parallel Processors: A Hands-on Approach, by David Kirk and Wen-mei Hwu, Morgan Kaufmann Publishers (2010). Chapter 6 (all).
    • Read, through section 5 "Benchmarking GPUs to tune dense linear algebra," by V. Volkov and J. Demmel. Proc. 2008 ACM/IEEE Conf. on Supercomputing, Austin, TX, Nov. 15 - 21, 2008.     PDF
  • Presentation on Reduction (NVIDIA Developer)

Lecture 10, 2/7/12 (Tuesday): Floating Point, GPUs in context

Lecture 11, 2/9/12 (Thursday): Parallel Programming Languages: Cilk and UPC

Lecture 12, 2/14/12 (Tuesday): Message passing; introduction to MPI.
  • Lecture slides Posted
  • Today's reading. Pacheco, Chapter 6: pp. 83-94.

Lecture 13, 2/16/12 (Thursday): A first MPI application, Cannon’s Matrix Multiplication Algorithm, basic collectives, managing communicators.
  • Lecture slides Posted
  • Today's reading:
    • Pacheco. Chapter 2: pp. 37-40; Chapter 3: pp. 97-109.
    • A User’s Guide to MPI, by Peter Pacheco, pp. 29-36.     pdf
      (Or from from Peter Pacheco’s Parallel Programming with MPI. pp. 111-121).
    • Lectures Notes on Parallel Matrix Multiplication, by Jim Demmel, UC Berkeley. Read the Introduction and Cannon’s algorithm on a 2D mesh.
  • Parallel print function. PPF is the Parallel Tools consortium's parallel print facility. For more information, consult the PPF web page. The software is installed on Triton in $(PUB)/lib/PPF, examples in $(PUB)/examples/PPF (See the README file for important information about using the software).
  • More about the trapezoidal rule
Lecture 14, 2/17/12 (Friday): Advanced collectives, SUMMA Matrix Multiplication Algorithm.
Lecture 15: 2/23/12 (Thursday): Communication avoiding matrix multiplication; stencil methods;

Lecture 16: 2/28/12 (Tuesday): NUMA architectures and programming  
  • Lecture slides Posted
  • Today's reading:
    • "The SGI Origin: a ccNUMA highly scalable server," J. Laudon and D. Lenoski, Proc. 24th ISCA, pp 241-251, 1997. DOI
    • Notes on shared memory
    • For background material, see Hennessy and Patterson, Computer Architecture A Quantitative Approach, 4th Ed., Morgan Kaufmann: Chapter 4, esp. §4.1, §4.4, § 4.6, on reserve in the S&E library.
  • Supplemental reading
    • Origin 2000 and Onyx2 Performance Tuning and Optimization Guide, Document No. 007-3430-003, SGI, 2001.
      • Chapter 1. Understanding SN0 Architecture
      • Chapter 2. SN0 Memory Management
      • Chapter 8. Tuning for Parallel Processing: read sections "Tuning Parallel Code for SN0," "Scalability and Data Placement," "Using Data Distribution Directives," but only read through "Understanding the AFFINITY clases for threads (Example 8-11). There is a conventient table of contents at the beginning of the section.
    • Presentation Materials on the Origin 2000 (David Culler, UC Berkeley)
Lecture 17: 3/1/12 (Thursday): Reflections on Performance  

Lectures 18 & 19, Progress report presentations

Lecture 20, 3/13/12 (Tuesday): Exascale computing