CSE 260 Schedule

This schedule is subject to change, so check frequently

Lecture 1, 9/24/09 (Thu): Introduction
  • Lecture slides
  • Homework #0 (due Monday)
  • Today's reading:
  • Text. Introduction to Parallel Computing, 2nd Ed. Grama, Gupta, Karypis, and Kumar, Benjamin-Cummings, 2003
    Chapter 1: ALL.
  • For background reading on memory hierarchies, including virtual memory, see
    John L. Hennessy and David A. Patterson, Computer Architecture, A Qualitative Approach,
    Ed. 3, Morgan Kaufmann, 2003, Chapter 5, esp §5.1-4 and §5.9. Other editions will have similar chapters. This book is on reserve at the S&E library (wait until Monday, 9/28)
  • Lecture 2, 9/29/09 (Tue): Motivating applications; multiprocessors
  • Lecture slides updated@12.29PM, 9/29/09
  • Homework #1 (due Thurs 10/8)
  • Today's reading:
  • Text. Chapter 2: pp. 11-21.

  • Lecture 3, 10/6/09 (Tue): Address space organization, performance, cache coherence and consistency
  • Lecture slides posted
  • Today's reading: Text. Chapter 2: pp. 24-30, 45-53, 61-3. Chapter 5: pp. 197-203, 205-212.
  • For additional background material on shared memory architecture, consult Hennessy and Patterson, Computer Architecture A Quantitative Approach, Morgan Kaufmann. 3rd Ed.: Chapter 6, §6.1, §6.3 or the 4th Ed., Chapter 4, §4.1-2, 4.6

  • Lecture 4, 10/7/09 (Weds at 11AM, EBU3B 1202): Threads Programming
  • Lecture slides re-posted with revisions to code examples (Code is also available on Triton in $PUB/Examples/Lec04)
  • Today's reading: Text. Chapter 5: pp. 279-294, 307-308.

  • Lecture 5, 10/8/09 (Thu): A first application; OpenMP
  • Homework #2 (due Tues 10/20) posted
  • Lecture slides re-posted with omp code for Jacobi3D
  • Today's reading:
  • On-line reader
  • Text. Chapter 7: pp 311-333.
  • For reference: Solving the Discrete Poisson Equation (Jim Demmel, UC Berkeley) (Read through Successive Overrelaxation, skipping The Complexity of Solving the Poisson Equation)

  • Lecture 6, 10/13/09 (Tue): Message passing
  • Lecture slides posted
  • Today's reading:
  • Text. Chapter 2, §2.5 (pp 53-60); Chapter 6: §6.1-6.3 (pp 233-250)
  • A User's Guide to MPI, by Peter Pacheco, pp. 1-10.     pdf
    (Or Chapters 2 and 3 from Peter Pacheco's Parallel Programming with MPI.
  • For reference: Chapter 8   from Ian Foster's Designing and Building Parallel Programs (Adison Wesley Pub.)
  • The Ring code is available on Triton in $PUB/Examples/Ring
  • Code examples from the Pacheco textbook are available in $Pub/Examples/Pacheco

  • Lecture 7, 10/15/09 (Thurs): A first application with message passing
  • Lecture slides posted
  • Today's reading:
  • Chapter 2, §2.4.2 through 2.4.4: pp. 32-44; Chapter 6: §6.6.1-6.6.3: pp. 260-263.
  • A User's Guide to MPI, by Peter Pacheco, pp. 11-17     pdf
    Or: Parallel Programming with MPI, by Peter Pacheco: Chapter 4: ALL, Chapter 5: §5.1-5: pp. 65-78
  • Modeling the performance of an iterative method pdf
  • For reference:Chapter 8, §8.1 to 8.3, from Ian Foster's Designing and Building Parallel Programs (Adison Wesley Pub., on-line). Nice introduction to MPI.
  • Code
  • Quadrature code (trapezoidal rule): available in $PUB/Pacheco/ppmpi_c/hap04

  • Lecture 8, 10/20/09 (Thu): Vectorization, SIMD, GPUs
  • Lecture slides posted
  • Today's reading:
  • Programming Guidelines for Vectorizing C/C++ Compilers, A. Bik et al. Dr. Dobb’s Journal, 2/1/03. (6 pp.) (Alternatively: Wikipedia article on Vectorization)
  • Intel SSE (2 pp.)
  • NVIDIA Tesla: A Unified Graphics and Computing Architecture, by E. Lindholm et al., IEEE Micro, March-April 2008, Vol. 28, Issue 2, pp. 39-55. (IEEE Digital Library, accessible from any campus machine. If you are off-campus, may use the UCSD Web proxy which will enable you to access restricted content from non-UCSD Internet service providers. UCSD Active Directory Login required. )
  • Vectorization code examples available in $PUB/Examples/Vectorization

  • Supplemental reading - to probe further
  • Streaming SIMD Extensions, SSE (Wikipedia)
  • "The CRAY-1 Computer System," R. M. Russell, Comm. ACM 21(1): 63-72 (Jan. 1978.) DOI
  • R. L. Sites, "An analysis of the Cray-1 computer." Proc. 5th Annual Symp. Computer Architecture (ISCA ’78), ACM, pp. 101-106. (Esp. sections 4 and 5) DOI
  • Vectorizing Loops, §6.1 in Optimizing Applications on Cray X1 (TM) Series Systems, Cray.
  • "Vector Processors," Computer Architecture: A Quantitative Approach, 4th Ed., Appendix F, by J. L. Hennessy and D. A. Patterson, Morgan Kaufmann, 2007. (45 pages) Also available as Appendix G in the 3rd Edition, if you have the CD.

  • Lecture 9, 10/22/09 (Thu): Performance Programming with CUDA
  • Homework #3 (due Friday 10/30) posted
  • Lecture slides posted
  • Today's reading:
  • Scalable Parallel Programming with CUDA, J. Nickolls et al., ACM Queue, 6(2):40-53, March/April 2008. (Both PDF and HTML)
  • CUDA, Supercomputing for the Masses, by Rob Farber, parts I and II. (14 part article, especially helpful if you are on the GPU Technology track.)

  • To enquire further
  • http://www.nvidia.com/object/cuda_home.html, CUDA Zone. More articles, CUDA download and documentation. If you are on the GPU Track, you’ll want to print out the Documentation, including the CUDA Programming Guide and the Best Practices Guide
  • GPU Resources

  • Lecture 10, 10/26/09 (Monday at 1pm in EBU3B (CSE) Room 1202): Accelerator Programming
  • Guest speaker: Michael Wolfe
  • Today's reading:
  • The PGI Accelerator Programming Model on NVIDIA GPUs Part I and Part II

  • Lecture 11, 10/27/09: Memory System Performance, Collective Communication
  • Lecture slides posted
  • Today's reading:
  • Optimization of Collective Communication Operations in MPICH, by R. Thakur, R. Rabenseifner, and W. Gropp. Int’l. J. of High Performance Computing Applications, (19)1:49-66, Spring 2005
  • Supplemental reading: Text. Introduction to Parallel Computing, 2nd Ed. Grama et al., Benjamin-Cummings, 2003.
    Chapter 4: pp. 149-160,166-179 (but §4.5.2); pp 184-189. Chapter 6: pp. 260-72.
  • Detailed discussions about MPI collective communication (For Reference). MPI: The Complete Reference, by Marc Snir et al.

  • Lecture 12, 10/29/09: Performance Measurement, Advanced Communication
  • Lecture slides posted
  • Today's reading: R. Van de Geign and J. Watts, SUMMA: Scalable universal matrix multiplication algorithm, Concurrency: Practice and Experience, 9:255-74 (1997) (Also on Netlib)
  • For Reference:
  • MPI: The Complete Reference, by Marc Snir et al. Detailed reference material on collective communications.
  • Parallel print function. PPF is the Parallel Tools consortium's parallel print facility. For more information, consult the PPF web page. The software is installed on Triton in $(PUB)/lib/PPF, examples in $(PUB)/examples/PPF (See the README file for important information about using the software).

  • Lecture 13, 11/3/09: Numerical Linear Algebra on the GPU
  • Lecture slides posted
  • As explained in class, there will be no formal lecture (other than some background material on Gaussian Elimination), and you are expected to contribute to the class discusions.
  • Today's reading:
  • Benchmarking GPUs to tune dense linear algebra, by V. Volkov and J. Demmel. Proc. 2008 ACM/IEEE Conf. on Supercomputing, Austin, TX, Nov. 15 - 21, 2008.
  • For background on Gaussian Elimination:
  • Notes on Gaussian Elimination, "Quick review of Gaussian Elimination," Jim Demmel, UC Berkeley
  • A detailed explanation with worked examples http://mathworld.wolfram.com/GaussianElimination.html    

  • Lecture 14, 11/5/09: Load Balancing
  • Lecture slides reposted
  • Today's reading:
  • Reader on Data Decomposition
  • Text. Cummings, 2003, Chapter 3: pp. 85-86, 115-142; Chapter 8: pp. 352-372. (Skim the discussions about the pipelined algorithm.)
  • Notes on Gaussian Elimenation, "How to Layout Matrices on Distributed Memory Machines" Jim Demmel, UC Berkeley

  • Lecture 15, 11/10/09: More Irregular Problems; GPU applications
  • Lecture slides posted
  • Jim Demmel's notes on Simulating Particle Systems.
  • S. Sengupta et al., Scan Primitives for GPU Computing, Graphics Hardware. pp 97--106 (2007).
  • For further reading

  • Bell and Michael Garland, Efficient Sparse Matrix-Vector Multiplication on CUDA, NVIDIA Technical Report NVR-2008-004, Dec. 2008.
  • M. Harris, S. Sengupta and J. D. Owens, Parallel Prefix Sum (Scan) with CUDA, CPU Gems 3, Chapter 39 (2008).
  • S. Chatterjee, G. E. Blelloch, and M. Zagha. Scan primitives for vector computers, Proc 1990 ACM/IEEE Conf. on Supercomputing, New York, pp. 666-675.
  • L. Nyland, M. Harris, J. Prins Fast N-Body Simulation with CUDA, GPU Gems 3, Chapter 31 (2008).
  • For further reading on spacefilling curves
  • Using Space-filling Curves for Multi-dimensional Indexing by J. K. Lawder and P. J. H. King, 4th July 2000.
  • Space-filling Curves. Applet by V. B. Balayoghan, U. Texas at Austin.
  • Plane filling curves
  • Space-filing Curves by Zbigniew Fiedorowic, Ohio State University.
  • Hilbert Curve (Wolfram Research).
  • The Peano Curve and Fractal Curves. Excerpt from from the 30th Edition of the CRC Standard Mathematical Tables and Formulas, 1995, CRC (transcribed by Silvio Levy).

  • Lecture 16, 11/12/09: Multi-tier computing
  • Project Progress report due in class
  • Lecture slides posted
  • There will be no formal lecture and you are expected to contribute to the class discusions.
  • Today's reading:
  • S. B. Baden and S. J. Fink, Communication overlap in multi-tier parallel algorithms, Proc the 1998 ACM/IEEE Conference on Supercomputing, San Jose, CA, Nov 1998, pp. 1-20.
  • M. Kistler, J. Gunnels, D. Brokenshire, and B. Benton, Petascale computing with accelerators. SIGPLAN Not. 44(4):241-250 (Feb. 2009).
  • To read further about Cell
  • J. Kahle et al., Introduction to the cell multiprocessor, IBM J. Res. Dev. 49(4/5): 589-604, July 2005.
  • Kevin Krewell, Cell Moves Into the Limelight, Microprocessor Report

  •   Lecture 17, 11/20/09 (Fri 3:00 to 4:20, EBU3B 1202): Programming Language Support
  • Lecture slides posted
  • Today's reading:
  • M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. SIGPLAN Not. 33(5):212-223 (May. 1998).     DOI (Also see HPCC06 Challenge Class 2 in Cilk)
  • K. Fatahalian et al., "Sequoia: Programming the Memory Hierarchy," Proc. ACM/IEEE SC 2006 Conf. PDF
  • For reference: Cilk web page

  • Lecture 18, 12/1: CSE 260 Symposium
  • Schedule

  • Lecture 19, 12/2 (Wed 5:00 to 7:00p, Rm. 4140 EBU3B): CSE 260 Symposium
  • Schedule

  • Lecture 20, 12/3: CSE 260 Symposium
  • Schedule

  • Maintained by baden @ ucsd.
    edu   [Wed Nov 18 23:52:15 PST 2009]