CSE 268C - Performance Programming
A systematic approach to the design, writing and
tuning of programs to sustain near-peak
performance, with particular emphasis on RISC
processors and massively-parallel computers.
Instructor:
Larry Carter.
Class times: Tuesdays and Thursdays, 2:20-3:40, AP&M 4218.
Office hours: Tuesday & Thursdays, 3:45-5:00, or by
appointment. AP&M 4101.
Assignments
We'll begin with short, easy assignments to get things started, but
later in the course there will be fewer, longer projects.
Related material
Slides used in NPACI Parallel Computing Institute in
PostScript or
PDF Format.
(I've used some of these in class.)
Information about the
CPU's of various computers.
My papers
about performance programming.
Homework #1
Due Sept 29
Read the handout, be prepared to discuss it in class Tuesday.
Homework #2
Profiling exercise
(due October 1)
- Choose your favorite computer (be sure it has a Fortran compiler).
- Compile the NAS 1 ("pencil and paper") ft.f
using a reasonable set of compiler options. You may have to change
the function "timer" to something that works on your machine.
- Time the benchmark, getting wallclock time, CPU time, and the
time reported by the benchmark (which is for the "timed section" of
the code).
- Profile it using prof, gprof, or your favorite profiler.
- Compare the times of the unprofiled versus the profiled code.
- Find the procedure that takes the largest percentage of runtime.
- Make a reasonable guess about what the most time-consuming inner
loop in that procedure is. Generate the assembly code for the
program and find the inner loop in the assembly listing.
- Calculate how much faster the program would run if that most
time-consuming procedure were made to run ten times faster.
Important: if one run takes time t1 and a second run takes
time t2 < t1, then we say "the second run was t1/t2 times
faster". We can also say, "the second run was 100((t1/t2) - 1) percent
faster".
- Summarize your results (just the facts - no "writeup" needed).
E.g. "Computer: Sparc 20, 50 MHz clock. Compiler: g77. Compiler
options: -fast. Timing: (figures for CPU, wallclock, and timed section,
with and without profiling) ... Etc ..."
Homework #3
Compute "cycles per iteration" for the inner loop of
transx in the ft benchmark. Spend 1 hour improving its
performance.
Homework #4
Read paper, "A comparison of
the floating-point performance of current computers" by Steven
H. Langer from July/Aug 1998 issue of Computers in Physics.
Project #1
Choose a program that spends
most of its execution time in a few relatively simple loops. Improve
the loops. In most cases, this will involve
- looking for opportunities for register reuse,
- insuring there is enough instruction-level parallelism
to keep the processor happy, and
- improving the cache and/or TLB behavior of the program.
Your goal is
to be able to explain how long the improved code is taking and why it
would be hard to get it to go much faster. Depending on the program
you choose, I may want you to be more or less ambitious. I strongly
encourage you to talk with me at least twice about your program, so I
can give you some idea of the type of improvements and explanations I
am or am not expecting.
Some possible programs:
- Matrix multiply, specifically LAPACK's DGEMM.
- LU decomposition, which is very
similar to matrix multiplication.
- FT, CG,
IS, or MG
from the NAS 1 benchmark suite.
- Protein string matching via the Smith-Waterman algorithm. Perhaps
you could update this to use MMX instructions.
- A couple of the Livermore
Fortran Kernels. They're in loops.tar.gz (Fortran) and
cloops.tar.gz (C). These don't have cache problems, but it might
be interesting to compare the C and Fortran compilations.
- Anything else that appeals to you (check with me first!)
Project #2
Do something interesting (after
checking with Larry that he thinks it's interesting too!)