Programming Lab #2, part II: Performance programming in CUDA
(Due February 14, 2012 at 6PM)


Date Description
02-Feb-12 Original posting

The purpose of this assignment is to experiment with Fermi's fast on-chip L1 cache/shared memory and to understand the tradeoffs in using this configurable store. Your starting point is your CUDA implementations of the Aliev-Panfilov from Assignment #2. You'll optimize this code to use shared memory. You'll also add the capability to selectively prefer L1 cache over shared memory and the other way around. Add this capability both to your code from A2 as well as your optimized code.

To enable you to switch preferences between L1 cache and share memory, the starter code from A2 has been modified; the code resides in $PUB/HW/A3 on Lilliput and on Forge. There is a new option to specify a preference for L1 cache (-l), otherwise the code sets the preference for shared memory. This capability necessitated changes in the following files:

cmdLine.cpp   Report.cpp

Since cmdLine.cpp and Report.cpp are not be be changed, you should focus on the two CUDA C files and compare with the versions supplied for A2

To collect performance results, we'll use the Forge GPU cluster located at NCSA. Since Lilliput is based on earlier 200 series devices (C1060 Tesla) there are some capabilities that it doesn't have, such as a reconfigurable L1 Cache/Shared Memory. However, you can still do development work on that machine. Like Forge, the cseclass machines use Fermi devices though Forge's device is a higher end Tesla M2070, capability 2.0. Cseclass 01-02 have a single GTX 580 (capability 2.0) with 512 cores configured into 16 multiprocessors each with 32 cores, whereas 03-07 (capability 2.1) have a GTX 460 with 336 cores configured as 7 mulitprocessors with 48 cores.

Try to make the code go as fast as you can. In the spirit of friendly competition, post your performance results to the A3 Moodle forum. We will post performance goals soon after the deadline for part I has passed.

The Assignment

Your current CUDA code accesses all simulation state via global memory. As a result, it accesses each mesh element multiple times. You can improve performance significantly by buffering frequently accessed data in fast on-chip memory. Modify the code to use shared memory, but try other performance programming techniques to help improve performance still further. Be sure that your data accesses are coalescing.

Perform the following experiment using single precision arithmetic.

  1. Set N=1024, t=50.0, and run with a thread geometry of 16 × 16. Compare the performance against your basic CUDA implementation from Lab 2. What speedup do you observe? Note that this run will prefer shared memory to L1 cache.
  2. Vary the block geometry and determine the geometry that maximizes performance. When doing these experiments run for short time intervals (t=10.0) as some geometries may slow down the code significantly, wasting precious computer time.
  3. Set the preference to L1 cache using the -l option. Repeat steps (1) and (2). Note and explain any change in performance and against choose the optimal tile size.

Be sure to explain all your results. In particular, analyze performance and discuss the bottlenecks.

Groups of 3

If you are working in a group of 3, your assignment will be somewhat more ambitious. In addition to what has been described above, you also need to complete the following.

Collecting timing data

Since Forge's login node has 8 GPUs which may be used interactively, you may do your preliminary experiments there. However, your timings will feel contention from other users. When you collect your numbers for you report, run in batch mode and be sure specify the debug queue in order to run at a high priority. You'll need to modify the provided batch script file slightly (; in particular, set the mailing address so you'll be notified when the job has completed.

More about the new A3 code module

As in the previous assignment, specify the arithmetic precision and the thread geometry on the make command.

To facilitate testing, the code outputs some essential information into a file called Log.txt: N, the thread block geometry, the running time, the L2 and L norms, whether or not L1 or shared memory was preferred and the numerical precision of floating point numbers (float or double). We will use this file in autograding so do not change the file Report.cpp as doing so could cause your assignment to be graded incorrectly. Any previous copy of Log.txt will be overwritten, so if you are doing automated testing, be sure to rename the files between invocations.

Things you should turn in

Document your work in a well-written report of about 4-6 pages (at the longer end of the range for groups of 3) which discusses your findings carefully. Your report should present a clear evaluation of the design of your code, including bottlenecks of the implementation, and describe the performance programming you performed. Negative results are also valid and especially valued if you carefully document them, with the underlying causes, or best guesses if the causes are unclear. Provide pseudo code listings that illustrates how you parallelized the code, but do not include full code listings as these will be in your turnin. Be sure to include any plotted data in tabular form.

Cite any written works or software as appropriate in the text of the report. Citations should include any software you used that was written by someone else, or that was written by you for purposes other than this class.

Your turnin will consist of two parts.

  1. Turn in your source code and your lab report electronically not later than the 9pm deadline on the day the assignment is due. To hand in your assignment electronically, copy your code over to Lilliput and then run turninA3 script located in $PUB/turnin/turninA3. We will announce when the script has been enabled via the A3 Moodle forum.
  2. Turn in hard copy of your report not later than 9am the next morning. You may leave your report in Professor Baden's mailbox on the 2nd floor of EBU3B. To avoid lateness penalties, don't miss these two deadlines.

Your turnin must include the provided Makefile so that we can build and test your code. We will run your code to test it and will check correctness using the residual measurement.

Your turnin must also include three important documents:

  1. A completed team self-evaluation discussing the division of labor and other aspects of how you worked together as a team. A copy of the form is available in $PUB/turnin/teameval.txt.
  2. A MEMBERS file that identifies your team members, with your full names and email addresses. A template showing the format can be found in $PUB/turnin/MEMBERS.
  3. Your report, a file called report.pdf

These documents are required, as the turnin process will not complete without them. Copies of these files are available in the turnin directory, $PUB/turnin Empty forms, or forms not properly filled out, will result in a 10% penalty being assessed, in addition to holding up the grading process.

A note about the turnin process: Since the turnin script will try to compile your project before letting you submit your assignment, it's important that you turn in all the files you have, including a Makefile. However, when grading your assignment, we will substitute all the files that you are not allowed to modify (as noted in the list above) with the original copies of the version we provided you. This is why it is important that you do not change those files in your own turnin, as they will be lost. The turnin script will have an option to do some basic testing, and we strongly suggest you avail yourself of this option.

Don't wait until the last minute to attempt a turnin for the first time. The turnin program will ask you various questions, and it will check for the presence of certain required files, as well as your source code. It will also build the source code if you like, to catch any obvious build errors. Try out the turnin process early on so you are sure that you you are familiar with the process, and that all the required additional files are present. The turnin procedure will be available about a week before the deadline.

Copyright © 2012 Scott B. Baden   [Thu Feb 2 21:55:56 PST 2012]