Programming Lab #2: CUDA Implementation of a Cardiac Electrophysiology Simulation


Changelog

Date Description
25-Jan-12 Original posting
29-Jan-12 Remark about setting the arithmetic precision
30-Jan-12 Groups of 3 option for optimizing the OpenMP basis


In this assignment we'll explore CUDA implementations of the Aliev-Panfilov method of Assignment #1. We'll explore perfomance programming techniques and compare against the OpenMP implementation from the previous assignment.

To collect performance results, we'll use the Forge GPU cluster located at NCSA. Refer to these instructions for how to access Forge and how to set up your programming environment. Like the CSEClass machines, Forge uses Fermi Nvidia GPUs.

Since Lilliput is based on earlier 200 series devices (C1060 Tesla) there are some capabilities that it doesn't have, such as a reconfigurable L1 Cache/Shared Memory. However, you can still do development work on that machine. Like Forge, the cseclass machines use Fermi devices though Forge's device is a higher end Tesla M2070, capability 2.0. Cseclass 01-02 have a single GTX 580 (capability 2.0) with 512 cores configured into 16 multiprocessors each with 32 cores, whereas 03-07 (capability 2.1) have a GTX 460 iwth 336 cores configured as 7 mulitprocessors with 48 cores.

This programming assignment contains 2 parts, and there is there is 1 intermediate milestone, i.e. with a due-date, to ensure progress.

In part I of the assignment, try to improve the performance of the code. Your options are limited and you may not be able to improve performance by more than about 25%. However you realize this perforamnce improvement, you are not permitted to explicitly manage shared memory. (That will come in part 2.) Once you have optimized your CUDA code, conduct a performance study and compare against your OpenMP implementation from Assignment #1.

In part II of the assignment, enhance your code to use shared memory. Try to make the code go as fast as you can. We will post performance goals after the deadline for part I has passed.

To learn more about the device and performance programming isues, refer to the paper by Vasily Volkov and James W. Demmel: Benchmarking GPUs to tune dense linear algebra, Proc. 2008 ACM/IEEE Conf. on Supercomputing(SC '08). Though this paper doesn't discuss stencil methods, and it appeared before Fermi was introduced, it provides a wealth of detail (and insight) about the inner workings of Nvidia GPUs.

The milestones (with due dates) are as follows:

Part I 2-Feb-12
Part II 14-Feb-12

The Assignment

A working, but inefficient, CUDA version of the code has been provided on forge.ncsa.illinois.edu in $PUB/HW/A2 (and in the same place on Lilliput), where $PUB is defined as

/uf/ac/baden/cse260-wi12/

This "naive" implementation will reference solution arrays residing in global memory. Memory accesses may or may not coalesce, and there may be costly conditional statements. Perform the following experiments, using single precision arithmetic (see below for how to compile in the precision.)

  1.   Compare the performance of you basic CUDA implementation against your OpenMP implementation from the previous assignment, running on the number of OpenMP threads that maximizes performance (this is likely to be 4 threads). For your CUDA implementation, set N=1023, t=5.0, and run with a thread geometry of 16 × 16. Determine the speedup (or slowdown) that your CUDA implementation achieves over the OpenMP implementation. Experiment with the thread block geometry and choose the geometry that maximizes performance.

    Note that your result assumes that you are using global memory to access the solution arrays, and that you haven't optimized memory locality by using use shared memory, and that you have the program set to favor shared memory over L1 (this effectively reduces the size of the cache)

  2. Run for a sequence of -t parameter values, in the range of 0.125 to 16.0, successively doubling the value starting from 0.125. Normalize the running time to the -t parameter, so that you get a per-time-unit value. Plot the variation in per-time-unit value as a function of the duration of the simulation, plotting both the GPU and OpenMP implementations on the same graph. Also include your data in tabulated form.
  3. Explain your results in (2). Using these results, estimate the data transfer costs into and out of the GPU.

In the spirit of friendly competition, post your performance results to the A2 Moodle forum.

Groups of 3

If you are working in a group of 3, your assignment will be somewhat more ambitious. In addition to what has been described above, you also need to complete the following.

CUDA Coding Conventions

A CUDA application may contain code in standard C/C++, devoid of CUDA, and in with CUDA-C. CUDA code resides in .cu files. We use the coding convention that there is a routine called solve() that is invoked from main(), and that the two are kept in separate files. Moreover, the Makefile does not specify the object file, containing solve(), on the loader line. Rather, the file containing main(), apf.cu #includes the CUDA source containing solve(), solve.cu This is a common pattern in architecting CUDA code.

Makefile command line flags

Setting the thread block geometry

The makefile is set up to enable you to set the thread block geometry at compile time, via the Make command line, e.g. for 16 × 16 thread blocks:

make  bx=16 by=16

To obtain different geometries you will need to change the values after the equal sign =. Be sure to run make clean before rebuilding the code with different thread block parameters.
When you specify bx= and by= make command line arguments, your code will be compiled with correspondingly defined C pre-processor command line arguments

DIM_X = -DBLOCKDIM_X=$(bx)
DIM_Y = -DBLOCKDIM_Y=$(by)

where $(bx) and $(by) are the values you specified on the make line. When you specify your thread geometry in your CUDA source you'll refer to these C pre-processor macros as follows

dim3 threads(BLOCKDIM_X,BLOCKDIM_Y);

Do not modify the portions of the provided Makefile that set the block geometry. It should build the application with a 16x16 geometry by default. We will test other geometries, and will use our own Makefile to do so. It's important that you stick to the above convention in order that your application build correctly for arbitrary geometries (but assume that all values are integer powers of two) that our Makefiles specify, including those that don't divide the problem evenly.

If you are experimenting with the effect of numerical precision on performance (groups of 3), we have included hooks to enable you to do so; see the file types.h and note the comments at the top of the Makefile.

Setting the numerical precision

To compile for single precision floating point, specify the argument

single=1

on the make command line. This option, which is also accpeted by the code in assignment #1 It adds -DFLOAT to the compiler command line, which properly defines the _DOUBLE_ macro in types.h so that all values designated as type _DOUBLE_ will be have type float. Be sure to use the generic type specifer _DOUBLE whereever you need to be able to set the precision to be a variable. If you don't specifiy single=1, the type of _DOUBLE will default to double precision, i.e. double. Certain intermediate scalar quantities should be computed in double precision even when single precision is used to compute the solution. The quantities in question involve the computation of the timestep dt, and they are set up in the code. Don't change them as they can have a significant impact on accuracy.

How to test your code

Test your code by comparing the results of CUDA implementation against the serial one. If you observe any differences, note them in your report, and discuss what you think the causes may have been.

Be sure that, while collecting performance data, that you have not enabled statistics reporting (-s), or if in a group of 3, plotting (-p).

Performance Measurement

Timings on a dedicated GPU should be highly reproducible, since you have sole access to the device when running on it. This is not possible on interactive resources, as timings on the host will feel contention from other users and can vary much more. As a result, the time to transfer data between host and device, and time to run code on the host can vary a lot more than the time to run a kernel on the device.

There may also be interference from system activities. It is possible to take timings on the device only. This will ensure that inconsistent host/device transfer times won't taint your results. There is code in the incrArray example which is enabled by adding the following line to your Makefile

NVCCFLAGS += -DCUDA_TIMER

GPU timers are discussed in CUDA C Best Practices Guide, Version 4.0, § 2.1.2, page 16. (§ 4.6.2.6 of the Cuda API Reference manual, version 4.0 discusses cudaEventRecord and CudaEventSynchronize on page 36, and is the definitive source of information about these routines.)

Building and Running CUDA Programs

A nice feature of CUDA is that it is free, so you may install it on your own machine and develop your GPU code off-line.

Here are Fred Lionetti’s notes (somewhat outdated), on how to install CUDA under MacOS or Ubuntu. The definitive source of information and software is the Developer Zone, mentioned above.

CUDA code examples reside in $PUB/Examples/Cuda on both Forge and the CSEClass machines. The CUDA SDK includes a number of examples as well. Consult this directory on Lilliput

/home/baden/NVIDIA_GPU_Computing_SDK/C/src

You may develop code on another NVIDIA GPU if you like, but your assignment will be graded against behavior on Forge. CUDA Version 4.0 has been installed on the CSECalss workstations.

We will be grading your code under the assumption that you are using the "arch" file convention rather than the CUDA SDK. Thus, to build your code, use the Makefile we’ve provided, as it has the environment properly set up to build CUDA applications that include C and C++ code.



Things you should turn in

Document your work in a well-written report of about 3 pages (or somewhat longer for groups of 3) which discusses your findings carefully. Your report should present a clear evaluation of the design of your code, including bottlenecks of the implementation, and describe any special coding or tuned parameters. Provide pseudo code listings that illustrates how you parallelized the code, but do not include full code listings as these will be in your turnin.

Cite any written works or software as appropriate in the text of the report. Citations should include any software you used that was written by someone else, or that was written by you for purposes other than this class.

Your turnin will consist of two parts.

  1. Turn in your source code and your lab report electronically not later than the 9pm deadline on the day the assignment is due. We will provide instructions about the electronic turnin procedure via Moodle next week.
  2. Also turn in a hard copy of your report not later than 9am the next morning, You may leave your report in Professor Baden's mailbox on the 2nd floor of EBU3B. To avoid lateness penalties, don't miss these two deadlines.

To hand in your assignment electronically, follow instructions to be discussed in the Moodle A2 forum.

Your turnin must include the provided Makefile, so that we can build and test your code. We will run your code to test it and will check correctness using the residual measurement.

Your turnin must include a self-evaluation discussing the division of labor and other aspects of how you worked together as a team (A copy of the form is available HERE).

At the top level of your directory place an ASCII file called README.txt, that lists all the parts of your electronic submission. Or, you may create an html file called index.html, and from there link to a README file as well as the src and report directories.



Bookkeeping

Recall the bookkeping detail from Assignment #1. We divide each axis into N-1 subintervals using N points positioned uniformly along the line at positions x=i×h, i=0, 1, 2, ..., N-1. We also pad the boundaries with a cell on each side, in order to properly handle updates next to a boundary. (This padding will simplify the code and you'll see comments about it in solve.cpp.) This is shown in below, for 1 axis of the mesh drawn above , where the mesh spacing h = 1/N. The points labeled 'o' correspond to the physical boundary, the points labeled 'P' are padding.

i=0   i=1    i=2                    i=N-1   i=N   i=N+1
P------o------|------| ...... |------|------|------o------P
       ^                                           ^
       |                                           |
      x=0.0  x=h    x=2h                 x=1.0-h   x=1.0

Consider the case where we set N=64. We allocate an array of size 67 × 67, but we solve the problem on a central 65 × 65 submesh. Thus, our loops range from 1 to N+1, that is, 1 to 65. Points 0 and 66 correspond to buffer points.

It follows that when the device subdivides the data to partition it for massively multithreaded execution, it is actually subdividing N+1 rather than N. The device will handle this case, though there will be some load imbalance. This imbalance will likely be minor, unless N is small.

Testing your code

We will grade your machine on Forge. Be sure to do your final testing on that machine, to avoid any surprises at turnin time.

Your code will be tested for correctness and performance. When grading your assignment, we will test your code against one public input (in the script file run-tests.sh) plus other inputs that will not be made public. Be sure to test against various combinations of the -n and -t command line parameters. For the purposes of this assignment, there is no need for your program to run for more than 15 or 20 seconds.

Test your code by comparing against the provided implementation. Unless you implement plotting (Groups of 3), you'll need to rely on the residual to catch errors.

To test your code for correctness, vary the block size. Verify that results are independent of the block size. To validate correctness use the output shown previously. In particular, for the command line options

 -n 50 -t 1000

your output should contain the following two consecutive lines:

End at time: 1000, iteration 11495
Max:   9.46128e-01, L2norm:   6.15442e-01

Be sure that, while collecting performance data, that you have not enabled plotting (-p) or statistics reporting (-s).

To facilitate testing, the simulator outputs some essential information into a file called Log.txt: N, the thread block geometry, the running time, and the L2 and L norms. We will use this file in autograding so do not change the file Report.cpp as doing so could cause your assignment to be graded incorrectly. Any previous copy of Log.txt will be overwritten, so if you are doing automated testing, be sure to rename the files between invocations.

Performance Measurement

Use the provided timer to take timings. Run the program at least 3 times -- for each input size or any other variable that affects the running time -- until you see a consistent timing emerge, that is, where a majority of the times agree to within 10%. If you do not observe a consensus, try doubling the number of measurements. Exclude outlying timings (but note how large they are). Report the minimal running time rather than the average, as discussed in class.

Since Forge's login node has 8 GPUs which may be used interactive, you may do your prelimenary experiements there. However, your timings will feel contention from other users. When you collect your numbers for you report, be sure to run in batch mode, modifying the provided batch script file as needed.

We have provided a high resolution timer based on gettimeofday(). This timer, which resides in Timer.cpp, collects timings from the host.

Plotting capability (groups of 3)

The code comes with a plotting capability to observe the simulation dynamics which can be helpful when debugging. The plotter (splot.c) interfaces with gnuplot, which is freeware. Note that two entries are provided, one to the provided multidimensional array implementation, and another to support flattened arrays, which you will need to plot the data in your implementation.


Turnin

Turn in your source code and your lab report electronically not later than the deadline at 9pm on the day the assignment is due. We will provide instructions about the electronic turnin procedure via Moodle next week. Also turn in hard copy in not later than 9am the next morning, You may leave your report in Professor Baden's mailbox on the 2nd floor of EBU3B. To avoid lateness penalties, don't miss these two deadlines.

A 1-2 page Lab report will be sufficient. Be sure to do the scaling study requested above, which varies according to which machine you were assigned. In addition to the problem instance described above, discuss what you did to the code to improve performance, that is, performance programming. If you improved performance over a series of steps, describe each step and the resultant performance. If you like, summarize the information in a list or a table. If necessary, provide pseudo code to illustrate the concepts.

Your turnin must also include three important documents:

  1. A valid team self-evaluation discussing the division of labor and other aspects of how you worked together as a team. A copy of the form is available in $PUB/turnin/teameval.txt.
  2. A MEMBERS file that identifies your team members, with your full names and email addresses. A template showing the format can be found in $PUB/turnin/MEMBERS.
  3. Your report, a file called report.pdf

These documents are mandatory, as the turnin process will not complete without them. Copies of these files are available in the TURNIN directory, $PUB/turnin Empty forms, or forms not properly filled out, will result in a 10% penalty being assessed, in addition to holding up the grading process.

A note about the turnin process: Since the turnin script will try to compile your project before letting you submit your assignment, it's important that you turn in all the files you have, including a Makefile. However, when grading your assignment, we will substitute all the files that you are not allowed to modify (as noted in the list above) with the original copies of the version we provided you. This is why it is important that you do not change those files in your own turnin, as they will be lost. The turnin script will have an option to do some basic testing, and we strongly suggest you avail yourself of this option.

Don't wait until the last minute to attempt a turnin for the first time. The turnin program will ask you various questions, and it will check for the presence of certain required files, as well as your source code. It will also build the source code if you like, to catch any obvious build errors. Try out the turnin process early on so you are sure that you you are familiar with the process, and that all the required additional files are present. The turnin procedure will be available about a week before the deadline.


Copyright © 2012 Scott B. Baden   [Thu Feb 2 21:54:40 PST 2012]