Date | Description |
13-Jan-12 | Original posting |
19-Jan-12 | Note added about Host file |
In this assignment you'll apply multithreading (via openmp) to a simulator used to study the electrophysiology of the heart, and observe parallel speedups. You'll also do some performance programming in an effort to speed up the code. A working serial version of the code has been provided on Lilliput in $PUB/HW/A1, where $PUB is defined as
/class/public/cse260-wi12
Owing to higher memory bandwidth demands, this application will not scale linearly. Thus, it provides ample motivation for performance programming. Try to make the code go as fast as you can.
Choose a CSE workstation according to the last digit of your student ID number, using the Mapping described in Moodle. You may develop code on another machine if you wish but be sure and confirm all results on your assigned machine, as your assignment will be graded against behavior on that machine.
Simulations play an important role in science, medicine, and engineering. A simulator like the cardiac electrophysiology simulator we are using in this assignment can be used for clinical diagnostic and therapeutic purposes. To reduce computational costs, simplifying assumptions are often made to make the simulations feasible. In our clinical example, time may be of the essence in determining an appropriate life-saving treatment.
Cell simulators entail solving a coupled set of a equations: a system of Ordinary Differential Equations (ODEs) together with a Partial Differential Equation (PDE). If you are interested in learning more about cardiac modeling peruse www.labheart.org or thevirtualheart.org or visit the web page of the UCSD Cardiac Mechanics Research Group.
Our simulator models the propagation of electrical signals in the heart, and it incorporates a cell model describing the kinetics of the membrane of a single cell. There can be different cell models, of varying degrees of complexity. We will use a model known as the Aliev-Panfilov model, that maintains 2 state variables, and solves one PDE (The PDE couples multiple cells into a system). This simple model can account for complex behavior such as how spiral waves break up and form elaborate patterns. Spiral waves can lead to life threatening situations such as ventricular fibrillation.
Our simulator models electrical signals in an idealized biological system in which voltages vary at discrete points in time, called timesteps, at the discrete positions of a mesh. The simulator is an iterative procedure that updates the simulation state variables (on the mesh) many times over a sequence of discrete timesteps. A simulation will run for some amount of simulated time, which is a real number. For our purposes, the units of time are unimportant. The timesteps subdivide the simulation time line that starts at t=0 and ends at some specified time t=T. The unit of subdivision is called Δt, or the time discretization. It is important to keep in mind that when we use the term "time," we need to specify whether we are talking about real-valued simulation time, or the discrete timestep. We will use the fully qualified name to avoid ambiguity. By convention, we number the discrete timesteps with integers. Thus, if we say that we are at timestep k of the simulation, then the current simulation time is kΔt.
To discretize space, we'll use a uniformly spaced two-dimensional mesh, shown in the left side of the figure (Irregular meshes are also possible, but are more difficult to parallelize).
![]() ![]() |
At each time step, the simulator updates the voltage according to nearest neighboring positions in space and time. This is done first in space, and next in time. Nearest neighbors in space are defined on the 4 Manhattan directions only: north, south, east and west as shown on the right side of the above figure.
In general, the finer the mesh (and hence the more numerous the points) the more accurate the solution, but at an increased computational cost. In a similar fashion, the smaller the timestep, the more accurate the solution, but at a higher computational cost. Thus, if h is the mesh spacing, Δt is the time discretization, then the running time is proportional to (Δt×h2)-1.
The simulator maintains two state variables characterizing the cell electrophysiology model: excitation (E) and recovery (R). Since we are using the method of finite differences to solve the problem, we discretize the variables E and R by considering the values only at a regularly spaced set of discrete points. We represent the variables as 2D arrays: E[][] and R[][], respectively. The PDE solver also maintains the voltage at the previous timestep, Eprev.
Here is the formula for solving the PDE, where the superscripts t and t-1 refer to the current and previous timestep, respectively, and the constant α is defined in the simulator (see apf.cpp and solve.cpp):
Et[i,j] = Et-1[i,j] + α*(Et-1[i+1,j] + Et-1[i-1,j] + Et-1[i,j+1] + Et-1[i,j-1] - 4*Et-1[i,j])
Here is the formula for solving the ODE, where references to E and R correspond to whole arrays. Expressions involving whole arrays are pointwise operations on corresponding array elements, and the constants k, a, b, ε, μ1 and μ2 are defined in the simulator:
E = E - Δt*(k*E * (E-a) * (E-1)+E * R); R = R + Δt*(ε+ μ1*R / ( E+μ2)) * (-R-k*E * (E-b-1));
We interpret a pointwise whole array operation as follows. If A, B and C are 2D arrays of size N × N, then A = B + C is equivalent to the following loop nest:
for (i = 0; i < N; i++) for (j = 0; j < N; j++) A[i][j] = B[i][j] + C[i][j];
There are a few bookkeeping details needed to help us organize our work. We divide each axis into N-1 subintervals using N points positioned uniformly along the line at positions x=i×h, i=0, 1, 2, ..., N-1. We also pad the boundaries with a cell on each side, in order to properly handle updates next to a boundary. (This padding will simplify the code and you'll see comments about it in solve.cpp.) This is shown in below, for 1 axis of the mesh drawn above , where the mesh spacing h = 1/N. The points labeled 'o' correspond to the physical boundary, the points labeled 'P' are padding.
i=0 i=1 i=2 i=N-1 i=N i=N+1 P------o------|------| ...... |------|------|------o------P ^ ^ | | x=0.0 x=h x=2h x=1.0-h x=1.0
Consider the case where we set N=64. We allocate an array of size 67 × 67, but we solve the problem on a central 65 × 65 submesh. Thus, our loops range from 1 to N+1, that is, 1 to 65. Points 0 and 66 correspond to buffer points.
It follows that when we subdivide the data to partition it for parallel execution, we are actually subdividing N+1 rather than N. OpenMP will handle this case, assigning slightly more work to some processors. So long as N >> P, the extra work won't be noticed.
Most (or all) of your changes will made in two files: apf.cpp, which is the driver program, and solver.cpp, containing the solve() function. This solver is computationally intensive, so most of your efforts will efforts on improving its running time through multithreading, and possibly some performance programming.
The organization of the code is described in a README file. Be sure to look over this file so you are familiar with the various code modules. To ensure that your program is correctly graded, do not change any of the following 7 files, as doing so may cause your turnin to be graded incorrectly:
cmdLine.cpp Report.cpp Timer.cpp splot.c apf.h types.h
The simulator driver is found in apf.cpp, but the numerical computation is handled in solver.cpp. Command line arguments are parsed in cmdLine.cpp. (Additional arguments have also been provided to support other types of parallelization, but may be ignored in this assignment: the -x and -y arguments). We have also provided a script file, run-tests.sh, with the public tests that we will use to test your code.
To build your code, use the provided Makefile, as it sets up the appropriate flags for the Intel compiler via the "arch" file located in $(PUB)/Arch/arch.intel.omp. (There is also an arch file for the gnu compiler suite, arch.gnu.omp in case you need gcc, say, on your own machine). The Makefile produces a target called apf. Do not change the name of this target. You may add additional code files if you wish, but don't change the names of the files mentioned above.
The simulator includes a plotting capability (using gnuplot) which you can use to debug your code and also to observe the simulation dynamics. Do not enable plotting when you are measuring running times for your performance studies. The simulator has various options as follows.
./apf [-t <simulation time>] [-n < box size >] [-s] [-p <plot freq>]
The default parameters are described in cmdLine.cpp:
-t <float> | Duration of simulation |
-n <integer> | Number of mesh points in the x and y dimensions |
-s | Print statistics, L2 and L∞ norms of the excitation variable. This feature is useful when debugging. |
-p <integer> | Plot the excitation variable at regular intervals. This feature is also useful in debugging. |
Thus, the following command line will conduct the simulation on a 150 × 150 box, run to 400 units of simulated time, plot the excitation variable every 20 units of simulated time, and periodically print out summary values of the excitation variable (The code is set up to print out the statistics every 100 units of simulated time, and the frequency can be changed by appropriately setting the value of the STATS_FREQ variable as noted in apf.cpp).
./apf -n 150 -t 400 -s -p 20
If we leave off the -s and -p flags, then the output is as follows:
./apf -n 150 -t 400 [Simulation begins] Sun Jan 16 21:05:37 2011 Using Double precision arithmetic dt= 0.0636729, T = 400 The code will run for approximately 6283 timesteps m x n = 150 x 150 thread geometry: 1 x 1 [Simulation completes] Sun Jan 16 21:05:46 2011 End at time: 400, iteration 6283 Max norm: 9.41928e-01, L2norm: 6.11817e-01 Running Time: 9.00787 sec. GFlop rate: 0.445
Parallelize the code with openmp, letting one of the threads handle the plotting. This introduces an obvious bottleneck, but we can live with it. In exchange you will have output that is useful when testing and debugging.
OpenMP implementations generally support 1-dimensional partitionings only (though changes are coming). Since we aren't using many threads, we will live with the limitation of 1D decompositions. However, on Trestles, with 32 core nodes, you may need to run with larger problems to avoid generating subproblems that are too thin.
You may assume that
n > 2 × the number of threads
Try to make the code go as fast as you can. For example, the solution arrays are represented as 2-dimensional arrays, but you may find that you can improve performance by converting them to 1-dimensional arrays and performing your own indexing.
In the spirit of friendly competition, post your performance results to the A1 Moodle forum.
In addition to submitting code, you must also provide a short report describing your experience with performance programming, and the following scaling study.
Be sure to discuss what you did to the code to improve performance, that is, what performance programming you carried out. If you improved performance over a series of steps, describe each step and the resultant performance. If you like, summarize the information in a list or a table. If necessary, provide pseudo code to illustrate the concepts.
In the spirit of friendly competition, post your performance results to the A1 Moodle forum.
If you are working in a group of 3, you will also need to run on at least 1 machine besides your designated machine, and explain your observations. Note that if your "home machine" in Lilliput, you can run on any of the cseclass machines. If on cseclass 01 or 02, you should run on any of cseclass 03-07 or on Lilliput, and so on. You may also run on another machine if you like but it must have at least 8 cores.
When reporting performance results as a group of 3, it is important to identify the hardware and software testbeds used. If you are running on a machine other than what is provided in the course, you must also , run a suite of commands to identify the hardware and the environment you used to develop and run your code.
Place all output from these commands in a file called config.out, e.g. by using the script command. Include this in your turnin. If you can find the hardware characteristics of your processor, i.e., the clock speed, TLB, L1, L2, and (if present) L3 cache, list this information in the report, and cite the source, e.g. a URL or a file. We have provided instructions for how to explore your processors hardware Here.
We will grade your machine on that which you have been assigned. Be sure to do your final testing on that machine, to avoid any surprises at turnin time.
Your code will be tested for correctness and performance. When grading your assignment, we will test your code against one public input (in the script file run-tests.sh) plus other inputs that will not be made public. Be sure to test against various combinations of the -n and -t command line parameters. For the purposes of this assignment, there is no need for your program to run for more than 15 or 20 seconds. Sticking to this practice will decrease the likelihood of job interference, which will slow you down, and affect your timing results.
Test your code by comparing against the provided implementation. You can catch obvious errors using the graphical output, but a more precise test is to compare two summary quantities reported by the simulator: the L2 norm and maximum value (magnitude) of the excitation variable. The former value is obtained by summing the squares of all solution values on the mesh, dividing by the number of points in the mesh (called normalization), and taking the square root of the result. For latter measure, we take the "absolute max," that is the maximum of the absolute value of the solution, taken over all points. While there may be slight variations in the last digits of the reported quantities, variations in the first few digits indicate an error.
To test your code for correctness, vary the number of threads and experiment Verify that results are independent of the number of threads. To validate correctness use the output shown previously. In particular, for the command line options
-n 50 -t 1000
your output should contain the following two consecutive lines:
End at time: 1000, iteration 11495 Max: 9.46128e-01, L2norm: 6.15442e-01
You are also required to speed up the serial performance of the application. Try larger values of the n parameter. When you do this you'll need to adjust the -t parameter downward; the simulation will take much longer to complete, not only because it is doing more work on a larger mesh, but also because it uses a smaller discrete timestep Δt, to ensure accuracy. Thus, your simulator is performing a larger number of mesh sweeps on larger meshes. Do you notice any change in scaling as you increased n?
Be sure that, while collecting performance data, that you have not enabled plotting (-p) or statistics reporting (-s).
To facilitate testing, the simulator outputs some essential information into a file called Log.txt: the number of threads, their geometry, the running time, and the L2 and L∞ norms. We will use this file in autograding so do not change the file Report.cpp as doing so could cause your assignment to be graded incorrectly. Any previous copy of Log.txt will be overwritten, so if you are doing automated testing, be sure to rename the files between invocations.
Use the provided timer to take timings. Run the program at least 3 times -- for each input size or any other variable that affects the running time -- until you see a consistent timing emerge, that is, where a majority of the times agree to within 10%. If you do not observe a consensus, try doubling the number of measurements. Exclude outlying timings (but note how large they are). Report the minimal running time rather than the average, as discussed in class.
Timings will feel contention from other users so if you are still unable to get consistent measurements, try collecting measurements at another time of day. If the system is heavily loaded, as in the day the assignment is due, you may observe performance anomalies, including non-linear speedups.
We have provided a high resolution timer based on gettimeofday(). This timer, which resides in Timer.cpp, allows you to time a individual threads and you can see how it is used in the provided serial code. Recall that the running time of the application is determined by the last core to finish. So be sure to collect timings from the master thread, outside of any OpenMP parallel region. and run for enough iterations so that you amortize the cost of measuring thread startup time.
As mentioned previously, The code comes with a plotting capability to observe the simulation dynamics which can be helpful when debugging. The plotter (splot.c) interfaces with gnuplot, which is freeware. Note that two entries are provided, one to the provided multidimensional array implementation, and another to support flattened arrays.
Remote visual display can be slow, so use a frequency of at least 20 (-p 20). The summary statistics may be more useful than plotting in some cases. If a display window does not appear on your screen, then your ssh display settings are not set correctly. If you are connecting to the remote machine from a Linux platform using ssh, use the -X option:
% ssh -X yourlogin@cseclass06.ucsd.edu
If you are connecting from a Windows platform, and are using a 'secure shell client' AND 'exceed' (or xwin32, or cygwin) to log into a remote machine you must ensure that 'X11 tunneling' is set. To do this, you must have installed Cygwin /X with X11.
The simulations will produce a spiral pattern that evolves with time. However, if you set the simulation time parameter (-t) to a sufficiently large value, the organized spiral pattern will break up. This indicates non-physical behavior, and the length of time that the simulator produces physically meaningful results depends on the mesh size n.
Turn in your source code and your lab report electronically not later than the deadline at 9pm on the day the assignment is due. We will provide instructions about the electronic turnin procedure via Moodle next week. Also turn in a hard copy of your report not later than 9am the next morning, You may leave your report in Professor Baden's mailbox on the 2nd floor of EBU3B. To avoid lateness penalties, don't miss these two deadlines.
A 1-2 page Lab report will be sufficient. Be sure to do the scaling study requested above, which varies according to which machine you were assigned. In addition to the problem instance described above, I strongly suggest you do a scaling study for a problem that does not fit into L3 cache.
Discuss what you did to the code to improve performance, that is, performance programming. If you improved performance over a series of steps, describe each step and the resultant performance. If you like, summarize the information in a list or a table. If necessary, provide pseudo code to illustrate the concepts.
As mentioned in class, Lilliput has a 4MB L3 cache. The other Intel boxes (cseclass 03-06) have an 8 MB cache. The AMD boxes (cseclass 01-02), with Phenom II X6 1090T Processors, have a 6MB L3 (Each core also has a 512KB L2 cache, for a total of 3MB) Here is a comparison of different Phenom processors and here is a summary of basic architectural features. You can find more information at the AMD Developer site
Your turnin must also include two three important documents.
These documents are mandatory, as the turnin process will not complete without them.
Copies of these files are available in the TURNIN directory,
$PUB/turnin.
uname -n > HOST
Empty forms, or forms not properly filled out, will result in a 10% penalty being assessed, in addition to holding up the grading process.
A note about the turnin process: Since the turnin script will try to compile your project before letting you submit your assignment, it's important that you turn in all the files you have, including a Makefile. However, when grading your assignment, we will substitute all the files that you are not allowed to modify (as noted in the list above) with the original copies of the version we provided you. This is why it is important that you do not change those files in your own turnin, as they will be lost. The turnin script will have an option to do some basic testing, and we strongly suggest you avail yourself of this option.
Don't wait until the last minute to attempt a turnin for the first time. The turnin program will ask you various questions, and it will check for the presence of certain required files, as well as your source code. It will also build the source code if you like, to catch any obvious build errors. Try out the turnin process early on so you are sure that you you are familiar with the process, and that all the required additional files are present. The turnin procedure will be available about a week before the deadline, likely on Wednesday Jan 18th.
Copyright © 2012 Scott B. Baden Fri [Fri Jan 20 08:07:21 PST 2012]