Date | Description |
13-Jan-12 | Original posting |
25-Jan-12 | Added forge documentation and "getting started" |
5-Feb-12 | Added MPI documentation and "getting started" |
Various hardware platforms are available in the course. These resources are located in the CSE Department.
hostname | CPU | GPU | GPU Compute Capability |
---|---|---|---|
lilliput | 4-core Intel Xeon E5504 | C1060 x 4 | 1.3 |
cseclass01-02 | 6-core AMD Phenom II X6 1090T | GTX 570 (Fermi) | 2.0 |
cseclass03-07 | 4-core 8-thread Intel Core i7 950 | GTX 460 (Fermi) | 2.1 |
Lilliput is a server with a quad core Intel Nehalem processor (E5504, "Gainsboro") running at 2.0 GHz, 16 GB of memory, and 4 NVIDIA C1060 Tesla GPUs. Here is Documentation about Lilliput. Lilliput runs Ubuntu Linux and it is possible to program in MPI, pthreads, OpenMP, CUDA, or some combination of these models. We have installed some basic documentation in
http://cseweb.ucsd.edu/groups/hpcl/scg/guide.html
The cse-class machines come in two types; 01-02 are AMD-based, 03-07 are Intel based. Both have Fermi GPUs but of different types. In particular, 01-02 have GTX 570's with 512 cores configured into 16 multiprocessors each with 32 cores, whereas 03-07's GTX 460's have 336 cores configured as 7 mulitprocessors with 48 cores.
The XSEDE resources include Trestles and Forge. Trestles is a supercomputer located at the San Diego Supercomputer Center. It has 324 compute nodes each containing 32 cores packaged as four 8-core sockets. This machine enables us to explore multicore and MPI scaling to a higher degree than Lilliput or the cseclass machines.
Forge is a GPU cluster located at NCSA and we will use it to conduct performance studies for our GPU codes. This machine enables to gain exclusive access to hardware resources and should be used to conduct performance studies. For this reason lilliput and the cse-class machines should be used for code development.
On Lilliput (as well as most other Linux systems, Trestles excepted) there is the lscpu command that reports the processor's memory hierarchy. Note that for shared caches (such as L1 and L2) you need to multiply the reported cache size by the number of cores, to obtain the total cache size at a given level. The output doesn't tell you which caches are duplicated, so you'll need to dig deeper to get this information (see below).
In addition to the lscpu command, there is also a file called /proc/cpuinfo, which provides a variety of other information. For example, here is part of Lilliput's cpuinfo file:
cpu family : 6 model : 26 model name : Intel(R) Xeon(R) CPU E5504 @ 2.00GHz stepping : 5 cpu MHz : 1600.000
The contents of cpuinfo enables you to identify the specific processor, and the identifying information may be used to obtain other details from on-line resources. This information confirms that Lilliput is a quad-core 2.0GHz Intel Xeon E5504 processor. There is additional information in procinfo: family, model, and stepping. So in this case the family is 6, the model is 26, and the stepping is 5. This will allow you to explore smaller scale changes to the processor’s design that occurred over time.
We may learn more by accessing Intel's ARK Database. Once at the ARK web site, search on "E5504" in the search window at the top of the page. You'll obtain various information; for example, the processor has 4 cores, and a 4 MB L3 cache shared among all cores. (The 2.0 GHz clocks speed matches that listed in the cpuinfo file). The L1 and L2 cache sizes, not listed, are: 32KB (per core; separate instruction and data caches) and 256KB (per core; unified), respectively.
Unfortunately, Intel dropped the handy processor finder web site (CPU World maintains similar functionality if you know the S-Number). However, you may obtain more detailed information about the processor in another way. There is set of files providing detailed characteristics of all levels of cache (though there are many files):
/sys/devices/system/cpu/cpu*/cache/index*/*
To learn more about the Nehalem processor consult one of the following sources
More generally, the definitive source of information about Intel processors Intel 64 and IA-32 Architectures Optimization Reference Manual, (June 2011), where you can find detailed information about your processor.
This document is accessible from another page that Intel maintains with pointers to a several documents on Intel 64 and IA-32 Architectures Software Developer Manuals
Intel also publishes a Quick Reference Guide for the many available compiler options.
As mentioned in class, Lilliput has a 4MB L3 cache. The other Intel boxes (cseclass 03-06) have an 8 MB cache. The AMD boxes (cseclass 01-02), with Phenom II X6 1090T Processors, have a 6MB L3 (Each core also has a 512KB L2 cache, for a total of 3MB) Here is a comparison of different Phenom processors and here is a summary of basic architectural features.You can find more information at the AMD Developer site
To access XSEDE resources, generate a private/public key pair each machine you'll use to access those resources using these instructions.
All the CSE machines used in this course share the same file system, which is served by Lilliput. To get started, log in to any of the CSE systems, and modify the .bashrc and .bash_profile files. Copy the files BASHRC and BASH_PROFILE, located in /class/public/cse260-wi12 to the end of the .bashrc and .bash_profile which are located in your home directory. Log out, and log in.
You will now have environment variables set up for the Intel compilers, and certain handy commands at your disposal, such as setting the number of OMP threads. You'll also have a pre-defined environment variable, called PUB:
/class/public/cse260-wi12
We have installed the Intel C and C++ compilers icc and icpc, respectively. These are located in /opt/intel/bin along with the The Intel Debugger idb To get help use the -help option or use man
We have installed Intel's C++ Composer XE 2011, which includes
There is an extensive set of Intel documentation describing these tools
accessible from the CSE machines or via the internet.
Consult this web page.
You've been provided with various example codes in $PUB/Examples. (This directory structure is set up on all course hardware testbeds). You'll notice that the Makefile provided with each example includes an "arch" file of the form $(PUB)/Arch/arch.* (or arch.*.*.) This arch file contains various settings needed to compile and built executables and generally it should not be changed. Avoid using a private copy in case (rarely) you need to patch an arch file. (Many of the Makefiles in the provided codes use conditional statements that enable the Makefile to be portable onto the platforms identified in the course.) Arch files for serial, openmp, MPI, and CUDA (for machines with Nvidia GPUs) are provided in $(PUB)/Arch The CUDA compiler is called nvcc and your path should already be set up to find it.
A test code called IncrementArray code resides in
$PUB/Examples/CUDA/incrArr
To establish that your environment has been set up correctly, make a copy of the incrArray program, which is also found in $(pub)/CUDA/Examples. Compile and run the program in your home directory. This program increments an array. It shows you the basics of CUDA such as how to manage storage on the device (the GPU) and to take timings. It also provides some utilities in the source file utils.cu. One of the utilities, selectAndReport(), chooses a GPU device and reports its characteristics. Be sure to make one call to selectAndReport() at the beginning of your program, as shown in incrArray. When we set up the system to provide dedicated GPU access, we will give you a new version of this routine to enable you to run under the new hardware setup.
The output of incrArr includes information about the hardware and should look like this:
Trips: 10 4 Devices Device # 0 has 30 cores Device # 1 has 30 cores Device # 2 has 30 cores Device # 3 has 30 cores Choosing device 0 Device 0 has 30 cores Device is a Tesla C1060, capability: 1.3 CUDA Driver version: 3020, runtime version: 3020 N = 8388480, block size = 128 Success! Host times: Total (0.091742), Compute (0.076029) Device times: Total (0.079212), Compute (0.008758)
Lilliput and the cseclass machines are running version 4.0 for the CUDA SDK.
For documentation, consult these resources:
Of note:
Be sure to use the documentation for the Linux port. The web site provides (free) downloads for the drivers, toolkit, and SDK, together with downloadable applications.
There isn't a lot published specifically about Fermi. The CUDA C Programming Guide, version 4.0, describes many of the details. (The Documentation is here, as described above). Here is an article that gives an overview: Nvidia's 'Fermi' GPU architecture revealed, Scott Watson, technreport.com, Sept 2009.
These NVIDIA documents are also useful.
A nice feature of CUDA is that it is free, so you may install it on your own machine and develop your GPU code off-line. Here are Fred Lionetti’s notes (somewhat outdated), on how to install CUDA under MacOS or Ubuntu. The definitive source of information and software is the Developer Zone, mentioned above.
You've been provided with various example codes in $PUB/Examples. (This directory structure is set up on all course hardware testbeds). You'll notice that the Makefile provided with each example includes an "arch" file of the form $(PUB)/Arch/arch.* (or arch.*.*.) This arch file contains various settings needed to compile and built executables and generally it should not be changed. Avoid using a private copy in case (rarely) you need to patch an arch file. (Many of the Makefiles in the provided codes use conditional statements that enable the Makefile to be portable onto the platforms identified in the course.) Arch files for serial, openmp, MPI, and CUDA (for machines with Nvidia GPUs) are provided in $(PUB)/Arch
To test out the environment, first run some simple OpenMP test codes. From within your home directory, run the following commands
cp -p -r $PUB/Examples/OpenMP . cd OpenMP make all
There are several programs and you can run them from the script runinter.sh. But let's start with the hello world program. Enter the directory containing the source code, set the number of OMP threads to 4, using a predefined bash shell function 'mp' and run omp_hello:
cd OpenMP mp 4 ./omp_hello
You will see output similar to this. Examine the source code. Why Does the "Number of threads" message come after the "Hello World" message? Do the threads always report in the same numerical order? What is the cause?
Hello World from thread = 0 Number of threads = 4 Hello World from thread = 2 Hello World from thread = 3 Hello World from thread = 1
When you run a program compiled with the openmp option, it will run with a default number of threads, which is equal to the number of cores provided by the resource you are using. Thus, the cseclass00 and 01, with hex-core AMD processors, run with 6 OpenMP threads by default, while cseclass02-07, with 8-core Intel i7's run with 8 OpenMP threads by default. Lilliput runs with 4 threads. The one exception to the default rule is Trestles; while machine has 32-core nodes, the default is 16 threads.
To set the number of threads, assign the OMP_NUM_THREADS environment variable to the desired value. In the bash shell (our default shell) use the following command.
export OMP_NUM_THREADS=3
We have also set up a bash shell function for you, mp as described above.
Login to trestles.sdsc.edu and then do
cd /home/baden/cse260-wi12/Profiles
Copy the files BASHRC and BASH_PROFILE to the end of the .bashrc and .bash_profile files you found in your home directory when you logged in. Once you've copied these files into your own profiles, you'll have some handy commands at your disposal, such as setting the number of OMP threads, or accessing the interactive queue.
We'll next test out our environment by running a simple OpenMP program. Log out and log in. If you want to test your MPI installation, jump here instead.
From within your home directory, run the following commands
cp -p -r $PUB/Examples/OpenMP . cd OpenMP make omp_hello qsubI 4
Entering this command will allocate 4 processors on an interactive node, which is shared with other users. Our account will be charged for the cores used, so allocate only the necessary number of cores. (This is not true in batch mode, when you get dedicated access to all 32 cores running on the normal queue). You'll see the following output
qsub: waiting for job 22955.trestles-fe1.sdsc.edu to start qsub: job 22955.trestles-fe1.sdsc.edu ready
Wait until you receive a command prompt. You will now be in a new shell, within your $HOME directory. Enter the directory containing the source code, set the number of OMP threads to 4, and run omp_hello:
cd OpenMP mp 4 ./omp_hello
You will see output similar to this. Examine the source code. Why Does the "Number of threads" message come after the "Hello World" message? Do the threads always report in the same numerical order? What is the cause?
Hello World from thread = 0 Number of threads = 4 Hello World from thread = 2 Hello World from thread = 3 Hello World from thread = 1
When an OpenMP thread waits at a synchronization point (say at a Barrier), the default barrier is to busy wait. That can lead to live lock. Threads at barrier keep all the available CPUs while waiting for the others to hit the barrier, consuming processor cycles. In contrast, the other threads are waiting for computing resources to finish their jobs before arriving the barrier. The thread scheduler will eventually force threads to yield, but on long enough time scales that the program will crawl along very slowly.
If you run into difficulties with long running times on OpenMP programs, especially on larger numbers of cores, set the PASSIVE mode wait policy:
export OMP_WAIT_POLICY=PASSIVE
For documentation and other information about Forge, consult the Forge web site.
Unlike the other platforms, Forge uses the tcsh shell. Shell syntax and the profile files are handled differently.
Login to forge.ncsa.illinois.edu and then run the following commands.
cd /uf/ac/baden/cse260-wi12/Profiles cp -p ALIASES $HOME/.aliases cp -p TCSHRC $HOME/.tcshrc cp -p TCSHRC_LOCAL $HOME/.tcshrc_local
Your home directory should not have any of the files you are generating with the above commands. If it does, you'll be asked if you want to overwrite the file. Do not overwrite the files, but cancel out of the dialog. Report this to the Forge Moodle web board.
Once you've copied the files, log out and log in. $PUB will be defined as /uf/ac/baden/cse260-wi12. You'll have certain handy commands at your disposal, which we'll describe soon.
To establish that your environment has been set up correctly, make a copy of the incrArray program, which is also found in $(pub)/CUDA/Examples. Compile and run the program in your home directory using these instructions.
You may run jobs interactively from the Forge front end.
You may run GPU jobs interactively via the interactive queue;
request an interactive mode via the "inter"
alias set up in your .aliases file.
The procedure for using an interactive node is more or less
as described for Trestles, and will not be discussed here.
If you want to submit a batch node, use the qsub
command, as for Trestles.
A sample batch file called "run-forge.sh" has been provided with the
incrArr application.
To collect timings you should always submit a non-interactive batch job. All such jobs have dedicated access to the node(s) allocated. The debug queue (maximum wall clock time of 30 mins) is set up with a higher priority and a couple of nodes to facilitate quick turnaround for debugging. Our jobs are short so we can afford the higher priority.
I have installed a copy of the Nvidia SDK, which includes many CUDA applications. Of note is deviceQuery app, which tells you how to report the various attributes of the hardware. Note that, these codes use a different convention than I do for building code, so you won't be able to use the arch file convention that we use in the course.
The SDK is in
~baden/NVIDIA_GPU_Computing_SDK
The pre-compiled binaries are in the subdirectory C/bin/linux/release and the sources are in C/src. If you are interested in seeing the details of how the low level device driver API works, look at the vectorAddDrv app, and compare against vectorAdd which doesn't use the lower level API. This driver version packs arguments and sends them over to the device, and so on, details hidden if you don't rely on the device driver interface.
The Intel compiler (and Gnu) will generate SSE code to vectorize inner loops. To see if the compiler is vectorizing those time consuming inner loops, examine the output of the compiler. The output also tells you if OpenMP is parallelizing the outer loop (why do we associate vectorization with the inner loop, and multithreading with the outer loop?) The two command line options generate this information for you:
-openmp-report2 -vec-report2
icc -O2 -openmp -openmp-report2 -vec-report2 -DFLOAT -c solve.c solve.c(62): (col. 9) remark: OpenMP DEFINED LOOP WAS PARALLELIZED. solve.c(78): (col. 9) remark: OpenMP DEFINED LOOP WAS PARALLELIZED. solve.c(49): (col. 9) remark: loop was not vectorized: vectorization possible but seems inefficient. solve.c(55): (col. 9) remark: LOOP WAS VECTORIZED. solve.c(63): (col. 9) remark: loop was not vectorized: not inner loop. solve.c(65): (col. 13) remark: LOOP WAS VECTORIZED. solve.c(79): (col. 9) remark: loop was not vectorized: not inner loop. solve.c(82): (col. 13) remark: LOOP WAS VECTORIZED. solve.c(29): (col. 5) remark: loop was not vectorized: nonstandard loop is not a vectorization candidate.
To read more about vectorization, consult the following resources.
It is also possible to generate SSE code with processor-specific optimizations.
Some sample MPI applications have been installed in $(PUB)/Examples/MPI, and an "arch" file arch.intel.mpi has been installed in $(PUB)/Arch.
The Makefile for all the MPI codes used in the course use the MPI version of the "arch" file, as we will use special versions of the compiler set up for MPI (the compiler commands are actually wrappers to gcc).
The MPI compilers (e.g. mpicc) are actually wrappers to the compiler back end (e.g. icc). On Lilliput (and the cseclass machines) the back end compilers are the Gnu compilers by default. To use the Intel compilers, and to launch the binaries they build, be sure to add the following to your .bashrc file in $HOME on Lilliput (this is not necessary on Trestles, but see these instructions for setting up your environment on Trestles):
# Set the Intel compilers as the back end for mpicc and mpicxx export OMPI_CC=icc export OMPI_CXX=icpc #MPI export PATH="/opt/mpi/openmpi/bin:$PATH" export LD_LIBRARY_PATH="/opt/mpi/openmpi/lib:$LD_LIBRARY_PATH"
To launch an MPI program, use the mpirun command on the CSE machines:
mpirun -np 4 ring -t 10000 -s 16
The mpirun command provides the -np flag to specify the number of processes. The value of this parameter should not exceed the number of available physical cores; otherwise the program will run, but very slowly, as it is multiplexing resources. The next parameter (ring) is the name of the executable. Any command line arguments come in the usual position, after the name of the executable.
To establish that your environment has been set up correctly, copy, compile and run the parallel "hello world" program. This program prints "Hello World'' from each process along with the process ID. It also reports the total number of processes in the run.
The hello world program is found in $(PUB)/Examples/Basic/hello. To run the program use mpirun as follows:
mpirun -np 2 hello
Here is some sample output:
# processes: 2 Hello world from node 0 Hello world from node 1
Be sure that your environment is set up properly on Trestles, as describe Here. To launch an MPI program on Trestles, use the mpirun_rsh command, where ring is the binary, with its own command line arguments that follow it on the command line:
mpirun_rsh -np 4 -hostfile $PBS_NODEFILE ring -s 8192
The general form is as follows, where anything not shown within angle brackets must be typed exactly as shown:
mpirun_rsh -np <# processors> -hostfile $PBS_NODEFILE
Be sure to run the Basic example codes described above to ensure that your environment is set up correctly. A batch submission file for trestles, run-trestles.sh has also been set up in each directory.
To take timings, use MPI's timer function MPI_Wtime() to measure wall clock time (like a stopwatch). On Lilliput, your timings will reflect contention from other jobs. On Triton, you'll have dedicated access to processing nodes, but not the switch. Thus, to avoid the effects of switch contention on Triton, we’ll use a technique to filter out the variations. Run the program at least 3 times (for each input size and number of cores, or any variable that affects the running time) until you see a consistent timing emerge, such that a majority agree within 10%. If you observe a lot more variation try doubling the number of runs to see if you can obtain a better consensus. Exclude outlying timings, but note them in your report. Report the minimal running time rather than the average, as previously discussed in class.
To measure time, take the difference in times sampled by two successive calls to MPI_Wtime():
double t0 = MPI_Wtime(); // some code.... double t1 = MPI_Wtime(); double time = (t1-t0);
MPI_Wtime() reports time in seconds so there is no need to scale the value returned by the timer (resolution of MPI_Wtime using MPI_Wtick()). The man-page for MPI_Wtime() is found on line.
Recall that the running time of the application is that of the last processor to finish. Don't time individual iterations, but rather, time an entire run, but excluding initialization, calls to MPI_Init() and MPI_Finalize() and any other activities that are not representative of a run with many iterations. communication time.
It can be difficult to measure communication times accurately, owing to the short time durations involved. Therefore, we will use an indirect measurement technique. We do this by disabling communication and subtracting the resultant time from that obtained from a run that includes communication. To this end, you'll need to add an option to your program to disable communication calls like MPI_Send. To this end, implement the the -k command line option.
While the PPF package to be discussed in class can help sort out some I/O problems, it can only synchronize a single line of output. When you have multiple lines, and you are using PPF_Print() or just printf(), you will find that fflush() does not sort out the problem of garbled I/O.
The only sure way of getting I/O to stdout (or stderr) to properly synchronize is to use sprintf() to generate string output, and then collect via Gatherv() at the root. In other words, only 1 process should actually perform the I/O.
There is an example of how to manage I/O in the Ring example code:
$PUB/Examples/MPI/Ring/getHost.C
There is a nice tutorial with some worked examples. Man pages for the MPI calls may be accessed with the man command and they are also available On-line. More extensive documentation can be found at http://www-cse.ucsd.edu/users/baden/Doc/mpi.html. The on-line man pages for MPI calls used in this assignment appear below.
MPI _Init | Initialize the MPI environment. Call this prior to making any MPI calls. |
MPI_Finalize | Exit the MPI environment. Call this function before exiting your program. |
MPI_Comm_size | Queries the total number of processes use in this program invocation. |
MPI_Comm_rank | Queries this process's unique identifier, an integer called the rank |
MPI _Send | Sends a message to another process, a return signifies that the message buffer may be reused. |
MPI_Recv | Receives a message from another process, a return signifies that the message has been received. |
MPI_Irecv | Immediate form of Receive. A return does not signify that the message has been received. To receive this assurance, you need to call MPI_Wait(). |
MPI _Isend | Non blocking variant of Send(). A return does not signify that the message buffer may be reused. To receive this assurance, you need to call MPI_Wait(). |
MPI_Wait | Return indicates that a previously posted Irecv (or Isend) has completed |
MPI_Wtimer | Timer. |
MPI_Gather | Gathers data from all processes onto the designated root process. All processes contribute the same amount of data. A collective routine, which all processes in the communicator must call. This may be be useful in debugging. |
MPI_Barrier | Blocks calling process until all processes have checked in. A collective routine, which all processes in the communicator must call. |