|
|
Lincoln is a GPU cluster located at the NCSA. The machine consists of a collection of Dell PowerEdge 1950 servers connected with InfiniBand SDR. Each server contains 8 cores constructed from dual-socket, quad-core Intel 64 (Clovertown, E5345) 2.33GHz processors, together with 2 Tesla processors connected via PCI-e Gen2 X8 slots. The hardware thus provides a platform for writing applications that run under message passing, threads, or hardware acceleration, or any combination of these models.
The purpose of this Primer is to get you started quickly.
However,
NCSA’s extensive
User Documentation will provide considerably more information,
and is the authoritative source.
Lincoln
You may obtain detailed information about Lincoln’s processors by looking at the file /proc/cpuinfo.
To use Lincoln, log in to the
front-end machine called
Use the front-end node to develop code, and to submit jobs
via the batch subsystem.
The front-end should not be used to
run jobs interactively; such jobs should be run using
the procedure described below.
Compiling and Running on Lincoln
A public directory has been set up for you on Lincoln:
~baden/cse260-fa09
The public directory contains source code you’ll use in your assignments and other goodies. From now on we’ll refer to this directory as $(PUB).
To establish that your environment has been set up correctly, compile and run the Jacobi code used in Assigment 2, which has been set up in $(PUB)/j3d_serial Be sure to use the Makefile that we've supplied so you'll get the correct compiler and loader flags for Lincoln.
You’ll notice that the Makefile includes a file called arch.serial. This file configures the Makefile to use the Intel compiler suite (C/C++/Fortran), with various compiler and loader flags set appropriately to build a pthread job. The "arch" file should not normally be changed. If you do modify it, be sure to document any changes you make.
To run interactively, use the qsub command with the -I option as follows:
qsub -I -V -l walltime=00:10:00,nodes=1:ppn=1
The above command will reserve 1 interactive node for 10 minutes You will not have access to the GPUs.
You’ll notice that the qsub -I command forks off a new shell. This shell will exit after your reservation has exhausted its time. If you are done before then, be sure to use the exit command to free up resources for other users.
Depending on the activity on the machine, you may have to wait for resources to become available
qsub: waiting for job 1244.abem5.ncsa.uiuc.edu to start qsub: job 1244.abem5.ncsa.uiuc.edu ready
This is likely to happen as assignment deadlines draw near, so the preferred method of debugging your code is to develop it on other resources, as with the previous assignments.
One advantage of CUDA is that it is free, so you may install it on your own machine and develop your GPU code off-line. You may even compile the code in emulation mode which will enable you to run without a GPU. (Applications run more slowly under emulation mode than on real GPU hardware, and there are some, but this capability is handy and convenient. Watch this space for more information.)
Fred Lionetti’s notes, on how to install CUDA under MacOS or Ubuntu. The definitive source of information and software is the Cuda Zone, which is hosted by Nvidia, and which includes user forums.
Code examples will be placed in $PUB/Examples/Cuda. To use CUDA, be sure to read NCSA’s instructions on the Tesla Compilers
To establish that you can compile, build, and run a CUDA application, make a copy in your home directory of the application increment from $PUB/Examples/Cuda and type make.
If you see error messages about include files or libraries not found, your environment is not set up correctly. Be sure that you set up the two entries in the .soft file as instructed by the NCSA documentation:
+cuda +nvidia-sdkIf you did set these up correctly, then logout and login again, so that the setting in the .soft file will take a effect. There is no need to set up path’s explicitly as under Linux or Darwin (MacOS). The .soft file takes care of this for you.
Having built the code, you are now ready to run it. To run interactively on the Tesla processors, specify the lincoln queue with ppn=2 (there are 2 GPUs on a node):
qsub -I -V -q lincoln -lwalltime=00:10:00,nodes=1:ppn=2
If successful, the job will print a single line
Success!
You may also submit batch jobs, as described below, but running this simple job is sufficient to get your started.
I have set up Moodle forums called "CUDA" and "Lincoln" Direct questions pertaining to coding issues to CUDA, machine or installation issues to Lincoln.
You may also use the compiler for MPI, openmp or threads codes,
and the compiler offers
Cudaprof provides an interactive graphical interface,
thus it is accessible via Lincoln’s interactive queue only.
You also need to enable X11 forwarding on the
qsub command line, which you specify with
the -X
option:
When you are ready to collect measurements,
make your production runs using one of the batch queues.
These batch queues provide
dedicated access to resources—though the switch
will be shared with other jobs. They implement a technique
called space sharing, whereby each user gets
a portion of the hardware until they are done using it.
Compare this with time sharing, in which users
get time slices of the available resources.
We will use space sharing when collecting performance measurements,
since the mechanism provides dedicated access.
While batch jobs provide dedicated access, the caveat
is that you will have to wait for them to return. This method
of submitting jobs may be new to you, and it takes some getting used to.
However, if you think about it, it would be very expensive to
give each user their own dedicated resources for an indefinite
period of time. Most
heavily subscribed high performance computing platforms employ
space sharing.
We will use the PBS batch system.
Here is some documentation on qsub, and here is some
documentation on PBS.
Submit your
batch job using the
qsub command:
where the run.sh is a job submission script
containing the appropriate environment settings and one or more
runs that you wish to make.
A message will be printed at the screen
We’ve set up a script file run.sh in $(PUB)/j3d_serial,
which will run the application.
You will need to make only minimal changes to the script,
as noted in the file, in particular, to set up the correct address
for email notification that the job has completed (Qsub scripts indicate options
with a # mark. These are not comments unless there are two or more # in a row.)
Once your job has been submitted, it will wait in the queue until
the specified resources are available. You may check
on the status of your job using the qstat command.
If you specify your user ID as in qstat -u baden
you'll see only your jobs.
You may also monitor the machine via
this url.
You can remove jobs with the qdel command:
The provided script is set up to write all output to a file
whose name
contains a unique identifier, in
Increment.o2196961. You also get an error output
file with the same prefix, i.e. Increment.e2196961
You may also have the output mailed to you; see the documentation.
Note that if there are problems with the job you’ll be notified.
Some output appears in the "error" file, other in the
"output" file.
For example, the following message came from the "error" file,
and it indicates that
the program could not be found
This might have been a result of not setting the correct directory
in the cd command.
By default, the batch subsystem puts your job in your home directory
(this includes interactive mode). So, you must cd to the correct working directory, as set up in the
provided job submission script:
To get an idea of when your job will finish, use
the showstart command.
To run showstart, provide a job ID as the single command line argument:
You can also determine when job will start via the teragrid portal’s
(https://portal.teragrid.org) resource tab at
https://portal.teragrid.org/gridsphere/gridsphere and then drilling down to
HPC queue prediction.
CUDA provides its own version of the BLAS libraries, but
when running on the GPUs
a particularly useful library is the Intel Math Kernel Library MKL,
which provides high performance implementations for BLAS,
Fast Fourier Transforms and a variety of other mathematical kernels.
These have been heavily optimized, and it will be difficult
(though not impossible) for you to build more efficient implementations.
If you are using MKL, you need to
add the following to the end of your loader command line
You can find documentation and sample code for the Intel compilers on Lincoln in the directory tree
rooted at
/usr/local/intel/10.0.017/samples (The latest version of the compiler
is version 11, and the samples are located in a different place:
/usr/local/intel/11.1.038/Samples/en_US)
Also see Intel’s
C++ Compiler for Linux Systems User’s Guide (hosted at NERSC) with additional information about
vectorization and OpenMP.
There are various tools for measuring performance. (These do not apply to the Tesla processors.)
A simple way to measure times in your program is to use
a high resolution timer.
The gettimeofday() routine will give you approximately
1 microsecond resolution, but has a 1 microsecond overhead.
If you are measuring timings of 100 microseconds or more,
this method should give you reasonable accuracy.
Code has been posted in $(PUB) showing how to use the timer.
Another method is to use PerfSuite, a set of tools which can measure performance within
th memory hierarchy, e.g. cache.
For example, you may run a program with psrun,
which produces an xml file containing various statistics
about the program.
Here is a book published by Intel Press:
The Software Optimization Cookbook.
The Intel compilers process the OpenMP pragmas provided the -openmp
compiler flag has been set:
To run an OpenMP program, you must
set the environment variable OMP_NUM_THREADS to the number of threads you require, e.g.
Here is a Tutorial on OpenMP, complete with code examples.
Also see
C++ Examples by Jon Burkardt
Example OpenMP programs taken from the tutorial
have been installed on Lincoln in $(PUB)/Examples/OpenMP
complete with Makefile and batch submission script.
Here is a short document on Using OpenMP, including a pointer to the use of environment variables.
Also look at Optimizing Applications (found
within the above Intel compiler document).
On the left hand pane of the window, drill down to Optimizing Applications. The outline tree will expand at this point;
from there drill down to Increasing Performance Using Parallelism
→
OpenMP Support. (You can also perform this
navigation from within the right hand pane, but the titles are slightly
different).
If you prefer, you can
obtain a PDF
of Intel® C++ Compiler User and Reference Guides (255 pages)
Here are some articles to help get you started, all
published by Intel.
Here are two tutorial presentations from the Ohio Supercomputing Center
We’ll be using an implementation of
of MPI known as MVAPICH2.
Before you get started, create the file
$HOME/.mpd.conf if you do not already have one.
The mpd daemons (MPI) require this file.
To establish that your environment has been set up correctly to run with
MPI,
compile and run the provided
parallel "hello world" program, which
prints "Hello World'' from each process
along with the process ID.
The code for hello world!
is found in $(PUB)/Examples/Basic/hello.
Be sure to use the Makefile that we've supplied so you'll get
the correct compiler and loader flags for Lincoln.
Code built to run with MPI must use the arch.mpi
version of the arch file.
Note that any command line arguments come in the usual position,
after the name of the executable, and are followed by the arguments
to mpirun.
Thus,
to
run the Ring program (found in $(PUB)/Examples/Ring)
on 16 CPUs with command line arguments
-lin 0 1024 64,
enter the following:
You may vary the qsub parameters nodes and
ppn
up to values limited by the machine configuration (Recall that nodes have 8 CPUs).
You may also set up the configuration with environment variables;
consult the on-line instructions for the details.
You may run your jobs using various
href="http://www.ncsa.illinois.edu/UserInfo/Resources/Hardware/Intel64TeslaCluster/Doc/Jobs.html#Queues">batch queues, which vary according to factors such
as expected job length and the
number of nodes requested.
When debugging,
your jobs should not normally use more than
a few minutes of wallclock time.
In this case, sepcify the
the debug queue.
Note that the debug queue is limited to 16 nodes, or 128
processors.
If you do need to run for more than a few minutes, or use
more nodes, use the normal queue.
Keep the job time limit low until you understand the
performance of your application. Then adjust
the time limit carefully by modifying your script.
Longer jobs may have to wait longer in the queue, but depending on
other activity your wait times may vary.
Since the computer time charges are based on a processor-time product,
specify the
the minimal number of nodes needed to demonstrate the desired effect.
If your job is scaling poorly, don’t double the number of nodes
without a specific reason. Similarly, when debugging, try to use
just a single node whenever possible.
When collecting timings, you may use the gettimeofday()
timer as stated above, but you may also use
the built in timer MPI_Wtime().
A variety of tools are available on Lincoln. In addition,
I’ve also installed fpmpi
which tallies various message passing statistics.
To use this tool, simply load the fpmpi library.
to do this add the following line to your Makefile:
Run your program like you would any other MPI program.
Your profile will appear
by default in the file named fpmpi_profile.txt
The output includes a self-explanation.
You can override the default name by setting the
FPMPI_FILENAME environment variable.
For example, using the bash shell:
As mentioned previously, Lincoln uses the Intel E5345 (Clovertown)
processor. To optimize for SSE3 instructions
use the -axT option
as
recommended by the Intel documentation (you may also use -msse3).
Be sure to generate a vectorization report with the -vec_report3
command line option in order to learn about what the compiler
was not able to vectorize.
Here are additional sources to consult on vectorization and SSE
instructions
If you have difficulty compiling code, post to Moodle.
There are no specific known problems at this time, but watch
this space for any developments.
Using Cudaprof
Cudaprof is a handy GUI-based tool for profiling your code, whether
written in CUDA or with PGI's accelerator extensions.
When you use ssh to connect to Lincoln, be sure
to enable X11 forwarding:
ssh -X lincoln.ncsa.uiuc.edu
qsub -I -V -X -q lincoln -lwalltime=00:30:00,nodes=1:ppn=2
Submitting Batch Jobs
abem5.ncsa.uiuc.edu:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
-------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - -----
2196961.abem5.nc baden lincoln Increment -- 1 1 -- 00:05 R --
llcancel 2196961
showstart 607580
job 607580 requires 32 procs for 00:30:00
Estimated Rsv based start in 00:00:00 on Fri Oct 31 21:40:35
Estimated Rsv based completion in 00:30:00 on Fri Oct 31 22:10:35
The job ID is the
string of numbers reported by qsub
qsub run.sh
This job will be charged to account: kmi (TG-CCR070001)
607580.abem5.ncsa.uiuc.edu
The job ID is 607580
Programming Environment for MPI and other models
You may implement your applications with the PGI accelerator extensions,
CUDA, OpenMP, pthreads, or MPI.
Lincoln provides various debugging tools, along with other tools
for checking for memory leaks, measuring cache performance, and many others.
Many of these do not apply to CUDA jobs, but they may prove
useful in hybrid applications (e.g. MPI + CUDA).
Look
Here
for documentation (This documentation is for Abe, which the subset
of Lincoln that excludes the Tesla processors)
-L/usr/local/intel/mkl/9.0/lib/em64t -lmkl -lguide -lpthread -lg2c -lm
To use the MKL library, add the following line to your
.soft file:
+intel-mkl
Additional documentation can be found HERE.
psrun Jacobi3D
7-Point Point Jacobi with N = 128
Process geometry 1 x 1 x 1
iterations: 10
...
N Px Py Pz It Time Gflops
#> 128 1 1 1 10 5.67385e-01 0.29569
honest3 [135]: ls -l *.xml
You then run the psprocess command to get a report:
-rw------- 1 baden kmi 4922 Oct 19 01:32 Jacobi3D.16547.honest3.xml
psprocess Jacobi3D.16547.honest3.xml > report.txt
However, this method is restricted to single core (serial) runs.
OpenMP
icc -c -openmp -openmp-report2 file.c
For more details,
see the NCSA’s posting of general information on compilers,
including
Intel C/C++ compiler documentation
setenv OMP_NUM_THREADS 4
./a.out
Since OpenMP assumes the existence of shared memory,
OpenMP runs are restricted to a single node--8 processors.
Thus,
effective upper bound on
OMP_NUM_THREADS is 8. In fact, this value
might be
smaller for some applications, depending
on the memory access patterns.
PDF, Troy Baer, April 2007
PDF, Jim Giuliani, November 2003
MPI
$ echo "MPD_SECRETWORD=mywordabc" > $HOME/.mpd.conf
$ chmod 600 $HOME/.mpd.conf
where mywordabc is a "secret word" of your choosing.
See this web page for more information:
http://www.ncsa.uiuc.edu/UserInfo/Resources/Hardware/CommonDoc/mpich2_gdb.html
LDLIBS += -L/u/ac/baden/cse260-fa09/lib/fpmpi-2.1f -lfpmpi
Vectorization and SSE instructions
Porting code to Lincoln
Maintained by
baden
@
ucsd.
edu
[ Wed Oct 21 13:50:42 PDT 2009]