A Brief Guide to Lincoln

Changelog

Date Description
28-Aug-09 Original posting
21-Oct-09 Update on using Tesla

 

Introduction

Lincoln is a GPU cluster located at the NCSA. The machine consists of a collection of Dell PowerEdge 1950 servers connected with InfiniBand SDR. Each server contains 8 cores constructed from dual-socket, quad-core Intel 64 (Clovertown, E5345) 2.33GHz processors, together with 2 Tesla processors connected via PCI-e Gen2 X8 slots. The hardware thus provides a platform for writing applications that run under message passing, threads, or hardware acceleration, or any combination of these models.

The purpose of this Primer is to get you started quickly. However, NCSA’s extensive User Documentation will provide considerably more information, and is the authoritative source.

Lincoln

You may obtain detailed information about Lincoln’s processors by looking at the file /proc/cpuinfo.
To use Lincoln, log in to the front-end machine called

lincoln.ncsa.uiuc.edu

Use the front-end node to develop code, and to submit jobs via the batch subsystem. The front-end should not be used to run jobs interactively; such jobs should be run using the procedure described below.

Compiling and Running on Lincoln

A public directory has been set up for you on Lincoln:

~baden/cse260-fa09

The public directory contains source code you’ll use in your assignments and other goodies. From now on we’ll refer to this directory as $(PUB).

To establish that your environment has been set up correctly, compile and run the Jacobi code used in Assigment 2, which has been set up in $(PUB)/j3d_serial Be sure to use the Makefile that we've supplied so you'll get the correct compiler and loader flags for Lincoln.

You’ll notice that the Makefile includes a file called arch.serial. This file configures the Makefile to use the Intel compiler suite (C/C++/Fortran), with various compiler and loader flags set appropriately to build a pthread job. The "arch" file should not normally be changed. If you do modify it, be sure to document any changes you make.

To run interactively, use the qsub command with the -I option as follows:

qsub -I -V -l walltime=00:10:00,nodes=1:ppn=1

The above command will reserve 1 interactive node for 10 minutes You will not have access to the GPUs.

You’ll notice that the qsub -I command forks off a new shell. This shell will exit after your reservation has exhausted its time. If you are done before then, be sure to use the exit command to free up resources for other users.

Depending on the activity on the machine, you may have to wait for resources to become available

qsub: waiting for job 1244.abem5.ncsa.uiuc.edu to start
qsub: job 1244.abem5.ncsa.uiuc.edu ready

This is likely to happen as assignment deadlines draw near, so the preferred method of debugging your code is to develop it on other resources, as with the previous assignments.  

GPU Programming with Tesla

There are two ways of programming the GPUs on Tesla. The first is to use Nvidia’s CUDA environment, that provides hands on control over the hardware, and is hence low level. The second is to use the new PGI compiler that understands an accelerator interface provided by pragmas, or program annotations, that are reminiscent of OpenMP. This interface is high level than CUDA’s.

One advantage of CUDA is that it is free, so you may install it on your own machine and develop your GPU code off-line. You may even compile the code in emulation mode which will enable you to run without a GPU. (Applications run more slowly under emulation mode than on real GPU hardware, and there are some, but this capability is handy and convenient. Watch this space for more information.)

Fred Lionetti’s notes, on how to install CUDA under MacOS or Ubuntu. The definitive source of information and software is the Cuda Zone, which is hosted by Nvidia, and which includes user forums.

Code examples will be placed in $PUB/Examples/Cuda. To use CUDA, be sure to read NCSA’s instructions on the Tesla Compilers

To establish that you can compile, build, and run a CUDA application, make a copy in your home directory of the application increment from $PUB/Examples/Cuda and type make.

If you see error messages about include files or libraries not found, your environment is not set up correctly. Be sure that you set up the two entries in the .soft file as instructed by the NCSA documentation:

+cuda
+nvidia-sdk
If you did set these up correctly, then logout and login again, so that the setting in the .soft file will take a effect. There is no need to set up path’s explicitly as under Linux or Darwin (MacOS). The .soft file takes care of this for you.

Having built the code, you are now ready to run it. To run interactively on the Tesla processors, specify the lincoln queue with ppn=2 (there are 2 GPUs on a node):

qsub -I -V -q lincoln -lwalltime=00:10:00,nodes=1:ppn=2

If successful, the job will print a single line

Success!

You may also submit batch jobs, as described below, but running this simple job is sufficient to get your started.

I have set up Moodle forums called "CUDA" and "Lincoln" Direct questions pertaining to coding issues to CUDA, machine or installation issues to Lincoln.

Using the PGI Compiler in accelerator mode

There is not a lot of documentation at this time, and the most complete source I am aware of are two white papers authored by Michael Wolfe, with PGI:
  • The PGI Accelerator Programming Model on NVIDIA GPUs, Part 1 (June 2009)
  • The PGI Accelerator Programming Model on NVIDIA GPUs Part 2 Performance Tuning (August 2009). This article also discusses the handy CUDA Profilng tool, cudaprof, which is discussed below.
  • You may also use the compiler for MPI, openmp or threads codes, and the compiler offers performance advice. Contact me if you are interested in using it.

    Using Cudaprof

    Cudaprof is a handy GUI-based tool for profiling your code, whether written in CUDA or with PGI's accelerator extensions. When you use ssh to connect to Lincoln, be sure to enable X11 forwarding:
    ssh -X lincoln.ncsa.uiuc.edu

    Cudaprof provides an interactive graphical interface, thus it is accessible via Lincoln’s interactive queue only. You also need to enable X11 forwarding on the qsub command line, which you specify with the -X option:

    qsub -I -V -X -q lincoln -lwalltime=00:30:00,nodes=1:ppn=2

    Submitting Batch Jobs

    When you are ready to collect measurements, make your production runs using one of the batch queues. These batch queues provide dedicated access to resources—though the switch will be shared with other jobs. They implement a technique called space sharing, whereby each user gets a portion of the hardware until they are done using it. Compare this with time sharing, in which users get time slices of the available resources. We will use space sharing when collecting performance measurements, since the mechanism provides dedicated access.

    While batch jobs provide dedicated access, the caveat is that you will have to wait for them to return. This method of submitting jobs may be new to you, and it takes some getting used to. However, if you think about it, it would be very expensive to give each user their own dedicated resources for an indefinite period of time. Most heavily subscribed high performance computing platforms employ space sharing.

    We will use the PBS batch system. Here is some documentation on qsub, and here is some documentation on PBS. Submit your batch job using the qsub command:

    qsub run.sh

    where the run.sh is a job submission script containing the appropriate environment settings and one or more runs that you wish to make.

    A message will be printed at the screen

    579997.abem5.ncsa.uiuc.edu
    where 579997 is the job number.

    We’ve set up a script file run.sh in $(PUB)/j3d_serial, which will run the application. You will need to make only minimal changes to the script, as noted in the file, in particular, to set up the correct address for email notification that the job has completed (Qsub scripts indicate options with a # mark. These are not comments unless there are two or more # in a row.)

    Once your job has been submitted, it will wait in the queue until the specified resources are available. You may check on the status of your job using the qstat command. If you specify your user ID as in qstat -u baden you'll see only your jobs.

    abem5.ncsa.uiuc.edu: 
                                                                             Req'd  Req'd   Elap
    Job ID               Username Queue    Jobname          SessID NDS   TSK Memory Time  S Time
    -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - -----
    2196961.abem5.nc     baden    lincoln  Increment           --      1   1    --  00:05 R   -- 
    

    You may also monitor the machine via this url. You can remove jobs with the qdel command:

    llcancel 2196961
    

    The provided script is set up to write all output to a file whose name contains a unique identifier, in Increment.o2196961. You also get an error output file with the same prefix, i.e. Increment.e2196961

    You may also have the output mailed to you; see the documentation.

    Note that if there are problems with the job you’ll be notified. Some output appears in the "error" file, other in the "output" file. For example, the following message came from the "error" file, and it indicates that the program could not be found

    increment: command not found

    This might have been a result of not setting the correct directory in the cd command. By default, the batch subsystem puts your job in your home directory (this includes interactive mode). So, you must cd to the correct working directory, as set up in the provided job submission script:

    cd $PBS_O_WORKDIR

    More on Queues

      To get an idea of when your job will finish, use the showstart command. To run showstart, provide a job ID as the single command line argument:

    showstart 607580
    job 607580 requires 32 procs for 00:30:00
    Estimated Rsv based start in 00:00:00 on Fri Oct 31 21:40:35
    Estimated Rsv based completion in 00:30:00 on Fri Oct 31 22:10:35
    
    The job ID is the string of numbers reported by qsub
    qsub run.sh
    
    This job will be charged to account: kmi (TG-CCR070001)
    607580.abem5.ncsa.uiuc.edu
    
    The job ID is 607580

    You can also determine when job will start via the teragrid portal’s (https://portal.teragrid.org) resource tab at https://portal.teragrid.org/gridsphere/gridsphere and then drilling down to HPC queue prediction.

    Programming Environment for MPI and other models

    You may implement your applications with the PGI accelerator extensions, CUDA, OpenMP, pthreads, or MPI. Lincoln provides various debugging tools, along with other tools for checking for memory leaks, measuring cache performance, and many others. Many of these do not apply to CUDA jobs, but they may prove useful in hybrid applications (e.g. MPI + CUDA). Look Here for documentation (This documentation is for Abe, which the subset of Lincoln that excludes the Tesla processors)

      CUDA provides its own version of the BLAS libraries, but when running on the GPUs a particularly useful library is the Intel Math Kernel Library MKL, which provides high performance implementations for BLAS, Fast Fourier Transforms and a variety of other mathematical kernels. These have been heavily optimized, and it will be difficult (though not impossible) for you to build more efficient implementations. If you are using MKL, you need to add the following to the end of your loader command line

    -L/usr/local/intel/mkl/9.0/lib/em64t -lmkl -lguide -lpthread -lg2c -lm
    
    To use the MKL library, add the following line to your .soft file:
    +intel-mkl
    
    Additional documentation can be found HERE.

    You can find documentation and sample code for the Intel compilers on Lincoln in the directory tree rooted at /usr/local/intel/10.0.017/samples (The latest version of the compiler is version 11, and the samples are located in a different place: /usr/local/intel/11.1.038/Samples/en_US) Also see Intel’s C++ Compiler for Linux Systems User’s Guide (hosted at NERSC) with additional information about vectorization and OpenMP.

    Performance Measurement

    There are various tools for measuring performance. (These do not apply to the Tesla processors.) A simple way to measure times in your program is to use a high resolution timer. The gettimeofday() routine will give you approximately 1 microsecond resolution, but has a 1 microsecond overhead. If you are measuring timings of 100 microseconds or more, this method should give you reasonable accuracy. Code has been posted in $(PUB) showing how to use the timer.

    Another method is to use PerfSuite, a set of tools which can measure performance within th memory hierarchy, e.g. cache. For example, you may run a program with psrun, which produces an xml file containing various statistics about the program.

    psrun Jacobi3D
    
    7-Point Point Jacobi with N = 128
    Process geometry 1 x 1 x 1
    iterations: 10
    
    ...
    
         N    Px   Py   Pz   It       Time         Gflops
    #>   128    1    1    1   10    5.67385e-01    0.29569
    honest3 [135]: ls -l *.xml
    
    You then run the psprocess command to get a report:
    -rw-------  1 baden kmi 4922 Oct 19 01:32 Jacobi3D.16547.honest3.xml
    psprocess Jacobi3D.16547.honest3.xml > report.txt
    
    However, this method is restricted to single core (serial) runs.

    Here is a book published by Intel Press: The Software Optimization Cookbook.

    OpenMP

    The Intel compilers process the OpenMP pragmas provided the -openmp compiler flag has been set:

    icc -c -openmp -openmp-report2 file.c
    For more details, see the NCSA’s posting of general information on compilers, including Intel C/C++ compiler documentation

    To run an OpenMP program, you must set the environment variable OMP_NUM_THREADS to the number of threads you require, e.g.

    setenv OMP_NUM_THREADS 4
    ./a.out
    
    Since OpenMP assumes the existence of shared memory, OpenMP runs are restricted to a single node--8 processors. Thus, effective upper bound on OMP_NUM_THREADS is 8. In fact, this value might be smaller for some applications, depending on the memory access patterns.

    Here is a Tutorial on OpenMP, complete with code examples. Also see C++ Examples by Jon Burkardt Example OpenMP programs taken from the tutorial have been installed on Lincoln in $(PUB)/Examples/OpenMP complete with Makefile and batch submission script.

    Here is a short document on Using OpenMP, including a pointer to the use of environment variables. Also look at Optimizing Applications (found within the above Intel compiler document). On the left hand pane of the window, drill down to Optimizing Applications. The outline tree will expand at this point; from there drill down to Increasing Performance Using ParallelismOpenMP Support. (You can also perform this navigation from within the right hand pane, but the titles are slightly different). If you prefer, you can obtain a PDF of Intel® C++ Compiler User and Reference Guides (255 pages)

    Here are some articles to help get you started, all published by Intel.

  • Getting Started with OpenMP
  • More work sharing with OpenMP
  • Advanced OpenMP Programming
  • 32 OpenMP traps for C++ Developers
  • Here are two tutorial presentations from the Ohio Supercomputing Center

    PDF, Troy Baer, April 2007
    PDF, Jim Giuliani, November 2003

    MPI

    We’ll be using an implementation of of MPI known as MVAPICH2.  

    Before you get started, create the file $HOME/.mpd.conf if you do not already have one. The mpd daemons (MPI) require this file.

        $ echo "MPD_SECRETWORD=mywordabc" > $HOME/.mpd.conf
        $ chmod 600 $HOME/.mpd.conf
    
    where mywordabc is a "secret word" of your choosing. See this web page for more information: http://www.ncsa.uiuc.edu/UserInfo/Resources/Hardware/CommonDoc/mpich2_gdb.html

    To establish that your environment has been set up correctly to run with MPI, compile and run the provided parallel "hello world" program, which prints "Hello World'' from each process along with the process ID. The code for hello world! is found in $(PUB)/Examples/Basic/hello. Be sure to use the Makefile that we've supplied so you'll get the correct compiler and loader flags for Lincoln. Code built to run with MPI must use the arch.mpi version of the arch file.

    Note that any command line arguments come in the usual position, after the name of the executable, and are followed by the arguments to mpirun. Thus, to run the Ring program (found in $(PUB)/Examples/Ring) on 16 CPUs with command line arguments -lin 0 1024 64, enter the following:

    mpirun -np 16 -lin 0 1024 64 mpirun -np 16 -lin 0 1024 64

    You may vary the qsub parameters nodes and ppn up to values limited by the machine configuration (Recall that nodes have 8 CPUs). You may also set up the configuration with environment variables; consult the on-line instructions for the details.

    You may run your jobs using various href="http://www.ncsa.illinois.edu/UserInfo/Resources/Hardware/Intel64TeslaCluster/Doc/Jobs.html#Queues">batch queues, which vary according to factors such as expected job length and the number of nodes requested. When debugging, your jobs should not normally use more than a few minutes of wallclock time. In this case, sepcify the the debug queue. Note that the debug queue is limited to 16 nodes, or 128 processors. If you do need to run for more than a few minutes, or use more nodes, use the normal queue. Keep the job time limit low until you understand the performance of your application. Then adjust the time limit carefully by modifying your script. Longer jobs may have to wait longer in the queue, but depending on other activity your wait times may vary. Since the computer time charges are based on a processor-time product, specify the the minimal number of nodes needed to demonstrate the desired effect. If your job is scaling poorly, don’t double the number of nodes without a specific reason. Similarly, when debugging, try to use just a single node whenever possible.

    When collecting timings, you may use the gettimeofday() timer as stated above, but you may also use the built in timer MPI_Wtime().

    A variety of tools are available on Lincoln. In addition, I’ve also installed fpmpi which tallies various message passing statistics.

    To use this tool, simply load the fpmpi library. to do this add the following line to your Makefile:

     LDLIBS          += -L/u/ac/baden/cse260-fa09/lib/fpmpi-2.1f -lfpmpi 

    Run your program like you would any other MPI program. Your profile will appear by default in the file named fpmpi_profile.txt The output includes a self-explanation. You can override the default name by setting the FPMPI_FILENAME environment variable. For example, using the bash shell:

    set FPMPI_FILENAME="OhYesNow.txt"
    Other environment variable settings are epxlained in the doc subdirectory of fpmpi-2.1f

    Vectorization and SSE instructions

    As mentioned previously, Lincoln uses the Intel E5345 (Clovertown) processor. To optimize for SSE3 instructions use the -axT option as recommended by the Intel documentation (you may also use -msse3). Be sure to generate a vectorization report with the -vec_report3 command line option in order to learn about what the compiler was not able to vectorize. Here are additional sources to consult on vectorization and SSE instructions

  • Intel’s Vectorization with the IntelĀ® Compilers (Part I), by A.J.C. Bik, 5/9/08.
  • Language Support and Directives (C++ Compiler for Linux Systems User’s Guide). A handy set of examples discussing directives that affect vectorization.
  • SSE Performance Programming
  • The Software Vectorization Handbook (Intel Press)
  • Introduction to the Streaming SIMD Extensions in the Pentium III, by Bipin Patwardhan part I, part II, part III
  • Become Familiar with Streaming SIMD Extensions 3 Instructions
  • Prescott New Instructions Developer Guide
  • Intel Compiler for Linux Intrinsics Reference
  • Streaming SIMD Extensions (Wikipedia)
  • Intel C++ Compiler (Wikipedia)
  • Porting code to Lincoln

    If you have difficulty compiling code, post to Moodle. There are no specific known problems at this time, but watch this space for any developments.


    Maintained by baden @ ucsd.
    edu   [ Wed Oct 21 13:50:42 PDT 2009]