A Brief Guide to Triton

Changelog

Date Description
25-Sep-09 Original posting
19-Oct-09 Added some references to Nehalem literature Installed fpmpi2 and updated documentation

 

Introduction

The Triton Resource contains 2 components which are available in the course.

The Triton Compute Cluster provides 256 Appro gB222X Blade Servers, each with dual-socket quad-core Intel Xeon E5530 (Nehalem, 2.40 GHz) processors and 24 GBytes of memory, for a total of 2048 cores. (To read more about the Nehalm processor see: Next Generation IntelĀ® Microarchitecture Nehalem, by Paul G. Howard, www.microway.com; Inside Nehalem: Intel's Future Processor and System, by David Kanter, realworldtech.com; and First Look at Nehalem Microarchitecture, by Ilya Gavrichenkov, xbitlabs.com.)

The Triton Resource Petascale Data Analysis Facility (PDAF) comprises 28 Sun x4600M2, eight-socket quad-core nodes. Each node has eight AMD 8380 Shanghai 4-core processors running at 2.5 GHz. These 32 core nodes have at least 256GBytes of memory, some with 512 GB. Four of the 256-gigabyte nodes are connected to large local database servers.

The PDAF will be available for the class project only; all other work will be done on the Cluster.

The purpose of this Primer is to get you started quickly. However, SDSC's documentation provides more information and is the authoritative source.

Triton

You may obtain detailed information about Triton’s processors by looking at the file /proc/cpuinfo.
To use Triton, log in to the front-end machine called

triton-login.sdsc.edu

Next, follow the instructions for Setting Up Your First-time Login using SSH. You are now ready to go.

SDSC has published a Quick Start Guide to get you started. Additional information will be posted here, throughout the quarter.

Use the front-end node to develop code, and to submit jobs via the batch subsystem. The front-end should not be used to run jobs interactively; such jobs should be run using the procedure described below.

Compiling and Running on Triton

A public directory has been set up for you on Triton:

~sbaden/cse260-fa09
The public directory contains source code you’ll use in your assignments and other goodies. From now on we’ll refer to this directory as $PUB.

To establish that your environment has been set up correctly, compile and run two pthreads programs which have been placed in $PUB/Examples/simpleThread. Be sure to use the Makefile that we've supplied so you'll get the correct compiler and loader flags for Triton.

You’ll notice that the Makefile includes a file called arch.pgi. This file configures the Makefile to use the PGI compiler (C/C++/Fortran), with various compiler and loader flags set appropriately to build a the executable. The "arch" file should not normally be changed. If you do modify it, be sure to document any changes you make.

To run interactively, use the qsub command with the -I option as follows:

qsub -I -V -l walltime=00:10:00,nodes=1:ppn=1

This will give you access to 1 interactive node for 10 minutes and is appropriate for running single processor or multithreaded jobs.

Depending on the activity on the machine, you may have to wait for resources to become available.

qsub: waiting for job 2509.triton-42.sdsc.edu to start
qsub: job 2509.triton-42.sdsc.edu ready

This is likely to happen as assignment deadlines draw near, so the preferred method of debugging your code is to develop it on other resources.

You’ll notice that the qsub -I command forks off a new shell. This shell will exit after your reservation has exhausted its time. If you are done before then, be sure to use the exit command to free up resources for other users.

Here is some more documentation on qsub     (Also the on-line man page)

Running Batch Jobs

When you are ready to collect measurements, make your production runs using one of the batch queues. These batch queues provide dedicated access to resources—though the switch will be shared with other jobs. They implement a technique called space sharing, whereby each user gets a portion of the hardware until they are done using it. Compare this with time sharing, in which users get time slices of the available resources. We will use space sharing when collecting performance measurements, since the mechanism provides dedicated access.

While batch jobs provide dedicated access, the caveat is that you will have to wait for them to return. This method of submitting jobs may be new to you, and it takes some getting used to. However, if you think about it, it would be very expensive to give each user their own dedicated resources for an indefinite period of time. Most heavily subscribed high performance computing platforms employ space sharing.

We will use the Torque resource manager (also known by as PBS, its historical name). Submit your batch job using the qsub command:

qsub run.sh

where the run.sh is a job submission script containing the appropriate environment settings and one or more runs that you wish to make.

A message will be printed at the screen

2357.triton-42.sdsc.edu
where 579997 is the job number.

We’ve set up a script file in $PUB/Examples/simpleThread, which will the two programs contained therein. You will need to make only minimal changes to the script, as noted in the file. (qsub scripts indicate options with a # mark. These are not comments unless there are two or more # in a row.)

Once your job has been submitted, it will wait in the queue until the specified resources are available. You may check on the status of your job using the qstat command. If you specify your user ID as in qstat -u sbaden you'll see only your jobs. The command output shows the job Ids and queues, for example:

Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
90.triton-46              PBStest          hocks                  0 R batch
91.triton-46              PBStest          hocks                  0 Q batch
92.triton-46              PBStest          hocks                  0 Q batch
2367.triton-42            SimpleThreads    sbaden          00:00:00 C batch

You may also monitor the machine via this url. You can remove jobs with the qdel command, and you only need to specify the leading number in the job ID:

qdel 2367

The provided script is set up to write all output to a file whose name contains a unique identifier, in SimpleThreads.o2367. You also get an error output file with the same prefix, i.e. SimpleThreads.e2367

You can ask to be notified by email when your job has finished. See the provided batch script and be sure to set the correct email address. You may also have the output mailed to you; see the documentation.

Note that if there are problems with the job you’ll be notified. Some output appears in the "error" file, other in the "output" file. For example, the following message came from the "output" file, and it indicates that the program could not be found

./e1: Command not found.

This might have been a result of not setting the correct directory in the cd command which appears in the provided job submission script. By default, the batch subsystem puts your job in your home directory (this includes interactive mode). So, you must cd to the correct working directory:

cd cse260-fa09/Examples/Basic

More on Queues

Unless you are using the PDAF nodes, these won’t be needed for the programming assignments, there is only one queue called batch. For pthread, openMP, and serial jobs, you’ll run on just 1 node. For the assignments in this course, you should not be using more than a few minutes of time.

You can also determine when your job will start via the teragrid portal’s (https://portal.teragrid.org) resource tab and then drilling down to HPC queue prediction.

Programming Environment

You may implement your applications with OpenMP, pthreads, or MPI. Triton provides various debugging tools, along with other tools for checking for memory leaks, measuring cache performance, and many others. These will be discussed in class, so watch this space for more information. A simple way to measure times in your program is to use a high resolution timer. The gettimeofday() routine will give you approximately 1 microsecond resolution, but has a 1 microsecond overhead. If you are measuring timings of 100 microseconds or more, this method should give you reasonable accuracy. Code will be posted in $PUB showing how to use the timer, again watch this space for more information.



Programming



OpenMP

The PGI compilers process the OpenMP pragmas provided the -mp compiler flag has been set; see the arch file in $PUB/Examples/OpenMP.

To run an OpenMP program, you must set the environment variable OMP_NUM_THREADS to the number of threads you require, e.g.

export OMP_NUM_THREADS=2
Since OpenMP assumes the existence of shared memory, OpenMP runs are restricted to a single node--8 processors. Thus, effective upper bound on OMP_NUM_THREADS is 8. In fact, this value might be smaller for some applications, depending on the memory access traffic.

Here is a Tutorial on OpenMP, complete with code examples. Also see C++ Examples by Jon Burkardt Example OpenMP programs taken from the tutorial have been installed on Triton in $PUB/Examples/OpenMP complete with Makefile and batch submission script.

Here are some articles to help get you started, all published by Intel, though note that the compiler flags are different from those expected by the PGI compilers.

  • Getting Started with OpenMP
  • More work sharing with OpenMP
  • Advanced OpenMP Programming
  • 32 OpenMP traps for C++ Developers
  • Here are two tutorial presentations from the Ohio Supercomputing Center

    PDF, Troy Baer, April 2007
    PDF, Jim Giuliani, November 2003

    MPI Users

    You may also use the PGI compiler for MPI. The compiler offers performance advice. Contact me if you are interested in using it.

    To establish that your environment has been set up correctly to run with MPI, compile and run the provided parallel "hello world" program, which prints "Hello World'' from each process along with the process ID. The code for hello world! is found in $PUB/Examples/Basic/hello. Be sure to use the Makefile that we've supplied so you'll get the correct compiler and loader flags for Triton. Code built to run with MPI must use the arch.mpi.pgi version of the arch file.

    Note that any command line arguments come in the usual position, after the name of the executable, and are followed by the arguments to mpirun. Thus, to run the Ring program (found in $PUB/Examples/Ring) on 16 CPUs with command line arguments -lin 0 1024 64, enter the following:

    mpirun -np 16 -lin 0 1024 64 mpirun -np 16 -lin 0 1024 64

    When collecting timings, you may use the gettimeofday() timer as stated above, but you may also use the built in timer MPI_Wtime().

    A variety of development tools are available on Triton, consult the documenation.

    I have installed a tool called fpmpi, which tallies various message passing statistics.

    To use this tool, simply load the fpmpi library. to do this add the following line to your Makefile:

     LDLIBS          += -L$PUB/lib/fpmpi-2.1f -lfpmpi 

    Run your program like you would any other MPI program. Your profile will appear by default in the file named fpmpi_profile.txt The output includes a self-explanation. You can override the default name by setting the FPMPI_FILENAME environment variable. For example, using the bash shell:

    set FPMPI_FILENAME="OhYesNow.txt"

    Consult the file $PUB/fpmpi2.txt for more options that you can set via environment variabls.

    Vectorization and SSE instructions

    Watch this space.

    Porting code to the Triton Resource

    There are no specific known problems at this time, but watch this space for any developments.


    Maintained by baden @ ucsd.
    edu   [Sat Aug 14 21:55:47 PDT 2010]