|
|
The Triton Resource contains 2 components which are available in the course.
The Triton Compute Cluster provides 256 Appro gB222X Blade Servers, each with dual-socket quad-core Intel Xeon E5530 (Nehalem, 2.40 GHz) processors and 24 GBytes of memory, for a total of 2048 cores. (To read more about the Nehalm processor see: Next Generation IntelĀ® Microarchitecture Nehalem, by Paul G. Howard, www.microway.com; Inside Nehalem: Intel's Future Processor and System, by David Kanter, realworldtech.com; and First Look at Nehalem Microarchitecture, by Ilya Gavrichenkov, xbitlabs.com.)
The Triton Resource Petascale Data Analysis Facility (PDAF) comprises 28 Sun x4600M2, eight-socket quad-core nodes. Each node has eight AMD 8380 Shanghai 4-core processors running at 2.5 GHz. These 32 core nodes have at least 256GBytes of memory, some with 512 GB. Four of the 256-gigabyte nodes are connected to large local database servers.
The PDAF will be available for the class project only; all other work will be done on the Cluster.
The purpose of this Primer is to get you started quickly.
However,
SDSC's documentation provides more information and is the authoritative
source.
Triton
You may obtain detailed information about Triton’s processors by looking at the file /proc/cpuinfo.
To use Triton, log in to the
front-end machine called
SDSC has published a Quick Start Guide to get you started. Additional information will be posted here, throughout the quarter.
Use the front-end node to develop code, and to submit jobs
via the batch subsystem.
The front-end should not be used to
run jobs interactively; such jobs should be run using
the procedure described below.
Compiling and Running on Triton
A public directory has been set up for you on Triton:
The public directory contains source code you’ll use in your assignments and other goodies. From now on we’ll refer to this directory as $PUB.~sbaden/cse260-fa09
To establish that your environment has been set up correctly, compile and run two pthreads programs which have been placed in $PUB/Examples/simpleThread. Be sure to use the Makefile that we've supplied so you'll get the correct compiler and loader flags for Triton.
You’ll notice that the Makefile includes a file called arch.pgi. This file configures the Makefile to use the PGI compiler (C/C++/Fortran), with various compiler and loader flags set appropriately to build a the executable. The "arch" file should not normally be changed. If you do modify it, be sure to document any changes you make.
To run interactively, use the qsub command with the -I option as follows:
qsub -I -V -l walltime=00:10:00,nodes=1:ppn=1
This will give you access to 1 interactive node for 10 minutes and is appropriate for running single processor or multithreaded jobs.
Depending on the activity on the machine, you may have to wait for resources to become available.
qsub: waiting for job 2509.triton-42.sdsc.edu to start qsub: job 2509.triton-42.sdsc.edu ready
This is likely to happen as assignment deadlines draw near, so the preferred method of debugging your code is to develop it on other resources.
You’ll notice that the qsub -I command forks off a new shell. This shell will exit after your reservation has exhausted its time. If you are done before then, be sure to use the exit command to free up resources for other users.
Here is some more
documentation on qsub
(Also the on-line
man page)
Running Batch Jobs
When you are ready to collect measurements, make your production runs using one of the batch queues. These batch queues provide dedicated access to resources—though the switch will be shared with other jobs. They implement a technique called space sharing, whereby each user gets a portion of the hardware until they are done using it. Compare this with time sharing, in which users get time slices of the available resources. We will use space sharing when collecting performance measurements, since the mechanism provides dedicated access.
While batch jobs provide dedicated access, the caveat is that you will have to wait for them to return. This method of submitting jobs may be new to you, and it takes some getting used to. However, if you think about it, it would be very expensive to give each user their own dedicated resources for an indefinite period of time. Most heavily subscribed high performance computing platforms employ space sharing.
We will use the Torque resource manager (also known by as PBS,
its historical name).
Submit your
batch job using the
qsub command:
where the run.sh is a job submission script containing the appropriate environment settings and one or more runs that you wish to make.
A message will be printed at the screen
We’ve set up a script file in $PUB/Examples/simpleThread, which will the two programs contained therein. You will need to make only minimal changes to the script, as noted in the file. (qsub scripts indicate options with a # mark. These are not comments unless there are two or more # in a row.)
Once your job has been submitted, it will wait in the queue until the specified resources are available. You may check on the status of your job using the qstat command. If you specify your user ID as in qstat -u sbaden you'll see only your jobs. The command output shows the job Ids and queues, for example:
Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 90.triton-46 PBStest hocks 0 R batch 91.triton-46 PBStest hocks 0 Q batch 92.triton-46 PBStest hocks 0 Q batch 2367.triton-42 SimpleThreads sbaden 00:00:00 C batch
You may also monitor the machine via this url. You can remove jobs with the qdel command, and you only need to specify the leading number in the job ID:
qdel 2367
The provided script is set up to write all output to a file whose name contains a unique identifier, in SimpleThreads.o2367. You also get an error output file with the same prefix, i.e. SimpleThreads.e2367
You can ask to be notified by email when your job has finished. See the provided batch script and be sure to set the correct email address. You may also have the output mailed to you; see the documentation.
Note that if there are problems with the job you’ll be notified. Some output appears in the "error" file, other in the "output" file. For example, the following message came from the "output" file, and it indicates that the program could not be found
This might have been a result of not setting the correct directory in the cd command which appears in the provided job submission script. By default, the batch subsystem puts your job in your home directory (this includes interactive mode). So, you must cd to the correct working directory:
More on Queues
Unless you are using the PDAF nodes, these won’t be needed
for the programming assignments, there is only one queue called
batch.
For pthread, openMP, and serial jobs, you’ll run
on just 1 node. For the assignments in this course, you should
not be using more than a few minutes of time.
You can also determine when your job will start via the teragrid portal’s
(https://portal.teragrid.org) resource tab
and then drilling down to
HPC
queue prediction.
Programming Environment
You may implement your applications with
OpenMP, pthreads, or MPI.
Triton provides various debugging tools, along with other tools
for checking for memory leaks, measuring cache performance, and many others.
These will be discussed in class, so watch this space for more information.
A simple way to measure times in your program is to use
a high resolution timer.
The gettimeofday() routine will give you approximately
1 microsecond resolution, but has a 1 microsecond overhead.
If you are measuring timings of 100 microseconds or more,
this method should give you reasonable accuracy.
Code will be posted in $PUB showing how to use the timer,
again watch this space for more information.
The PGI compilers process the OpenMP pragmas provided the -mp compiler flag has been set; see the arch file in $PUB/Examples/OpenMP.
To run an OpenMP program, you must set the environment variable OMP_NUM_THREADS to the number of threads you require, e.g.
export OMP_NUM_THREADS=2Since OpenMP assumes the existence of shared memory, OpenMP runs are restricted to a single node--8 processors. Thus, effective upper bound on OMP_NUM_THREADS is 8. In fact, this value might be smaller for some applications, depending on the memory access traffic.
Here is a Tutorial on OpenMP, complete with code examples. Also see C++ Examples by Jon Burkardt Example OpenMP programs taken from the tutorial have been installed on Triton in $PUB/Examples/OpenMP complete with Makefile and batch submission script.
Here are some articles to help get you started, all published by Intel, though note that the compiler flags are different from those expected by the PGI compilers.
Here are two tutorial presentations from the Ohio Supercomputing Center
PDF, Troy Baer, April 2007
PDF, Jim Giuliani, November 2003
You may also use the PGI compiler for MPI.
The compiler offers To establish that your environment has been set up correctly to run with
MPI,
compile and run the provided
parallel "hello world" program, which
prints "Hello World'' from each process
along with the process ID.
The code for hello world!
is found in $PUB/Examples/Basic/hello.
Be sure to use the Makefile that we've supplied so you'll get
the correct compiler and loader flags for Triton.
Code built to run with MPI must use the arch.mpi.pgi
version of the arch file.
Note that any command line arguments come in the usual position,
after the name of the executable, and are followed by the arguments
to mpirun.
Thus,
to
run the Ring program (found in $PUB/Examples/Ring)
on 16 CPUs with command line arguments
-lin 0 1024 64,
enter the following:
When collecting timings, you may use the gettimeofday()
timer as stated above, but you may also use
the built in timer MPI_Wtime().
A variety of development tools are available on Triton,
consult the documenation.
I have installed a tool called
fpmpi, which tallies various message passing statistics.
To use this tool, simply load the fpmpi library.
to do this add the following line to your Makefile:
Run your program like you would any other MPI program.
Your profile will appear
by default in the file named fpmpi_profile.txt
The output includes a self-explanation.
You can override the default name by setting the
FPMPI_FILENAME environment variable.
For example, using the bash shell:
Consult the file $PUB/fpmpi2.txt for more options
that you can set via environment variabls.
There are no specific known problems at this time, but watch
this space for any developments.
LDLIBS += -L$PUB/lib/fpmpi-2.1f -lfpmpi
Vectorization and SSE instructions
Watch this space.
Porting code to the Triton Resource
Maintained by
baden
@
ucsd.
edu
[Sat Aug 14 21:55:47 PDT 2010]