A Brief Guide to DataStar

Changelog

Date Description
1-Nov-06 Original posting
2-Nov-06 Added discussion about how to take timings using high resolution timer.
12-Nov-06 Added notes about porting code to DataStar.

Introduction

DataStar is a mainframe located at the San Diego Supercomputer Center, and is manufactured by IBM. It is organized as a heterogeneous collection of 8-way and 32-way SMP nodes, interconnected by a high performance switch. The nodes contain IBM Power4 CPUs, but differ according to the amount of memory they contain, and the clock speed of the processors. See the System Configuration for more information.

Here we'll get an overview of the environment, and some tips to get started. But be sure to look over the DataStar User's Guide carefully in order to appreciate DataStar's full range of capabilities. Here is documentation on the IBM C++ and Fortran Compilers and on the BLAS libraries.

Environment

DataStar has a front end called dslogin.sdsc.edu. From this node you may develop code and submit batch jobs. You should never run jobs on this node, however. There is also an interactive node. In batch mode you may obtain access to dedicated nodes—although the switch will be shared with other jobs. Batch mode is appropriate for collecting performance measurements, where reproducibility is important.

A copy of Valkyrie's public directory has been set up in ~baden/cse260_fa06 At the top level of this directory you'll find a version of the arch file set up for DataStar. This file, called arch.dstar, configures your Makefile to use IBM's compilers. These compilers incorporate the MPI library. We'll use thread-safe versions of the compilers, which are distinguished with names ending in the suffix _r, e.g. mpCC_r for the thread-safe C++ compiler, and so on. (For serial code you may use xlC; to run you should not use the front end.)

Running jobs

Use the interactive node for code development: dsdirect.sdsc.edu. This node has 32 processors and 64 GB of memory. This is a p690 server, and you may use all the node's memory if necessary. However, use memory with care as our class account is charged in terms of the amount of processors as well as memory used.

To run interactively you use the poe32 command. For example, to run the Ring program on 16 CPUs with command line arguments -lin 0 1024 64, enter the following:

poe32 ring -lin 0 1024 64 -nodes 1 -tasks_per_node 16

You may also set up the configuration with environment variables:

setenv MP_NODES 1
setenv MP_TASKS_PER_NODE 16
poe32 ring -lin 0 1024 64

For more details, see the on-line instructions.

When you are ready to collect measurements, make your production runs using the batch subsystem. There are many batch queues, and they vary according to factors such as: expected job length, maximum numbers of nodes—and cost. If your job requires 4 nodes (32CPUs) or less, and can live within 16GB of memory per node, use the Express queue. (The maximum time limit is two hours, but in this course we'll run in far less time. See me if you need to make longer runs.) The provided script specifies a normal queue, but a high priority queue is also specified in a commented line of the script.

Express queue jobs must be submitted from the special front end dspoe.sdsc.edu. Larger jobs should be submitted from dslogin.sdsc.edu and use use one of the other queues, which will give you access to up to 265 nodes (2120 processors). Be sure to use "normal" queue unless we've discussed the matter, as the higher priority queues drain our bank account more quickly. Similarly, be careful when running on more than 16 nodes (128 processors). Consult the documentation for more information.

Keep the job time limit low until you understand the performance of your application. Then adjust the time limit carefully, modifying the time limit specified in the provided scripts. Longer jobs may have to wait longer in the queue, but depending on other activity your wait times may vary.

Submit batch jobs with the llsubmit command:

llsubmit p4_4

where the file p4_4 contains the appropriate environment setting and one or more runs that you wish to make. A copy of p4_4 is found in A1/Ring_new. You will need to make some changes in order to use the script, and these are noted in the script.

Once your has been submitted, it will wait in the queue until the appropriate resources are available. You may check on the status of your job using the llq command. If you specify your user ID as in llq -u baden you'll see only your jobs.

You can remove jobs with the llcancel command:

ds100 [50]: llq -u baden
Id                       Owner      Submitted   ST PRI Class        Running On 
------------------------ ---------- ----------- -- --- ------------ -----------
ds100.240242.0           baden      11/1  22:03 C  50  express
ds100.240243.0           baden      11/1  22:04 C  50  express
ds100.240244.0           baden      11/1  22:07 C  50  express

0 job step(s) in query, 0 waiting, 0 pending, 0 running, 0 held, 0 preempted
ds100 [51]: llcancel ds100.240242.0
llcancel: Cancel command has been sent to the central manager.

Once your job has dispatched, you'll get an email confirmation. You'll get another email confirmation when your job has completed. The provided script is set up to place all output on a file which contains a unique identifier in its name, allowing you to run the job several times while getting unique output, as in Ring_4_4_240241.out. You also get an error output file with the same prefix. Normally the error file will show successful outcomes, as in

ATTENTION: 0031-408 4 tasks allocated by LoadLeveler, continuing...

but if there are errors in submitting or running the job, e.g. executable not found, exceeding resource limits, you'll be notified. You may also have the output mailed to you. See the documentation for the details.

Performance Measurement

There are various tools for measuring single processor performance. You may also use the high resolution timer to collect sub-microsecond times. Look at the dotslow example to see how these are used, and consult the IBM AIX documentation on high resolution clocks, as well as the document on CPU monitoring and tuning.

Porting code to DataStar

You'll find that the IBM compilers are bit fussier than the Gnu compilers. In some cases this is for the better, in other cases it is due differences in the version of the language recognized. After you've ported the code to the IBM, you may find it convenient to set things up so you can move the code freely between IBM's and Valkyrie's compilers, or any other(s) that you are using.

On the IBM, you'll need to use gmake rather than make, as make won't accept some conventions accepted by Gnu Make.

If you run out of memory, try compiling in 64 bit mode. Look here for more information.


Maintained by Scott B. Baden, Last modified: Sun Nov 12 15:25:10 PST 2006