Running Cactus on the PSC Terascale Compaq

This is a short introduction on how to get Cactus running on the Compaq (lemieux) at the Pittsburgh Supercomputing Center. The Compaq machine has 750 4-processor nodes, with each node having 4GB, each processor runs at 1 GHz.

The PSC web pages for lemieux are at http://www.psc.edu/machines/tcs/lemieux.html

Getting an Account

We have a 500000 SU allocation on lemieux. Fill out the form linked here to get a new account. To see the account details type xbanner.

Logging in to LeMieux

ssh lemieux.psc.edu -l <user_name>

Filesystems

Your user account will have a 1Gb quota on $HOME.

Temporary data should be stored on the available scratch filesystems which are globally visible on all service and compute nodes. Use $SCRATCH and $SCRATCH2 to point to your own scratch space.

Data on $SCRATCH and $SCRATCH2 is subject to purging if the filesystem becomes full. For longer-term storage you should move your data to PSC's mass storage system (see this page for details).

Each compute node also has a $LOCAL filesystem which is shared only between the 4 processors of that node. Access to this filesystem is very fast and should be used in your runs for output of chunked Cactus data (just change into $LOCAL before starting Cactus in your qsub batch script, and then use a relative directory name for IOHDF5::out_dir in your parameter file).
Once your Cactus job has finished you need to copy back its output files from $LOCAL to some global filesystem using the tcscp command:

tcscp -v -r -p ${RMS_NODES} '{compute}:$LOCAL' $SCRATCH

This recursively copies all the files created in the current batch job's $LOCAL directory into a subdirectory (with the job ID as its name) in your scratch space using multiple I/O nodes in parallel.

Compiling Cactus

The standard configuration options can be found on the Cactus configurations page. Just put these options in your ${HOME}/.cactus/config configuration file on lemieux.

Interactive Jobs

Single processor (non-MPI) jobs can be run as usual interactively. For MPI jobs you have to first request processors, using

qsub -I [-q <queue>] [-l rmsnodes=<#nodes>:<total #procs>]

with the appropriate number of compute nodes and processors (default is one node with 4 processors on the standard queue). Once the qsub command gives you an interactive shell on the requested nodes you can then run jobs using

prun [-N <#nodes>] [-n <total #procs>] cactus_<config> <parameter file>

(default is to run the job on all requested nodes and processors).

Note that it sometimes takes a long time to request nodes for interactive jobs. Then you should try both the standard and the debug queue.

Submitting to Queues

Check the status of queues using qstat (you can use -a, -f, or -u <login name> for more info).

Check the status of the machine using rinfo.

To submit to the queues you need to create a qsub script following the example below:

#!/bin/csh
# your job's runtime in HH:MM:SS
#PBS -l walltime=0:05:00
# the number of nodes:processors requested
#PBS -l rmsnodes=1:4
# use the projects command to find out your project name
#PBS -l rmsproject=<projectname>
# notify by email when the job has finished
#PBS -m e

set cactus=${HOME}/cactus/exe/cactus_hdf5
set parfile=${HOME}/cactus/par/hdf5.par

# Cactus is started by a temporary shell script which also cd's into $LOCAL
echo 'cd ${LOCAL}' > ${SCRATCH}/shell.$$
echo "${cactus} ${parfile}" >> ${SCRATCH}/shell.$$
prun /bin/sh ${SCRATCH}/shell.$$
/bin/rm ${SCRATCH}/shell.$$

# recursively copy all files on $LOCAL back to $SCRATCH, using parallel I/O
tcscp -v -r '{compute}:$LOCAL' $SCRATCH

Important: You must make sure in your parameter file that your Cactus job terminates before your batch job used up all of its walltime. Otherwise there will be no time left to copy back your data files from $LOCAL to $SCRATCH and, since $LOCAL is subject to purging, this basically means you lose all of your job's output data !!!

Open Ports for HTTPD Connections

Port numbers 7770-7790 have been opened up as inbound ports for Cactus network services like HTTPD.

Documentation

A complete technical description and user documentation can be found on PSC's lemieux pages.

This page last modified: $Date: 2004/02/09 12:07:28 $