Running Cactus on the Garching SP3 (PSI)

This is a short introduction on how to get Cactus running on the IBM Regatta at the Leibniz Rechenzentrum in Garching.

We currently have dedicated access to one node of this machine. The node has 32 processors (SP4) in 96GB memory in principle, though it seems that only 64GB are accessible at the current time.

The LRZ web pages for the Regatta system are at:

http://www.rzg.mpg.de/computing/IBM_P/

Getting an Account

Fill out the form linked from this page in order to get an account.

Logging in to psi and psi19

You can directly login to psi from any machine in the world using
ssh <user_name>@psi.rzg.mpg.de
Psi19 is behind a firewall, direct external access is only possible from our origin and from all machines on AEI's virtual private network (172.16.*.*) - this includes all Xeons and most peoples' laptops. Otherwise you have to log into psi first, and from there ssh to psi19.

Setup

Download Cactus as usual, using GetCactus.

Create a ~/.cactus directory, and create a file called config there, which contains the lines listed for the Regatta on the configurations page. You should be able to compile your Cactus source tree using these options.

The job manager on the Regatta is poe (Parallel Operating Environment). In order to use this, you will need to set up a host.list file in your home directory listing all of the nodes on which you would like to run. You will also need to add the host to your local rhosts.

Monitoring CPU usage and processes on the nodes

The command
octop -c -n <node name>
will display the CPU usage on the node node name.

Type octop -h to see more options.

Interactive job submission

You can run an interactive using a command such as:
poe ./cactus_test test.par -procs 4

Batch job submission

Four queues have been set up on the psi19, which differ in the length of jobs they will allow:

Queue name Time limit
short 1 hour
huge 12 hours
lhuge 24 hours
infinite 14 days
You can use the llclass command to list the available queues.

Manual batch job submission

Batch jobs are submitted to the LoadLeveler system. A simple submission script looks like this:

#!/bin/sh
# @ output  = test-192.out
# @ error   = test-192.err
# @ initialdir = /afs/ipp-garching.mpg.de/home/p/pollney/runs/test/
# @ class = huge
# @ job_type = parallel
# @ environment= COPY_ALL
# @ node_usage= shared
# @ node = 1
# @ tasks_per_node = 8
# @ resources = ConsumableCpus(1)
# @ queue

poe ./cactus_bhrun test-192.par
You can submit this script using the llsubmit command:
llsubmit test.ll
where test.ll is the name of the submission script containing the above lines.

Submitting jobs with qs2

The qs2 script can be used to automate job submission. Use
qs2 16 cactus_test test.par 2:00:00
to submit a job on 16 nodes for 2 hours. If you leave out the time, then the job will be submitted to the infinite queue with a two-week time limit.

Monitoring and controlling batch jobs

Use the command llq to see the queue.

Use the command llcancel job_# to cancel a job.


Filesystem

The local /batch filesystem has about 350 GB of space, and the local /scratch filesystem has 70GB. You can create your own directories under these.


Accessing jobs via the Portal and HTTPD

In order to access a Cactus job using a web browser:

If you include the thorn
AEIDevelopment/Announce
then you will also be able to connect to the run via the ASC-portal.


This page last modified: $Date: 2004/03/02 14:14:38 $