COMPASS

The MPE COMPASS User Manual


This document is aimed at local MPE Compass Users, in particular beginners. General concepts are covered elsewhere in the comprehensive COMPASS Introductory Guide so the emphasis here is on the details of the MPE implementation.

  1. Getting Started
  2. Submitting a Compass Job
  3. Monitoring a Compass Job
  4. Compass Data Environment
  5. A Compass Job Running
  6. Compass Test Domains
  7. Commmand Line Interface

1. Getting Started

We make use of six machines: mpesun4, mpesun6, mpesun10, mpesun13, mpesun15 and bsc. Of these, mpesun4 and mpesun6 are reserved for interactive work and bsc is reserved for batch jobs. The batch system LoadLeveler is running on the four batch machines, and Compass jobs can be submitted from any of the mpesuns. It should not matter which machine the job runs on, as the results should end up in the same place.

What you need in your Sun environment:

  1. Path to sbm/usr/bin (and Oracle, which you should have automatically)
    In your .login file you should have the line
        setenv PATH "${PATH}:/afs/ipp/mpe/compass/sbm/usr/bin" 
  2. Correct keyboard-mapping (if you have an X-Terminal).
    In your .Xdefaults file you should have the line
        XTerm*sunFunctionKeys:          off 
  3. Permission to yourself to access your files from other machines.
    In your .rhosts file you should have the lines
        mpesun4.mpe-garching.mpg.de
        mpesun6.mpe-garching.mpg.de
        mpesun10.mpe-garching.mpg.de
        mpesun13.mpe-garching.mpg.de
        mpesun15.mpe-garching.mpg.de
        bsc.rzg.mpg.de  

  4. A working directory on each machine.
    Each user has a /batch/u/$USER directory on each machine, which each job enters with cd in order to run and create temporary files etc.

You should then be able to type compass on any mpesun and a window will open showing the Compass Menu. At the same time a directory temp will be created in your home directory (if not already there). This temp directory will be used for job output and temporary files.


2. Submitting a Compass Job

Option 1 on the Compass Menu opens a window for the User Shell, where you have to use the arrows and tab keys to move the cursor around (i.e. you cannot use the mouse, but maybe we can improve that later). The forms of the User Shell are standard, until you come to the job submission form. Most fields are self-explanatory with help text ( obtained by entering ?) and menu items ( obtained by entering !). Some important fields are:

Batch job identifier: You have to enter a job identifier in this form, but a default is suggested, based on the task id and TPT number. It is the first 1 and last 3 letters of the task id, followed by the last (up to) 4 digits of the TPT number.

 e.g.    for task USHTZZ with TPT id MPE-USHTPT-12345 : utzz2345
     or for task EVPRNN with TPT id MPE-EVPRN3-9577  : ernn9577 
By this you recognise your job in the job utilities and the corresponding filenames. The job identifier which you enter here is used when constructing the names of many of the temporary files and log files, e.g. if you have esel781 for a job with TPT id MPE-EVPSEL-781, then the following files may appear in your temp directory:
     esel781.LOG.EVPSEL-781   ...   log file, similar to job output on CMS
     esel781.MON.EVPSEL-781   ...   monitor file, a simplified log file
     esel781.TPT.EVPSEL-781   ...   print file, with TPT and task-related info
     esel781.PLT.EVPSEL-781   ...   plot file 
     esel781.JOB.EVPSEL-781   ...   the script which was submitted to LoadLeveler
Batch queue: You specify a queue requirement by a short code consisting of letters. Look carefully at the help text (?) to see the meanings of the letters. The codes are given as menu items, so you can see the features of the different queues by entering '!'.

Host machine: You can specify where the job should run. If you leave this field blank (recommended) then LoadLeveler itself will see which machines support your chosen queue and will decide which one to use, based on machine-load.

Zephyr messages welcome?: By default, small windows will open in the upper-left corner of your screen giving a message when your job starts, ends, or appears to be stalled. These can be suppressed if you enter NO here. They come from a process which runs in parallel to your job, observing its status for you. (Ah, but this feature is currently disabled, sorry)


3. Monitoring a Compass Job

The job exists only as a JOB file in your temp directory until it starts running. Then a correspondingly named LOG file is opened, and for each task in the job (normally only one) a MON (monitor) file is opened. The monitor file is simpler and intended to mean something to the user, whereas the more detailed LOG file is intended more for system diagnostics. You can follow the progress of your jobs by looking at these files. Also (unless suppressed by you at job submission, or disabled, as it is at the moment) a process is started which runs in parallel to your job, observing its progress and sending Zephyr messages to your screen (if you are logged on) to inform that the job has (a) started (b) ended or (c) not consuming any CPU time, i.e. stalled in some way.

Also you can check progress by choosing option 8 of the Compass Public User Utilities. This gives you the Job Utilities menu.

Option 1 - Status - This gives an overview of the LoadLeveler queues (LoadLeveler deals in machines and classes - strictly speaking a queue is a class/machine combination). It shows the LoadLeveler queues, what is running in them, and what jobs are pending. There are low limits to the number of jobs allowed to run simultaneously for a given machine - or class - or user - or queue. The jobs wait in the pending list until an appropriate queue becomes free. Apart from this, some jobs may be put on Hold temporarily in order to improve on LoadLever's scheduling priorities.

Option 2 - Re-Submit - You may re-submit a killed or crashed job (choose from a list of job files), specifying again the class and/or host.

You can also use the LoadLeveler overview window, obtained by typing 'xll' or 'xloadl &'.

When the job has finished, you can look in your temp directory for corresponding LOG and MON files to tell you what has happened. If successful, there should also be a TPT file and maybe a PLT or TEM file. The filenames always begin with the job identifier and end like the TPT id.


4. Compass Data Environment

The data areas which are accessed by a Compass job are shown in the following table.
Machine      type of files/directories      directories        
-------     ----------------------------   ---------------------------- 
mpesun4      local working directories    /batch/u/xxx/
             local Compass datasets       /batch/data/loc/

mpesun6      local working directories    /batch/u/xxx/
             local Compass datasets       /batch/data/loc/

mpesun10     home directories             /u/c/xxx/temp
             local Compass datasets       /batch/data/loc/
             Compass dataset pool         /batch/data/prd/    

mpesun13     local working directories    /batch/u/xxx/
             local Compass datasets       /batch/data/loc/

mpesun15     local working directories    /batch/u/xxx/
             local Compass datasets       /batch/data/loc/

bsc          local working directories    /batch/u/xxx/
             local Compass datasets       /batch/data/loc/ with links to:
             high-quality datasets        /batch/data/hq1/
                                          /batch/data/hq2/
sunrz2       Oracle database

(afs)        Compass object libraries,    /afs/ipp/mpe/compass/sbm/lib/
             executables                  /afs/ipp/mpe/compass/sbm/exe/
             Oracle libraries,executables /afs/ipp/@sys/soft/oracle/

(MR-AFS)     The Compass Production Datasets 
Usually a data area is a directory on a disk attached to a machine, but we also make use of the Andrew Filing System (AFS), which is described in the Archive and File Systems of RZG (RechenZentrum Garching). In particular we use Multi-Resident-AFS (MR-AFS) as the mass storage medium for dataset archival.

Compass Datasets

The definitive versions of the datasets are in Multi-Resident-AFS. There is also a central data storage area (/batch/data/prd - readable from all machines, but attached to mpesun10) and a local data storage area for each of the 6 machines ( /batch/data/loc - same name, but 6 different disks).

When a Compass job runs, the input datasets are first brought into the local /batch/data/loc. If not there already, they are copied from /batch/data/prd, and if not there then they are remote-copied from another /batch/data/loc, but if nowhere on-line then they are (a) copied from AFS directly to /batch/data/loc, which may take some time, and then (b) copied from /batch/data/loc to /batch/data/prd.

The Compass program then runs, reading/writing the input/output datasets in the local /batch/data/loc, and reading/writing temporary files in the local /batch/u/xxx (for user xxx). This is intended to minimise network problems.

The output datasets, when closed, are copied within 5-10 minutes to /batch/data/prd, and within a further 5-10 minutes they should be stored in AFS.

Thus there are 7 possible places where a dataset might be, apart from AFS - in /batch/data/prd or one of the 6 /batch/data/loc areas. Each of these copies of the dataset is liable to be deleted at any time, if the area is running out of space, so long as it is not needed for a running job, and as long as it is safely stored in AFS. Certain datasets however are protected from deletion from /batch/data/loc on bsc, in particular the links to the high-quality datasets.

The above refers only to so-called 'production' datasets. Analogous to /batch/data/prd there is a directory /batch/data/dom for datasets which have been generated by test domain jobs. These datasets are recognisable by the name, which has a domain-id instead of 'COMPASS'. They are simply copied from /batch/data/loc into /batch/data/dom, and they are not stored in AFS.


5. A Compass Job Running

The processes involved in the run of a Compass Task are shown in the following diagram, for which you need a wide window.

A Compass Job consists of one or more Compass Tasks, running in succession but otherwise more or less independent of one another. The Compass Task itself consists of four main activities, and if any one of these fails then the task run is aborted:

Compass program build: The latest version of the required program is linked as an executable, taking the latest object libraries as guided by the Oracle database. The executable file is stored in the local /batch/u/xxx directory (where xxx is the user id) to be run as the third activity of the task. Its name ('progname' in the diagram) is derived from a number of identifiers, for example

             mpesun6.9682.c9828.PRG.EVPRN3-10509 

         where mpesun6.9682 is the LoadLeveler job-id,
              c9828 is the Compass job id 
         and  MPE-EVPRN3-10509 is the TPT id
Also consulting the database, lists of input and output datasets are written to /batch/u/xxx. The names of these lists are prefixed with the program name.

Dataset pre-processor: Reading the list of input datasets, a check is made that they are available in the local /batch/data/loc directory. If not, a list of missing datasets is written to /batch/u/xxx and a process is spawned to copy these from /batch/data/prd (or /batch/data/dom) or from another /batch/data/loc or, if necessary, directly from AFS. By the end of this activity, all input datasets must be available in /batch/data/loc.

Compass program: The executable runs in /batch/u/xxx, reading its input datasets from, and writing its output dataset to, the local /batch/data/loc directory. Each output dataset, as soon as it has been closed for write, is accompanied by a small file with the same name suffixed by '.outfile', which is used to guide the archiving processes.
The program also writes to log files in the home temp directory and also a number of files to its current directory /batch/u/xxx, some of which will be moved, if the run is successful, also to the home temp directory.

Tidy: If the executable has run successfully, all relevant files in /batch/u/xxx are deleted, some being copied first to the home temp directory. In particular the executable itself is deleted. The Compass Task Run is then complete.

Archiving

Independent of the task itself, background tasks are running on each of the machines to see, every 5 minutes, if a file has appeared in /batch/data/loc with the suffix '.outfile'. This indicates that a dataset has been closed for write, and can now be copied to /batch/data/prd. If the copy is successful, the '.outfile' file is also copied to batch/data/prd and deleted from /batch/data/loc. The dataset is now available to all machines, but has not yet been archived.

Every 5 minutes on mpesun10, a further background task is running to see if a file has appeared in /batch/data/prd with the suffix '.outfile'. The corresponding dataset is copied to AFS (possibly tarred), and the '.outfile' file is deleted.

Deleting

Independent of the above, background tasks are running on each of the machines to see, every 5 minutes, if the size of the /batch/data/loc directory has exceeded a certain fixed number, or if the disk has exceeded a certain percentage full. In either case, datasets are deleted in order of age (oldest first) (unless protected) until both disk percentage and directory size are below the required limits. Datasets accompanied by a '.outfile', and datasets on any of the lists of input datasets for running programs, are of course excluded from being deleted.

Every 5 minutes on mpesun1, an analogous background task is running to see if /batch/data/prd or /batch/data/dom are too full, in which case datasets are deleted as necessary, oldest first.


6. Compass Test Domains

To use a Compass test domain you need, in addition to the above, a file in your home directory called .compass.domain which will be executed as soon as you enter Compass. This sets the environment variables for your domain directories. For example, if user cr2 has a directory xx with sub-directories src, inc and obj for his source libraries, include libraries and object files, then the .compass.domain file would be:
    setenv DOMAIN_SRC /mpe/u/c/cr2/xx/src 
    setenv DOMAIN_OBJ /mpe/u/c/cr2/xx/obj  
    setenv DOMAIN_INCL /mpe/u/c/cr2/xx/inc  
The environment variable names are fixed. Without a .compass.domain file you cannot choose the Developer Utilities option.

Domain datasets are named after the domain which generated them, and are stored in the local /batch/data/loc and copied (not tarred) into a central storage area (shared by all domains) called /batch/data/dom. They are not stored in AFS, and are liable to be deleted when space runs out in these areas.

e.g. if domain DOMDOM generated MPE-FBY-23342, it is stored as

           /batch/data/loc/DOMDOM.FBY.M0023342   and copied to          
           /batch/data/dom/DOMDOM.FBY.M0023342 
More details on test domains can be given later when required.


7. Command Line Interface

There is an alternative way of calling the User Shell, whereby the input is given in a line-by-line dialogue instead of screen-by-screen. The (Unix-like) command is ,
                     ushc  
which means "User SHell Command line interface". It works much the same way as the normal User Shell, except that the Form Control Character comes first (so that control characters like 'x' can be obeyeded immediately). It still deals with each whole form as a unit, after reading in the parameters. (but now there is only one page; and the work zone, if you need it, appears completely).

For each field, the prompt and default (or last-typed) value appear, then it waits for you to (a) type in a value or (b) hit Enter to leave the value unchanged, then it echoes back the prompt with the new value. (unless it is your password). For routine job-submission, a file of prepared replies can be redirected as input, and the output can be redirected as a log file.

Please note the following:

1. Spaces are taken as an empty line, so if you wish to actually set the field to blank, then enter a line consisting mostly of blanks but terminated by a non-blank character beyond the field-width. Any characters beyond the field-width will be ignored, so you can place comments there.

2. The following Form Control Characters are obeyed without waiting for, or requiring, the rest of the form: x b c d f h l s w m ?

3. x takes you out completely, not back to the logon screen.

4. Input dataset with quality < 100 are accepted without confirmation.

If OK (and this means that each form in the session is OK) ushc exits with status 0; otherwise error-messages appear for each error in the bad form (not just the first one, as now) and ushc exits with status 1.


We are still working on this information sheet

cmg@mpe-garching.mpg.de