This document is aimed at local MPE Compass Users, in particular beginners. General concepts are covered elsewhere in the comprehensive COMPASS Introductory Guide so the emphasis here is on the details of the MPE implementation.
What you need in your Sun environment:
setenv PATH "${PATH}:/afs/ipp/mpe/compass/sbm/usr/bin"
XTerm*sunFunctionKeys: off
mpesun4.mpe-garching.mpg.de mpesun6.mpe-garching.mpg.de mpesun10.mpe-garching.mpg.de mpesun13.mpe-garching.mpg.de mpesun15.mpe-garching.mpg.de bsc.rzg.mpg.de
Batch job identifier: You have to enter a job identifier in this form, but a default is suggested, based on the task id and TPT number. It is the first 1 and last 3 letters of the task id, followed by the last (up to) 4 digits of the TPT number.
e.g. for task USHTZZ with TPT id MPE-USHTPT-12345 : utzz2345 or for task EVPRNN with TPT id MPE-EVPRN3-9577 : ernn9577By this you recognise your job in the job utilities and the corresponding filenames. The job identifier which you enter here is used when constructing the names of many of the temporary files and log files, e.g. if you have esel781 for a job with TPT id MPE-EVPSEL-781, then the following files may appear in your temp directory:
esel781.LOG.EVPSEL-781 ... log file, similar to job output on CMS esel781.MON.EVPSEL-781 ... monitor file, a simplified log file esel781.TPT.EVPSEL-781 ... print file, with TPT and task-related info esel781.PLT.EVPSEL-781 ... plot file esel781.JOB.EVPSEL-781 ... the script which was submitted to LoadLevelerBatch queue: You specify a queue requirement by a short code consisting of letters. Look carefully at the help text (?) to see the meanings of the letters. The codes are given as menu items, so you can see the features of the different queues by entering '!'.
Host machine: You can specify where the job should run. If you leave this field blank (recommended) then LoadLeveler itself will see which machines support your chosen queue and will decide which one to use, based on machine-load.
Zephyr messages welcome?: By default, small windows will open in the upper-left corner of your screen giving a message when your job starts, ends, or appears to be stalled. These can be suppressed if you enter NO here. They come from a process which runs in parallel to your job, observing its status for you. (Ah, but this feature is currently disabled, sorry)
Also you can check progress by choosing option 8 of the Compass Public User Utilities. This gives you the Job Utilities menu.
Option 1 - Status - This gives an overview of the LoadLeveler queues (LoadLeveler deals in machines and classes - strictly speaking a queue is a class/machine combination). It shows the LoadLeveler queues, what is running in them, and what jobs are pending. There are low limits to the number of jobs allowed to run simultaneously for a given machine - or class - or user - or queue. The jobs wait in the pending list until an appropriate queue becomes free. Apart from this, some jobs may be put on Hold temporarily in order to improve on LoadLever's scheduling priorities.
Option 2 - Re-Submit - You may re-submit a killed or crashed job (choose from a list of job files), specifying again the class and/or host.
You can also use the LoadLeveler overview window, obtained by typing 'xll' or 'xloadl &'.
When the job has finished, you can look in your temp directory for corresponding LOG and MON files to tell you what has happened. If successful, there should also be a TPT file and maybe a PLT or TEM file. The filenames always begin with the job identifier and end like the TPT id.
Machine type of files/directories directories ------- ---------------------------- ---------------------------- mpesun4 local working directories /batch/u/xxx/ local Compass datasets /batch/data/loc/ mpesun6 local working directories /batch/u/xxx/ local Compass datasets /batch/data/loc/ mpesun10 home directories /u/c/xxx/temp local Compass datasets /batch/data/loc/ Compass dataset pool /batch/data/prd/ mpesun13 local working directories /batch/u/xxx/ local Compass datasets /batch/data/loc/ mpesun15 local working directories /batch/u/xxx/ local Compass datasets /batch/data/loc/ bsc local working directories /batch/u/xxx/ local Compass datasets /batch/data/loc/ with links to: high-quality datasets /batch/data/hq1/ /batch/data/hq2/ sunrz2 Oracle database (afs) Compass object libraries, /afs/ipp/mpe/compass/sbm/lib/ executables /afs/ipp/mpe/compass/sbm/exe/ Oracle libraries,executables /afs/ipp/@sys/soft/oracle/ (MR-AFS) The Compass Production DatasetsUsually a data area is a directory on a disk attached to a machine, but we also make use of the Andrew Filing System (AFS), which is described in the Archive and File Systems of RZG (RechenZentrum Garching). In particular we use Multi-Resident-AFS (MR-AFS) as the mass storage medium for dataset archival.
When a Compass job runs, the input datasets are first brought into the local /batch/data/loc. If not there already, they are copied from /batch/data/prd, and if not there then they are remote-copied from another /batch/data/loc, but if nowhere on-line then they are (a) copied from AFS directly to /batch/data/loc, which may take some time, and then (b) copied from /batch/data/loc to /batch/data/prd.
The Compass program then runs, reading/writing the input/output datasets in the local /batch/data/loc, and reading/writing temporary files in the local /batch/u/xxx (for user xxx). This is intended to minimise network problems.
The output datasets, when closed, are copied within 5-10 minutes to /batch/data/prd, and within a further 5-10 minutes they should be stored in AFS.
Thus there are 7 possible places where a dataset might be, apart from AFS - in /batch/data/prd or one of the 6 /batch/data/loc areas. Each of these copies of the dataset is liable to be deleted at any time, if the area is running out of space, so long as it is not needed for a running job, and as long as it is safely stored in AFS. Certain datasets however are protected from deletion from /batch/data/loc on bsc, in particular the links to the high-quality datasets.
The above refers only to so-called 'production' datasets. Analogous to /batch/data/prd there is a directory /batch/data/dom for datasets which have been generated by test domain jobs. These datasets are recognisable by the name, which has a domain-id instead of 'COMPASS'. They are simply copied from /batch/data/loc into /batch/data/dom, and they are not stored in AFS.
Compass program build: The latest version of the required program is linked as an executable, taking the latest object libraries as guided by the Oracle database. The executable file is stored in the local /batch/u/xxx directory (where xxx is the user id) to be run as the third activity of the task. Its name ('progname' in the diagram) is derived from a number of identifiers, for example
mpesun6.9682.c9828.PRG.EVPRN3-10509 where mpesun6.9682 is the LoadLeveler job-id, c9828 is the Compass job id and MPE-EVPRN3-10509 is the TPT idAlso consulting the database, lists of input and output datasets are written to /batch/u/xxx. The names of these lists are prefixed with the program name.
Dataset pre-processor: Reading the list of input datasets, a check is made that they are available in the local /batch/data/loc directory. If not, a list of missing datasets is written to /batch/u/xxx and a process is spawned to copy these from /batch/data/prd (or /batch/data/dom) or from another /batch/data/loc or, if necessary, directly from AFS. By the end of this activity, all input datasets must be available in /batch/data/loc.
Compass program: The executable runs in /batch/u/xxx, reading its input datasets from, and writing its output dataset to, the local /batch/data/loc directory. Each output dataset, as soon as it has been closed for write, is
accompanied by a small file with the same name suffixed by '.outfile',
which is used to guide the archiving processes.
The program also writes to log files in the home temp directory and also a
number of files to
its current directory /batch/u/xxx, some of which will be moved, if the run is
successful, also to the home temp directory.
Tidy: If the executable has run successfully, all relevant files in /batch/u/xxx are deleted, some being copied first to the home temp directory. In particular the executable itself is deleted. The Compass Task Run is then complete.
Every 5 minutes on mpesun10, a further background task is running to see if a file has appeared in /batch/data/prd with the suffix '.outfile'. The corresponding dataset is copied to AFS (possibly tarred), and the '.outfile' file is deleted.
Every 5 minutes on mpesun1, an analogous background task is running to see if /batch/data/prd or /batch/data/dom are too full, in which case datasets are deleted as necessary, oldest first.
setenv DOMAIN_SRC /mpe/u/c/cr2/xx/src setenv DOMAIN_OBJ /mpe/u/c/cr2/xx/obj setenv DOMAIN_INCL /mpe/u/c/cr2/xx/incThe environment variable names are fixed. Without a .compass.domain file you cannot choose the Developer Utilities option.
Domain datasets are named after the domain which generated them, and are stored in the local /batch/data/loc and copied (not tarred) into a central storage area (shared by all domains) called /batch/data/dom. They are not stored in AFS, and are liable to be deleted when space runs out in these areas.
e.g. if domain DOMDOM generated MPE-FBY-23342, it is stored as
/batch/data/loc/DOMDOM.FBY.M0023342 and copied to /batch/data/dom/DOMDOM.FBY.M0023342More details on test domains can be given later when required.
ushcwhich means "User SHell Command line interface". It works much the same way as the normal User Shell, except that the Form Control Character comes first (so that control characters like 'x' can be obeyeded immediately). It still deals with each whole form as a unit, after reading in the parameters. (but now there is only one page; and the work zone, if you need it, appears completely).
For each field, the prompt and default (or last-typed) value appear, then it waits for you to (a) type in a value or (b) hit Enter to leave the value unchanged, then it echoes back the prompt with the new value. (unless it is your password). For routine job-submission, a file of prepared replies can be redirected as input, and the output can be redirected as a log file.
Please note the following:
1. Spaces are taken as an empty line, so if you wish to actually set the field to blank, then enter a line consisting mostly of blanks but terminated by a non-blank character beyond the field-width. Any characters beyond the field-width will be ignored, so you can place comments there.
2. The following Form Control Characters are obeyed without waiting for, or requiring, the rest of the form: x b c d f h l s w m ?
3. x takes you out completely, not back to the logon screen.
4. Input dataset with quality < 100 are accepted without confirmation.
If OK (and this means that each form in the session is OK) ushc exits with status 0; otherwise error-messages appear for each error in the bad form (not just the first one, as now) and ushc exits with status 1.
cmg@mpe-garching.mpg.de