The Care and Feeding of SLDPM

SLDPM is the automated server on SLACVM that processes data after a run is completed and runs it through various checks and PASS1/PASS2 filters. SLDPM also nightly runs a job to examine all the day's data using vertexing information. This is usually referred to as the 'daily summary job'.

SLDPM was written by Len Moss and is partly documented, in terms of the server and how to maintain it. This document is directed to the lucky folks who get to mind SLDPM as it runs and deal with the day to day trials and tribulations.

Manning the Watch

Watching SLDPM is broken down into two period types: weekdays and evenings & weekends. Typically one person does the weekdays and another person covers the rest. Obviously, the goal is good turnaround time. Quick turnaround on the first stage jobs (AIS) gives the 'official' Z count. This is watched closely by both SLD and SLC folks. It seems to be psychologically important that this number be up to date. The stages following (PASS1 & PASS2) feed in to the data quality checking.

It is recommended that spot checks of the various monitoring tools be done every few hours during the weekdays, a couple of times during the evening and then 3 or 4 times per day on weekends. These latter two should have a check in the late evening to ensure that things can run smoothly through owl.

Tools

The main tools to watch SLDPM with are:

What	Description
Missing Runs	which runs have not been processed, have failed or are still active; by category.
Run Status p1	run summary from online, indicating current run number and tape in use
Status p15	Status display showing batch queues of active and recently completed jobs grouped by category
Status p12	Status display showing a summary the last job run in each category
Recent Runs	Webtables display of the various run categories for the past three days. This gives you access to the LOG files from the web.
Recent Days	Webtables display of the daily summary information for the past three days. This gives you access to the LOG files from the web.
Status p14	Status display showing the Z/day summary and SLDPM last alive times

SLDPM tells its watchers of problems via the mailing list SLDPM-NOTIFY. This is an archived list that is also being used by the various watchers to keep each other abreast of problem stati and fixes. Watchers communicate with SLDPM via Bitnet messages either from VM or one of the VAXes. "SEND SLDPM@SLACVM ?" will give a list of commands that it recognizes.

Watchdogs

There are several watchdogs, which can generate pages, in place keeping an eye on operations. These are run as cron jobs on AIXCRON from the account richard. The idea is that these watchdogs will relieve you of the burden of worrying about these failure modes (until they happen!). All these watchdog alerts are recorded on the SLACVX system log page.

SLDPM alive: half-hourly checks that SLDPM is operational. SLDPM writes a file every 10 minutes to VMS. This is checked for age. If SLDPM is logged on, no alarm is raised.
SLDPM output disk has space: checks hourly that there are at least 500k blocks on the DPA2 output disk.
Offline Oracle mirror up to date: 3-hourly checks that the unix online Oracle database matches the SLD VAX version.
User disk has space: checks twice per day that $usr has > 50k blocks available
Production tapes: checks daily that there are at least 50 PROD tapes in the pool.
SLDPM itself will issue pages when
- there is a tape problem
- a run occurs where any subsystem sends up no data at all.

How to Fix

Here is a list of currently typical failure modes and what one does about them:

fetch failures

Fails to fetch files from VAX

Problem	there is an intermittent failure mode wherein the ftp transfer of result files from VMS to VM fails.
Fix	refetch the job: SEND SLDPM@SLACVM xxxFET job-name. Legal jobnames are AISrrrrr, CVDrrrrr, FLTrrrrr, RECrrrrr, DSPrrrrr, where rrrrr is the run number.

Unrecognized error messsage in log file

Problem

there is a message in the log file with syntax like official error messages that SLDPM has not seen before (and had been blessed as innoccuous).

Fix

if the message is indeed innoccuous, refetch the job: SEND SLDPM@SLACVM xxxFET job-name ( IGNORE MSGS

if the message involves a tape error that you believe can be ignored, the ignore option is DMOUNTE.
If there is a fatal IDA return code in the job that is to be ignored, the option is IDASGLISS.
if the error is 'cannot connect to SLDTMS', the option is NOTMS
These options can be combined.

job crashed in SLD code

try to identify from the job log file which code failed (eg in KAL) and notify the appropriate person to pursue it:

Subsystem	Who to contact
VTX	Su Dong
CDC	Leon Rochester
EDC	Sal Fahey
KAL	Richard Dubois
WIC	Giampiero Mancinelli
LUM	Matt Langston

jobs can also fail if they cannot get access to the SLDTMS servers. If no one is around and they are dead, there are instructions for how to get them going again.

once the code problem is fixed, the job is deleted from the database and resubmitted.

SEND SLDPM@SLACVM deljob job-name
- jobnames are AISrrrrr, CVDrrrrr, FLTrrrrr, RECrrrrr, DSPrrrrr

SEND SLDPM@SLACVM xxxsub dataset_name

this can get a little tricky to remember:

Job type	xxx	Input dataset
AIS	AQI	ACQS
FLT	FLT	ACQS (or RAWS if still on disk)
REC	REC	HAD
DSP	DSP	REC
CVD	CVD	ACQS(?)

Note: if the code problem looks like it is due to noisy conditions or is not a fundamental problem, we often resubmit it to the VAX, appending an option VAX to the submit command, eg
- SEND SLDPM@SLACVM xxxsub dataset_name (VAX
Another ploy is to edit the .COM files for the jobs and have them skip the offending event. Each of the .COM files already has code in them to delete past bad events. The files are ACQIN.COM, FILTER.COM and RECON.COM, all in $usr:[sldpm.dp.ida].

SLDPM server hung or crashed
- if the server is dead or hung (if the watchdog bleats or SLDPM doesn't respond to commands). Recall that p14 of Status will show you the last time SLDPM was alive.
  - log on to SLDPM (password is in the unix passwords area, $SLDROOT/etc/accounts) on VM. If it is in a DISCconnected state, log it off and back on, then issue SLDPM (DISC to fire the server back up again. Each time it is logged off, it sends the console file to the VM account SLDLAC's reader. Again the password is in the unix area.

Tips & Tricks

This is a compendium of tricks and seldom used (and hence readily forgotten) techniques.

Last Modified: 11/14/01 17:24