SLDPM is the automated server on SLACVM that processes data after a run is completed and runs it through various checks and PASS1/PASS2 filters. SLDPM also nightly runs a job to examine all the day's data using vertexing information. This is usually referred to as the 'daily summary job'.
SLDPM was written by Len Moss and is partly documented, in terms of the server and how to maintain it. This document is directed to the lucky folks who get to mind SLDPM as it runs and deal with the day to day trials and tribulations.
Watching SLDPM is broken down into two period types: weekdays and evenings & weekends. Typically one person does the weekdays and another person covers the rest. Obviously, the goal is good turnaround time. Quick turnaround on the first stage jobs (AIS) gives the 'official' Z count. This is watched closely by both SLD and SLC folks. It seems to be psychologically important that this number be up to date. The stages following (PASS1 & PASS2) feed in to the data quality checking.
It is recommended that spot checks of the various monitoring tools be done every few hours during the weekdays, a couple of times during the evening and then 3 or 4 times per day on weekends. These latter two should have a check in the late evening to ensure that things can run smoothly through owl.
The main tools to watch SLDPM with are:
What |
Description |
| Missing Runs | which runs have not been processed, have failed or are still active; by category. |
| Run Status p1 | run summary from online, indicating current run number and tape in use |
| Status p15 | Status display showing batch queues of active and recently completed jobs grouped by category |
| Status p12 | Status display showing a summary the last job run in each category |
| Recent Runs | Webtables display of the various run categories for the past three days. This gives you access to the LOG files from the web. |
| Recent Days | Webtables display of the daily summary information for the past three days. This gives you access to the LOG files from the web. |
| Status p14 | Status display showing the Z/day summary and SLDPM last alive times |
SLDPM tells its watchers of problems via the mailing list SLDPM-NOTIFY. This is an archived list that is also being used by the various watchers to keep each other abreast of problem stati and fixes. Watchers communicate with SLDPM via Bitnet messages either from VM or one of the VAXes. "SEND SLDPM@SLACVM ?" will give a list of commands that it recognizes.
There are several watchdogs, which can generate pages, in place keeping an eye on operations. These are run as cron jobs on AIXCRON from the account richard. The idea is that these watchdogs will relieve you of the burden of worrying about these failure modes (until they happen!). All these watchdog alerts are recorded on the SLACVX system log page.
Here is a list of currently typical failure modes and what one does about them:
| Problem | there is an intermittent failure mode wherein the ftp transfer of result files from VMS to VM fails. |
| Fix | refetch the job: SEND SLDPM@SLACVM xxxFET job-name. Legal jobnames are AISrrrrr, CVDrrrrr, FLTrrrrr, RECrrrrr, DSPrrrrr, where rrrrr is the run number. |
| Problem | there is a message in the log file with syntax like official error messages that SLDPM has not seen before (and had been blessed as innoccuous). |
| Fix | if the message is indeed innoccuous, refetch the job: SEND SLDPM@SLACVM xxxFET
job-name ( IGNORE MSGS
|
| Subsystem | Who to contact |
| VTX | Su Dong |
| CDC | Leon Rochester |
| EDC | Sal Fahey |
| KAL | Richard Dubois |
| WIC | Giampiero Mancinelli |
| LUM | Matt Langston |
| Job type | xxx | Input dataset |
| AIS | AQI | ACQS |
| FLT | FLT | ACQS (or RAWS if still on disk) |
| REC | REC | HAD |
| DSP | DSP | REC |
| CVD | CVD | ACQS(?) |
This is a compendium of tricks and seldom used (and hence readily forgotten) techniques.
Last Modified: 11/14/01 17:24