offline shift training: the shuttle – monitoring and debugging
DESCRIPTION
Offline Shift Training: the shuttle – monitoring and debugging. 19 February 2010 Chiara Zampolli , Jan Fiete Grosse-Oetringhaus. https://aloshi.cern.ch. Outline. - PowerPoint PPT PresentationTRANSCRIPT
OFFLINE SHIFT TRAINING: THE SHUTTLE – MONITORING AND
DEBUGGING
19 February 2010
Chiara Zampolli, Jan Fiete Grosse-Oetringhaus
https://aloshi.cern.ch
Outline
Offl Shift Training C. Zampolli 2
The Shuttle is the ALICE Online-Offline software framework dedicated to the extraction of conditions data – calibration
and alignment – during data taking, running detector specific procedures called preprocessors
Outline:
‒Monitoring Web Page
‒How to read the Logs
‒The History
‒The Detectors Preprocessors Flow
‒How to handle Errors
‒The SHUTTLE Status
‒The OCDB
‒Contacts
Offl Shift Training 3
OCDBGrid File Catalog
DIMtriggerECS
RunLogbook
DAQ
FXS DB
FXS
ArchiveDB
DCS
FXS DB
FXS
HLTFXS DB
FXS
SHUTTLET
RD
HM
P
SP
D
TP
C
...
No alternative system to extract data (especially online calibration results) between data-taking and first reconstruction pass!
The Shuttle General Schema
C. Zampolli
The Shuttle Data Flow – Schema per Detector
Offl Shift Training 4
DAQ/DCS/HLT machines
DADCS PVSS project
DCS database
DAQ/DCS/HLT FXS
SHUTTLE
Detector Preprocessor
OCDB Reference Data
via Shuttle via Shuttle
C. Zampolli
MonALISA Web Page
http://pcalimonitor.cern.ch/shuttle.jsp?instance=PROD
Offl Shift Training C. Zampolli 5
how to get there...
Start from the MonALISA web page
http://pcalimonitor.cern.ch/map.jspOpen the SHUTTLE menu Click on Production@P2 key word
MonALISA Web Page – an Overview
Offl Shift Training C. Zampolli 6
Monitoring for P2
AliROOT version
SHUTTLE status
DCS/FXS errorsGRP failures
Link to the test setup mon page
MonALISA Web Page – the Test Setup
Offl Shift Training C. Zampolli 7
Monitoring for the Test Setup
A Look to the Table
Offl Shift Training C. Zampolli 8
General information SHUTTLE + Detector status
Status & access to the log
Access to the history
n. of retries
The SHUTTLE Log
Every information is associated to a timestamp which is expressed in UTC
Geneva time is CEST +1h in winter, +2h in summer
Contains information about the detectors participating in the run, and for which the corresponding preprocessor has been called
E.g.: http://pcalishuttle02.cern.ch/logs_PROD/7/76448/SHUTTLE.log
Offl Shift Training C. Zampolli 9
The Detector Log
Every information is associated to a timestamp which is expressed in UTC
Geneva time is CEST +1h in winter, +2h in summer
Messages can come from either the detector, or the SHUTTLE
E.g.: http://pcalishuttle02.cern.ch/logs_PROD/7/76451/MCH.logI-AliShuttle::Log: 2009-07-20 16:38:07 UTC (17478): MCH - run 76451 - ProcessCurrentDetector - The preprocessor requested to skip the retrieval of DCS values
I-AliShuttle::Log: 2009-07-20 16:38:07 UTC (17478): SHUTTLE - run 76451 - UpdateShuttleStatus - MCH: Changing state from Started to PPStarted
In case of failure, the email address of the responsibles to which the notification has been sent can be found at the end of the logKey words: DCS, FXS, GetFile, StoreOCDB
Offl Shift Training C. Zampolli 10
The History
Offl Shift Training C. Zampolli 11
tim
e tim
e
Failed preprocessor: 3 retries
Successful preprocessor: 1 retry
Preprocessor Status Flow
Offl Shift Training 12
DCS started
PreprocessorDone
Store Delayed
PreprocessorStarted
DCS error
Done
Preprocessor OutOfMemory
Started
Store Started
Preprocessor OutOfTime
Preprocessor Error
Failed
retry countexceeded
Store Error
FXS Error
Skipped
run typenot
requested
Failing retrieving DCS DPs
Failing to connect To FXS
Exceed of the allowed mem
Exceed of the allowed time
Preprocessorfailure - retry
Waiting forprevious runs
Failing storing data in OCDB
Preprocessor success
C. Zampolli
Possible ERRORs
When to intervene:GRP ErrorsDCS ErrorsFXS ErrorsDetector Errors (*)
When not to intervene:Store ErrorsDetector Errors (*)
(*) Depending on the error, on the frequency...
Offl Shift Training C. Zampolli 13
Possible ERRORs – What to Do
GRP “PPError”, “Failed”:Immediate action required without the GRP, the run cannot be reconstructed!Search in the corresponding log the cause of the error in the last lines – e.g.: “GRP Preprocessor FAILS!!! (Trigger Configuration ERROR)” or “GRP Preprocessor FAILS!!! (DCS ERROR)”Contact the responsible for the corresponding system (Trigger or DCS, in the example)Inform the shift leader of the problem no reconstruction will be possible for that run
Offl Shift Training C. Zampolli 14
Possible ERRORs – What to Do
DCS “DCSError” (could happen for every detector):
It means that the communication between the DCS AMANDA server (DP retrieval) and the Shuttle is brokenInform the DCS shifter, indicating the detector for which the problem is happening
FXS “FXSError” (could happen for every detector):
It means that the communication between one FXS (DAQ, DCS or HLT) and the Shuttle is brokenCheck in the log which system is involved (DAQ/DCS/HLT), finding the entry in the FXS for which the retrieval failed (search, e.g., for “GetFile” in the log)Inform the shifter of the online system involved (DAQ/DCS/HLT), indicating the detector for which it’s happening Offl Shift Training C. Zampolli 15
Other Errors
“StoreError” (could happen for every detector):It means that some problems occured while storing the output files of the detector preprocessorCould be related to some instabilities of the GRIDInform the experts in case the error is persistent
“StoreDelayed” (could happen for every detector):This is NOT an error! (not in red)In case a previous run has still to be closed for this detector
Offl Shift Training C. Zampolli 16
Other Errors – II
“PPError”, “Failed”:An error occurred in the processing of the data by the corresponding detector preprocessorIn general, no action has to be taken (if not in the case of GRP): the detector experts are automatically notified...
See end of the corresponding log to see who was notified
...BUT! If the error is persistent, inform the detector shifter
Offl Shift Training C. Zampolli 17
SHUTTLE Status
If, accoding to the MonALISA page, the Shuttle is OFFLINE, and ONLY in that case, login in the Shuttle machine (pcalishuttle02) from the offline console:
[aldaqacr10] ssh shuttle@pcalishuttle02
Check the SHUTTLE status:[pcalishuttle02] ./shuttle status
If the SHUTTLE IS RUNNING, check whether MonALISA gets updated if not, contact the MonALISA experts ([email protected])If the SHUTTLE IS NOT RUNNING, type:[pcalishuttle02] ./shuttle restart
Offl Shift Training C. Zampolli 18
OCDB
The conditions data produced by the preprocessors while running within the SHUTTLE are put in the OCDB folder in AliEn:
/alice/data/<current_year>/OCDB/*/*/*
/alice/data/<current_year>/Reference/*/*/*
Offl Shift Training C. Zampolli 19
Important Remarks
The run types for which the detector preprocessors are run depend on the implementation of their preprocessor code
Only runs taken within the ECS framework (not from the DAQ Run Control of the detectors!!) can be processed by the Shuttle
The GRP preprocessor is run only for a subset of run types, http://aliceinfo.cern.ch/Offline/Activities/Shuttle/RunTypesForGRP.html
A successful preprocessor exits with code 1, a failing preprocessor exits with code > 1
Offl Shift Training C. Zampolli 20
Whom to Contact
For any SHUTTLE related issues not mentioned in the slides, please contact:
[email protected] (165459) (*)[email protected] (160906) (*)[email protected]
(*) based at CERN
Offl Shift Training C. Zampolli 21
Hands-on…
What is wrong here?
Offl Shift Training C. Zampolli 23
• Shuttle OFFLINE• Check on the Shuttle machine• Inform the experts
What is wrong here?
Offl Shift Training C. Zampolli 24
ONLINE1
FXS Error
• FXS Error• Go to the log• Inform the online system experts
What do you see from the log?
Offl Shift Training C. Zampolli 25
DAQ FXS problem
What is wrong here?
Offl Shift Training C. Zampolli 26
• GRP Error• Check the log
What do you see from the log?
Offl Shift Training C. Zampolli 27
• DCS FXS problem • Trigger scalers missing• Inform Trigger and DCS
What is wrong here?
Offl Shift Training C. Zampolli 28
• SPD Error• Go to the log
What do you see from the log?
Offl Shift Training C. Zampolli 29
• problem with file from DAQ DA• Inform SPD expert (see bottom
of the log)
Back-Ups
Sequence Diagram
ECS
DAQ
DCS (Arch. DB)
HLT
Shuttle
End of Data Taking End of Run
DCS (FXS)
Start of Run
(*) ITS = SPD + SDD + SSD(**) MUON = MCH + MTR(***) PHOS = PHS + CPV
ACORDE
EMCAL HMPID FMD ITS(*)MUON
(**)PHOS(***)
PMD T0 TOF TPC TRD V0 ZDC
Detector preprocessors
Loop over all detectors (+ GRP and HLT)
Registration of conditions data files in AliEn
Interfaces with info providersGRP HLT
No interfe
rence w
ith data ta
king!
Offl Shift Training 31 C. Zampolli