offline shift training: the shuttle – monitoring and debugging

31
OFFLINE SHIFT TRAINING: THE SHUTTLE – MONITORING AND DEBUGGING 19 February 2010 Chiara Zampolli , Jan Fiete Grosse-Oetringhaus https://aloshi.cern.ch

Upload: jeneva

Post on 14-Jan-2016

42 views

Category:

Documents


0 download

DESCRIPTION

Offline Shift Training: the shuttle – monitoring and debugging. 19 February 2010 Chiara Zampolli , Jan Fiete Grosse-Oetringhaus. https://aloshi.cern.ch. Outline. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Offline Shift Training:  the shuttle  – monitoring and debugging

OFFLINE SHIFT TRAINING: THE SHUTTLE – MONITORING AND

DEBUGGING

19 February 2010

Chiara Zampolli, Jan Fiete Grosse-Oetringhaus

https://aloshi.cern.ch

Page 2: Offline Shift Training:  the shuttle  – monitoring and debugging

Outline

Offl Shift Training C. Zampolli 2

The Shuttle is the ALICE Online-Offline software framework dedicated to the extraction of conditions data – calibration

and alignment – during data taking, running detector specific procedures called preprocessors

Outline:

‒Monitoring Web Page

‒How to read the Logs

‒The History

‒The Detectors Preprocessors Flow

‒How to handle Errors

‒The SHUTTLE Status

‒The OCDB

‒Contacts

Page 3: Offline Shift Training:  the shuttle  – monitoring and debugging

Offl Shift Training 3

OCDBGrid File Catalog

DIMtriggerECS

RunLogbook

DAQ

FXS DB

FXS

ArchiveDB

DCS

FXS DB

FXS

HLTFXS DB

FXS

SHUTTLET

RD

HM

P

SP

D

TP

C

...

No alternative system to extract data (especially online calibration results) between data-taking and first reconstruction pass!

The Shuttle General Schema

C. Zampolli

Page 4: Offline Shift Training:  the shuttle  – monitoring and debugging

The Shuttle Data Flow – Schema per Detector

Offl Shift Training 4

DAQ/DCS/HLT machines

DADCS PVSS project

DCS database

DAQ/DCS/HLT FXS

SHUTTLE

Detector Preprocessor

OCDB Reference Data

via Shuttle via Shuttle

C. Zampolli

Page 5: Offline Shift Training:  the shuttle  – monitoring and debugging

MonALISA Web Page

http://pcalimonitor.cern.ch/shuttle.jsp?instance=PROD

Offl Shift Training C. Zampolli 5

how to get there...

Start from the MonALISA web page

http://pcalimonitor.cern.ch/map.jspOpen the SHUTTLE menu Click on Production@P2 key word

Page 6: Offline Shift Training:  the shuttle  – monitoring and debugging

MonALISA Web Page – an Overview

Offl Shift Training C. Zampolli 6

Monitoring for P2

AliROOT version

SHUTTLE status

DCS/FXS errorsGRP failures

Link to the test setup mon page

Page 7: Offline Shift Training:  the shuttle  – monitoring and debugging

MonALISA Web Page – the Test Setup

Offl Shift Training C. Zampolli 7

Monitoring for the Test Setup

Page 8: Offline Shift Training:  the shuttle  – monitoring and debugging

A Look to the Table

Offl Shift Training C. Zampolli 8

General information SHUTTLE + Detector status

Status & access to the log

Access to the history

n. of retries

Page 9: Offline Shift Training:  the shuttle  – monitoring and debugging

The SHUTTLE Log

Every information is associated to a timestamp which is expressed in UTC

Geneva time is CEST +1h in winter, +2h in summer

Contains information about the detectors participating in the run, and for which the corresponding preprocessor has been called

E.g.: http://pcalishuttle02.cern.ch/logs_PROD/7/76448/SHUTTLE.log

Offl Shift Training C. Zampolli 9

Page 10: Offline Shift Training:  the shuttle  – monitoring and debugging

The Detector Log

Every information is associated to a timestamp which is expressed in UTC

Geneva time is CEST +1h in winter, +2h in summer

Messages can come from either the detector, or the SHUTTLE

E.g.: http://pcalishuttle02.cern.ch/logs_PROD/7/76451/MCH.logI-AliShuttle::Log: 2009-07-20 16:38:07 UTC (17478): MCH - run 76451 - ProcessCurrentDetector - The preprocessor requested to skip the retrieval of DCS values

I-AliShuttle::Log: 2009-07-20 16:38:07 UTC (17478): SHUTTLE - run 76451 - UpdateShuttleStatus - MCH: Changing state from Started to PPStarted

In case of failure, the email address of the responsibles to which the notification has been sent can be found at the end of the logKey words: DCS, FXS, GetFile, StoreOCDB

Offl Shift Training C. Zampolli 10

Page 11: Offline Shift Training:  the shuttle  – monitoring and debugging

The History

Offl Shift Training C. Zampolli 11

tim

e tim

e

Failed preprocessor: 3 retries

Successful preprocessor: 1 retry

Page 12: Offline Shift Training:  the shuttle  – monitoring and debugging

Preprocessor Status Flow

Offl Shift Training 12

DCS started

PreprocessorDone

Store Delayed

PreprocessorStarted

DCS error

Done

Preprocessor OutOfMemory

Started

Store Started

Preprocessor OutOfTime

Preprocessor Error

Failed

retry countexceeded

Store Error

FXS Error

Skipped

run typenot

requested

Failing retrieving DCS DPs

Failing to connect To FXS

Exceed of the allowed mem

Exceed of the allowed time

Preprocessorfailure - retry

Waiting forprevious runs

Failing storing data in OCDB

Preprocessor success

C. Zampolli

Page 13: Offline Shift Training:  the shuttle  – monitoring and debugging

Possible ERRORs

When to intervene:GRP ErrorsDCS ErrorsFXS ErrorsDetector Errors (*)

When not to intervene:Store ErrorsDetector Errors (*)

(*) Depending on the error, on the frequency...

Offl Shift Training C. Zampolli 13

Page 14: Offline Shift Training:  the shuttle  – monitoring and debugging

Possible ERRORs – What to Do

GRP “PPError”, “Failed”:Immediate action required without the GRP, the run cannot be reconstructed!Search in the corresponding log the cause of the error in the last lines – e.g.: “GRP Preprocessor FAILS!!! (Trigger Configuration ERROR)” or “GRP Preprocessor FAILS!!! (DCS ERROR)”Contact the responsible for the corresponding system (Trigger or DCS, in the example)Inform the shift leader of the problem no reconstruction will be possible for that run

Offl Shift Training C. Zampolli 14

Page 15: Offline Shift Training:  the shuttle  – monitoring and debugging

Possible ERRORs – What to Do

DCS “DCSError” (could happen for every detector):

It means that the communication between the DCS AMANDA server (DP retrieval) and the Shuttle is brokenInform the DCS shifter, indicating the detector for which the problem is happening

FXS “FXSError” (could happen for every detector):

It means that the communication between one FXS (DAQ, DCS or HLT) and the Shuttle is brokenCheck in the log which system is involved (DAQ/DCS/HLT), finding the entry in the FXS for which the retrieval failed (search, e.g., for “GetFile” in the log)Inform the shifter of the online system involved (DAQ/DCS/HLT), indicating the detector for which it’s happening Offl Shift Training C. Zampolli 15

Page 16: Offline Shift Training:  the shuttle  – monitoring and debugging

Other Errors

“StoreError” (could happen for every detector):It means that some problems occured while storing the output files of the detector preprocessorCould be related to some instabilities of the GRIDInform the experts in case the error is persistent

“StoreDelayed” (could happen for every detector):This is NOT an error! (not in red)In case a previous run has still to be closed for this detector

Offl Shift Training C. Zampolli 16

Page 17: Offline Shift Training:  the shuttle  – monitoring and debugging

Other Errors – II

“PPError”, “Failed”:An error occurred in the processing of the data by the corresponding detector preprocessorIn general, no action has to be taken (if not in the case of GRP): the detector experts are automatically notified...

See end of the corresponding log to see who was notified

...BUT! If the error is persistent, inform the detector shifter

Offl Shift Training C. Zampolli 17

Page 18: Offline Shift Training:  the shuttle  – monitoring and debugging

SHUTTLE Status

If, accoding to the MonALISA page, the Shuttle is OFFLINE, and ONLY in that case, login in the Shuttle machine (pcalishuttle02) from the offline console:

[aldaqacr10] ssh shuttle@pcalishuttle02

Check the SHUTTLE status:[pcalishuttle02] ./shuttle status

If the SHUTTLE IS RUNNING, check whether MonALISA gets updated if not, contact the MonALISA experts ([email protected])If the SHUTTLE IS NOT RUNNING, type:[pcalishuttle02] ./shuttle restart

Offl Shift Training C. Zampolli 18

Page 19: Offline Shift Training:  the shuttle  – monitoring and debugging

OCDB

The conditions data produced by the preprocessors while running within the SHUTTLE are put in the OCDB folder in AliEn:

/alice/data/<current_year>/OCDB/*/*/*

/alice/data/<current_year>/Reference/*/*/*

Offl Shift Training C. Zampolli 19

Page 20: Offline Shift Training:  the shuttle  – monitoring and debugging

Important Remarks

The run types for which the detector preprocessors are run depend on the implementation of their preprocessor code

Only runs taken within the ECS framework (not from the DAQ Run Control of the detectors!!) can be processed by the Shuttle

The GRP preprocessor is run only for a subset of run types, http://aliceinfo.cern.ch/Offline/Activities/Shuttle/RunTypesForGRP.html

A successful preprocessor exits with code 1, a failing preprocessor exits with code > 1

Offl Shift Training C. Zampolli 20

Page 21: Offline Shift Training:  the shuttle  – monitoring and debugging

Whom to Contact

For any SHUTTLE related issues not mentioned in the slides, please contact:

[email protected] (165459) (*)[email protected] (160906) (*)[email protected]

(*) based at CERN

Offl Shift Training C. Zampolli 21

Page 22: Offline Shift Training:  the shuttle  – monitoring and debugging

Hands-on…

Page 23: Offline Shift Training:  the shuttle  – monitoring and debugging

What is wrong here?

Offl Shift Training C. Zampolli 23

• Shuttle OFFLINE• Check on the Shuttle machine• Inform the experts

Page 24: Offline Shift Training:  the shuttle  – monitoring and debugging

What is wrong here?

Offl Shift Training C. Zampolli 24

ONLINE1

FXS Error

• FXS Error• Go to the log• Inform the online system experts

Page 25: Offline Shift Training:  the shuttle  – monitoring and debugging

What do you see from the log?

Offl Shift Training C. Zampolli 25

DAQ FXS problem

Page 26: Offline Shift Training:  the shuttle  – monitoring and debugging

What is wrong here?

Offl Shift Training C. Zampolli 26

• GRP Error• Check the log

Page 27: Offline Shift Training:  the shuttle  – monitoring and debugging

What do you see from the log?

Offl Shift Training C. Zampolli 27

• DCS FXS problem • Trigger scalers missing• Inform Trigger and DCS

Page 28: Offline Shift Training:  the shuttle  – monitoring and debugging

What is wrong here?

Offl Shift Training C. Zampolli 28

• SPD Error• Go to the log

Page 29: Offline Shift Training:  the shuttle  – monitoring and debugging

What do you see from the log?

Offl Shift Training C. Zampolli 29

• problem with file from DAQ DA• Inform SPD expert (see bottom

of the log)

Page 30: Offline Shift Training:  the shuttle  – monitoring and debugging

Back-Ups

Page 31: Offline Shift Training:  the shuttle  – monitoring and debugging

Sequence Diagram

ECS

DAQ

DCS (Arch. DB)

HLT

Shuttle

End of Data Taking End of Run

DCS (FXS)

Start of Run

(*) ITS = SPD + SDD + SSD(**) MUON = MCH + MTR(***) PHOS = PHS + CPV

ACORDE

EMCAL HMPID FMD ITS(*)MUON

(**)PHOS(***)

PMD T0 TOF TPC TRD V0 ZDC

Detector preprocessors

Loop over all detectors (+ GRP and HLT)

Registration of conditions data files in AliEn

Interfaces with info providersGRP HLT

No interfe

rence w

ith data ta

king!

Offl Shift Training 31 C. Zampolli