grid monitoring and operations sam development team cern it/gd tier2 admin workshop 03 dec. 2006,...
TRANSCRIPT
Grid Monitoringand
Operations
SAM Development Team
CERN IT/GD
Tier2 Admin Workshop
03 Dec. 2006, Mumbai
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 2
Outline
• Monitoring and Operational tools– SAM
• framework• sensors• availibility metrics
– FCR– gstat, GOCDB, SAM Admin Portal, COD
Dashboard
• Grid Operations (COD)
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 3
Monitoring tools
Service Availibility Monitoring
(SAM)
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 4
SAM -- Overview
• Grid service-level monitoring framework
• successor of SFT• used in Grid Operations • basis for Availibility Metrics• VO-based submissions
– VO-specific tests
• services tested currently:•CE, gCE•SE•RB• sBDII
•BDII• FTS• LFC• JobWrapper
tests
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 5
Central SAM submissions
• Official CERN submissions– Production and Certified sites– ops (+ dteam) VO– job submitted in every hour– basis of COD alarms– https://lcg-sam.cern.ch:8443/sam/sam.py
• PPS– ops VO– hourly– https://lcg-sam.cern.ch:8443/sam-pps/sam.py
• SAM Admin Portal– ops VO– on-demand– Certified + Uncertified sites
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 6
VO specific tests submission
• LHCb – successfully migrated to SAM (only CE, gCE)– VO specific test (Dirac installation)
• Atlas– all sensors – submitted from SAM UI
• CMS– set up, but no regular submission yet
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 7
SAM Portal
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 8
SAM Internals
• framework structure– client
• submission framework – (developed by CERN team)
• sensors – developed by different contributors + CERN team– tests: plug-in modules
– server•web services• portal
• Oracle DB accessed by web services• static (GOCDB) + dynamic (BDIIs) info
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 9
Sensors – CE, gCE, SE, SRM
• CE, gCE– job submission
•UI → RB → CE → WN chain
– CA certificates (on WN)– software middleware version (WN)– replica management
• lcg-utils• default SE + 3rd -party replication
– RGMA, Apel, etc.
• SE, SRM– UI ↔ SE/SRM
• lcg-utils (LFC)
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 10
Sensors – LFC, FTS
• LFC– lfc-ls + create file in /grid/<VO>
• FTS– BDII entry check– listing channels
• glite-transfer-channel-list (ChannelManagement service)– transfer test (in development):
• submitting transfer jobs between SRMs in all Tier0 and Tier1 sites (N-N testing)
• checking the status of jobs•Note! The test is relying on availability of SRMs in sites
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 11
Standalone sensors – BDII, RB
• sBDII (Gstat)– accessibility– sanity checks
• top-level BDIIs (Gstat)– accessibility – reliability of data (number of entries)
• RB– jobs submission
•UI → important RBs → “reliable” CEs
– time of matchmaking
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 12
JobWrapper tests
• JobWrapper– requested by experiments, also useful in operations– testing all WNs
•SAM always tests just an arbitrary one
– tests executed by CE wrapper script• executed with every production job
– test results• passed to the job• published to the SAM DB
– test code• core scripts in the release• tests on software area (signed tarball)
– soon in production
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 13
Availability metrics - algorithm
t CriticalTests∈
TestResult (N,t)Status of node N =
Status of site S =
CE1
CE2
CEn
SRM 1
SRM 2
SRM n
site BDII
AND
OR
OR
OR
OR
Everything is calculated for each
VO that defined critical tests in
FCR
Results make sense only if VO
submits tests!!!
N instances(C)∈
Status (N)Status of service C =
∧∨
∧ = boolean AND ∨ = boolean OR
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 14
Availability metrics - algorithm II
• service and site status in every hour
• daily, weekly, monthly availability • scheduled downtime information from
GOCDB • details of the algorithm on GOC:
http://goc.grid.sinica.edu.tw/gocwiki/SAME_Metrics_calculation
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 15
Availability metrics - GridView
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 16
Availability metrics - data export
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 17
VO tools
Freedom of Choice for Resources (FCR)
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 18
FCR -- Overview
• Freedom of Choice for Resources• https://lcg-fcr.cern.ch:8443/fcr/fcr.cgi• VO policy enforcement tool• critical test and resource selection for VOs by
manipulating top-level BDII information• goal is to be able to
– select which aspects of site funcionality are important for the VO
– blacklist unreliable sites– always use stable, "important" sites – less reliable sites based on SAM results
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 19
FCR -- Overview
• integrated with SAM– sharing the same DB
• optional usage– BDII configuration parameter– FCR output: ldif file
• information from GOCBD + BDII • DN-based authentication (2-levels)
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 20
FCR Admin Portal
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 21
FCR User Pages
• read-only view of VO settings• tells if the resource is available at the moment• grouping selection
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 22
FCR User Portal
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 23
Monitoring tools
gstat, SAM Admin Portal, COD dashboard
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 24
gstat (Sinica)
– http://goc.grid.sinica.edu.tw/gstat/– Information System (BDII) monitoring– response time, consistency (sanity),
completeness– site-BDII + top-level BDII– aggregated and detailed views– plots (history)– refreshed in every 5 mins (non-
intrusive)
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 25
gstat
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 26
SAM Admin Portal
– https://monitoring.egee.man.poznan.pl/admin2
– on-demand SAM submission– easy to use– target site selection– used by:
•ROCs: certification of a site•ROCs, site admins, CODs: speed up
debugging
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 27
SAM Admin Portal
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 28
GOCDB
– https://goc.grid-support.ac.uk/gridsite/gocdb2/index.php
– central database to store static site information
– all EGEE sites have to register– contact, security contact, certification status,
site type– scheduled maintainence– used by
•script that generates top-level BDII config file
•monitoring tools •SAM DB → SAM, FCR, Availability calc.•operations management tools
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 29
GOCDB
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 30
CIC Operations Portal
• COD management
– schedule for rotations– COD dashboard– COD handover notes
• ROC management
– ROC contacts– weekly reports
• VO management
– VO ID cards (VO contacts, etc.)
• EGEE broadcast
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 31
CIC Operations Portal
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 32
GGUS (FZK)
• Global GRID User Support• http://ggus.org• ticketing system for the EGEE GRID
• based on Remedy• tickets created by
– individual users (manually)– Grid Operators (via COD Dashboard)
• news, documentation
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 33
GGUS Portal
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 34
Operations
Grid Operations
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 35
EGEE Operations Structure
• Regional Operations Centres (ROC)– One in each region (incl. Asia-Pacific)
– Front-line support for user and operations issues
• point of contact for sites in the region
– Provide local knowledge and adaptations
– Manage daily Grid operations – oversight, troubleshooting
– Run infrastructure services
• for Asia-Pacific region
– Asia-Pacific• [email protected]
• Jason Shih, Min-Hong Tsai, Shu-Ting Liao
– CERN (catch-all ROC)• [email protected]• Nicholas Thackray
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 36
COD
• COD is Operator on Duty – was: CIC-on-Duty
• global LCG/EGEE GRID monitoring• 1 (2) ROCs responsible for the whole GRID
operations at a time– 12 ROCs involved– weekly rotation
• weekly WLCG-OSG-EGEE Operations meeting
– ROCS, Tier1, VOs– all sites invited
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 37
COD Procedures• https://twiki.cern.ch/twiki/bin/view/EGEE/EGEEROperationalProcedures
• Looking at monitoring tools– SAM, Certificate Monitoring pages
• Open tickets using COD Dasboard• Escalate expired tickets• Process site responses (update tickets
accordingly)• End of duty: hand-over notes• Update the GOC wiki pages
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 38
COD Dashboard
• summary of necessary monitoring information + tools for ticket processing
• tickets linked to GGUS tickets• GOCDB information
– site downtime information!
• SAM alarms• ticket creation and management tool• tools for related e-mail
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 39
COD Dashboard
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 40
Connection between the used tools
COD dashboard
Monitoring tools
GGUS
Grid Operators
(COD)
Problem tracking
and
reporting
Ticket follow-up
Modifications
on the ticketsSAM
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 41
• defines the steps to be taken during the lifetime of a ticket– tickets don't get forgotten!
• avaliable on CIC Portal– (https://edms.cern.ch/document/701575)
• prioritization alarms depending on the amount of resources at the site
Escalation Procedure
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 42
Escalation Steps
1.ticket creation
2.first mail (to: site + ROC)
3.second mail (to: site + ROC)
4.suspension from the GRID
• before 4.:a) mail to ROCb)mail to OCC for validation c)site is invited to the weekly operations meeting
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 43
Escalation Procedure -- Quarantine
• site categories– low: CPU <20– normal: 20 < CPU < 100– high: 100 < CPU
• between 2.-3. and 3.-4.– low + normal: 3 days– high: 1 days
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 44
COD Escalation Procedure
Create ticket Close ticket
When
deadline
reachedProblem solved ?
last
escalation ?
Extend deadline
Suspend site
Escalate
yes
no
no
site respondsmail mail
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 45
What a site is expected to do
• Look at the monitoring tools (SAM)
– try to notice & fix failures before the CODs
• COD notification about a failure
– fix it ASAP– contact the ROC for help if needed
• Scheduled downtime– enter it in GOCDB– broadcast it in advance– broadcast when it's finished
• weekly site reports (at COD portal)– input to weekly Operations meeting
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 46
What a site could do
• problems → contact the ROC
– best way: GGUS ticket
• question → ask the ROC
• open a ticket if there is a failure in Central Services – LFC, SAM, etc.
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 47
Happy End
Thanks for your attention :)