wlcg monitoring – an overview
DESCRIPTION
WLCG Monitoring – An overview. LHCOPN Meeting Madrid, 11 th March 2008 James Casey. The WLCG Monitoring Vision. Show stakeholders the state of the global WLCG infrastructure, and its historical evolution, in order to improve the availability and reliability of this infrastructure. - PowerPoint PPT PresentationTRANSCRIPT
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
LHCOPN Meeting
Madrid, 11th March 2008
James Casey
WLCG Monitoring – An overview
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
The WLCG Monitoring Vision
Show stakeholders the state of the
global WLCG infrastructure, and its
historical evolution, in order to improve
the availability and reliability of this
infrastructure
WLCG Monitoring - 2
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
What is monitoring for us?
• Service Availability/Reliability– Service Status provided– Availability + Reliability calculated
• Usage records– Gridftp, SRM, Grid Job execution– One record per task, or one record per state
change
• Accounting information– Daily rollups of Usage
• Right now, distributed debugging/service management not in scope
WLCG Monitoring - 3
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
Double vision
• Sites and the experiments – Both have a view of this data– Very complicated stack to trace through
• Try to connect the two perspectives– Important for site managers and project
management– Especially sites which support more than
one experiment
WLCG Monitoring - 4
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
Complexity !
WLCG Monitoring - 5
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
Strategy to simplify
• Operations– Delegate to regional entities– Provide some tools to have a “global” view
• Tools– Delegate to experts– Standardize on information interchange
schemas + protocols
• Reporting– Lightweight metric collection
• Of metrics that are useful for site managers or project management
– Reporting on top of this• For project management
WLCG Monitoring - 6
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
Infrastructure and tools - Message Bus
• WLCG Monitoring Working Group– Aim to consolidate the current monitoring effort
• Single message bus for data interchange– With reliable message delivery– Message persistence
• Isolate producers and consumers from each other– Define the message schemas and protocols
• Provide bridges/adaptors as needed– NMWG ?
WLCG Monitoring - 7
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
Broker at the centre ..
A Strategy for WLCG Monitoring - 8
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
Leverage the underlying infrastructures
• WLCG is a virtual infrastructure built on top of other physical infrastructures
• Added value ?– From interoperation and exchange of information
between the systems– Provide information not available only in one
• Don’t add too many layers– Enough exist already !– E.g Our MoUs should be defined related to the
SLA/MoU of the infrastructures
WLCG Monitoring - 9
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
LHCOPN and monitoring
• Availability/Reliability– Provide E2E link status – Create bridge from LHCOPN monitoring to
WLCG monitoring
• Usage records– At the individual flow level is too detailed– Summary statistics should be ok
• Aggregate rates as seen E2E• No need to expose internal complexity
– Always ask “How could a site admin use this?”
• Reporting– Operational statistics– MoU reporting WLCG Monitoring - 10
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
LHCOPN and Operations
• What’s the requirements of LHCOPN?– Notification of grid ‘users’ of
• Service interruptions• Status of problem investigations
– Mechanism for grid users to raise problems against LHC OPN
• GGUS is too complicated for the problem– ‘300 supporters’,TPM in the loop
• Perhaps a simpler solution works for notifications– “Dashboard” (from Dans presentation)– Good experiences in CCRC’08
WLCG Monitoring - 11
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
MoU compliance reporting
• We agreed to try and measure MoU metrics during CCRC’08– To evaluate if we can actually do it !
12
Service Maximum delay in responding to operational problems Average availability measured on an annual
basis
Service interruption
Degradation of the capacity of the service by more
than 50%
Degradation of the capacity of the service by more
than 20%
During accelerator operation
At all other times
Acceptance of data from the Tier-0 Centre
12 hours 12 hours 24 hours 99% n/a
Networking service to the Tier-0 Centre during accelerator operation
12 hours 24 hours 48 hours 98% n/a
Data-intensive analysis services, including networking to Tier-0, Tier-1 Centres
24 hours 48 hours 48 hours 98% 98%
All other services – prime service hours
2 hour 2 hour 4 hours 98% 98%
All other services – other times
24 hours 48 hours 48 hours 97% 97%
https://prod-grid-logger.cern.ch/elog/CCRC'08+Logbook/
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
Mapping to MoU Services
13
Tier-1Grid Service
ArcCE
BDII CE FTS LFC
MYPX
OSGCE RB
RGMA SE
SRM
SRMv2
VOBOX gCE
gRB
sBDII
MoU Category
Acceptance of data from Tier-0 * • •
Networking Services to Tier-0 *
Data-intensive analysis service, including networking to Tier-0
• • • • • •
All Other Services • • • • • • •
• Current availability is per-service• Map grid services status (from SAM) to MoU
categories– These are “custom” service availability
calculations
• LHC OPN – Can provide “Networking services to/from T1”
LHC OPN
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
ServiceMap
• What’s a ServiceMap?– It’s a gridmap with many different maps, showing
different aspects of the WLCG infrastructure
• What’s the CCRC’08 ServiceMap?– Service ‘readiness’– Service availability
• For VO critical services
– VO Functional blocks
• A single place to see both the VO and the infrastructure view of the grid– For all stakeholders
14
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
CCRC’08 ServiceMap
…Demo…
http://gridmap.cern.ch/ccrc08/servicemap.html
15
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
Julia Andreeva, CERN, 04.03.2008 F2F meeting16
What are the VO functional blocks ?
• Functional blocks for LHC experiments are similar to a large extent– Allows for a site to compare the service they
provide for different experiments– e.g - functional blocks for ATLAS and CMS for
CCRC08
Data archiving at T0
Data processing at T0
Data transfer from T0
CAF
Data archiving at Tier1
Processing at Tier1
Data transfer T1-T1
Data transfer T1-T2
MC production at T2
Analysis at T2
Data transfer T2-T1
MC production at T2
Analysis at T2
Data transfer T2-T1
Data archiving at T0
Data processing at T0
Data transfer from T0
Processing at Tier1
Data transfer T1-T1
Data transfer T1-T2
CMS ATLAS
T0
T1
T2
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
Summary
• WLCG monitoring needs for LHCOPN are modest
• Providing service status information would satisfy MoU availability requirements– We calculate availability/reliability according to
our algorithms– Need downtime information too for this
• How to satisfy MoU response time?– This is still a wider problem for us
• Test some simpler notification systems– elogger + RSS feed ?
WLCG Monitoring - 18