wlcg monitoring – an overview

17
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/ LHCOPN Meeting Madrid, 11 th March 2008 James Casey WLCG Monitoring – An overview

Upload: lorand

Post on 14-Jan-2016

27 views

Category:

Documents


0 download

DESCRIPTION

WLCG Monitoring – An overview. LHCOPN Meeting Madrid, 11 th March 2008 James Casey. The WLCG Monitoring Vision. Show stakeholders the state of the global WLCG infrastructure, and its historical evolution, in order to improve the availability and reliability of this infrastructure. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: WLCG Monitoring – An overview

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

LHCOPN Meeting

Madrid, 11th March 2008

James Casey

WLCG Monitoring – An overview

Page 2: WLCG Monitoring – An overview

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

The WLCG Monitoring Vision

Show stakeholders the state of the

global WLCG infrastructure, and its

historical evolution, in order to improve

the availability and reliability of this

infrastructure

WLCG Monitoring - 2

Page 3: WLCG Monitoring – An overview

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

What is monitoring for us?

• Service Availability/Reliability– Service Status provided– Availability + Reliability calculated

• Usage records– Gridftp, SRM, Grid Job execution– One record per task, or one record per state

change

• Accounting information– Daily rollups of Usage

• Right now, distributed debugging/service management not in scope

WLCG Monitoring - 3

Page 4: WLCG Monitoring – An overview

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

Double vision

• Sites and the experiments – Both have a view of this data– Very complicated stack to trace through

• Try to connect the two perspectives– Important for site managers and project

management– Especially sites which support more than

one experiment

WLCG Monitoring - 4

Page 5: WLCG Monitoring – An overview

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

Complexity !

WLCG Monitoring - 5

Page 6: WLCG Monitoring – An overview

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

Strategy to simplify

• Operations– Delegate to regional entities– Provide some tools to have a “global” view

• Tools– Delegate to experts– Standardize on information interchange

schemas + protocols

• Reporting– Lightweight metric collection

• Of metrics that are useful for site managers or project management

– Reporting on top of this• For project management

WLCG Monitoring - 6

Page 7: WLCG Monitoring – An overview

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

Infrastructure and tools - Message Bus

• WLCG Monitoring Working Group– Aim to consolidate the current monitoring effort

• Single message bus for data interchange– With reliable message delivery– Message persistence

• Isolate producers and consumers from each other– Define the message schemas and protocols

• Provide bridges/adaptors as needed– NMWG ?

WLCG Monitoring - 7

Page 8: WLCG Monitoring – An overview

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

Broker at the centre ..

A Strategy for WLCG Monitoring - 8

Page 9: WLCG Monitoring – An overview

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

Leverage the underlying infrastructures

• WLCG is a virtual infrastructure built on top of other physical infrastructures

• Added value ?– From interoperation and exchange of information

between the systems– Provide information not available only in one

• Don’t add too many layers– Enough exist already !– E.g Our MoUs should be defined related to the

SLA/MoU of the infrastructures

WLCG Monitoring - 9

Page 10: WLCG Monitoring – An overview

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

LHCOPN and monitoring

• Availability/Reliability– Provide E2E link status – Create bridge from LHCOPN monitoring to

WLCG monitoring

• Usage records– At the individual flow level is too detailed– Summary statistics should be ok

• Aggregate rates as seen E2E• No need to expose internal complexity

– Always ask “How could a site admin use this?”

• Reporting– Operational statistics– MoU reporting WLCG Monitoring - 10

Page 11: WLCG Monitoring – An overview

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

LHCOPN and Operations

• What’s the requirements of LHCOPN?– Notification of grid ‘users’ of

• Service interruptions• Status of problem investigations

– Mechanism for grid users to raise problems against LHC OPN

• GGUS is too complicated for the problem– ‘300 supporters’,TPM in the loop

• Perhaps a simpler solution works for notifications– “Dashboard” (from Dans presentation)– Good experiences in CCRC’08

WLCG Monitoring - 11

Page 12: WLCG Monitoring – An overview

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

MoU compliance reporting

• We agreed to try and measure MoU metrics during CCRC’08– To evaluate if we can actually do it !

12

Service Maximum delay in responding to operational problems Average availability measured on an annual

basis

Service interruption

Degradation of the capacity of the service by more

than 50%

Degradation of the capacity of the service by more

than 20%

During accelerator operation

At all other times

Acceptance of data from the Tier-0 Centre

12 hours 12 hours 24 hours 99% n/a

Networking service to the Tier-0 Centre during accelerator operation

12 hours 24 hours 48 hours 98% n/a

Data-intensive analysis services, including networking to Tier-0, Tier-1 Centres

24 hours 48 hours 48 hours 98% 98%

All other services – prime service hours

2 hour 2 hour 4 hours 98% 98%

All other services – other times

24 hours 48 hours 48 hours 97% 97%

https://prod-grid-logger.cern.ch/elog/CCRC'08+Logbook/

Page 13: WLCG Monitoring – An overview

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

Mapping to MoU Services

13

Tier-1Grid Service

ArcCE

BDII CE FTS LFC

MYPX

OSGCE RB

RGMA SE

SRM

SRMv2

VOBOX gCE

gRB

sBDII

MoU Category

Acceptance of data from Tier-0 * • •

Networking Services to Tier-0 *

Data-intensive analysis service, including networking to Tier-0

• • • • • •

All Other Services • • • • • • •

• Current availability is per-service• Map grid services status (from SAM) to MoU

categories– These are “custom” service availability

calculations

• LHC OPN – Can provide “Networking services to/from T1”

LHC OPN

Page 14: WLCG Monitoring – An overview

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

ServiceMap

• What’s a ServiceMap?– It’s a gridmap with many different maps, showing

different aspects of the WLCG infrastructure

• What’s the CCRC’08 ServiceMap?– Service ‘readiness’– Service availability

• For VO critical services

– VO Functional blocks

• A single place to see both the VO and the infrastructure view of the grid– For all stakeholders

14

Page 15: WLCG Monitoring – An overview

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

CCRC’08 ServiceMap

…Demo…

http://gridmap.cern.ch/ccrc08/servicemap.html

15

Page 16: WLCG Monitoring – An overview

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

Julia Andreeva, CERN, 04.03.2008 F2F meeting16

What are the VO functional blocks ?

• Functional blocks for LHC experiments are similar to a large extent– Allows for a site to compare the service they

provide for different experiments– e.g - functional blocks for ATLAS and CMS for

CCRC08

Data archiving at T0

Data processing at T0

Data transfer from T0

CAF

Data archiving at Tier1

Processing at Tier1

Data transfer T1-T1

Data transfer T1-T2

MC production at T2

Analysis at T2

Data transfer T2-T1

MC production at T2

Analysis at T2

Data transfer T2-T1

Data archiving at T0

Data processing at T0

Data transfer from T0

Processing at Tier1

Data transfer T1-T1

Data transfer T1-T2

CMS ATLAS

T0

T1

T2

Page 17: WLCG Monitoring – An overview

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

Summary

• WLCG monitoring needs for LHCOPN are modest

• Providing service status information would satisfy MoU availability requirements– We calculate availability/reliability according to

our algorithms– Need downtime information too for this

• How to satisfy MoU response time?– This is still a wider problem for us

• Test some simpler notification systems– elogger + RSS feed ?

WLCG Monitoring - 18