wlcgwlcg: experience from real data taking at the lhc [email protected] wlcg service coordination...

22
WLCG : Experience from Real Data Taking at the LHC [email protected] WLCG Service Coordination & EGI InSPIRE SA3 / WP6 TNC2010, Vilnius, Lithuania More information: LHC Physics Centre at CERN

Upload: norman-richards

Post on 13-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: WLCGWLCG: Experience from Real Data Taking at the LHC Jamie.Shiers@cern.ch WLCG Service Coordination & EGI InSPIRE SA3 / WP6EGI InSPIRE TNC2010, Vilnius,

WLCG: Experience from Real Data Taking at the LHC

[email protected]

WLCG Service Coordination & EGI InSPIRE SA3 / WP6

TNC2010, Vilnius, LithuaniaMore information: LHC Physics Centre at CERN

Page 2: WLCGWLCG: Experience from Real Data Taking at the LHC Jamie.Shiers@cern.ch WLCG Service Coordination & EGI InSPIRE SA3 / WP6EGI InSPIRE TNC2010, Vilnius,

Overview

• Summarize the Worldwide LHC Computing Grid:– In terms of scale & scope;– In terms of service.

• Highlight the main obstacles faced in arriving at today’s situation

• Issues, concerns & directions for the future

• As this presentation will be given remotely, I have deliberately kept it as free as possible from any graphics or animations

2WLCG: Experience from Real Data Taking

Page 3: WLCGWLCG: Experience from Real Data Taking at the LHC Jamie.Shiers@cern.ch WLCG Service Coordination & EGI InSPIRE SA3 / WP6EGI InSPIRE TNC2010, Vilnius,

The Worldwide LHC Grid (WLCG)• Simply put, this is the distributed processing and storage

system deployed to handle the data from the world’s largest scientific machine – the Large Hadron Collider (LHC)

• Based today on grid technology – including the former EGEE infrastructure in Europe plus the Open Science Grid in US

• WLCG is more than simply a customer of EGEE: it has been and continues to be a driving force not only in the grid domain but also others, such as storage and data management

• WLCG has always been about a production service – one that is needed 24 x 7 most days (362) per year– Much activity – particularly at Tier0 and Tier0-Tier1 transfers –

takes place at nights and over weekends (accelerator cycle)

3WLCG: Experience from Real Data Taking

Page 4: WLCGWLCG: Experience from Real Data Taking at the LHC Jamie.Shiers@cern.ch WLCG Service Coordination & EGI InSPIRE SA3 / WP6EGI InSPIRE TNC2010, Vilnius,

The WLCG Deployment Model• WLCG is the convergence of grid technology with a specific

deployment model, elaborated in the late 1990s in the “Modelling of Network & Regional Centres” (MONARC) project

• This defined the well-known hierarchy Tier0/Tier1/Tier2 that is now common to several disciplines and matches well to International Centre / National Centres / Local Institutes

• MONARC originally foresaw limited networking between Tier0/Tier1/Tier2s – with air freight as a possible backup to (best case) 622Mbps links (cost!), as well as a smaller number of centres than we have today– We have redundant 10Gbps links: T0-T1 & also T1-T1, some of

which are on occasion max-ed out!

• These base assumptions are currently being re-discussed

4WLCG: Experience from Real Data Taking

Page 5: WLCGWLCG: Experience from Real Data Taking at the LHC Jamie.Shiers@cern.ch WLCG Service Coordination & EGI InSPIRE SA3 / WP6EGI InSPIRE TNC2010, Vilnius,

The WLCG Scale• Deployed worldwide: Americas, Europe & Asia-Pacific• Computational requirements: O(105) cores• Networking requirements: routinely move 1PB of data per

day between grid sites – significant intra-site requirements• Single VO transfers CERN-Tier1s >4GB/s over sustained

periods (~days)• Annual growth in stored data: 15PB

– Old calculation: # copies & location(s) of data may well be revised in coming months as well as trigger rates & event sizes

• Sum of resources at each tier approximately equal– 1 Tier0, ~10 Tier1s, ~100 Tier2s

• Sum of tickets at each tier (service metric) also ~equal!– A few: rarely as many as 5 per VO per day (OPS meeting)

5WLCG: Experience from Real Data Taking

Page 6: WLCGWLCG: Experience from Real Data Taking at the LHC Jamie.Shiers@cern.ch WLCG Service Coordination & EGI InSPIRE SA3 / WP6EGI InSPIRE TNC2010, Vilnius,

The WLCG Service• The WLCG Service is “held together” through (week)daily operations calls /

meetings – remote participation is a requirement in all meetings!– Attended by representatives from all main LHC experiments, Tier1 sites and

(typically) OSG & GridPP & chaired by a WLCG Service Coordinator on Duty• Follow-up on key service issues, site problems & accelerator / experiment

news– Calls typically last 30’ with notes – based on pre-minutes – circulated shortly after

• Longer-term issues, including roll-out of new releases, handled at fortnightly WLCG Service Coordination meetings

• In addition, regular “Collaboration” and topical workshops– Collaboration workshops attended by 200 – 300 people, many of whom (e.g.

Tier2s) do not attend more frequent events• Experiments also hold regular meetings with their sites, as well as

“Jamborees” – WLCG adds value in helping to identify / provide common solutions and/or when dealing with sites supporting multiple VOs– Typically the case for Tier1 sites outside of North America

WLCG: Experience from Real Data Taking 6

Page 7: WLCGWLCG: Experience from Real Data Taking at the LHC Jamie.Shiers@cern.ch WLCG Service Coordination & EGI InSPIRE SA3 / WP6EGI InSPIRE TNC2010, Vilnius,

WLCG: Service Reporting

• Regular service reports are made to the WLCG Management Board based on Key Performance Indicators together with reports on any major service incident (“SIRs”)

• The KPIs: Site Usability based on VO tests; GGUS summaries (user, team & alarm tickets); # of SIRs & their nature

• This can be summarized in one (colour) slide: drill-down provided when not A-OK (usually…)

WLCG: Experience from Real Data Taking 7

Page 8: WLCGWLCG: Experience from Real Data Taking at the LHC Jamie.Shiers@cern.ch WLCG Service Coordination & EGI InSPIRE SA3 / WP6EGI InSPIRE TNC2010, Vilnius,

WLCG Operations Report – Summary

8

KPI Status Comment

GGUS tickets 1 “real” alarm, 4 test alarms; normal # team and user tickets.

Drill-down on real alarms;comment on tests.

Site Usability Minor issues Drill-down (hidden)

SIRs & Change assessments

Four incident reports Drill-down on each

VO User Team Alarm Total

ALICE

6 1 1 8

ATLAS

21 67 1 89

CMS 2 3 1 6

LHCb 0 25 2 27

Totals

29 96 5 130

Page 9: WLCGWLCG: Experience from Real Data Taking at the LHC Jamie.Shiers@cern.ch WLCG Service Coordination & EGI InSPIRE SA3 / WP6EGI InSPIRE TNC2010, Vilnius,

WLCG: Team & Alarm tickets

• These are two features – along with the weekly ticket summaries – that were introduced for WLCG into GGUS

• Team tickets – shared by “shifters” and passed on from one shift to another

• Alarm tickets – when you need to get someone out of bed: only named, authorized people may use them; only for agreed “critical services”

• Targets for expert intervention & problem resolution: statistics fortunately rather low

9WLCG: Experience from Real Data Taking

Page 10: WLCGWLCG: Experience from Real Data Taking at the LHC Jamie.Shiers@cern.ch WLCG Service Coordination & EGI InSPIRE SA3 / WP6EGI InSPIRE TNC2010, Vilnius,

WLCG: Recent Problems

• Three real problems from the most recent MB report: 2 network-related and 1 data / storage management

1.Loss of DNS in Germany(!) – GGUS black-out2.Network performance on CERN – Taiwan link:

delays in problem resolution (problem turned out to be near Frankfurt) due to holiday w/e

Major data issue: a combination of events led to data loss which IMHO cannot be tolerated

WLCG: Experience from Real Data Taking 10

Page 11: WLCGWLCG: Experience from Real Data Taking at the LHC Jamie.Shiers@cern.ch WLCG Service Coordination & EGI InSPIRE SA3 / WP6EGI InSPIRE TNC2010, Vilnius,

WLCG Networking Issues

• Network works well most of the time: adequate bandwidth, backup links

• Occasional problems often lengthy (days) to resolve: future direction is to be more network-centric (breaking MONARC model, more flexible “hierarchy”, remote data access)

• Network problems: lines cut by motorway construction, fishing trawlers & Tsunamis

Tighter integration of network operations with WLCG operations already needed

WLCG: Experience from Real Data Taking 11

Page 12: WLCGWLCG: Experience from Real Data Taking at the LHC Jamie.Shiers@cern.ch WLCG Service Coordination & EGI InSPIRE SA3 / WP6EGI InSPIRE TNC2010, Vilnius,

WLCG Service: Challenges• The “Service Challenge” programme was initiated in 2004 to

deliver a full production service capable of handling data taking needs well in advance of real data taking

• Whilst we have recently been congratulated for the status and stability of the service, it has been much more difficult and taken much longer than anyone anticipated

• We should have been where we are today 2-3 years ago!• The service works with most problems being addressed

sufficiently quickly – but there are still many improvements needed and the operations cost is high (sustainable?)

• These include adapting to constantly evolving technology (multi-core, virtualization, changes in the data & storage management landscape, clouds computing, …)

• As well as the move from EGEE to EGI…

WLCG: Experience from Real Data Taking 12

Page 13: WLCGWLCG: Experience from Real Data Taking at the LHC Jamie.Shiers@cern.ch WLCG Service Coordination & EGI InSPIRE SA3 / WP6EGI InSPIRE TNC2010, Vilnius,

(Some) Data Related Issues

Data Preservation

Storage Management

Data Management

Data Access

Page 14: WLCGWLCG: Experience from Real Data Taking at the LHC Jamie.Shiers@cern.ch WLCG Service Coordination & EGI InSPIRE SA3 / WP6EGI InSPIRE TNC2010, Vilnius,

WLCG & Data: Futures• Just as our MONARC+grid choice was made some time

ago and is now being reconsidered, so too were our data management strategies

• Whilst for a long time HEP’s data volumes and rates were special, they no longer are, nor are our requirements along other axes, such as Data Preservation

• “Standard building blocks” now look close to production readiness for at least some of our needs (tape access & management layer, cluster/parallel filesystems, data access…) requiring still some “integration glue” which can hopefully be shared with other disciplines

• “Watch this space” – data & storage management in HEP looks about to change (but not too much: it takes a long time to roll out new services and/or migrate multi-PB of data…)

WLCG: Experience from Real Data Taking 14

Page 15: WLCGWLCG: Experience from Real Data Taking at the LHC Jamie.Shiers@cern.ch WLCG Service Coordination & EGI InSPIRE SA3 / WP6EGI InSPIRE TNC2010, Vilnius,

WLCG & EGEE• The hardening of the WLCG service took place

throughout the entire lifetime of the EGEE project series: 2004 – 2010

• The service continues to evolve: one recent change was in the scheduling of site / service interventions with respect to LHC machine stops: agreed no more than 3 sites or 30% of the total Tier1 resources to a given VO can be (scheduled) down at the same time

• Another was the introduction of Change / Risk assessments prior to service interventions: still not sufficiently rigorous

• Constant service evolution is clearly measurable over a time scale of months, e.g. via the Quarterly Reports

WLCG: Experience from Real Data Taking 15

Page 16: WLCGWLCG: Experience from Real Data Taking at the LHC Jamie.Shiers@cern.ch WLCG Service Coordination & EGI InSPIRE SA3 / WP6EGI InSPIRE TNC2010, Vilnius,

EGI InSPIRE SA3 / WP6

• Another significant change or transition that we are now facing is the move from EGEE to EGI

• WP6 “Services for Heavy Users” is the only WP in InSPIRE led by CERN and addresses a variety of disciplines: HEP, LS, ES, A&A, F, CCMST

• Intent: continue to seek commonalities and synergies, not only within but also across disciplines

• Also “bind” related projects in which CERN is involved in LS & ES areas, plus links with A&A + F

WLCG: Experience from Real Data Taking 16

Page 17: WLCGWLCG: Experience from Real Data Taking at the LHC Jamie.Shiers@cern.ch WLCG Service Coordination & EGI InSPIRE SA3 / WP6EGI InSPIRE TNC2010, Vilnius,

SA3 – Long Term Goals• One of the key issues that needs to be addressed by this

Work Package is a model for long-term sustainability• This could have multiple elements:

– Some services moving to standard infrastructure;– Some of the current effort coming from the VOs themselves:

this is the standard model in HEP since many years!– A nucleus of expertise to assist with migrations to new

service versions and/or larger changes should nevertheless be foreseen: for HEP this needs to be higher than today

• Expect to elaborate on these issues through regular workshops and conferences, including EGI Technical and User Fora plus other multi-disciplinary events– IEEE NSS & MIC brings together a number of those in SA3

WLCG: Experience from Real Data Taking 17

Page 18: WLCGWLCG: Experience from Real Data Taking at the LHC Jamie.Shiers@cern.ch WLCG Service Coordination & EGI InSPIRE SA3 / WP6EGI InSPIRE TNC2010, Vilnius,

WLCG, the LHC & Experiments

• Metric for WLCG’s success:– WLCG seen and acknowledged as delivering a service

that is an essential part of the LHC physics programme

• Metric for EGI InSPIRE SA3’s success:– SA3 seen and acknowledged both within and across

the disciplines that it supports but also beyond: the longer term goals of RoI to science and society

Investment in Research & Education is essential: the motivation clearly goes beyond direct returns

WLCG: Experience from Real Data Taking 18

Page 19: WLCGWLCG: Experience from Real Data Taking at the LHC Jamie.Shiers@cern.ch WLCG Service Coordination & EGI InSPIRE SA3 / WP6EGI InSPIRE TNC2010, Vilnius,

Summary• WLCG delivers a production service at the Terra & peta

scale, it permits local investment and exploitation, solving the “brain-drain” and related problems of previous solutions

• It has taken longer & been a lot harder than foreseen• Many of the basic ingredients work well and are

applicable beyond grids; complexity has not always been justified

• Expect some changes, e.g. in Data Management + a more network-centric deployment model, in coming years: these changes need to be adiabatic

WLCG: Experience from Real Data Taking 19

Page 20: WLCGWLCG: Experience from Real Data Taking at the LHC Jamie.Shiers@cern.ch WLCG Service Coordination & EGI InSPIRE SA3 / WP6EGI InSPIRE TNC2010, Vilnius,

LHCC Referees…

• “First experience of the World-wide LHC Computing Grid (WLCG) with the LHC has been positive. This is very much due to the substantial effort invested over several years during the intensive testing phase and all Tier centres must take credit for this success. The LHCC congratulates the WLCG on the achievements.”

May 2010WLCG: Experience from Real Data Taking 20

Page 21: WLCGWLCG: Experience from Real Data Taking at the LHC Jamie.Shiers@cern.ch WLCG Service Coordination & EGI InSPIRE SA3 / WP6EGI InSPIRE TNC2010, Vilnius,

BACKUPWLCG: Experience from Real Data Taking 21

Page 22: WLCGWLCG: Experience from Real Data Taking at the LHC Jamie.Shiers@cern.ch WLCG Service Coordination & EGI InSPIRE SA3 / WP6EGI InSPIRE TNC2010, Vilnius,

Intervention & Resolution Targets• Targets (not commitments) proposed for Tier0 services

– Similar targets requested for Tier1s/Tier2s– Experience from first week of CCRC’08 suggests targets for problem

resolution should not be too high (if ~achievable)• The MoU lists targets for responding to problems (12 hours for T1s)

¿ Tier1s: 95% of problems resolved <1 working day ?¿ Tier2s: 90% of problems resolved < 1 working day ?

Post-mortem triggered when targets not met!

22

Time Interval Issue (Tier0 Services) TargetEnd 2008 Consistent use of all WLCG Service Standards 100%

30’ Operator response to alarm / phone call 99%

1 hour Operator response to alarm / phone call 100%

4 hours Expert intervention in response to above 95%

8 hours Problem resolved 90%

24 hours Problem resolved 99%22