ccrc status & readiness for data taking ~~~ 27 th may 2008 ~~~ fio group meeting, dm technical...

41
CCRC Status & Readiness for Data Taking [email protected] [email protected] ~~~ 27 th May 2008 ~~~ FIO Group Meeting, DM Technical Meeting

Upload: adam-harrell

Post on 06-Jan-2018

215 views

Category:

Documents


0 download

DESCRIPTION

CCRC’08 – Conclusions (LHCC referees) WLCGThe WLCG service is running (reasonably) smoothly The functionality matches: what has been tested so far – and what is (known to be) required  We have a good baseline on which to build (Big) improvements over the past year are a good indication of what can be expected over the next! (Very) detailed analysis of results compared to up-front metrics – in particular from experiments! 3

TRANSCRIPT

Page 1: CCRC Status & Readiness for Data Taking ~~~ 27 th May 2008 ~~~ FIO Group Meeting, DM Technical Meeting

CCRC Status & Readiness for Data Taking

[email protected] [email protected] ~~~

27th May 2008~~~

FIO Group Meeting, DM Technical Meeting

Page 2: CCRC Status & Readiness for Data Taking ~~~ 27 th May 2008 ~~~ FIO Group Meeting, DM Technical Meeting

CCRC ’08 Post-Mortem• Will be held at CERN June 12-13 in IT amphitheatre

• To some extent, previews mini-LHCC review of 1st July

• Will include a much more detailed review of all aspects of CCRC ’08 than is possible / meaningful here!• Sessions on Tier0/1/2, Databases, Storage / Data Management

• Physical attendance limited by capacity of amphi – registration closes at end of this week (54/80 at 08:30)

• But there’s also EVO…

• There’s also a Communal Curry Repast Challenge (50SF)

2

Page 3: CCRC Status & Readiness for Data Taking ~~~ 27 th May 2008 ~~~ FIO Group Meeting, DM Technical Meeting

CCRC’08 – Conclusions (LHCC referees)

• The WWLLCCGG service is running (reasonably) smoothly

• The functionality matches: what has been tested so far – and what is (known to be) required

We have a good baseline on which to build

• (Big) improvements over the past year are a good indication of what can be expected over the next!

• (Very) detailed analysis of results compared to up-front metrics – in particular from experiments!

3

Page 4: CCRC Status & Readiness for Data Taking ~~~ 27 th May 2008 ~~~ FIO Group Meeting, DM Technical Meeting

CCRC ’08 – How Does it Work?• Experiment “shifters” use Dashboards, experiment-specific

SAM-tests (+ other monitoring, e.g. PhEDEx) to monitor the various production activities

• Problems spotted are reported through the agreed channels (ticket + elog entry)

• Response is usually rather rapid – many problems are fixed in (<)< 1 hour!

• A small number of problems are raised at the daily (15:00) WLCG operations meeting

Basically, this works!• We review on a weekly basis if problems were not spotted

by the above fix [ + MB report ] With time, increase automation, decrease eye-

balling4

Page 5: CCRC Status & Readiness for Data Taking ~~~ 27 th May 2008 ~~~ FIO Group Meeting, DM Technical Meeting

WLCG OperationsI am sure that most – if not all – of you are familiar with:

• Daily “CERN” operations meetings at 09:00 in 513 R-068• Daily “CCRC” operations meetings at 15:00 in 28 R-006

(will move to 513 R-068 if the phone is ever connected)

• In addition, the experiments have (twice-)daily operations meetings continuous operations in control rooms

• These are pretty light-weight and a (long) proven way of ensuring information flow / problem dispatching

5

Page 6: CCRC Status & Readiness for Data Taking ~~~ 27 th May 2008 ~~~ FIO Group Meeting, DM Technical Meeting

The WLCG Service• Is a combination of services, procedures,

documentation, monitoring, alarms, accounting, … and is based on the “WLCG Service Model”.

• It includes both “experiment-specific” as well as “baseline” components

• No-one (that I’ve managed to find…) has a complete overview of all aspects

It is essential that we work together and feel joint ownership – and pride – in the complete service

• It is on this that we will be judged – and not individual components…

• (I’m sure that I am largely preaching to the choir…)

6

Page 7: CCRC Status & Readiness for Data Taking ~~~ 27 th May 2008 ~~~ FIO Group Meeting, DM Technical Meeting

CCRC ’08 – Areas of Opportunity• Tier2s: MC well run in, distributed analysis still to be

scaled up to (much) larger numbers of users• Tier1s: data transfers (T0-T1, T1-T1, T1-T2, T2-T1) now

well debugged and working sufficiently well (most of the time…); reprocessing still needs to be fully demonstrated for ATLAS (includes conditions!!!)

Tier0: best reviewed in terms of the experiments’ “Critical Services” lists• These strongly emphasize data/storage management and

database services! We know how to run stable, reliable services• IMHO – these take less effort to run than ‘unreliable’ ones… But they require some minimum amount of discipline…

7

Page 8: CCRC Status & Readiness for Data Taking ~~~ 27 th May 2008 ~~~ FIO Group Meeting, DM Technical Meeting

Data / Storage Management There is no doubt that this is a complex area• And one which is still changing• We don’t really know what load and / or access patterns ‘real’

production will give• But if there is one area that can be expected to be ‘hot’ it is

this one…• IMHO (again) it is even more important that we

remain calm and disciplined – “fire-fighting” mode doesn’t work well – except possibly for very short term issues

In the medium term, we can expect (significant?) changes in the way we do data management

Given our experience e.g. with SRM v2.2, how can this be achieved in a non-disruptive fashion, at the same time as (full-scale?) data-taking and production???

8

Page 9: CCRC Status & Readiness for Data Taking ~~~ 27 th May 2008 ~~~ FIO Group Meeting, DM Technical Meeting

CCRC ’09 - Outlook• SL(C)5• CREAM• Oracle 11g• SRM v2.2++ Other DM fixes…• SCAS• [ new authorization

framework ]• …

• 2009 resources• 2009 hardware• Revisions to Computing

Models• EGEE III transitioning to more

distributed operations• Continued commissioning,

7+7 TeV, transitioning to normal(?) data-taking (albeit low luminosity?)

• New DG, …9

Page 10: CCRC Status & Readiness for Data Taking ~~~ 27 th May 2008 ~~~ FIO Group Meeting, DM Technical Meeting

Count-down to LHC Start-up!• Experiments have been told to be ready for beam(s) at

any time from the 2nd half of July• CCRC ’08 continues until the end of this week…• Basically, that means we have ONE MONTH to make any

necessary remaining preparations / configuration changes for first data taking (which won’t have been tested in CCRC…)

There are quite a few outstanding issues – not clear if these are yet fully understood

There must be some process for addressing these • If the process is the ‘daily CASTOR+SRM operations

meeting’, then each sub-service absolutely must be represented!

Almost certainly too many / too complex to address them all – prioritize and address systematically.

• Some will simply remain as “features” to be addressed later – and tested in CCRC ’09?

10

Page 11: CCRC Status & Readiness for Data Taking ~~~ 27 th May 2008 ~~~ FIO Group Meeting, DM Technical Meeting

CCRC’08 – Conclusions• The WWLLCCGG service is running (reasonably) smoothly

• The functionality matches: what has been tested so far – and what is (known to be) required & the experiments are happy!

We have a good baseline on which to build

• (Big) improvements over the past year are a good indication of what can be expected over the next!

• (Very) detailed analysis of results compared to up-front metrics – in particular from experiments!

12

Page 12: CCRC Status & Readiness for Data Taking ~~~ 27 th May 2008 ~~~ FIO Group Meeting, DM Technical Meeting

WLCG BBQ• The barbecue area in Prévessin has been booked

from 17:30 – 23:59 on Wednesday 25th June for a WLCG BBQ

• More details will follow…

13

CCRC ’09 Planning• Will start latest after the summer (if it hasn’t

already…)

• A 2-day workshop is foreseen for November 13-14 at CERN

• Next WLCG Collaboration workshop w/e prior to CHEP

Page 13: CCRC Status & Readiness for Data Taking ~~~ 27 th May 2008 ~~~ FIO Group Meeting, DM Technical Meeting

Summary of February phase, preparations for May and

[email protected] [email protected]

~~~LHCC Referees Meeting, 5th May 2008

Page 14: CCRC Status & Readiness for Data Taking ~~~ 27 th May 2008 ~~~ FIO Group Meeting, DM Technical Meeting

Agenda• Status of CCRC’08

• Summary of February run

• Preparations for May run

• and later…

• Conclusions

15

Page 15: CCRC Status & Readiness for Data Taking ~~~ 27 th May 2008 ~~~ FIO Group Meeting, DM Technical Meeting

Status of CCRC’08• CCRC’08 was proposed during the WLCG Collaboration

workshop held in Victoria prior to CHEP’07• It builds on many years of work preparing and

hardening the services, the experiments’ computing models and offline production chains• Data challenges, Service challenges, Dress rehearsals, …

The name says it all:

Common Computing Readiness Challenge

• Are we (together) ready?16

Page 16: CCRC Status & Readiness for Data Taking ~~~ 27 th May 2008 ~~~ FIO Group Meeting, DM Technical Meeting

February run• Preparations started with “kick-off” F2F meeting in October

2007• Followed by weekly planning con-calls and monthly F2Fs• The weekly calls were suspended end-January 2008 – it is not

foreseen to restart them!• Daily “operations-style” con-calls started late January

• Except Mondays (weekly operations) and during F2Fs / workshops

• The February run was generally considered to be successful – although the overlap in terms of experiment activities (particularly Tier0-Tier1 export) was lower than foreseen

• SRM v2.2 production deployment, middleware services, operations aspects – generally better than expected!

Much still left to do in May: more resources, more services, more of everything…

17

Page 17: CCRC Status & Readiness for Data Taking ~~~ 27 th May 2008 ~~~ FIO Group Meeting, DM Technical Meeting

What Did We Achieve? (High Level)

• Even before the official start date of the February challenge, it had proven an extremely useful focusing exercise, in helping understand missing and / or weak aspects of the service and in identifying pragmatic solutions

• Although later than desirable, the main bugs in the middleware were fixed (just) in time and many sites upgraded to these versions

• The deployment, configuration and usage of SRM v2.2 went better than had predicted, with a noticeable improvement during the month

• Despite the high workload, we also demonstrated (most importantly) that we can support this work with the available manpower, although essentially no remaining effort for longer-term work (more later…)

• If we can do the same in May – when the bar is placed much higher – we will be in a good position for this year’s data taking

• However, there are certainly significant concerns around the available manpower at all sites – not only today, but also in the longer term, when funding is unclear (e.g. post EGEE III)

18

Page 18: CCRC Status & Readiness for Data Taking ~~~ 27 th May 2008 ~~~ FIO Group Meeting, DM Technical Meeting

CCRC’08 – Objectives• Primary objective (next) was to demonstrate that we (sites,

experiments) could run together at 2008 production scale This includes testing all “functional blocks”:

• Experiment to CERN MSS; CERN to Tier1; Tier1 to Tier2s etc.

• Two challenge phases were foreseen:1.February : not all 2008 resources in place – still adapting to

new versions of some services (e.g. SRM) & experiment s/w2.May: all 2008 resources in place – full 2008 workload, all

aspects of experiments’ production chains

N.B. failure to meet target(s) would necessarily result in discussions on de-scoping!

• Fortunately, the results suggest that this is not needed, although much still needs to be done before, and during, May!

19

Page 19: CCRC Status & Readiness for Data Taking ~~~ 27 th May 2008 ~~~ FIO Group Meeting, DM Technical Meeting

Main Lessons Learned Generally, things worked reasonably well…

Still improvements in communication are needed!• Tools still need to be streamlined (e.g. elog-books / GGUS), and

reporting automated¿ Service dashboards – should be in place before May…• F2Fs and other meetings working well in this direction!

• Pre-established metrics extremely valuable!• As well as careful preparation and extensive communication!

Now in continuous production mode – this will Now in continuous production mode – this will continue – as will today’s infrastructure & continue – as will today’s infrastructure & meetingsmeetings

20

Page 20: CCRC Status & Readiness for Data Taking ~~~ 27 th May 2008 ~~~ FIO Group Meeting, DM Technical Meeting

Recommendations – mid-February!

To improve communications with Tier2s and the DB community, 2 new mailing lists have been setup, as well as regular con-calls with Asia-Pacific sites (time zones…)

• Follow-up on the lists of “Critical Services” must continue, implementing not only the appropriate monitoring, but also ensuring that the WLCG “standards” are followed for Design, Implementation, Deployment and Operation We are having some (small) success in convincing others in the Grid….

• Clarify reporting and problem escalation lines (e.g. operator call-out triggered by named experts, …) and introduce (light-weight) post-mortems when MoU targets not met

We must continue to improve on open & transparent reporting, as well as further automations in monitoring, logging & accounting

We should foresee “data taking readiness” challenges in future years – probably with a similar schedule to this year – to ensure that full chain (new resources, new versions of experiment + AA s/w, middleware, storage-ware) is ready [maybe also July 2008!][maybe also July 2008!]

21

Page 21: CCRC Status & Readiness for Data Taking ~~~ 27 th May 2008 ~~~ FIO Group Meeting, DM Technical Meeting

22

Page 22: CCRC Status & Readiness for Data Taking ~~~ 27 th May 2008 ~~~ FIO Group Meeting, DM Technical Meeting

27

Page 23: CCRC Status & Readiness for Data Taking ~~~ 27 th May 2008 ~~~ FIO Group Meeting, DM Technical Meeting

28

May Run – Introduction• 2nd phase of CCRC’08 is from Monday 5th May to Friday 30th.

• This was immediately preceded by a long w/e (Thu-Sun) at CERN• May is also a month with quite a few public holidays

• ~1 per week… Everyone is keen that we improve on February in

terms of being ready well in time• And in terms of defining what “ready” actually means…

• Detailed presentations on middleware, storage-ware, databases, experiments plans & requirements at April’s F2F• This can be summarized in a couple of tables, on the CCRC’08 wiki,

as was done for February – very small updates wrt February Further discussions and clarifications during WLCG

Collaboration Workshop, 21 – 25 April at CERN

Page 24: CCRC Status & Readiness for Data Taking ~~~ 27 th May 2008 ~~~ FIO Group Meeting, DM Technical Meeting

Critical Services / Service Reliability

• Summary of key techniques (D.I.D.O.) and status of deployed middleware• Techniques widely used – in particular for Tier0 (“WLCG” &

experiment” services) Quite a few well identified weaknesses – action plan?

• Measuring what we deliver – service availability targets

• Testing experiment services• Update on work on ATLAS services

• To be extended to all 4 LHC experiments• Top 2 ATLAS services “ATLONR” & DDM catalogs• Request(?) to increase availability of latter during DB sessions

• Possible use of DataGuard to another location in CC?

29Design, Implementation, Deployment, Operation

Page 25: CCRC Status & Readiness for Data Taking ~~~ 27 th May 2008 ~~~ FIO Group Meeting, DM Technical Meeting

Storage Issues• Covered not only baseline versions of storage-ware

for May run of CCRC’08, but also timeline for phase-out of SRM v1.1 services, schedule (and some details…) of SRM v2.2 MoU Addendum, and …

• Storage Performance / Optimization / Monitoring• Some (new) agreed metrics for May run, e.g.

1. Sites should split the monitoring of the tape activity by production and non-production users;

2. Sites should measure number of files read/write per mount (this should be much greater than 1);

3. Sites should measure the amount of data transferred during each mount .

• Clear willingness to work to address problems• e.g. ALICE plan files ~10GB

30

Page 26: CCRC Status & Readiness for Data Taking ~~~ 27 th May 2008 ~~~ FIO Group Meeting, DM Technical Meeting

CCRC08 F2F meeting – Flavia Donno 31

Storage Baseline Versions for CCRC08 in May

Implementation Addresses

CASTOR:• SRM: v 1.3-21,• b/e: 2.1.6-12

Possible DB deadlock scenarios; srmLs return structure now conforms; Various minor DB fixes; Fix for leaking sockets when srmCopy attempted; correct user mappings in PutDone; log improvements; better bulk deletion(?)

dCache:• 1.8.0-15,p1, p2, p3(cumulative)

http://trac.dcache.org/trac.cgi/wiki/manuals/CodeChangeLogsT1D1 can be set to T1D0 if the file is NOT in a token space; There is a new PinManager available.(improved stability); Space Tokens can be specified in a directory, to become the token used for writing if no other information is provided; dCache provides to the HSM script : Directory/file, Space Token, Space Token Description (new)

DPM:• 1.6.7-4

Issues fixed for 1.6.7-4: Pool free space correctly updated after filesystem drain and removal ; SRM2.2 srmMkdir will now create directories that are group writable (if the default ACL of the parent gives that permission)Known issues or points: No srmCopy is available; Only round robin selection of filesystems within a pool; No transfer stream limit per node

StoRM• 1.3.20

http://storm.forge.cnaf.infn.it/documentation/storm_release_planUpgraded information providers, both static and dynamic; Fix on the file size  returned by srmPrepareToGet/srmStatusOfPtG for file > 2GB; Support for ROOT protocol; "default ACLs" on Storage Areas that are automatically set for newly created files; srmGetSpaceMetaData bound with quota information; Improved support for Tape1Disk1 Storage Class with GPFS 3.2 and TSM

Page 27: CCRC Status & Readiness for Data Taking ~~~ 27 th May 2008 ~~~ FIO Group Meeting, DM Technical Meeting

SRM v1.1 Services – R.I.P. ??• When can sites de-commission SRM v1.1 services?

• Pre-May is presumably not realistic… (i.e. end-April)

• Experience from May will presumably (re-)confirm that SRM v2.2 is (extremely fully) ready for business…

• Decide at June post-mortem workshop?

Set tentative target of end June?Set tentative target of end June?

¿ Is this too aggressive? Too mild?

Looks like by end-2008 WLCG can be declared an “SRM v1.1-free zone!”

32

Page 28: CCRC Status & Readiness for Data Taking ~~~ 27 th May 2008 ~~~ FIO Group Meeting, DM Technical Meeting

SRM v2.2 MoU Addendum - Timeline

• 19 May – technical agreement with implementers• 26 May – experiment agreement• 2 June – manpower estimates / implementation

timelines• 10 June – MB F2F: approval!

Essential that representatives have authority to sign-off on behalf of their site / experiment!

• Target for production versions is that of CCRC’09

Test versions must be available well beforehand!33

Page 29: CCRC Status & Readiness for Data Taking ~~~ 27 th May 2008 ~~~ FIO Group Meeting, DM Technical Meeting

Middleware Baseline

Component Patch # Status

LCG CE Released gLite 3.1 Update 20

FTS (T0) Released gLite 3.0 Update 42

Released gLite 3.1 Update 18

gFAL/lcg_utils Released gLite 3.1 Update 20

DPM 1.6.7-4

Patch #1752

Patch #1740

Patch #1738

Patch #1706

Patch #1671FTS (T1) Released gLite 3.0 Update 41

CCRC May recommended versions

Page 30: CCRC Status & Readiness for Data Taking ~~~ 27 th May 2008 ~~~ FIO Group Meeting, DM Technical Meeting

CCRC’08 – Actions for May• Middleware & storage-ware versions defined• Numerous “operational” improvements to be implemented

(Consistent) Tier0 operator alarm mailing list for all experiments, co-owned by IT & experiments

[email protected]• NL-T1 alarm mechanism for Tier1s – sites who cannot implement

this should say so asap• Possible GGUS improvements – on-hold until after May…• Better follow-up on “MoU targets” and other issues at ‘daily’

meetings• Monitoring improvements – still on-going…• Clarification of roles / responsibilities of Tier2 coordinators /

contacts… New – metrics (from experiments) for DB services

35

Page 31: CCRC Status & Readiness for Data Taking ~~~ 27 th May 2008 ~~~ FIO Group Meeting, DM Technical Meeting

Databases (1/2)• Databases are behind pretty much all of the services

classified as “Very Critical” by the experiments – including data / storage management services!• Oracle for very critical services, also MySQL, PostgreSQL for others But problems – such as “DB house-keeping” – often similar!

• A set of (Oracle) techniques – first identified in 2004(!) – are in full production to cover a variety of needs See next slide for more complete list of DB services at Tier0• Some more or less equivalent techniques also exist for other

implementations, but CASTOR, FTS, LFC(*) primarily Oracle… “3D” deployment is now complete full service mode

• Which naturally includes agreed service evolution…• Responsibility for coordination passes from Dirk Düllmann to

Maria Girone (+Tier1 DBAs) A “DB track” has since long been held at WLCG

Collaboration workshops – and will continue Tier0 DB support model remains 8x5, otherwise “best-

effort”• This includes support for Oracle Streams issues

36* Some LFC backends on MySQL – migrations will take time! (2009 for TRIUMF)

Page 32: CCRC Status & Readiness for Data Taking ~~~ 27 th May 2008 ~~~ FIO Group Meeting, DM Technical Meeting

(Some) Critical Services Using DBs

38

Experiment

Service Hosts / Affects

(all) Conditions, book-keeping, data & storage management, file transfer, …

ATLAS ATONR PVSS, DCS, Run Control,and Luminosity Block data

ATLAS DDM catalogs No access to data catalogues forproduction or analysis. All activitiesstop.

CMS Oracle Data taking & T0 production stopsCMS SRM, CASTOR T0 production stopsLHCb LFC, VOMS…

Page 33: CCRC Status & Readiness for Data Taking ~~~ 27 th May 2008 ~~~ FIO Group Meeting, DM Technical Meeting

Databases (2/2)• At the Tier0, move to 2008 h/w successfully completed

prior to WLCG Collaboration workshop (using DataGuard)• Some concerns with rate of storage controller failures led to

proposal of retaining old hardware as “standby” to new…• Latest Oracle release (10.2.0.4) and April patches (“cpu”)

not yet deployed for production, but for test & integration• Schedule too tight to allow sufficient testing prior to May• Arguments for upgrading to this version prior to data taking¿ Foresee a slot in June? (Other upgrades also likely…)

R/O replicas of LFC at LHCb Tier1 sites now complete Some sites still migrating from MySQL to Oracle for

ATLAS LFCs – main requirement is operational stability – deployment of RAC + DNS load-balancing middle-tier also relevant here (failure = cloud failure)

40

Page 34: CCRC Status & Readiness for Data Taking ~~~ 27 th May 2008 ~~~ FIO Group Meeting, DM Technical Meeting

41Alexandre Vaniachine WLCG Workshop, CERN, April 23, 2008

Proposed Targets for 3D Milestones in CCRC08-2

Last LCG Management Board requested to collect any remaining ATLAS milestones for the 3D setup

Reprocessing during CCRC08-2 in May provides an opportunity to validate that the 3D deployment is ready for LHC data taking

Robust Oracle RAC operation at the Session/Node target demonstrates that 3D setup at the site is ready for ATLAS data taking– Achieving 50% of the target

shows that site is 50% ready In addition, during ATLAS

reprocessing week DBAs at the Tier-1 sites should collect Oracle performance metrics: CPU, I/O and cache hit ratios– Details will be presented to

DBAs at the Database Track

Tier-1 site Sessions/NodeTRIUMF 40FZK 80IN2P3 60CNAF 40SARA 140NDGF 100ASGC 80RAL 90BNL 220PIC 40

Unclear if requested resources OK for these targets(They came kinda late…)

Page 35: CCRC Status & Readiness for Data Taking ~~~ 27 th May 2008 ~~~ FIO Group Meeting, DM Technical Meeting

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

DB metrics at Tier0 and Tier1

• At Tier0 we collect and show:– Database resource usage per node

• OS Load• CPU Utilization• Physical Reads/Writes (MB reads per second)

– Application resource usage per node• DB CPU (time spent on CPU)• DB Time (time on CPU + Wait times)• User I/O time (time spent on I/O)• Physical Reads (MB reads per second)• Application wait time (Waits on commits, etc)• Execute count (number of queries executed)

• For each Tier1 we collect and show– Resource usage at Tier1 database

• CPU usage, Streams pool usage– Streams activity

• LCR (changes) captured, enqueued, propagated, dequeued, applied• Latency• State of each step (capture, propagation, apply)

Page 36: CCRC Status & Readiness for Data Taking ~~~ 27 th May 2008 ~~~ FIO Group Meeting, DM Technical Meeting

Running The Challenge• The daily / weekly / monthly meetings are working

reasonably well and are regularly optimized• e.g. tweaks to agenda and / or representation Suppression of meetings no longer needed

• More work needs to be done on:• Awareness of where to find information• Awareness of existing summaries• Better usage by sites of the above!

• Automated reporting and stream-lined monitoring could ease burden of those involved

Most likely: incremental changes driven by production needs for many months to come!

• Strong motivation for common solutions • Across experiments, sites, Grids…

46

Page 37: CCRC Status & Readiness for Data Taking ~~~ 27 th May 2008 ~~~ FIO Group Meeting, DM Technical Meeting

Julia Andreeva , CERN WLCG Workshop 47

Gridmap for monitoring of the workflows of the Experiments

Roll-up “Functional Blocks”, “Critical Services” & MoU Targets into one view with drill-down?

Page 38: CCRC Status & Readiness for Data Taking ~~~ 27 th May 2008 ~~~ FIO Group Meeting, DM Technical Meeting

Critical Service Follow-up• Targets (not commitments) proposed for Tier0

services• Similar targets requested for Tier1s/Tier2s• Experience from first week of CCRC’08 suggests targets for

problem resolution should not be too high (if ~achievable)• The MoU lists targets for responding to problems (12 hours for

T1s)¿ Tier1s: 95% of problems resolved <1 working day ?¿ Tier2s: 90% of problems resolved < 1 working day ?

Post-mortem triggered when targets not met!

48

Time Interval Issue (Tier0 Services) TargetEnd 2008 Consistent use of all WLCG Service Standards 100%30’ Operator response to alarm / call to x5011 / alarm e-mail 99%1 hour Operator response to alarm / call to x5011 / alarm e-mail 100%4 hours Expert intervention in response to above 95%8 hours Problem resolved 90%24 hours Problem resolved 99%

Page 39: CCRC Status & Readiness for Data Taking ~~~ 27 th May 2008 ~~~ FIO Group Meeting, DM Technical Meeting

Outlook – 2008 Stable baseline service with (small) incremental

improvements throughout the year• Detailed post-mortem of May CCRC’08 in June (12-13)• Still need to converge on additional service issues:

• SRM v2.2 MoU addendum, monitoring improvements, better use of storage resources (file size, batching of requests, …)

It is inevitable that both the May run and first pp data taking will bring with them new issues – and solutions!

• Continued (late) deployment of 2008 resources over several months

• A further CCRC’08 phase in July should be expected to test these resources and any further service enhancements

First pp data taking in 2nd half of the year…49

Page 40: CCRC Status & Readiness for Data Taking ~~~ 27 th May 2008 ~~~ FIO Group Meeting, DM Technical Meeting

Outlook – Beyond We can still expect some significant increases in

resources, plus changes in some of the services / experiment-ware, for the foreseeable future

• Some examples of possible changes:• Computing Models – based on real-life experience;• LCG OPN (recent proposal rejected);• Operations infrastructure – move from EGEE III to EGI(?)• …

Expecting everything to work without a large-scale test of all experiments and all sites is contrary to our experience

As previously discussed (February) a “readiness challenge” for each year’s data taking should be foreseen

• And scheduled to be consistent with needs of reprocessing!

50

Page 41: CCRC Status & Readiness for Data Taking ~~~ 27 th May 2008 ~~~ FIO Group Meeting, DM Technical Meeting

CCRC’08 – Conclusions• The WWLLCCGG service is running (reasonably) smoothly

• The functionality matches what has been tested so far – and what is (known to be) required

We have a good baseline on which to build

• (Big) improvements over the past year are a good indication of what can be expected over the next!

• (Very) detailed analysis of results compared to up-front metrics – in particular from experiments!

51