the wlcg service starts here… sc4 production == wlcg pilot --- jamie shiers it-gd group meeting,...

The WLCG Service Starts Here…

SC4 Production == WLCG Pilot

---Jamie Shiers

IT-GD Group Meeting, July 7th 2006

My (CERN) Background

Started at CERN in CO group back in 1984

Started at CERN as student in 1978…

Since then, we’ve had the following major accelerator startups:

pp collider at CERN; LEP; FNAL collider runs I & II; SLC at SLAC; (others too…)

Enjoy the calm, relaxing environment you currently enjoy..

(The quiet before the storm…)

The Worldwide LHC Computing Grid

Purpose Develop, build and maintain a distributed computing

environment for the storage and analysis of data from the four LHC experiments

Ensure the computing service … and common application libraries and tools

Phase I – 2002-05 - Development & planning

Phase II – 2006-2008 – Deployment & commissioning of the initial services

The solution!

July 2006 WLCG Service Challenges: Overview and Outlook

Overview

• SC4 Phases:– Throughput Phase (April)

• May was reserved for gLite 3.0 upgrades

– Service Phase (June – September inclusive)– Experiment production activities / requirements

• WLCG Production Service– In principle October on…– ATLAS CSC / CMS CSA06 start early / mid September

• Some comments on Tier2 workshop– Much more complete review at Wednesday’s GDB

Th

e E

volu

tion

of

Dat

abas

es in

HE

PCHEP 92 – the Birth of OO in

HEP? Wide-ranging discussions on the future of s/w development in

HEP

A number of proposals presented leading to (DRDC/LCRB/LCB):

RD41 – MOOSE [ Kors Bos ] The applicability of OO to offline particle physics code

RD44 – GEANT4 [ Simone Giani ] Produce a global object-oriented analysis and design of an

improved GEANT simulation toolkit for HEP RD45 – A Persistent Object Manager for HEP [ JDS ]

(and later also LHC++ (subsequently ANAPHE)) [ JDS ]

ROOT [ René ]

Started working on LHC Computing full-time!

LCG Service Deadlines

full physicsrun

first physics

cosmics

2007

2008

2006Pilot Service – stable service from 1 June 06i.e. we have already taken off!

LCG Service in operation – 1 Oct 06 over following six months ramp up to full operational capacity & performance

LCG service commissioned – 1 Apr 07

~6 months prior to first collisions

Updated LHC schedule coming…


The LHC Machine

• Some clear indications regarding LHC startup schedule and operation are now available– Press release issued two weeks ago

• Comparing our (SC) actual status with ‘the plan’, we are arguably one year late!– Some sites cheerfully claim two…– We were supposed to test all offline Use Cases of experiments

during SC3 production phase (Sep 2005)

• We still have an awful lot of work to do

Not the time to relax!


Press Release - Extract

CERN confirms LHC start-up for 2007

• Geneva, 23 June 2006. First collisions in the … LHC … in November 2007 said … Lyn Evans at the 137th meeting of the CERN Council ...

• A two month run in 2007, with beams colliding at an energy of 0.9 TeV will allow the LHC accelerator and detector teams to run-in their equipment ready for a full 14 TeV energy run to start in Spring 2008– Service Challenge ’07?

• The schedule announced today ensures the fastest route to a high-energy physics run with substantial quantities of data in 2008, while optimising the commissioning schedules for both the accelerator and the detectors that will study its particle collisions. It foresees closing the LHC’s 27 km ring in August 2007 for equipment commissioning. Two months of running, starting in November 2007, will allow the accelerator and detector teams to test their equipment with low-energy beams. After a winter shutdown in which commissioning will continue without beam, the high-energy run will begin. Data collection will continue until a pre-determined amount of data has been accumulated, allowing the experimental collaborations to announce their first results.

LHC CommissioningExpect to be characterised by:

Poorly understood detectors, calibration, software, triggers etc.

Lower than design luminosity & energy (~injection energy)

Most likely no AOD or TAG from first pass – but ESD will be larger?

Possible large impact on Tier2s – RAW and ESD samples to Tier2s?

The pressure will be on to produce some results as soon as possible!

There will not be sufficient resources at CERN to handle the load

We need a fully functional distributed system - ENTER THE GRID

There are many Use Cases we did not yet clearly identify

Nor indeed test --- this remains to be done in the coming months!

July 2006July 2006 R.Bailey, Chamonix XV, January 2006R.Bailey, Chamonix XV, January 2006 1515

Breakdown of a normal yearBreakdown of a normal year

7-8

~ 140-160 days for physics per yearNot forgetting ion and TOTEM operation

Leaves ~ 100-120 days for proton luminosity running? Efficiency for physics 50% ?

~ 50 days ~ 1200 h ~ 4 106 s of proton luminosity running / year

- From Chamonix XIV -S

ervi

ce u

pgra

de s

lots

?

July 2006WLCG Service Challenges: Overview and Outlook 16

P. SphicasLHC experiments’ software

Multiplicity paper:• Introduction• Detector system

- Pixel (& TPC)• Analysis method• Presentation of data

- dN/dη and mult. distribution (s dependence)

Theoretical interpretation- ln2(s) scaling?, saturation, multi-parton inter…

• Summary

pT paper outline:• Introduction• Detector system

- TPC, ITS• Analysis method• Presentation of data

- pT spectra and pT-multiplicity correlation

• Theoretical interpretation- soft vs hard, mini-jet production…

• Summary

Startup physics (ALICE)

Can publish two papers 1-2 weeks after LHC startup

LCG Service ModelTier0 – the accelerator centre (that’s us) Data acquisition & initial processing Long-term data curation Distribution of data Tier1s

This is where FTS comes in…

Canada – Triumf (Vancouver)France – IN2P3 (Lyon)Germany – Forschungszentrum KarlsruheItaly – CNAF (Bologna)Netherlands – NIKHEF (Amsterdam)

Nordic countries – distributed Tier-1 Spain – PIC (Barcelona)Taiwan – Academia Sinica (Taipei)UK – CLRC (Didcot)US – FermiLab (Illinois) – Brookhaven (NY)

Tier1 – “online” to the data acquisition process high availability

Managed Mass Storage – grid-enabled data service

Data intensive analysis National, regional support Continual reprocessing activity

(or is that continuous?)

Tier2 – ~100 centres in ~40 countries Simulation End-user analysis – batch and interactive

Les Robertson

CERN18%

All Tier-1s39%

All Tier-2s43%

CERN12%

All Tier-1s55%

All Tier-2s33%

CERN34%

All Tier-1s66%

CPU Disk Tape

Summary of Computing Resource RequirementsAll experiments - 2008From LCG TDR - June 2005

CERN All Tier-1s All Tier-2s TotalCPU (MSPECint2000s) 25 56 61 142Disk (PetaBytes) 7 31 19 57Tape (PetaBytes) 18 35 53

Tier1 Centre ALICE ATLAS ATLAS CMS LHCb Target

IN2P3, Lyon 9% 13% 90 10% 27% 200

GridKA, Germany 20% 10% 75 8% 10% 200

CNAF, Italy 7% 7% 60 13% 11% 200

FNAL, USA - - - 28% - 200

BNL, USA - 22% 200 - - 200RAL, UK - 7% 60 3% 15% 150

NIKHEF, NL (3%) 13% 90 - 23% 150

ASGC, Taipei - 8% 60 10% - 100

PIC, Spain - 4% (5) 50 6% (5) 6.5% 100

Nordic Data Grid Facility - 6% 50 - - 50

TRIUMF, Canada - 4% 50 - - 50

TOTAL 1.6GB/s

Nominal Tier0 – Tier1 Data Rates (pp)H

eat

SC4 T0-T1: Results

Target: sustained disk – disk transfers at 1.6GB/s out of CERN at full nominal rates for ~10 days

Result: just managed this rate on Good Sunday (1/10)

Easter w/eTarget 10 day period

Easter Sunday: > 1.6GB/s including DESY

GridView reports 1614.5MB/s as daily average

Service Challenges - Reminder

Purpose Understand what it takes to operate a real grid servicereal grid service – run for weeks/months at a time (not just

limited to experiment Data Challenges) Trigger and verify Tier-1 & large Tier-2 planning and deployment –

- tested with realistic usage patterns Get the essential grid services ramped up to target levels of reliability, availability, scalability,

end-to-end performance

Four progressive steps from October 2004 thru September 2006 End 2004 - SC1 – data transfer to subset of Tier-1s Spring 2005 – SC2 – include mass storage, all Tier-1s, some Tier-2s 2nd half 2005 – SC3 – Tier-1s, >20 Tier-2s – first set of baseline services

Jun-Sep 2006 – SC4 – pilot service

Autumn 2006 – LHC service in continuous operation – ready for data taking in 2007

We have shown that we can drive transfers at full nominal rates to:

Most sites simultaneously; All sites in groups (modulo network constraints – PIC); At the target nominal rate of 1.6GB/s expected in pp running

In addition, several sites exceeded the disk – tape transfer targets

There is no reason to believe that we cannot drive all sites at or above nominal rates for sustained periods.

But

There are still major operational issues to resolve – and most importantly – a full end-to-end demo under realistic conditions

SC4 – Executive Summary


Experiment Plans for SC4

• All 4 LHC experiments will run major production exercises during WLCG pilot / SC4 Service Phase

• These will test all aspects of the respective Computing Models plus stress Site Readiness to run (collectively) full production services

• These plans have been assembled from the material presented at the Mumbai workshop, with follow-up by Harry Renshall with each experiment, together with input from Bernd Panzer (T0) and the Pre-production team, and summarised on the SC4 planning page.

• We have also held a number of meetings with representatives from all experiments to confirm that we have all the necessary input (all activities: PPS, SC, Tier0, …) and to spot possible clashes in schedules and / or resource requirements. (See “LCG Resource Scheduling Meetings” under LCG Service Coordination Meetings).

• The conclusions of these meetings has been presented to the weekly operations meetings and the WLCG Management Board in written form (documents, presentations)

– See the SC4 Combined Action List for more information…

http://agenda.cern.ch/fullAgenda.php?ida=a056461

https://twiki.cern.ch/twiki/bin/view/LCG/SC4ExperimentPlans

http://indico.cern.ch/categoryDisplay.py?categId=654

http://agenda.cern.ch/displayLevel.php?fid=258

http://agenda.cern.ch/displayLevel.php?fid=666

https://twiki.cern.ch/twiki/bin/view/LCG/SCActionList

Summary of Experiment Plans

All experiments will carry out major validations of both their offline software and the service infrastructure during the next 6 months

There are significant concerns about the state-of-readiness (of everything…) – not to mention manpower at ~all sites + in experiments

I personally am considerably worried –- seemingly simply issues, such as setting up LFC/FTS services, publishing SRM end-points etc. have taken O(1 year) to be resolved (across all sites).

and [still] don’t even mention basic operational procedures (Some big improvements here recently…)

And all this despite heroic efforts across the board

But – oh dear – your planet has just been blown up by the Vogons

[ So long and thanks for all the fish ]

Mini ComputerMini Computer

MicrocomputerMicrocomputer

ClusterCluster

mainframemainframe


31

ATLAS SC plans/requirements

• Running now till 7 July to demonstrate the complete Atlas DAQ and first pass processing with distribution of raw and processed data to Tier 1 sites at the full nominal rates. Will also include data flow to some Tier2 sites and full usage of the Atlas Distributed Data Management system, DQ2. Raw data to go to tape, processed to disk only. Sites to delete from disk and tape

• After summer investigate scenarios of recovery from failing Tier 1 sites and deploy cleanup of pools at Tier 0.

• Later, test distributed production, analysis and reprocessing.• DQ2 has a central role with respect to Atlas Grid tools

– ATLAS will install local DQ2 catalogues and services at Tier 1 centres– ATLAS define a region of a Tier 1 and well network connected sites that will

depend on the Tier 1 DQ2 catalogue.– Expect such (volunteer) Tier 2 to join SC when T0/T1 runs stably – ATLAS will delete DQ2 catalogue entries

• Require VO box per Tier 0 and Tier 1 – done• Require LFC server per Tier 1 – done, must be monitored• Require FTS server and validated channels per Tier 0 and Tier 1 – close• Require ‘durable’ MSS disk area at Tier 1 – few sites have it. To be followed

up by Atlas and SC team. • Atlas would like their T1 sites to attend (VRVS) their weekly (Wed at 14.00)

SC review meeting during this running phase. No commitments were made.


32

ALICE SC Plans

• Validation of the LCG/gLite workload management services: ongoing– Stability of the services is fundamental for the entire

duration of the exercise• Validation of the data transfer and storage services

– 2nd phase: end July/August T0 to T1 (recyclable tape) at 300 MB/sec

– The stability and support of the services have to be assured during and beyond these throughput tests

• Validation of the ALICE distributed reconstruction and calibration model: August/September reconstruction at Tier 1

• Integration of all Grid resources within one single – interfaces to different Grids (LCG, OSG, NDGF) will be done by ALICE

• End-user data analysis: September/October


34

CMS SC Plans/Requirements

• In September/October run CSA06, a 50 million event exercise to test the workflow and dataflow associated with the data handling and data access model of CMS

• Now till end June– Continue to try to improve file transfer efficiency. Low rates and many errors now.– Attempt to hit 25k batch jobs per day and increase the number and reliability of sites aiming

to obtain 90% efficiency for job completion• In July

– Demonstrate CMS analysis submitter in bulk mode with the gLite RB• In July and August

– 25M events per month with the production systems– Second half of July participate in multi-experiment FTS Tier-0 to Tier-1 transfers at 150

MB/sec out of CERN– Continue through August with transfers

• Requirements:

• Improve Tier-1 to Tier-2 transfers and the reliability of the FTS channels.• CMS are exercising the channels available to them, but there are still issues with site

preparation and reliability– the majority of sites are responsive, but there is a lot of work for this summer

• Require to deploy the LCG-3D infrastructure– From late June deploy Frontier for SQUID caches

• All participating sites should be able to complete the CMS workflow and metrics (as defined in the CSA06 documentation)


35

LHCB SC Plans/Requirements

• Will start DC06 challenge at beginning of July using LCG production services and run till end August:

– Distribution of raw data from CERN to Tier 1s at 23 MB/sec– Reconstruction/stripping at Tier 0 and Tier 1– DST distribution to CERN and Tier 1s– Job prioritisation will be dealt with by LHCB but it is important jobs are not

delayed by other VO activities

• Preproduction for this is ongoing with 125 TB of MC data at CERN• Production will go on throughout the year for an LHCB physics book due in

2007• Require SRM 1.1 based SE’s separated for disk and MSS at all Tier 1 as

agreed in Mumbai and FTS channels for all CERN-T1’s– Data access directly from SE to ROOT/POOL (not just GridFTP/srmcp). For

NIKHEF/SARA (firewall issue) this could perhaps be done via GFAL.

• Require VO boxes at Tier 1 – so far at CERN, IN2P3, PIC and RAL. Need CNAF, NIKHEF and GridKa

• Require central LFC catalogue at CERN and read-only copy at certain T1 (currently setting up at CNAF)

• DC06-2 in Oct/Nov requires T1’s to run COOL and 3D database services

WLCG Service Challenges: Overview and Outlook

Experiment Summary

• All experiments will be ramping up their activity between now and first collisions

• The period of ‘one experiment having priority’ – as was done in SC3 and for ATLAS until this weekend – is over

It is full, concurrent production from now on!


Workshop Feedback

• >160 people registered and (a few more) participated!– This is very large for a workshop – about same as Mumbai

• Some comments related directly to this (~40 replies received so far)

• Requests for more: – Tutorials, particularly “hands-on”– Direct Tier2 involvement– Feedback sessions, planning concrete actions etc.

Active help from Tier2s in preparing / defining future events would be much appreciated– Please not just the usual suspects…

• See also Duncan Rand’s talk to GridPP16 – Some slides included below

http://www.gridpp.ac.uk/gridpp16/gridpp16_T2.ppt

http://www.gridpp.ac.uk/gridpp16/programme.html


Tutorial Rating – 10=best

0

2

4

6

8

10

12

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37

Series1


Workshop Rating

0

2

4

6

8

10

12

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37

Series1


Workshop Comments

• Many positive comments on all sessions of the workshop and tutorials

• Possibility to discuss with other sites and the developers also much appreciated

• Sessions which some liked least others liked most!

• I hope that the people who didn’t reply also feel the same!

“Very very inspiring” “Hope to do it again soon”

“Tutorials were very useful”

“The organisation was excellent”

“Discussions were very enlightening”

“Information collected together in one place”


Workshop Summary

• Workshops have been well attended and received– Feedback will help guide future events

Need to improve on Tier1+Tier2 involvement– Preparing agenda / chairing sessions / giving talks etc.

• Strong demand for more tutorials– Hands-on where possible / appropriate

• Thanks to everyone for their contribution to both workshop and tutorials!

HEPiX Rome 05apr06

LCG

[email protected]

The Service Challenge programme this year must show that we can run reliable services

Grid reliability is the product of many components – middleware, grid operations, computer centres, ….

Target for September 90% site availability 90% user job success

Requires a major effort by everyone to monitor, measure, debug

First data will arrive next year NOT an option to get things going later

Too modest?

Too ambitious?

HEPiX Rome 05apr06

LCG

[email protected]

the wlcg service starts here… sc4 production == wlcg pilot --- jamie shiers it-gd group meeting,...

Documents