egee is a project funded by the european union under contract ist-2003-508833

24
EGEE is a project funded by the European Union under contract IST-2003-508833 “Deploying and operating the LHC Computing Grid 2 during Data Challenges” Markus Schulz, IT-GD, CERN [email protected] the CERN IT-GD group CHEP’04 – Interlaken, Switzerland, 27 September – 1 October 2004

Upload: orson-gilliam

Post on 30-Dec-2015

15 views

Category:

Documents


0 download

DESCRIPTION

“ Deploying and operating the LHC Computing Grid 2 during Data Challenges ” Markus Schulz, IT-GD, CERN [email protected] the CERN IT-GD group. CHEP’04 – Interlaken, Switzerland, 27 September – 1 October 2004. EGEE is a project funded by the European Union under contract IST-2003-508833. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: EGEE is a project funded by the European Union under contract IST-2003-508833

EGEE is a project funded by the European Union under contract IST-2003-508833

“Deploying and operating the LHC Computing Grid 2 during

Data Challenges”

Markus Schulz, IT-GD, [email protected]

the CERN IT-GD group

CHEP’04 – Interlaken, Switzerland, 27 September – 1 October 2004

Page 2: EGEE is a project funded by the European Union under contract IST-2003-508833

CHEP’04, Interlaken Switzerland CERN IT-GD 27 September – 1 October 2004 2

Outline

• LCG overview

• Short History of LCG-2

• Data Challenges

• Operating LCG

• Problems

• Summary

Page 3: EGEE is a project funded by the European Union under contract IST-2003-508833

CHEP’04, Interlaken Switzerland CERN IT-GD 27 September – 1 October 2004 3

The LCG Project (and what it isn’t)

• Mission To prepare, deploy and operate the computing environment for the experiments to

analyze the data from the LHC detectors

• Two phases: Phase 1: 2002 – 2005 Build a prototype, based on existing grid middleware Deploy and run a production service Produce the Technical Design Report for the final system

Phase 2: 2006 – 2008 Build and commission the initial LHC computing environment

LCG is NOT a development project for middlewarebut problem fixing is permitted (even if writing code is required)

LCG-2 is the first production service for EGEE

Page 4: EGEE is a project funded by the European Union under contract IST-2003-508833

CHEP’04, Interlaken Switzerland CERN IT-GD 27 September – 1 October 2004 4

LCG Grid Deployment Area

Scope: Integrate a set of middleware and coordinate and support its

deployment to the regional centres Provide operational services to enable running as a production-quality

service Provide assistance to the experiments in integrating their software and

deploying in LCG; Provide direct user support

Deployment Goals for LCG-2 Production service for Data Challenges in 2004

• Focused on batch production work Experience in close collaboration between the Regional Centres Learn how to maintain and operate a global grid Focus on building a production-quality service

• robustness, fault-tolerance, predictability, and supportability Understand how LCG can be integrated into the sites’ computing

Page 5: EGEE is a project funded by the European Union under contract IST-2003-508833

CHEP’04, Interlaken Switzerland CERN IT-GD 27 September – 1 October 2004 5

LCG Deployment Area

A core team at CERN – Grid Deployment group (~30) Collaboration of the regional centres –

• through the Grid Deployment Board

Partners take responsibility for specific tasks (e.g. GOCs, GUS) Focussed task forces as needed Collaborative joint projects – via JTB, grid projects, etc.

• CERN deployment group Core preparation, (re)certification, deployment, and support activities Integration, packaging, debugging, development of missing tools, Deployment, coordination & support, security & VO management, Experiment integration and support

• GDB: Country representatives for regional centres Address policy, operational issues that require general agreement Brokered agreements on:

• Security• What is deployed…..

Page 6: EGEE is a project funded by the European Union under contract IST-2003-508833

CHEP’04, Interlaken Switzerland CERN IT-GD 27 September – 1 October 2004 6

LCG-2 Software

• LCG-2 core packages: VDT (Globus2, condor) EDG WP1 (Resource Broker, job submission tools) EDG WP2 (Replica Management tools) + lcg tools

• One central RMC and LRC for each VO, located at CERN, ORACLE backend Several bits from other WPs (Config objects, InfoProviders, Packaging…) GLUE 1.1 (Information schema) + few essential LCG extensions MDS based Information System with significant LCG enhancements

(replacements, simplified (see poster)) Mechanism for application (experiment) software distribution

• Almost all components have gone through some reengineering robustness scalability efficiency adaptation to local fabrics

• The services are now quite stable and the performance and scalability has been significantly improved (within the limits of the current architecture)

Page 7: EGEE is a project funded by the European Union under contract IST-2003-508833

CHEP’04, Interlaken Switzerland CERN IT-GD 27 September – 1 October 2004 7

History

• Jan 2003 GDB agreed to take VDT and EDG components• March 2003 LCG-0

existing middleware, waiting for EDG-2 release

• September 2003 LCG-1 3 month late -> reduced functionality extensive certification process -> improved stability (RB, Information system) integrated 32 sites ~300 CPUs first use for production

• December 2003 LCG-2 Full set of functionality for DCs, first MSS integration Deployed in January to 8 core sites DCs started in February -> testing in production Large sites integrate resources into LCG (MSS and farms) Introduced a pre-production service for the experiments Alternative packaging (tool based and generic installation guides)

• Mai 2004 -> now monthly incremental releases Not all releases are distributed to external sites Improved services, functionality, stability and packing step by step Timely response to experiences from the data challenges

Page 8: EGEE is a project funded by the European Union under contract IST-2003-508833

CHEP’04, Interlaken Switzerland CERN IT-GD 27 September – 1 October 2004 8

LCG-2 Status 22 09 2004

Total:78 Sites~9400 CPUs~6.5 PByte

Cyprus

new interested sites should look here: release

Page 9: EGEE is a project funded by the European Union under contract IST-2003-508833

CHEP’04, Interlaken Switzerland CERN IT-GD 27 September – 1 October 2004 9

Integrating Sites

• Sites contact GD Group or Regional Center• Sites go to the release page• Sites decide on manual or tool based installation

documentation for both available WN and UI from next release on tar-ball based release

• Sites provide security and contact information• Sites install and use provided tests for debugging

support from regional centers or CERN• CERN GD certifies site and adds it to the monitoring and

information system sites are daily re-certified and problems traced in SAVANNAH

• Large sites have integrated their local batch systems in LCG-2 • Adding new sites is now quite smooth

problem is keeping large number of sites correctly configured

Page 10: EGEE is a project funded by the European Union under contract IST-2003-508833

CHEP’04, Interlaken Switzerland CERN IT-GD 27 September – 1 October 2004 10

Data Challenges

• Large scale production effort of the LHC experiments test and validate the computing models produce needed simulated data test experiments production frame works and software test the provided grid middleware test the services provided by LCG-2

• All experiments used LCG-2 for part of their production

Page 11: EGEE is a project funded by the European Union under contract IST-2003-508833

CHEP’04, Interlaken Switzerland CERN IT-GD 27 September – 1 October 2004 11

Data Challenges

• Phase I120k Pb+Pb events produced in 56k jobs1.3 million files (26TByte) in Castor@CERNTotal CPU: 285 MSI-2k hours (2.8 GHz PC working 35 years)~25% produced on LCG-2

Phase II (underway)1 million jobs, 10 TB produced, 200TB transferred ,500 MSI2k hours CPU~15% on LCG-2

• Phase I7.7 Million events fully simulated (Geant 4) in 95.000 jobs22 TByteTotal CPU: 972 MSI-2k hours >40% produced on LCG-2 (used LCG-2, GRID3, NorduGrid)

Page 12: EGEE is a project funded by the European Union under contract IST-2003-508833

CHEP’04, Interlaken Switzerland CERN IT-GD 27 September – 1 October 2004 12

Data Challenges

• ~30 M events produced• 25Hz reached

•(only once for a full day)

Page 13: EGEE is a project funded by the European Union under contract IST-2003-508833

CHEP’04, Interlaken Switzerland CERN IT-GD 27 September – 1 October 2004 13

DIRAC alone

LCG inaction

1.8 106/day

LCG paused

3-5 106/day

LCG restarted

Data Challenges

• Phase I186 M events 61 TByteTotal CPU: 424 CPU years (43 LCG-2 and 20 DIRAC sites)Up to 5600 concurrent running jobs in LCG-2

Page 14: EGEE is a project funded by the European Union under contract IST-2003-508833

CHEP’04, Interlaken Switzerland CERN IT-GD 27 September – 1 October 2004 14

Problems during the data challenges

• All experiments encountered on LCG-2 similar problems• LCG sites suffering from configuration and operational problems

not adequate resources on some sites (hardware, human..) this is now the main source of failures

• there is a discrepancy between the failure rate on LCG-2 and on the C&T testbed

• Load balancing between different sites is problematic jobs can be “attracted” to sites that have no adequate resources modern batch systems are too complex and dynamic to summarize their

behavior in a few values in the IS • Identification and location of problems in LCG-2 is difficult

distributed environment, access to many logfiles needed….. status of monitoring tools

• Handling thousands of jobs is time consuming and tedious Support for bulk operation is not adequate

• Performance and scalability of services storage (access and number of files) job submission information system file catalogues

• Services suffered from hardware problems (no fail over services)

DC summary

Page 15: EGEE is a project funded by the European Union under contract IST-2003-508833

CHEP’04, Interlaken Switzerland CERN IT-GD 27 September – 1 October 2004 15

Running Services

• Multiple instances of core services for each of the experiments separates problems, avoids interference between experiments improves availability allows experiments to maintain individual configuration (information system) addresses scalability to some degree

• Monitoring tools for services currently not adequate tools under development to implement control system

• Access to storage via load balanced interfaces CASTOR dCache

• Services that carry “state” are problematic to restart on new nodes needed after hardware problems, or security problems

• “State Transition” between partial usage and full usage of resources required change in queue configuration (faire share, individual queues/VO) next release will come with description for fair share configuration (smaller sites)

DC summary

Page 16: EGEE is a project funded by the European Union under contract IST-2003-508833

CHEP’04, Interlaken Switzerland CERN IT-GD 27 September – 1 October 2004 16

Support during the DCs

• User (Experiment) Support: GD at CERN worked very close with the experiments production managers Informal exchange (e-mail, meetings, phone)

• “No Secrets” approach, GD people on experiments mail lists and vice versa – ensured fast response

• tracking of problems tedious, but both sites have been patient• clear learning curve on BOTH sites • LCG GGUS (grid user support) at FZK became operational after start of the DCs

– due to the importance of the DCs the experiments switch slowly to the new service

• Very good end user documentation by GD-EIS • Dedicated testbed for experiments with next LCG-2 release

– rapid feedback, influenced what made it into the next release

• Installation (Site) Support: GD prepared releases and supported sites (certification, re-certification) Regional centres supported their local sites (some more, some less) Community style help via mailing list (high traffic!!) FAQ lists for trouble shooting and configuration issues: Taipei RAL

Dear Experiments DC Staff, [LCG-ROLLOUT]-friends,

Site Admins and GD-Group

Thank you very much for the supportive attitude, patience and energy during the data challenges!!!!

Dear Experiments DC Staff, [LCG-ROLLOUT]-friends,

Site Admins and GD-Group

Thank you very much for the supportive attitude, patience and energy during the data challenges!!!!

You made DCs almost enjoyableYou made DCs almost enjoyable

Page 17: EGEE is a project funded by the European Union under contract IST-2003-508833

CHEP’04, Interlaken Switzerland CERN IT-GD 27 September – 1 October 2004 17

Support during the DCs

• Operations Service: RAL (UK) is leading sub-project on developing operations services Initial prototype http://www.grid-support.ac.uk/GOC/

• Basic monitoring tools• Mail lists for problem resolution• Working on defining policies for operation, responsibilities (draft document)• Working on grid wide accounting

Monitoring:• GridICE (development of DataTag Nagios-based tools) • GridPP job submission monitoring• Information system monitoring and consitency check http://goc.grid.sinica.edu.tw/gstat/

CERN GD daily re-certification of sites (including history) • escalation procedure under development• tracing of site specific problems via problem tracking tool• tests core services and configuration

Page 18: EGEE is a project funded by the European Union under contract IST-2003-508833

CHEP’04, Interlaken Switzerland CERN IT-GD 27 September – 1 October 2004 18

Operational issues (selection)

• Slow response from sites Upgrades, response to problems, etc. Problems reported daily – some problems last for weeks

• Lack of staff available to fix problems Vacation period, other high priority tasks

• Various mis-configurations (see next slide)• Lack of configuration management – problems that are fixed reappear• Lack of fabric management (mostly smaller sites)

scratch space, single nodes drain queues, incomplete upgrades, ….• Lack of understanding

Admins reformat disks of SE …• Firewall issues –

often less than optimal coordination between grid admins and firewall maintainers• PBS problems

Scalability, robustness (switching to torque helps)• Provided documentation often not read (carefully)

new activity started to develop “hierarchical” adaptive documentation • (see G2G poster)

simpler way to install middleware on farm nodes (even remotely in user space)• will be included in October release

Page 19: EGEE is a project funded by the European Union under contract IST-2003-508833

CHEP’04, Interlaken Switzerland CERN IT-GD 27 September – 1 October 2004 19

Site (mis) - configurations

• Site mis-configuration was responsible for most of the problems that occurred during the experiments Data Challenges. Here is a non-complete list of problems:

– The variable VO <VO> SW DIR points to a non existent area on WNs. – The ESM is not allowed to write in the area dedicated to the software installation – Only one certificate allowed to be mapped to the ESM local account – Wrong information published in the information system (Glue Object Classes not linked) – Queue time limits published in minutes instead of seconds and not normalized – /etc/ld.so.conf not properly configured. Shared libraries not found. – Machines not synchronized in time – Grid-mapfiles not properly built – Pool accounts not created but the rest of the tools configured with pool accounts – Firewall issues – CA files not properly installed – NFS problems for home directories or ESM areas – Services configured to use the wrong/no Information Index (BDII) – Wrong user profiles – Default user shell environment too big

• Partly related to middleware complexity

Page 20: EGEE is a project funded by the European Union under contract IST-2003-508833

CHEP’04, Interlaken Switzerland CERN IT-GD 27 September – 1 October 2004 20

Outstanding Middleware Issues

• Collection: Outstanding Middleware Issues Important: 1st systematic confrontation of required functionalities

with capabilities of the existing middleware• Some can be patched, worked around, • Those related to fundamental problems with underlying models and

architectures have to be input as essential requirements to future developments (EGEE)

• Middleware is now not perfect but quite stable Much has been improved during DC’s

• A lot of effort still going into improvements and fixes• Big hole is missing space management on SE’s

– especially for Tier 2 sites

Page 21: EGEE is a project funded by the European Union under contract IST-2003-508833

CHEP’04, Interlaken Switzerland CERN IT-GD 27 September – 1 October 2004 21

EGEE Impact on Operations

• The available effort for operations from EGEE is now ramping up: LCG GOC (RAL) EGEE CICs and ROCs, + Taipei

• Hierarchical support structure Regional Operations Centres (ROC)

• One per region (9)• Front-line support for deployment, installation, users

Core Infrastructure Centres (CIC)• Four (+ Russia next year)• Evolve from GOC – monitoring, troubleshooting, operational “control”

– “24x7” in a 8x5 world ????

• Also providing VO-specific and general services EGEE NA3 organizes training for users and site admins

• “Grid operations day” at HEPiX in October Address common issues, experiences

• “Operations and Fabric Workshop” CERN 1-3 Nov

Page 22: EGEE is a project funded by the European Union under contract IST-2003-508833

CHEP’04, Interlaken Switzerland CERN IT-GD 27 September – 1 October 2004 22

Summary

• LCG-2 services have been supporting the data challenges Many middleware problems have been found – many addressed Middleware itself is reasonably stable

• Biggest outstanding issues are related to providing and maintaining stable operations

• Has to be addressed in large part by management buy-in to providing sufficient and appropriate effort at each site

• Future middleware has to take this into account: Must be more manageable, trivial to configure and install Must be easy deployed in a failsafe mode Must be easy deployed in a way that allows to build scalable load

balancing services Management and monitoring must be built into services from the

start on

Page 23: EGEE is a project funded by the European Union under contract IST-2003-508833

CHEP’04, Interlaken Switzerland CERN IT-GD 27 September – 1 October 2004 23

Screen Shots

Page 24: EGEE is a project funded by the European Union under contract IST-2003-508833

CHEP’04, Interlaken Switzerland CERN IT-GD 27 September – 1 October 2004 24

Screen Shots