operation of castor at ral tier1 review november 2007 bonny strong

12
Operation of CASTOR at RAL Tier1 Review November 2007 Bonny Strong

Upload: sherilyn-cross

Post on 01-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Operation of CASTOR at RAL Tier1 Review November 2007 Bonny Strong

Operation of CASTOR at RAL

Tier1 Review November 2007

Bonny Strong

Page 2: Operation of CASTOR at RAL Tier1 Review November 2007 Bonny Strong

History

Jan 2005 Castor1 installed at RAL for evaluationJan 2006 Castor2 first available to external institutes,

installation begun at RALAug 2006 Castor2 running after resolving problems

for deployment outside CERN, verion 2.1.0

Sep 2006 CSA06 ran successfully Mar 2007 Upgrade to version 2.1.2

- Major problems and instability causing frequent meltdowns

Sep 2007 Deployed separate instances per VO and castor version 2.1.3

- Much better stability

Page 3: Operation of CASTOR at RAL Tier1 Review November 2007 Bonny Strong

NameServer 2

Production Architecture

stager DLF

LSF

stager stagerstager DLFDLFDLF

LSF LSF LSF

1 Diskserver

9 TB

TapeServer

Oraclestager

OracleNS+vmgr

NameServer

1+vmgr

CMS StagerInstance

Atlas StagerInstance

LHCb StagerInstance

Repack and SmallUser Stager Instance

22 Diskservers

133 TB

7Diskservers

48 TB

20 Diskservers

144 TB

Oracle

DLF

Oracle

stager

OracleDLF

Oraclestager

OracleDLF

OracleDLF

Oraclerepack

Oraclestager

TapeServer

TapeServer

TapeServer

TapeServer

TapeServer

repack

Shared Services

Page 4: Operation of CASTOR at RAL Tier1 Review November 2007 Bonny Strong

Test Architecture

stagerstager DLF

DLF+LSF

LSF

1 Diskserver - variable

TapeServer

Oraclestager

OracleNS+vmgr

NameServer +vmgr

DevelopmentPreproduction

1 Diskserver- variable

OracleDLF

OracleDLF

Oraclerepack

Oraclestager

repack

Shared Services

stager DLF

LSF

1 Diskserver - variable

TapeServer

OracleNS+vmgr

NameServer +vmgr

Certification Testbed

OracleDLF

Oraclerepack

Oraclestager

repack

Shared Services

Page 5: Operation of CASTOR at RAL Tier1 Review November 2007 Bonny Strong

Operational Management

• Change management• System manager on duty• Helpdesk• Monitoring: nagios, ganglia, castor-specific • Team

Bonny Strong – service managerShaun de Witt – developerTim Folkes (about 50%)- tape operationsChris Kruk – LSF manager, diskservers, sys adminCheney Ketley (50%) – sys admin, LSF backup

Page 6: Operation of CASTOR at RAL Tier1 Review November 2007 Bonny Strong

Working with VOs

• Weekly meeting with all VOs to discuss issues and plans

• Meetings individually with VOs to model data flow and plan CASTOR configuration

Page 7: Operation of CASTOR at RAL Tier1 Review November 2007 Bonny Strong

Atlas Data Flow Model

T0Raw

StripInput

D0T1

D1T0

D1T1

D0T0

T0 T2T2T1’sT1’s

RAW

RAW

AODm1/TAGAODm2/

TAG

ESD2/AODm2/TAG

AOD2

simRaw

ESD/AODm/TAG/

RAW

simStrip

ESD1/AODm1/

TAG

TAG/AODm2

Partner T1ESD1

AODm2/TAG

ESD

Farm

RAW

Page 8: Operation of CASTOR at RAL Tier1 Review November 2007 Bonny Strong

Key Improvements Planned

Over Next 6 Months• Resilience

– Oracle clusters (RAC) with Dataguard DB replication

– Redundant stagers for each VO– Encouraging development for additional redundancy

• Monitoring improvements• Development of administrative tools• Deployment and configuration management

procedures• Disaster recovery documentation and testing

Page 9: Operation of CASTOR at RAL Tier1 Review November 2007 Bonny Strong

SRMv2

• In production at RAL by 1 Dec 2007• Separate endpoints for each VO• Front end clusters for redundancy• Will run in parallel with SRMv1 until VOs

approve v1 decommissioning

Page 10: Operation of CASTOR at RAL Tier1 Review November 2007 Bonny Strong

Major Problems and Issues

• Software reliability• Heavy operational cost• CERN-specific development• Repack delayed• Lack of administrative tools• Performance to tape• Staffing for 24/7 coverage

Page 11: Operation of CASTOR at RAL Tier1 Review November 2007 Bonny Strong

Working with CERN

• External institutes conference call every 2 weeks to review development progress and operational issues

• Twice yearly face-to-face meetings of external institutes

• Once monthly deployment conference call to plan development priorities

• Management level meetings over last year to address problems of CASTOR for Tier1s– Improved release procedures and planning– More involvement of Tier1s in development planning– Improved testing with development of certification

testbed and testsuite at RAL

Page 12: Operation of CASTOR at RAL Tier1 Review November 2007 Bonny Strong

Conclusions

• Has not been a smooth road• Have taken or plan significant steps to

overcome problems• Major concerns for 2008:

– 24/7 operation– Improving tape performance

• Expect system reliability to be much better in 2008 than 2007