storage review david britton,21/nov/08.. 2 31/03/2014 one year ago time line apr-09 jan-09 oct-08...

18
Storage Review David Britton,21/Nov/08.

Upload: jake-alexander

Post on 27-Mar-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Storage Review David Britton,21/Nov/08.. 2 31/03/2014 One Year Ago Time Line Apr-09 Jan-09 Oct-08 Jul-08 Apr-08 Jan-08 Oct-07 OC 2.1.4 Data? Oversight

Storage Review

David Britton,21/Nov/08.

Page 2: Storage Review David Britton,21/Nov/08.. 2 31/03/2014 One Year Ago Time Line Apr-09 Jan-09 Oct-08 Jul-08 Apr-08 Jan-08 Oct-07 OC 2.1.4 Data? Oversight

210/04/23

One Year Ago Time

Line

Apr-

09

Jan-0

9

Oct

-08

Jul-08

Apr-

08

Jan-0

8

Oct

-07

OC

2.1.4

Data?

Oversight Committee – Oct 2008.

Data was expected in early summer 2008.

CASTOR was broken (2.1.2 and 2.1.3) and a serious concern.

Alternative to CASTOR (dCache and HPSS/enStore) had been considered and rejected.

Page 3: Storage Review David Britton,21/Nov/08.. 2 31/03/2014 One Year Ago Time Line Apr-09 Jan-09 Oct-08 Jul-08 Apr-08 Jan-08 Oct-07 OC 2.1.4 Data? Oversight

310/04/23

OC Feedback Time Line

Apr-

09

Jan-0

9

Oct

-08

Jul-08

Apr-

08

Jan-0

8

Oct

-07

OC

2.1.4

Data?

NOTES FROM THE OCTOBER 2007 OC on CASTOR :

The main concern was progress towards fixing CASTOR at the Tier-1. It was understood that various actions were ongoing, but that it was necessary to manage this (and the associated expectations on each side). We were asked to make all deadlines as clear as possible to all those involved in the project (since delays in this area inevitably have a large impact across the project).

We need to agree, where necessary, sets of milestones and deadlines from CERN, the Tier-1, ATLAS, CMS and LHCb for end-December, February (prior to CCRC-1) and May (prior to CCRC-2) in anticipation of the next OC meeting in mid-May.

Page 4: Storage Review David Britton,21/Nov/08.. 2 31/03/2014 One Year Ago Time Line Apr-09 Jan-09 Oct-08 Jul-08 Apr-08 Jan-08 Oct-07 OC 2.1.4 Data? Oversight

410/04/23

Tier-1 Review Time Line

Apr-

09

Jan-0

9

Oct

-08

Jul-08

Apr-

08

Jan-0

8

Oct

-07

OC

2.1.4

Data?

NOTES ON CASTOR FROM THE NOVEMBER 2007 Tier-1 Review:

Concerns: "2.1    CASTOR: The effort required over the next 12 months on CASTOR may be larger than planned." This was about 5 FTE (half funded by GridPP) compared to plan of 1.5 FTE.

Recommendations:3.1    The CASTOR level of effort is appropriate for steady-state operation, but given the current status, it needs to be monitored. Based on current input, we do not believe that a long-term redistribution of manpower in this area would lead to an optimum overall plan. In the short term, it is recognised that dedicated effort is required for testing. This should be regarded as transitionary. (Point-2.1)

Tier-1 Review

Page 5: Storage Review David Britton,21/Nov/08.. 2 31/03/2014 One Year Ago Time Line Apr-09 Jan-09 Oct-08 Jul-08 Apr-08 Jan-08 Oct-07 OC 2.1.4 Data? Oversight

510/04/23

2008 Time Line

2.1.6

CCRC08 CCRC08

Apr-

09

Jan-0

9

Oct

-08

Jul-08

Apr-

08

Jan-0

8

Oct

-07

OC

2.1.4

Data?

2.1.7

CASTOR S.I.R.’sTier-1

Review

Page 6: Storage Review David Britton,21/Nov/08.. 2 31/03/2014 One Year Ago Time Line Apr-09 Jan-09 Oct-08 Jul-08 Apr-08 Jan-08 Oct-07 OC 2.1.4 Data? Oversight

610/04/23

2008 – The Present Time Line

Apr-

09

Jan-0

9

Oct

-08

Jul-08

Apr-

08

Jan-0

8

Oct

-07

OCTier-1 Review

2.1.4 2.1.6 2.1.7

Data?CCRC08 CCRC08 CASTOR

S.I.R.’s

Storage Review

OC

??????????????????????????

Page 7: Storage Review David Britton,21/Nov/08.. 2 31/03/2014 One Year Ago Time Line Apr-09 Jan-09 Oct-08 Jul-08 Apr-08 Jan-08 Oct-07 OC 2.1.4 Data? Oversight

7

Where do we go from here?

• At the review last year the feedback noted: We were also pleased to see signs of improvement w.r.t.

CASTOR, following dedicated efforts from several individuals from a potentially disastrous situation.

• A year later, it is clear that: The CASTOR and Database teams have put in an enormous

amount of work and achieved many successes. They have significantly improved the infrastructure, monitoring, and management processes.

BUT …. we have not yet established a stable, reliable, load-tolerant mass storage service that is adequate for data.

• At this point we need to take a step back and look at the big picture to ensure that over the next 6 months we can address this.

10/04/23

Page 8: Storage Review David Britton,21/Nov/08.. 2 31/03/2014 One Year Ago Time Line Apr-09 Jan-09 Oct-08 Jul-08 Apr-08 Jan-08 Oct-07 OC 2.1.4 Data? Oversight

8

(Sample) Questions– Can we benefit by making our CASTOR setup mimic CERN’s

more closely?

10/04/23

Cost issues?

Knowledge issues?

Manpower issues?

Other non-CERN CASTOR sites?

Page 9: Storage Review David Britton,21/Nov/08.. 2 31/03/2014 One Year Ago Time Line Apr-09 Jan-09 Oct-08 Jul-08 Apr-08 Jan-08 Oct-07 OC 2.1.4 Data? Oversight

9

(Sample) Questions– Is the main problem actually the database and Is the RAC set-

up a large part of most problems?

10/04/23

Licences and hardware costs?

Oracle Expertise?

CERN / Oracle Support?

Other non-CERN CASTOR sites?

Page 10: Storage Review David Britton,21/Nov/08.. 2 31/03/2014 One Year Ago Time Line Apr-09 Jan-09 Oct-08 Jul-08 Apr-08 Jan-08 Oct-07 OC 2.1.4 Data? Oversight

10

(Sample) Questions– What effort is needed on CASTOR/databases over the next 6-

months and the next 2 years, and can we provide it?

10/04/23

Backdrop:

2 FTE funded by GridPP in this area.

11 FTE total effort reported by Tier-1 against 17 FTE funded

Page 11: Storage Review David Britton,21/Nov/08.. 2 31/03/2014 One Year Ago Time Line Apr-09 Jan-09 Oct-08 Jul-08 Apr-08 Jan-08 Oct-07 OC 2.1.4 Data? Oversight

11

(Sample) Questions– Have we optimised the management, operation and internal

and external interfaces of the Database and Castor teams?

10/04/23

Do we have the right skill mixture?

Is there enough agility?

How do we interface to CERN? To the Experiments?

Page 12: Storage Review David Britton,21/Nov/08.. 2 31/03/2014 One Year Ago Time Line Apr-09 Jan-09 Oct-08 Jul-08 Apr-08 Jan-08 Oct-07 OC 2.1.4 Data? Oversight

12

(Sample) Questions– Is our hardware resilient (enough) and is our architecture

optimal?

10/04/23

Disk failures (correlations; replacement process)?

Load levels ?

RAC ?

Page 13: Storage Review David Britton,21/Nov/08.. 2 31/03/2014 One Year Ago Time Line Apr-09 Jan-09 Oct-08 Jul-08 Apr-08 Jan-08 Oct-07 OC 2.1.4 Data? Oversight

13

(Sample) Questions– How do we approach future CASTOR upgrades?

10/04/23

Is our test-bed sufficient (RAC?)?

Can we/do we generate representative loads?

Do we have enough (the right sort of) manpower?

How do we make the decision to deploy?

Page 14: Storage Review David Britton,21/Nov/08.. 2 31/03/2014 One Year Ago Time Line Apr-09 Jan-09 Oct-08 Jul-08 Apr-08 Jan-08 Oct-07 OC 2.1.4 Data? Oversight

14

(Sample) Questions– Does or will the changing (relative) costs of disk and tape

(infrastructure) change the usage model?

5 FTE = £350k/p.a.

Tape infrastructure FY08 £694k (+ £75k media).

Page 15: Storage Review David Britton,21/Nov/08.. 2 31/03/2014 One Year Ago Time Line Apr-09 Jan-09 Oct-08 Jul-08 Apr-08 Jan-08 Oct-07 OC 2.1.4 Data? Oversight

15

(Sample) Questions– Is there light at the end of the CASTOR tunnel on the timescale

of data?

10/04/23

Fundamentally, are we in a different position this year?

What are the key indicators that show this?

How do we monitor/measure/present this?

Page 16: Storage Review David Britton,21/Nov/08.. 2 31/03/2014 One Year Ago Time Line Apr-09 Jan-09 Oct-08 Jul-08 Apr-08 Jan-08 Oct-07 OC 2.1.4 Data? Oversight

16

(Sample) Questions– Are there alternatives to CASTOR that we should start to look at

more seriously?

10/04/23

Options (from AS):

Keep running CASTOR;

Switch to dCache, either with DMF or some other HSM;

Switch to dCache with Enstore;

Write our own tapestore interface for dCache;

Buy a commercial HSM and rewrite either DPM or the CASTOR SRM to interface to it, or write our own SRM interface;

Run BeStMan or JASMINE;

Stop providing tape storage and switch to a disk-only Tier 1.

Page 17: Storage Review David Britton,21/Nov/08.. 2 31/03/2014 One Year Ago Time Line Apr-09 Jan-09 Oct-08 Jul-08 Apr-08 Jan-08 Oct-07 OC 2.1.4 Data? Oversight

17

(Sample) Questions– Do we (deployers and users) still believe CASTOR is the right

mid- and long-term solution?

10/04/23

Are the experiment mid/long term plans evolving ?

Archival storage on spin-on-demand disks or other technologies?

Is CASTOR appropriate for disk (only) storage (at any level?)

Can we/should we reduce our exposure/dependence on CASTOR?

Page 18: Storage Review David Britton,21/Nov/08.. 2 31/03/2014 One Year Ago Time Line Apr-09 Jan-09 Oct-08 Jul-08 Apr-08 Jan-08 Oct-07 OC 2.1.4 Data? Oversight

18

(Sample) Questions

10/04/23

– Can we benefit by making our CASTOR setup mimic CERN’s more closely? – Is the main problem actually the database and Is the RAC set-up a large part of

most problems?– What effort is needed on CASTOR/databases over the next 6-months and the next 2

years, and can we provide it?– Have we optimised the management, operation and internal and external

interfaces of the Database and Castor teams?– Is our hardware resilient (enough) and is our architecture optimal? – How do we approach future CASTOR upgrades?– Does or will the changing (relative) costs of disk and tape (infrastructure) change

the usage model?– Is there light at the end of the CASTOR tunnel on the timescale of data?– Are there alternatives to CASTOR that we should start to look at more seriously?– Do we (deployers and users) still believe CASTOR is the right mid- and long-term

solution?– Are ATLAS’s problems due to lack of embedded ATLAS effort at RAL and/or their file

sizes?– We’ve seen lots of load related problems – need for a CCRC09?– Data base load – can we reduce it for a modest cost?– 0.5% data loss: What would (spin-on-demand) disk give us?– CNAF model of CASTOR only for tape (and STORM for disk)?– Is there a training issue for DB experts on CASTOR architecture/operation?– Oracle/RAC architecture optimisation and what are reasonable/expected loads (by

V0)?