lcg – databases - meeting

25
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/ LCG – Databases - Meeting 25 March 2008

Upload: maia

Post on 15-Feb-2016

34 views

Category:

Documents


0 download

DESCRIPTION

LCG – Databases - Meeting. 25 March 2008. Presences: Miguel Anjo, John Shade, Paolo Tedesco, Phool Chand, David Collados, Judit Novak, James Casey, Steve Traylen. Outline. Main issue during the power cut . - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: LCG – Databases - Meeting

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

LCG – Databases - Meeting

25 March 2008

Page 2: LCG – Databases - Meeting

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

• Presences: Miguel Anjo, John Shade, Paolo Tedesco, Phool Chand, David Collados, Judit Novak, James Casey, Steve Traylen

Presentation title - 2

Page 3: LCG – Databases - Meeting

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Presentation title - 3

Outline

Power cut problem

• Current config• What happened

Service changes

• Move Jobs to Scheduler• End of synonyms

Tasks on users / DBA

• Division of tasks

Points to improve• ServiceMap account• Gridmap service• Cleanup/partitioning of SAM• Gridview Merge/partitioning• Weekly report checkup

AOB

• Lemon alarm for DB availability• Next meeting

Page 4: LCG – Databases - Meeting

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Main issue during the power cut

• Ethernet network switches in RAC6 were not connected to the critical power (wrong connection of the power bar) – The public and cluster interconnect networks

went down

Presentation title - 4

Page 5: LCG – Databases - Meeting

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Impact of the power cut on the DB ServicesATLAS Online RAC – RAC2 – no downtime

• Services relocated to the 2 surviving nodes until manual reboot

ATLAS Offline RAC – RAC5 – no downtime

• Streams processes moved to available nodes

Downstream servers - RAC5 - no downtime• Services unavailable from 6:30 till 7:30• Services available on a single cluster node from 7:30 till 10:30

CMS RAC – RAC6 – 1h downtime + 3h of reduced performance• Services unavailable form 6:30 till 7:30 • Services available on a single cluster node from 7:30 till 09.45, except

LCG_SAM (wrongly allocated to the servers which went down)• Further downtime (9:45-11:00) while fixing other nodes to support load LCG RAC – RAC6 – 2h downtime + 2h of severely reduced

performance• Services unavailable from 6:30 till 7:30

LHCb RAC - RAC6 – 1h downtime

Page 6: LCG – Databases - Meeting

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Service Changes

• Announce and schedule interventions– Have a main contact that keeps plan and progress, contact all parts

and announces restart of all services• Move from dbms_job to dbms_SchedulerEGEE_PPS_SAM rmTDLOneWeekOld; LCG_FTS_PROD begin fts_history.movedata; end;LCG_FTS_PROD begin fts_servicestate.runjob; end; LCG_FTS_PROD_T2 begin fts_history.movedata; end; LCG_SAM_PPS p_testdef_autodel; LCG_SAM_PPS rmTDLOneWeekOld;http://

oracle-documentation.web.cern.ch/oracle-documentation/10gr2doc/server.102/b14231/jobtosched.htm

• End synonyms– Use Schema.TABLE_NAME

• Select from LCG_GRIDVIEW.SITES (from lcg_gridview_r or lcg_same_w or …)

– Only need to grant the privileges– Check usage outside CERN (Miguel Anjo)

LCG meeting - 6

Page 7: LCG – Databases - Meeting

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Developer tasks

Manage partitions

• create/drop/move

Clean up old data

Monitors space usage

Defragment tables

Check request to production (improve docs)

?

Presentation title - 7

Page 8: LCG – Databases - Meeting

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Points to improve

• ServiceMap account – Need reader/writer

• LCG_Gridmap service– User using LCG_SAM service

• Cleanup/partitioning of SAM (sam meeting?)• Gridview Merge/partitioning (gridview

meeting?)• Weekly report checkup

LCG meeting - 8

Page 9: LCG – Databases - Meeting

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

AOB

• Lemon alarm for DB availability– Create lemon metric for DB services

• Next meeting– 29th July?

LCG meeting - 9

Page 10: LCG – Databases - Meeting

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Why and what was done

• Space pressure on LCGR storage arrays• Of ~2750GB, only 175GB are available• Not possible to shrink datafiles• 650GB space not used in datafiles• Solution: move segments

Presentation title - 10

Page 11: LCG – Databases - Meeting

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Why and what was done• Overview

• Backup system will appreciate <200GB datafiles– Datafile is smallest unit to backup, not possible to

parallelize neither resume Presentation title - 11

TABLESPACE_NAME GB USEDLCG_GRIDVIEW_DATA01 858 311

LCG_SAME_DATA01 387 240

LCG_GRIDVIEW_DATA02 240 224

LCG_SAME_TESTDATA_1H2007 185 184

LCG_SAME_TESTDATA_2H2007 122 98

LCG_FTS_PROD_DATA01 113 77

LCG_GRIDVIEW_JOBSTATUSRAW 92 82

LCG_SAME_TESTDATA_2H2006 42 41

LCG_SAME_TESTDATA_1H2008 28 25

Page 12: LCG – Databases - Meeting

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Why and what was done

• Partitioned tables– LCG_SAME.TESTDATA (April 2007)

• Monthly partitions up to “2008”, indexed, clob• Data since July 2007• Work to do to (data move during CPUJan08 not finished - see later)

– LCG_SAME.TESTDATA_HISTORY (March 2008)• half-yearly partitions/tablespaces, no indexes• Data between July 2006 and July 2007• Created during CPUJan2008

– LCG_GRIDVIEW.JOBSTATUSRAW (March 2008)• Monthly partition up to Dec/2010, indexed• Created during CPUJan2008

Presentation title - 12

Page 13: LCG – Databases - Meeting

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Why and what was done

• LCG_GRIDVIEW_DATA01 space waste– Not possible to shrink datafiles. – Solution: move data to different datafile

• ALTER TABLE MOVE + ALTER INDEX REBUILD– Copy table, constraints online– Copy indexes online– Made some cursors invalid (need to restart app)– Done for tables <1GB (Thursday 6.March)

• DBMS_REDEFINITION does online– Copy table, indexes, constraints, keep synchronized– Rename tables, copy privileges– Done successfully for 7 tables (Monday 10.March)– Failed for table VO (but reported successful)

Presentation title - 13

Page 14: LCG – Databases - Meeting

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Why and what was done

• Why it failed (table VO)?– Service request open to Oracle– Table VO is heavily used (several users,

synonyms, views, procedures)– Oracle failed to get a lock but did not report error

• “ORA-4020 Deadlock when trying to lock xxx” reported for other tables when moving

– Similar problem for table SITES and NODES– Currently difficult to create/drop tables

referencing those tables • (tables in bad state? Service Request)

Presentation title - 14

Page 15: LCG – Databases - Meeting

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Missing operations

• LCG_GRIDVIEW– Recreate tables SITES, NODES, VO– Move 10 tables off DATA01 (120GB)– Possible “exp/imp” or “table move + index

rebuild”– 8 hours??

• LCG_SAME– Move partitions 2H2007 to correct tablespace– Split 2008 partitions– Create partitions up to Dec2010– >1 day, “transparent”

Presentation title - 15

Page 16: LCG – Databases - Meeting

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Presentation title - 16

Outline

Last week operations

• Why and what was done• Missing operations

Space usage and cleanup

• Current situation• What can be done• Division of tasks

Interventions in production system

• Transparent interventions• Applications resilience to interventions

Database meetings with developers

• Next meeting

Page 17: LCG – Databases - Meeting

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Current situation

• TESTDATA_HISTORY is 224GB • Data: July 06 > July 07• Needed? For how long?

• TESTDATA is 286GB + indexes• Data >= July 07

LCG_SAME

• JOBSTATUSRAW is partitioned• Agreed to drop partitions > 3 months

old• About 30GB/month

• JOBSTATUS, GRIDFTPMONITORRAW• > 30GB + indexes• Partition? Regular clean up?

LCG_GRIDVIE

W Presentation title - 17

Page 18: LCG – Databases - Meeting

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

What can be done

• Partitioning– Some maintenance work– No space gain

• Aggregates– After aggregation, delete row data– Space gain and performance boost

• History table (no indexes, compressed)– Little space gain– Heavy maintenance work

Presentation title - 18

Page 19: LCG – Databases - Meeting

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Expected growth

• Start monitoring of space growth per table• What are expectations?• How much aggregate data will be kept?• What about aggregation of aggregates?• LCG_SAM?

• LCG_GRIDVIEW?

• LCG_FTS?

Presentation title - 19

Page 20: LCG – Databases - Meeting

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Presentation title - 20

Outline

Last week operations

• Why and what was done• Missing operations

Space usage and cleanup

• Current situation• What can be done• Division of tasks

Interventions in production system

• Transparent interventions• Applications resilience to interventions

Database meetings with developers

• Next meeting

Page 21: LCG – Databases - Meeting

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Transparent interventions

• Huge database (for SAM, GridView)– Impossible to perform full scale tests

• Some operations ‘with risk’ for long periods• How to schedule? • Possible to do with downtime? (less risk)• Notification flow?

Presentation title - 21

Type Flow“With risk” selected users

“With risk” all users

“Downtime” selected users

“Downtime” all users

Page 22: LCG – Databases - Meeting

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Applications resilience to interventions

Presentation title - 22

• Resilient: adj. – Marked by the ability to recover readily, as from misfortune; – Capable of returning to an original shape or position, as after having been

compressed.

Application Resilient? Grid consequence? Acceptable downtime?

FTS

Gridview

LFC

SAM

VOMS

Page 23: LCG – Databases - Meeting

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Presentation title - 23

Outline

Last week operations

• Why and what was done• Missing operations

Space usage and cleanup

• Current situation• What can be done• Division of tasks

Interventions in production system

• Transparent interventions• Applications resilience to interventions

Database meetings with developers

• Next meeting

Page 24: LCG – Databases - Meeting

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Presentation title - 24

Next meeting

• Main developers of LCG

• Weekly report, interventions planned, SQL optimization, share solutions

• Schedule: Monday after the 15th at 14:00• Next meeting 21st April – 14:00

FTS Gavin MccanceGridView James CaseyLFCSAMVOMS Steve Murray

Page 25: LCG – Databases - Meeting

Presentation title - 25