implementing asm without hw raid, a user’s experience

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Implementing ASM Without HW RAID,Implementing ASM Without HW RAID,A User’s ExperienceA User’s Experience

Luca Canali, CERNDawid Wojcik, CERN

UKOUG, Birmingham, December 2008

CERN IT Department

CH-1211 Genève 23


it

OutlookOutlook

• Introduction to ASM – Disk groups, fail groups, normal redundancy

• Scalability and Performance of the solution• Possible pitfalls, sharing experiences• Implementation details, monitoring, and

tools to ease ASM deployment

CERN IT Department

CH-1211 Genève 23


it

Architecture and main concepts Architecture and main concepts

• Why ASM ?– Provides functionality of volume manager and a

cluster file system– Raw access to storage for performance

• Why ASM-provided mirroring?– Allows to use lower-cost storage arrays– Allows to mirror across storage arrays

• arrays are not single points of failure• Array (HW) maintenances can be done in a rolling way

– Stretch clusters

CERN IT Department

CH-1211 Genève 23


it

ASM and cluster DB architectureASM and cluster DB architecture

• Oracle architecture of redundant low-cost components

Servers

SAN

Storage

CERN IT Department

CH-1211 Genève 23


it

Files, extents, and failure groupsFiles, extents, and failure groups

Files and extent pointers

Failgroupsand ASMmirroring

CERN IT Department

CH-1211 Genève 23


it

ASM disk groupsASM disk groups

• Example: HW = 4 disk arrays with 8 disks each• An ASM diskgroup is created using all available disks

– The end result is similar to a file system on RAID 1+0 – ASM allows to mirror across storage arrays – Oracle RDBMS processes directly access the storage– RAW disk access

Mirroring

Striping Striping

Failgroup1 Failgroup2

ASM Diskgroup

CERN IT Department

CH-1211 Genève 23


it

Performance and scalabilityPerformance and scalability

• ASM with normal redundancy – Stress tested for CERN’s use– Scales and performs

CERN IT Department

CH-1211 Genève 23


it

Case Study: the largest cluster I have Case Study: the largest cluster I have ever installed, RAC5ever installed, RAC5

• The test used:14 servers

CERN IT Department

CH-1211 Genève 23


it

Multipathed fiber channel Multipathed fiber channel

• 8 FC switches: 4Gbps (10Gbps uplink)

CERN IT Department

CH-1211 Genève 23


it

Many spindlesMany spindles

• 26 storage arrays (16 SATA disks each)

CERN IT Department

CH-1211 Genève 23


it

Case Study: I/O metrics for the RAC5 Case Study: I/O metrics for the RAC5 clustercluster

• Measured, sequential I/O– Read: 6 GB/sec– Read-Write: 3+3 GB/sec

• Measured, small random IO– Read: 40K IOPS (8 KB read ops)

• Note: – 410 SATA disks, 26 HBAS on the storage arrays– Servers: 14 x 4+4Gbps HBAs, 112 cores, 224

GB of RAM

CERN IT Department

CH-1211 Genève 23


it

How the test was runHow the test was run

• A custom SQL-based DB workload:– IOPS: Probe randomly a large table (several

TBs) via several parallel queries slaves (each reads a single block at a time)

– MBPS: Read a large (several TBs) table with parallel query

• The test table used for the RAC5 cluster was 5 TB in size– created inside a disk group of 70TB

CERN IT Department

CH-1211 Genève 23


it

Possible pitfalls Possible pitfalls

• Production Stories– Sharing experiences – 3 years in production, 550 TB of raw capacity

CERN IT Department

CH-1211 Genève 23


it

Rebalancing speedRebalancing speed

• Rebalancing is performed (and mandatory) after space management operations– Typically after HW failures (restore mirror)– Goal: balanced space allocation across disks– Not based on performance or utilization– ASM instances are in charge of rebalancing

• Scalability of rebalancing operations? – In 10g serialization wait events can limit scalability– Even at maximum speed rebalancing is not always I/O

bound

CERN IT Department

CH-1211 Genève 23


it

Rebalancing, an exampleRebalancing, an example

ASM Rebalancing Performance (RAC)

0

1000

2000

3000

4000

5000

6000

7000

0 2 4 6 8 10 12

Diskgroup Rebalance Parallelism

Ra

te, M

B/m

in Oracle 11g

Oracle 10g

CERN IT Department

CH-1211 Genève 23


it

VLDB and rebalancingVLDB and rebalancing

• Rebalancing operations can move more data than expected

• Example: – 5 TB (allocated): ~100 disks, 200 GB each – A disk is replaced (diskgroup rebalance)

• The total IO workload is 1.6 TB (8x the disk size!)• How to see this: query v$asm_operation, the column

EST_WORK keeps growing during rebalance

• The issue: excessive repartnering

CERN IT Department

CH-1211 Genève 23


it

Rebalancing issues wrap-upRebalancing issues wrap-up

• Rebalancing can be slow– Many hours for very large disk groups

• Risk associated– 2nd disk failure while rebalancing– Worst case - loss of the diskgroup because

partner disks fail

CERN IT Department

CH-1211 Genève 23


it

Fast Mirror ResyncFast Mirror Resync

• ASM 10g with normal redundancy does not allow to offline part of the storage – A transient error in a storage array can cause

several hours of rebalancing to drop and add disks

– It is a limiting factor for scheduled maintenances

• 11g has new feature ‘fast mirror resync’– Great feature for rolling intervention on HW

CERN IT Department

CH-1211 Genève 23


it

ASM and filesystem utilities ASM and filesystem utilities

• Only a few tools can access ASM– Asmcmd, dbms_file_transfer, xdb, ftp– Limited operations (no copy, rename, etc)– Require open DB instances– file operations difficult in 10g

• 11g asmcmd has the copy command

CERN IT Department

CH-1211 Genève 23


it

ASM and corruptionASM and corruption

• ASM metadata corruption– Can be caused by ‘bugs’– One case in prod after disk eviction

• Physical data corruption– ASM will fix automatically most corruption on

primary extent – Typically when doing a full backup

– Secondary extent corruption goes undetected untill disk failure/rebalance can expose it

CERN IT Department

CH-1211 Genève 23


it

Disaster recoveryDisaster recovery

• Corruption issues were fixed using physical standby to move to ‘fresh’ storage

• For HA our experience is that disaster recovery is needed– Standby DB– On-disk (flash) copy of DB

CERN IT Department

CH-1211 Genève 23


it

Implementation details

CERN IT Department

CH-1211 Genève 23


it

Storage deploymentStorage deployment

• Current storage deployment for Physics Current storage deployment for Physics Databases at CERNDatabases at CERN– SAN, FC (4Gb/s) storage enclosures with SATA

disks (8 or 16)– Linux x86_64, no ASM lib, device mapper instead

(naming persistence + HA)– Over 150 FC storage arrays (production,

integration and test) and ~ 2000 LUNs exposed– Biggest DB over 7TB (more to come when LHC

starts – estimated growth up to 11TB/year)

CERN IT Department

CH-1211 Genève 23


it


• ASM implementation details– Storage in JBOD configuration (1 disk -> 1 LUN)– Each disk partitioned on OS level

• 1st partition – 45% of disk size – faster part of disk – short stroke

• 2nd partition – rest – slower part – full stroke

innerinner sectors sectors – – fullfull stroke stroke

outerouter sectors sectors – – shortshort stroke stroke

CERN IT Department

CH-1211 Genève 23


it

Failgroup4Failgroup4Failgroup2Failgroup2 Failgroup3Failgroup3


• Two diskgroups created for each cluster– DATA – data files and online redo logs – outer

part of the disks– RECO – flash recovery area destination –

archived redo logs and on disk backups – inner part of the disks

• One failgroup per storage array

Failgroup1Failgroup1

DATA_DG1DATA_DG1

RECO_DG1RECO_DG1

CERN IT Department

CH-1211 Genève 23


it

Storage managementStorage management

• SAN configuration in JBOD configuration – many steps, can be time consuming– Storage level

• logical disks• LUNs• mappings

– FC infrastructure – zoning– OS – creating device mapper configuration

• multipath.conf – name persistency + HA

CERN IT Department

CH-1211 Genève 23


it


• Storage manageability– DBAs set-up initial configuration– ASM – extra maintenance in case of storage

maintenance (disk failure)– Problems

• How to quickly set-up SAN configuration• How to manage disks and keep track of the mappings:

physical disk -> LUN -> Linux disk -> ASM Disk

SCSI [SCSI [1:0:1:31:0:1:3] & [] & [2:0:1:32:0:1:3] ->] ->/dev/sdn & /dev/sdax ->/dev/sdn & /dev/sdax ->/dev/mpath/rstor901_3 ->/dev/mpath/rstor901_3 ->ASM – ASM – TEST1_DATADG1_0016TEST1_DATADG1_0016

CERN IT Department

CH-1211 Genève 23


it


• Solution– Configuration DB - repository of FC switches, port

allocations and of all SCSI identifiers for all nodes and storages

• Big initial effort• Easy to maintain• High ROI

– Custom tools• Tools to identify

– SCSI (block) devices <-> device mapper device <-> physical storage and FC port

– Device mapper mapper device <-> ASM disk• Automatic generation of device mapper configuration

CERN IT Department

CH-1211 Genève 23


it


[ ~]$ lssdisks.py

The following storages are connected:

* Host interface 1:

Target ID 1:0:0: - WWPN: 210000D0230BE0B5 - Storage: rstor316, Port: 0

Target ID 1:0:1: - WWPN: 210000D0231C3F8D - Storage: rstor317, Port: 0

Target ID 1:0:2: - WWPN: 210000D0232BE081 - Storage: rstor318, Port: 0

Target ID 1:0:3: - WWPN: 210000D0233C4000 - Storage: rstor319, Port: 0

Target ID 1:0:4: - WWPN: 210000D0234C3F68 - Storage: rstor320, Port: 0

* Host interface 2:

Target ID 2:0:0: - WWPN: 220000D0230BE0B5 - Storage: rstor316, Port: 1

Target ID 2:0:1: - WWPN: 220000D0231C3F8D - Storage: rstor317, Port: 1

Target ID 2:0:2: - WWPN: 220000D0232BE081 - Storage: rstor318, Port: 1

Target ID 2:0:3: - WWPN: 220000D0233C4000 - Storage: rstor319, Port: 1

Target ID 2:0:4: - WWPN: 220000D0234C3F68 - Storage: rstor320, Port: 1

SCSI Id Block DEV MPath name MP status Storage Port

------------- ---------------- -------------------- ---------- ------------------ -----

[0:0:0:0] /dev/sda - - - -

[1:0:0:0] /dev/sdb rstor316_CRS OK rstor316 0

[1:0:0:1] /dev/sdc rstor316_1 OK rstor316 0

[1:0:0:2] /dev/sdd rstor316_2 FAILED rstor316 0

[1:0:0:3] /dev/sde rstor316_3 OK rstor316 0

[1:0:0:4] /dev/sdf rstor316_4 OK rstor316 0

[1:0:0:5] /dev/sdg rstor316_5 OK rstor316 0

[1:0:0:6] /dev/sdh rstor316_6 OK rstor316 0

. . .

. . .

Custom made script

SCSI id (host,channel,id) -> storage name and

FC port

SCSI ID -> block device-> device mapper name and status -> storage name and

FC port

CERN IT Department

CH-1211 Genève 23


it


[ ~]$ listdisks.py

DISK NAME GROUP_NAME FG H_STATUS MODE MOUNT_S STATE TOTAL_GB USED_GB

---------------- ------------------ ------------- ---------- ---------- ------- -------- ------- ------ -----

rstor401_1p1 RAC9_DATADG1_0006 RAC9_DATADG1 RSTOR401 MEMBER ONLINE CACHED NORMAL 111.8 68.5

rstor401_1p2 RAC9_RECODG1_0000 RAC9_RECODG1 RSTOR401 MEMBER ONLINE CACHED NORMAL 119.9 1.7

rstor401_2p1 -- -- -- UNKNOWN ONLINE CLOSED NORMAL 111.8 111.8

rstor401_2p2 -- -- -- UNKNOWN ONLINE CLOSED NORMAL 120.9 120.9













rstor401_CRS1

rstor401_CRS2

rstor401_CRS3


. . .

. . .

Custom made script

device mapper name -> ASM disk

and status

CERN IT Department

CH-1211 Genève 23


it


[ ~]$ gen_multipath.py

# multipath default configuration for PDB

defaults {

udev_dir /dev

polling_interval 10

selector "round-robin 0"

. . .

}

. . .

multipaths {

multipath {

wwid 3600d0230006c26660be0b5080a407e00

alias rstor916_CRS

}

multipath {

wwid 3600d0230006c26660be0b5080a407e01

alias rstor916_1

}

. . .

}

Custom made script

device mapper alias – naming persistency and

multipathing (HA)

SCSI [1:0:1:3] & [2:0:1:3] ->/dev/sdn & /dev/sdax ->/dev/mpath/rstor916_1

CERN IT Department

CH-1211 Genève 23


it

Storage monitoringStorage monitoring

• ASM-based mirroring means– Oracle DBAs need to be alerted of disk failures

and evictions– Dashboard – global overview – custom solution –

RACMon

• ASM level monitoring– Oracle Enterprise Manager Grid Control– RACMon – alerts on missing disks and failgroups

plus dashboard

• Storage level monitoring– RACMon – LUNs’ health and storage

configuration details – dashboard

CERN IT Department

CH-1211 Genève 23


it

Storage monitoringStorage monitoring

• ASM instance level monitoring

• Storage level monitoring

new failing disk onRSTOR614

new disk installed onRSTOR903 slot 2

CERN IT Department

CH-1211 Genève 23


it

Conclusions Conclusions

• Oracle ASM diskgroups with normal redundancy – Used at CERN instead of HW RAID– Performance and scalability are very good– Allows to use low-cost HW– Requires more admin effort from the DBAs than

high end storage– 11g has important improvements

• Custom tools to ease administration

CERN IT Department

CH-1211 Genève 23


it

Q&AQ&A

Thank you• Links:

– http://cern.ch/phydb– http://www.cern.ch/canali

implementing asm without hw raid, a user’s experience

Documents