tier1 - disk failure stats and networking martin bly tier1 fabric manager

27
Tier1 - Disk Failure stats and Networking Martin Bly Tier1 Fabric Manager

Upload: douglas-holmes

Post on 26-Dec-2015

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tier1 - Disk Failure stats and Networking Martin Bly Tier1 Fabric Manager

Tier1 - Disk Failure stats and Networking

Martin BlyTier1 Fabric Manager

Page 2: Tier1 - Disk Failure stats and Networking Martin Bly Tier1 Fabric Manager

Tier1 Disk Failure Stats and Networking 211/06/2010

Overview

• Tier1 Disk Storage Hardware• Disk Failure Statistics• Network configuration

Page 3: Tier1 - Disk Failure stats and Networking Martin Bly Tier1 Fabric Manager

Tier1 Disk Failure Stats and Networking 3

Tier1 Storage Model

• Tier1 provides disk storage for Castor– Particle Physics data (mostly)

• Storage is:– Commodity components– Storage-in-a-box– Lots of units

11/06/2010

Page 4: Tier1 - Disk Failure stats and Networking Martin Bly Tier1 Fabric Manager

Tier1 Disk Failure Stats and Networking 4

Storage Systems I

• Typically 16- or 24-disk chassis– Mostly SuperMicro

• One or more hardware RAID cards– 3ware/AMCC, Areca, Adaptec, LSI– PCI-X or PCI-e– Some active backplanes

• Configured as RAID5 or more commonly RAID6– RAID1 for system disks, on second controller where it fitted

• Dual multi-cores CPUs• 4GB, 8GB, 12GB RAM• 1 GbE NIC• IPMI

11/06/2010

Page 5: Tier1 - Disk Failure stats and Networking Martin Bly Tier1 Fabric Manager

Tier1 Disk Failure Stats and Networking 5

Storage Systems II

Year Chassis

CPU CPU x Cores

RAM/GB

NIC IMPI Units

2005 24-bay AMD 275

2 x 2 4 1 GbE 21

2006 16-bay AMD 275

2 x 2 4 1 GbE 86

2007 A

16-bay AMD 2220

2 x 2 8 1 GbE 91

2007 B

16-bay Intel 5310

2 x 4 8 1 GbE 91

2008 A

16-bay L5410 2 x 4 8 1 GbE 50

2008 B

16-bay E5405 2 x 4 8 1 GbE 60

2009 A

16-bay E5504 2 x 4 12 1 GbE 60

2009 B

16-bay E5504 2 x 4 12 1 GbE 3811/06/2010

Page 6: Tier1 - Disk Failure stats and Networking Martin Bly Tier1 Fabric Manager

Tier1 Disk Failure Stats and Networking 6

Storage Systems III

Year Controller Data Drive

Data Drives

Raid Capacity

Units Drives

2005 Areca 1170 WD5000 22 6 10TB 21 462

2006 3ware 9550SX-16ML

WD5000 14 5 6TB 86 1204

2007 A

3ware 9650SE-16 WD7500 14 6 9TB 91 1274

2007 B

3ware 9650SE-16 WD7500 14 6 9TB 91 1274

2008 A

Areca 1280 WD RE 22 6 20TB 50 1100

2008 B

3ware 9650SE-24 WD RE 22 6 20TB 60 1320

2009 A

LSI 8888-ELP WD RE4-GP

22 6 38TB 60 1320

2009 B

Adaptec 52445 Hitachi 22 6 40TB 38 836

497 8790

11/06/2010

Page 7: Tier1 - Disk Failure stats and Networking Martin Bly Tier1 Fabric Manager

Tier1 Disk Failure Stats and Networking 7

Storage Use

• Servers are assigned to a Service Classes– Each Virtual Organisation (VO) has several Service Classes– Usually between 2 and many servers per service class

• Service Classes are either:– D1T1: data is on both disk and tape– D1T0: data is on disk but NOT tape

• If lost, VO gets upset :-(– D0T1: data is on tape – disk is a buffer– D0T0: ephemeral data on disk

• Service Class type assigned VO depending on their data model and what they want to do with a chunk of storage

11/06/2010

Page 8: Tier1 - Disk Failure stats and Networking Martin Bly Tier1 Fabric Manager

Tier1 Disk Failure Stats and Networking 8

System uses

• Want to make sure data is as secure as possible• RAID5 is not secure enough:

– Only 1 disk failure can put data at risk– Longer period of risk as rebuild time increases due to

array/disk sizes– Even with host spare, risk of double failure is significant

• Keep D1T0 data on RAID6 systems– Double parity information– Can lose two disks before data is at risk

• Tape buffer systems need throughput (network bandwidth) not space– Use smaller capacity servers in these Service Classes

11/06/2010

Page 9: Tier1 - Disk Failure stats and Networking Martin Bly Tier1 Fabric Manager

Tier1 Disk Failure Stats and Networking 9

Disk Statistics

11/06/2010

Page 10: Tier1 - Disk Failure stats and Networking Martin Bly Tier1 Fabric Manager

11/06/2010 Tier1 Disk Failure Stats and Networking 10

2009 High(Low!)lights

• Drives failed / changed: 398 (9.4% !)• Multiple failure incidents: 21• Recoveries from multiple failures: 16• Data copied to another file system: 1• Lost file systems: 4

Page 11: Tier1 - Disk Failure stats and Networking Martin Bly Tier1 Fabric Manager

2009 Data

11/06/2010 Tier1 Disk Failure Stats and Networking 11

Month

Clustervision 05 Viglen 06 Viglen 07a Viglen 07i WD 500GB WD 750GB Seagate

750GB All

Jan 6 18 3 2 24 3 2 29Feb 1 16 4 5 17 4 5 26Mar 3 17 4 3 20 4 3 27Apr 1 21 1 2 22 1 2 25May 4 13 6 6 17 6 6 29Jun 3 14 6 5 17 6 5 28Jul 6 30 6 5 36 6 5 47

Aug 1 31 10 6 32 10 6 48Sep 13 13 3 5 26 3 5 34Oct 7 28 1 4 35 1 4 40Nov 12 19 0 3 31 0 3 34Dec 9 16 2 4 25 2 4 31Total 66 236 46 50 302 46 50 398

Average/month 5.50 19.67 3.83 4.17 25.17 3.83 4.17 33.1667

Base 462 1204 1274 1274 1688 1274 1274 4214% fail 14.3% 19.6% 3.6% 3.9% 17.9% 3.6% 3.9% 9.4%

Page 12: Tier1 - Disk Failure stats and Networking Martin Bly Tier1 Fabric Manager

11/06/2010 Tier1 Disk Failure Stats and Networking 12

Failure by generation

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec0

10

20

30

40

50

60Drive Failures by Generation - 2009

Clustervision 05Viglen 06Viglen 07a

Month

Num

ber

of

Failure

s

Page 13: Tier1 - Disk Failure stats and Networking Martin Bly Tier1 Fabric Manager

By drive model

11/06/2010 Tier1 Disk Failure Stats and Networking 13

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec0

5

10

15

20

25

30

35

40

Dive Failures by Drive Type - 2009

WD 500GBWD 750GBSeagate 750GB

Month

Num

ber

of

Failure

s

Page 14: Tier1 - Disk Failure stats and Networking Martin Bly Tier1 Fabric Manager

Tier1 Disk Failure Stats and Networking 14

Normalised Drive Failure Rates

11/06/2010

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr0.00%

0.50%

1.00%

1.50%

2.00%

2.50%

3.00%Failure Rate Normalised by Installed Base 2009-10

Clustervision 05

Viglen 06

Viglen 07a

Viglen 07i

Streamline 08

Month

Norm

alised

failure

rate

per

month

In 2009, 10% of drives failed in storage servers

Page 15: Tier1 - Disk Failure stats and Networking Martin Bly Tier1 Fabric Manager

2009 Multiple failures data

11/06/2010 Tier1 Disk Failure Stats and Networking 15

Viglen 06 Viglen 07a Viglen 07i All

Days Month

Clustervision 05

Saved Copied Lost

Total Doubl

esSaved Copie

d LostTotal

Doubles

Saved Copied Lost

Total Doubl

esSaved Copie

d LostTotal

Doubles

31 Jan 0 1 1 0 0 1 0 0 128 Feb 0 1 1 1 1 0 2 0 0 231 Mar 0 1 1 2 0 0 1 0 1 230 Apr 0 1 1 0 1 1 2 0 0 231 May 0 0 0 0 0 0 0 030 Jun 0 0 0 0 0 0 0 031 Jul 0 1 1 1 1 0 2 0 0 231 Aug 0 4 1 5 1 1 1 1 6 1 0 730 Sep 0 1 1 0 0 0 0 1 131 Oct 0 2 2 0 0 2 0 0 230 Nov 0 2 2 0 0 0 0 2 231 Dec 0 0 0 0 0 0 0

365 Total 0 11 1 4 16 3 0 0 3 2 0 0 2 16 1 4 21

Average/month 0.00 1.33 0.25 0.17 1.33 0.08 0.33 1.75

Page 16: Tier1 - Disk Failure stats and Networking Martin Bly Tier1 Fabric Manager

Lost file systems (arrays)

11/06/2010 Tier1 Disk Failure Stats and Networking 16

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec0

1

2

3

4

5

6

7

8

Multiple-disk failure events 2009

AllSavedCopiedLost

Month

Num

ber

of

Events

Page 17: Tier1 - Disk Failure stats and Networking Martin Bly Tier1 Fabric Manager

Tier1 Disk Failure Stats and Networking 17

Closer look: Viglen 06

• Why are Viglen 06 servers less reliable?• RAID5?

– More vulnerable to double disk failure causing a problem• Controller issues?

– Yes: Failure to start rebuilds after a drive fails– Yes: Some cache RAM issues

• System RAM?– Yes: ECC not set up correctly in BIOS

• Age?– Possibly: difficult to assert with confidence

• Disks?– Hmmm....

11/06/2010

Page 18: Tier1 - Disk Failure stats and Networking Martin Bly Tier1 Fabric Manager

Tier1 Disk Failure Stats and Networking 18

Viglen 06 – Failures per server

11/06/2010

gdss

87

gdss

90

gdss

93

gdss

96

gdss

99

gdss

102

gdss

105

gdss

108

gdss

111

gdss

114

gdss

117

gdss

120

gdss

123

gdss

126

gdss

129

gdss

132

gdss

135

gdss

138

gdss

141

gdss

144

gdss

147

gdss

150

gdss

153

gdss

156

gdss

159

gdss

162

gdss

165

gdss

168

gdss

171

0

2

4

6

8

10

12

14Disk failures - Viglen 06 - June '08 to April '10

Page 19: Tier1 - Disk Failure stats and Networking Martin Bly Tier1 Fabric Manager

Tier1 Disk Failure Stats and Networking 19

Viglen 06 failures - sorted

11/06/2010

gdss

87

gdss

90

gdss

93

gdss

96

gdss

99

gdss

102

gdss

105

gdss

108

gdss

111

gdss

114

gdss

117

gdss

120

gdss

123

gdss

130

gdss

133

gdss

136

gdss

126

gdss

138

gdss

141

gdss

144

gdss

147

gdss

150

gdss

153

gdss

156

gdss

159

gdss

162

gdss

165

gdss

168

gdss

171

0

2

4

6

8

10

12

14

Viglen Disk Data - Sorted

Batch 1

Batch 2

Page 20: Tier1 - Disk Failure stats and Networking Martin Bly Tier1 Fabric Manager

Summary

• Overall failure rate of drives is high (9.4%)• Dominated by WD 500GB drives (19.6%)• Rate for 750GB drives much lower (3.75%)• Spike in failures:

– at the time of the Tier1 migration to R89– when air conditioning in R89 failed

• Batch effect for Viglen 06 generation

11/06/2010 Tier1 Disk Failure Stats and Networking 20

Page 21: Tier1 - Disk Failure stats and Networking Martin Bly Tier1 Fabric Manager

Tier1 Disk Failure Stats and Networking 2111/06/2010

Networking

Page 22: Tier1 - Disk Failure stats and Networking Martin Bly Tier1 Fabric Manager

Tier1 Disk Failure Stats and Networking 22

Tier1 Network - 101• Data network

– Central core switch: Force10 C300 (10GbE)– Edge switches: Nortel 55xx/56xx series stacks– Core-to-Edge: 10GbE or 2 x 10GbE– All nodes connected at 1 x GbE for data

• Management network– Various 10/100MbE switches: NetGear, 3Com

• New and salvaged from old data network

• Uplinks– 10Gb/s to CERN (+ failover)– 10Gb/s to Site LAN– 10Gb/s + 10Gb/s failover to SJ5 (to be 20Gb/s each soon)– Site routers: Nortel 8600 series

11/06/2010

Page 23: Tier1 - Disk Failure stats and Networking Martin Bly Tier1 Fabric Manager

Tier1 Disk Failure Stats and Networking 2311/06/2010

Page 24: Tier1 - Disk Failure stats and Networking Martin Bly Tier1 Fabric Manager

Tier1 Disk Failure Stats and Networking 24

Tier1 Network 102

• Tier1 has a class 21 network within the site range– ~2000 addresses

• Need to have a dedicated IP range for the LCHOPN– Class 23 network within existing subnet– ~500 addresses, Castor disk servers only

11/06/2010

130.246 176.0130.246 177.0130.246 178.0130.246 179.0130.246 180.0130.246 181.0130.246 182.0130.246 183.0130.246 184.0130.246 185.0130.246 186.0130.246 187.0130.246 188.0130.246 189.0130.246 190.0130.246 191.0

T1-Extra

Tier1

SSD

OPN

T1-Original

T1 Dev

Page 25: Tier1 - Disk Failure stats and Networking Martin Bly Tier1 Fabric Manager

Tier1 Disk Failure Stats and Networking 25

Tier1 Network 103 - routing

• General nodes– Gateway address on site router– All traffic goes to Router A

• From there to site or off site via firewall and Site Access Router (SAR)

• OPN nodes– Gateway address on UKLight router– Special routes to Router A for site only traffic– All off-site traffic to UKLight router

• UKLight router– BGP routing information from CERN– T0, T1<->T1 traffic directed to CERN link– Lancaster traffic -> link to Lancaster– Other traffic from RAL Tier1 up to SAR (the bypass)

• SAR– Filters inbound traffic for Tier1 subnet

• OPN machines to Core via UKLight router• Others to Router A via Firewall• Non-Tier1 traffic not affected

11/06/2010

Page 26: Tier1 - Disk Failure stats and Networking Martin Bly Tier1 Fabric Manager

Tier1 Disk Failure Stats and Networking 26

Logical network

11/06/2010

Firewall

SAR

UKLR

Router A

OPN

Tier1 Subnet

SJ5OPN Subnet

Site

Page 27: Tier1 - Disk Failure stats and Networking Martin Bly Tier1 Fabric Manager

Tier1 Disk Failure Stats and Networking 27

Tier1 Network 104 - firewalling

• Firewall is a NetScreen– 10Gb/s linespeed

• Site FW policy:– closed inbound except by request on port/host (local/remote)

basis– Open outbound– Port 80 + kin redirected via web caches...

• Tier1 subnet:– As Site except port 80 + kin do not use web caches

• OPN subnet:– Traffic is filtered at SAR – only Castor ports open

11/06/2010