Transcript
Page 1: PetaByte Storage Facility at RHIC Razvan Popescu - Brookhaven National Laboratory

PetaByte Storage Facility at RHIC

Razvan Popescu - Brookhaven National Laboratory

Page 2: PetaByte Storage Facility at RHIC Razvan Popescu - Brookhaven National Laboratory

CHEP 2000 -- Padova

PetaByte Storage Facility at RHIC 2

Who are we?

Relativistic Heavy-Ion Collider @ BNL– Four experiments: Phenix, Star, Phobos,

Brahms.– 1.5PB per year.– ~500MB/sec.– >20,000SpecInt95.

Startup in May 2000 at 50% capacity and ramp up to nominal parameters in 1 year.

Page 3: PetaByte Storage Facility at RHIC Razvan Popescu - Brookhaven National Laboratory

CHEP 2000 -- Padova

PetaByte Storage Facility at RHIC 3

Overview

Data Types:– Raw: very large volume (1.2PB/yr.), average

bandwidth (50MB/s).– DST: average volume (500TB), large

bandwidth (200MB/s).– mDST: low volume (<100TB), large bandwidth

(400MB/s).

Page 4: PetaByte Storage Facility at RHIC Razvan Popescu - Brookhaven National Laboratory

CHEP 2000 -- Padova

PetaByte Storage Facility at RHIC 4

Data Flow (generic)

RHIC

File Servers(DST/mDST)

ReconstructionFarm (Linux)

AnalysisFarm (Linux)

Archive

raw

raw

DST

mDSTmDST

DST

35MB/s

50MB/s

10MB/s

10MB/s

200MB/s

400MB/s

Page 5: PetaByte Storage Facility at RHIC Razvan Popescu - Brookhaven National Laboratory

CHEP 2000 -- Padova

PetaByte Storage Facility at RHIC 5

The Data Store

HPSS (ver. 4.1.1 patch level 2)– Deployed in 1998.– After overcoming some growth difficulties we

consider the present implementation successful.– One major/total reconfiguration to adapt to new

hardware (and system understanding).– Flexible enough for our needs. One shortage:

preemptable priority schema.– Very high performance.

Page 6: PetaByte Storage Facility at RHIC Razvan Popescu - Brookhaven National Laboratory

CHEP 2000 -- Padova

PetaByte Storage Facility at RHIC 6

The HPSS Archive

Constraints - large capacity & high bandwidth:– Two types of tape technology: SD-3 (best $/GB) &

9840 (best $/MB/s).– Two tape layers hierarchies. Easy management of

the migration. Reliable and fast disk storage:

– FC attached RAID disk. Platform compatible with HPSS:

– IBM, SUN, SGI.

Page 7: PetaByte Storage Facility at RHIC Razvan Popescu - Brookhaven National Laboratory

CHEP 2000 -- Padova

PetaByte Storage Facility at RHIC 7

Present Resources

Tape Storage:– (1) STK Powderhorn silo (6000 cart.)– (11) SD-3 (Redwood) drives.– (10) 9840 (Eagle) drives.

Disk Storage:– ~8TB of RAID disk.

• 1TB for HPSS cache.• 7TB Unix workspace.

Servers:– (5) RS/6000 H50/70 for HPSS.– (6) E450&E4000 for file serving and data mining.

Page 8: PetaByte Storage Facility at RHIC Razvan Popescu - Brookhaven National Laboratory

CHEP 2000 -- Padova

PetaByte Storage Facility at RHIC 8

Phenix Data Flow

RHIC 10MB/s 6MB/s 1MB/s File Server16MB/sHPSS (RAW )

150GB @ 80MB/s

10MB/s 6MB/s

Redwood(3)

10MB/s 6MB/s

Reconstr.Farm

(?Si95)(?00 proc.)

HPSS (DST)

Redwood(0)

9840(2)

150GB @ 80MB/s

1MB/s 16MB/s

1MB/s 16MB/s

1MB/s

Analysis Farm(? Si95)

55MB/s

3TB @ 100MB/s

18MB/s 65MB/s

2MB/sCalibration - xMB/s

10MB/s -Calib.

Page 9: PetaByte Storage Facility at RHIC Razvan Popescu - Brookhaven National Laboratory

CHEP 2000 -- Padova

PetaByte Storage Facility at RHIC 9

Page 10: PetaByte Storage Facility at RHIC Razvan Popescu - Brookhaven National Laboratory

CHEP 2000 -- Padova

PetaByte Storage Facility at RHIC 10

Page 11: PetaByte Storage Facility at RHIC Razvan Popescu - Brookhaven National Laboratory

CHEP 2000 -- Padova

PetaByte Storage Facility at RHIC 11

HPSS Structure

(1) Core Server:– RS/6000 Model H50– 4x CPU– 2GB RAM– Fast Ethernet (control)– OS mirrored storage for metadata (6pv.)

Page 12: PetaByte Storage Facility at RHIC Razvan Popescu - Brookhaven National Laboratory

CHEP 2000 -- Padova

PetaByte Storage Facility at RHIC 12

HPSS Structure

(3) Movers:– RS/6000 Model H70– 4x CPU– 1GB RAM– Fast Ethernet (control)– Gigabit Ethernet (data) (1500&9000MTU)– 2x FC attached RAID - 300GB - disk cache– (3-4) SD-3 “Redwood” tape transports– (3-4) 9840 “Eagle” tape transports

Page 13: PetaByte Storage Facility at RHIC Razvan Popescu - Brookhaven National Laboratory

CHEP 2000 -- Padova

PetaByte Storage Facility at RHIC 13

HPSS Structure

Guarantee availability of resources for a specific user group separate resources separate PVRs & movers.

One mover per user group total exposure to single-machine failure.

Guarantee availability of resources for Data Acquisition stream separate hierarchies.

Result: 2PVR&2COS&1Mvr per group.

Page 14: PetaByte Storage Facility at RHIC Razvan Popescu - Brookhaven National Laboratory

CHEP 2000 -- Padova

PetaByte Storage Facility at RHIC 14

HPSS Structure

Page 15: PetaByte Storage Facility at RHIC Razvan Popescu - Brookhaven National Laboratory

CHEP 2000 -- Padova

PetaByte Storage Facility at RHIC 15

HPSS Topology

M3M2M1Core

Net 2 - Control (100baseT)

Net 1 - Data (1000baseSX)

STK

10baseT

N x PVR

pftpd

Client

(Routing)

Page 16: PetaByte Storage Facility at RHIC Razvan Popescu - Brookhaven National Laboratory

CHEP 2000 -- Padova

PetaByte Storage Facility at RHIC 16

HPSS Performance

80 MB/sec for the disk subsystem. ~1 CPU per 40MB/sec for TCPIP Gbit

traffic @ 1500MTU or 90MB/sec @ 9000MTU

>9MB/sec per SD-3 transport. ~10MB/sec per 9840 transport.

Page 17: PetaByte Storage Facility at RHIC Razvan Popescu - Brookhaven National Laboratory

CHEP 2000 -- Padova

PetaByte Storage Facility at RHIC 17

I/O Intensive Systems

Mining and Analysis systems. High I/O & moderate CPU usage. To avoid large network traffic merge file

servers with HPSS movers:– Major problem with HPSS support on non-AIX

platforms.– Several (Sun) SMP machines or Large (SGI)

Modular System.

Page 18: PetaByte Storage Facility at RHIC Razvan Popescu - Brookhaven National Laboratory

CHEP 2000 -- Padova

PetaByte Storage Facility at RHIC 18

Problems

Short lifecycle of the SD-3 heads.– ~ 500 hours < 2 months @ average usage. (6 of

10 drives in 10 months). – Built a monitoring tool to try to predict

transport failure (based of soft error frequency). Low throughput interface (F/W) for SD-3:

high slot consumption. SD-3 production discontinued?! 9840 ???

Page 19: PetaByte Storage Facility at RHIC Razvan Popescu - Brookhaven National Laboratory

CHEP 2000 -- Padova

PetaByte Storage Facility at RHIC 19

Issues

Tested the two tape layer hierarchies:– Cartridge based migration.– Manually scheduled reclaim.

Work with large files. Preferable ~1GB. Tolerable >200MB.– Is this true with 9840 tape transports?

Don’t think at NFS. Wait for DFS/GPFS?– We use exclusively pftp.

Page 20: PetaByte Storage Facility at RHIC Razvan Popescu - Brookhaven National Laboratory

CHEP 2000 -- Padova

PetaByte Storage Facility at RHIC 20

Issues

Guarantee avail. of resources for specific user groups:– Separate PVRs & movers.– Total exposure to single-mach. failure !

Reliability:– Distribute resources across movers share movers

(acceptable?).– Inter-mover traffic:

• 1 CPU per 40MB/sec TCPIP per adapter: Expensive!!!

Page 21: PetaByte Storage Facility at RHIC Razvan Popescu - Brookhaven National Laboratory

CHEP 2000 -- Padova

PetaByte Storage Facility at RHIC 21

Inter-Mover Traffic - Solutions

Affinity.– Limited applicability.

Diskless hierarchies (not for DFS/GPFS).– Not for SD-3. Not enough tests on 9840.

High performance networking: SP switch. (This is your friend.)– IBM only.

Lighter protocol: HIPPI.– Expensive hardware.

Multiply attached storage (SAN). Most promising! See STK’s talk. Requires HPSS modifications.

Page 22: PetaByte Storage Facility at RHIC Razvan Popescu - Brookhaven National Laboratory

CHEP 2000 -- Padova

PetaByte Storage Facility at RHIC 22

Summary

HPSS works for us. Buy an SP2 and the SP switch.

– Simplified admin. Fast interconnect. Ready for GPFS.

Keep an eye on the STK’s SAN/RAIT. Avoid SD-3. (not a risk anymore) Avoid small file access. At least for the

moment.

Page 23: PetaByte Storage Facility at RHIC Razvan Popescu - Brookhaven National Laboratory

Thank you!

Razvan [email protected]


Top Related