san diego supercomputer center & national partnership for advance computational infrastructure...

29
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Storage Resource Broker Reagan W. Moore San Diego Supercomputer Center [email protected] http://www.npaci.edu/DICE/

Upload: danielle-manning

Post on 27-Mar-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Storage Resource Broker Reagan W. Moore San Diego Supercomputer

San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure

Storage Resource Broker

Reagan W. MooreSan Diego Supercomputer Center

[email protected]://www.npaci.edu/DICE/

Page 2: San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Storage Resource Broker Reagan W. Moore San Diego Supercomputer

San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure

Data Management Objectives

• Automate all aspects of data management– Discovery (without knowing the file name)– Access (without knowing its location)– Retrieval (using your preferred API)– Control (without having a personal account at the

remote storage system)– Performance (use latency management mechanisms to

minimize impact of wide-area-networks)

Page 3: San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Storage Resource Broker Reagan W. Moore San Diego Supercomputer

San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure

Collections Replicated via SRBonto TeraGrid

• 2MASS – 10 TBs, 5 million images

• DPOSS– 3 TBs, 6000 images

• USNO-B– In progress

• SDSS– In progress

• MACHO– In negotiation

Page 4: San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Storage Resource Broker Reagan W. Moore San Diego Supercomputer

San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure

SRB Implementations• Data collecting

– Sensor systems, object ring buffers and portals

• Data organization– Collections, manage data context

• Data sharing– Data grids, manage heterogeneity

• Data publication– Digital libraries, support discovery

• Data preservation– Persistent archives, manage technology evolution

• Data analysis– Processing pipelines, manage knowledge extraction

Page 5: San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Storage Resource Broker Reagan W. Moore San Diego Supercomputer

San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure

NSF Infrastructure Projects Using SRB

• Partnership for Advanced Computational Infrastructure - PACI– Data grid - Storage Resource Broker

• Distributed Terascale Facility - DTF/ETF– Compute, storage, network resources

• Digital Library Initiative, Phase II - DLI2– Publication, discovery, access

• Information Technology Research projects - ITR– SCEC Southern California Earthquake Center– GEON GeoSciences Network– SEEK Science Environment for Ecological Knowledge– GriPhyN Grid Physics Network– NVO National Virtual Observatory

• National Science Digital Library - NSDL– Support for education curricula modules

Page 6: San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Storage Resource Broker Reagan W. Moore San Diego Supercomputer

San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure

Federal Infrastructure Projects Using SRB

• NASA– Information Power Grid - IPG– Advanced Data Grid - ADG– Data Management System - Data Assimilation Office

• Integration of DODS with Storage Resource Broker data grid

– Earth Observing Satellite EOS data pools – Consortium of Earth Observing Satellites CEOS data grid

• Library of Congress– National Digital Information Infrastructure and Preservation Program - NDIIPP

• National Archives and Records Administration and National Historical Public Records Commission– Prototype persistent archives

• NIH– Biomedical Informatics Research Network data grid

• DOE– Particle Physics Data Grid - Babar, CMS

Page 7: San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Storage Resource Broker Reagan W. Moore San Diego Supercomputer

San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure

SDSC Collaborations

• Hayden Planetarium Simulation & Visualization

• Knowledge Network for BioComplexity (NSF)

• Mol Science – JCSG, AfCS• Visual Embryo Project (NLM)• RoadNet (NSF)

• Earth System Sciences – CEED, Bionome, SIO Explorer

• Hyper LTER • Grid Portal (NPACI)• Tera Scale Computing (NSF)• Long Term Archiving Project (NARA)• Education – Transana (NPACI)• NSDL – National Science Digital Library

(NSF)• Digital Libraries – ADL, Stanford,

UMichigan, UBerkeley, CDL• … 31 additional collaborations

Page 8: San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Storage Resource Broker Reagan W. Moore San Diego Supercomputer

San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure

Approach

• Use collections to organize digital entities– Digital entity - file, URL, SQL, directory, table, …

• Create logical name space– Location independent naming convention– Map state information created by data access services to the

logical name space– Manage consistency constraints on the metadata update

• Build an interoperability mechanism– Map from storage repository protocols to preferred APIs

Page 9: San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Storage Resource Broker Reagan W. Moore San Diego Supercomputer

San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure

Basic Concepts• Logical name space

– Map administrative, descriptive, authenticity, consistency metadata onto the logical name

• Storage repository abstraction– Standard operations performed at remote storage

• Information repository abstraction– Standard operations to manage collection in a database

• Access abstraction– Standard operations supported for metadata and data access

• Authentication abstraction– Collection-owned data, ACLs for data and metadata

• Latency management mechanisms

Page 10: San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Storage Resource Broker Reagan W. Moore San Diego Supercomputer

San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure

Unix Shell

Java, NTBrowsers

OAIWSDL

GridFTP

SDSC Storage Resource Broker & Meta-data Catalog

ArchivesHPSS, ADSM,UniTree, DMF

DatabasesDB2, Oracle,

Postgres

File SystemsUnix, NT,Mac OSX

Application

HRMORB

AccessAPIs

Servers

Storage AbstractionCatalog Abstraction

DatabasesDB2, Oracle, Postgres, SQLServer, Informix

C, C++, Libraries

Logical Name Space

LatencyManagement

DataTransport

MetadataTransport

Consistency Management / Authorization-AuthenticationPrimeServer

Linux I/O

DLL /Python

Page 11: San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Storage Resource Broker Reagan W. Moore San Diego Supercomputer

San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure

Production Data Grid• SDSC Storage Resource Broker

– Federated client-server system, managing• Over 70 TBs of data at SDSC

• Over 10 million files

– Manages data collections stored in• Archives (HPSS, UniTree, ADSM, DMF)

• Hierarchical Resource Managers

• Tapes, tape robots

• File systems (Unix, Linux, Mac OS X, Windows)

• FTP sites

• Databases (Oracle, DB2, Postgres, SQLserver, Sybase, Informix)

• Virtual Object Ring Buffers

Page 12: San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Storage Resource Broker Reagan W. Moore San Diego Supercomputer

San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure

SRBserver

SRB agent

SRBserver

Federated SRB server model

MCAT

Read Application

SRB agent

1

2

34

6

5

Logical NameOr

Attribute Condition

1.Logical-to-Physical mapping2.Identification of Replicas3.Access & Audit Control

Peer-to-peer

Brokering

Server(s) SpawningData

Access

Parallel Data Access

R1R2

5/6

Page 13: San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Storage Resource Broker Reagan W. Moore San Diego Supercomputer

San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure

Logical Name Space Example - Hayden Planetarium

• Generate “fly-through” of the evolution of the solar system

• Access data distributed across multiple administration domains

• Gigabyte files, total data size was 7 TBytes

• Very tight production schedule - 3 months

Page 14: San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Storage Resource Broker Reagan W. Moore San Diego Supercomputer

San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure

Page 15: San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Storage Resource Broker Reagan W. Moore San Diego Supercomputer

San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure

Hayden Data Flow

NCSA

SDSC

AMNHNYC

GPFS7.5 TB

IBM SP2

SGI

Production parameters, movies, images

data simulation

visualization

HPSS 7.5 TB

2.5 TB UniTree

UVa

NY

CalTech

BIRN

Page 16: San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Storage Resource Broker Reagan W. Moore San Diego Supercomputer

San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure

Logical Name Space

• Global, location-independent identifiers for digital entities– Organized as collection hierarchy– Attributes mapped to logical name space

• Attributed managed in a database

• Types of system metadata– Physical location of file– Owner, size, creation time, update time– Access controls

Page 17: San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Storage Resource Broker Reagan W. Moore San Diego Supercomputer

San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure

Mappings on Name Space

• Define logical resource name– List of physical resources

• Replication– Write to logical resource completes when all physical resources

have a copy

• Load balancing– Write to a logical resource completes when copy exist on next

physical resource in the list

• Fault tolerance– Write to a logical resource completes when copies exist on “k” of

“n” physical resources

Page 18: San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Storage Resource Broker Reagan W. Moore San Diego Supercomputer

San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure

Latency ManagementExample - Digital Sky Project

• 2MASS (2 Micron All Sky Survey): – Bruce Berriman, IPAC, Caltech; John Good, IPAC, Caltech, Wen-

Piao Lee, IPAC, Caltech

• NVO (National Virtual Observatory):– Tom Prince, Caltech, Roy Williams CACR, Caltech, John Good,

IPAC, Caltech

• SDSC – SRB :– Arcot Rajasekar, Mike Wan, George Kremenek, Reagan Moore

Page 19: San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Storage Resource Broker Reagan W. Moore San Diego Supercomputer

San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure

Digital Sky - 2MASS

• http://www.ipac.caltech.edu/2mass

• The input data was originally written to DLT tapes in the order seen by the telescope – 10 TBytes of data, 5 million files

• Ingestion took nearly 1.5 years - almost daily reading of tapes, one at a time

• Images aggregated into 147,000 containers by SRB

Page 20: San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Storage Resource Broker Reagan W. Moore San Diego Supercomputer

San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure

Digital Sky Data Ingestion

Informix

SUN

SRBSUN E10K

HPSS

….

800 GB

10 TB

SDSCIPAC CALTECH

input tapes from telescopes

star catalogData

Cache

Page 21: San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Storage Resource Broker Reagan W. Moore San Diego Supercomputer

San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure

Page 22: San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Storage Resource Broker Reagan W. Moore San Diego Supercomputer

San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure

SRB Latency Management

ReplicationServer-initiated I/O

StreamingParallel I/O

CachingClient-initiated I/O

Remote Proxies,Staging

Data AggregationContainers

SourceDestination

Prefetch

NetworkDestinationNetwork

Page 23: San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Storage Resource Broker Reagan W. Moore San Diego Supercomputer

San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure

Containers

• Images sorted by spatial location– Retrieving one container accesses related images

• Minimizes impact on archive name space– HPSS stores 680 Tbytes in 17 million files

• Minimizes distribution of images across tapes• Bulk unload by transport of containers

Page 24: San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Storage Resource Broker Reagan W. Moore San Diego Supercomputer

San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure

SRB Development

• Peer-to-peer federation– Support multiple independent MCAT catalogs– Replicate metadata

• mySQL/BerkeleyDB port

• OGSA/OGSI compliant interface

• GridFTP interfaces– Waiting for next release of the software (4thQ)

Page 25: San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Storage Resource Broker Reagan W. Moore San Diego Supercomputer

San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure

MySRB Features

• Data & File Management

• Collection Creation and Management

• Collection of Varied Objects– Files, SQL Objects, Databases, URLs, directories, archives, …

• Metadata Handling

• Browsing & Querying Interface

• Access Control

• Version Control (soon)

• Support proxy (remote) operations

Page 26: San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Storage Resource Broker Reagan W. Moore San Diego Supercomputer

San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure

MySRB

• Web-based Access to the SRB• Secure HTTP• Uses Cookies for Session Control• Self Registration of Users Supported

– Currently limited to SDSC users• Self Registration of Resources (soon)• Access to Both Data and Metadata

Page 27: San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Storage Resource Broker Reagan W. Moore San Diego Supercomputer

San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure

Data Management

• Browse in Hierarchical Collections• Registration of (remote) Legacy Files & Directories• Registration of SQL Objects• Registration of URLs• Data Movement Operations

– Ingest & Re-Ingest, Delete, Unlink– Replicate, Copy, Move, S-Link

• Access Control Operations– Read, Write, Own, Curate, Annotate, …– Ticket-based Access

• Version Control Operations (soon)– Read Lock, Write Lock, Unlock– Check In Check Out

Page 28: San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Storage Resource Broker Reagan W. Moore San Diego Supercomputer

San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure

Types of Meta data• System-level Metadata

– Size, resource, owner, date, access control, …

• User-defined Meta data– for data & collections– <name,value,unit> triples– No limits in number of metadata– Support for Collection-level schemas

• Comments, default values, drop-down lists

– Support for Standardized Schemas • (eg. Dublin Core)

• Annotations– Supports textual annotations– Annotator, date, context also registered

Page 29: San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Storage Resource Broker Reagan W. Moore San Diego Supercomputer

San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure

Meta Data Management

• Insert, Update and Delete of Metadata

• Access Control for Metadata (soon in mySRB)

• Querying across system-level, user-defined metadata and annotations– Query under collections & across collections

• Browsing on user-defined metadata

• Metadata supported for legacy files & directories

• Extract Metadata (using proxy operations)