data management systems richard marciano reagan w. moore wayne schroeder arcot rajasekar mike wan...

33
Data Management Data Management Systems Systems Richard Marciano Richard Marciano Reagan W. Moore Reagan W. Moore Wayne Schroeder Wayne Schroeder Arcot Rajasekar Arcot Rajasekar Mike Wan Mike Wan San Diego Supercomputer Center San Diego Supercomputer Center [email protected] [email protected]

Upload: cody-rooney

Post on 27-Mar-2015

222 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Data Management Systems Richard Marciano Reagan W. Moore Wayne Schroeder Arcot Rajasekar Mike Wan San Diego Supercomputer Center moore@sdsc.edu

Data Management SystemsData Management Systems

Richard MarcianoRichard Marciano

Reagan W. MooreReagan W. Moore

Wayne SchroederWayne Schroeder

Arcot RajasekarArcot Rajasekar

Mike WanMike Wan

San Diego Supercomputer CenterSan Diego Supercomputer Center

[email protected]@sdsc.edu

http://irods.sdsc.edu

http://www.sdsc.edu/srb/

Page 2: Data Management Systems Richard Marciano Reagan W. Moore Wayne Schroeder Arcot Rajasekar Mike Wan San Diego Supercomputer Center moore@sdsc.edu

Key Data Management StandardsKey Data Management Standards

• Data sharing - data grids• Storage Resource Broker - SRB

• Enables each community to use their preferred API• Supports federation between data grids

• Integrated Rule Oriented Data System - iRODS• Automates execution of management policies

• Data publication - digital libraries• DSpace and Fedora support metadata standards

• OAI-PMH• METS

• Data preservation - persistent archives• Integrated Rule Oriented Data System - iRODS

• RLG/NARA Trusted Repository Assessment criteria• OAIS• ISADG metadata• DFDL data format characterizatioin

Page 3: Data Management Systems Richard Marciano Reagan W. Moore Wayne Schroeder Arcot Rajasekar Mike Wan San Diego Supercomputer Center moore@sdsc.edu

Data StandardsData Standards

• What are your plans of supporting these?• International collaboration on the development

of iRODS.• Integration with other data management

solutions including LOCKSS, IBP/Lstore, DataVerse

• Which is in your opinion the area which lacks standardization most (either because of the absence of standardization or because of insufficient standards)?• Need progress in representation information• Need DFDL standard

Page 4: Data Management Systems Richard Marciano Reagan W. Moore Wayne Schroeder Arcot Rajasekar Mike Wan San Diego Supercomputer Center moore@sdsc.edu

Data Management StandardsData Management Standards

• Which (future) standards are seen as important?• Virtual Machine technology• Workflow virtualization

• How close is your implementation to the (published) standard(s)?• Most standards provide access methods, but not data

management• Port access standards as they become heavily used

• Which are the open source extensions to employed standards you had to make and why?• Generic interface for manipulating structured information

• Need the ability to query a structured information resource to read and write internal information

• Generic interface for characterizing data management policies

• Need the ability to virtualize management policies across administrative domains

Page 5: Data Management Systems Richard Marciano Reagan W. Moore Wayne Schroeder Arcot Rajasekar Mike Wan San Diego Supercomputer Center moore@sdsc.edu

Grid InfrastructureGrid Infrastructure

• Interoperation Challenges• Virtualization of trust

• Support for multiple uthentication and authorization mechanisms

• Data management virtualization• Characterization of management policies as server-side workflows

• Virtualization of workflows• Ability to migrate workflows between client-side and server-side

• What do you suggest are the best ways to tackle these problem areas?• Bottom-up interoperability development by the principal

software system developers

• Roadmap document• Wiki - http://irods.sdsc.edu

• What is your funding status in a mid- and long-term perspective?• Sustained funding for the next three years (NSF, NARA)

Page 6: Data Management Systems Richard Marciano Reagan W. Moore Wayne Schroeder Arcot Rajasekar Mike Wan San Diego Supercomputer Center moore@sdsc.edu

William Charles WentworthWilliam Charles Wentworth(1790-1872)(1790-1872)

• Noted Australian explorer and statesman

• Ancestry• 8,979 ancestors• 84,628 descents from Charlemagne

• Cousins• Queen Elizabeth 10th cousin, 3 removes• George Washington 11th cousin, 3 removes• Reagan Wentworth Moore 10th cousin, 4 removes

Page 7: Data Management Systems Richard Marciano Reagan W. Moore Wayne Schroeder Arcot Rajasekar Mike Wan San Diego Supercomputer Center moore@sdsc.edu

Data Grid EvolutionData Grid Evolution

• Data grids• Infrastructure independence • Data sharing through data and trust virtualization

• SRB - Storage Resource Broker

• Rule-based data grids• Automation of management policies

Management virtualization• Open source software

• iRODS - integrated Rule-Oriented Data System

Page 8: Data Management Systems Richard Marciano Reagan W. Moore Wayne Schroeder Arcot Rajasekar Mike Wan San Diego Supercomputer Center moore@sdsc.edu

Data Management ApplicationsData Management Applications

• Data grids• Share data - organize distributed data as a collection

• Digital libraries• Publish data - support browsing and discovery

• Persistent archives• Preserve data - manage technology evolution

• Real-time sensor systems• Federate sensor data - integrate across sensor streams

• Workflow systems• Analyze data - integrate client- & server-side workflows

Page 9: Data Management Systems Richard Marciano Reagan W. Moore Wayne Schroeder Arcot Rajasekar Mike Wan San Diego Supercomputer Center moore@sdsc.edu

Generic InfrastructureGeneric Infrastructure

• Data grids organize distributed data into shared collections • Persistent name spaces for files, users, storage• Collection attributes

• Provenance, descriptive, system metadata

• Data grids manage heterogeneous storage systems• Standard operations across file systems, tape archives,

object ring buffers• Enable technology evolution

• At the point in time when new technology is available, both the old and new systems can be integrated

Page 10: Data Management Systems Richard Marciano Reagan W. Moore Wayne Schroeder Arcot Rajasekar Mike Wan San Diego Supercomputer Center moore@sdsc.edu

Data GridData Grid

Using a Data Grid – Using a Data Grid – in Abstractin Abstract

Ask for d

ata

•User asks for data from the data grid

Data d

elivere

d

•The data is found and returned•Where & how details are hidden

Page 11: Data Management Systems Richard Marciano Reagan W. Moore Wayne Schroeder Arcot Rajasekar Mike Wan San Diego Supercomputer Center moore@sdsc.edu

Using a Data Grid - Using a Data Grid - DetailsDetails

iRODS Server

•Data request goes to iRODS Server

iRODS Server Metadata Catalog

DB

•Server looks up information in catalog

•Catalog tells which iRODS server has data

•1st server asks 2nd for data

•The 2nd iRODS server applies rules

•User asks for data

Page 12: Data Management Systems Richard Marciano Reagan W. Moore Wayne Schroeder Arcot Rajasekar Mike Wan San Diego Supercomputer Center moore@sdsc.edu

Extremely SuccessfulExtremely Successful• Storage Resource Broker (SRB) manages 2 PBs of data in

internationally shared collections• Data collections for NSF, NARA, NASA, DOE, DOD, NIH, LC,

NHPRC, IMLS; APAC, UK e-Science, IN2P3, KEK, …• Astronomy Data grid• Bio-informatics Digital library• Earth Sciences Data grid• Ecology Collection• Education Persistent archive• Engineering Digital library• Environmental science Data grid• High energy physics Data grid• Humanities Data Grid• Medical community Digital library• Oceanography Real time sensor data, persistent archive • Seismology Digital library, real-time sensor data

• Goal has been generic infrastructure for distributed data

Page 13: Data Management Systems Richard Marciano Reagan W. Moore Wayne Schroeder Arcot Rajasekar Mike Wan San Diego Supercomputer Center moore@sdsc.edu

Date

ProjectGBs of

data stored1000’s of

filesGBs of

data stored1000’s of

filesUsers with

ACLsGBs of

data stored1000’s of

filesUsers with

ACLs

Data Grid NSF / NVO 17,800 5,139 51,380 8,690 80 88,216 14,550 100 NSF / NPACI 1,972 1,083 17,578 4,694 380 38,147 7,715 380 Hayden 6,800 41 7,201 113 178 8,013 161 227 Pzone 438 31 812 47 49 27,914 16,106 68 NSF / LDAS-SALK 239 1 4,562 16 66 202,312 166 67 NSF / SLAC-JCSG 514 77 4,317 563 47 21,644 2,330 55 NSF / TeraGrid 80,354 685 2,962 280,247 7,235 3,267 NIH / BIRN 5,416 3,366 148 21,000 35,301 445 NCAR 36,689 268 2 LCA 3,445 74 2Digital Library NSF / LTER 158 3 233 6 35 260 42 36 NSF / Portal 33 5 1,745 48 384 2,620 53 460 NIH / AfCS 27 4 462 49 21 733 94 21 NSF / SIO Explorer 19 1 1,734 601 27 2,750 1,202 27 NSF / SCEC 15,246 1,737 52 168,931 3,545 73 LLNL 16,931 1,895 5 CHRON 12,634 6,299 5Persistent Archive NARA 7 2 63 81 58 4,989 6,390 58 NSF / NSDL 2,785 20,054 119 7,188 77,479 136 UCSD Libraries 127 202 29 5,158 1,319 29 NHPRC / PAT 2,576 966 28 RoadNet 3,174 1,321 30 UCTV 7,140 2 5 LOC 6,644 192 8 Earth Sci 5,869 647 5TOTAL 28 TB 6 mil 194 TB 40 mil 4,635 975 TB 185 mil 5,539

5/17/02 6/30/04 9/4/07

Page 14: Data Management Systems Richard Marciano Reagan W. Moore Wayne Schroeder Arcot Rajasekar Mike Wan San Diego Supercomputer Center moore@sdsc.edu

BaBar High-Energy PhysicsBaBar High-Energy Physics

• Stanford Linear Accelerator

• IN2P3• Lyon, France• Rome, Italy• San Diego• RAL, UK

• A functioning international Data Grid for high-energy physics

Manchester-SDSC mirror

Moved over 300 TBs of dataMoved over 300 TBs of data

Increasing to 5 TBs per dayIncreasing to 5 TBs per day

Page 15: Data Management Systems Richard Marciano Reagan W. Moore Wayne Schroeder Arcot Rajasekar Mike Wan San Diego Supercomputer Center moore@sdsc.edu

Requirements Driving EvolutionRequirements Driving Evolution

• Observe that as the size of the shared collections grow, the administrative tasks can become onerous.• Data grids provide mechanisms to manage recovery

from all errors that occur in the distributed environment

• Need to minimize labor support through automation of administrative functions• File ingestion tasks• Verification of desired collection properties• Integrity checks and replica management

Page 16: Data Management Systems Richard Marciano Reagan W. Moore Wayne Schroeder Arcot Rajasekar Mike Wan San Diego Supercomputer Center moore@sdsc.edu

Requirements Driving EvolutionRequirements Driving Evolution

• Observe that each community has unique management policies• User administration • File retention & deletion • Time-dependent access controls• Data distribution and replication• File update (versions, backups)• Descriptive metadata

Page 17: Data Management Systems Richard Marciano Reagan W. Moore Wayne Schroeder Arcot Rajasekar Mike Wan San Diego Supercomputer Center moore@sdsc.edu

Requirements Driving EvolutionRequirements Driving Evolution

• Socialization of collections• The creators of the collection have specific

properties that they assert the collection will possess

• Completeness• Authoritative sources• Authenticity

• The users of the collection have their own criteria for the properties they expect

• Socialization is the mapping from creator assertions to user expectations

Page 18: Data Management Systems Richard Marciano Reagan W. Moore Wayne Schroeder Arcot Rajasekar Mike Wan San Diego Supercomputer Center moore@sdsc.edu

Data Grid MechanismsData Grid Mechanisms

• Essential components needed for synergism implemented in SRB • Infrastructure independence• Data and trust virtualization

• Components needed for specific management policies and processes implemented in iRODS• Map policies to rules that control all processes• Map processes to standard micro-services

Page 19: Data Management Systems Richard Marciano Reagan W. Moore Wayne Schroeder Arcot Rajasekar Mike Wan San Diego Supercomputer Center moore@sdsc.edu

Data ManagementData Management

Data ManagementEnvironment

ConservedProperties

ControlMechanisms

RemoteOperations

ManagementFunctions

AssessmentCriteria

ManagementPolicies

Capabilities

Data grid – Management virtualizationData Management

InfrastructurePersistent

StateRules Micro-services

Data grid – Data and trust virtualizationPhysical

InfrastructureDatabase Rule Engine Storage

System

iRODS - integrated Rule-Oriented Data SystemiRODS - integrated Rule-Oriented Data System

Page 20: Data Management Systems Richard Marciano Reagan W. Moore Wayne Schroeder Arcot Rajasekar Mike Wan San Diego Supercomputer Center moore@sdsc.edu

RulesRules

• Rule classes• System enforced rules• Administrator controlled rules• User defined rules

• Rule execution• Atomic rules - executed on each operation invoked by

a client• Deferred rules - executed at a future time• Periodic rules - executed to validate assessment

criteria and enforce desired properties (integrity)

Page 21: Data Management Systems Richard Marciano Reagan W. Moore Wayne Schroeder Arcot Rajasekar Mike Wan San Diego Supercomputer Center moore@sdsc.edu

iRODS Rule SyntaxiRODS Rule Syntax

• Event | Condition | Action-set | Recovery-set• Event - triggered by operation or queued rule

• Condition - composed of tests on any attributes in

the persistent state information

• Action-set - composed from both micro-services

and rules

• Recovery-set - used to ensure transaction semantics

and consistent state information

• Executed by a rule engine installed at each storage location - server side workflows

Page 22: Data Management Systems Richard Marciano Reagan W. Moore Wayne Schroeder Arcot Rajasekar Mike Wan San Diego Supercomputer Center moore@sdsc.edu

Micro-ServicesMicro-Services

• Challenge is that storage systems do not provide desired processes• Have “minimal” set of standard operations that

are performed at the storage system• Have actions required by clients such as

replication, metadata extraction• Create standard micro-services that aggregate

storage operations into modules that can be used to implement desired processes.

Page 23: Data Management Systems Richard Marciano Reagan W. Moore Wayne Schroeder Arcot Rajasekar Mike Wan San Diego Supercomputer Center moore@sdsc.edu

Data VirtualizationData Virtualization

Storage SystemStorage System

Storage ProtocolStorage Protocol

Access InterfaceAccess Interface

Standard Micro-servicesStandard Micro-services

Data GridData Grid

Map from the actions

requested by the access

method to a standard set of

micro-services. The

standard micro-services

are mapped to the

operations supported by the storage system

Standard OperationsStandard Operations

Page 24: Data Management Systems Richard Marciano Reagan W. Moore Wayne Schroeder Arcot Rajasekar Mike Wan San Diego Supercomputer Center moore@sdsc.edu

integrated Rule-Oriented Data Systemintegrated Rule-Oriented Data System

Client Interface Admin Interface

Current State

Rule Invoker

MicroService

Modules

Metadata-based Services

Resources

MicroService

Modules

Resource-based Services

ServiceManager

ConsistencyCheck

Module

RuleModifierModule

ConsistencyCheck

Module

Engine

Rule

Confs

ConfigModifierModule

MetadataModifierModule

MetadataPersistent

Repository

ConsistencyCheck

Module

RuleBase

Page 25: Data Management Systems Richard Marciano Reagan W. Moore Wayne Schroeder Arcot Rajasekar Mike Wan San Diego Supercomputer Center moore@sdsc.edu

Distributed Management SystemDistributed Management System

RuleRule

EngineEngine

DataData

TransportTransport

MetadataMetadata

CatalogCatalog

ExecutionExecution

ControlControl

MessagingMessaging

SystemSystem

ExecutionExecution

EngineEngine

VirtualizationVirtualization

ServerServer

SideSide

WorkflowWorkflow

PersistentPersistent

StateState

informationinformation

SchedulingScheduling

PolicyPolicy

ManagementManagement

Page 26: Data Management Systems Richard Marciano Reagan W. Moore Wayne Schroeder Arcot Rajasekar Mike Wan San Diego Supercomputer Center moore@sdsc.edu

Micro-service ClassesMicro-service Classes

• Test

• System

• Workflow control

• Client

• iCAT catalog

• User level invoked by “irule”

• Image manipulation

Page 27: Data Management Systems Richard Marciano Reagan W. Moore Wayne Schroeder Arcot Rajasekar Mike Wan San Diego Supercomputer Center moore@sdsc.edu

Digital PreservationDigital Preservation

• Preservation community is defining the rules need to assert trustworthiness of a digital repository• RLG/NARA - Trustworthy Repositories Audit &

Certification: Criteria and Checklist.

http://wiki.digitalrepositoryauditandcertification.org/pub/Main/ReferenceInputDocuments/trac.pdf

• Defined 105 rules that are being implemented in iRODS

Page 28: Data Management Systems Richard Marciano Reagan W. Moore Wayne Schroeder Arcot Rajasekar Mike Wan San Diego Supercomputer Center moore@sdsc.edu

RLG/NARA Assessment

• Example TRAC assessment criteria

90 Verify descriptive metadata and source against SIP template and set SIP compliance flag

91 Verify descriptive metadata against semantic term list

92 Verify status of metadata catalog backup (create a snapshot of metadata catalog)

93 Verify consistency of preservation metadata after hardware change or error

Page 29: Data Management Systems Richard Marciano Reagan W. Moore Wayne Schroeder Arcot Rajasekar Mike Wan San Diego Supercomputer Center moore@sdsc.edu

Classes of Assessment CriteriaClasses of Assessment Criteria• Collection properties

• List properties of associated name spaces• Verify properties• Compare properties with assertions

• Collection operations• Transform file formats• Migrate data• Generate audit trails

• Structured information• Parse audit trails to generate compliance reports• Apply templates to extract information• Apply templates to format state information

Page 30: Data Management Systems Richard Marciano Reagan W. Moore Wayne Schroeder Arcot Rajasekar Mike Wan San Diego Supercomputer Center moore@sdsc.edu

iRODS DevelopmentiRODS Development

• NSF - SDCI grant “Adaptive Middleware for Community Shared Collections”• iRODS development, SRB maintenance

• NARA - Transcontinental Persistent Archive Prototype• Trusted repository assessment criteria

• NSF - Ocean Research Interactive Observatory Network (ORION)• Real-time sensor data stream management

• NSF - Temporal Dynamics of Learning Center data grid• Management of Institution Research Board approval

Page 31: Data Management Systems Richard Marciano Reagan W. Moore Wayne Schroeder Arcot Rajasekar Mike Wan San Diego Supercomputer Center moore@sdsc.edu

iRODS Development StatusiRODS Development Status

• Current release is version 0.9.2• June 2007

• Production release will be version 1.0• Fall quarter 2007

• International collaborations• SHAMAN - University of Liverpool

• Sustaining Heritage Access through Multivalent ArchiviNg

• UK e-Science data grid• IN2P3 in Lyon, France• DSpace policy management

Page 32: Data Management Systems Richard Marciano Reagan W. Moore Wayne Schroeder Arcot Rajasekar Mike Wan San Diego Supercomputer Center moore@sdsc.edu

Planned DevelopmentPlanned Development

• GSI support• Time-limited sessions via a one-way hash authentication• Python Client library• GUI Browser (AJAX in development)• Driver for HPSS (in development)• Driver for SAM-QFS• Porting to additional versions of Unix/Linux• Porting to Windows• Support for MySQL as the metadata catalog• API support packages based on existing mounted collection driver• MCAT to ICAT migration tools• Extensible Metadata including Databases Access Interface• Zones/Federation • Auditing - mechanisms to record and track iRODS persistent state

changes

Page 33: Data Management Systems Richard Marciano Reagan W. Moore Wayne Schroeder Arcot Rajasekar Mike Wan San Diego Supercomputer Center moore@sdsc.edu

For More InformationFor More Information(iRODS Tutorial on Thursday)(iRODS Tutorial on Thursday)

Reagan W. MooreSan Diego Supercomputer Center

[email protected]

http://www.sdsc.edu/srb/http://irods.sdsc.edu/