data management for grown ups

28
Data Management For Grown Ups Terrell Russell, Ph.D. @terrellrussell Senior Data Scientist, iRODS Consortium Renaissance Computing Institute (RENCI), UNC-Chapel Hill

Upload: all-things-open

Post on 21-Jan-2017

329 views

Category:

Technology


0 download

TRANSCRIPT

Data ManagementFor Grown Ups

Terrell Russell, Ph.D.

@terrellrussell

Senior Data Scientist, iRODS Consortium

Renaissance Computing Institute (RENCI), UNC-Chapel Hill

iRODS Consortium

was created to ensure the sustainability of iRODS and tofurther its adoption and continued evolution. To this end, the Consortiumworks to standardize the definition, development, and release of iRODS-baseddata middleware technologies, evangelize iRODS among potential users,promote new advances in iRODS, and expand the adoption of iRODS-baseddata middleware technologies through the development, release, and supportof an open-source, mission-critical, production-level distribution of iRODS.

Current Members:

RENCI, DICE, Seagate, DDN, Novartis, IBM, Complete Genomics, Wellcome TrustSanger Institute, UCL, Cleversafe, EMC, and the NASA Atmospheric Science DataCenter

The iRODS Consortium

Data Management

Multiple pieces

Multiple meanings

Multiple goals

Data Management

Access - Authentication, Authorization, Revocation

Data Management

Access

Description - Standards for discovery, compliance

Data Management

Access

Description

Integrity - Confidence that nothing has changed

Data Management

Access

Description

Integrity

Replication - Multiple copies, multiple locations

Data Management

Access

Description

Integrity

Replication

Availability - If things are down, nothing else matters

Data Management

Access

Description

Integrity

Replication

Availability

Migration - Hardware changes, format changes

Data Management

Access

Description

Integrity

Replication

Availability

Migration

Recovery - Robust plans for when things go wrong

Data Management

Access

Description

Integrity

Replication

Availability

Migration

Recovery

Provenance - Full record of all related activity

Data Management

Access

Description

Integrity

Replication

Availability

Migration

Recovery

Provenance

Retention - Deleting data on a defined schedule

People with Keys + Notes/Reports

Passwords + Folders + Scripts (Maybe)

Credentials + Metadata + Automation

Policy Enforcement - Through the Years

Data Management

Fraught with People

Four Verticals → Four Case Studies

Health Care & Life Science

Oil & Gas

Media & Entertainment

Archives & Records Management

Health Care & Life Science

Genomics Use Case - Data begins as series of images

from a sequencer, converted to bases (ATCG),

fragmented, aligned, annotated for variants, filtered,

analyzed

Extensive Data Pipelines

Saved State

Diverse Data Products

Share Results

Health Care & Life Science

Priorities:

reproducibility

multi-institutional

collaboration

Oil & Gas

Ingest Use Case - As existing storage fills up,

complementary strategies 1) migrate from active to

slower, cheaper archive and 2) add more active.

Traditional HSM has limited flexibility (access date,

physical location, etc.) and additional namespaces

just add more complexity.

Diverse Data Sources

Spread Geographically

Computationally Intense

Oil & Gas

Priorities:

unified namespace

automated analytics

Media & Entertainment

Born Digital Use Case - New valuable creative

content (movie assets, original musical tracks)

requires large, robust, long-term, flexible,

accessible infrastructure.

Popular Content

Unique

Largely Video and Games

Media & Entertainment

Priorities:

access control

backups

integrity

Archives & Records Management

Provenance Use Case - Libraries, museums, and

other cultural institutions have a 100+ year view on

their digital assets. Must maintain archival and

dissemination copies. Lots of metadata.

Cultural Heritage

Original and Derivative Copies

Quality Search and Browse

Archives & Records Management

Priorities:

provenance

integrity

migration

metadata

replication

Four Verticals → Four Case Studies

Health Care & Life Science

Oil & Gas

Media & Entertainment

Archives & Records Management

The Four Pillars

Open Source Data Management Middleware

iRODS enables data discovery using a metadata catalog thatdescribes every file, every directory, and every storageresource in the data grid.

iRODS automates data workflows, with a rule engine thatpermits any action to be initiated by any trigger on any serveror client in the grid.

iRODS enables secure collaboration, so users only need tolog in to their home grid to access data hosted on a remotegrid.

iRODS implements data virtualization, allowing access todistributed storage assets under a unified namespace, andfreeing organizations from getting locked in to single-vendorstorage solutions.

Questions?

irods.orggithub.com/irods

@irods

Creative Commons Images Used:

https://www.flickr.com/photos/addieplum/116062198/

https://www.flickr.com/photos/ajmexico/3281139507/

https://www.flickr.com/photos/future15/2037742362/