bl i p f dbalancing performance and preservation lessons ...€¦ · hdf5 technology platform •...

51
Bl i P f d Balancing Performance and Preservation Preservation Lessons learned with HDF5 Mike Folk The HDF Group The HDF Group US DPIF Workshop NIST, Gaithersburg, Maryland March 29-31, 2010

Upload: others

Post on 30-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bl i P f dBalancing Performance and Preservation Lessons ...€¦ · HDF5 Technology Platform • HDF5 Software • Manage, analyze, view, query data • HDF5 Data Model • Building

B l i P f dBalancing Performance and PreservationPreservation

Lessons learned with HDF5Mike Folk

The HDF GroupThe HDF Group

US DPIF WorkshopNIST, Gaithersburg, Maryland

March 29-31, 2010

Page 2: Bl i P f dBalancing Performance and Preservation Lessons ...€¦ · HDF5 Technology Platform • HDF5 Software • Manage, analyze, view, query data • HDF5 Data Model • Building

D t Ch llData Challenges

3/30/2010 DPIF NIST March 2010 2

Page 3: Bl i P f dBalancing Performance and Preservation Lessons ...€¦ · HDF5 Technology Platform • HDF5 Software • Manage, analyze, view, query data • HDF5 Data Model • Building

Answering big questions …

Matter and the universe Life and nature

August 24, 2001 August 24, 2002

Total Column Ozone (Dobson)

3/30/2010 DPIF NIST March 2010 3Weather and climate 

60 385 610

Page 4: Bl i P f dBalancing Performance and Preservation Lessons ...€¦ · HDF5 Technology Platform • HDF5 Software • Manage, analyze, view, query data • HDF5 Data Model • Building

… involves big data …… involves big data …

3/30/2010 DPIF NIST March 2010 4

Page 5: Bl i P f dBalancing Performance and Preservation Lessons ...€¦ · HDF5 Technology Platform • HDF5 Software • Manage, analyze, view, query data • HDF5 Data Model • Building

… … highly varied highly varied data …data …

3/30/2010 DPIF NIST March 2010 5

Thanks to Mark Miller, LLNL

Page 6: Bl i P f dBalancing Performance and Preservation Lessons ...€¦ · HDF5 Technology Platform • HDF5 Software • Manage, analyze, view, query data • HDF5 Data Model • Building

… and complex relationships …… and complex relationships …

Contig Summaries

Discrepancies

SNP ScoreSNP Score

Contig Qualities

Coverage Depth

TraceTrace

Aligned bases

Reads

Read Read qualityquality ContigContig

3/30/2010 DPIF NIST March 2010 6

Percent match

Page 7: Bl i P f dBalancing Performance and Preservation Lessons ...€¦ · HDF5 Technology Platform • HDF5 Software • Manage, analyze, view, query data • HDF5 Data Model • Building

… on big computers …

d t3/30/2010 DPIF NIST March 2010 7

… and small computers …

Page 8: Bl i P f dBalancing Performance and Preservation Lessons ...€¦ · HDF5 Technology Platform • HDF5 Software • Manage, analyze, view, query data • HDF5 Data Model • Building

HDF

• At once, HDF serves asAt once, HDF serves as • a container for big data and varied data• a platform upon which to build data applications,a platform upon which to build data applications, • high performance middleware for capturing,

storing, and accessing data

3/30/2010 8DPIF NIST March 2010

Page 9: Bl i P f dBalancing Performance and Preservation Lessons ...€¦ · HDF5 Technology Platform • HDF5 Software • Manage, analyze, view, query data • HDF5 Data Model • Building

HDF = Hierarchical Data Format

• HDF4 is the first HDF format- Originally called HDF- First release was 1988- Still supported by The HDF Group

• HDF5 is the second HDF format First release as in 1998- First release was in 1998

Page 10: Bl i P f dBalancing Performance and Preservation Lessons ...€¦ · HDF5 Technology Platform • HDF5 Software • Manage, analyze, view, query data • HDF5 Data Model • Building

HDF5 File

lat | lon | templat | lon | temp‐‐‐‐|‐‐‐‐‐|‐‐‐‐‐12 |  23 |  3.115 |  24 |  4.217 |  21 |  3.6

An HDF5 file is a container thatcontainer that holds data objects.

3/30/2010 DPIF NIST March 2010 10

Page 11: Bl i P f dBalancing Performance and Preservation Lessons ...€¦ · HDF5 Technology Platform • HDF5 Software • Manage, analyze, view, query data • HDF5 Data Model • Building

Organizing data with HDF5

Experiment Notes:Serial Number: 99378920Date: 3/13/09

/HDF5 groups and links

i Date: 3/13/09Configuration: Standard 3organize

data objects.

lat | lat | lonlon | temp| temp

BackgroundResults

‐‐‐‐‐‐‐‐||‐‐‐‐‐‐‐‐‐‐||‐‐‐‐‐‐‐‐‐‐12 |  23 |  3.112 |  23 |  3.115 |  24 |  4.215 |  24 |  4.217 |  21 |  3.617 |  21 |  3.6

3/30/2010 DPIF NIST March 2010 11

Page 12: Bl i P f dBalancing Performance and Preservation Lessons ...€¦ · HDF5 Technology Platform • HDF5 Software • Manage, analyze, view, query data • HDF5 Data Model • Building

HDF5 Technology Platform

• HDF5 Software• Manage, analyze, view, query data

• HDF5 Data Model• Building blocks for data organization and storage

• HDF5 Binary File Format• Bit-level organization of HDF5 fileBit level organization of HDF5 file

3/30/2010 12DPIF NIST March 2010

Page 13: Bl i P f dBalancing Performance and Preservation Lessons ...€¦ · HDF5 Technology Platform • HDF5 Software • Manage, analyze, view, query data • HDF5 Data Model • Building

Uses and users of HDF5

Page 14: Bl i P f dBalancing Performance and Preservation Lessons ...€¦ · HDF5 Technology Platform • HDF5 Software • Manage, analyze, view, query data • HDF5 Data Model • Building

Earth Science (Earth Observing System)Aqua (6/01)q ( )

AuraTES HRDLSMLSOMI

TerraCERES MISR

MODIS MOPITT

AquaCERES MODIS

AMSR

3/30/2010 DPIF NIST March 2010 14

Page 15: Bl i P f dBalancing Performance and Preservation Lessons ...€¦ · HDF5 Technology Platform • HDF5 Software • Manage, analyze, view, query data • HDF5 Data Model • Building

Big simulations

A simulation can have billions of elements

Each element can have dozens of associated values

3/30/2010 DPIF NIST March 2010 15

Page 16: Bl i P f dBalancing Performance and Preservation Lessons ...€¦ · HDF5 Technology Platform • HDF5 Software • Manage, analyze, view, query data • HDF5 Data Model • Building

Bioinformatics

3/30/2010 DPIF NIST March 2010 16

Page 17: Bl i P f dBalancing Performance and Preservation Lessons ...€¦ · HDF5 Technology Platform • HDF5 Software • Manage, analyze, view, query data • HDF5 Data Model • Building

Images

2525--80Å 80Å resolution electron tomographyresolution electron tomography

3/30/2010 DPIF NIST March 2010 17

g p yg p y8k 8k x 8k x 1k images soon (256 GB)x 8k x 1k images soon (256 GB)

Page 18: Bl i P f dBalancing Performance and Preservation Lessons ...€¦ · HDF5 Technology Platform • HDF5 Software • Manage, analyze, view, query data • HDF5 Data Model • Building

Flight testing

3/30/2010 DPIF NIST March 2010 18

Page 19: Bl i P f dBalancing Performance and Preservation Lessons ...€¦ · HDF5 Technology Platform • HDF5 Software • Manage, analyze, view, query data • HDF5 Data Model • Building

Vehicle testing

3/30/2010 DPIF NIST March 2010 19

Page 20: Bl i P f dBalancing Performance and Preservation Lessons ...€¦ · HDF5 Technology Platform • HDF5 Software • Manage, analyze, view, query data • HDF5 Data Model • Building

Making moviesg

3/30/2010 DPIF NIST March 2010 20

Spiderman 3 The Polar Express

Page 21: Bl i P f dBalancing Performance and Preservation Lessons ...€¦ · HDF5 Technology Platform • HDF5 Software • Manage, analyze, view, query data • HDF5 Data Model • Building

Target audience

• Applications facing big data challengesApplications facing big data challenges• Academia, government, industry• Hundreds of different of appsHundreds of different of apps• Millions of users world-wide

3/30/2010 21DPIF NIST March 2010

Page 22: Bl i P f dBalancing Performance and Preservation Lessons ...€¦ · HDF5 Technology Platform • HDF5 Software • Manage, analyze, view, query data • HDF5 Data Model • Building

Something is missing

3/30/2010 DPIF NIST March 2010 22

Page 23: Bl i P f dBalancing Performance and Preservation Lessons ...€¦ · HDF5 Technology Platform • HDF5 Software • Manage, analyze, view, query data • HDF5 Data Model • Building

What is on these tapes?What is on these tapes?

3/30/2010 DPIF NIST March 2010 23

Page 24: Bl i P f dBalancing Performance and Preservation Lessons ...€¦ · HDF5 Technology Platform • HDF5 Software • Manage, analyze, view, query data • HDF5 Data Model • Building

What about users in the future?

3/30/2010 DPIF NIST March 2010 24

Page 25: Bl i P f dBalancing Performance and Preservation Lessons ...€¦ · HDF5 Technology Platform • HDF5 Software • Manage, analyze, view, query data • HDF5 Data Model • Building

HDF i ( i d)HDF is .. (revised)

3/30/2010 DPIF NIST March 2010 25

Page 26: Bl i P f dBalancing Performance and Preservation Lessons ...€¦ · HDF5 Technology Platform • HDF5 Software • Manage, analyze, view, query data • HDF5 Data Model • Building

HDF is.. (revised)

• A technology platform for addressing some ofA technology platform for addressing some of today’s greatest data challenges

• A set of features and practices to help p ppreserve access to data for the long term

3/30/2010 26DPIF NIST March 2010

Page 27: Bl i P f dBalancing Performance and Preservation Lessons ...€¦ · HDF5 Technology Platform • HDF5 Software • Manage, analyze, view, query data • HDF5 Data Model • Building

target audience ... (revised)

• Users todayUsers today• Those who face challenges in organizing,

accessing and integrating big, complex data. g g g g• Future users, and we don’t know…

• what data will be important to themp• what they will do with the data once they get it• what knowledge and tools they will have for

accessing and interpreting the data

3/30/2010 27DPIF NIST March 2010

Page 28: Bl i P f dBalancing Performance and Preservation Lessons ...€¦ · HDF5 Technology Platform • HDF5 Software • Manage, analyze, view, query data • HDF5 Data Model • Building

“What makes a good archive format?” (1997 Folk)(1997, Folk)

“Attributes of File Formats for Long-Term Preservation of Scientific and Engineering

Data in Digital Libraries” (2002, Folk and Barkstrom)*

And what can we do about it?

DPIF NIST March 2010 28

*http://www.hdfgroup.org/projects/nara/Sci_Formats_and_Archiving.pdf3/30/2010

Page 29: Bl i P f dBalancing Performance and Preservation Lessons ...€¦ · HDF5 Technology Platform • HDF5 Software • Manage, analyze, view, query data • HDF5 Data Model • Building

What Makes a Good Archive Format?

• Ease of Archival Storage

• Usability• Popularity

• Compactness• Size

Abilit t t

• Availability of readers• Ability to embed data

extraction software in• Ability to aggregate related objects.

• Ease of Archival

extraction software in the files

• Ease of implementing • Ease of Archival Access• Raw I/O efficiency

p greaders

• Simplicityy• Ease of subsetting • Ability to name file

elements

3/30/2010 DPIF NIST March 2010 29

Page 30: Bl i P f dBalancing Performance and Preservation Lessons ...€¦ · HDF5 Technology Platform • HDF5 Software • Manage, analyze, view, query data • HDF5 Data Model • Building

What Makes a Good Archive Format?

• Support for Data Scholarship

• Support for Data Integrity

• Provenance traceability• Rigorous definition

S lf d ibi

• Source verification• File corruption

detection & correction• Self-describing• Referential extensibility• URN embedding

detection & correction

• URN embedding• Citability

3/30/2010 DPIF NIST March 2010 30

Page 31: Bl i P f dBalancing Performance and Preservation Lessons ...€¦ · HDF5 Technology Platform • HDF5 Software • Manage, analyze, view, query data • HDF5 Data Model • Building

What Makes a Good Archive Format?

• Maintainability and DurabilityMaintainability and Durability• Long-term institutional support• Suitability for a variety of storage technologiesSuitability for a variety of storage technologies• Stability• Formal (BNF- or XML-like) description of format( ) p• Multi-language implementation of library software• Open Source software or equivalentp q

3/30/2010 31DPIF NIST March 2010

Page 32: Bl i P f dBalancing Performance and Preservation Lessons ...€¦ · HDF5 Technology Platform • HDF5 Software • Manage, analyze, view, query data • HDF5 Data Model • Building

HDF strategies for long-term preservation

Technological Institutional

Page 33: Bl i P f dBalancing Performance and Preservation Lessons ...€¦ · HDF5 Technology Platform • HDF5 Software • Manage, analyze, view, query data • HDF5 Data Model • Building

TechnologyTechnologystrategies

3/30/2010 DPIF NIST March 2010 33

Page 34: Bl i P f dBalancing Performance and Preservation Lessons ...€¦ · HDF5 Technology Platform • HDF5 Software • Manage, analyze, view, query data • HDF5 Data Model • Building

A simple durableA simple, durable but evolvable

model and implementationimplementation

3/30/2010 DPIF NIST March 2010 34

John of England signs Magna Carta

Page 35: Bl i P f dBalancing Performance and Preservation Lessons ...€¦ · HDF5 Technology Platform • HDF5 Software • Manage, analyze, view, query data • HDF5 Data Model • Building

Self-description

3/30/2010 DPIF NIST March 2010 35

Page 36: Bl i P f dBalancing Performance and Preservation Lessons ...€¦ · HDF5 Technology Platform • HDF5 Software • Manage, analyze, view, query data • HDF5 Data Model • Building

SpecificationSpecification documentation

3/30/2010 DPIF NIST March 2010 36

Page 37: Bl i P f dBalancing Performance and Preservation Lessons ...€¦ · HDF5 Technology Platform • HDF5 Software • Manage, analyze, view, query data • HDF5 Data Model • Building

Preservation basedPreservation-based evolution

3/30/2010 DPIF NIST March 2010 37

Darwin’s first evolutionary tree - 1837

Page 38: Bl i P f dBalancing Performance and Preservation Lessons ...€¦ · HDF5 Technology Platform • HDF5 Software • Manage, analyze, view, query data • HDF5 Data Model • Building

Preservation-based evolution

a technology development strategy that allows the

software and format to evolve,software and format to evolve, at the same time giving legacy applications a decent chanceapplications a decent chance

to meet their users’ needs, and i t ll d tpreserving access to all data.

3/30/2010 DPIF NIST March 2010 38

Page 39: Bl i P f dBalancing Performance and Preservation Lessons ...€¦ · HDF5 Technology Platform • HDF5 Software • Manage, analyze, view, query data • HDF5 Data Model • Building

ProvidingProvidingProviding Providing different different

ways to view ways to view the samethe samethe same the same

informationinformation

3/30/2010 DPIF NIST March 2010 39

Page 40: Bl i P f dBalancing Performance and Preservation Lessons ...€¦ · HDF5 Technology Platform • HDF5 Software • Manage, analyze, view, query data • HDF5 Data Model • Building

Integration with preservationpreservation frameworks

3/30/2010 DPIF NIST March 2010 40

Page 41: Bl i P f dBalancing Performance and Preservation Lessons ...€¦ · HDF5 Technology Platform • HDF5 Software • Manage, analyze, view, query data • HDF5 Data Model • Building

Institutional strategies

Page 42: Bl i P f dBalancing Performance and Preservation Lessons ...€¦ · HDF5 Technology Platform • HDF5 Software • Manage, analyze, view, query data • HDF5 Data Model • Building

Long-term institutional support

3/30/2010 DPIF NIST March 2010 42

Page 43: Bl i P f dBalancing Performance and Preservation Lessons ...€¦ · HDF5 Technology Platform • HDF5 Software • Manage, analyze, view, query data • HDF5 Data Model • Building

A mission-drivenA mission driven business

3/30/2010 DPIF NIST March 2010 43

Page 44: Bl i P f dBalancing Performance and Preservation Lessons ...€¦ · HDF5 Technology Platform • HDF5 Software • Manage, analyze, view, query data • HDF5 Data Model • Building

Human, financial, legal foundations forfoundations for sustainabilityy

3/30/2010 DPIF NIST March 2010 44

Page 45: Bl i P f dBalancing Performance and Preservation Lessons ...€¦ · HDF5 Technology Platform • HDF5 Software • Manage, analyze, view, query data • HDF5 Data Model • Building

Open source

3/30/2010 DPIF NIST March 2010 45

Page 46: Bl i P f dBalancing Performance and Preservation Lessons ...€¦ · HDF5 Technology Platform • HDF5 Software • Manage, analyze, view, query data • HDF5 Data Model • Building

One keeper of th f tthe format

and softwareand software

3/30/2010 DPIF NIST March 2010 46

Page 47: Bl i P f dBalancing Performance and Preservation Lessons ...€¦ · HDF5 Technology Platform • HDF5 Software • Manage, analyze, view, query data • HDF5 Data Model • Building

C thCross the chasm to newchasm to new

users and applications

3/30/2010 DPIF NIST March 2010 47

Page 48: Bl i P f dBalancing Performance and Preservation Lessons ...€¦ · HDF5 Technology Platform • HDF5 Software • Manage, analyze, view, query data • HDF5 Data Model • Building

P tiPromoting standardizationstandardization

3/30/2010 DPIF NIST March 2010 48

Page 49: Bl i P f dBalancing Performance and Preservation Lessons ...€¦ · HDF5 Technology Platform • HDF5 Software • Manage, analyze, view, query data • HDF5 Data Model • Building

Summary

• Technical strategies• A simple, durable but evolvable model and implementation• Self-descriptionSelf description• Specification documentation• Preservation-based evolution• Providing different ways to view the same information• Providing different ways to view the same information• Integration with preservation frameworks

• Institutional strategiesLong term institutional support• Long-term institutional support

• A mission-driven business• Human, financial, legal foundations for sustainability

O• Open source• One keeper of the format and software• Cross the chasm to new users and applications

P ti t d di ti• Promoting standardization

3/30/2010 49DPIF NIST March 2010

Page 50: Bl i P f dBalancing Performance and Preservation Lessons ...€¦ · HDF5 Technology Platform • HDF5 Software • Manage, analyze, view, query data • HDF5 Data Model • Building

HDF Group Mission

To ensure long-term accessibility of HDF data

through sustainablethrough sustainable development and support of HDF technologiestechnologies.

3/30/2010 DPIF NIST March 2010 50

Page 51: Bl i P f dBalancing Performance and Preservation Lessons ...€¦ · HDF5 Technology Platform • HDF5 Software • Manage, analyze, view, query data • HDF5 Data Model • Building

Thank you.y