approaches to preservation storage technologies

28
Prepared for MIT Libraries Informatics Program Brown Bag Talk June 2013 Approaches to Preservation Storage Technologies Dr. Micah Altman <[email protected]> Director of Research, MIT Libraries

Upload: micah-altman

Post on 15-Jan-2015

763 views

Category:

Technology


1 download

DESCRIPTION

The structure and design of digital storage systems is a cornerstone of digital preservation. To better understand ongoing storage practices of organizations committed to digital preservation, the National Digital Stewardship Alliance conducted a survey of member organizations. This talk discusses findings from this survey, common gaps, and trends in this area. (I also have a little fun highlighting the hidden assumptions underlying Amazon Glacier's reliability claims. For more on that see this earlier post: http://drmaltman.wordpress.com/2012/11/15/amazons-creeping-glacier-and-digital-preservation )

TRANSCRIPT

Page 1: Approaches to Preservation Storage Technologies

Prepared for

MIT Libraries Informatics Program Brown Bag Talk

June 2013

Approaches to Preservation Storage Technologies

Dr. Micah Altman<[email protected]>

Director of Research, MIT Libraries

Page 2: Approaches to Preservation Storage Technologies

Approaches to Preservation Storage Technologies 2

DISCLAIMERThese opinions are my own, they are not the opinions of MIT, Brookings, any of the project funders, nor (with the exception of co-authored previously published work) my collaborators

Secondary disclaimer:

“It’s tough to make predictions, especially about the future!”

-- Attributed to Woody Allen, Yogi Berra, Niels Bohr, Vint Cerf, Winston Churchill, Confucius, Disreali [sic], Freeman Dyson, Cecil B. Demille, Albert Einstein, Enrico Fermi, Edgar R.

Fiedler, Bob Fourer, Sam Goldwyn, Allan Lamport, Groucho Marx, Dan Quayle, George Bernard Shaw, Casey Stengel, Will Rogers, M. Taub, Mark Twain, Kerr L. White, etc.

Page 3: Approaches to Preservation Storage Technologies

Approaches to Preservation Storage Technologies 3

Collaborators & Co-Conspirators

• Jefferson Bailey, Karen Cariani, Jonathan Crabtree, Michelle Gallinger, Jane Mandelbaum, Nancy McGovern Trevor Owens

• NDSA Coordination Committee & Working Group Chairs

• Research SupportThanks to the Library of Congress, & the

Massachusetts Institute of Technology.

Page 4: Approaches to Preservation Storage Technologies

Approaches to Preservation Storage Technologies 4

Related Work• Altman, et. al, 2013. “NDSA Storage Report: Reflections on

National Digital Stewardship Alliance Member Approaches to Preservation Storage Technologies”, Dlib 19 (5/6)

• National Digital Stewardship Alliance, 2013 (Forthcoming), 2014 National Agenda for Digital Stewardship.

• Micah Altman, Jonathan Crabtree (2011) Using the SafeArchive System : TRAC-Based Auditing of LOCKSS, 165-170. In Archiving 2011.

Most reprints available from:informatics.mit.edu

Page 5: Approaches to Preservation Storage Technologies

Approaches to Preservation Storage Technologies 5

Simple question?

• If you have 1000 files (bitstreams), and you’d like to have 99.99% chance of accessing them in 20 years. How do you store them?

Page 6: Approaches to Preservation Storage Technologies

Approaches to Preservation Storage Technologies 6

Simplistic Answer: Put it in AWS

• Amazon Glacier claims a design reliability of 99.999999999%

• Neat-o !!!!!!!!!! – Longer odds than winning Powerball OR– Getting struck by a lightning, three times OR– (Possibly) eventually finding alien civilization

• But …

Page 7: Approaches to Preservation Storage Technologies

Approaches to Preservation Storage Technologies 7

Clarifying Requirements

• What are the units of reliability? - Collection? Object? Bit?

• What is the natural unit of risk? • Is value of information uniform across units?• How many of these do you have?

Page 8: Approaches to Preservation Storage Technologies

Approaches to Preservation Storage Technologies 8

Hidden Assumptions• Reliability estimates appear entirely theoretical

– (MTBF + Independence)* enough replicas -> as many 9’s as you like… – No details for estimate provided– No historical reliability statistics provided– No service reliability auditing provided

• Empirical Issues– Storage manufacture hardware MTBF (mean time between failures) does not match

observed error rates in real environments…– Failures across hardware replicas are observed to correlated

• Unmodeled failure modes– software failure

(e.g. a bug in the AWS software for its control backplane might result in permanent loss that would go undetected for a substantial time_

– legal threats (leading to account lock-out — such as this, deletion, or content removal);– institutional threats (such as a change in Amazon’s business model)– Process threats (someone hits the delete button by mistake; forgets to pay the bill; or

AWS rejects the payment)

Page 9: Approaches to Preservation Storage Technologies

Approaches to Preservation Storage Technologies 9

Business Risks?• Amazon SLA’s do not incorporate or reflect

“design” reliability claims:– No claim to reliability in SLA’s– Sole recover for breach limited to refund of fees for

periods the service was unavailable– No right to audit logs, or other evidence of reliability

Page 10: Approaches to Preservation Storage Technologies

Approaches to Preservation Storage Technologies 10

What practices are leading stewardhip

organizations using?

Page 11: Approaches to Preservation Storage Technologies

Approaches to Preservation Storage Technologies 11

Results from the NDSA Bi-Annual Preservation Storage Survey

• 74 institutions surveyed. 58 met selection criteria.– Follow up on non-responders: 100% response rate.– Low rolloff on individual questions– Next round will be > 2x bigger

• Survey Methods– Close ended, with open ended extensions– Selected qualitative followup

• Survey Data– Instrument and data available as open data

Page 12: Approaches to Preservation Storage Technologies

Approaches to Preservation Storage Technologies 12

About the NDSA

Page 13: Approaches to Preservation Storage Technologies

Approaches to Preservation Storage Technologies 13

Key Findings: What are Current Institutional Practices?

• 90% of respondents are distributing copies of at least part of their content geographically

• 88% of respondents are responsible for their content for an indefinite period of time

• 80% of respondents use some form of fixity checking for their content• 75% of respondents report a strong preference to host and control their own

technical infrastructure for preservation storage• 69% of respondents are considering or currently participating in a distributed

storage cooperative or system (ex. LOCKSS alliance, MetaArchive, Data-PASS)• 64% of respondents are planning to make significant changes in the

technologies in their preservation storage architecture in the next three years• 51% of respondents are considering or already using a cloud storage provider

to keep one or more copies of their content• 48% of respondents are considering, or currently contracting out storage

services to be managed by another organization or company

Page 14: Approaches to Preservation Storage Technologies

Approaches to Preservation Storage Technologies 14

How Many Copies

Page 15: Approaches to Preservation Storage Technologies

Approaches to Preservation Storage Technologies 15

How Many Copies? …by Role

Page 16: Approaches to Preservation Storage Technologies

Approaches to Preservation Storage Technologies 16

Preservation Storage -- New Approaches

Page 17: Approaches to Preservation Storage Technologies

Approaches to Preservation Storage Technologies 17

What do organizations want from their preservation systems?

Page 18: Approaches to Preservation Storage Technologies

Approaches to Preservation Storage Technologies 18

What are most memory organizations not doing yet?

• Formal cost and valuation models• Auditing&evaluation• Certification• Comprehensive content review

Page 19: Approaches to Preservation Storage Technologies

Approaches to Preservation Storage Technologies 19

Plans for future

Page 20: Approaches to Preservation Storage Technologies

Approaches to Preservation Storage Technologies 20

Emerging State of the Practice

Page 21: Approaches to Preservation Storage Technologies

Approaches to Preservation Storage Technologies 21

Methods for Mitigating Bit-Level Risk

Physical:Media,

Hardware,Environment

Number of copies

Diversification of copies

Formats FileTransforms:compressio

n,encoding, encryption

Fixity Repair

Loca

l St

orag

e

FileSystems:

transforms,deduplicatio

n, redundancy

Repl

icati

on

Verifi

catio

n

Audit

Page 22: Approaches to Preservation Storage Technologies

Approaches to Preservation Storage Technologies 22

Emerging State of Practice

• Organizational – Multi Institutional Stewardship– Institutions hold digital assets they wish to preserve,

many unique– Many of these assets are not replicated at all– Even when institutions keep multiple backups offsite,

many single points of failure remain, because replicas are managed by single institution

– Approaches: LOCKSS, Digital Preservation Network, MetaArchive, Data-PASS, Datanet Federation Consortium, Data-ONE

• Technical: Fixity, verification and auditing• Legal: Secession planning, Confidentiality, …

Page 23: Approaches to Preservation Storage Technologies

Approaches to Preservation Storage Technologies 23

Future research: What do we need

to know?

Page 24: Approaches to Preservation Storage Technologies

Approaches to Preservation Storage Technologies 24

ModelingBit Corruption

Media characteristics

Threat characteristics

Correlations

Logical Scope of Corruption

File Characteristics

File system Characteristics

Probability of Successful Repair

Auditing Frequency

Number of copies

Repair frequency

Corruption

Detection

Repair

Page 25: Approaches to Preservation Storage Technologies

Approaches to Preservation Storage Technologies 25

The Risk Problem Restated

Keeping risk of object loss fixed -- what choices minimize $?

“Dual problem” Keeping $ fixed, what choices minimize risk?

Extension

For specific cost functions for loss of object:

Loss(object_i), of all lost objects

What choices minimize:

Total cost= preservation cost+ sum(E(Loss))

risk

cost

Are we there yet?

Page 26: Approaches to Preservation Storage Technologies

Approaches to Preservation Storage Technologies 26

Research Directions• Growing the evidence base

– Descriptive inference – patterns of use– Descriptive inference – outcomes– Predictive inference – trend analysis– Causal inference – effectiveness of interventions

• Modes of inquiry– probability-based surveys

(e.g. of information management practice and outcomes) – replicable simulation experiments tied to theoretically grounded models of

information management and risk; – creation of testbeds and test-corpuses which can be used to systematically

compare new practices, tools, and methods; – field experiments, in which randomized interventions are applied and evaluated

in real operational environments.

Page 27: Approaches to Preservation Storage Technologies

Bibliography (Selected)

• David S.H. Rosenthal, Thomas S. Robertson, Tom Lipkis, Vicky Reich, Seth Morabito. “Requirements for Digital Preservation Systems: A Bottom-Up Approach”, D-Lib Magazine, vol. 11, no. 11, November 2005.

• Pinheiro, E., Weber, W.D., & Barroso, L. A. (2007). Failure trends in a large disk drive population. In Proceedings of 5th USENIX Conference on File and Storage Technologies.

• Rosenthal, David SH. "Bit preservation: a solved problem?." International Journal of Digital Curation 5.1 (2010): 134-148.

Approaches to Preservation Storage Technologies

27

Page 28: Approaches to Preservation Storage Technologies

Questions?

E-mail: [email protected]: micahaltman.comTwitter: @drmaltman

Approaches to Preservation Storage Technologies

28