a funding and operational model for long-term preservation and sharing of research data 1 educause...

31
A Funding and Operational Model for Long-Term Preservation and Sharing of Research Data 1 EDUCAUSE Live!, 9/1/2010

Upload: evelyn-cain

Post on 05-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Funding and Operational Model for Long-Term Preservation and Sharing of Research Data 1 EDUCAUSE Live!, 9/1/2010

A Funding and Operational Model for Long-Term Preservation and Sharing of

Research Data

1EDUCAUSE Live!, 9/1/2010

Page 2: A Funding and Operational Model for Long-Term Preservation and Sharing of Research Data 1 EDUCAUSE Live!, 9/1/2010

Special Thanks To

• Mark Ratliff• Digital Repository Architect

• Implemented DataSpace at Princeton

• Princeton University Library and OIT• Marvin Bielawski

• Joyce Bell (Metadata) and Dan Santamaria (archives)

• http://arks.princeton.edu/ark:/88435/dsp01w6634361k

2EDUCAUSE Live!, 9/1/2010

Page 3: A Funding and Operational Model for Long-Term Preservation and Sharing of Research Data 1 EDUCAUSE Live!, 9/1/2010

Goal of DataSpace Model

• Store data “forever”• Preserve and protect

• Share data with everyone• Make available

• Make accessible

• Make it Work!• Financially

• Operationally

EDUCAUSE Live!, 9/1/2010 3

Page 4: A Funding and Operational Model for Long-Term Preservation and Sharing of Research Data 1 EDUCAUSE Live!, 9/1/2010

What is “Forever”?• Longer than a typical project?• Longer than a typical career?• Longer than a typical institution?• 5 years, 10 years, 25 years, 100 years?• Suggestion: treat data same way library treats

books• Intent is to preserve indefinitely• As long as practical, feasible• Cannot be precisely defined

4EDUCAUSE Live!, 9/1/2010

Page 5: A Funding and Operational Model for Long-Term Preservation and Sharing of Research Data 1 EDUCAUSE Live!, 9/1/2010

Why Save Data “Forever”

• Because we want to make it:• Available to ourselves and our students and

colleagues• Where are the data sitting today? On a

departmental server? On a computer under your desk? On a CD or DVD somewhere?

• Where is your dissertation data?

• Available to future scholars, including ourselves

5EDUCAUSE Live!, 9/1/2010

Page 6: A Funding and Operational Model for Long-Term Preservation and Sharing of Research Data 1 EDUCAUSE Live!, 9/1/2010

Why Save Data “Forever”

• Because we need to:• Encourage honesty?

• Gregor Mendel probably cheated

• Like open-source: help uncover mistakes, bugs?

• Open Data Movement• Mostly library/catalog data, map data, WordNet

• Open Access Movement• Mostly publications

• Because it’s not “our” data

6EDUCAUSE Live!, 9/1/2010

Page 7: A Funding and Operational Model for Long-Term Preservation and Sharing of Research Data 1 EDUCAUSE Live!, 9/1/2010

Why Save Data “Forever”

Because we have to:

Science Insider, May 5 reports:

” Edward Seidel, acting head of NSF's mathematics and physical sciences directorate, described NSF's intention to require all applicants to submit a data management plan along with their grant application in a presentation this morning to the National Science Board, NSF's oversight body. …NSF's current policy requires grantees to share their data within a reasonable length of time so long as the cost is modest. "That's nice, but it doesn't have much teeth," said Seidel. Under the new policy, which is expected to be unveiled this fall, a researcher would submit a data management plan as a two-page supplement to any regular grant proposal. That would make it an element of the merit review process.”

7EDUCAUSE Live!, 9/1/2010

Page 8: A Funding and Operational Model for Long-Term Preservation and Sharing of Research Data 1 EDUCAUSE Live!, 9/1/2010

Data Sharing Policies

• New NSF policy continues trend • NIH Data Sharing Policy (2003):

http://grants.nih.gov/grants/guide/notice-files/NOT-OD-03-032.html

“all investigator-initiated applications with direct costs greater than $500,000 in any single year will be expected to address data sharing in their application.”

EDUCAUSE Live!, 9/1/2010 8

Page 9: A Funding and Operational Model for Long-Term Preservation and Sharing of Research Data 1 EDUCAUSE Live!, 9/1/2010

NIH Data Sharing Policy

• “Applicants may request funds for data sharing and archiving. The financial issues should be addressed in the budget section of the application.”

• Specifics depend on grant, published in RFP, RFA or PA

9EDUCAUSE Live!, 9/1/2010

Page 10: A Funding and Operational Model for Long-Term Preservation and Sharing of Research Data 1 EDUCAUSE Live!, 9/1/2010

NSF Data Archiving Policy

• Division of Social and Economic Scienes• http://www.nsf.gov/sbe/ses/common/archive.jsp• “Grantees from all fields will develop and

submit specific plans to share materials collected with NSF support, except where this is inappropriate or impossible.”

10EDUCAUSE Live!, 9/1/2010

Page 11: A Funding and Operational Model for Long-Term Preservation and Sharing of Research Data 1 EDUCAUSE Live!, 9/1/2010

NSF Data Archiving

• From Grant Proposal Guide• NSF “expects PIs to share with other

researchers, at no more than incremental cost and within a reasonable time, the data, samples, physical collections and other supporting materials created or gathered in the course of the work.”

• Specifics depend on grant and program officer

11EDUCAUSE Live!, 9/1/2010

Page 12: A Funding and Operational Model for Long-Term Preservation and Sharing of Research Data 1 EDUCAUSE Live!, 9/1/2010

Other agency Policies

• See Gary King’s Page on “Data Sharing and Replication”

• http://gking.harvard.edu/replication.shtml• See National Academy of Sciences “Ensuring

the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age”, July, 2009

• http://www.nap.edu/catalog/12615.html

12EDUCAUSE Live!, 9/1/2010

Page 13: A Funding and Operational Model for Long-Term Preservation and Sharing of Research Data 1 EDUCAUSE Live!, 9/1/2010

Current Storage Models

• Let someone else do it• Government agency/lab/bureau

• NOAA National Geophysical Data Center• GenBank (DNA data)• fMRIDC (fMRI publications and data)• NCSA Astronomy Digital Image Library

13EDUCAUSE Live!, 9/1/2010

Page 14: A Funding and Operational Model for Long-Term Preservation and Sharing of Research Data 1 EDUCAUSE Live!, 9/1/2010

Current Storage Models• Professional society/Journals

• Global Ocean Observing System: coordinates distributed data

• Dryad: ecology/evolutionary biology

• Nice folks at another University• ICPSR, University of Michigan (political/social)• Dryad: ecology/evolutionary biology• Protein Data Bank (PDB): 3-D protein data• NCSA Astronomical Image Library

• The “Cloud”

14EDUCAUSE Live!, 9/1/2010

Page 15: A Funding and Operational Model for Long-Term Preservation and Sharing of Research Data 1 EDUCAUSE Live!, 9/1/2010

Current Funding Models

• Institution/department pays• Grants pay monthly/yearly• Haphazard

• Some grant money

• Some departmental money

• Use whatever is available

• Don’t worry, someone will pay

15EDUCAUSE Live!, 9/1/2010

Page 16: A Funding and Operational Model for Long-Term Preservation and Sharing of Research Data 1 EDUCAUSE Live!, 9/1/2010

Current Funding Models

• Most require some form of on-going payment• Advantages

• Capitalist approach to data storage

• If someone wants to pay, data gets saved

• “Natural” expiration process

• Disadvantages• Capitalist approach to data storage

• Who pays to save rarely used data?

16EDUCAUSE Live!, 9/1/2010

Page 17: A Funding and Operational Model for Long-Term Preservation and Sharing of Research Data 1 EDUCAUSE Live!, 9/1/2010

Different Approach

PAY ONCE, STORE ENDLESSLY (POSE)

Why Pay Once?• Grants expire often and quickly• Researchers retire/move/move-on pretty often

How Store Forever?• Administrators expire slowly • Institutions expire rarely• Cost is FINITE (how is that possible?)

•Magic of Math

17EDUCAUSE Live!, 9/1/2010

Page 18: A Funding and Operational Model for Long-Term Preservation and Sharing of Research Data 1 EDUCAUSE Live!, 9/1/2010

The Business Model (1)• I = Initial cost of storage

• D = rate at which storage costs decrease yearly, expressed as a fraction (e.g., 20% would be 0.2)

• R = How often, in years, storage is replaced

• T= Cost to store the data “forever”

T = I + (1-d)r * I + (1-d)2r * I + ….

If d=20%, r = 4:

T = I + (.84 )* I + (.88)* I + ….

18EDUCAUSE Live!, 9/1/2010

Page 19: A Funding and Operational Model for Long-Term Preservation and Sharing of Research Data 1 EDUCAUSE Live!, 9/1/2010

The Business Model (2)

If d >0,

T = I + (1-d)r * I + (1-d)2r * I + …. CONVERGES

= I * 1/(1-(1-d)r)

For d=20%, r = 3: T=I * (1/.83): T ~= I * 2

19EDUCAUSE Live!, 9/1/2010

Charge 2x initial storage cost, save half, store forever!

To simplify: Let S = 1/(1-(1-d) r) : “Storage Factor”, “S factor”

Then: T = I * S : Total Cost = Initial Cost * S Factor

Page 20: A Funding and Operational Model for Long-Term Preservation and Sharing of Research Data 1 EDUCAUSE Live!, 9/1/2010

“S” Factor is very stable

EDUCAUSE Live!, 9/1/2010 20

• “S” factor small, between 1 and 2 for broad range of reasonable values:

Page 21: A Funding and Operational Model for Long-Term Preservation and Sharing of Research Data 1 EDUCAUSE Live!, 9/1/2010

Rate of Decrease in Storage

• Is 20%/year too much?• 1981: 10 meg drive costs $3,000

• 2010: 500 gig drive costs $600

• 250,000 fold decrease over 30 years, or an average of 35% per year.

• 2000: IBM 20 gig drive costs $280

• 2010: 500 gig drive costs $600

• 12-fold decrease over 10 years, or about 23%/year.

EDUCAUSE Live!, 9/1/2010 21

Page 22: A Funding and Operational Model for Long-Term Preservation and Sharing of Research Data 1 EDUCAUSE Live!, 9/1/2010

An Example: DataSpace at Princeton

• FC costs decrease by about 16% per year• SATA costs decrease by about 17% per year

22EDUCAUSE Live!, 9/1/2010

Page 23: A Funding and Operational Model for Long-Term Preservation and Sharing of Research Data 1 EDUCAUSE Live!, 9/1/2010

The “S” factor for Princeton

• SATA cost = $1.81/gb• Replace every four years• Costs decrease by 20% year

“S” = $1.81/(1-.8 **4) = $3/gb

Adding tape backup jumps this to $6/gb

$6K one-time to store a terabyte forever

$6K one-time to store a terabyte forever23EDUCAUSE Live!, 9/1/2010

Page 24: A Funding and Operational Model for Long-Term Preservation and Sharing of Research Data 1 EDUCAUSE Live!, 9/1/2010

But what about Other Costs?

• Cost of disk drives only (small) part of cost of saving/sharing data

• What about costs of people, buildings, electronics, software

• Some claim these costs “swamp” the actual cost of the disks

EDUCAUSE Live!, 9/1/2010 24

Page 25: A Funding and Operational Model for Long-Term Preservation and Sharing of Research Data 1 EDUCAUSE Live!, 9/1/2010

Model can handle other costs

• People, building, infra-structure, software costs MUCH higher than disk costs ….

• … but NOT if pro-rated across storage units.• Storage administrator in 2010 can manage 1000

times more storage than in 2000.• Have to look at “marginal costs”; how much

more in people costs to store an extra terabyte?

EDUCAUSE Live!, 9/1/2010 25

Page 26: A Funding and Operational Model for Long-Term Preservation and Sharing of Research Data 1 EDUCAUSE Live!, 9/1/2010

Model can handle other costs

• On a per unit of storage basis, people and electronics and software and infra-structure costs also decreasing rapidly

• These costs can be built in to the “S” factor (just as we built the tape backup costs into the “S” factor at Princeton).

• As long as a cost is decreasing over time on a pro-rated basis, model applies.

EDUCAUSE Live!, 9/1/2010 26

Page 27: A Funding and Operational Model for Long-Term Preservation and Sharing of Research Data 1 EDUCAUSE Live!, 9/1/2010

When does model break down?

• If costs increase on a per-unit basis over time, then you must have on-going income stream

• Administrative, data-translation, data-delivery costs are candidates

• Need to minimize such costs• Keep administrative overhead and service

overhead to a minimum

EDUCAUSE Live!, 9/1/2010 27

Page 28: A Funding and Operational Model for Long-Term Preservation and Sharing of Research Data 1 EDUCAUSE Live!, 9/1/2010

Organizational facet of model

• So far, only looked at financial facet• Financial facet needs to be combined with

organizational facet to keep costs down• Write Once, Read Forever (WORF)

• Set of principles aimed at minimizing operational costs

• Also implements the “sharing” requirement of current grant policies

EDUCAUSE Live!, 9/1/2010 28

Page 29: A Funding and Operational Model for Long-Term Preservation and Sharing of Research Data 1 EDUCAUSE Live!, 9/1/2010

WORF Principles

• Storage may not be “re-used”; it is archival, not working

• Permanent URL assigned; data cannot be changed. Can be made unavailable or super-ceded by new data.

• All data permanently publicly accessible (minus “embargo” period)

• Repository only provides storage and web access

EDUCAUSE Live!, 9/1/2010 29

Page 30: A Funding and Operational Model for Long-Term Preservation and Sharing of Research Data 1 EDUCAUSE Live!, 9/1/2010

WORF Principles (2)

• POSF• Ancillary services (data conversion) separately

charged

We propose that a repository implementing the above set of principles be called a “DataSpace Repository”

EDUCAUSE Live!, 9/1/2010 30

Page 31: A Funding and Operational Model for Long-Term Preservation and Sharing of Research Data 1 EDUCAUSE Live!, 9/1/2010

DataSpace at Princeton

• OIT and the Library have implemented a DataSpace repository

• Uses “Dspace” repository software developed at MIT and the Archival Resource Key (ARK) system for permanent URLs

• Just started:• http://dataspace.princeton.edu

EDUCAUSE Live!, 9/1/2010 31