elag workshop sessie 1 en 2 v10

54
ELAG Workshop “Data repository challenges” Wednesday, May 16 th 2012 Session 1 & 2 Jeroen Rombouts & Egbert Gramsbergen

Upload: jeroen-rombouts

Post on 26-Jan-2015

116 views

Category:

Technology


0 download

DESCRIPTION

ELAG2012 Workshop 3TU.Datacentrum Session 1 and 2

TRANSCRIPT

Page 1: Elag workshop sessie 1 en 2 v10

ELAG Workshop

“Data repository challenges”

Wednesday, May 16th 2012

Session 1 & 2

Jeroen Rombouts & Egbert Gramsbergen

Page 2: Elag workshop sessie 1 en 2 v10

ProgrammeProgramme

Session 1 (14:30 – 15:30): “meta - data - value - …”

1.Round of introduction: who-is-who and why this workshop?

2.Short intro 3TU.DC

3.Background information

4.Case: Traffic flow observations

5.Warming-up Graphs

Break

Session 2 (16:00 – 17:00): “producers - consumers - attitudes - …”•‘Discipline’ differences (researchers & repositories)•Dotmocracy ‘Lite’•Conclusions

Page 3: Elag workshop sessie 1 en 2 v10

1. Who is who?1. Who is who?

• Who are you?

• Why interested in this topic?

Page 4: Elag workshop sessie 1 en 2 v10

2. 3TU.Datacentrum = …2. 3TU.Datacentrum = …

• 3 Dutch TU’s: Delft, Eindhoven, Twente• Project 2008-2011, going concern 2012-• Data archive

– 2008 -– “finished” data– preserve but do not forget usability– meta data harvestable (OAI-PMH) – crawlable (OAI-ORE linked data)– data citation information (incl. DataCite DOI’s)

• Data labs– Just starting (hosting)– Unfinished data + software/scripts

Page 5: Elag workshop sessie 1 en 2 v10

Website & Data-archiveWebsite & Data-archive• http://datacentrum.3tu.nl• Information

News, announcementsPublications, links and tutorials

• http://data.3tu.nl• Data sets download and

‘management’• ‘Use’ data with Google

Maps/Earth, OPeNDAP, …

Page 6: Elag workshop sessie 1 en 2 v10

• ‘Simple’ sets (Do It Yourself)Standard (self)upload form and descriptive information, single file per object (can be a ‘zipped’ collection), single DOI, …

E.g.: Zandvliet, H.J.W. et al. (2010): Diffusion driven concerted motion of surface atoms: Ge on Ge(001). MESA+ Institute For Nanotechnology, University of Twente. doi:10.4121/uuid:3f71549c-6097-4bb8-bc00-6db77deb161d

• Special collections (Do It Together)Negotiate: deposit procedure, description (xml, picture, preview), data model, level of DOI assignment, query online, …

E.g.: Otto, T., Russchenberg, H.W.J. (2010): IDRA weather radar measurements - all data. TU Delft - Delft University of Technology. doi:10.4121/uuid:5f3bcaa2-a456-4a66-a67b-1eec928cae6d

Data archiving optionsData archiving options

Page 7: Elag workshop sessie 1 en 2 v10

Training & Data-labsTraining & Data-labs• http://dataintelligence.3tu.nl• Reference, News & Events

for training library staff.

• OpenEarth, SHARE, …?

Page 8: Elag workshop sessie 1 en 2 v10

QuestionsQuestions

Page 9: Elag workshop sessie 1 en 2 v10

3. Background information3. Background information

• Workshop scope– Need for change ?/!– Questions (for now)

• Report inputs– NSF/NSB: Definitions– RIN: Discipline/Data Differences– DANS/3TU.DC: Value/selection/DSA/…???

Page 10: Elag workshop sessie 1 en 2 v10

Data DelugeData Deluge• Data in 2015 approx. 18 million

times Library of Congress (in size).

• Video data in 2005 half of all digital data.

• According to Eric Sieverts:At current growth rate in 2210 number of bytes equal to number of atoms on planet earth. (predicts that before that happens something will change ;-))

• CERN-LHC: 10-15PB/yr.

Page 11: Elag workshop sessie 1 en 2 v10

Workshop scopeWorkshop scope

Preconditions• Challenge: Too much data (to keep).Technology (storage capacity, cooling, energy), organizations (strategies, budgets) and people (awareness, training) can’t keep (this) up!

• Upside: Not all data is valuable in the futuresome relevant (de)selection experience in archiving, some efficiency improvements, ‘some’ increase in storage capacity, …

QuestionsA.Which research output to share and preserve?

B.Who are the players involved?

C.How to collect and preserve the research output?

Roles of University Libraries…

Conclusions on differences between documents and research data?

Page 12: Elag workshop sessie 1 en 2 v10

NSF/NSB - 1/3NSF/NSB - 1/3

• Data.For the purposes of this document, data are any and all complex data

entities from observations, experiments, simulations, models, and higher order assemblies, along with the associated documentation needed to describe and interpret the data.

• Metadata. Metadata are a subset of data, and are data about data. Metadata

summarize data content, context, structure, interrelationships, and provenance (information on history and origins). They add relevance and purpose to data, and enable the identification of similar data in different data collections.

Page 13: Elag workshop sessie 1 en 2 v10

NSF/NSB - 2/3NSF/NSB - 2/3

3 functional types of data collections:

•Research CollectionsAuthors are individual investigators and investigator teams.

Research collections are usually maintained to serve immediate group participants only for the life of a project, and are typically subjected to limited processing or curation. Data may not conform to any data standards.

•Resource CollectionsResource collections are authored by a community of investigators, often within a domain of science or engineering, and are often developed with community level standards. Budgets are often intermediate in size.

Lifetime is between the mid- and long-term.

Page 14: Elag workshop sessie 1 en 2 v10

• Reference Collections Reference collections are authored by and serve large segments of the

science and engineering community and conform to robust, well-established and comprehensive standards, which often lead to a universal standard. Budgets are large and are often derived from diverse sources with a view to indefinite support.

[NSF, Originally: National Science Board report on Long-Lived Digital Data Collections, …]

Differences:• Community size• Collection lifetime• Level of standardization• Amount of processing• Budget size & sources• …

NSF/NSB - 3/3NSF/NSB - 3/3

Page 15: Elag workshop sessie 1 en 2 v10

RINRIN

• Many different kinds and categories of data:– scientific experiments;– models or simulations; and– observations of specific phenomena at a specific time or location.…

• Datasets are generated for different purposes and through different processes.

• Data may undergo various stages of transformation.• The quality of metadata provided for research datasets is very

variable.• Varying degrees of data management, efforts, resources and

expertise.• There are significant variations – as well as commonalities - in

researchers’ attitudes, behaviors and needs, in the available infrastructure, and in the nature and effect of policy initiatives, in different disciplines and subject areas

• …

Page 16: Elag workshop sessie 1 en 2 v10

DANS/3TU.DCDANS/3TU.DC

Key findings•No solid definition of “research data” found•Lot of literature on selection process, but…•Not a single case of selection policy of digital data found

Apparently a lot of implicit selection going on considering the available digital research data

Reasons for preserving research data:

a)Obligation to enable re-use (by funder, publisher)

b)Other arguments: inter or intra disciplinary value, hard to repeat, value for historic research

c)Obligation for verification (by code of conduct, employer, publisher)

d)Non scientific arguments (heritage, responsibilty to society)

Page 17: Elag workshop sessie 1 en 2 v10

Docs vs. Data (Differences)Docs vs. Data (Differences)

• Object sizes (capacity)• Collection sizes/granularity (number or objects)• Meta data (type, standards and distinction from object)• Heterogeneity of collections (not discipline differences)

– Data category (experiment, model/simulation, observation)– Data generation process (man made vs. machine made or …)– File formats

• Attitudes to ‘publishing’• Resources, expertise, efforts on

data management• Selection inevitable• Value?• …• … Anything to add?

(list to be expanded in workshop)

Page 18: Elag workshop sessie 1 en 2 v10

Questions, suggestions, …Questions, suggestions, …

Page 19: Elag workshop sessie 1 en 2 v10

4. Case: Traffic flow observations4. Case: Traffic flow observations

• CaseResearchers needed to clear the disk space and offered data which where “expensive to gather and had required quite a lot of computation to process.”Project was already finished.

• ContentPictures of highway stretches shot from helicopter.Shoulder open/closed, several flights, raw/stabilized, several dates, calibration image, calibration software and settings.

Page 20: Elag workshop sessie 1 en 2 v10

Questions for caseQuestions for case

A. Which data to ingest?raw pictures, stabilized pictures, movies or … vectors and type of cars?GPS logscalibration imagestabilisation software/data

B. Who are involved?data-producer (researcher)research funder (owner)data repository

C. How to preserve?gps logs: as data or meta data, all flight data or only when recording? the software (code or executable?)picture formats (tiff, png, jpeg2000, …)?granularity (per flight, per location, per recording, ...?

Page 21: Elag workshop sessie 1 en 2 v10

The dataThe data

Page 22: Elag workshop sessie 1 en 2 v10

CollectionCollection

Page 23: Elag workshop sessie 1 en 2 v10

Top level datasetTop level dataset

Page 24: Elag workshop sessie 1 en 2 v10

Low level dataset (stabilized data)Low level dataset (stabilized data)

Page 25: Elag workshop sessie 1 en 2 v10

……

• …

Page 26: Elag workshop sessie 1 en 2 v10

Citation informationCitation information

Page 27: Elag workshop sessie 1 en 2 v10

Docs vs. Data (Differences)Docs vs. Data (Differences)

• Object sizes (capacity)• Collection sizes/granularity (number of files)• Meta data (type, standards and distinction from object)• Heterogeneity of collections

– Data category (experiment, model/simulation, observation)– Data generation process (man made vs. machine made or …)– File formats

• Attitudes to ‘publishing’• Resources, expertise, efforts on

data management• Selection inevitable• Value?• Citation practice• …• … Anything to add?

Page 28: Elag workshop sessie 1 en 2 v10

Questions, suggestions, …Questions, suggestions, …

Page 29: Elag workshop sessie 1 en 2 v10

5. To the graphs…5. To the graphs…

Page 30: Elag workshop sessie 1 en 2 v10

BreakBreak

Page 31: Elag workshop sessie 1 en 2 v10
Page 32: Elag workshop sessie 1 en 2 v10
Page 33: Elag workshop sessie 1 en 2 v10
Page 34: Elag workshop sessie 1 en 2 v10
Page 35: Elag workshop sessie 1 en 2 v10

Session 2Session 2

Session 2 (16:00 – 17:00): “producers - consumers - attitudes - …”

1.‘Discipline’ differences (researchers & repositories)

2.Dotmocracy ‘Lite’

3.Preliminary conclusions?

Back to plenary presentations

Page 36: Elag workshop sessie 1 en 2 v10

What our accountmanagers ‘sell’…What our accountmanagers ‘sell’…

The benefits for data producers and data consumers

• Increased visibility of research output. (metadata in repository networks, assigning doi’s, facilitate increases citation rate for ‘enhanced publications’, ...);

• Improved quality of dataset (quality assurance for multi- user setup, checks on ingest, …);

• Provide (long-term) preservation of and accessibility to, valuable research data;

• Distribution of research data for reuse, including administration and usage statistics;

• Provides advice on data management, rights, formats, metadata, etc.

Page 37: Elag workshop sessie 1 en 2 v10

ValueValue

Secure research data Cite/Claim (DOIs) Quality Assurance (support) Data exchange Data visibility

Support EU projects, Communities Extra show window Relation with non-academic research, society Prepare for paradigm shift Enable verification

Page 38: Elag workshop sessie 1 en 2 v10

Nobody needs my data

Data transfer not needed, every PhD does own project

Our datasets are confidential

Interesting but not for me

Only for long term continuous

data

Datasets are stored by publisherNo time!

Our research is once only

What do data producers say? 1/2What do data producers say? 1/2

Page 39: Elag workshop sessie 1 en 2 v10

Surprising our university had no faciltity for data

preservation

Transfer of data between PhD’s can be

improved

Would like to publish data

Good opportunity to share datasets

we bought

Very usefull, essential metadata

often missing Much to

improve in reuse of data

When can I store my datasets?

What do data producers say? 2/2What do data producers say? 2/2

Page 40: Elag workshop sessie 1 en 2 v10

Workshop with researchersWorkshop with researchers

Data should only become available after publication

Page 41: Elag workshop sessie 1 en 2 v10

Workshop resultsWorkshop results

• Confirmed:– Different domains have commonalities– Need for support on research data management

exists

• There are strong differences depending on – Research type– Data types– Individual attitudes

Page 42: Elag workshop sessie 1 en 2 v10

‘‘Conclusions’ on valuable dataConclusions’ on valuable data

• Data of ‘enhanced publications’ (underlying data and visualisations linked to publications).Increase publication value (stronger basis, more citations, …);

• Data generated by ‘hard to repeat’ processes.E.g. high cost, (environmental) observations, complex or continuous experiments, …;

• Data collected with public funding.Conditions by funding organisations or publishers like Nature Publishing Group, NWO, governmental organisations, universities, …;

• Preferably open access data with potential for reuse (verification, new research, …).Increase visibility, efficiency and quality of research efforts.

• … Anything to add?

Which data to preserve? And why?

Page 43: Elag workshop sessie 1 en 2 v10

Docs vs. Data (Differences)Docs vs. Data (Differences)

• Object sizes (capacity)• Collection sizes/granularity (number of files)• Meta data (type, standards and distinction from object)• Heterogeneity of collections

– Data category (experiment, model/simulation, observation)– Data generation process (man made vs. machine made or …)– File formats

• Attitudes to ‘publishing’• Resources, expertise, efforts on

data management• Selection inevitable (due to size)• Value of research data higher• Readability of research data is lower (zero without metadata• Citation practice• …• … Anything to add?

Page 44: Elag workshop sessie 1 en 2 v10

The EndThe End

In one line:

“Challenge is to find the ready, able and willing (researchers)”

Page 45: Elag workshop sessie 1 en 2 v10

To Dotmocracy…To Dotmocracy…

1. 15 min. to select or define new propositions (approx. 3) and write them on a sheet.

2. 15 min. to ‘vote’on every sheet.

3. 15 min. for plenary discussion on opposing opinions.

Page 46: Elag workshop sessie 1 en 2 v10

Responsibility Propositions 1/4Responsibility Propositions 1/4

• All research data should be stored in disciplinary archives.

• Research institutes must register data produced by their researchers.

• Libraries are the best departments at universities to take on research data archiving.

Page 47: Elag workshop sessie 1 en 2 v10

Obligation Propositions 2/4Obligation Propositions 2/4

• Data-producers should be obliged to publish their (anonymous) research data as open data.

• High cost research facilities should be obliged to share (and preserve) their data.

• Users should login to download data

• Data-repositories should never accept data in proprietary file formats

Page 48: Elag workshop sessie 1 en 2 v10

Value Propositions 3/4Value Propositions 3/4

• Only datasets which are linked to publications need to be preserved for the long term.

• Not simulation results but algorithms and boundary conditions should be stored.

• Each dataset should also include the data in its rawest form.

Page 49: Elag workshop sessie 1 en 2 v10

Misc. Propositions 4/4Misc. Propositions 4/4

• University libraries have a harder job to attract datasets from exact sciences than from humanities.

• Researchers are sloppy (they regard documentation as irrelevant and annoying).

• Session #4 should be on the beach with lots of beer.

Page 50: Elag workshop sessie 1 en 2 v10

Docs vs. Data (Differences)Docs vs. Data (Differences)

• Object sizes (capacity)• Collection sizes/granularity (number of files)• Meta data (type, standards and distinction from object)• Heterogeneity of collections

– Data category (experiment, model/simulation, observation)– Data generation process (man made vs. machine made or …)– File formats

• Attitudes to ‘publishing’• Resources, expertise, efforts on data management• Selection inevitable (due to size)• Value of research data higher• Readability of research data is lower (zero without metadata• Citation practice• (A document is data)• Boundaries of data (sets) are less clear than for documents• Assigned responsibilities and tasks • Legal status• …

Page 51: Elag workshop sessie 1 en 2 v10

All Propositions 1/1All Propositions 1/1• All research data should be stored in disciplinary archives.

• Research institutes must register data produced by their researchers.

• Libraries are the best departments at universities to take on research data archiving.

• Data-producers should be obliged to publish their (anonymous) research data as open data.

• High cost research facilities should be obliged to share (and preserve) their data.

• Users should login to download data

• Data-repositories should never accept data in proprietary file formats

• Only datasets which are linked to publications need to be preserved for the long term.

• Not simulation results but algorithms and boundary conditions should be stored.

• Each dataset should also include the data in its rawest form.

• University libraries have a harder job to attract datasets from exact sciences than from humanities.

• Researchers are sloppy (they regard documentation as irrelevant and annoying).

• Session #4 should be on the beach with lots of beer.

Page 52: Elag workshop sessie 1 en 2 v10

Dotmocracy results 1/3Dotmocracy results 1/3

“Users should login to download data”

+ Should be for some data types (sensitive)

+ It helps to get an idea of usage

+ Anonymity(?) on the net is a ‘2000’ thought anyway

+ Accept license

+ Trace of use for data-producers

- Raise threshold for re-use

Str. Agree Agree Neutral Disagree Str. Disagree

xx xx x

Page 53: Elag workshop sessie 1 en 2 v10

Dotmocracy results 2/3Dotmocracy results 2/3

“Data repositories should never accept files in proprietary formats”

+ Easy to reuse data in open formats

- Better to have proprietary data than none at all

- May prelude data if insist on open format

- Can be migrated to open formats (sometimes)

Str. Agree Agree Neutral Disagree Str. Disagree

xxxxxx xxxxxx xxxxxx xx

Page 54: Elag workshop sessie 1 en 2 v10

Dotmocracy results 3/3Dotmocracy results 3/3

“Libraries are the best departments at universities to take on research data archiving”

+ Co-operation already with researchers

+ Librarians have good meta data skills

o The library’s vendor should deliver the service(?)

+ Full control and close to researcher(?)

- Challenge to big: long term sustainability

+ Builds on metadata knowledge of libraries

- Must have IT in co-operation

- Archiving skills

Str. Agree Agree Neutral Disagree Str. Disagree

xx xxxxxxxxxxxxxxxxxx

xx