elag workshop sessie 1 en 2 v10

ELAG Workshop

“Data repository challenges”

Wednesday, May 16th 2012

Session 1 & 2

Jeroen Rombouts & Egbert Gramsbergen

ProgrammeProgramme

Session 1 (14:30 – 15:30): “meta - data - value - …”

1.Round of introduction: who-is-who and why this workshop?

2.Short intro 3TU.DC

3.Background information

4.Case: Traffic flow observations

5.Warming-up Graphs

Break

Session 2 (16:00 – 17:00): “producers - consumers - attitudes - …”•‘Discipline’ differences (researchers & repositories)•Dotmocracy ‘Lite’•Conclusions

1. Who is who?1. Who is who?

• Who are you?

• Why interested in this topic?

2. 3TU.Datacentrum = …2. 3TU.Datacentrum = …

• 3 Dutch TU’s: Delft, Eindhoven, Twente• Project 2008-2011, going concern 2012-• Data archive

– 2008 -– “finished” data– preserve but do not forget usability– meta data harvestable (OAI-PMH) – crawlable (OAI-ORE linked data)– data citation information (incl. DataCite DOI’s)

• Data labs– Just starting (hosting)– Unfinished data + software/scripts

Website & Data-archiveWebsite & Data-archive• http://datacentrum.3tu.nl• Information

News, announcementsPublications, links and tutorials

• http://data.3tu.nl• Data sets download and

‘management’• ‘Use’ data with Google

Maps/Earth, OPeNDAP, …

https://datacentrum.3tu.nl/

https://data.3tu.nl/

• ‘Simple’ sets (Do It Yourself)Standard (self)upload form and descriptive information, single file per object (can be a ‘zipped’ collection), single DOI, …

E.g.: Zandvliet, H.J.W. et al. (2010): Diffusion driven concerted motion of surface atoms: Ge on Ge(001). MESA+ Institute For Nanotechnology, University of Twente. doi:10.4121/uuid:3f71549c-6097-4bb8-bc00-6db77deb161d

• Special collections (Do It Together)Negotiate: deposit procedure, description (xml, picture, preview), data model, level of DOI assignment, query online, …

E.g.: Otto, T., Russchenberg, H.W.J. (2010): IDRA weather radar measurements - all data. TU Delft - Delft University of Technology. doi:10.4121/uuid:5f3bcaa2-a456-4a66-a67b-1eec928cae6d

Data archiving optionsData archiving options

http://dx.doi.org/doi:10.4121/uuid:3f71549c-6097-4bb8-bc00-6db77deb161d

http://dx.doi.org/doi:10.4121/uuid:5f3bcaa2-a456-4a66-a67b-1eec928cae6d

Training & Data-labsTraining & Data-labs• http://dataintelligence.3tu.nl• Reference, News & Events

for training library staff.

• OpenEarth, SHARE, …?

http://dataintelligence.3tu.nl/

QuestionsQuestions

3. Background information3. Background information

• Workshop scope– Need for change ?/!– Questions (for now)

• Report inputs– NSF/NSB: Definitions– RIN: Discipline/Data Differences– DANS/3TU.DC: Value/selection/DSA/…???

Data DelugeData Deluge• Data in 2015 approx. 18 million

times Library of Congress (in size).

• Video data in 2005 half of all digital data.

• According to Eric Sieverts:At current growth rate in 2210 number of bytes equal to number of atoms on planet earth. (predicts that before that happens something will change ;-))

• CERN-LHC: 10-15PB/yr.

Workshop scopeWorkshop scope

Preconditions• Challenge: Too much data (to keep).Technology (storage capacity, cooling, energy), organizations (strategies, budgets) and people (awareness, training) can’t keep (this) up!

• Upside: Not all data is valuable in the futuresome relevant (de)selection experience in archiving, some efficiency improvements, ‘some’ increase in storage capacity, …

QuestionsA.Which research output to share and preserve?

B.Who are the players involved?

C.How to collect and preserve the research output?

Roles of University Libraries…

Conclusions on differences between documents and research data?

NSF/NSB - 1/3NSF/NSB - 1/3

• Data.For the purposes of this document, data are any and all complex data

entities from observations, experiments, simulations, models, and higher order assemblies, along with the associated documentation needed to describe and interpret the data.

• Metadata. Metadata are a subset of data, and are data about data. Metadata

summarize data content, context, structure, interrelationships, and provenance (information on history and origins). They add relevance and purpose to data, and enable the identification of similar data in different data collections.


3 functional types of data collections:

•Research CollectionsAuthors are individual investigators and investigator teams.

Research collections are usually maintained to serve immediate group participants only for the life of a project, and are typically subjected to limited processing or curation. Data may not conform to any data standards.

•Resource CollectionsResource collections are authored by a community of investigators, often within a domain of science or engineering, and are often developed with community level standards. Budgets are often intermediate in size.

Lifetime is between the mid- and long-term.

• Reference Collections Reference collections are authored by and serve large segments of the

science and engineering community and conform to robust, well-established and comprehensive standards, which often lead to a universal standard. Budgets are large and are often derived from diverse sources with a view to indefinite support.

[NSF, Originally: National Science Board report on Long-Lived Digital Data Collections, …]

Differences:• Community size• Collection lifetime• Level of standardization• Amount of processing• Budget size & sources• …


RINRIN

• Many different kinds and categories of data:– scientific experiments;– models or simulations; and– observations of specific phenomena at a specific time or location.…

• Datasets are generated for different purposes and through different processes.

• Data may undergo various stages of transformation.• The quality of metadata provided for research datasets is very

variable.• Varying degrees of data management, efforts, resources and

expertise.• There are significant variations – as well as commonalities - in

researchers’ attitudes, behaviors and needs, in the available infrastructure, and in the nature and effect of policy initiatives, in different disciplines and subject areas

• …

DANS/3TU.DCDANS/3TU.DC

Key findings•No solid definition of “research data” found•Lot of literature on selection process, but…•Not a single case of selection policy of digital data found

Apparently a lot of implicit selection going on considering the available digital research data

Reasons for preserving research data:

a)Obligation to enable re-use (by funder, publisher)

b)Other arguments: inter or intra disciplinary value, hard to repeat, value for historic research

c)Obligation for verification (by code of conduct, employer, publisher)

d)Non scientific arguments (heritage, responsibilty to society)

Docs vs. Data (Differences)Docs vs. Data (Differences)

• Object sizes (capacity)• Collection sizes/granularity (number or objects)• Meta data (type, standards and distinction from object)• Heterogeneity of collections (not discipline differences)

– Data category (experiment, model/simulation, observation)– Data generation process (man made vs. machine made or …)– File formats

• Attitudes to ‘publishing’• Resources, expertise, efforts on

data management• Selection inevitable• Value?• …• … Anything to add?

(list to be expanded in workshop)

Questions, suggestions, …Questions, suggestions, …

4. Case: Traffic flow observations4. Case: Traffic flow observations

• CaseResearchers needed to clear the disk space and offered data which where “expensive to gather and had required quite a lot of computation to process.”Project was already finished.

• ContentPictures of highway stretches shot from helicopter.Shoulder open/closed, several flights, raw/stabilized, several dates, calibration image, calibration software and settings.

Questions for caseQuestions for case

A. Which data to ingest?raw pictures, stabilized pictures, movies or … vectors and type of cars?GPS logscalibration imagestabilisation software/data

B. Who are involved?data-producer (researcher)research funder (owner)data repository

C. How to preserve?gps logs: as data or meta data, all flight data or only when recording? the software (code or executable?)picture formats (tiff, png, jpeg2000, …)?granularity (per flight, per location, per recording, ...?

The dataThe data

CollectionCollection

Top level datasetTop level dataset

Low level dataset (stabilized data)Low level dataset (stabilized data)

……

• …

Citation informationCitation information


• Object sizes (capacity)• Collection sizes/granularity (number of files)• Meta data (type, standards and distinction from object)• Heterogeneity of collections



data management• Selection inevitable• Value?• Citation practice• …• … Anything to add?

Questions, suggestions, …Questions, suggestions, …

5. To the graphs…5. To the graphs…

BreakBreak

Session 2Session 2

Session 2 (16:00 – 17:00): “producers - consumers - attitudes - …”

1.‘Discipline’ differences (researchers & repositories)

2.Dotmocracy ‘Lite’

3.Preliminary conclusions?

Back to plenary presentations

What our accountmanagers ‘sell’…What our accountmanagers ‘sell’…

The benefits for data producers and data consumers

• Increased visibility of research output. (metadata in repository networks, assigning doi’s, facilitate increases citation rate for ‘enhanced publications’, ...);

• Improved quality of dataset (quality assurance for multi- user setup, checks on ingest, …);

• Provide (long-term) preservation of and accessibility to, valuable research data;

• Distribution of research data for reuse, including administration and usage statistics;

• Provides advice on data management, rights, formats, metadata, etc.

ValueValue

Secure research data Cite/Claim (DOIs) Quality Assurance (support) Data exchange Data visibility

Support EU projects, Communities Extra show window Relation with non-academic research, society Prepare for paradigm shift Enable verification

Nobody needs my data

Data transfer not needed, every PhD does own project

Our datasets are confidential

Interesting but not for me

Only for long term continuous

data

Datasets are stored by publisherNo time!

Our research is once only

What do data producers say? 1/2What do data producers say? 1/2

Surprising our university had no faciltity for data

preservation

Transfer of data between PhD’s can be

improved

Would like to publish data

Good opportunity to share datasets

we bought

Very usefull, essential metadata

often missing Much to

improve in reuse of data

When can I store my datasets?

What do data producers say? 2/2What do data producers say? 2/2

Workshop with researchersWorkshop with researchers

Data should only become available after publication

Workshop resultsWorkshop results

• Confirmed:– Different domains have commonalities– Need for support on research data management

exists

• There are strong differences depending on – Research type– Data types– Individual attitudes

‘‘Conclusions’ on valuable dataConclusions’ on valuable data

• Data of ‘enhanced publications’ (underlying data and visualisations linked to publications).Increase publication value (stronger basis, more citations, …);

• Data generated by ‘hard to repeat’ processes.E.g. high cost, (environmental) observations, complex or continuous experiments, …;

• Data collected with public funding.Conditions by funding organisations or publishers like Nature Publishing Group, NWO, governmental organisations, universities, …;

• Preferably open access data with potential for reuse (verification, new research, …).Increase visibility, efficiency and quality of research efforts.

• … Anything to add?

Which data to preserve? And why?





data management• Selection inevitable (due to size)• Value of research data higher• Readability of research data is lower (zero without metadata• Citation practice• …• … Anything to add?

The EndThe End

In one line:

“Challenge is to find the ready, able and willing (researchers)”

To Dotmocracy…To Dotmocracy…

1. 15 min. to select or define new propositions (approx. 3) and write them on a sheet.

2. 15 min. to ‘vote’on every sheet.

3. 15 min. for plenary discussion on opposing opinions.

Responsibility Propositions 1/4Responsibility Propositions 1/4

• All research data should be stored in disciplinary archives.

• Research institutes must register data produced by their researchers.

• Libraries are the best departments at universities to take on research data archiving.

Obligation Propositions 2/4Obligation Propositions 2/4

• Data-producers should be obliged to publish their (anonymous) research data as open data.

• High cost research facilities should be obliged to share (and preserve) their data.

• Users should login to download data

• Data-repositories should never accept data in proprietary file formats

Value Propositions 3/4Value Propositions 3/4

• Only datasets which are linked to publications need to be preserved for the long term.

• Not simulation results but algorithms and boundary conditions should be stored.

• Each dataset should also include the data in its rawest form.

Misc. Propositions 4/4Misc. Propositions 4/4

• University libraries have a harder job to attract datasets from exact sciences than from humanities.

• Researchers are sloppy (they regard documentation as irrelevant and annoying).

• Session #4 should be on the beach with lots of beer.




• Attitudes to ‘publishing’• Resources, expertise, efforts on data management• Selection inevitable (due to size)• Value of research data higher• Readability of research data is lower (zero without metadata• Citation practice• (A document is data)• Boundaries of data (sets) are less clear than for documents• Assigned responsibilities and tasks • Legal status• …

All Propositions 1/1All Propositions 1/1• All research data should be stored in disciplinary archives.

• Research institutes must register data produced by their researchers.

• Libraries are the best departments at universities to take on research data archiving.

• Data-producers should be obliged to publish their (anonymous) research data as open data.

• High cost research facilities should be obliged to share (and preserve) their data.

• Users should login to download data

• Data-repositories should never accept data in proprietary file formats

• Only datasets which are linked to publications need to be preserved for the long term.

• Not simulation results but algorithms and boundary conditions should be stored.

• Each dataset should also include the data in its rawest form.

• University libraries have a harder job to attract datasets from exact sciences than from humanities.

• Researchers are sloppy (they regard documentation as irrelevant and annoying).

• Session #4 should be on the beach with lots of beer.

Dotmocracy results 1/3Dotmocracy results 1/3

“Users should login to download data”

+ Should be for some data types (sensitive)

+ It helps to get an idea of usage

+ Anonymity(?) on the net is a ‘2000’ thought anyway

+ Accept license

+ Trace of use for data-producers

- Raise threshold for re-use

Str. Agree Agree Neutral Disagree Str. Disagree

xx xx x


“Data repositories should never accept files in proprietary formats”

+ Easy to reuse data in open formats

- Better to have proprietary data than none at all

- May prelude data if insist on open format

- Can be migrated to open formats (sometimes)


xxxxxx xxxxxx xxxxxx xx


“Libraries are the best departments at universities to take on research data archiving”

+ Co-operation already with researchers

+ Librarians have good meta data skills

o The library’s vendor should deliver the service(?)

+ Full control and close to researcher(?)

- Challenge to big: long term sustainability

+ Builds on metadata knowledge of libraries

- Must have IT in co-operation

- Archiving skills


xx xxxxxxxxxxxxxxxxxx

xx

elag workshop sessie 1 en 2 v10

Technology

data deluge data

data standards

data data citation information

data sets

data archive

meta data

categories of data

data model