a basic course on research data management: part 1 - part 4

57
A basic course on Research data management part 1: what and why PROOF course Information Literacy and Research Data Management TU/e, 07-03-2017 [email protected], TU/e IEC/Library Available under CC BY-SA license, which permits copying and redistributing the material in any medium or format & adapting the material for any purpose, provided the original author and source are credited & you distribute the adapted material under the same license

Upload: leon-osinski

Post on 11-Apr-2017

22 views

Category:

Education


2 download

TRANSCRIPT

Page 1: A basic course on Research data management: part 1 - part 4

A basic course on Research data management

part 1: what and why

PROOF course Information Literacy and Research Data Management

TU/e, 07-03-2017

[email protected], TU/e IEC/LibraryAvailable under CC BY-SA license, which permits copying and redistributing the material in any medium or format & adapting the material for any purpose, provided the original author and source are credited & you distribute the adapted material under the same license as the original

Page 2: A basic course on Research data management: part 1 - part 4

Research data management [RDM]what #1

Essence of RDM: “… tracking back to what you did 7 years ago and recovering it (...) immediately in a re-usable manner.” (Henry Rzepa)

Page 3: A basic course on Research data management: part 1 - part 4

Research data management [RDM]what #2

RDM: caring for your data with the purpose to:1. protect their mere existence: data loss, data authenticity

(RDM basics)2. share them with others

a. for reasons of reuse: in the same context or in a different context; during research and after research

b. for reasons of reproducibility checks scientific integrity; data quality

RDM = good data practices1,2,3,4,5,6 that make your data understandable, easy to work with, and available to other scientists

1. Dynamic ecology (2016), Ten commandments for good data management. https://dynamicecology.wordpress.com/2016/08/22/ten-commandments-for-good-data-management/

2. Borer, E.T., Seabloom, E.W., Jones, M.B., et al. (2009) Some simple guidelines for effective data management, Bulletin of the Ecological Society of America, 90(2), p. 205-214. doi: 10.1890/0012-9623-90.2.205

3. Hook, L.A., Santhana Vannan, S.K., Beaty, T.W. et al. Best practices for preparing environmental data sets to share and archive. Available online http://daac.ornl.gov/PI/BestPractices-2010.pdf . doi: 10.3334/ORNLDAAC/BestPractices-2010

4. White, E.P., Baldridge, E., Brym, T. et al. (2013) Nine simple ways to make it easier to (re)use your data, Ideas in Ecology and Evolution, 6(2), p. 1-10. doi: 10.4033/iee.2013.6b.6.f

5. Goodman, A., Pepe, A., Blocker, A.W., et al. (2014) Ten simple rules for the care and feeding of scientific data, PLOS Computional Biology, 10(4), e10033542. doi: 10.1371/journal.pcbi.1003542

6. Sandve, G.K., et. al. (2013), Ten simple rules for reproducible computational research, PLOS Computational Biology, 9(10), e1003285. doi: 10.1371/journal.pcbi.1003285

Page 4: A basic course on Research data management: part 1 - part 4

Source: Research Data Netherlands / Marina Noordegraaf

Topics

1. Research data management [RDM]: what and whya. data management plan

2. Sharing your data, or making your data findable and accessiblea. data protection: back up, file naming, organizing

datab. data sharing: via collaboration platforms, data

archives3. Caring for your data, or making your data re-usable

and interoperablea. metadata, tidy data, licenses

Page 5: A basic course on Research data management: part 1 - part 4

Because you work together with other researchers collaborative science Because of re-using results: data-driven science open science Because of scientific integrity: validating data analysis by reproducibility

checks requires data and the code that is used to clean, process and analyze the data and to produce the final outputs

Additional reasons Because your data are unique / not easily

repeatable (long term observational data) Because you benefit from it:

increases your visibility and enhances the trustworthiness / credibility of your research

Why sharing research data? #1

Page 7: A basic course on Research data management: part 1 - part 4

EC: Horizon 2020 #1Open research data (ORD) pilot

“The ORD pilot aims to improve and maximise access to and re-use of research data generated by Horizon 2020…”

“The ORD pilot applies primarily to the data needed to validate the results presented in scientific publications. Other data can also be provided…”

“A data management plan (DMP) is required for all projects participating in the extended ORD pilot…”

“Participating in the ORD pilot does not necessarily mean opening up all your research data. Rather, the ORD Pilot follows the principle “as open as possible, as closed as necessary” and focuses on encouraging sound data management as an essential part of research best practice.” (my underlining)

Page 8: A basic course on Research data management: part 1 - part 4

EC: Horizon 2020 #2sound research data management

Sound research data management is data management following the FAIR principles. All research data should be:Findable: easy to find by both humans and computer systems;Accessible: stored for long term with well-defined license and access conditions (open access when possible);Interoperable: ready to be combined with other datasets by humans as well as computer systems;Reusable: ready to be used for future research and to be processed further using computational methods.

Page 9: A basic course on Research data management: part 1 - part 4

Source: Research Data Netherlands / Marina Noordegraaf

EC: Horizon 2020 #3requirements

The conditions set by Horizon 2020 with regard to research data management, come down to two requirements:1. Formulate a data management plan, and;2. Deposit research data in a data repository

Page 10: A basic course on Research data management: part 1 - part 4

The DMP is a set of questions along the FAIR principles about:1. The handling of research data during and after the project2. What data sets the project will collect, process and/or

generate3. Whether and how data sets will be findable/discoverable,

re-useable and shared/made open access4. How data will be curated and preserved5. What measures are taken to safeguard and protect

(sensitive) data

EC Horizon 2020 #4data management plan

DMP template Horizon 2020 (via DMPOnline): recommended but voluntary

DMP template by 4TU.Centre of Research Data Examples of H2020 DMPs: http://

www.dcc.ac.uk/resources/data-management-plans/guidance-examples

Page 11: A basic course on Research data management: part 1 - part 4

Research data managementdiscussion topics and questions

Storage and back-up What sort of data do you use? Are you creating new data or are you

working with pre-existing data? Where do you store your research data? Is there a back-up? Where? Are data selections made? Not everything is to be stored but…?Metadata and documentation (information to let you find, use and understand the data) Do you describe your research data? Who measured or collected what,

when, how? Other context information? Are you content with the way you document or describe your research

data? Do you succeed in finding the right (version of your) research data? Can other researchers understand and (re-)use your research data (during

and after research)? Should they be able to?Access and re-use Who can access your research data? What will happen to your research data when you leave TU/e? Would you consider publishing your research data, i.e. to make them public

available?

Page 12: A basic course on Research data management: part 1 - part 4

Research data managementwhich of these statements is true?

Storage and back-up

1. My research data is stored safely and securely, including regular back ups?

Metadata and documentation

2. I keep metadata with my data: who measured/collected what, when, how

Access and re-use

3. My colleagues are able to access and use my data4. Other researchers are able to access and use my data5. My nearest colleagues and I are the only ones who can

understand my data6. Anyone should be able to use my data when I have finished

with it

Page 13: A basic course on Research data management: part 1 - part 4

Reasons not to share your data

Preparing my data for sharing takes time and effortBut research data management also increases your research efficiency

My data are confidentialBut you can anonymize or pseudonymize your data

My data still need to yield publicationsBut you can publish your data under an embargo and by publishing your data you establish priority and you can get credits for it

My data can be misused or misinterpretBut the best defense against malicious use is to refer to an archival copy of your data which is guaranteed exactly as you mean it to be

My data are only interesting for meBut sharing your data may be required by a funder / journal or your data may be requested to validate yourresults

Page 14: A basic course on Research data management: part 1 - part 4

1. Website IEC/Library [TU/e]: https://www.tue.nl/en/university/library/ 2. Figshare support, The importance of data management for research: https://

youtu.be/Ae205CNrk6w 3. Henry Rzepa, Collaborative FAIR data sharing:

http://www.ch.imperial.ac.uk/rzepa/blog/?p=16292 4. Dynamic ecology (2016), ten commandments for good data management.

https://dynamicecology.wordpress.com/2016/08/22/ten-commandments-for-good-data-management/

5. Borer, E.T., Seabloom, E.W., Jones, M.B., et al. (2009) Some simple guidelines for effective data management, Bulletin of the Ecological Society of America, 90(2), p. 205-214. doi: 10.1890/0012-9623-90.2.205

6. Hook, L.A., Santhana Vannan, S.K., Beaty, T.W. et al. Best practices for preparing environmental data sets to share and archive. doi: 10.3334/ORNLDAAC/BestPractices-2010

7. White, E.P., Baldridge, E., Brym, T. et al. (2013) Nine simple ways to make it easier to (re)use your data, Ideas in Ecology and Evolution, 6(2), p. 1-10. doi: 10.4033/iee.2013.6b.6.f

8. Goodman, A., Pepe, A., Blocker, A.W., et al. (2014) Ten simple rules for the care and feeding of scientific data, PLOS Computional Biology, 10(4), e10033542. doi: 10.1371/journal.pcbi.1003542

9. Sandve, G.K., et. al. (2013), Ten simple rules for reproducible computational research, PLOS Computational Biology, 9(10), e1003285. doi: 10.1371/journal.pcbi.1003285

10. Data sharing increases visibility: http://dx.doi.org/10.7717/peerj.175 11. Data sharing enhances trustworthiness: http://

dx.doi.org/10.1371/journal.pone.0026828

URL’s of mentioned webpagesin order of appearance #1

Page 15: A basic course on Research data management: part 1 - part 4

12. Data availability policy journals: http://www.nap.edu/openbook.php?record_id=10613&page=33

13. Data availability policy American Economic Review: https://www.aeaweb.org/aer/data.php

15. Data availability policy PLoS: http://journals.plos.org/plosone/s/data-availability 16. Data availability policy Nature:

http://www.nature.com/authors/policies/availability.html 17. VSNU Code of Scientific Conduct (Dutch, revision 2014):

http://www.vsnu.nl/files/documenten/Domeinen/Onderzoek/Code_wetenschapsbeoefening_2004_(2014).pdf

18. KNAW responsible research data management: https://www.knaw.nl/en/news/publications/responsible-research-data-management-and-the-prevention-of-scientific-misconduct?set_language=en

19. Radboud University research data policy: http://www.ru.nl/research-information-services/institutional-policy/policy-research-data-management/

20. TU/e Code of Scientific Conduct: http://www.tue.nl/en/university/about-the-university/integrity/scientific-integrity/

21. NWO and research data: http://www.nwo.nl/en/policies/open+science/data+management

21. ZonMW Toegang tot data: http://www.zonmw.nl/nl/programmas/programma-detail/toegang-tot-data-ttdata/algemeen/

22. Horizon 2020 Guidelines on data management: http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/oa_pilot/h2020-hi-oa-data-mgt_en.pdf

URL’s of mentioned webpagesin order of appearance #2

Page 16: A basic course on Research data management: part 1 - part 4

23. Data management plan Horizon 2020: https://dmponline.dcc.ac.uk/ 24. About FAIR: Mons, B. et al., Cloudy, increasingly FAIR: revisiting the FAIR Data guiding

principles for the European Open Science Cloud: http://dx.doi.org/10.3233/ISU-170824

25. Data management plan template (4TU.ResearchData): http://researchdata.4tu.nl/en/planning-research/data-management-plan/

25. Emilio M. Bruna (04-09-2014), The opportunity cost of my #OpenScience was 36 hours + $690 (UPDATED) . http://brunalab.org/blog/2014/09/04/the-opportunity-cost-of-my-openscience-was-35-hours-690/

26. Rouder, Jeffrey N., The what, why, and how of born-open data, Behavior Research Methods, vol. 48(2016), p. 1062-1069.. http://dx.doi.org/10.3758/s13428-015-0630-z

URL’s of mentioned webpagesin order of appearance #2

Page 17: A basic course on Research data management: part 1 - part 4

A basic course on Research data management

part 2: protecting and organizing your dataPROOF course Information Literacy and Research Data Management

TU/e, 07-03-2017

[email protected], TU/e IEC/LibraryAvailable under CC BY-SA license, which permits copying and redistributing the material in any medium or format & adapting the material for any purpose, provided the original author and source are credited & you distribute the adapted material under the same license as the original

Page 18: A basic course on Research data management: part 1 - part 4

Research data management Sharing your data, or making your data findable

and accessible with good data practices→ protecting your data: back up, access control; file naming,

organizing data, versioning+ sharing your data via collaboration platforms and archives

Caring for your data, or making your data re-usable and interoperable with good data practices+ metadata, tidy data, licenses

Research data managementwhat was it again

Page 19: A basic course on Research data management: part 1 - part 4

Be safe+ storage, backup data safety, protecting against

loss: use local ICT infrastructure (departmental servers, including SURFdrive) as much as possible

+ access control data security, protecting against unauthorized use: with DataverseNL for example

Be organized, or: you should be able to tell what’s in a file without opening it+ file-naming, organizing data in folders, versioning

Protecting your datagood data practices during your research

“…we can copy everything and do not manage it well.” (Indra Sihar)

Page 20: A basic course on Research data management: part 1 - part 4

File-naming #1be consistent and aim for concise but informative names

How you organize and name your files has a big impact on your ability to find those files later and to understand what they contain. Good file names are consistent (use file-naming conventions), unique (distinguishes a file from files with similar subjects as well as different versions of the file) and meaningful (use descriptive names).

File-naming conventions help you find your data, help others to find your data and help track which version of a file is most current

Avoid using special characters in a file name: \ / : * ? < > | [ ] & $

Use underscores instead of periods or spaces to separate logical elements in a file name

Avoid very long names: usually 25 characters is sufficient length

Names should include all necessary descriptive information independent of where it is stored

Include dates and a version number on files Add a readme.txt to each folder in which the file

naming and its meaning is explained Source: Best practices for file naming (Stanford University Libraries)

Page 21: A basic course on Research data management: part 1 - part 4

File naming #2think about the ordering of elements within a filename

Order by date:2013-04-12_interview-recording_THD.mp32013-04-12_interview-transcript_THD.docx2012-12-15_interview-recording_MBD.mp32012-12-15_interview-transcript_MBD.docx

Order by subject:MBD_interview-recording_2012-12-15.mp3MBD_interview-transcript_2012-12-15.docxTHD_interview-recording_2013-04-12.mp3THD_interview-transcript_2013-04-12.docx

Order by type:Interview-recording_MBD_2012-12-15.mp3Interview-recording_THD_2013-04-12.mp3Interview-transcript_MBD_2012-12-15.docxInterview-transcript_THD_2013-04-12.docx

Forced order with numbering:01_THD_interview-recording_2013-04-12.mp302_THD_interview-transcript_2013-04-12.docx03_MBD_interview-recording_2012-12-15.mp304_MBD_interview-transcript_2012-12-15.docx

<

Page 22: A basic course on Research data management: part 1 - part 4

File organization

PAGE 2203-05-2023

<Source: Beatriz Ramirez, Data management plan for the PhD project: development and application of a monitoring system to assess the impacts of climate and land cover changes on eco-hydrological processes in an eastern Andes catchment area

Source: Haselager, dr. G.J.T. (Radboud University Nijmegen); Aken, prof. dr. M.A.G. van (Utrecht University) (2000): Personality and Family Relationships. DANS. http://dx.doi.org/10.17026/dans-xk5-y7vc .

Page 23: A basic course on Research data management: part 1 - part 4

Organizing your data in folders #1based on the TIER documentation protocol (http://www.projecttier.org/)

1. Main project folder (name of your research project/working title of your paper)1.1. Original data and metadata

1.1.1. Original data1.1.2. Metadata

1.1.2.1. Supplements1.2. Processing and analysis files

1.2.1. Importable data files1.2.2. Command files1.2.3. Analysis files

1.3. Documents

Page 24: A basic course on Research data management: part 1 - part 4

1. Main project folder (name of your research project/working title of your paper)1.1. Original data and metadata

1.1.1. Original data (raw data, obtained/gathered data)Any data that were necessary for any part of the processing and/or analysis you reported in you paper. Copies of all your original data files, saved in exactly the format it was when you first obtained it. The name of the original data file may be changedKeep these data read only!1.1.2. Metadata

1.1.2.1. Supplements

Organizing your data in folders #2based on the TIER documentation protocol

Page 25: A basic course on Research data management: part 1 - part 4

1. Main project folder (name of your research project/working title of your paper)1.1. Original data and metadata

1.1.1. Original data1.1.2. MetadataThe Metadata Guide: document that provides information about each of your original data files. Applies especially to obtained data files

A bibliographic citation of the original data files, including the date you downloaded or obtained the original data files and unique identifiers that have been assigned to the original data files.

Information about how to obtain a copy of the original data file Whatever additional information to understand and use the data in the

original data file1.1.2.1. SupplementsAdditional information about an original data file that’s not written by yourself but that is found in existing supplementary documents, such as users’ guides and code books that accompany the original data file

Organizing your data in folders #3based on the TIER documentation protocol

Page 26: A basic course on Research data management: part 1 - part 4

Organizing your data in folders #4based on the TIER documentation protocol

1. Main project folder (name of your research project/working title of your paper)1.1. Original data and metadata

1.1.1. Original data1.1.2. Metadata

1.1.2.1. Supplements

1.2. Processing and analysis files1.2.1. Importable data files (the data you work with, input data, suitable for processing and analysis)A corresponding version for each of the original data files. This version can be identical to the original version, or in some cases it will be a modified version.For example modifications required to allow your software to read the file (converting the file to another format, removing unusable data or explanatory notes from a table) The original and importable versions of a data file should be given different

names The importable data file should be as nearly as identical as possible to the

original The changes you make to your original data files to create the corresponding

importable data files should be described in a Readme file 1.2.2. Command files1.2.3. Analysis files

Page 27: A basic course on Research data management: part 1 - part 4

Organizing your data in folders #5based on the TIER documentation protocol

1. Main project folder (name of your research project/working title of your paper)1.1. Original data and metadata

1.1.1. Original data1.1.2. Metadata

1.1.2.1. Supplements1.2. Processing and analysis files

1.2.1. Importable data files

1.2.2. Command filesOne or more files containing code written in the syntax of the (statistical) software you use for the study

Importing phase: commands to import or read the files and save them in a format that suits your software

Processing phase: commands that execute all the processing required to transform the importable version of your files into the final data files that you will use in your analysis (i.e. cleaning, recoding, joining two or more data files, dropping variables or cases, generating new variables)

Generating the results: commands that open the analysis data file(s), and then generate the results reported in your paper.

1.2.3. Analysis files

Page 28: A basic course on Research data management: part 1 - part 4

Organizing your data in folders #6based on the TIER documentation protocol

1. Main project folder (name of your research project/working title of your paper)1.1. Original data and metadata

1.1.1. Original data1.1.2. Metadata

1.1.2.1. Supplements1.2. Processing and analysis files

1.2.1. Importable data files1.2.2. Command files1.2.3. Analysis files

The fully cleaned and processed data files that you use to generate the results reported in your paper in your paper

The Data Appendix: codebook for your analysis data files: brief description of the analysis data file(s), a complete definition of each variable (including coding and/or units of measurement), the name of the original data files from which the variable was extracted, the number of valid observations for the variable, and the number of cases with missing values

Page 29: A basic course on Research data management: part 1 - part 4

Organizing your data in folders #7based on the TIER documentation protocol

1. Main project folder (name of your research project/working title of your paper)1.1. Original data and metadata

1.1.1. Original data1.1.2. Metadata

1.1.2.1. Supplements1.2. Processing and analysis files

1.2.1. Importable data files1.2.2. Command files1.2.3. Analysis files

1.3. Documents An electronic copy of your complete final paper The Readme-file for your replication documentation

What statistical software or other computer programs are needed to run the command files

Explain the structure of the hierarchy of folders in which the documentation is stored

Describe precisely any changes you made to your original data files to create the corresponding importable data files

Step-by-step instructions for using your documentation to replicate the statistical results reported in your paper

Page 30: A basic course on Research data management: part 1 - part 4

1. Best practices for file naming: http://library.stanford.edu/research/data-management-services/data-best-practices/best-practices-file-naming

2. File organization: http://www.wageningenur.nl/web/file?uuid=3f974938-79a0-421f-b1ad-95eef49d777c&owner=c057b578-4a6a-4449-881b-17fff17e2f1a (paragraph 6, example 1)

3. File organization: Haselager, dr. G.J.T. , Aken, prof. dr. M.A.G. van (2000): Personality and Family Relationships. DANS. http://dx.doi.org/10.17026/dans-xk5-y7vc (Data guide, p. 24-26)

4. Version control: http://www.data-archive.ac.uk/create-manage/format/versions 5. Storage, back up of data: http://www.data-archive.ac.uk/create-manage/storage6. Local ICT infrastructure:

https://intranet.tue.nl/en/university/services/ict-services/ict-service-catalog/management-services/data-management-storage/ (TU/e intranet)

7. DataverseNL: https://dataverse.nl/dvn/ 8. TIER documentation protocol: http://www.projecttier.org/

URL’s of mentioned webpagesin order of appearance

Page 31: A basic course on Research data management: part 1 - part 4

A basic course on Research data management

part 3: sharing your data

PROOF course Information Literacy and Research Data Management

TU/e, 07-03-2017

[email protected], TU/e IEC/LibraryAvailable under CC BY-SA license, which permits copying and redistributing the material in any medium or format & adapting the material for any purpose, provided the original author and source are credited & you distribute the adapted material under the same license as the original

Page 32: A basic course on Research data management: part 1 - part 4

Research data management Sharing your data, or making your data findable

and accessible with good data practices+ protecting your data: back up, access control; file naming,

organizing data, versioning→ sharing your data via collaboration platforms and archives

Caring for your data, or making your data re-usable and interoperable with good data practices+ metadata, tidy data, licenses

Research data managementwhat was it again

Page 34: A basic course on Research data management: part 1 - part 4

DataverseNL [TU/e only]: data sharing platform for active research data [based on Harvard’s Dataverse Project] where you may: store your data in an organized and safe way clearly describe your data version control of your data arrange access to your data get recognition for your data [collaborate on your data]

Various disciplinary initiatives: Open Science Framework, OpenML, RodRep, CRCNS…

General data sharing platforms: SURFdrive [TU/e only]: Dutch academic Dropbox, 100 Gb, maximum data

transfer 16 Gbevery TUe employee can use SURFdrive

Google Drive, Dropbox, Beehub…

SURF Filesender [secure data transfer up to 500 Gb!, WeTransfer up to 2 Gb]

Sharing your datacollaboration or sharing platforms (during your research)

Storage and backup of data through DANS [Dutch Archiving and Networking Services]Data transfer: up to 2 Gb per datasetDataverse via 4TU.ResearchData: up to 50 Gb free

Page 35: A basic course on Research data management: part 1 - part 4

How to create an account:

Go to: https://dataverse.nl/ Click ‘Log in’ (at the top right); under Institutional account click SURFconext Select Eindhoven University of Technology and log on with your TU/e

username and password When asked for it, give permission to share your data by answering Yes or

click this Tab When asked to create an account, answer Yes or click this Tab. When you succeeded to create an account, your username is the prefix of

your email address

You now have a user account with DataverseNL: you can create and publish data sets, upload files and assign access rights to data sets or files.However, before you proceed, contact me (for more options) or first use the demo version: https://act.dataverse.nl

Sharing your dataDataverseNL

If you are interested in using DataverseNL, please contact me (Leon Osinski)

Page 36: A basic course on Research data management: part 1 - part 4

On request“I'd like to thank E.J. Masicampo and Daniel LaLande for sharing and allowing me to share their data…”Daniël Lakens (2014), What p-hacking really looks like: A comment on Masicampo & LaLande (2012)

On a (personal) website“Let me start by saying that the reason why I put all excel files online, including all the detailed excel formulas about data constructions and adjustments, is precisely because I want to promote an open and transparent debate about these important and sensitive measurement issues.”Thomas Piketty, My response to the Financial Times, HuffPost The Blog, 29-05-2014 ;originally published as Addendum: Response to FT, 28-05-2014

A data journalJournal of open psychology data, Geoscience data journal,Data in brief, Scientific data, Data reports

Sharing your dataafter your research has ended

Source: www.aukeherrema.nl

Page 37: A basic course on Research data management: part 1 - part 4

Choose a repository where other researchers in your discipline are sharing their data, for example LXcat (for plasma data) or GenBank (for genetic sequence data)

Overview of research data repositories: Re3data.org

Use a repository that at least assigns a persistent identifier to your data (DOI) and requires that you provide adequate metadata General or multidisciplinary repositories: Zenodo, Figshare, DANS, Dryad, B2SHARE 4TU.ResearchData

+ small medium sized data sets, long tail data+ static data, ‘frozen’ data sets, ‘milestone’ data sets+ preferably nonproprietary software formats suitable for long+ term preservation+ DOI’s [ persistent identifier for citability and retrievability ]+ open access+ long-term availability, Data Seal of Approval+ Data Citation Index (Thomson Reuters)+ self-upload (single data sets < 3Gb)+ special collections of related data sets

Sharing your datain an established repository (after your research has ended)

Page 39: A basic course on Research data management: part 1 - part 4

1. DataverseNL: https://www.dataverse.nl/dvn/ 2. Harvard’s Dataverse Project: http://dataverse.org/ 3. Open Science Framework: https://cos.io/osf/ 4. OpenML: http://www.openml.org 5. RodRep: http://www.rodrep.com/ 6. CRCNS: http://crcns.org/ 7. SURFdrive: https://www.surfdrive.nl/ 8. Google Drive: https://www.google.com/drive/ 9. Dropbox: https://www.dropbox.com/ 10. Beehub: https://beehub.nl/system/ 11. SURF filesender: https://filesender.surfnet.nl/ 12. Data on request (blog post Daniel Lakens):

http://daniellakens.blogspot.nl/2014/09/what-p-hacking-really-looks-like.html 13. Data on personal website (Thomas Piketty): http://piketty.pse.ens.fr/en/capital21c2 14. Data journal: Journal of Open Psychology Data:

http://openpsychologydata.metajnl.com/15. Data journal: Geoscience Data Journal:

http://onlinelibrary.wiley.com/journal/10.1002/(ISSN)2049-6060

URL’s of mentioned webpagesin order of appearance #1

Page 40: A basic course on Research data management: part 1 - part 4

16. Data journal: Data in brief: http://www.journals.elsevier.com/data-in-brief 17. Data journal: Scientific data: http://www.nature.com/sdata/ 18. Data journal: Data reports:

http://www.frontiersin.org/news/Data_Reports_a_new_type_of_peer-reviewed_article_in_Frontiers_journals/1051?utm_source=FRN&utm_medium=ECOM&utm_campaign=TWT_FRN_1502_datareport

19. Research data catalogue: Re3data.org: http://service.re3data.org/search/results?term=

20. Publishing data: Zenodo: http://www.zenodo.org/21. Publishing data: Figshare: http://www.figshare.com22. Publishing data: DANS: http://www.dans.knaw.nl/en 23. Publishing data: Dryad: http://datadryad.org/ 24. Publishing data: B2SHARE: https://b2share.eudat.eu/ 25. Publishing data: 4TU.ResearchData: https://data.4tu.nl/ 26. Long tail research data:

http://www.nature.com/neuro/journal/v17/n11/fig_tab/nn.3838_F1.html27. Nonproprietary software formats:

http://datacentrum.3tu.nl/fileadmin/editor_upload/File_formats/Digital_Preservation_Support_levels.pdf

28. Data Seal of Approval: http://www.datasealofapproval.org

URL’s of mentioned webpagesin order of appearance #2

Page 41: A basic course on Research data management: part 1 - part 4

29. Data Citation Index (Thomson Reuters): http://wokinfo.com/products_tools/multidisciplinary/dci/

30. Self upload 4TU.ResearchData: https://data.4tu.nl/account/login/?next=/upload/ 31. Data sets underlying PhD thesis Joos Buijs: http://

dx.doi.org/10.4121/uuid:26aba40d-8b2d-435b-b5af-6d4bfbd7a270 32. PhD thesis Joos Buijs: http://dx.doi.org/10.6100/IR780920

URL’s of mentioned webpagesin order of appearance #3

Page 42: A basic course on Research data management: part 1 - part 4

A basic course on Research data management

part 4: caring for your data, or making data reusable PROOF course Information Literacy and Research Data Management

TU/e, 07-03-2017

[email protected], TU/e IEC/LibraryAvailable under CC BY-SA license, which permits copying and redistributing the material in any medium or format & adapting the material for any purpose, provided the original author and source are credited & you distribute the adapted material under the same license as the original

Page 43: A basic course on Research data management: part 1 - part 4

Research data management Sharing your data, or making your data findable

and accessible with good data practices+ protecting your data: back up, access control; file naming,

organizing data, versioning+ sharing your data via collaboration platforms and archives

→ Caring for your data, or making your data reusable and interoperable with good data practices+ metadata, tidy data, licenses

Research data managementwhat was it again

Before data can be reusable, it has first to be usable

Page 44: A basic course on Research data management: part 1 - part 4

What is the nature of the “unusual episode” to which this table refers?

Page 45: A basic course on Research data management: part 1 - part 4
Page 46: A basic course on Research data management: part 1 - part 4

Raw data: https://www.amstat.org/publications/jse/datasets/titanic.dat.txt

Documentation accompanying the data:

https://www.amstat.org/publications/jse/datasets/titanic.txt Size (number of observations

and variables) Description Provenance Variable descriptions

Based on:

The "Unusual Episode" Data Revisited / by Robert J. MacG. Dawson, in: Journal of Statistics Education vol. 3(1995), issue 3

Page 47: A basic course on Research data management: part 1 - part 4

1. Morphological Measurements of Galapagos Finches

http://dx.doi.org/10.5061/dryad.152 Use of standard names

(taxonomy, species) Variable names clear

enough? WingL must be wing length but what is N.Ubkl?

Units of measurement?

Based on:

Looking after datasets / by Antony Unwin, 01-09-2015, http://blog.revolutionanalytics.com/2015/09/looking-after-datasets.html

2. Collaborative FAIR data sharing / by Henry Rzepa

Page 48: A basic course on Research data management: part 1 - part 4

The welfare consequences… / by Jonathan J. Cooper et. al., http://dx.doi.org/10.1371/journal.pone.0102722

Word.doc

These data are findable and accessible – but usable?

Page 49: A basic course on Research data management: part 1 - part 4

Lessons learnedtable structure [ tidy data ]

To allow your data to be easily: imported by data management systems; analyzed by analysis software, and ; combined with other data (interoperability)make sure that: each row represents a single observation (record) each column represents a single variable (parameter) or type of

measurement (field) every cell contains only one piece of information (no highlighting

of cells) there is only one table for each type of information (no multiple

worksheets)

Cross-tab structure / contingency table: different columns contain measurements of the same variable: easier to read but difficult to add data (columns) to the records (rows). See Titanic table versus Titanic raw data“The problem is that people like to view data in a totally different way than a computer likes to process it.” (Kien Leong)

Page 50: A basic course on Research data management: part 1 - part 4

Lessons learnedtable metadata: variables (columns) and observations/records (rows)

include a row at the top of each table that contains full column (variable) names (no hard to understand abbreviations)

columns: use clear, descriptive variable names, avoid special characters (can cause problems with some software)

rows: if possible, use standard names within cells (derived from a taxonomy for example, standard species name, standard date formats, …)

try to avoid coding categorical or ordinal data as numbers missing data / null values: best option: use a blank

Page 51: A basic course on Research data management: part 1 - part 4

Lessons learneddata set metadata (documentation), discovery metadata, licenses

size of the data set: number of observations and variables explanation of the variables, how each was measured and

its measurement units (code book) provenance (origin) of the data, how you collected the data,

data manipulation steps (study design) description of the data set: what’s included and excluded,

known problems or inconsistencies in the data, why data are missing

add license-information: what are others allowed to do with your data?

a simple readme file can be enough (see documentation Titanic dataset) but not always “Research outputs that are poorly documented are like canned goods with the label removed (…)” (Carly Strasser)

Page 52: A basic course on Research data management: part 1 - part 4

Lessons learnedlong term availability

if possible use a non-proprietary (open) file format (are easier to use in a variety of software), like csv for tabular data

if possible, take the preferred formats of a data archive in account.

See for example 4TU.ResearchData overview of file formats and types of support: http://researchdata.4tu.nl/en/publishing-research/data-description-and-formats/

Page 53: A basic course on Research data management: part 1 - part 4

Toolsfor working with messy data

Excel vs scripting based software tools Excel: data provenance and documentation of data processing

with a graphical user interface is bad because it doesn’t leaves a record

use a scripted language (R (free), Matlab, SAS…) to process data, run the analysis and to produce final outputs

OpenRefine runs on your computer (not in the cloud), inside the Firefox

browser (not in IE), no web connection is needed working with OpenRefine:

http://www.datacarpentry.org/OpenRefine-ecology/01-working-with-openrefine.html

captures all steps done to your raw data ; original dataset is not modified ; steps are easily reversed ;

Tabula “… tool for liberating data tables locked inside PDF files.”

A reproducible workflow (bartomeuslab)

Page 54: A basic course on Research data management: part 1 - part 4

Toolsfor working with messy data and working reproducible

OpenRefine runs on your computer (not in the cloud), inside the Firefox

browser (not in IE), no web connection is needed working with OpenRefine: http://

www.datacarpentry.org/OpenRefine-ecology-lesson/01-working-with-openrefine.html

captures all steps done to your raw data ; original dataset is not modified; steps are easily reversed;

Tabula “… tool for liberating data tables locked inside PDF files.”

Excel vs scripting based software tools Excel: data provenance and documentation of data processing

with a graphical user interface is bad because it doesn’t leaves a record

use a scripted language (R (free), Matlab, SAS…) to process data, run the analysis and to produce final outputs

A reproducible workflow (bartomeuslab)

Reusability ≠ reproducibility

Page 55: A basic course on Research data management: part 1 - part 4

Data Coach [ website ]TU/e data librarians ([email protected])

Leon Osinski, Sjef ÖllersRecommended reading

Van den Eynden, Veerle e.a. (2011), Managing and sharing data: best practice for researchers, UK Data ArchiveStrasser, Carly (2015), Research data management, NISO

Recommended online courseEssentials 4 data support [English & Dutch]

Support

Page 56: A basic course on Research data management: part 1 - part 4

1. Overview research data storage services: http://dataservices.silk.co/ 2. Raw Titanic data: https://www.amstat.org/publications/jse/datasets/titanic.dat.txt 3. Documentation to Titanic data: https://

www.amstat.org/publications/jse/datasets/titanic.txt 4. The “Unusual Episode Data“ revisited: https://

www.amstat.org/publications/jse/v3n3/datasets.dawson.html 5. Morphological Measurements of Galapagos Finches: http://

dx.doi.org/10.5061/dryad.152 6. Looking after data sets: http://

blog.revolutionanalytics.com/2015/09/looking-after-datasets.html 7. Collaborative FAIR data sharing: http://www.ch.imperial.ac.uk/rzepa/blog/?p=16292 8. The welfare consequences… : http://dx.doi.org/10.1371/journal.pone.0102722 9. Tidy data: http://vita.had.co.nz/papers/tidy-data.pdf 10. Data guide example: http://dx.doi.org/10.17026/dans-xk5-y7vc 11. Preferred data formats of 4TU.ResearchData:

http://researchdata.4tu.nl/en/publishing-research/data-description-and-formats/ 12. Excel: http://production-scheduling.com/seven-deadly-spreadsheet-sins/ 13. R: https://www.r-project.org/

URL’s of mentioned webpagesin order of appearance #1

Page 57: A basic course on Research data management: part 1 - part 4

14. Bartolomeuslab, A reproducible workflow: https://youtu.be/s3JldKoA0zw 15. OpenRefine: http://openrefine.org/ 16. Working with OpenRefine: http://

www.datacarpentry.org/OpenRefine-ecology-lesson/01-working-with-openrefine.html 16. TU/e Data Coach: http://www.tue.nl/datacoach 17. Carly Strasser, Research data management: http://

www.niso.org/apps/group_public/download.php/15375/PrimerRDM-2015-0727.pdf 18. Online course ‘Essentials for data support’: http://datasupport.researchdata.nl/en/

URL’s of mentioned webpagesin order of appearance #2