nicole nogoy's talk at eresearchnz 2014: improving data sharing, integration and...
DESCRIPTION
Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration and reproducibility, July 2nd 2014TRANSCRIPT
Nicole NogoyeResearch NZ, 2 July 2014
Open-DataOpen-Source
Open-Access
: Improving data sharing, integration and reproducibility
Open-Review Open-Access
Open-DataOpen-Source
What can be achieved?
Its all about the re-use
To do this everything needs to be free and accessible to be read by humans & machines*
* See: http://www.biomedcentral.com/about/datamining
Take home message:
Challenges/Opportunities in the Data-Driven Era
Quick response to climate change, food security & disease outbreaks
Using networking power of the internet to tackle problems
Can ask new questions & find hidden patterns & connections
Build on each others efforts quicker & more efficiently
More collaborations across more disciplinesHarness wisdom of the crowds: crowdsourcing, citizen science, crowdfunding
Enables:
Enabled by:Removing silos, standards/formats, open-access/data
Challenges:
Not enabled by: paywalls, silos, dead trees
18121665 1869
• Scholarly articles are merely advertisement of scholarship . The actual scholarly artefacts, i.e. the data and computational methods, which support the scholarship, remain largely inaccessible --- Jon B. Buckheit and David L. Donoho, WaveLab and reproducible research, 1995
• Lack of transparency, lack of credit for anything other than “regular” dead tree publication
• If there is interest in data, only to monetise & repackage
Problem: growing replication gap
1. Ioannidis et al., (2009). Repeatability of published microarray gene expression analyses. Nature Genetics 41: 142. Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8)
Out of 18 microarray papers, resultsfrom 10 could not be reproduced
Out of 18 microarray papers, resultsfrom 10 could not be reproduced
Growing Issue: increasing number of retractions>15X increase in last decade
Strong correlation of “retraction index” with higher impact factor
1. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html2. Retracted Science and the Retraction Index ▿ http://iai.asm.org/content/79/10/3855.abstract?
At current % increase by 2045 as many papers published as retracted!
?How
GigaSolution: Deconstructing the paper
www.gigadb.orgwww.gigasciencejournal.com
Utilizes big-data infrastructure and expertise from:
Combines and integrates:Open-access journal
Data Publishing Platform
Data Analysis Platform
• Data• Software• Review• Re-use…
= Credit
}
Credit where credit is overdue:“One option would be to provide researchers who release data to public repositories with a means of accreditation.”“An ability to search the literature for all online papers that used a particular data set would enable appropriate attribution for those who share. “Nature Biotechnology 27, 579 (2009)
New incentives/credit
Anatomy of a Publication
Data
Idea
Study
Analysis
Answer
Metadata
Anatomy of a Data Publication
Data
Idea
Study
Analysis
Answer
Metadata
Valid
ation
chec
ks
Fail – submitter is provided error report
Pass – dataset is uploaded to GigaDB.
Submission Workflow
Curator makes dataset public (can be set as future date if required)DataCite
XML file
Excel submission file
Submitter logs in to GigaDB website and uploads Excel submission
GigaDB
DOI assigned
FilesSubmitter provides files by ftp or Aspera
XML is generated and registered with DataCite
Curator Review
Curator contacts submitter with DOI citation and to arrange file transfer (and resolve any other questions/issues).
DOI 10.5524/100003Genomic data from the crab-eating macaque/cynomolgus monkey (Macaca fascicularis) (2011)
Public GigaDB dataset
See: http://database.oxfordjournals.org/content/2014/bau018.abstract
GigaScience Data Publishing PlatformCurrently 120 datasets & 50TB data
• ~50 TBs of data from: BGI, ACRG, G10K, Bird10K, 3K Rice Genomes• Provide curation & integration with other DBs (INSDC databases)
Many data types…
BGI Datasets Get DOIs
PlantsChinese cabbageCucumberFoxtail milletPigeonpeaPotatoSorghumWheat A+BRice
Microbe/metagenomicsE. Coli O104:H4 TY-2482T2D gut metagenomeBulk pooled insectsT. Tengcongensis proteomeCell-LinesChinese Hamster OvaryMouse methylomesCancer quantitative protemicsHuman
Asian individual (YH) - DNA Methylome - Genome Assembly v1+2- TranscriptomeCancer (14TB)Single cell bladder cancerHBV infected exomesAncient DNA - Saqqaq Eskimo - Aboriginal Australian
VertebratesDarwin’s FinchGiant panda Macaque -Chinese rhesus -Crab-eatingMini-PigNaked mole rat Parrot, Puerto Rican Penguin - Emperor penguin- Adelie penguinPigeon, domesticPolar bearDA and F344 rats SheepTibetan antelopeOtherfMRI & Retinal waves
InvertebrateAnt - Florida carpenter ant- Jerdon’s jumping ant- Leaf-cutter antRoundwormSchistosomaSilkwormParasitic nematodePacific oyster
Released pre-publicationPaper Published in GigaScience
Cloud solutions?
Reward better handling of metadata…Novel tools/formats for data interoperability/handling.
Examples
To maximize its utility to the research community and aid those fighting the current epidemic, genomic data is released here into the public domain under a CC0 license. Until the publication of research papers on the assembly and whole-genome analysis of this isolate we would ask you to cite this dataset as:
Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G; Wang, J; Zhang, Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S; Li, J; Peng, Y; Pu, F; Sun, Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z; Zhao, X; Chen, F; Yin, X; Song,Y ; Rohde, H; Li, Y; Wang, J; Wang, J and the Escherichia coli O104:H4 TY-2482 isolate genome sequencing consortium (2011) Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI Shenzhen. doi:10.5524/100001 http://dx.doi.org/10.5524/100001
Our first DOI:
To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to Genomic Data from the 2011 E. coli outbreak. This work is published from: China.
IRRI GALAXY
Beneficiaries of the genomics revolution?Rice 3K project: 3,000 rice genomes, 13.4TB public data
NO
Collaborations with Pensoft & PLOSCyber-centipedes & virtual worms
SOURCE
USE/REUSE
PUBLISH
INTEGRATION WITH DOMAIN-SPECIFIC
DATABASES VIA ISA-TOOLS
NARRATIVE DATA
(SOCIAL) MEDIA
DATA PRODUCTION
Sneddon,T.P., Zhe,X.S., Edmunds,S.C., et al. GigaDB: promoting data dissemination and reproducibility. Database (2014) Vol. 2014: article ID bau018; doi:10.1093/database/bau018
Disseminating new types of data
How are we supporting data reproducibility?
Data sets
Analyses
Linked to
Linked to
DOI
DOI
Open-Paper
Open-Review
DOI:10.1186/2047-217X-1-18~21,000 accesses
Open-Code
8 reviewers tested data in ftp server & named reports published
DOI:10.5524/100044
Open-PipelinesOpen-Workflows
DOI:10.5524/100038Open-Data
78GB CC0 data
Code in sourceforge under GPLv3: http://soapdenovo2.sourceforge.net/~21,000 downloads
Enabled code to being picked apart by bloggers in wiki http://homolog.us/wiki/index.php?title=SOAPdenovo2
New & more transparent peer-review:The GigaScience way:
8 referees downloaded & tested data, then signed reports
New & more transparent peer-review:The GigaScience way:
Real-time open-review = paper in arXiv + blogged reviews
Implement workflows in a community-accepted format
http://galaxyproject.org
Over 36,000 main Galaxy server users
Over 1000 papersciting Galaxy use
Over 55 Galaxyservers deployed
Open source
Visualizations & DOIs for workflows
SOAPdenovo2 workflows implemented in
galaxy.cbiit.cuhk.edu.hk
Implemented entire workflow in our Galaxy server, inc.:
• 3 pre-processing steps
• 4 SOAPdenovo modules
• 1 post processing steps
• Evaluation and visualization tools
Also will be available to download by >36K Galaxy users in
Tool list Tool parameterisation Results panelResults panel
GigaGalaxy & Metabolomics
Rewarding and aiding reproducibility
OMERO: providing access to imaging data…
Changing the way we publish:
“Deconstructed”Journal
“Regular”Journal
“Conscientious” Online Journal
Ruibang Luo (BGI/HKU)Shaoguang Liang (BGI-SZ)Tin-Lap Lee (CUHK)Qiong Luo (HKUST)Senghong Wang (HKUST)Yan Zhou (HKUST)
Thanks to:
@gigasciencefacebook.com/GigaScienceblogs.biomedcentral.com/gigablog/
Peter LiChris HunterJesse Si ZheNicole NogoyLaurie GoodmanRob DavidsonAmye Kenall (BMC)
Marco Roos (LUMC)Mark Thompson (LUMC)Jun Zhao (Lancaster)Susanna Sansone (Oxford)Philippe Rocca-Serra (Oxford) Alejandra Gonzalez-Beltran (Oxford)
www.gigadb.orggalaxy.cbiit.cuhk.edu.hk
www.gigasciencejournal.com
CBIITFunding from:
Our collaborators:team: Case study: