scott edmunds: data dissemination in the era of "big-data"
DESCRIPTION
Scott Edmunds talk at the ICG Europe meeting in Copenhagen on Data Dissemination in the era of "Big-Data", May 24th 2012TRANSCRIPT
![Page 1: Scott Edmunds: Data Dissemination in the era of "Big-Data"](https://reader036.vdocuments.mx/reader036/viewer/2022062617/54c8ae4e4a79594b1c8b45cc/html5/thumbnails/1.jpg)
Scott Edmunds
Data dissemination in the era of “big data”
www.gigasciencejournal.com
William Gibson: "Information is the currency of the future world”
Sir Tim Berners-Lee: "Data is a precious thing and will last longer than the systems themselves”
ICG-Europe Meeting, 24th May 2012
Image: s-ariga cc/Flickr
![Page 2: Scott Edmunds: Data Dissemination in the era of "Big-Data"](https://reader036.vdocuments.mx/reader036/viewer/2022062617/54c8ae4e4a79594b1c8b45cc/html5/thumbnails/2.jpg)
Is data “the new oil”?
Data Bonanza?
Data Deluge?
1.2 zettabytes (1021) of electronic data generated each year1
1. Mervis J. U.S. science policy. Agencies rally to tackle big data. Science. 2012 Apr 6;336(6077):22.
![Page 3: Scott Edmunds: Data Dissemination in the era of "Big-Data"](https://reader036.vdocuments.mx/reader036/viewer/2022062617/54c8ae4e4a79594b1c8b45cc/html5/thumbnails/3.jpg)
Global Sequencing Capacity
Data Production 5.6 Tb / day
> 1500X of human genome / day
Multiple Supercomputing Centers 157 TB Flops
20 TB Memory
14.7 PB Storage
![Page 4: Scott Edmunds: Data Dissemination in the era of "Big-Data"](https://reader036.vdocuments.mx/reader036/viewer/2022062617/54c8ae4e4a79594b1c8b45cc/html5/thumbnails/4.jpg)
BGI Sequencing Capacity
Data Production 5.6 Tb / day
> 1500X of human genome / day
Multiple Supercomputing Centers 157 TB Flops
20 TB Memory
14.7 PB Storage
137
Sequencers137 Illumina/HiSeq 200027 LifeTech/SOLiD 41 454 GS FLX+2 Illumina iScan1 Illumina MiSeq1 Ion Torrent
![Page 5: Scott Edmunds: Data Dissemination in the era of "Big-Data"](https://reader036.vdocuments.mx/reader036/viewer/2022062617/54c8ae4e4a79594b1c8b45cc/html5/thumbnails/5.jpg)
www.gigasciencejournal.com
Large-Scale Data: Journal/Database/Platform
Editor-in-Chief: Laurie Goodman, PhDEditor: Scott Edmunds, PhDAssistant Editor: Alexandra Basford, PhDLead BioCurator: Tam Sneddon, DphilData Platform: Peter Li, PhD
In conjunction with:
Now taking submissions…
![Page 6: Scott Edmunds: Data Dissemination in the era of "Big-Data"](https://reader036.vdocuments.mx/reader036/viewer/2022062617/54c8ae4e4a79594b1c8b45cc/html5/thumbnails/6.jpg)
Data-data everywhere?
![Page 7: Scott Edmunds: Data Dissemination in the era of "Big-Data"](https://reader036.vdocuments.mx/reader036/viewer/2022062617/54c8ae4e4a79594b1c8b45cc/html5/thumbnails/7.jpg)
Data Silo’s
©$
InteroperabilityPaywalls
Metadata
![Page 8: Scott Edmunds: Data Dissemination in the era of "Big-Data"](https://reader036.vdocuments.mx/reader036/viewer/2022062617/54c8ae4e4a79594b1c8b45cc/html5/thumbnails/8.jpg)
?
There are many hurdles…
![Page 9: Scott Edmunds: Data Dissemination in the era of "Big-Data"](https://reader036.vdocuments.mx/reader036/viewer/2022062617/54c8ae4e4a79594b1c8b45cc/html5/thumbnails/9.jpg)
?
Technical: too large volumes too heterogeneous
no home for many data typestoo time consuming
Cultural: inertiano incentives to share unaware of how
There are many hurdles…
![Page 10: Scott Edmunds: Data Dissemination in the era of "Big-Data"](https://reader036.vdocuments.mx/reader036/viewer/2022062617/54c8ae4e4a79594b1c8b45cc/html5/thumbnails/10.jpg)
Technical challenges…
Cloud solutions?
Better handling of metadata…
Novel tools/formats for data interoperability/handling.
![Page 11: Scott Edmunds: Data Dissemination in the era of "Big-Data"](https://reader036.vdocuments.mx/reader036/viewer/2022062617/54c8ae4e4a79594b1c8b45cc/html5/thumbnails/11.jpg)
Data quality assessment
Tools making work more easily reproducible…
WorkflowsInteroperability/Ease of use
Technical challenges…
![Page 12: Scott Edmunds: Data Dissemination in the era of "Big-Data"](https://reader036.vdocuments.mx/reader036/viewer/2022062617/54c8ae4e4a79594b1c8b45cc/html5/thumbnails/12.jpg)
Cloud?
More efficient handling of data…
Do we need to keep everything?
Compression?
Technical challenges…
![Page 13: Scott Edmunds: Data Dissemination in the era of "Big-Data"](https://reader036.vdocuments.mx/reader036/viewer/2022062617/54c8ae4e4a79594b1c8b45cc/html5/thumbnails/13.jpg)
![Page 14: Scott Edmunds: Data Dissemination in the era of "Big-Data"](https://reader036.vdocuments.mx/reader036/viewer/2022062617/54c8ae4e4a79594b1c8b45cc/html5/thumbnails/14.jpg)
![Page 15: Scott Edmunds: Data Dissemination in the era of "Big-Data"](https://reader036.vdocuments.mx/reader036/viewer/2022062617/54c8ae4e4a79594b1c8b45cc/html5/thumbnails/15.jpg)
Cultural challenges…
![Page 16: Scott Edmunds: Data Dissemination in the era of "Big-Data"](https://reader036.vdocuments.mx/reader036/viewer/2022062617/54c8ae4e4a79594b1c8b45cc/html5/thumbnails/16.jpg)
Data Re-use
($)
Effort
Usability
![Page 17: Scott Edmunds: Data Dissemination in the era of "Big-Data"](https://reader036.vdocuments.mx/reader036/viewer/2022062617/54c8ae4e4a79594b1c8b45cc/html5/thumbnails/17.jpg)
Need to lower the hurdles…
($)
Effort
Usability
![Page 18: Scott Edmunds: Data Dissemination in the era of "Big-Data"](https://reader036.vdocuments.mx/reader036/viewer/2022062617/54c8ae4e4a79594b1c8b45cc/html5/thumbnails/18.jpg)
Better incentives?
($)
Effort
Usability
![Page 19: Scott Edmunds: Data Dissemination in the era of "Big-Data"](https://reader036.vdocuments.mx/reader036/viewer/2022062617/54c8ae4e4a79594b1c8b45cc/html5/thumbnails/19.jpg)
Incentives/creditCredit where credit is overdue:“One option would be to provide researchers who release data to public repositories with a means of accreditation.”“An ability to search the literature for all online papers that used a particular data set would enable appropriate attribution for those who share. “Nature Biotechnology 27, 579 (2009)
Prepublication data sharing (Toronto International Data Release Workshop)“Data producers benefit from creating a citable reference, as it can later be used to reflect impact of the data sets.” Nature 461, 168-170 (2009)
![Page 20: Scott Edmunds: Data Dissemination in the era of "Big-Data"](https://reader036.vdocuments.mx/reader036/viewer/2022062617/54c8ae4e4a79594b1c8b45cc/html5/thumbnails/20.jpg)
Datacitation: Datacite and DOIsDigital Object Identifiers (DOIs)
offer a solution
Mostly widely used identifier for scientific articles
Researchers, authors, publishers know how to use them
Put datasets on the same playing field as articles
DatasetYancheva et al (2007). Analyses on sediment of Lake Maar. PANGAEA.doi:10.1594/PANGAEA.587840
“increase acceptance of research data as legitimate, citable contributions to the scholarly record”.
Aims to:
“data generated in the course of research are just as valuable to the ongoing academic discourse as papers and monographs”.
![Page 21: Scott Edmunds: Data Dissemination in the era of "Big-Data"](https://reader036.vdocuments.mx/reader036/viewer/2022062617/54c8ae4e4a79594b1c8b45cc/html5/thumbnails/21.jpg)
Datacitation: Datacite and DOIsCentral metadata repository:
• >1 million entries to date
• Stability
• Data discoverability
• Open & harvestable
• Potential to track & credit use
![Page 22: Scott Edmunds: Data Dissemination in the era of "Big-Data"](https://reader036.vdocuments.mx/reader036/viewer/2022062617/54c8ae4e4a79594b1c8b45cc/html5/thumbnails/22.jpg)
www.gigasciencejournal.com
Data publishing/DOINew journal format combines standard manuscript publication with an extensive database to host all associated data, and integrated tools. Data hosting will follow standard funding agency and community guidelines.DOI assignment available for submitted data to allow ease of finding and citing datasets, as well as for citation tracking.
![Page 23: Scott Edmunds: Data Dissemination in the era of "Big-Data"](https://reader036.vdocuments.mx/reader036/viewer/2022062617/54c8ae4e4a79594b1c8b45cc/html5/thumbnails/23.jpg)
www.gigaDB.org
Data Publishing
![Page 24: Scott Edmunds: Data Dissemination in the era of "Big-Data"](https://reader036.vdocuments.mx/reader036/viewer/2022062617/54c8ae4e4a79594b1c8b45cc/html5/thumbnails/24.jpg)
BGI Datasets Get DOI®s
doi:10.5524/100004
PLANTSChinese cabbageCucumberFoxtail milletPigeonpeaPotatoSorghum
MicrobeE. Coli O104:H4 TY-2482
Cell-LineChinese Hamster Ovary
Human Asian individual (YH) - DNA Methylome - Genome Assembly- TranscriptomeCancer (14TB)Ancient DNA - Saqqaq Eskimo - Aboriginal Australian
VertebratesGiant panda Macaque - Chinese rhesus - Crab-eatingMini-PigNaked mole rat Penguin - Emperor penguin- Adelie penguinPigeon, domesticPolar bearSheepTibetan antelope
InvertebrateAnt - Florida carpenter ant- Jerdon’s jumping ant- Leaf-cutter antRoundwormSchistosomaSilkworm
Many released pre-publication…
![Page 25: Scott Edmunds: Data Dissemination in the era of "Big-Data"](https://reader036.vdocuments.mx/reader036/viewer/2022062617/54c8ae4e4a79594b1c8b45cc/html5/thumbnails/25.jpg)
For data citation to work, needs:
• Proven utility/potential user base.
• Acceptance/inclusion by journals.
• Data+Citation: inclusion in the references.
• Tracking by citation indexes.
• Usage of the metrics by the community…
![Page 26: Scott Edmunds: Data Dissemination in the era of "Big-Data"](https://reader036.vdocuments.mx/reader036/viewer/2022062617/54c8ae4e4a79594b1c8b45cc/html5/thumbnails/26.jpg)
Data+Citation: inclusion in the references
![Page 27: Scott Edmunds: Data Dissemination in the era of "Big-Data"](https://reader036.vdocuments.mx/reader036/viewer/2022062617/54c8ae4e4a79594b1c8b45cc/html5/thumbnails/27.jpg)
• Data submitted to NCBI databases:
• Submission to public databases complemented by its citable form in GigaDB.
- Raw data SRA:SRA046843 - Assemblies of 3 strains Genbank:AHAO00000000-AHAQ00000000 - SNPs dbSNP:1056306 - CNVs- InDels dbGAP:nstd63 - SV }
![Page 28: Scott Edmunds: Data Dissemination in the era of "Big-Data"](https://reader036.vdocuments.mx/reader036/viewer/2022062617/54c8ae4e4a79594b1c8b45cc/html5/thumbnails/28.jpg)
![Page 29: Scott Edmunds: Data Dissemination in the era of "Big-Data"](https://reader036.vdocuments.mx/reader036/viewer/2022062617/54c8ae4e4a79594b1c8b45cc/html5/thumbnails/29.jpg)
In the references…
![Page 30: Scott Edmunds: Data Dissemination in the era of "Big-Data"](https://reader036.vdocuments.mx/reader036/viewer/2022062617/54c8ae4e4a79594b1c8b45cc/html5/thumbnails/30.jpg)
Is the DOI…
![Page 31: Scott Edmunds: Data Dissemination in the era of "Big-Data"](https://reader036.vdocuments.mx/reader036/viewer/2022062617/54c8ae4e4a79594b1c8b45cc/html5/thumbnails/31.jpg)
![Page 32: Scott Edmunds: Data Dissemination in the era of "Big-Data"](https://reader036.vdocuments.mx/reader036/viewer/2022062617/54c8ae4e4a79594b1c8b45cc/html5/thumbnails/32.jpg)
And now in Nature Biotech…
![Page 33: Scott Edmunds: Data Dissemination in the era of "Big-Data"](https://reader036.vdocuments.mx/reader036/viewer/2022062617/54c8ae4e4a79594b1c8b45cc/html5/thumbnails/33.jpg)
Datacitation: tracking?
Plans in 2012 to link central metadata repository with WoS
- Will finally track and credit use!
To be continued…
DataCite metadata in harvestable form (OAI-PMH)
![Page 34: Scott Edmunds: Data Dissemination in the era of "Big-Data"](https://reader036.vdocuments.mx/reader036/viewer/2022062617/54c8ae4e4a79594b1c8b45cc/html5/thumbnails/34.jpg)
![Page 35: Scott Edmunds: Data Dissemination in the era of "Big-Data"](https://reader036.vdocuments.mx/reader036/viewer/2022062617/54c8ae4e4a79594b1c8b45cc/html5/thumbnails/35.jpg)
Final step: open licensing
![Page 36: Scott Edmunds: Data Dissemination in the era of "Big-Data"](https://reader036.vdocuments.mx/reader036/viewer/2022062617/54c8ae4e4a79594b1c8b45cc/html5/thumbnails/36.jpg)
To maximize its utility to the research community and aid those fighting the current epidemic, genomic data is released here into the public domain under a CC0 license. Until the publication of research papers on the assembly and whole-genome analysis of this isolate we would ask you to cite this dataset as:
Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G; Wang, J; Zhang, Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S; Li, J; Peng, Y; Pu, F; Sun, Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z; Zhao, X; Chen, F; Yin, X; Song,Y ; Rohde, H; Li, Y; Wang, J; Wang, J and the Escherichia coli O104:H4 TY-2482 isolate genome sequencing consortium (2011) Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI Shenzhen. doi:10.5524/100001 http://dx.doi.org/10.5524/100001
Our first DOI:
To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to Genomic Data from the 2011 E. coli outbreak. This work is published from: China.
![Page 37: Scott Edmunds: Data Dissemination in the era of "Big-Data"](https://reader036.vdocuments.mx/reader036/viewer/2022062617/54c8ae4e4a79594b1c8b45cc/html5/thumbnails/37.jpg)
![Page 38: Scott Edmunds: Data Dissemination in the era of "Big-Data"](https://reader036.vdocuments.mx/reader036/viewer/2022062617/54c8ae4e4a79594b1c8b45cc/html5/thumbnails/38.jpg)
![Page 39: Scott Edmunds: Data Dissemination in the era of "Big-Data"](https://reader036.vdocuments.mx/reader036/viewer/2022062617/54c8ae4e4a79594b1c8b45cc/html5/thumbnails/39.jpg)
![Page 40: Scott Edmunds: Data Dissemination in the era of "Big-Data"](https://reader036.vdocuments.mx/reader036/viewer/2022062617/54c8ae4e4a79594b1c8b45cc/html5/thumbnails/40.jpg)
“Last summer, biologist Andrew Kasarskis was eager to help decipher the genetic origin of the Escherichia coli strain that infected roughly 4,000 people in Germany between May and July. But he knew it that might take days for the lawyers at his company — Pacific Biosciences — to parse the agreements governing how his team could use data collected on the strain. Luckily, one team had released its data under a Creative Commons licence that allowed free use of the data, allowing Kasarskis and his colleagues to join the international research effort and publish their work without wasting time on legal wrangling.”
Other consequences: speed/legal-freedom
![Page 41: Scott Edmunds: Data Dissemination in the era of "Big-Data"](https://reader036.vdocuments.mx/reader036/viewer/2022062617/54c8ae4e4a79594b1c8b45cc/html5/thumbnails/41.jpg)
The era of the data consumer?
![Page 42: Scott Edmunds: Data Dissemination in the era of "Big-Data"](https://reader036.vdocuments.mx/reader036/viewer/2022062617/54c8ae4e4a79594b1c8b45cc/html5/thumbnails/42.jpg)
The era of the data consumer?
?
![Page 43: Scott Edmunds: Data Dissemination in the era of "Big-Data"](https://reader036.vdocuments.mx/reader036/viewer/2022062617/54c8ae4e4a79594b1c8b45cc/html5/thumbnails/43.jpg)
The era of the data consumer?
?
Free access to data – but analysis hubs/nodes for will form around it
![Page 44: Scott Edmunds: Data Dissemination in the era of "Big-Data"](https://reader036.vdocuments.mx/reader036/viewer/2022062617/54c8ae4e4a79594b1c8b45cc/html5/thumbnails/44.jpg)
Data Modeling
Pipeline design
Validation
Commercial applications
Genomic Data Submission and Analytical platform
Big data from the
“Sequencing Oil Field”
GDSAP:
Data, Data, Data…
Tin-Lap Lee, CUHK
“Apps”
![Page 45: Scott Edmunds: Data Dissemination in the era of "Big-Data"](https://reader036.vdocuments.mx/reader036/viewer/2022062617/54c8ae4e4a79594b1c8b45cc/html5/thumbnails/45.jpg)
Genomic Data Submission and Analytical platform
GDSAP:
![Page 46: Scott Edmunds: Data Dissemination in the era of "Big-Data"](https://reader036.vdocuments.mx/reader036/viewer/2022062617/54c8ae4e4a79594b1c8b45cc/html5/thumbnails/46.jpg)
Genomic Data Submission and Analytical platform
GDSAP:
mirror/open platform
![Page 47: Scott Edmunds: Data Dissemination in the era of "Big-Data"](https://reader036.vdocuments.mx/reader036/viewer/2022062617/54c8ae4e4a79594b1c8b45cc/html5/thumbnails/47.jpg)
Papers in the era of big-data
To review: (>6TBp, >1500 datasets)
S3 = $15,000
EC2 (BLASTx) = $500,000
$1000 genome = million $ peer-review?
Source: Folker Meyer/Wilkening et al. 2009, CLUSTER'09. IEEE International Conference on Cluster Computing and Workshops
![Page 48: Scott Edmunds: Data Dissemination in the era of "Big-Data"](https://reader036.vdocuments.mx/reader036/viewer/2022062617/54c8ae4e4a79594b1c8b45cc/html5/thumbnails/48.jpg)
Papers in the era of big-data
Analysis Data
Tools/Workflows
Compute
goal: Executable Research Objects
Citable DOI
![Page 49: Scott Edmunds: Data Dissemination in the era of "Big-Data"](https://reader036.vdocuments.mx/reader036/viewer/2022062617/54c8ae4e4a79594b1c8b45cc/html5/thumbnails/49.jpg)
Papers in the era of big-dataInterested in Reproducible Research?
Take part in our session on: “Cloud and workflows for reproducible bioinformatics”
• Rapid review/Open Access/High-visibility• Article Processing Charge covered by BGI• Hosting of any test datasets/workflows in GigaDB
Submit to:
![Page 50: Scott Edmunds: Data Dissemination in the era of "Big-Data"](https://reader036.vdocuments.mx/reader036/viewer/2022062617/54c8ae4e4a79594b1c8b45cc/html5/thumbnails/50.jpg)
www.gigasciencejournal.com
Thanks to:
@gigascience
facebook.com/GigaScience
blogs.openaccesscentral.com/blogs/gigablog/
Contact us:
Follow us:
Laurie Goodman Alexandra BasfordTam Sneddon/Peter Li Shaoguang LiangTin-Lap Lee (CUHK) Qiong Luo (HKUST)