big data standards - workshop, expbio, boston, 2015
TRANSCRIPT
![Page 1: Big Data Standards - Workshop, ExpBio, Boston, 2015](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a696db1a28ab6b2d8b4671/html5/thumbnails/1.jpg)
!
!
Big Data Standards: how to set the bar?!!
!
Susanna-Assunta Sansone, PhD!
!
@biosharing!@isatools!
!
Experimental Biology, Big Data Workshop, 28 March, 2015
Data Consultant, Honorary Academic Editor
Associate Director, Principal Investigator
http://www.slideshare.net/SusannaSansone
![Page 2: Big Data Standards - Workshop, ExpBio, Boston, 2015](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a696db1a28ab6b2d8b4671/html5/thumbnails/2.jpg)
https://projects.ac/blog/five-top-reasons-to-protect-your-data-and-practise-safe-science/
Credit to:
![Page 3: Big Data Standards - Workshop, ExpBio, Boston, 2015](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a696db1a28ab6b2d8b4671/html5/thumbnails/3.jpg)
A community mobilization for “openness”
![Page 4: Big Data Standards - Workshop, ExpBio, Boston, 2015](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a696db1a28ab6b2d8b4671/html5/thumbnails/4.jpg)
Is open data understandable, reusable?
“Reproducing the method took several months of effort, and required using new versions and new software that posed challenges to reconstructing and validating the results”
![Page 5: Big Data Standards - Workshop, ExpBio, Boston, 2015](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a696db1a28ab6b2d8b4671/html5/thumbnails/5.jpg)
Is open data understandable, reusable? Not always…but why?
• Outputs are multi-dimensional, diverse, not always well cited / stored
• Software, codes, workflows etc.; hard(er) to get hold of
• Data often distributed and fragmented to fit (siloed) databases
o Not contain enough information for others to understand it
• Uneven level of details and annotation across different databases
o Specialized, generalist, public and institutional
• Data curation activities are perceived as time consuming
o Collection and harmonization of detailed methods and experimental
steps is done/rushed at publication stage
![Page 6: Big Data Standards - Workshop, ExpBio, Boston, 2015](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a696db1a28ab6b2d8b4671/html5/thumbnails/6.jpg)
Not just open, but FAIR data
![Page 7: Big Data Standards - Workshop, ExpBio, Boston, 2015](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a696db1a28ab6b2d8b4671/html5/thumbnails/7.jpg)
Responsibilities lie across several stakeholder groups
Understand the benefits of sharing FAIR datasets and enact them
Engage and assist researchers to enable them to share FAIR datasets
Release or endorse practices and polices, but also incentive
and credit mechanisms for researchers, curators and
developers
![Page 8: Big Data Standards - Workshop, ExpBio, Boston, 2015](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a696db1a28ab6b2d8b4671/html5/thumbnails/8.jpg)
Rise of a data-centric enterprise, e.g.:
![Page 9: Big Data Standards - Workshop, ExpBio, Boston, 2015](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a696db1a28ab6b2d8b4671/html5/thumbnails/9.jpg)
Not just data, but FAIR digital research objects
![Page 10: Big Data Standards - Workshop, ExpBio, Boston, 2015](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a696db1a28ab6b2d8b4671/html5/thumbnails/10.jpg)
• We need to report sufficient information to reuse the dataset
• We must strike a balance between depth and breadth of information
Without context data is meaningless
![Page 11: Big Data Standards - Workshop, ExpBio, Boston, 2015](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a696db1a28ab6b2d8b4671/html5/thumbnails/11.jpg)
Information intensive experiments
• Not too much • Not too little • But just right
![Page 12: Big Data Standards - Workshop, ExpBio, Boston, 2015](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a696db1a28ab6b2d8b4671/html5/thumbnails/12.jpg)
And conversely….
LS1_C2_LD_TP2_P1! file1.gz!
![Page 13: Big Data Standards - Workshop, ExpBio, Boston, 2015](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a696db1a28ab6b2d8b4671/html5/thumbnails/13.jpg)
…how not to report the experimental information!
• L!S1 ! !liver sample 1!• C2 ! !compound 2!• LD ! !low dose!• TP2 ! !time point 2!
• P1 ! !protocol 1!• file1.gz ! !compressed data file with !! ! !phenotypic and other information ! ! !on this sample!
Sample name (?!)! Data file!
LS1_C2_LD_TP2_P1! file1.gz!
![Page 14: Big Data Standards - Workshop, ExpBio, Boston, 2015](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a696db1a28ab6b2d8b4671/html5/thumbnails/14.jpg)
The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project
14
• make annotation explicit and discoverable
• structure the descriptions for consistency
• ensure/regulate access
• deposit and publish • etc….
• To make any dataset ‘FAIR’, one must have standards, tools and best practices to: § report sufficient details § capture all salient features of
the experimental workflow
![Page 15: Big Data Standards - Workshop, ExpBio, Boston, 2015](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a696db1a28ab6b2d8b4671/html5/thumbnails/15.jpg)
The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project
15
…breadth and depth !of the experimental context!
…is pivotal !
…and has to be both human and machine
readable!
![Page 16: Big Data Standards - Workshop, ExpBio, Boston, 2015](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a696db1a28ab6b2d8b4671/html5/thumbnails/16.jpg)
nature.com/scientificdata
A new category of publication that provides detailed descriptors of scientifically valuable
datasets. They are a highly effective link between traditional research articles and data repositories
Introducing the Data Descriptor
![Page 17: Big Data Standards - Workshop, ExpBio, Boston, 2015](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a696db1a28ab6b2d8b4671/html5/thumbnails/17.jpg)
Res
earc
h pa
pers
D
ata
reco
rds
Dat
a D
escr
ipto
rs
To add value to research articles and data records
![Page 18: Big Data Standards - Workshop, ExpBio, Boston, 2015](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a696db1a28ab6b2d8b4671/html5/thumbnails/18.jpg)
!!!
Experimental metadata or !structured component!
(in-house curated, machine-readable format)!
Article or !narrative component!
(PDF and HTML) !
Data Description narrative and structured components
![Page 19: Big Data Standards - Workshop, ExpBio, Boston, 2015](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a696db1a28ab6b2d8b4671/html5/thumbnails/19.jpg)
19
A curated, structured component - why?
• Supplements the scientific discourse!o natural language has a degree of ambiguity!
• Brings clarity in reporting research methods and procedures!o no trimming, no cooking!o clear samples to data files links and relation to methods!
• Provides the basis for search and discovery features!
SciData DD
Structured content SciData DD
Structured content
SciData DD
Structured content
SciData DD
Structured content
SciData DD
Structured content
SciData DD
Structured content
SciData DD
Structured content
SciData DD
Structured content
SciData DD
Structured content
SciData DD
Structured content
Same tissue
Same organism
Same assay
Community Data
Repositories
![Page 20: Big Data Standards - Workshop, ExpBio, Boston, 2015](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a696db1a28ab6b2d8b4671/html5/thumbnails/20.jpg)
Seven week old C57BL/6N mice were treated with low-fat diet.
Liver was dissected out, hepatocytes prepared…
From natural language to ‘computable’ concepts
Data Curation Editor
Responsible for creating the structured component, ensuring that the most appropriate metadata is being captured.
![Page 21: Big Data Standards - Workshop, ExpBio, Boston, 2015](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a696db1a28ab6b2d8b4671/html5/thumbnails/21.jpg)
Age value Unit
Strain name Subject of the experiment
Type of diet and experimental condition Anatomy part
Seven week old C57BL/6N mice were treated with low-fat diet.
Liver was dissected out, hepatocytes prepared …
From natural language to ‘computable’ concepts
![Page 22: Big Data Standards - Workshop, ExpBio, Boston, 2015](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a696db1a28ab6b2d8b4671/html5/thumbnails/22.jpg)
Age value Unit
Strain name Subject of the experiment
Type of diet and experimental condition Anatomy part
Seven week old C57BL/6N mice were treated with low-fat diet.
Liver was dissected out, hepatocytes prepared …
From natural language to ‘computable’ concepts
Type of protocol – cell preparation
Type of protocol - sample treatment
Type of protocol – liver preparation
![Page 23: Big Data Standards - Workshop, ExpBio, Boston, 2015](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a696db1a28ab6b2d8b4671/html5/thumbnails/23.jpg)
Including minimum information reporting requirements, or checklists to report the same core, essential information
Including controlled vocabularies, taxonomies, thesauri, ontologies etc. to use the same word and refer to the same ‘thing’
Including conceptual model, conceptual schema from which an exchange format is derived to allow data to flow from one system to another
Community-developed content standards To structure and enrich the description of datasets, facilitating
understanding, sharing and reuse!
![Page 24: Big Data Standards - Workshop, ExpBio, Boston, 2015](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a696db1a28ab6b2d8b4671/html5/thumbnails/24.jpg)
de jure de facto
grass-roots groups
standard organizations
Community mobilization, some examples
• Structural and operational differences § organization types (open, close to members, society, WG etc.) § standards development (how to formulate, conduct and maintain) § adoption, uptake, outreach (link to journals, funders and commercial sector) § funds (sponsors, memberships, grants, volunteering)
![Page 25: Big Data Standards - Workshop, ExpBio, Boston, 2015](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a696db1a28ab6b2d8b4671/html5/thumbnails/25.jpg)
~ 156
~ 70
~ 334
miame!MIAPA!
MIRIAM!MIQAS!MIX!
MIGEN!
ARRIVE!MIAPE!
MIASE!
MIQE!
MISFISHIE….!
REMARK!
CONSORT!
MAGE-Tab!GCDML!
SRAxml!SOFT! FASTA!
DICOM!
MzML !SBRML!
SEDML…!
GELML!
ISA-Tab!
CML!
MITAB!
AAO!CHEBI!
OBI!
PATO! ENVO!MOD!
BTO!IDO…!
TEDDY!
PRO!XAO!
DO
VO!
In the life sciences…..almost 600!
Databases, !annotation,!
curation !tools !
implementing !standards!
![Page 26: Big Data Standards - Workshop, ExpBio, Boston, 2015](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a696db1a28ab6b2d8b4671/html5/thumbnails/26.jpg)
A web-based, curated and searchable registry ensuring that standards are registered, informative and discoverable; monitoring their
development and evolution and their use in databases, and the adoption of both in data policies.
Launched Jan 2011
![Page 27: Big Data Standards - Workshop, ExpBio, Boston, 2015](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a696db1a28ab6b2d8b4671/html5/thumbnails/27.jpg)
The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project
Core functionalities: • search and filtering, e.g. by
funder • submissions forms to add
new records • “claim” functionality of
existing records • person’s profile (as
maintainer of records) associated to the ORCID profile (for credit, as incentive)
• visualization and views of content
Search, filter, claim, view and more
![Page 28: Big Data Standards - Workshop, ExpBio, Boston, 2015](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a696db1a28ab6b2d8b4671/html5/thumbnails/28.jpg)
Assists users to make informed decisions
![Page 29: Big Data Standards - Workshop, ExpBio, Boston, 2015](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a696db1a28ab6b2d8b4671/html5/thumbnails/29.jpg)
Advisory Board and Working Group - core members and adopters
Operational Team
![Page 30: Big Data Standards - Workshop, ExpBio, Boston, 2015](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a696db1a28ab6b2d8b4671/html5/thumbnails/30.jpg)
The relationship among popular standard formats for pathway information. !
Demir, et al., The BioPAX community standard for pathway data sharing, Nat Biotech. 2010.
Standards as an area of research - still a lot to do! E.g.:
1. Create relation or “usage maps and guides”, e.g.:
2. Metrics of maturity, usability and popularity
3. Embed in the ecosystem of complementary registries
![Page 31: Big Data Standards - Workshop, ExpBio, Boston, 2015](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a696db1a28ab6b2d8b4671/html5/thumbnails/31.jpg)
31
Technologically-delineated views of the world !
Biologically-delineated views of the world!
Generic features (‘common core’)!- description of source biomaterial!- experimental design components!
Arrays!
Scanning! Arrays &Scanning!
Columns!
Gels!MS! MS!
FTIR!
NMR!
Columns!
transcriptomics proteomics metabolomics
plant biology epidemiology microbiology
To compare and integrate data we need interoperable standards
How do we address fragmentation, duplications gaps?
![Page 32: Big Data Standards - Workshop, ExpBio, Boston, 2015](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a696db1a28ab6b2d8b4671/html5/thumbnails/32.jpg)
Global alliances are needed, e.g.:
![Page 33: Big Data Standards - Workshop, ExpBio, Boston, 2015](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a696db1a28ab6b2d8b4671/html5/thumbnails/33.jpg)
biocaddie.org
![Page 34: Big Data Standards - Workshop, ExpBio, Boston, 2015](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a696db1a28ab6b2d8b4671/html5/thumbnails/34.jpg)
metadatacenter.org
![Page 35: Big Data Standards - Workshop, ExpBio, Boston, 2015](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a696db1a28ab6b2d8b4671/html5/thumbnails/35.jpg)
• Most researchers understand the value of standardized descriptions, when using third-party datasets!
!
• But when asked to structure their datasets, they view requests for even “minimal” information as burdensome!
re is an urgent need to lower the bar for authoring good metadata!
Researchers hate standards!
![Page 36: Big Data Standards - Workshop, ExpBio, Boston, 2015](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a696db1a28ab6b2d8b4671/html5/thumbnails/36.jpg)
• Most researchers understand the value of standardized descriptions, when using third-party datasets!
!
• But when asked to structure their datasets, they view requests for even “minimal” information as burdensome!
!
Ø There is an urgent need to lower the bar for authoring good metadata!
Researchers hate standards!