Common languages in genomic epidemiology:
from ontologies to algorithms
João André Carriço, Mario RamirezMicrobiology Institute and Instituto de Medicina Molecular, Faculty of Medicine, University of [email protected] twitter: @jacarrico
RAMI-NGS, Hamburg, Germany, 9-11 June 2016
Genomic epidemiology goals
Moving from Typing into High Throughput Sequencing (HTS) Genomics : Increase in discrimination Extra information to be extracted the
genome (resistance profiles, virulence factors, genome organization)
Global Outbreak detection / Surveillance
Direct application in public health Source attribution -> intervention
Need for common languages
Image credits:1) http://www.iissiidiology.net/en/publications/104-ayfaar-interpersonal-and-true-human-relationship-
harmonization-mechanisms2) http://blog.f1000research.com/2014/04/04/reproducibility-tweetchat-recap/
Data Integration
Harmonization Reproducibility
1)
Three tiers
Algorithms
Interfaces
Ontologies
SNP callingRead mapping algorithms
Bowtie2 BWA SOAP2 Saruman mr/mrsFAST …. (And a lot more )
Algorithms
Hatem M et all BMC Bioinformatics 2013..14:184DOI: 10.1186/1471-2105-14-184
+ a plethora of parameters for each of them + a (proper) choice of reference
Gene-by-geneGene-by-gene approach allele call algorithms:
BIGSdb ( Jolley, K. A. & Maiden, M. C. J. BMC Bioinf 11, 595 (2010).)
Enterobase (https://enterobase.warwick.ac.uk/)
GEP (Genome Profiler) (JCM. 2015 May;53(5):1765-7)
Ridom Seqsphere Bionumerics (Applied Maths)
Mostly assembly based (yes it is a lot of work … ) Assembly algorithms have some parameters (mostly k-mer
sizes) Lots of heuristics for allele definition..
Algorithms
Definitions needed! Gene by gene
approaches: What is a locus? What is an allele?
It depends on the algorithm(s) used!
Algorithms
However the results are largely congruent!
Ontology definitionOntologies
Image from http://www.emiliosanfilippo.it/?page_id=1172
Ontology definition “Formal representation of knowledge as a set of concepts within
a domain, and the relationships between those concepts” – Wikipedia
Domain modeling: represents all the concepts involved in in microbial typing by sequence-based methods
Provides a shared vocabulary, where the concepts should be unambiguous
Enables a machine-readable format that can be used for software and algorithms automatically interact with multiple databases
Ontologies
TypOn Ontologies
GenEpiO: Combining Different Epi, Lab, Genomics and Clinical Data Fields.
Lab AnalyticsGenomics, PFGE
Serotyping, Phage typingMLST, AMR
Clinical DataPatient demographics,
Medical History, Comorbidities,
Symptoms, Health Status
ReportingCase/Investigation Status
GenEpiO
(Genomic Epidemiology Application Ontology)
See draft version at https://github.com/Public-Health-Bioinformatics/IRIDA_ontology
Original slide from Emma Griffiths
Ontologies
Public Health Surveillance
Case Cluster Analysis
Result Reporting
Infectious Disease Epidemiology (from case to Intervention)
Lab Surveillance (from sample to strain typing results)
Evidence Collection
& Outbreak Investigation
Sample Collection& Processing
Sequence Data
Generation & Processing
Bioinformatics Analysis
Result Reporting
Whole Genome Sequencing (SO, ERO, OBI etc) Quality Control (OBI, ERO)
Anatomy (FMA)
Environment (Envo)
Food (FoodOn)
Clinical Sampling (OBI)
Custom LIMS
Quality Control (OBI, ERO)
AMR (ARO)
Virulence (PATO)
Phylogenetic Clustering (EDAM)Mobile Elements (MobiO)
Quality Control (OBI, ERO)
AMR (ARO) LOINC
Surveillance (SurvO)
Demographics (SIO)Patient History (SIO)
Symptoms (SYMP)Exposures (ExO)
Source Attribution (IDO)
Travel (IDO)
Transmission (TRANS)
Food (FoodOn)Geography (OMRSE)Outbreak Protocols
Surveillance (SurvO)
Food (FoodOn)
Surveillance (SurvO)
Mobile Elements (MobiO)
Infectious Disease (IDO)
Typing (TypON)Nomenclature & Taxonomy (NCBItaxon) Original slide from Emma Griffiths /IRIDA
http://foodontology.github.io/foodon/
(pipeline) NGSOnto
Application programming interfaces (API)
Provides machine-readable web-based interface,i.e.,the algorithms (not humans) can:
retrieve, submit , update data /analysis results
launch analysis/algorithms
Interfaces
http://www.clker.com/cliparts/q/P/V/D/5/R/cog-allgrey-hi.png
Databases BIGSdb Enterobase
Offer an Restful API for data retrieving, submission and data analysis
Interfaces
Tools:microreact.orgInterfaces
Tools: PHYLOViZ Online
Interfaces
https://online.phyloviz.net/
API: *account creation*profile + metadata upload*running goeBURST*retrieving a link
Private or Public data sharing
Scalable to thousands of nodes
Tree Analysis tools:Interactive distance matrixNLV graph
Conclusions / Future WorkTransparency of analytical methods
Better definition of concepts
(Clinical/Lab/Analysis)
Better tool/databaseinteroperability
• Reproducibility of results• Added value of analysis
• Custom interfaces for non-bionf specialists
Conclusions / Future Work
Acknowledgments UMMI Members
Bruno Gonçalves Mickael Silva Miguel MAchado Mário Ramirez José Melo-Cristino
INESC-ID Alexandre Francisco Cátia Vaz Marta Nascimento
EFSA INNUENDO Project (https://sites.google.com/site/innuendocon/) Mirko Rossi
FP7 PathoNGenTrace (http://www.patho-ngen-trace.eu/): Dag Harmsen (Univ. Muenster) Stefan Niemann (Research Center Borstel) Keith Jolley, James Bray and Martin Maiden (Univ. Oxford) Joerg Rothganger (RIDOM) Hannes Pouseele (Applied Maths)
Genome Canada IRIDA project (www.irida.ca) Franklin Bristow, Thomas Matthews, Aaron Petkau, Morag Graham and Gary Van Domselaar (NLM , PHAC) Ed Taboada and Peter Kruczkiewicz (Lab Foodborne Zoonoses, PHAC) Fiona Brinkman (SFU) William Hsiao (BCCDC) INTEGRATED RAPID INFECTIOUS DISEASE ANALYSIS