forum on personalized medicine: challenges for the next decade
DESCRIPTION
Bioinformatics and Big Data in the era of Personalized Medicine 10th Anniversary Instituto Roche Forum on Personalized Medicine: Challenges for the next decade. Santiago de Compostela (Spain), September 25th 2014TRANSCRIPT
Joaquín Dopazo
Computational Genomics Department,
Centro de Investigación Príncipe Felipe (CIPF),
Functional Genomics Node, (INB),
Bioinformatics Group (CIBERER) and
Medical Genome Project,
Spain.
http://bioinfo.cipf.es http://www.medicalgenomeproject.com http://www.babelomics.org
http://www.hpc4g.org @xdopazo
Forum on Personalized Medicine, 25 September 2014
Bioinformatics and Big Data in the
era of Personalized Medicine
Allison, 2008. Is personalized medicine finally
arriving? Nature.
Personalized medicine: just about a better
understanding of the relationship
phenotype-genotype
Personalized medicine through
precision medicine
• Precision medicine requires of
better ways of defining diseases
by introducing genomic
technologies into the diagnostic
procedures.
• A more precise diagnostic of
diseases, based on the description
of their molecular mechanisms, is
critical for creating innovative
diagnostic, prognostic, and
therapeutic strategies properly
tailored to each patient’s
necessities
The future of personalized medicine
is strongly based on genomics
•Personalized medicine is based on the availability of
diagnostic biomarkers
•Genome sequencing offers ALL this information (if
properly analyzed)
•Genome sequence prices are in free fall (exome price
expected < 300€ in 2-3 years)
•Over 30-40 % of budget (>500 B $) per year, is spent on
costs associated with “overuse, underuse, misuse, ...”
While the cost falls down, the amount of data to manage and its
complexity raise exponentially.
Costs are already almost competitive enough to be used in clinic
The problem is… are we ready to deal with this data?
Exome sequencing successfully used.
NGS prices will be soon affordable.
http://www.genome.gov/sequencingcosts/
http://www.nih.gov/news/health/jun2014/nhgri-18.htm
http://www.nejm.org/doi/full/10.1056/NEJMra1312543
More than
10,000
exomes will
be ordered
for
diagnostic
purposes
Clinical application of exomes
Personalized Genomic Medicine. Phase I: generating the knowledge database
-----
-----
-----
-----
-----
-----
-----
-----
-----
-----
-----
-----
-----
-----
sequencing
Patient List of variants
Database. Query: variant/pathway
Therapy Outcome
System feedback
Genetic variants are linked to therapies through the knowledge of their functional effects (systems biology)
Initially the system will need much feedback: Knowledge generation phase. Growing knowledge database
Genomic medicine
Knowledge
database
Personalized genomic medicine.
Phase II: applying the knowledge database
Patient
1) Genomic sequencing 2) Database of markers 3) Therapy prediction
Genomic core facility phase II
Clinician receives hints on possible prescriptions and therapeutic interventions
+ Other factors (risk, cost, etc.)
Prescription Pre-symptomatic: • Genetic predisposition of acquired diseases
(>6000. some treatable)
• Early diagnosis of genetic diseases
Symptomatic analysis • Diagnostic of acquired diseases
• Early cancer detection
• Cancer treatment recommendation
From genetics to genomic medicine
Test 1
Test 2
Therapy 1
Therapy 2
Therapy 3
?
Genetic medicine
Test
Therapy 1
Therapy 2
Therapy 3
?
Genomic medicine
+
Genomic analysis allows associating patients to therapies from the very beginning, saving time and costs and increasing the success of treatments. feedback
Some examples
Conventional sequencing NGS (with capture)
Marfan syndrome 1300€
2 genes, 75 exons
900€
3 genes, 237 exons
Hereditary deafness 12500€
36 genes 1500 exons
1100€
38 genes > 1500 exons
• Low initial investment
• Already existent infrastructure
• Quick implementation
• Easily implementation as a cloud service that
guarantees sustainability
Preparing the scenario for the
introduction of genome in the clinics
Patient
Treatment
eHR
Decision support
techniques: algorithms
that relate biomarkers to
treatments, outcomes, etc.
(gene prioritization and
predictors)
Integration of
the data in
the eHR
Visualization and
data presentation.
Ready for the
clinical interpretation
Acceleration of
algorithms for data pre-
processing. Data
strorage optimization
feedback
Corporative
systems
Orion clinic
Abucasis, Gaia,
etc.
Preparing the scenario for the
introduction of genome in the clinics
Patient
Treatment
eHR
feedback
Corporative
systems
Orion clinic
Abucasis, Gaia,
etc.
Decision support
techniques: algorithms
that relate biomarkers to
treatments, outcomes, etc.
(gene prioritization and
predictors)
Visualization and
data presentation.
Ready for the
clinical interpretation
Integration of
the data in
the eHR
Acceleration of
algorithms for data pre-
processing. Data
strorage optimization
New Big Data storage strategies
Automatic QC Sequence cleansing
Variant calling + QC
Mapping + QC
8-10 hours 8-12 hours 8-12 hours
CLOUD
FASTQ
(10GB)
BAM
(7GB)
VCF
(200MB)
Data sizes for
exomes. In case of
whole genomes
sizes are >20x
Remote visualization
of big data.
Data production phase
e-health record
Final human supervision
of data QC
Tools developed to improve the pipeline Genome Maps, a HTML5+SVG data visualization of VCF and BAM
o Genome scale data visualization plays an important role in the data analysis process. It is a big data
management problem.
o Features of Genome Maps (Medina, 2013, NAR; ICGC data analysis portal)
● First 100% HTML5 web based: HTML5+SVG (inspired in Google Maps)
● Always updated, no browser plugins or installation
● Data taken from CellBase, remote NGS data, local files and DAS servers: genes, transcripts, exons, SNPs, TFBS, miRNA
targets, etc.
● Other features: Multi species, API oriented, easy integration, plugin framework, etc.
BAM
viewer
VCF viewer ICGC genomic viewer
www.genomemaps.org
Patient
Treatment
eHR
feedback
Corporative
systems
Orion clinic
Abucasis, Gaia,
etc.
Acceleration of
algorithms for data pre-
processing. Data
strorage optimization
Integration of
the data in
the eHR
Visualization and
data presentation.
Ready for the
clinical interpretation
Decision support
techniques: algorithms
that relate biomarkers to
treatments, outcomes, etc.
(gene prioritization and
predictors)
Preparing the scenario for the
introduction of genome in the clinics
Finding new biomarkers
Test
Therapy 1
Therapy 2
Therapy 3
?
feedback
Feedback: treatment failures are
reanalyzed to search for:
1) Biomarkers (of failure)
2) Subgroups (to search for new
personalized and rational
therapeutic interventions
Treatables
Failure
treatment
biomarkers
Group A
biomarkers
Group A
biomarkers
Irrelevant
Non treatables
Signaling
Protein interaction Regulation
Variants are used as biomarkers to distinguish
between responders and non-responders and to
sub-classify non-responders
Rationale design of therapies rely on
Systems Biology concepts. Pathways
are complex and must be understood
with the proper bioinformatic tools
Patient
Treatment
eHR
feedback
Corporative
systems
Orion clinic
Abucasis, Gaia,
etc.
Decision support
techniques: algorithms
that relate biomarkers to
treatments, outcomes, etc.
(gene prioritization and
predictors)
Acceleration of
algorithms for data pre-
processing. Data
strorage optimization
Visualization and
data presentation.
Ready for the
clinical interpretation
Integration of
the data in
the eHR
Preparing the scenario for the
introduction of the genome in clinics
BiERapp: interactive web-based tool for easy candidate prioritization by successive filtering
SEQUENCING CENTER
Data preprocessing
VCF FASTQ
Genome Maps
BAM
BiERapp filters
No-SQL (Mongo) VCF indexing
Population frequencies Consequence types
Experimental design
BAM viewer and Genomic context ?
Easy
sc
ale
up
NA19660 NA19661
NA19600 NA19685
BiERapp: the interactive filtering tool for easy candidate prioritization
http://bierapp.babelomics.org Aleman et al., 2014 NAR
3-Methylglutaconic aciduria (3-
MGA-uria) is a heterogeneous
group of syndromes
characterized by an increased
excretion of 3-methylglutaconic
and 3-methylglutaric acids.
WES with a consecutive filter
approach is enough to detect
the new mutation in this case.
Successive Filtering approach An example with 3-Methylglutaconic aciduria syndrome
Use known variants and their population frequencies to filter out irrelevant polymorphisms.
• Typically dbSNP, 1000 genomes and
the 6515 exomes from the ESP are
used as sources of population
frequencies.
• We sequenced 300 healthy controls
(rigorously phenotyped) to add and
extra filtering step to the analysis
pipeline
Novembre et al., 2008. Genes mirror
geography within Europe. Nature Comparison of MGP controls to 1000g
How important do you
think local information is
to detect disease genes?
Filtering with or without local variants
Number of genes as a function of individuals in the study of a dominant disease Retinitis Pigmentosa autosomal dominant
The use of local
variants makes
an enormous
difference
New variants and disease genes found with WES and successive filtering
WES
IRDs
arRP (EYS)
BBS
arRP arRP (USH2)
3-MGA-uria
(SERAC1)
NBD (BCKDK )
Knowledge DB
Fre
q. p
op
ul.
MySeq
IonTorrent
IonProton
Illumina
NO
Diagnostic Therapeutic
decision
Ne
w v
aria
nts
D
ise
ase
All
Candidate
Prioritization
Data
pre
pro
cessin
g
Sequence DB
Se
qu
en
ces
Freqs.
Future
technologies
New knowledge
for future
diagnostic
The final schema: diagnostic and discovery
Diagnostic by targeted sequencing
(panels of genes)
Tool for defining panels
New filter based on
local population variant
frequencies
If no diagnostic variants appear, then
secondary findings are studied
Diagnostic mutations
http://team.babelomics.org
Implementation of tools in the IT4I
Supercomputing Center (Czech Republic)
The pipelines of primary and
secondary analysis developed by the
Computational Genomics
Department of the CIPF in close
collaboration with the Bull Chair has
proven its efficiency in the analysis
of more than 1000 exomes in a joint
collaborative project of the CIBERER
and the MGP
A first pilot implementation has been
done in the IT4I supercomputing
center, which aims to centralize the
analysis of genomics data in the
country.
Implementation in the AVS
…..
1PB DB
We have taken advantage of the already operative corporative
medical image system using a quite similar philosophy.
eHR
gateway
Upload
image
Retrieve (by
patient ID)
Genomic
gateway
Pilot
project
with 20
leukemias
Knowledge DB
Fre
q. p
op
ul.
MySeq
IonTorrent
IonProton
Illumina
NO
Diagnostic Therapeutic
decision
New
va
ria
nts
D
ise
ase
All
Candidate
Prioritization D
ata
pre
pro
cessin
g
Sequence DB
Se
qu
en
ces
Freqs.
Future
technologies
New knowledge
for future
diagnostic
Gene discovery and diagnostic
implemented
But… what about personalized treatments?
Patient’s omic data Biological knowledge
Systems biology
computational models
Epigenomics Regulation
Interaction
Function
Proteomics
Genomics and transcriptomics
Patient
Metabolomics
Diagnostic biomarkers Personalized medicine
Therapeutic targets
Cell culture
Best combination
Xenograft model
Drug treatment
Network drugs
Personalized therapy
Are individualized treatments a realistic option?
Dopazo, 2003, Drug Discovery Today
Modeling pathways The effect of gene expression over signaling can be estimated. Virtual KOs (or over-expressions) can be
simulated
Colorectal cancer activates a signaling
circuit of VEGF pathway that produces
PGI2.
Virtual KO of COX2 interrupts the circuit
(known therapeutic inhibitor in CGR
COX2
gene KO
Future prospects Exome vs complete genome
The ENCODE project suggests a functional
role for a large fraction of the genome
Which percentage of the genome is
occupied by:
Coding genes: 2.4%
TFBSs 8.1%
Open chromatin regions 15.2%
Different RNA types 62.0%
Total annotated elements: 80.4%
Exomes are only covering a small fraction of the potential functionality of the genome
(2.4%).
Is the missing heritability hidden in the remaining 78%?
If so, what type of variant should be expect to discover? SNVs? SVs?
Future prospects We need to efficiently query all the information contained in the
genome, including all the epigenomic signatures as well as the
structural variation.
This involves data integration and “epistatic” queries.
We need to prepare our health systems to deal with all the genomic
data flood
Information about variations Processed Raw
Genome variant information (VCF) 150 MB 250 GB
Epigenome 150 MB 250 GB
Each transcriptome 20 MB 80 GB
Individual complete variability 400 MB 525 GB
Hospital (100.000 patients) 40 TB 50 PB
We are only starting to realize the dimension of the
daunting challenges posed by genomic big data
There are technical (data
size) and conceptual
problems (data analysis) in
the way genomic information
is managed that must be
addressed.
The Computational Genomics Department at the Centro de Investigación Príncipe Felipe (CIPF),
Valencia, Spain, and… ...the INB, National Institute of Bioinformatics (Functional Genomics
Node) and the CIBERER Network of Centers for Rare Diseases.
@xdopazo
@bioinfocipf