forum on personalized medicine: challenges for the next decade

Joaquín Dopazo

Computational Genomics Department,

Centro de Investigación Príncipe Felipe (CIPF),

Functional Genomics Node, (INB),

Bioinformatics Group (CIBERER) and

Medical Genome Project,

Spain.

http://bioinfo.cipf.es http://www.medicalgenomeproject.com http://www.babelomics.org

http://www.hpc4g.org @xdopazo

Forum on Personalized Medicine, 25 September 2014

Bioinformatics and Big Data in the

era of Personalized Medicine

Allison, 2008. Is personalized medicine finally

arriving? Nature.

Personalized medicine: just about a better

understanding of the relationship

phenotype-genotype

Personalized medicine through

precision medicine

• Precision medicine requires of

better ways of defining diseases

by introducing genomic

technologies into the diagnostic

procedures.

• A more precise diagnostic of

diseases, based on the description

of their molecular mechanisms, is

critical for creating innovative

diagnostic, prognostic, and

therapeutic strategies properly

tailored to each patient’s

necessities

The future of personalized medicine

is strongly based on genomics

•Personalized medicine is based on the availability of

diagnostic biomarkers

•Genome sequencing offers ALL this information (if

properly analyzed)

•Genome sequence prices are in free fall (exome price

expected < 300€ in 2-3 years)

•Over 30-40 % of budget (>500 B $) per year, is spent on

costs associated with “overuse, underuse, misuse, ...”

While the cost falls down, the amount of data to manage and its

complexity raise exponentially.

Costs are already almost competitive enough to be used in clinic

The problem is… are we ready to deal with this data?

Exome sequencing successfully used.

NGS prices will be soon affordable.

http://www.genome.gov/sequencingcosts/

http://www.nih.gov/news/health/jun2014/nhgri-18.htm

http://www.nejm.org/doi/full/10.1056/NEJMra1312543

More than

10,000

exomes will

be ordered

for

diagnostic

purposes

Clinical application of exomes

Personalized Genomic Medicine. Phase I: generating the knowledge database

-----

-----

-----

-----

-----

-----

-----

-----

-----

-----

-----

-----

-----

-----

sequencing

Patient List of variants

Database. Query: variant/pathway

Therapy Outcome

System feedback

Genetic variants are linked to therapies through the knowledge of their functional effects (systems biology)

Initially the system will need much feedback: Knowledge generation phase. Growing knowledge database

Genomic medicine

Knowledge

database

Personalized genomic medicine.

Phase II: applying the knowledge database

Patient

1) Genomic sequencing 2) Database of markers 3) Therapy prediction

Genomic core facility phase II

Clinician receives hints on possible prescriptions and therapeutic interventions

+ Other factors (risk, cost, etc.)

Prescription Pre-symptomatic: • Genetic predisposition of acquired diseases

(>6000. some treatable)

• Early diagnosis of genetic diseases

Symptomatic analysis • Diagnostic of acquired diseases

• Early cancer detection

• Cancer treatment recommendation

From genetics to genomic medicine

Test 1

Test 2

Therapy 1

Therapy 2

Therapy 3

?

Genetic medicine

Test

Therapy 1

Therapy 2

Therapy 3

?

Genomic medicine

+

Genomic analysis allows associating patients to therapies from the very beginning, saving time and costs and increasing the success of treatments. feedback

Some examples

Conventional sequencing NGS (with capture)

Marfan syndrome 1300€

2 genes, 75 exons

900€

3 genes, 237 exons

Hereditary deafness 12500€

36 genes 1500 exons

1100€

38 genes > 1500 exons

• Low initial investment

• Already existent infrastructure

• Quick implementation

• Easily implementation as a cloud service that

guarantees sustainability

Preparing the scenario for the

introduction of genome in the clinics

Patient

Treatment

eHR

Decision support

techniques: algorithms

that relate biomarkers to

treatments, outcomes, etc.

(gene prioritization and

predictors)

Integration of

the data in

the eHR

Visualization and

data presentation.

Ready for the

clinical interpretation

Acceleration of

algorithms for data pre-

processing. Data

strorage optimization

feedback

Corporative

systems

Orion clinic

Abucasis, Gaia,

etc.



Patient

Treatment

eHR

feedback

Corporative

systems

Orion clinic

Abucasis, Gaia,

etc.

Decision support





predictors)

Visualization and

data presentation.

Ready for the


Integration of

the data in

the eHR

Acceleration of


processing. Data


New Big Data storage strategies

Automatic QC Sequence cleansing

Variant calling + QC

Mapping + QC

8-10 hours 8-12 hours 8-12 hours

CLOUD

FASTQ

(10GB)

BAM

(7GB)

VCF

(200MB)

Data sizes for

exomes. In case of

whole genomes

sizes are >20x

Remote visualization

of big data.

Data production phase

e-health record

Final human supervision

of data QC

Tools developed to improve the pipeline Genome Maps, a HTML5+SVG data visualization of VCF and BAM

o Genome scale data visualization plays an important role in the data analysis process. It is a big data

management problem.

o Features of Genome Maps (Medina, 2013, NAR; ICGC data analysis portal)

● First 100% HTML5 web based: HTML5+SVG (inspired in Google Maps)

● Always updated, no browser plugins or installation

● Data taken from CellBase, remote NGS data, local files and DAS servers: genes, transcripts, exons, SNPs, TFBS, miRNA

targets, etc.

● Other features: Multi species, API oriented, easy integration, plugin framework, etc.

BAM

viewer

VCF viewer ICGC genomic viewer

www.genomemaps.org

Patient

Treatment

eHR

feedback

Corporative

systems

Orion clinic

Abucasis, Gaia,

etc.

Acceleration of


processing. Data


Integration of

the data in

the eHR

Visualization and

data presentation.

Ready for the


Decision support





predictors)



Finding new biomarkers

Test

Therapy 1

Therapy 2

Therapy 3

?

feedback

Feedback: treatment failures are

reanalyzed to search for:

1) Biomarkers (of failure)

2) Subgroups (to search for new

personalized and rational

therapeutic interventions

Treatables

Failure

treatment

biomarkers

Group A

biomarkers

Group A

biomarkers

Irrelevant

Non treatables

Signaling

Protein interaction Regulation

Variants are used as biomarkers to distinguish

between responders and non-responders and to

sub-classify non-responders

Rationale design of therapies rely on

Systems Biology concepts. Pathways

are complex and must be understood

with the proper bioinformatic tools

Patient

Treatment

eHR

feedback

Corporative

systems

Orion clinic

Abucasis, Gaia,

etc.

Decision support





predictors)

Acceleration of


processing. Data


Visualization and

data presentation.

Ready for the


Integration of

the data in

the eHR


introduction of the genome in clinics

BiERapp: interactive web-based tool for easy candidate prioritization by successive filtering

SEQUENCING CENTER

Data preprocessing

VCF FASTQ

Genome Maps

BAM

BiERapp filters

No-SQL (Mongo) VCF indexing

Population frequencies Consequence types

Experimental design

BAM viewer and Genomic context ?

Easy

sc

ale

up

NA19660 NA19661

NA19600 NA19685

BiERapp: the interactive filtering tool for easy candidate prioritization

http://bierapp.babelomics.org Aleman et al., 2014 NAR

http://bierapp.babelomics.org/

3-Methylglutaconic aciduria (3-

MGA-uria) is a heterogeneous

group of syndromes

characterized by an increased

excretion of 3-methylglutaconic

and 3-methylglutaric acids.

WES with a consecutive filter

approach is enough to detect

the new mutation in this case.

Successive Filtering approach An example with 3-Methylglutaconic aciduria syndrome

Use known variants and their population frequencies to filter out irrelevant polymorphisms.

• Typically dbSNP, 1000 genomes and

the 6515 exomes from the ESP are

used as sources of population

frequencies.

• We sequenced 300 healthy controls

(rigorously phenotyped) to add and

extra filtering step to the analysis

pipeline

Novembre et al., 2008. Genes mirror

geography within Europe. Nature Comparison of MGP controls to 1000g

How important do you

think local information is

to detect disease genes?

Filtering with or without local variants

Number of genes as a function of individuals in the study of a dominant disease Retinitis Pigmentosa autosomal dominant

The use of local

variants makes

an enormous

difference

New variants and disease genes found with WES and successive filtering

WES

IRDs

arRP (EYS)

BBS

arRP arRP (USH2)

3-MGA-uria

(SERAC1)

NBD (BCKDK )

Knowledge DB

Fre

q. p

op

ul.

MySeq

IonTorrent

IonProton

Illumina

NO

Diagnostic Therapeutic

decision

Ne

w v

aria

nts

D

ise

ase

All

Candidate

Prioritization

Data

pre

pro

cessin

g

Sequence DB

Se

qu

en

ces

Freqs.

Future

technologies

New knowledge

for future

diagnostic

The final schema: diagnostic and discovery

Diagnostic by targeted sequencing

(panels of genes)

Tool for defining panels

New filter based on

local population variant

frequencies

If no diagnostic variants appear, then

secondary findings are studied

Diagnostic mutations

http://team.babelomics.org

Implementation of tools in the IT4I

Supercomputing Center (Czech Republic)

The pipelines of primary and

secondary analysis developed by the

Computational Genomics

Department of the CIPF in close

collaboration with the Bull Chair has

proven its efficiency in the analysis

of more than 1000 exomes in a joint

collaborative project of the CIBERER

and the MGP

A first pilot implementation has been

done in the IT4I supercomputing

center, which aims to centralize the

analysis of genomics data in the

country.

Implementation in the AVS

…..

1PB DB

We have taken advantage of the already operative corporative

medical image system using a quite similar philosophy.

eHR

gateway

Upload

image

Retrieve (by

patient ID)

Genomic

gateway

Pilot

project

with 20

leukemias

Knowledge DB

Fre

q. p

op

ul.

MySeq

IonTorrent

IonProton

Illumina

NO

Diagnostic Therapeutic

decision

New

va

ria

nts

D

ise

ase

All

Candidate

Prioritization D

ata

pre

pro

cessin

g

Sequence DB

Se

qu

en

ces

Freqs.

Future

technologies

New knowledge

for future

diagnostic

Gene discovery and diagnostic

implemented

But… what about personalized treatments?

Patient’s omic data Biological knowledge

Systems biology

computational models

Epigenomics Regulation

Interaction

Function

Proteomics

Genomics and transcriptomics

Patient

Metabolomics

Diagnostic biomarkers Personalized medicine

Therapeutic targets

Cell culture

Best combination

Xenograft model

Drug treatment

Network drugs

Personalized therapy

Are individualized treatments a realistic option?

Dopazo, 2003, Drug Discovery Today

Modeling pathways The effect of gene expression over signaling can be estimated. Virtual KOs (or over-expressions) can be

simulated

Colorectal cancer activates a signaling

circuit of VEGF pathway that produces

PGI2.

Virtual KO of COX2 interrupts the circuit

(known therapeutic inhibitor in CGR

COX2

gene KO

Future prospects Exome vs complete genome

The ENCODE project suggests a functional

role for a large fraction of the genome

Which percentage of the genome is

occupied by:

Coding genes: 2.4%

TFBSs 8.1%

Open chromatin regions 15.2%

Different RNA types 62.0%

Total annotated elements: 80.4%

Exomes are only covering a small fraction of the potential functionality of the genome

(2.4%).

Is the missing heritability hidden in the remaining 78%?

If so, what type of variant should be expect to discover? SNVs? SVs?

Future prospects We need to efficiently query all the information contained in the

genome, including all the epigenomic signatures as well as the

structural variation.

This involves data integration and “epistatic” queries.

We need to prepare our health systems to deal with all the genomic

data flood

Information about variations Processed Raw

Genome variant information (VCF) 150 MB 250 GB

Epigenome 150 MB 250 GB

Each transcriptome 20 MB 80 GB

Individual complete variability 400 MB 525 GB

Hospital (100.000 patients) 40 TB 50 PB

We are only starting to realize the dimension of the

daunting challenges posed by genomic big data

There are technical (data

size) and conceptual

problems (data analysis) in

the way genomic information

is managed that must be

addressed.

The Computational Genomics Department at the Centro de Investigación Príncipe Felipe (CIPF),

Valencia, Spain, and… ...the INB, National Institute of Bioinformatics (Functional Genomics

Node) and the CIBERER Network of Centers for Rare Diseases.

@xdopazo

@bioinfocipf

forum on personalized medicine: challenges for the next decade

Healthcare

personalized genomic

era of personalized

genomicspersonalized

precise diagnostic of

innovative diagnostic

diagnostic procedures

genomic medicinetest

genomic technologies