analysis of datasets globally coherent

Florian Markowetzmarkowetzlab.org

Analysis ofglobally coherent datasets

From enrichment and clustersto networks and mechanisms

“A SNP in FGFR2promotes cancer

susceptibility”

“Silencing a single genedeforms morphology”

“HER2 positivebreast cancers arevery aggressive”

Boutros and Ahringer, Nat Rev 2008

FGFR2 gene

homozygous odds ratio: 1.63heterozygous odds ratio: 1.22

Meyer et al, 2008; Easton et al 2007

perturbationperturbation

copy number alterationamplification

deletion SNP

RNA interferenceKnock-out

genome-wide

pathway-specific

phenotypephenotype

cancer subtypes

different risk groups

morphologyviability

differential gene expression

phenotypephenotype

perturbationperturbation

networksnetworks

Natural perturbationsPerturbation• copy-number alterationsNetwork• Pathways and regulatory

networks in the cellPhenotype• disease subtypes, survival, etcGoal• find regulatory hotspots to

explain heterogeneity ofdisease

Our lab

Experimental perturbationsPerturbations• RNA interferenceNetwork• Signal transduction pathwaysPhenotype• differential gene expression

downstream of pathwayGoal• Reconstruct pathway

What is globally coherent data?Phenotype

DNA Intermediate phenotype

SNPs

Copy-numberalterations mRNAproteins

metabolites

e.g. Q

TLs

e.g. marker genes

e.g. eQTLs

survival

cancer subtypesobesity

Examples of Globally Coherent Data

METABRIC @ CRI2000 breast tumours

Challenges of Globally Coherent Data

linking/integrating/comparing different data types– find homogeneous sub-types of heterogeneous disease

with impact on outcome– Mechanistic explanation of observed changes (“causal

models”)– not single marker, but complete story

• methods applicable in many other integrative tasks(eg microRNA + expression)

Keith Haring, Untitled, 1986

Urs Wehrli, Tidying Up Art, 2003

This course is about tools

DataPre-

processing Analysis Follow-up Result

Raw clay

Quality co

ntrol

sanity

check

s

Statistic

s

Machine le

arning

et al Sto

ry

Yeah!

We are here!

Programme

Todaymorning• lectures on statistical and

machine learning conceptsto analyse GCDs (FM)

afternoon• practical session to try

them out on real data (YY,MC)

Tomorrowmorning• Discussion of tools from

day 1, merits and pitfalls,how best to use them foryour projects (FM)

afternoon• practical session continues

(YY, MC)

Overview

Clustering• Hierarchical clustering, Mixture models, Dirichlet process,

Bayesian hierarchical clustering, integrative clusteringEnrichment• hypergeometric test, Gene set enrichment analysis, rich

subnetworks, HTSanalyzeRNetworks• Schadt’s ‘causal’ networks, Bayesian networks, conditional

independence• DANCE: Differential regulation in different disease sub-

types, NEMs: Nested Effects Models

Genes -> networks -> disease

clustering + enrichment = story

From data to distances

Samples

Gen

es

i

j

i

j

j

i

D[i,j] = dist( M[i,] , M[j,] )

M D

D[j,i] = D[i,j]

D[i,i] = 0 for all i

dist(. , .)

What distancemeasure shouldwe use?

Distance or dissimilarity matrix

a

bEuclidean distance

dist( a,b ) =

Examples of distances

a

b dist( a,b ) =Manhattan distance

a

b dist( a,b ) =Cosine distance

how is this related to correlation?

Linkage: distances to clusters

D

Sam

ple

2

Sample 1

1 2

3

6 5

4

D[3,4]

1 2

3

6 5

4Sa

mpl

e 2

Sample 1

D[ (2,3) ,4] = ???

dist(C1, C2) = max { dist(i,j) : i in C1, j in C2 }complete linkage

= min { dist(i,j) : i in C1, j in C2 }single linkage

= mean { dist(i,j) : i in C1, j in C2 }average linkage

Distances between individual genes

3

Hierarchical clusteringIngredients: data matrix , distance function, linkage function

Sam

ple

2

Sample 1

1 2

3

6 5

4

1 2 3 654

Dendrogram

Normal mixture models

All mixturecomponents are

Normal, but withdifferent means and

covariances.

Special case: k-means

All mixturecomponents are

Normal withdifferent means, but

same simplecovariance matrices.

How many cluster?

too few :( too many :(

Key questions in clustering

• How many clusters?

• How to integrate many data types?– eg to find concomitant patterns in gene expression and

copy-number

• How to identify important features?– which genes/alterations drive the clustering

BHC

Generating mixture data

A fixed collection of parameter sets(= means and variances)

Randomly select one parameter set

Given the parameter set, sample the data

Jordan, 2005

Dirichlet process clusteringNot a fixedcollection ofparameters

Parameter setsare sampled

One is chosen…

… to generate adata point

Jordan, 2005

Dirichlet process: Chinese restaurant

p1 p2 p4p3 …

Tables = parameter setCustomers = data pointsCustomer either (i) chooses already occupied table or (ii)

opens new table.Probability increases with number of people already at the

table (-> clustering!)

Dirichlet process mixtures (DPMs)

• Mixture models + Dirichlet process = models with(countably) infinite mixtures

• Big advantage:– estimate of how many clusters are in the data– not limited to pre-defined number

• How do we find the best model for our data?– popular (as always): Markov Chain Monte Carlo

• Uh oh … but that will be taking a reeeaaaalllly longtime for large genomic datasets.

Bayesian Hierarchical Clustering

• Like hierarchical clustering, but using marginallikelihood of DPM instead of (ad hoc) distance.

• Bayesian hypothesis testing decides which clustersto merge.

• Advantage: estimates optimal number and size ofclusters in the data.

p1 p2 …

Dirichlet process

Hierarchical clustering

BHC in practical session

Key questions in clustering

• How many clusters?

• How to integrate many data types?– eg to find concomitant patterns in gene expression and

copy-number

• How to identify important features?– which genes/alterations drive the clustering

iCluster

iCluster

BHC

iCluster: basic model

• (k-means) mixture model• + written as latent variable model• + sparsity constraints

X = W Z + data

genes X sa

mples

coefficients

genes X cluste

rscluste

ring

clusters

X samples

noise

for one data type:

iCluster: integrative clustering

X1 = W1 Z + 1

X2 = W2 Z + 2

…Xm = Wm Z + m

for many data types: 1, 2 ,.., m

dataerror terms foreach data type

different coefficients fordifferent data types

the same clustering matrixfor all data types!

iCluster: Sparsityfor many data types: 1, 2 ,.., m

Sparsity assumption:The W-matrices aresparsely populated,

i.e. most entries = 0

X1 = W1 Z + 1

X2 = W2 Z + 2

…Xm = Wm Z + m

Implemented by Lasso penaltyduring estimation:

Sum of absolute value of entriesmust be small

Clustering: PROs and CONs

Brown et al (2005) Ivanova et al (2006)Bakal et al (2007)

• Standard analysis, almost always applicable• Global first overview• can identify strong trends and patterns in the data

PRO

NEG• Often applied in situations where other methods wouldbe more appropriate (e.g. supervised analysis)

Overview






Gene Ontology (GO)

www.geneontology.org

A GO Term with a gene set annotated to it

Over-representation analysis

Weak Stronggenes ranked bystrength of phenotype

All genes

All HitsHits in GO term

Genes in GO term

Hyper-geometric test

Collection of gene sets

Hyper-geometric distribution

N genes altogether

n hitsk hits in GO term

m genes in GO term

Probability to observek hits in GO term

Number of possibilities to choose n hits out of N genes

k hits fall intothe GO term ofsize m

The other n-k hitsare all genes outside

the GO term

Hyper-geometric test: example

N

nk

m

pvalue = phyper( k , m , N-m , n , lower.tail=FALSE)

Hold these fixed:N = 10 000m = 200

See what happens if we vary:k = 1,2,…, 30n = 50, 100, 200, 300, 400Number k of hits in GO term

p-va

lue

[log1

0]

50100

200

300

400

k or more!

Gene Set Enrichment Analysis (GSEA)

Weak StrongPhenotype


No significant trend

Significant trend

Subramanian et al. (2005)

GSEA: construction


Nr. non-hits before iNr. all non-hits

i

Nr. hits before iNr. all hits if p = 0phenotype before i sum all phenotype if p = 1

GSEA: examples

Differential GSEA



Phenotype 1

Phenotype 2

PROs and CONs

Gene set 1Gene set 2Gene set 3Gene set 4

…

e.g. Wnt pathway

DNA repairChromosome 1

Expressed in liver

p-values (hyper-geometric or GSEA)

Advantages:• standard analysis• comprehensive first overview• “unbiased” and “hypothesis-free”

Disadvantages:• “unbiased” and “hypothesis-free”• relies on known gene sets• can not uncover new pathways• pathway = “unconnected” gene set• soon: more gene sets than genes!

Result:

Sub-networks rich in hits

Sub-networkswith highly correlated phenotypes

Which networks?

Networks from large-scale experiments

Networks from analyzing theexperimental literature

Networks fromprobabilistic dataintegration

HTSanalyzeR

Overview






Schadt’s “causal” networks

Nature Genetics 2005

Really really reallycomplicated model

Bayesian network: DefinitionA Bayesian network for

X = (X1, .., Xn) consists of1. A directed acyclic graph (DAG) with

vertices corresponding to X1, .., Xn

2. for each node a conditionalprobability distribution (LPD) P(Xi |Pai)

3. Each LPD is parameterized by (avector) i

1 2

34

5

P (X3 | X1, X2)

P (X5 | X4, X3)

P (X1) P (X2)

P (X4)P (X3 | X1, X2 , 3)

P (X5 | X4, X3 , 5)

P (X1 | 1) P (X2 | 2)

P (X4 | 4)

There is nothing

really ‘Bayesian’

about Bayesian

networks so far

Joint distribution

“Naïve” chain rule: for every (X1, .., Xn) the jointdistribution can be factorized as

… but sparser by using conditional independencerelations encoded in the DAG:

Conditional independence

Single family Bayesian network

Directed acyclic graph defines families: gene andregulators.

homehome.8.1.1

homework.1.8.1

workhome.1.8.1

workwork.1.1.8

cryin

gOk hap

py

Paren

t 1

Paren

t 2

P (child | parents)

Relation of parents to child is described byconditional distribution

X|PaX=u

Which model is the right one?The one that fits the data best!

Model selection: Akaike Information Criterion (AIC)– find parameters that maximize the likelihood for each model– subtract number of parameters to penalize complex models– select model with highest AIC value

P(L,R,C) = P(L) P(R | L) P(C | R)

P(L,R,C) = P(L) P(R | L) P(C | R)

P(L,R,C) = P(L) P(R | L) P(C | R)

Score based

model selection

Conditional independence tests

P(L,R,C) = P(L) P(R | L) P(C | R)

P(L,R,C) = P(L) P(R | L) P(C | R)

P(L,R,C) = P(L) P(R | L) P(C | R)

• L is correlated with R and C• C is independent of L given R(= partial cor. between C and L equals 0)shrinkage tests in GeneNet package

Assumptions of “causal” networks

• Inference from observational data based onconditional independence

• In general: direction of edges not determined(partial correlation still only a correlation)

• In Schadt’s scenario: genomic locus serves asanchor to direct edges

• This assumption needs to be checked, e.g. in cancergenomic alterations are not fixed but evolutionaryselected.

Overview






Yinyin Yuan

Automated quantitative cellularitycorrection

Yinyin Yuan

SVM Spatialsmoothing

Quantitativecellularity score in [0,1]

# cancer# all cells

H&E stain

Oscar Rueda(Caldas lab)

Impact of CNA on expression

Baseline

cis-effect

cis- andtrans-effects

Gen

e-ex

pres

sion Copy-number

Yinyin Yuan

Differential regulationSubtype A, eg ER+ breast cancer

Subtype B, eg ER- breast cancer

Yinyin Yuan

Differential regulationG

ene

Expr

essi

on

Copy-number

Differential“network”

Reference“network”

solved by Lasso

Yinyin Yuan

Yinyin Yuan

Data from Chin et al, 07

21q22.3Amp with few

cis-changes

concomitantCN signaturesonly

Hotspots: ER+ versus ER-

6q22.31WISP3/PPAC/PPD

Wnt-1 inducible protein

“The end of thescreen is thebeginning of theexperiment”

Boutros and Ahringer 2008

high-throughput

experiment

validation

Anatomy of the NF B pathway


Step 1

Step 2

Hits

Knock-downKnown pathway membersNew RNAi Hits

Compareexpression phenotypesby NEMs

NF B

Roland Schwarz+ Meyer lab @MPI IB Berlin

?

TF1 TF2

Kinase TF1 TF2

TF3

Nested effect models: subset relationsOther statistical methods: similarity

Markowetz et al 2005, 2007

Tresch and Markowetz 2008

Nested Effects Models

Nested Effects Models

Inferred pathway

A B

C D E F

G H

Phenotypic profiles

A

B

C

F

D

H

E

GGen

e pe

rtur

batio

ns

Effects

1. Set of candidate pathway genes2. High-dimensional phenotypic profile, e.g. microarrayINPUT

OUTPUT Graph explaining the phenotypes

Roland Schwarz

Nested Effect Models for NF B

Summary






data visualization result paper insight

Today

Tomorrow ??

tool

A B C+p

better algorithms better questions

Network biology

Networks

“We do network analysis”what network?

what analysis?

“The network is a model”Model of what?

mechanistic model or predictive?

“The network”‘The’ or ‘A’?

Which one?

Is the network the solution?More than (just) a cluster?

Would anotherrepresentation be clearer?

Shameless self-promotion

http://www.markowetzlab.org/software/

Software

• HTSanalyzeR provides an integrated interface toenrichment and network analysis.

• DANCE quantifies the impact of genomicalterations on gene expression and compares itbetween tumour sub-types.

• lol contains various optimization methods formatrix-to-matrix Lasso inference

• nem infers Nested Effects Models from data

Some take home messages

• Large chunks of network analysis are based onclustering and enrichment.

• Most of the rest is based on conditionalindependence and (sparse) regression.

• Big networks tend to be hairy. Avoid the hairball byasking more focused questions.

• Network analysis is even (more?) useful whentargeting a single pathway .

Practical session

Clustering• Hierarchical clustering, BHC: Bayesian hierarchical

clustering, iCluster: integrative clusteringEnrichment• Hypergeometric test, Gene set enrichment analysis,

BioNet: rich subnetworks, HTSanalyzeRNetworks• Schadt’s ‘causal’ networks, conditional independence• DANCE: Differential regulation in different disease sub-

types

Yinyin Yuan Mauro Castro

the team

Cambridge Breast CancerFunctional Genomics

Carlos CaldasSuet-Feung Chin

Stefan GräfOscar RuedaZhihao Ding

Irene PapatheodorouElena Provenzano

Helen BardwellHans Kristian Moen Vollan

Edouard Hatton

Cambridge Ovarian CancerFunctional Genomics

James Brenton

Cambridge CompBio & StatsSimon TavaréChristina Curtis

Andy LynchShamith Samarajiwa

Doug SpeedInma Spiteri

Cambridge CompBioFlorian Markowetz

Yinyin Yuan

CRI Bioinformatics CoreMatt EldridgeMark DunningRoslin RussellKevin Howe

British Columbia Cancer AgencySamuel Aparicio

Sohrab ShahGulisa Turashvili

Gavin HaDavid Huntsman

John Fee

NottinghamIan Ellis

Sarah WattsAndrew Green

OsloAnne-Lise Børresen-Dale

Anita Langerød

CRI Genomics CoreJames Hadfield

Michele OsborneSara SayaleroClaire FieldingSarah Moffatt

CRI Histopathology CoreWill Howat

CRI BiorepositoryBob Geraghty

Maria Vias

Addenbrooke’s HospitalGordon Wishart

Stephen ShamuthBeverly Haynes

Linda Jones

Strangeway’s LaboratoryPaul Pharoah

Mitul ShahPatricia Harrington

Guy’s /St. ThomasArnie Purushotham

Sarah PinderAndy Tutt

Cheryl GillettRobert Springall

ManitobaLee Murphy

Michelle ParisienSandy Troup

Leigh MurphyEtienne Leygue

METABRIC

Florian Markowetzmarkowetzlab.org

Analysis ofglobally coherent datasets

Thank you!

analysis of datasets globally coherent

Documents