motifbooster – a boosting approach for constructing tf-dna binding classifiers

23
MotifBooster – A Boosting MotifBooster – A Boosting Approach for Constructing Approach for Constructing TF-DNA Binding Classifiers TF-DNA Binding Classifiers Pengyu Hong Pengyu Hong 10/06/2005 10/06/2005

Upload: morton

Post on 01-Feb-2016

32 views

Category:

Documents


0 download

DESCRIPTION

MotifBooster – A Boosting Approach for Constructing TF-DNA Binding Classifiers. Pengyu Hong 10/06/2005. mRNA transcript. Binding sites. Regulators. Genes. Motivation. Understand transcriptional regulation. Gene X. TF. Model transcriptional regulatory networks. Motivation. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: MotifBooster – A Boosting Approach for Constructing TF-DNA Binding Classifiers

MotifBooster – A Boosting Approach MotifBooster – A Boosting Approach for Constructing TF-DNA Binding for Constructing TF-DNA Binding

ClassifiersClassifiers

Pengyu HongPengyu Hong

10/06/200510/06/2005

Page 2: MotifBooster – A Boosting Approach for Constructing TF-DNA Binding Classifiers

MotivationMotivation Understand transcriptional regulationUnderstand transcriptional regulation

TF Gene X

mRNA transcript

Model transcriptional regulatory networksModel transcriptional regulatory networks

Binding sites

Regulators

Genes

Page 3: MotifBooster – A Boosting Approach for Constructing TF-DNA Binding Classifiers

MotivationMotivation

AlignACE (Hughes et al 2000)AlignACE (Hughes et al 2000) ANN-Spec (Workman et al 2000)ANN-Spec (Workman et al 2000) BioProspector (Liu et al 2001)BioProspector (Liu et al 2001) Consensus (Hertz et al 1999)Consensus (Hertz et al 1999) Gibbs Motif Sampler (Lawrence et al 1993)Gibbs Motif Sampler (Lawrence et al 1993) LogicMotif (Keles et al 2004)LogicMotif (Keles et al 2004) MDScan (Liu et al 2002)MDScan (Liu et al 2002) MEME (Bailey and Elkan 1995)MEME (Bailey and Elkan 1995) Motif Regressor (Colon et al 2003)Motif Regressor (Colon et al 2003) … …… …

Previous works on motif findingPrevious works on motif finding

Page 4: MotifBooster – A Boosting Approach for Constructing TF-DNA Binding Classifiers

1 2 3 4 5 6 7 8A 0.19 1.11 -0.17 1.65 -2.65 -2.66 -1.98 0.92C -0.14 -0.49 1.89 -1.81 1.70 2.32 2.14 -2.07G -1.39 0.25 -1.22 -1.07 -2.07 -2.07 -2.07 1.13T 0.86 -1.39 -2.65 -2.65 0.41 -2.65 -1.16 -1.80

MotivationMotivation

A widely used model – Motif Weight Matrix A widely used model – Motif Weight Matrix (Stormo et al 1982)(Stormo et al 1982)

A A C A T C C G • • •• • •

Score of the site = + = 10.84

A sequence is a target if it contains a binding site (score > threshold).

vs. vs. thresholdthreshold

Computational << Molecular

Page 5: MotifBooster – A Boosting Approach for Constructing TF-DNA Binding Classifiers

MotivationMotivation

• • • • • • CACACCCCCCAATACAT • • •TACAT • • •

• • • • • • CACATTCCCCGGTACAT • • •TACAT • • •

Non-linear binding effects, e.g., different binding modes.Non-linear binding effects, e.g., different binding modes.

Preferred bindingPreferred binding

• • • • • • CACACCCCCCGGTACAT • • •TACAT • • •

• • • • • • CACATTCCCCAATACAT • • •TACAT • • •Non-preferred bindingNon-preferred binding

Mode 1Mode 1

Mode 2Mode 2

Mode 3Mode 3

Mode 4Mode 4

• • • • • • CA CA C/TC/T CC CC A/GA/G TACAT • • • TACAT • • •

Page 6: MotifBooster – A Boosting Approach for Constructing TF-DNA Binding Classifiers

ModelingModeling

Model a TF-DNA binding classifier as an ensemble model. Model a TF-DNA binding classifier as an ensemble model.

m immi SqSQ )()(

base classifier base classifier weightweightensemble modelensemble model

0)(1

0)(1)(

i

ii SQ

SQSLabel

Page 7: MotifBooster – A Boosting Approach for Constructing TF-DNA Binding Classifiers

ModelingModeling

))(tanh()( imim ShSq

Sequence scoring function:Sequence scoring function: )log()(0)(|

)(

ikmik

ikm

sfs

sfim eSh

ffmm((ssikik) is a site scoring function (weight matrix + threshold).) is a site scoring function (weight matrix + threshold).

The scoring function considersThe scoring function considers(a) the number of matching sites (a) the number of matching sites (b) the degree of matching(b) the degree of matching

hhmm((SSii))

qqmm((SSii))

The The mmth base classifierth base classifier

Page 8: MotifBooster – A Boosting Approach for Constructing TF-DNA Binding Classifiers

Training – BoostingTraining – Boosting

Modify the confidence-rated boosting (CRB) algorithm Modify the confidence-rated boosting (CRB) algorithm (Schapire et al. 1999) to train ensemble models (Schapire et al. 1999) to train ensemble models

m immi SqSQ )()(

(b) Learn the parameters of (b) Learn the parameters of each base classifier and each base classifier and its weight. its weight.

(a) Decide the number (a) Decide the number of base classifiers.of base classifiers.

Page 9: MotifBooster – A Boosting Approach for Constructing TF-DNA Binding Classifiers

Why Boosting?Why Boosting?

Booting is a Newton-like technique that iteratively Booting is a Newton-like technique that iteratively adds base classifiers to minimize the upper bound adds base classifiers to minimize the upper bound on the training error. on the training error.

Training errorTraining error Margin of training Margin of training samplessamples

Generalization Generalization errorerror

(Schapire et al. 1998)

Page 10: MotifBooster – A Boosting Approach for Constructing TF-DNA Binding Classifiers

ChallengesChallenges

•• Positive sequences – targets of a TFPositive sequences – targets of a TF

•• Negative sequencesNegative sequences

1.1. Sequences are labeled, but not the sites in the sequences.Sequences are labeled, but not the sites in the sequences.2.2. Cannot be well separated by the weight matrix model (linear).Cannot be well separated by the weight matrix model (linear). 3.3. Number of negative sequences >> number of positive Number of negative sequences >> number of positive

sequences. sequences.

Page 11: MotifBooster – A Boosting Approach for Constructing TF-DNA Binding Classifiers

BoostingBoosting

InitializationInitialization•• PositivePositive

•• NegativeNegative Total weight of the positive Total weight of the positive samples == Total weight of samples == Total weight of the negative samples. the negative samples.

Since the motif must be an Since the motif must be an enriched pattern in the enriched pattern in the positive sequences, use positive sequences, use Motif Regressor to find a Motif Regressor to find a seed motif matrix seed motif matrix WW00..

Page 12: MotifBooster – A Boosting Approach for Constructing TF-DNA Binding Classifiers

BoostingBoosting

Train a base classifier (BC)Train a base classifier (BC)

Refine Refine mm and the parameters and the parameters

of of qqmm(() to minimize) to minimize

i imim

mim Sqyd ))(exp(

Negative information is explicitly Negative information is explicitly used to train used to train qqmm(() and ) and mm..

wherewhere y yii is the label of is the label of SSii and and ddiimm is is

the weight of the weight of SSii in the in the mmth round.th round.

Use the seed matrix Use the seed matrix WW00 + + to to

initialize the initialize the mmth base th base classifier classifier qqmm(() and let ) and let mm=1.=1.

•• PositivePositive

•• NegativeNegative

BC 1

Page 13: MotifBooster – A Boosting Approach for Constructing TF-DNA Binding Classifiers

BoostingBoosting

Adjust sample weights and gives Adjust sample weights and gives higher weights to previously higher weights to previously misclassified samples.misclassified samples.

•• PositivePositive

•• NegativeNegative

BC 1

i

mi

imimmim

i d

Sqydd

11 ))(exp(

• yyii is the label of is the label of SSii • ddii

mm is the weight of is the weight of SSii in the in the mmth th

round.round.• ddii

mm+1+1 is the new weight of is the new weight of SSii..

Page 14: MotifBooster – A Boosting Approach for Constructing TF-DNA Binding Classifiers

BoostingBoosting

Add a new base classifierAdd a new base classifier•• PositivePositive

•• NegativeNegative

BC 1

BC 2

Page 15: MotifBooster – A Boosting Approach for Constructing TF-DNA Binding Classifiers

BoostingBoosting

Add a new base classifierAdd a new base classifier•• PositivePositive

•• NegativeNegative

Decision boundary

Page 16: MotifBooster – A Boosting Approach for Constructing TF-DNA Binding Classifiers

BoostingBoosting

Adjust sample weights againAdjust sample weights again•• PositivePositive

•• NegativeNegative

Decision boundary

Page 17: MotifBooster – A Boosting Approach for Constructing TF-DNA Binding Classifiers

BoostingBoosting

Add one more base classifierAdd one more base classifier•• PositivePositive

•• NegativeNegativeBC 3

Page 18: MotifBooster – A Boosting Approach for Constructing TF-DNA Binding Classifiers

BoostingBoosting

Add one more base classifierAdd one more base classifier•• PositivePositive

•• NegativeNegative

Decision boundary

Page 19: MotifBooster – A Boosting Approach for Constructing TF-DNA Binding Classifiers

BoostingBoosting

•• PositivePositive

•• NegativeNegative

Decision boundary

Stop if the result is perfect or Stop if the result is perfect or the performance on the internal the performance on the internal validation sequences drops. validation sequences drops.

Page 20: MotifBooster – A Boosting Approach for Constructing TF-DNA Binding Classifiers

ResultsResults

– Positive sequencesPositive sequences– pp-value < 0.001 -value < 0.001 – Number of positive sequences Number of positive sequences 25. 25.

– Negative sequencesNegative sequences– pp-value -value 0.05 & ratio 0.05 & ratio 1 1

Got 40 TFs.Got 40 TFs.

Data: ChIP-chip data of Data: ChIP-chip data of Saccharomyces cerevisiaeSaccharomyces cerevisiae ((Lee et al. 2002 Lee et al. 2002 ))

Page 21: MotifBooster – A Boosting Approach for Constructing TF-DNA Binding Classifiers

0.00%

0.00%

0.00%

0.00% 2.43%

3.70%

5.79% 7.97%

9.17% 13

.17%

14.14%

14.24%

14.41%

14.48%

14.90%

15.02%

16.02%

16.41%

18.23%

18.74%

19.01%

19.44%

21.06% 24

.44%

25.18%

25.28%

26.30%

27.66% 30.25% 34

.24%

38.96%

0.00%

5.00%

10.00%

15.00%

20.00%

25.00%

30.00%

35.00%

40.00%

45.00%

FKH1

FKH2

RLM

1

YAP6

ABF1

REB1

FHL1

CAD1

NRG1

MBP1

CIN5

GCN4

SMP1

SUM1

HAP4

DAL81

SKN7

BAS1

ACE2

MCM1

SWI4

STE

12

SWI6

CBF1

HSF1

YAP5

YAP1

SWI5

PDR1

PHD1

RAP1

ResultsResults

Horizontal axis: TFsHorizontal axis: TFs

Vertical axis: Vertical axis: Improvements Improvements on specificityon specificity

Boosted models Boosted models vsvs. Seed weight matrices. Seed weight matrices

Leave-one-out test resultsLeave-one-out test results

W

BW

FP

FPFP

Page 22: MotifBooster – A Boosting Approach for Constructing TF-DNA Binding Classifiers

Results Results

RAP1RAP1

Weight MatrixWeight Matrix

BoostingBoosting

Base classifier 1Base classifier 1 Base classifier 2Base classifier 2 Base classifier 3Base classifier 3

Capture Position-CorrelationCapture Position-Correlation

++

00

Page 23: MotifBooster – A Boosting Approach for Constructing TF-DNA Binding Classifiers

Results Results

REB1REB1

Weight MatrixWeight Matrix

BoostingBase classifier 1Base classifier 1 Base classifier 2Base classifier 2

Capture Position-CorrelationCapture Position-Correlation