cisc 841 bioinformatics combining hmms with svms

CISC 841 Bioinformatics

Combining HMMs with SVMs

1Li Liao, CISC841, F07

HMM gradients

• Fisher Score <X> = log P(X|H, )

• The gradient of a sequence X with respect to a given model is computed using the forward-backward algorithm.

• Each dimension corresponds to one parameter of the model.

• The feature space is tailored to the sequences from which the model was trained.


SVM-Fisher discrimination

A probabilistic hidden Markov model is trained from some example sequences x1 x2 x3 … xN

Usually probability model P(xi|) (or function of P(xi|)) is used as a measure of sequence-model membership, and a threshold is used on this measure to decide membership.

The Fisher vector is a vector of gradients of P(xi|) (or gradients of function of P(xi|)) w.r.t the parameters of the model. Uxi = P(xi|)

One can take the training example sequences (positive set) and other sequences that are known to be non-members (negative set), and transform them into Fisher vectors.

A Support Vector Machine (SVM) can be trained using the positive and negative Fisher vectors, and can be used to classify other sequences.


Li Liao, CISC841, F07 4

Application: Protein remote homology detection

SVM-Pairwise method

Protein homologs

Protein non-homologs

Positivepairwise score

vectors

Negativepairwise score

vectors

Support vector machine

Binary classification

Target protein of unknown function

1

23

Positive train Negative train

Testing data


Experiment: known protein families

Jaakkola, Diekhans and Haussler 19996Li Liao, CISC841, F07

Sample family sizes

Family ID

Positive train

Positive test

Negative train

Negative test

1.27.1.1 12 6 2890 1444

1.27.1.2 10 8 2408 1926

1.36.1.1 29 7 3477 839

1.4.1.1 26 23 2256 1994

2.1.1.3 113 8 3895 275

3.1.8.3 17 10 2686 1579

3.2.1.5 46 7 3732 567

2.44.1.2 11 140 307 3894

A measure of sensitivity and specificity

ROC = 1

ROC = 0

ROC = 0.67

6

5

ROC: receiver operating characteristic score is the normalized area

under a curve the plots true positives as a function of false positives


Application: Discriminating signal peptide from transmembrane proteins

Feature selection

We expect gradients w.r.t transition parameters to be better discrimination features

We look for those transitions that are differentially used by TM proteins and SP proteins

- transform each signal peptide sequence (1275) into a Fisher vector w.r.t transition parameters

and find the resultant vector- transform each TM sequence into a Fisher vector w.r.t transition parameters and find the resultant vector - compare the two resultant vectors

SignalP

TM protein


Gradients of P(s|x)

In pattern recognition problems, we are interested in P(s|x,) rather than P(x|)

Us|x = log P(s|x,) = log P(s, x|) - log P(x|)

First term:

P(s,x) = aBs1es1(x1) . as1s2 es2(x2) . as2s3 es3(x3) …

= i (i/aa)ni(s,x)

where ni(s,x) number times i is used, and aa = 1

P(x, s) = (1 - k ) nk(s,x) P(s,x) k k

= mk(x)/k – mk(x) mk(x) is the expected number of times k is used in x following the given path s

Second term:

P(x) = P(x,)

P(x,) = a01e1(x1) . a12 e2(x2) . a23 e3(x3)…

= i(i/aa)ni(,x)

where ni(,x) number times i is used, and aa = 1

log P(x) = 1 P(x, )

k P(x) k

But, P(x, ) = (1 - k ) nk(,x) P(x,) k k

Thus, log P(x) = (1 - k ) nk(,x) P(x,) k k P(x) = (1 - k ) nk(,x) P(|x) k

= nk(x)/k – nk(x)

nk(x) is the expected number of times k is used in x following any path

Finally:

Us|x = mk(x)/k – mk(x) – nk(x)/k + nk(x)11Li Liao, CISC841, F07

Classification experiment

10-fold cross validation experiment using- positive set (247 TM proteins)- negative set (1275 signal peptide containing proteins)

SVM-light package is used.

sequenceto

vectorx Us|x

TMMOD

SVM Learn

SVM Classifier?

?

??

subsets of 247 TM proteins

subsets of 1275 SP proteins


Discrimination results

Results

A third (68) more SP proteins that were incorrectly classified as TM

TM proteins are identified correctly.

TM proteins incorrectly classified as SP proteins

SP proteins incorrectly classified as TM proteins

Phobius

SignalP-NN

SignalP-HMM

TMMOD

TMMOD + SVM-Fisher

7.7% (19/247)

42.9%

19.0%

6.1% (15/247)

6.1% (15/247)

3.5% (45/1275)

2.3%

1.4%

14.5%(185/1275)

9.2% (117/1275)



Application: Protein-Protein Interaction Prediction


Interaction Profile Hidden Markov Model (ipHMM)

Fredrich et al (2006)


U (x) = ∇θ logP(x|θ)

<LSai, A, LSai, B, LSbj,A, LSbj, B>

Uij = Ej(i) / ej(i) + k Ej(k)

Likelihood Score Vector

Fisher Score Vector

Knowledge transfer: • Build ipHMM from proteins whose structural information is available.• Align the sequences of proteins whose structural information is not available to the model.


Scheme mean ROC score

FS_NM 0.7487

LS 0.7997

FS_IM 0.8202

FS_IM + LS 0.8626

Data set Fredrich et al (2006): 2018 proteins in 36 domain families

Conclusions

• Structural information at binding sites enhances protein-protein interaction prediction.

• Interaction profile HMM can transfer structural information

• Fisher scores extracted from domain profiles further enhance protein-protein interaction prediction for proteins with no available structural information.


cisc 841 bioinformatics combining hmms with svms

Documents