cisc 841 bioinformatics combining hmms with svms
DESCRIPTION
CISC 841 Bioinformatics Combining HMMs with SVMs. HMM gradients. Fisher Score = log P(X|H, ) The gradient of a sequence X with respect to a given model is computed using the forward-backward algorithm. Each dimension corresponds to one parameter of the model. - PowerPoint PPT PresentationTRANSCRIPT
CISC 841 Bioinformatics
Combining HMMs with SVMs
1Li Liao, CISC841, F07
HMM gradients
• Fisher Score <X> = log P(X|H, )
• The gradient of a sequence X with respect to a given model is computed using the forward-backward algorithm.
• Each dimension corresponds to one parameter of the model.
• The feature space is tailored to the sequences from which the model was trained.
2Li Liao, CISC841, F07
SVM-Fisher discrimination
A probabilistic hidden Markov model is trained from some example sequences x1 x2 x3 … xN
Usually probability model P(xi|) (or function of P(xi|)) is used as a measure of sequence-model membership, and a threshold is used on this measure to decide membership.
The Fisher vector is a vector of gradients of P(xi|) (or gradients of function of P(xi|)) w.r.t the parameters of the model. Uxi = P(xi|)
One can take the training example sequences (positive set) and other sequences that are known to be non-members (negative set), and transform them into Fisher vectors.
A Support Vector Machine (SVM) can be trained using the positive and negative Fisher vectors, and can be used to classify other sequences.
3Li Liao, CISC841, F07
Li Liao, CISC841, F07 4
Application: Protein remote homology detection
SVM-Pairwise method
Protein homologs
Protein non-homologs
Positivepairwise score
vectors
Negativepairwise score
vectors
Support vector machine
Binary classification
Target protein of unknown function
1
23
Positive train Negative train
Testing data
5Li Liao, CISC841, F07
Experiment: known protein families
Jaakkola, Diekhans and Haussler 19996Li Liao, CISC841, F07
Sample family sizes
Family ID
Positive train
Positive test
Negative train
Negative test
1.27.1.1 12 6 2890 1444
1.27.1.2 10 8 2408 1926
1.36.1.1 29 7 3477 839
1.4.1.1 26 23 2256 1994
2.1.1.3 113 8 3895 275
3.1.8.3 17 10 2686 1579
3.2.1.5 46 7 3732 567
2.44.1.2 11 140 307 3894
A measure of sensitivity and specificity
ROC = 1
ROC = 0
ROC = 0.67
6
5
ROC: receiver operating characteristic score is the normalized area
under a curve the plots true positives as a function of false positives
Li Liao, CISC841, F07 9
Application: Discriminating signal peptide from transmembrane proteins
Feature selection
We expect gradients w.r.t transition parameters to be better discrimination features
We look for those transitions that are differentially used by TM proteins and SP proteins
- transform each signal peptide sequence (1275) into a Fisher vector w.r.t transition parameters
and find the resultant vector- transform each TM sequence into a Fisher vector w.r.t transition parameters and find the resultant vector - compare the two resultant vectors
SignalP
TM protein
10Li Liao, CISC841, F07
Gradients of P(s|x)
In pattern recognition problems, we are interested in P(s|x,) rather than P(x|)
Us|x = log P(s|x,) = log P(s, x|) - log P(x|)
First term:
P(s,x) = aBs1es1(x1) . as1s2 es2(x2) . as2s3 es3(x3) …
= i (i/aa)ni(s,x)
where ni(s,x) number times i is used, and aa = 1
P(x, s) = (1 - k ) nk(s,x) P(s,x) k k
= mk(x)/k – mk(x) mk(x) is the expected number of times k is used in x following the given path s
Second term:
P(x) = P(x,)
P(x,) = a01e1(x1) . a12 e2(x2) . a23 e3(x3)…
= i(i/aa)ni(,x)
where ni(,x) number times i is used, and aa = 1
log P(x) = 1 P(x, )
k P(x) k
But, P(x, ) = (1 - k ) nk(,x) P(x,) k k
Thus, log P(x) = (1 - k ) nk(,x) P(x,) k k P(x) = (1 - k ) nk(,x) P(|x) k
= nk(x)/k – nk(x)
nk(x) is the expected number of times k is used in x following any path
Finally:
Us|x = mk(x)/k – mk(x) – nk(x)/k + nk(x)11Li Liao, CISC841, F07
Classification experiment
10-fold cross validation experiment using- positive set (247 TM proteins)- negative set (1275 signal peptide containing proteins)
SVM-light package is used.
sequenceto
vectorx Us|x
TMMOD
SVM Learn
SVM Classifier?
?
??
subsets of 247 TM proteins
subsets of 1275 SP proteins
12Li Liao, CISC841, F07
Discrimination results
Results
A third (68) more SP proteins that were incorrectly classified as TM
TM proteins are identified correctly.
TM proteins incorrectly classified as SP proteins
SP proteins incorrectly classified as TM proteins
Phobius
SignalP-NN
SignalP-HMM
TMMOD
TMMOD + SVM-Fisher
7.7% (19/247)
42.9%
19.0%
6.1% (15/247)
6.1% (15/247)
3.5% (45/1275)
2.3%
1.4%
14.5%(185/1275)
9.2% (117/1275)
13Li Liao, CISC841, F07
Li Liao, CISC841, F07 14
Application: Protein-Protein Interaction Prediction
Li Liao, CISC841, F07 15
Interaction Profile Hidden Markov Model (ipHMM)
Fredrich et al (2006)
Li Liao, CISC841, F07 16
U (x) = ∇θ logP(x|θ)
<LSai, A, LSai, B, LSbj,A, LSbj, B>
Uij = Ej(i) / ej(i) + k Ej(k)
Likelihood Score Vector
Fisher Score Vector
Knowledge transfer: • Build ipHMM from proteins whose structural information is available.• Align the sequences of proteins whose structural information is not available to the model.
Li Liao, CISC841, F07 17
Li Liao, CISC841, F07 18
Li Liao, CISC841, F07 19
Scheme mean ROC score
FS_NM 0.7487
LS 0.7997
FS_IM 0.8202
FS_IM + LS 0.8626
Data set Fredrich et al (2006): 2018 proteins in 36 domain families
Conclusions
• Structural information at binding sites enhances protein-protein interaction prediction.
• Interaction profile HMM can transfer structural information
• Fisher scores extracted from domain profiles further enhance protein-protein interaction prediction for proteins with no available structural information.
20Li Liao, CISC841, F07