probabilistic suffix trees maria cutumisu cmput 606 october 13, 2004
DESCRIPTION
3 Conceptual Map Probabilistic Suffix Trees ePST Suffix Trees Variable Length Markov Model bPSTTRANSCRIPT
![Page 1: Probabilistic Suffix Trees Maria Cutumisu CMPUT 606 October 13, 2004](https://reader034.vdocuments.mx/reader034/viewer/2022052419/5a4d1b427f8b9ab0599a1c6f/html5/thumbnails/1.jpg)
Probabilistic Suffix Trees
Maria CutumisuCMPUT 606
October 13, 2004
![Page 2: Probabilistic Suffix Trees Maria Cutumisu CMPUT 606 October 13, 2004](https://reader034.vdocuments.mx/reader034/viewer/2022052419/5a4d1b427f8b9ab0599a1c6f/html5/thumbnails/2.jpg)
2
Goal Provide efficient prediction for
protein families Probabilistic Suffix Trees (PSTs) are
variable length Markov models (VMMs)
![Page 3: Probabilistic Suffix Trees Maria Cutumisu CMPUT 606 October 13, 2004](https://reader034.vdocuments.mx/reader034/viewer/2022052419/5a4d1b427f8b9ab0599a1c6f/html5/thumbnails/3.jpg)
3
Conceptual MapProbabilistic Suffix Trees
ePST
Suffix TreesVariable Length Markov Model
bPST
![Page 4: Probabilistic Suffix Trees Maria Cutumisu CMPUT 606 October 13, 2004](https://reader034.vdocuments.mx/reader034/viewer/2022052419/5a4d1b427f8b9ab0599a1c6f/html5/thumbnails/4.jpg)
4
Background PSTs were introduced by Ron, Singer,
Tishby Bejerano, Yona made further
improvements (bPST) Poulin – efficient PSTs (ePSTs) PSTs a.k.a. prediction suffix trees
![Page 5: Probabilistic Suffix Trees Maria Cutumisu CMPUT 606 October 13, 2004](https://reader034.vdocuments.mx/reader034/viewer/2022052419/5a4d1b427f8b9ab0599a1c6f/html5/thumbnails/5.jpg)
5
Higher Order Markov Models A k-order Markov chain: history of
length k for conditional probabilities Exponential storage requirements Order of the chain increases, amount
of training data increases to improve estimation accuracy
![Page 6: Probabilistic Suffix Trees Maria Cutumisu CMPUT 606 October 13, 2004](https://reader034.vdocuments.mx/reader034/viewer/2022052419/5a4d1b427f8b9ab0599a1c6f/html5/thumbnails/6.jpg)
6
Variable Length Markov Models (VMMs) Space and parameter-estimation
efficient variable length of the history sequence
for prediction only needed parameters are stored
Created from less training data
>T1 Test sequenceAHGSGYMNAB Training
sequences
Is T1 in the training set?
![Page 7: Probabilistic Suffix Trees Maria Cutumisu CMPUT 606 October 13, 2004](https://reader034.vdocuments.mx/reader034/viewer/2022052419/5a4d1b427f8b9ab0599a1c6f/html5/thumbnails/7.jpg)
7
VMMs P(sequence) = product of the
probabilities of each amino acid given those that precede it
Conditional probability based on the context of each amino acid
A context function k(·) can select the history length based on the context x1 . . . xi−1 xi
VMMs were first introduced as PSTs
![Page 8: Probabilistic Suffix Trees Maria Cutumisu CMPUT 606 October 13, 2004](https://reader034.vdocuments.mx/reader034/viewer/2022052419/5a4d1b427f8b9ab0599a1c6f/html5/thumbnails/8.jpg)
8
PSTs VMMs for efficient prediction Pruned during training to contain
only required parameters bPST: represents histories ePST: represents sequences
![Page 9: Probabilistic Suffix Trees Maria Cutumisu CMPUT 606 October 13, 2004](https://reader034.vdocuments.mx/reader034/viewer/2022052419/5a4d1b427f8b9ab0599a1c6f/html5/thumbnails/9.jpg)
9
bPST Used to represent the histories for
prediction instead of the training sequences
The possible histories are the reversed strings of all the substrings of the training sequences
![Page 10: Probabilistic Suffix Trees Maria Cutumisu CMPUT 606 October 13, 2004](https://reader034.vdocuments.mx/reader034/viewer/2022052419/5a4d1b427f8b9ab0599a1c6f/html5/thumbnails/10.jpg)
10
Prediction with bPSTs The conditional probabilities P(xi|xi-1…)
are obtained for each position by tracing a path from the root that matches the preceding residues
![Page 11: Probabilistic Suffix Trees Maria Cutumisu CMPUT 606 October 13, 2004](https://reader034.vdocuments.mx/reader034/viewer/2022052419/5a4d1b427f8b9ab0599a1c6f/html5/thumbnails/11.jpg)
11
Construction bPST We add histories for the training data Nodes: parameters that estimate the
conditional probabilities γhistory(a) = P(a|history) PbPST (xi|xi−1, . . . , x1) = γx1...xi−1(xi) if in bPST else γx2...xi−1(xi) if in bPST etc. else γ(xi)
![Page 12: Probabilistic Suffix Trees Maria Cutumisu CMPUT 606 October 13, 2004](https://reader034.vdocuments.mx/reader034/viewer/2022052419/5a4d1b427f8b9ab0599a1c6f/html5/thumbnails/12.jpg)
12
bPST created and pruned using 010010010011110101100010111
P(01001) = P(0)P(1|0)P(0|01)P(0|010)P(1|0100) = γ(0) γ0(1) γ01(0) γ0
*(0) γ00*(1)
= (13/27)(8/13)(5/8)(5/13)(4/5) = 10400/182520 = 0.057
Bret
t Pou
lin
![Page 13: Probabilistic Suffix Trees Maria Cutumisu CMPUT 606 October 13, 2004](https://reader034.vdocuments.mx/reader034/viewer/2022052419/5a4d1b427f8b9ab0599a1c6f/html5/thumbnails/13.jpg)
13
Complexity bPST bPST building process requires O(Ln2)
time L is the length limit of the tree n is the total length of the training set.
bPST building requires all training sequences at once (in order to get all the reverse substrings) and cannot be done online (the bPST cannot be built as the training data is encountered)
Prediction: O(mL), m = sequence length
![Page 14: Probabilistic Suffix Trees Maria Cutumisu CMPUT 606 October 13, 2004](https://reader034.vdocuments.mx/reader034/viewer/2022052419/5a4d1b427f8b9ab0599a1c6f/html5/thumbnails/14.jpg)
14
Improved bPST Idea: tree with training sequences n length of all training sequences m length of tested sequence Result (theoretical):
linear time building O(n) linear time prediction O(m).
![Page 15: Probabilistic Suffix Trees Maria Cutumisu CMPUT 606 October 13, 2004](https://reader034.vdocuments.mx/reader034/viewer/2022052419/5a4d1b427f8b9ab0599a1c6f/html5/thumbnails/15.jpg)
15
Efficient PST (ePST) Used for predicting protein function ePST represents sequences Linear construction and prediction
![Page 16: Probabilistic Suffix Trees Maria Cutumisu CMPUT 606 October 13, 2004](https://reader034.vdocuments.mx/reader034/viewer/2022052419/5a4d1b427f8b9ab0599a1c6f/html5/thumbnails/16.jpg)
16
Example ePST
Bret
t Pou
lin
![Page 17: Probabilistic Suffix Trees Maria Cutumisu CMPUT 606 October 13, 2004](https://reader034.vdocuments.mx/reader034/viewer/2022052419/5a4d1b427f8b9ab0599a1c6f/html5/thumbnails/17.jpg)
17
Prediction with ePSTs The probabilities for a substring are
obtained for each position by tracing the path representing the sequence from the root
If the entire sequence is not found in the tree, suffix links are followed
![Page 18: Probabilistic Suffix Trees Maria Cutumisu CMPUT 606 October 13, 2004](https://reader034.vdocuments.mx/reader034/viewer/2022052419/5a4d1b427f8b9ab0599a1c6f/html5/thumbnails/18.jpg)
18
Construction ePST ePSTs gain efficiency by representing
the training sequences in the PST Nodes store counts of the
subsequence occurrences in the training data (with respect to the complete tree)
Conditional probabilities deducted from the counts are stored as well
![Page 19: Probabilistic Suffix Trees Maria Cutumisu CMPUT 606 October 13, 2004](https://reader034.vdocuments.mx/reader034/viewer/2022052419/5a4d1b427f8b9ab0599a1c6f/html5/thumbnails/19.jpg)
19
Example ePST - AYYYA
Bret
t Pou
lin
![Page 20: Probabilistic Suffix Trees Maria Cutumisu CMPUT 606 October 13, 2004](https://reader034.vdocuments.mx/reader034/viewer/2022052419/5a4d1b427f8b9ab0599a1c6f/html5/thumbnails/20.jpg)
20
Complexity ePST Linear time and space with regards to
the combined length of the training sequences O(n)
Linear prediction time O(m)
![Page 21: Probabilistic Suffix Trees Maria Cutumisu CMPUT 606 October 13, 2004](https://reader034.vdocuments.mx/reader034/viewer/2022052419/5a4d1b427f8b9ab0599a1c6f/html5/thumbnails/21.jpg)
21
Advantages and Disadvantages Avoid exponential space requirements
and parameter estimation problems of higher order Markov chains
Pruned during training to contain only required parameters
bPSTs for local predictions: more accurate prediction than global
Loss in classification performance: Pfarm, SCOP
![Page 22: Probabilistic Suffix Trees Maria Cutumisu CMPUT 606 October 13, 2004](https://reader034.vdocuments.mx/reader034/viewer/2022052419/5a4d1b427f8b9ab0599a1c6f/html5/thumbnails/22.jpg)
22
Conclusions PSTs require less training and
prediction time than HMMs Despite some loss in classification
performance, PSTs compete with HMMs due to PSTs reduced resource demands
PSTs take advantage of VMMs higher order correlations
![Page 23: Probabilistic Suffix Trees Maria Cutumisu CMPUT 606 October 13, 2004](https://reader034.vdocuments.mx/reader034/viewer/2022052419/5a4d1b427f8b9ab0599a1c6f/html5/thumbnails/23.jpg)
23
References Brett Poulin, Sequence-based Protein
Function Prediction, Master Thesis, University of Alberta, 2004
G Bejerano, G Yona, Modeling protein families using probabilistic suffix trees, RECOMB’99
G Bejerano, Algorithms for variable length markov chain modeling, Bioinformatics Applications Note, 20(5):788–789, 2004
![Page 24: Probabilistic Suffix Trees Maria Cutumisu CMPUT 606 October 13, 2004](https://reader034.vdocuments.mx/reader034/viewer/2022052419/5a4d1b427f8b9ab0599a1c6f/html5/thumbnails/24.jpg)
24
PSTs and HMMs “HMMs do not capture any higher-order
correlations. An HMM assumes that the identity of a particular position is independent of the identity of all other positions.” [1]
PSTs are variable length Markov models for efficient prediction. The prediction uses the longest available context matching the history of the current amino acid.
For protein prediction in general, “the main advantage of PSTs over HMMs is that the training and prediction time requirements of PSTs are much less than for the equivalent HMMs.” [1]
![Page 25: Probabilistic Suffix Trees Maria Cutumisu CMPUT 606 October 13, 2004](https://reader034.vdocuments.mx/reader034/viewer/2022052419/5a4d1b427f8b9ab0599a1c6f/html5/thumbnails/25.jpg)
25
Suffix Trees (ST)
Bret
t Pou
lin
![Page 26: Probabilistic Suffix Trees Maria Cutumisu CMPUT 606 October 13, 2004](https://reader034.vdocuments.mx/reader034/viewer/2022052419/5a4d1b427f8b9ab0599a1c6f/html5/thumbnails/26.jpg)
26
bPST Histories added to the tree must
occur more frequently than a threshold Pmin
The substrings are added in order of length from smallest to largest
![Page 27: Probabilistic Suffix Trees Maria Cutumisu CMPUT 606 October 13, 2004](https://reader034.vdocuments.mx/reader034/viewer/2022052419/5a4d1b427f8b9ab0599a1c6f/html5/thumbnails/27.jpg)
27
bPST vs ST The string s is only added to the tree if the
resulting conditional probability at the node to be created will be greater than the minimum prediction probability γmin + α and the probability for the prefix of the string is different (with some ratio r) from the probability assigned to the next shortest substring suf(s) (which is already in the tree). After all the substrings are added to the tree, the probabilities are smoothed according to the parameter γmin.
The smoothing (as calculated by the equation below) prevents any probability from being less than γmin
![Page 28: Probabilistic Suffix Trees Maria Cutumisu CMPUT 606 October 13, 2004](https://reader034.vdocuments.mx/reader034/viewer/2022052419/5a4d1b427f8b9ab0599a1c6f/html5/thumbnails/28.jpg)
28
New!