application of latent semantic analysis to protein remote homology detection wu dongyin 4/13/2015

Application of latent semantic analysis to protein remote homology detection

Wu Dongyin4/13/2015

ABSTRACT

LSA

Related Work on Remote Homology Detection

LSA-based SVM and Data set

Result and Discussion

CONCLUSION

Motivation Remote homology detection: A central problem in computational biology, the classification of proteins into functi

onal and structural classes given their amino acid sequences

Results

Discriminative method such as SVM is one of the most effective methods

Explicit feature are usually large and noise data may be introduced, and it leads to peaking phenomenon

Introduce LSA, which is an effecient feature extraction technique in NLP LSA model significantly improves the performance of remote homology detection in comparison with basic formal

isms, and its peformance is comparable with complex kernel methods such as SVM-LA and better than other sequence-based methods

ABSTRACT

Related Work of Remote Homology Detection

pairwise sequence

comparison algorithm

rotein families and discriminative classifiers

generative models

dynamic programming

algorithm:

BLAST,

FASTA,

PSI-BLAST, etc

HMM, etc SVM, SVM-fisher, SVM-k-spectrum, mismatch-SVM, SVM-pairwise, SVM-I-sites,

SVM-LA, SVM-SW, etc

structure is more conserved than sequence -- detecting very subtle sequence similarities, or remote homology is important

Most methods can detect homology with a high level of similarity, while remote homology is often difficult to be separated from pairs of proteins that share similarities owing to chance -- 'twilight zone'

The success of a SVM classification method depends on the choice of the feature set to describe each protein. Most of these research efforts focus on finding useful representations of protein sequence data for SVM training by using either explicit feature vector representations or kernel functions.

LSA

Latent semantic analysis (LSA) is a theory and method for extracting and representing the contextual-usage meaning by statistical computations applied to a large corpus of text.

LSA analysis the relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text.

LSA

c1 c2 c3 c4 c5 m1 m2 m3 m4

human 1 1

interface 1 1

computer 1 1

user 1 1 1

system 1 1 2

response 1 1

time 1 1

EPS 1 1

survey 1 1

tree 1 1 1

graph 1 1 1

minor 1 1

bag-of-words model

N documentM words in total Word-Document Matrix

( M × N )

( This repretation does not recognize synonymous or related words and the dimensions are too large)

TUSVW

U(M×K) S(K×K) VT(K×N)

LSA

))Min(( NM,R

c1 c2 c3 c4 c5 m1

m2

m3

m4

human 1 1

interface 1 1

computer 1 1

user 1 1 1

system 1 1 2

response 1 1

time 1 1

EPS 1 1

survey 1 1

tree 1 1 1

graph 1 1 1

minor 1 1

+sequences of proteins

documents

LSA

For a new document (sequence) which is not in the training set, it is required to add the unseen do

cument (sequence) to the original training set and the LSA model be computed. The new vector t can be approximated as

t = dU

LSA

where d is the raw vector of the new document, which is similar to the columns of the matrix W

LSA-based SVM and Data set

Structral Classification of Proteins (SCOP) 1.53 sequences from ASTRAL database

54 families 4352 distinct sequences remote homology is simulated by holding out all m

embers of a target 1.53 family from a given superfamily.

3 basic building block of proteins N-gram

N = 3, 20^3, 8000 words Patterns

alphabet ∑U{‘.’}, where ∑ is the set of the 20 amino acids and {‘.’} can be any of the amino acids. X2 selection, 8000 patterns. Motifs

denotes the limited, highly conserved regions of proteins. 3231 motifs.


Two methods are used to evaluate the experimental results: the receiver operating characteristic (ROC) scores. the median rate of false positives (M-RFP) scores. The fraction of negative test sequences

that score as high or better than the median score of the positive sequences.


When the families are in the left-upper area, it means that the method labeled by y-axis outperforms the method labeled by x-axis on this family.


fold1

superfamily1.1

family1.1.1

fold2

family1.1.2 family1.1.3

superfamily2.1


superfamily1.2

family2.1.1

positivetrain

20

positivetest13

1. Family level

negative train & negative test3033 & 1137


fold1

superfamily1.1

family1.1.1

fold2


superfamily2.1


superfamily1.2

family2.1.1

positivetrain

88

positivetest33

2. Superfamily level



fold1

superfamily1.1

family1.1.1

fold2


superfamily2.1


superfamily1.2

family2.1.1

positive

train61

3. Fold level

positivetest33



LSA better than SVM-pairwise and SVM-LAworse than methods without LSA and PSI-BLAST

vectorization step optimization step

SVM-pairwise O(n2l2) O(n3)

SVM-LA O(n2l2) O(n2p)

SVM-Ngram O(nml) O(n2m)

SVM-Pattern O(nml) O(n2m)

SVM-Motif O(nml) O(n2m)

SVM-Ngram-LSA O(nmt) O(n2R)

SVM-Pattern-LSA O(nmt) O(n2R)

SVM-Motif-LSA O(nmt) O(n2R)

computational efficiency

n: the number of training examplesl: the length of the longest training sequencem: the total number of wordst: min (m,n)p: the length of the latent semantic representation vector p = n, in SVM-pairwise p = m ， in the method with LSA p = R, in the LSA method

CONCLUSION

In this paper, the LSA model from natural language processing is successfully used in protein remote homology detection and improved performances have been acquired in comparison with the basic formalisms.

Each document is represented as a linear combination of hidden abstract concepts, which arise automatically from the SVD mechanism.

LSA defines a transformation between high-dimensional discrete entities (the vocabulary) and a low-dimensional continuous vector space S, the R-dimensional space spanned by the Us, leading to noise removal and efficient representation of the protein sequence.

As a result, the LSA model achieves better performance than the methods without LSA.

application of latent semantic analysis to protein remote homology detection wu dongyin 4/13/2015

Documents

svm training

svm classification method

feature set

matrix wlsabased svm

new document sequence

set of documents

data set result

unseen document sequence