mining motifs in omics networks · finding motifs in omics ... indian institute of technology,...
TRANSCRIPT
Finding motifs in Omics Sequences
Thesis submitted in Partial Fulfillment
of the Requirements for the Award of the Degree of
Master of Science (by Research)
by
Aniruddha Maiti(Roll No.: 11EE72P01)
under the supervision of
Dr. Anirban Mukherjee and Dr. Niloy Ganguly
Department of Electrical Engineering
Indian Institute of Technology, Kharagpur
Kharagpur - 721 302, INDIA
August, 2014
© 2014 Aniruddha Maiti. All rights reserved.
You can’t cross the sea merely by standingand staring at the water
"
"
- Rabindranath Tagore
~ Dedicated to my Parents ~
Declaration
I certify that
a. the work contained in this thesis is original and has been done by me under the
guidance of my supervisor.
b. the work has not been submitted to any other Institute for any degree or diploma.
c. I have followed the guidelines provided by the Institute in preparing the thesis.
d. I have conformed to the norms and guidelines given in the Ethical Code of Con-
duct of the Institute.
e. whenever I have used materials (data, theoretical analysis, figures, and text) from
other sources, I have given due credit to them by citing them in the text of the
thesis and giving their details in the references.
Aniruddha Maiti
Date:
Place:
Certificate
This is to certify that the thesis entitled “Finding motifs in Omics Sequences”, sub-
mitted byAniruddha Maiti , to the Department of Electrical Engineering, Indian Institute
of Technology, Kharagpur, India, in partial fulfillment forthe award of the degree ofMas-
ter of Science (by Research)in Electrical Engineering is a bona-fide record of work carried
out by him under my supervision and guidance. The thesis has fulfilled all the requirements
as per the regulations of the institute and, in my opinion, has reached the standard needed
for submission.
Dr. Anirban Mukherjee
Department of Electrical Engineering
Indian Institute of Technology
Kharagpur - 721 302, India
Dr. Niloy Ganguly
Department of Computer Science and Engineering
Indian Institute of Technology
Kharagpur - 721 302, India
Acknowledgments
I would like to take this opportunity to express my sincere gratitude to many individuals
who have given me a lot of supports during my tenure here in IIT, Kharagpur. First and
foremost, I express my deepest sense of gratitude to my supervisor and guide Dr. Anirban
Mukherjee whose expert guidance and support have made my research work fruitful here
in IIT, Kharagpur. I am thankful that I had the opportunity towork under an excellent hu-
man being like him. He never got tired of discussing my ideas and patiently proofread my
publications. His support, not only has enhanced the quality of my research work, it has
also helped me to keep going through the difficult patches andmade me a stronger person
over the last two years. I am also grateful to my co-supervisor Dr. Niloy Ganguly for his
constant encouragement and support.
Apart from my guides, I have learnt valuable technical aspects from Dr. Pabitra Mitra
through attending his classes. I am fortunate enough to haveMr. Surajit Panja as caring
elder brothers whose self-less help and support have made mywork environment as conge-
nial as it could get.
I would like to thank theCouncil of Scientific and Industrial Research, Govt. of Indiafor
sponsoring my research project.
This journey would not have been possible without the blessings and love of my parents.
Their continuous support has provided me the necessary courage to reach to the verge of
completing my degree.
Aniruddha Maiti
Abstract
Given the availability of large number of genomic and proteomic sequences, the motif find-
ing problem has received intense attention in the field of computational biology over the
last two decades. For DNA, short conserved patterns or motifscan represent transcription
factor binding sites. For proteins, the motifs may represent binding domains. For RNA, it
may represent splice junctions. Thus, discovering short conserved substrings in biological
sequences can lead to a better understanding of transcriptional regulation, mRNA splicing
and formation, and classification of protein complexes.
In the motif finding problem, the objective is to locate shortconserved substrings or motifs
in a set of long strings. This thesis presents some methodologies to find conserved loca-
tions in protein and DNA sequences. A method is proposed to classify unlabeled protein se-
quences using additional topological information besidesprimary structural information.
G protein-coupled receptors (GPCRs) are selected in this workas they contain additional
topological information besides primary structural information. Two kernels are designed
to classify GPCR sequences using available structural information. An improved accuracy
is achieved in both GPCR family and GPCR Class-A subfamily classification problem by
using kernel classifiers. The proposed framework can classify sequences with a fixed topol-
ogy and identify the family specific conserved triplets.
For DNA sequences, two Expectation Maximization (EM)-basedtechniques are developed.
First one is random projection based, and second one is Monte-Carlo (MC) simulation
based. The effectiveness of the proposed algorithms are validated using both synthetic
dataset and biological dataset containing experimentallyverified motifs.
Keywords: Kernel Function, Motif Finding, GPCR Classification, Expectation Maximiza-
tion, Random Projection, Monte Carlo
Contents
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Symbols and Abbreviations . . . . . . . . . . . . . . . . . . . . . .. . . ix
1 Introduction 1
1.1 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . .. 2
1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 GPCR Classificaton and Motif Identification Techniques .. . . . . 2
1.2.2 Motif Finding Problem in DNA Sequences . . . . . . . . . . . . .4
2 GPCR classification using family specific conserved triplets 7
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Kernel Classifiers and Spectrum Kernel . . . . . . . . . . . . . . . .. . . 9
2.2.1 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 VVRKFA Method of Classification . . . . . . . . . . . . . . . . . 10
2.2.3 Spectrum Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.2.1 Alphabet Reduction Scheme . . . . . . . . . . . . . . . 13
2.3.3 Proposed String Kernel . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.3.1 Normalization . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.3.2 An Example . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.4 Extention to the 8-fold Kernel . . . . . . . . . . . . . . . . . . . .16
2.3.5 Feature Reduction Schemes . . . . . . . . . . . . . . . . . . . . . 17
2.3.5.1 Identification of Receptor-Ligand Interaction Sites . . . . 17
2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
ii Contents
2.4.1 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.3.1 Performance to predict the Families and Sub-families . . 21
2.4.3.2 Class A Subfamily Detection . . . . . . . . . . . . . . . 22
2.4.4 Comparison with Some Related Work . . . . . . . . . . . . . . . . 23
2.4.5 Effect of Variation ofη . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4.6 Identified Binding-Site Triplets and Their Positions .. . . . . . . . 25
2.4.7 Effect of Selected Reduced Feature Set . . . . . . . . . . . . . .. 27
2.5 Contribution of this Chapter . . . . . . . . . . . . . . . . . . . . . . . . .27
3 Expectation Maximization in Random Projected Spaces to FindMotifs in DNA
sequences 29
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.1 local alignment problem for motif discovery . . . . . . . .. . . . . 30
3.2.2 Expection Maximization (EM) method for motif discovery . . . . . 31
3.2.3 Random Projection Method . . . . . . . . . . . . . . . . . . . . . 33
3.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4 Proposed Method : EM in randomly projected spaces . . . . . .. . . . . . 34
3.4.1 An example of Projected Motif Model . . . . . . . . . . . . . . . .35
3.5 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.6 Experimental Results and Analysis . . . . . . . . . . . . . . . . . . .. . . 36
3.6.1 Results on synthetic data set . . . . . . . . . . . . . . . . . . . . . 36
3.6.2 Results on JASPAR data set . . . . . . . . . . . . . . . . . . . . . 37
3.7 Conclusion and Contribution . . . . . . . . . . . . . . . . . . . . . . . . .38
4 Monte-Carlo Expectation Maximization for Finding Motifs in DN A Sequences 39
4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1.1 Multiple Local Alignment for Motif Discovery . . . . . . .. . . . 39
4.1.2 Expectation Maximization in Motif Finding . . . . . . . . .. . . . 41
4.1.3 Monte Carlo EM Motif Discovery Algorithm . . . . . . . . . . . .42
4.1.3.1 MCEMDA as a Markov Chain . . . . . . . . . . . . . . 43
4.2 Motivation and Methodology . . . . . . . . . . . . . . . . . . . . . . . .. 44
4.2.1 Simplified Q Function . . . . . . . . . . . . . . . . . . . . . . . . 44
Contents iii
4.2.2 Selection of Promising Markov Chains . . . . . . . . . . . . . . .45
4.2.3 Goodness Measure:ψ . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2.4 Overcoming the Limitation of the Phase Shift . . . . . . . .. . . . 46
4.2.5 An Example of ShiftedΘ . . . . . . . . . . . . . . . . . . . . . . 48
4.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3.1 Computational Leverage . . . . . . . . . . . . . . . . . . . . . . . 51
4.4 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.4.2 Comparison with other EM-based Algorithms . . . . . . . . . .. . 53
4.4.2.1 Results on Synthetic Dataset . . . . . . . . . . . . . . . 53
4.4.2.2 Results on JASPAR Dataset . . . . . . . . . . . . . . . . 53
4.4.3 A Study on the Proposed Initialization Scheme . . . . . . .. . . . 54
4.4.4 A Study on the Stand-alone Model Shifting Mechanism . .. . . . 56
4.4.5 The Effect of End-Clustering . . . . . . . . . . . . . . . . . . . . . 57
4.4.6 Computation Time of Individual Markov Chain . . . . . . . . . .. 58
4.5 Contribution of this Chapter . . . . . . . . . . . . . . . . . . . . . . . . .59
5 Conclusion 61
REFERENCES 63
AUTHOR’S PUBLICATIONS 69
List of Figures
2.1 Snake plot of a GPCR sequence. . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Multiclass data classification through regression approach. . . . . . . . . . 11
2.3 An Example of Feature-Map Formation. . . . . . . . . . . . . . . . .. . . 15
2.4 Schematic Diagram of the Classification Process. . . . . . . .. . . . . . . 18
2.5 Comparison of Classification Performances among GPCRPred, GPCRBind
and proposed String Kernels on data set-II. . . . . . . . . . . . . . .. . . . 25
2.6 Effect of variation ofη on classification accuracy (%) in GPCR class-A
subfamily prediction (Based on Dataset-II). . . . . . . . . . . . . .. . . . 25
2.7 Number of features versus classification accuracy (%) inGPCR class-A
subfamily prediction (A 17-class problem). . . . . . . . . . . . . .. . . . 27
3.1 Comparison of conventional EM and proposed method (errormeasure is
taken to be the average number of mismatches between true motif and iden-
tified motif) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2 Comparison of conventional EM and proposed method (errormeasure is
taken to be the average number of mismatches between identified motif
start locations via indicator variableAik and true motif start locations) . . . 37
3.3 Performance for different projection length in a (15,4)problem (error mea-
sure is taken to be the average number of mismatches between true motif
and identified motif) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1 Visualization of the MCEMDA as a Markov chain (Number of MCsimu-
lation is 3). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Improvement of the Q-function during first400 iterations. (Number of MC
simulation in each stage:m = 3). . . . . . . . . . . . . . . . . . . . . . . 44
vi List of Figures
4.3 Convergence of Monte Carlo EM based algorithm to the shifted version of
true motif model. (a)(15, 3) motif finding problem withm = 1. (b) (15, 3)
motif finding problem withm = 3. (c) (12, 2) motif finding problem with
m = 1. (d) (12, 2) motif finding problem withm = 3. . . . . . . . . . . . . 47
4.4 Flowchart to overcome the phase shift problem. . . . . . . . .. . . . . . . 48
4.5 Improvement in (15,3) motif finding problem. . . . . . . . . . .. . . . . . 49
4.6 Simplified description of the proposed algorithm. . . . . .. . . . . . . . . 51
4.7 Evolution of goodness measure,ψ, with respect to the number of iteration
for finding a promising Markov chain (itrmax). . . . . . . . . . . . . . . . 57
List of Tables
2.1 Sezerman Alphabet Reduction Scheme . . . . . . . . . . . . . . . . . .. . 14
2.2 Start and end points of different segments . . . . . . . . . . . .. . . . . . 15
2.3 Data set-I: Human GPCR sequences . . . . . . . . . . . . . . . . . . . . .19
2.4 Data set-II: Class A subfamilies and TMHMM performance . .. . . . . . 20
2.5 Classification Accuracy in Data set-I . . . . . . . . . . . . . . . . .. . . . 22
2.6 Classification Accuracy in Naveed’s GPCR Data set . . . . . . . .. . . . . 22
2.7 Classification Accuracy in Data set-II . . . . . . . . . . . . . . . .. . . . 23
2.8 Misclassification Table . . . . . . . . . . . . . . . . . . . . . . . . . . .. 23
2.9 Comparison of Classification Performances among GPCRPred, GPCRBind
and proposed String Kernels . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.10 Identified motifs and their locations
(dataset-I : GPCR Family) . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.11 Subfamily-specific motifs and their locations (8-fold Kernel)
(data set-II : Class-A subfamily) . . . . . . . . . . . . . . . . . . . . . . .26
3.1 Performance on JASPAR data set . . . . . . . . . . . . . . . . . . . . . .. 38
4.1 Convergence to the true motif model due to shifting . . . . . .. . . . . . . 49
4.2 Performance on Synthetic Dataset . . . . . . . . . . . . . . . . . . .. . . 54
4.3 Performance on JASPAR Dataset : Large Group . . . . . . . . . . .. . . . 55
4.4 Performance on JASPAR Dataset : Small Group . . . . . . . . . . .. . . . 56
4.5 Comparison of the Stand-alone Shifting Scheme . . . . . . . . .. . . . . . 57
4.6 Performance Improvement on Synthetic Dataset due to Clustering . . . . . 58
4.7 Computation Time Comparison of a Single Chain . . . . . . . . . . . .. . 59
List of Symbols and Abbreviations
Σ Character Alphabet
CV Cross Validation
DNA Deoxyribonucleic acid
ECL Exo-Cellular Loops
EM Expectation Maximization
GPCR G protein-coupled receptors
HMM Hidden Markov Model
ICL Intra-Cellular Loops
MC Monte-Carlo
MC EM Monte-Carlo Expectation Maximization
MCEMDA Monte Carlo EM Motif Discovery Algorithm
mRMR minimum Redundancy Maximum Relevance
mRNA Messenger Ribonucleic acid
oops One Occurrence Per Sequence
PCA Principal Component Analysis
PWM Position Weight Matrix
RBF Radial Basis Function
RNA Ribonucleic acid
x Glossary
SVM Support Vector Machines
VVRKFA Vector-Valued Regularized Kernel Function Approximation
Chapter 1
Introduction
Finding conserved locations or motifs in biological sequences is of paramount impor-
tance. Given the advancement in sequencing technologies, alarge number of biological
sequences, e.g. protein, DNA or RNA sequences, are now available in public domain. For
DNA, short conserved patterns or motifs can represent transcription factor binding sites.
For RNA, it may represent splice junctions [1]. For proteins,the motifs may represent
binding domains. Thus, discovering short conserved substrings in biological sequences can
lead to a better understanding of transcriptional regulation, mRNA splicing and formation,
and classification of protein complexes [2] [3]. This thesispresents some methodologies
to find conserved locations in protein and DNA sequences. Based on the indentified con-
served locations in protein sequences, a method is proposedto classify unlabeled protein
sequences. Some protein sequence possesses additional topological information besides
primary structural information. The primary structural information is combined to the ad-
ditional topological information. G protein-coupled receptors (GPCRs) are selected in this
work as they contain additional topological information besides primary structural infor-
mation. The proposed framework can classify sequences witha fixed topology and identify
the family specific conserved triplets. In the case of DNA sequences, Expectation Maxi-
mization (EM)-based techniques are explored. Two EM based techniques are proposed to
identify motifs in DNA sequences.
The organization of the thesis is as follows :
2 Chapter 1. Introduction
1.1 Organization of the Thesis
• Chapter 1 discusses some of the related motif finding techniques for both protein
and DNA sequences.
• Chapter 2 presents a framework for G-Protein Coupled Receptor classification and
conserved triplet identification with the help of availabletopological information.
• Chapter 3 presents a framework where expectation maximization method is per-
formed in randomly projected space to find motif in DNA sequences.
• Chapter 4 presents a variant of Monte Carlo Expectation Maximization method to
find motifs in DNA sequences.
• Chapter 5 concludes the work described in this thesis and hashes out possible future
research scopes.
1.2 Related Work
For protein sequences, G protein-coupled receptors (GPCRs) are selected as the target class.
The proposed framework classifies GPCR based on identified conserved triplets. Identified
conserved triplets is important in the context of GPCR classification because the more a
triplet is conserved within a particular family, the betterchance it has to classify accurately.
This section discusses some related works on GPCR classificaton and conserved location
identification techniques.
For DNA sequences, the proposed techniques are ExpectationMaximization (EM)-
based. Some related methods concerning motif finding in DNA sequences are also dis-
cussed in this section.
1.2.1 GPCR Classificaton and Motif Identification Techniques
There are a number of techniques employed by different researchers to classify GPCRs.
For example Raghava et al. proposed the GPCRpred [4] method which uses a combina-
tion of 20 Support Vector Machines (SVMs) [5] to classify GPCRsequences in differ-
ent levels. In GPCRpred the feature vectors constructed usingdipeptide composition of
each protein, where the entire sequence represents a point in a high dimensional space.
TheHidden Markov Model(HMM) is also used to classify GPCR sequences, for example
1.2. Related Work 3
PRED-GPCR server uses 265 signature profile HMMs [6]. In GRIFFINproject, the com-
bined use of SVM and HMM is employed to predict the GPCR-G protein coupling [7].
The predictive power of the SVM is higher than that of the restof the methods, but it is
opaque in the sense that it fails to identify the key featureswhich are primarily responsible
for classification. On the other hand, the HMM is capable to pinpoint those features but its
classification performance is not as good as of SVM. The GPCR data suffers from the curse
of dimensionality, i.e., the number of features is much larger than the number of samples in
different classes. The prediction accuracy decreases due to this problem. In order to over-
come this problem, most of the classification techniques employed Principal Component
Analysis(PCA) to reduce the number of features [8] [9]. As PCA combines all the features,
it is difficult to pinpoint the exact set of features which areresponsible for ligand-binding
processes. In their work, Cobanoglu et al. devised an exhaustive search method forfamily
specific tripletsto classify and identify the key ligand-receptor binding sites considering
the linear position of a particular triplet [10]. The accuracy of this method is high because
the exhaustive search is employed based on the relative linear distance of triplets or in
other words, apart from the primary sequence information, topological information of the
sequence is also taken into account. The authors of [11] attempted to classify GPCRs based
on the protein power spectrum from Fast Fourier Transform (FFT). Recent work of Naveed
et al. shows that the classification accuracy can be improvedby using genetic ensembles
[12], where different feature extraction and classification strategies are used for GPCRs
prediction and then the evolutionary ensemble approach is used for enhanced prediction
performance. State of the art classification techniques either use conventional classification
algorithms like HMM or SVM, or exhaustive search method to identify the family specific
motifs. When it comes to conventional classification techniques, none of them exploits the
properties, specific to GPCR sequences. Although GPCR-SVMFS employed the minimum
Redundancy Maximum Relevance (mRMR) method [13] and the geneticalgorithm to ex-
tract features specific to GPCRs, it uses the entire sequence information [14]. The recent
work of Cobanoglu et al. shows that the exo-cellular portion of the sequence is primar-
ily responsible for ligand-binding processes. In order to classify GPCRs, it is sufficient to
take into consideration only the exo-cellular part [10]. The work presented in Chapter-2,
attempts to explore these properties which are specific to a GPCR sequence, while using
conventional method of classification as SVM.
4 Chapter 1. Introduction
1.2.2 Motif Finding Problem in DNA Sequences
In the motif finding problem, the objective is to locate shortconserved substrings or motifs
in a set of long strings. A more general and difficult problem is to find short sub-strings
which arealmost conserved, e.g. the(l, d) motif finding problem. Given a setS ofN num-
ber of sequences of lengthLi (i ∈ {1, 2, ..., N}), the task is to find a substringm of length
l which appears frequently inS accompanied by mutations in at mostd random positions.
An example is(15, 4) motif finding problem. The problem is a difficult one as two mutated
versions of substringm can differ in at most2d (8) positions. This problem is commonly
known as thechallenge problem[15]. Later Buhler and Tompa provided mathematical
analysis explaining the inherent intractability of the problem [16]. Given the intractability
and the importance of the motif finding problem in the contextof transcription factor bind-
ing site identification, a number of computational tools (such as MEME [17], MCEMDA
[18], Projection [16], Weeder [19], MUSCLE [20], ClustalW [21], BioProspector [22] etc.)
have been developed to solve the problem. Among them, the MEME and the MCEMDA
are Expectation Maximization (EM)-based methods. The projection method uses a random
projection technique to identify a favorable starting seedfor EM-based algorithms. The
MUSCLE and the ClustalW are multiple sequence alignment algorithms. MUSCLE incor-
porates a fast distance estimation usingk-mer counting, a progressive alignment using a
profile function, called the log-expectation score, and a refinement technique using a tree-
dependent restricted partitioning [20]. ClustalW is also a progressive multiple sequence
alignment algorithm which uses sequence weighting, position-specific gap penalties [21].
The BioProspector uses Markov background models [22] to search regulatory sequence
motifs. It uses the Gibbs sampling strategy.
The Expectation Maximization (EM) method was introduced inconserved site identi-
fication problem by Lawrence and Reilly[23]. This method identifies motifs which occur
only once in each sequence. Later Bailey and Elkan provided a more generalized model
for EM-based motif identification problem [17]. This iterative model is effective given a
reasonably good starting point. In [16], Buhler and Tompa proposed a locality-sensitive
hashing method calledrandom projectionto pinpoint a good starting point for EM-based
algorithms. In [24], it is shown that the performance of uniform projection is better than
that of the random projection. If the initial guess about thestarting point of EM is not
reasonably good enough, the deterministic algorithm converges quickly to a local opti-
mum point. To ameliorate this limitation, the Monte Carlo Expectation Maximization (MC
1.2. Related Work 5
EM) method is proposed by Wei and Tanner, where expectation step is calculated through
Monte Carlo simulation [25]. This incorporates randomness in EM algorithm. In Monte
Carlo EM Motif Discovery Algorithm (MCEMDA) [18], this concept is used to discover
motifs in DNA sequences.
The work presented in Chapter-3, discusses a method where EM is performed in ran-
domly projected spaces to reduce computation. In Chapter-4,a Monte Carlo Expectation
Maximization (MC EM) based motif finding method is proposed and studied.
Chapter 2
GPCR classification using family specific
conserved triplets
2.1 Introduction
G protein-coupled receptors (GPCRs) being one of the largest superfamily of transmem-
brane proteins, transduce extracellular signals across the cell membrane [26–28]. Various
extracellular signals, related to vision, metabolism, immune and inflammatory responses
[29–32], activate these receptors and on ligands binding, they transduce these signals into
intracellular responses via heterotrimeric G proteins. Different ligands are bound to differ-
ent types of GPCRs and consequently activate them by allowing it to bind with G proteins
[33]. As a result, GPCRs are of paramount importance for pharmacological intervention.
Presently, more than 50 percent of modern drugs, available in the market, target these pro-
tein sequences [34]. Given their importance, there is a considerable attention to identify
the key ligand-receptor binding sites using the GPCR sequence.
GPCR is a single polypeptide chain (shown in Fig. 2.1) which crosses the cell mem-
brane seven times. The segments, external to the cell membrane, are oneamino terminus
and threeexo-cellular loops(ECL). These four segments are in direct physical contact with
ligands during interaction. Although thetransmembrane regions, threeintra-cellular loops
(ICL) and onecarboxyl terminushave an important role in signal transduction mechanism,
they have minor functional role in ligand-binding processes [10]. Previously, GPCRs have
been classified using primary sequence information only [4]. In this work, structural infor-
mation is used along with the primary sequence information to classify GPCR sequences.
8 Chapter 2. GPCR classification using family specific conserved triplets
����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������
����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������
Trans−membrane
Outside of a Cell
Inside of a Cell
c de
fg hb
a ECL 2 ECL 3N Terminus ECL 1
Carboxyl TerminusICL 3ICL 2ICL 1
Ligand
Regions
i
Figure 2.1: Snake plot of a GPCR sequence.
Out of all major families of GPCR, class-A (Rhodopsin like) is the most important one as
more than80% of GPCRs are Rhodopsin like[35].
From here onwards, the word family will be used in context of the first level classi-
fication of GPCRs. Similarly, the word, subfamily, will be usedin context of Class-A
GPCR subfamilies unless otherwise specified. As it is the mostabundant among all GPCR
families, the GPCR Class-A is the main focus of this work.
A novel string kernel is designed for the classification of GPCR. This framwork can
identify the triplets, conserved within a family, as potential ligand-receptor binding sites.
These triplets may be represented as motifs. As these motifsare specific to a particular
family, it can be inferred that they are primarily responsible for ligand-binding processes.
The primary sequence information and cross-over points (b,c, d, e, f, g, h; in Fig.2.1)
are considered as input to the proposed string kernel for feature extraction. A few selected
features are fed to kernel classifiers. The improved classification accuracy indicates the
effectiveness of the proposed method over the previous methods in terms of the ease of
feature extraction method and classification accuracy. Therest of the chapter is organized
as follows. Two kernel classifiers, namely, Support Vector Machine (SVM) and vector-
valued regularized kernel function approximation (VVRKFA)is described in section 2.2.
In section 2.3, the proposed string kernel, feature extraction and feature selection method-
ology are described. Experimental results are presented insection 2.4. These two sections
provide the principal contribution of the work presented inthis chapter. Finally, section 2.5
concludes by discussing the contribution of this chapter.
2.2. Kernel Classifiers and Spectrum Kernel 9
2.2 Kernel Classifiers and Spectrum Kernel
In this section, two kernel classifiers, used in this work, are described briefly. The basic
construction of spectrum kernel and its limitation are alsodescribed here.
2.2.1 Support Vector Machine
SVM, introduced in [5], is a supervised learning algorithm usedfor classification tasks.
Given a set of features and their binary class labels, SVM computes a linear decision
boundary in high dimensional feature space to discriminatebetween positive and negative
samples by maximizing the margin between them [36]. LetS be a training set consist-
ing of labeled input vectors(xi, li), i = 1, . . . ,m, wherexi ∈ ℜn are the instances and
li ∈ {+1,−1} are their respective labels. The input training data are projected to a higher
dimensional feature space by a nonlinear transformation functionφwhich maps every point
of input spaceX ∈ ℜn to a feature spaceX ∈ ℜm, i.e.,φ : X→ X. In order to train SVM
classifier, the knowledge of the feature space in its explicit form is not required. Only the
knowledge of inner products between training patterns in the feature space is sufficient.
Therefore, the computational problem that arises from the high dimensional feature space
is overcome with the help of kernel trick by replacing the dotproduct in feature space by
some functionK , calledkernel function[36], such that :
K (xi,xj) = φ(xi)Tφ(xj), (2.1)
where,xi andxj are two points in the input spaceX andφ(xi), φ(xj) are their mapped
points, respectively, in the feature spaceX. SVM finds the optimal separating hyperplane
by minimizing the following quadratic optimization problem:
Min 12||w||2 + C
m∑
i=1
ξi
s.t. li(
wTφ(xi) + b)
≥ 1− ξi for i = 1, 2, ...,m.(2.2)
Here,w ∈ ℜn andb ∈ ℜ, respectively, are the normal to the hyperplane and bias term.
ξi (≥ 0) is soft margin error of theith training sample andC is a regularization parameter
for regulating the trade off between the margin width and thegeneralization performance.
This formulation of SVM in (2.2) is known as the soft SVM. The solution of the quadratic
10 Chapter 2. GPCR classification using family specific conserved triplets
programming (QP) problem (2.2) is obtained by forming its Lagrangian dual as follows:
Max L(α) =m∑
i=1
αi −12
m∑
i,j=1
αiαjliljK (xi, xj)
s. t.m∑
i=1
αili = 0, C ≥ αi ≥ 0 for i = 1, 2, ...,m,(2.3)
where,αi(0 < αi < C) is the Lagrange multiplier. The solution of this problem leads to
the following decision hyperplane to test a new patternx ∈ ℜn:
f(x) = sgn(wTφ(xi) + b) = sgn
(
m∑
i=1
liαiK (xi, x) + b
)
. (2.4)
This classification approach was originally developed for binary data classification which is
lated extended to multiclass classifier [37–40]. There are various methods to extend the bi-
nary classifiers to the multiclass version. As the performance of the one-vs-one (OVO)
SVM classifier is better than the others[39], it is used in this work. In OVO method,NC2=N(N − 1)/2 binary classifier models are trained considering each possible pair for
anN class problem. The testing of a sample, in OVO method, is performed by max-win
strategy [39]. By this technique each of the trained binary classifier castes one vote to its
favored class and the class with maximum votes specifies the class label of the sample.
Thus as the number of classes increases the training and testing time also increase in this
method.
2.2.2 VVRKFA Method of Classification
The VVRKFA is a recently proposed multiclass kernel classifier that classifies a pattern in
the low dimensional label space through regression technique [41]. The limitation of the
decomposition techniques of classifying a multiclass datawith the binary classifiers has
been overcome with this method. The concept of VVRKFA method is depicted graphically
in Fig 2.2 [42]. Here, the training data are first mapped to a feature space by using the
kernel trick. Then a regularized nonlinear function is fitted to map the data from feature
space to a lower dimensional label space. The label space consists of the number of class
labels in a data set. As observed in Fig. 2.2, the fitted vector-valued response has three
elements, which is the number of classes. The classificationis performed in the label space
with these low dimensional patterns obtained through the vector-valued regression. The
testing of a new pattern is also carried out in this label space by mapping it from feature
2.2. Kernel Classifiers and Spectrum Kernel 11
Figure 2.2: Multiclass data classification through regression approach.
space to label space.
The mapping function of VVRKFA for anN -class data classification problem, based onm
measurements of input attributes and training samplesX {(xi, yi) : xi ∈ Hx ⊂ ℜn, yi ∈
Hy ⊂ ℜN , i = 1, 2, ..., m}, is obtained by solving the following optimization problem:
Min J(Θ, b, ξ) =C2tr([ Θ b ]T [ Θ b ]) + 1
2
m∑
i=1
‖ξi‖2
s.t. ΘK (xTi , BT )T + b+ ξi = yi, i = 1, 2, ....,m.
(2.5)
The label vectoryi of a samplexi of jth class is chosen as the indicator vector of the classes
with the following rule:
yi = [yi1, yi2, ...., yiN ]Twith yij = 1, yik = 0 for k 6= j ∈ N. (2.6)
Θ ∈ ℜN×m is the regression coefficient matrix that maps a feature vectorφ(xi) = K (xTi , BT )T
from the feature spaceHφ(x) into the label spaceHy,B ∈ ℜ m×n is a matrix formed by ran-
domly picking upm rows of the training data matrixA ∈ ℜm × n andK (., .) is the kernel
function that determines dot product of two vectors in feature space,b ∈ Hy is the bias
vector,ξi ∈ Hy is the slack or error variable anddim(Hφ(x)) ∈ ℜm, m ≤ m. This leads to
the following vector-valued functionρ(xi) = ΘK (xTi , BT )T + b, to map a feature vector
into the low dimensional subspaceHy. where,ρ is the fitted vector-valued response of input
12 Chapter 2. GPCR classification using family specific conserved triplets
x ∈ ℜn by VVRKFA.
A test pattern, in the spaceHy, is classified depending on its proximity (in the Maha-
lanobis sense) to the class centroids. The Mahalanobis distance takes into account the
data scattering and forms the classification rule invariantunder scale changes and location
shifts. The vector-valued responses{ρ(xi) ∈ Hy}mi=1 of the training patterns{xi}mi=1 are
obtained by the approximated function (??) using a nonlinear kernel functionK . These
vector-valued responses are used to form the class centroids of theN classes denoted by
{ρ(1), ρ(2), ...., ρ(N)}. Each class centroidρ(j) is calculated by
ρ(j) =1
mj
mj∑
i=1
ρ(xi) (2.7)
where,mj is the number of training sample ofjth class. The class label of a test patternxt
is determined by
class(xt) = arg min1≤j≤N
dM(ρ(xt), ρ(xj)|Σ), (2.8)
where Σ =N∑
j=1
(mj − 1)Σ(j)/(m−N) (2.9)
is the pooled within class sample co-variance matrix, and
Σ(j) =1
mj − 1
mj∑
i=1|xi∈Clj
(ρ(xi)− ρ(j))(ρ(xi)− ρ
(j))T (2.10)
is the co-variance matrix ofjth class anddM(ρ1, ρ2|Σ) represents the Mahalanobis distance
of two patternsρ1 andρ2 given by
dM(ρ1, ρ2|Σ) =
√
(ρ1 − ρ2)T Σ−1(ρ1 − ρ2). (2.11)
2.2.3 Spectrum Kernel
Spectrum Kernelis ann-gram kernel introduced in [43] for protein classification.Let Σ
be an alphabet with|Σ| = l (for protein sequences,l = 20 with each letter representing
one amino acid). Lett be a sequence consisting of letters fromΣ. Now, thek-spectrumof
t is the set of allk-length (k ≥ 1) contiguous subsequences thatt contains. Letp ∈ Σk,
then thefeature-mapis indexed by all the elements belonging toΣk. The feature-map
which translates a point from input spaceX toRlk is defined as the column vectorΦk(t) =
2.3. Methodology 13
(φp(t))p∈Σk , where,φp(t) = frequency of occurence ofp in t. In kernel matrix, the value
corresponding to the sequencexi andxj is computed fromΦk(xi) andΦk(xj) by using any
kernel function like linear kernel or Gaussian Radial Basis Function (RBF) kernel. For the
RBF kernelKk(xi,xj) = exp(µ‖Φk(xi)−Φk(xj)‖2). µ ≥ 0 is an adjustable parameter.
This work implements the kernel matrix by incorporating thestructural information of
a GPCR sequence and thus improving the classification accuracy. The limitation of the
traditional string kernel is that in the feature selection step, it can pinpoint thek-mers
that are primarily responsible for classification but it fails to predict their positions. This
limitation is overcome by using the proposed string kernel.
2.3 Methodology
2.3.1 Problem Statement
The research problem is formulated as follows:
Given a data set,D, of GPCR sequences and the name of their respective family and
subfamilies, compute the kernel matrix, using GPCR-specificstructural information, to
classify a new unlabeled sequence by kernel classifiers. From D, identify the family or
subfamily-specific conserved locations as the potential receptor-ligand binding sites.
2.3.2 Feature Extraction
For every GPCR sequence, start and end points of different regions are used. This informa-
tion can be taken from a pre-existing knowledge-base or it can be predicted by the HMM as
is the case in this work. From these segments, all the triplets are collected and the feature-
map is formed. The alphabet size (l) of the amino-acid sequences is twenty and it is quite
a large value to form the feature vector. For this reason an alphabet reduction scheme is
employed here.
2.3.2.1 Alphabet Reduction Scheme
The alphabet reduction scheme is a grouping scheme of amino acids based on the similarity
of their physio-chemical properties. There is prior work onthese types of grouping scheme
in [44], [45], [10]. Among these grouping schemes, the Sezerman grouping scheme, shown
in TABLE-2.1, is considered in this work due to its best results in GPCR classification [10].
14 Chapter 2. GPCR classification using family specific conserved triplets
Table 2.1: Sezerman Alphabet Reduction SchemeIVLM RKH DE QN ST A G W C YF PA B C D E F G H I J K
2.3.3 Proposed String Kernel
Given a receptor sequencet and anyk-length protein sequenceq, thek-spectrum kernel
only possesses the information about the frequency of occurrence ofq in t. But the po-
sition of q in t is equally important in ligand-binding process. To includethe position
information in the spectrum kernel, a GPCR sequence is divided into four separate re-
gions.These regions are represented asti, (i ∈ {1, 2, 3, 4}). Eachti is composed ofti(out)andti(in). Start and end points of these segments are shown in TABLE-2.2. Now, for each
ti, (i ∈ {1, 2, 3, 4}), its feature- map,Φ3i(t), is defined as :
Φ3i(t) = (φi
p(t))p∈Σ3 , (2.12)
whereΣ is the reduced alphabet (for Sezerman’s grouping scheme,|Σ| = 11) and
φip(t) = η ∗ fp
ti(out) + (1− η) ∗ fpti(in) (2.13)
where,fpl is the frequency of occurrence ofp in l. The value of the parameterη (0 ≤
η ≤ 1) is a measure of confidence on two aspects. The first aspect is the belief that the
external segments contain more informative triplets and the second is the accuracy of the
method employed to predict the different segments (HMM in this case). For example, if
the positions of all the segments are known exactly, then thevalue ofη should be close to
unity giving high importance to external regions and givinglittle importance to the rest of
the regions. On the other hand, if the GPCR structure prediction is not much reliable, then
the value ofη should come near0.5 giving almost equally importance to both the regions
ti(out) andti(in). The feature-map, corresponding to the entire sequencet, is defined as,
Φ(t) = [(Φ31(t))T (Φ3
2(t))T (Φ33(t))T (Φ3
4(t))T ]T , (2.14)
The feature-map of the entire GPCR sequence,t, is formed by concatenating the individual
feature-map of four different regions of the GPCR sequence. As the length ofp is clear
from the context, the subscript3 will be omitted from the feature-map notation from here
onwards. The vectorΦi(t) ∈ ℜ|Σ|3 (i ∈ {1, 2, 3, 4}) where|Σ| = 11, so the feature vector
2.3. Methodology 15
Table 2.2: Start and end points of different segments4-fold kernel 8-fold kernel
Segment Start End Segment Start End
t1t1(out) a b t1 a bt1(in) b c t2 b c
t2t2(out) c d t3 c dt2(in) d e t4 d e
t3t3(out) e f t5 e ft3(in) f g t6 f g
t4t4(out) g h t7 g ht4(in) h i t8 h i
† Refer to Fig.2.1 for start and end points.
Φ(t) ∈ ℜ4|Σ|3 . Φ(t) possesses not only the frequency information of a particular triplet
p ∈ Σ3 in different regions oft, but also contains specific positional information. In order
to compute an entry of kernel matrix corresponding to two GPCRsequencesx andy, any
similarity measures between feature-maps of these sequences, i.e.,Φ(x) andΦ(y), can be
incorporated. For example, theinner productor Gaussian RBF kernelcan be used for this
purpose. In this work, the later is used to compute the kernelmatrix.
MELNSRVDSFRYTLPIVLGANGWAMPVAAAAAB 0
AKA
00
CAD
KKJKKK
Feature Map
Alphabet Reduction
ACADEBACEJBJEAKAAAGFDGHFAKA
0.7
0.7
0.7+0.3
Region (t2(in))Trans-membrane
Exo-cellular loop (t2(out))
weight = 0.7
Figure 2.3: An Example of Feature-Map Formation.
2.3.3.1 Normalization
The length of the GPCR sequence varies significantly and so is the total number of triplets
in them. So a proper normalization step is required before the computation of the kernel
matrix. LetD be a dataset containingm GPCR sequences. At first,Φ(t) is computed for
eacht ∈ D. Now a matrixD′ ∈ ℜm×(4.|Σ|3) is formed, where each row represents a GPCR
16 Chapter 2. GPCR classification using family specific conserved triplets
sequence and each column represents a triplet of one of the four different regions of GPCR.
ThenD′ is normalized row-wise to formD′′, for i = 1, 2, ...,m , as :
D′′(i, j) =
D′(i, j)∑
∀j
D′(i, j)j = 1, 2, ..., (4.|Σ|3) (2.15)
The above normalization takes care the problem of variable length GPCR sequence. In the
next step, column-wise normalization is performed to convert all the feature values between
0 and1. HereD′′′ is formed fromD′′, for j = 1, 2, ..., (4.|Σ|3), as :
D′′′(i, j) =
D′′(i, j)−min∀i
D′′(i, j)
max∀i
D′′(i, j)−min∀i
D′′(i, j)i = 1, 2, ...,m (2.16)
2.3.3.2 An Example
Consider a protein sequence segment (ti): MELNSRVDSFRYTLPIVLGANGWAMPV. After
Sezerman’s alphabet reduction scheme is employed, according to TABLE-2.1,ti becomes:
ACADEBACEJBJEAKAAAGFDGHFAKA. Let ti(out) = ACADEBACEJBJEAKAAAGFand
ti(in) = DGHFAKA. The triplets present inti(out) areACA, CAD, ... ,AGF, and the triplets
present inti(in) areDGH, GHF, ... ,AKA. In Φi(t), the values corresponding to different
triplets are computed using (2.12). This scheme is illustrated in Fig. 2.3 (i is taken to be2
in this case). Similarly, feature-maps of other segments are computed and concatenated to
form the feature-map of entire sequence,Φ(t), according to (2.14). Then normalization is
performed as described in section 2.3.3.1.
2.3.4 Extention to the 8-fold Kernel
A new 8-fold kernel is proposed here as a variant of the4 fold kernel. Instead of forming
feature-map ofti by weighted contribution fromti(in) andti(out), different feature-map is
formed for ti(in) and ti(out). In this scheme, a protein sequencet is divided into eight
segmets :ti, (i ∈ {1, 2, ..., 8}). The start and end points of these segments are shown in
TABLE-2.2. Feature-map,Φi(t), for eachti, (i ∈ {1, 2, ..., 8}), is defined as:
Φi(t) = (φip(t))p∈Σ3 , (2.17)
2.3. Methodology 17
As before,Σ is the reduced alphabet, and
φip(t) =
η ∗ fpti if i is odd
(1− η) ∗ fpti otherwise.
(2.18)
where,fpti is the frequency of occurrence ofp in ti and0 ≤ η ≤ 1. Note that,ti is an
external segment ifi is odd and is an internal segment otherwise. The kernel formed in
this manner is8-fold kernel. The feature-map, corresponding to the entiresequencet, is
formed as before,
Φ(t) = [(Φ1(t))T (Φ2(t))T ...(Φ8(t))T ]T , (2.19)
The vectorΦi(t) ∈ ℜ|Σ|3 (i ∈ {1, 2, 3, 4}) where|Σ| = 11, so the feature vectorΦ(t) ∈
ℜ8|Σ|3 . Here, localization property is better than the previous scheme, and higher classifi-
cation accuracy can be achieved at the expense of enhanced computational complexity.
2.3.5 Feature Reduction Schemes
As described in section 2.3.3, the feature-map of a sequence, Φ(t), is of length5324 for
a 4-fold kernel. The length is10648 for 8-fold one. As a reult, it is important to reduce
the dimension ofΦ(t) before classifcation. Theminimum Redundancy Maximum Rele-
vance(mRMR) method [13] is employed in this work for feature selection. The mRMR
method utilizes the mutual information criteria for selecting a set of the most informative
features. This takes the maximum relevance along with the minimum redundancy criteria
into account to choose the additional features that are maximally dissimilar to the already
selected features. The mRMR method with quotient scheme[13], being is a popular choice,
is employed in this work.
2.3.5.1 Identification of Receptor-Ligand Interaction Sites
While using mRMR method to select the best features before classifying GPCR, the effect
of the number of selected features on classification accuracy is studied. It is found that the
classification accuracy saturates with increasing number of features. Further increase in
the number of features results in a drop in classification accuracy. It can be inferred that
the features, offering maximum classification accuracy, correspond to triplets responsible
for ligand-binding processes. The work of Cobanoglu et al. [10] also supports this obser-
vation. These triplets and their positions are identified incorresponding GPCR sequences.
18 Chapter 2. GPCR classification using family specific conserved triplets
A schematic diagram of the entire work flow of the proposed method is summarized in
Fig. 2.4. Once the triplets are identified, their corresponding classes can be identified using
Topological InformationGPCR Dataset
Feature Vector Formation Feature Selection
mRMR−basedIdentification
Motif
Formation of Feature Vector ofReduced DimensionCross Validation
Model Selection by
Trained Classifier Final Accuracy Calculation
using Proposed Kernel
Figure 2.4: Schematic Diagram of the Classification Process.
the following method. Letr be the number of features selected by mRMR. The normal-
ized dataset,D′′′, containing the information ofm GPCR sequences as described in section
2.3.3.1, is now transformed intoDrmrmr ∈ ℜ
m×r, which containsr columns corresponding
to r mRMR features. Ifd ∈ ℜm is the vector containing the corresponding class informa-
tion of Drmrmr, andh is the number of classes, then an indicator vectordj ∈ ℜ
m is formed
from d for a classj as:
dj(i) =
1 if d(i) = j
0 otherwise.(2.20)
Let, Drmrmr(:, ri) represents a column vector ofDr
mrmr corresponding to featureri. The
label,class(ri), for which the triplet corresponding tori is responsible for ligand binding,
is :
class(ri) = argmax∀j∈{1,2,...,h}
[dTj .Drmrmr(:, r1)] (2.21)
Here, in (2.21),dTj represents the transpose ofdj ∈ ℜm.
2.4 Experiments
In this section, the data sets and experimental setup are described. This is followed by the
results and their analyses.
2.4. Experiments 19
2.4.1 Data Set
The performance of the proposed string kernel is evaluted onthree different data sets de-
scribed below:
• Data set-I: This is used for GPCR family detection problem. It is preparedfrom
GPCRDB website [46]. A set of 390 human GPCR sequences, belonging to four
different families, is studied. The number of sequences in each family in this data
set is given in TABLE-2.3. The topological information of these sequences is taken
from UniProt website [47] as the input to the proposed method.
Table 2.3: Data set-I: Human GPCR sequencesGPCR Families # of sequences
Class-A Rhodopsin like 323Class-B Secretin like 47Class-C Metabotropic glutamate or pheromone 15Vomeronasal receptors (V1R and V3R) 5
TOTAL 390
• Data set-II: This data set contains protein sequences of GPCR Class-A family. It
has been used in the work of Raghava et al. [4]. In order to automate the topo-
logical region prediction process, the TMHMM server [48], which predicts different
regions of GPCR based on theHMM, is used. TMHMM may fail to predict a valid
GPCR structure (the one having four extracellular regions).Those sequences, for
which TMHMM fails to predict a valid GPCR structure, are excluded from the study.
Out of the total 1054 sequences, used in the work of Raghava et al.[4], TMHMM
predicted 885 sequences correctly. The number of sequences, processed correctly
by TMHMM, of different subfamilies is shown in TABLE-2.4. It is observed that
TMHMM processes all the GPCR Class-A subfamilies with high accuracy except
for Prostanoid subfamily. After processing with TMHMM server, the prepared data
set becomes identical to the data used in [10], where thefamily specificity of motif,
is considered to classify GPCRs. As Class-A family is considered to be the most
important among all GPCR families, the maximum emphasis is provided to this data
set. The data set is used to compare the results of this work with some existing GPCR
classification methods.
20 Chapter 2. GPCR classification using family specific conserved triplets
Table 2.4: Data set-II: Class A subfamilies and TMHMM performanceClass A Number of† TMHMM ‡ % ofsubfamilies Sequences accuracyAmine (AMN) 221 208 94.1Peptide (PEP) 381 304 79.8Hormone proteins (HMP) 25 24 96.0(Rhod)opsin (RHD) 183 174 95.1Olfactory (OLF) 87 69 79.3Prostanoid (PRS) 38 8 21.0Nucleotide-like (NUC) 48 33 68.7Cannabinoid (CAN) 11 11 100.0Gonadotropin-releasing hormone (GRH) 10 9 90.0Thyrotropin-releasing hormone (TRH) 7 7 100.0Melatonin (MEL) 13 13 100.0Viral (VIR) 17 13 76.5Lysosphingolipids (LYS) 9 8 88.9Platelet activating hormone (PAF) 4 4 100.0
TOTAL 1054 885 84.0
† Total number of sequences available in dataset-II.‡ Number of sequences correctly processed by TMHMM.
• Data set-III: This data set is used by Naveed et al. [12]. This, being the largest
among all three data sets used in this work, consists of five families of GPCR. These
five families are divided into 39 subfamilies. TMHMM is used to predict the topo-
logical information. After preprocessing, this data set contains a total of 6125 se-
quences. The number of sequences in five families and 39 sub-families are shown in
TABLE-2.6.
2.4.2 Experimental Setup
The effectiveness of the selected reduced feature set in classifying GPCR sequence is prin-
cipally studied by SVM. Most of the results, shown here, are based on SVM. Additionally,
in some cases, results using VVRKFA is also shown along with results using SVM because
VVRKFA being a kernel based classifier validates our claim about effectiveness of the pro-
posed string kernel. To implement the SVM classification algorithm, the libSVM software
package [49] is used and the VVRKFA algorithm is implemented in MATLAB [41]. To
evaluate the performance of the proposed method, a 10-fold cross validation (CV) testing
is carried out. The total dataset is divided into10 parts. At a time, nine parts are taken as
2.4. Experiments 21
the training dataset and the remaining part is taken as the testing dataset. The experiment is
performed10 times taking each part as the testing once. The final accuracyis calculated by
averaging the accuracies for every part. The grid search technique is employed to identify
the optimal parameter set for a classifier. The experiment isperformed with a Gaussian
RBF kernel of the formK (xi, xj) = exp(−µ ‖xi − xj‖2), whereµ is the kernel parameter.
The regularization parameterC of SVM and VVRKFA are selected by tuning from the set
{C = 2i|i = −5,−4, ..., 12} and{C = 10i|i = −7,−6, ...,−1}, respectively. The kernel
parameterµ for both the methods is selected from the set{µ = 2i|i = −8,−7, ...., 8}. The
optimal parameter set for both classifiers are selected by the performance of a parameter set
on a tuning set comprising of 30% of the total data. Once the optimal parameter set is cho-
sen, the 10-fold CV is performed to compute the performance ofthe classifiers. For each
data set, the 10-fold CV is performed 100 times with random permutation of the training
data to calculate average testing accuracy.
2.4.3 Experimental Results
With the experimental setup, described in section 2.4.2, the performance of the proposed
method is computed in three ways- 1) performance to predict the different families of the
GPCR, 2) performance to predict the different sub-families, and 3) performance to predict
the sub-families within a specific family of GPCR with an emphasis on Class-A subfami-
lies in particular.
2.4.3.1 Performance to predict the Families and Sub-families
In data set-I, the human GPCR sequences are grouped into four major classes shown in
TABLE-2.5. The proposed method achieves99.1% and98.9% classification accuracy using
SVMandVVRKFA, respectively. The average accuracies of100 runs of the algorithm, using
10-fold CV, are shown in TABLE-2.5.
On data set-III, the task is to classify the GPCR sequences from five families and thirty
nine sub-families in two subsequent stages. The result is given in TABLE-2.6. For the
family prediction, a set of50 mRMR features is used to solve the 5-class problem. For
the prediction of39 subfamilies, the set of400 mRMR features is used to solve the 39-
class problem. The value ofη is taken to be0.60 and0.55 for 4-fold and8-fold kernel
respectively.
The data sets, used in this work, have the additional information about start and end points
22 Chapter 2. GPCR classification using family specific conserved triplets
Table 2.5: Classification Accuracy in Data set-ITotal No. of Total No. of Percentage of Accuracya
Features Sequences SVM VVRKFA4-fold† 8-fold† 4-fold† 8-fold†
25 390 90.8 91 89.9 90.150 390 95.1 96.3 94.9 96.180 390 98.7 99.1 98.6 98.9a 10-fold cross validation is used.† The value ofη is taken to be0.6 and0.55 for 4-fold and8-fold kernel
respectively.
Table 2.6: Classification Accuracy in Naveed’s GPCR Data setLevels No. of No. of No. of % of Accuracyc
Classesa Sequencesb Features SVM VVRKFA4-fold† 8-fold† 4-fold† 8-fold†
Family 5 6125 50 96.2 98.9 96.3 97.5Subfamily 39 6125 400 83.8 90.1 82.5 87.3Class-A
17 4283 150 89.8 92.9 91.6 91.8subfamilyClass-B
14 402 150 90.1 94.3 92.2 93.9subfamilyd
Class-C6 1415 80 76.1 80.6 77.8 78.4
subfamilya Both Class-D and Class-E consist of a single subfamily.b Number of sequences for which TMHMM predicts a valid GPCR structure.c In all results,10 fold cross validation is used to measure the percentage of accuracies.d TMHMM fails to predict a valid GPCR structure for all the sequences of Gastric subfamily in
Class-B .† The value ofη is taken to be0.6 and0.55 for 4-fold and8-fold kernel respectively.
of diferent regions of a GPCR sequence. As a result, it may not be feasible to compare
the classification accuracy of some of the existing methods without incorporating this extra
information.
2.4.3.2 Class A Subfamily Detection
In this stage of classification, the experiment is performedto identify only Class-A sub-
families as it is the most important GPCR family. For this purpose, the data set-II, given in
TABLE-2.4 is used. This data set, prepared by Raghava et al. [4], is also used in the work
of Cobanoglu et al.[10]. An emphasis is provided in this data set as it helps to compare the
results with the existing methods. In the data set-II, as shown in TABLE-2.4, there are 14
2.4. Experiments 23
subfamilies of GPCR Class-A family. The number of sequences varies in each subfamily.
Using200 mRMR features, the proposed method achieves the CV accuraciesof 99.4% and
99.1% in SVM and VVRKFA respectively for the8-fold kernel, shown in TABLE-2.7.
Table 2.7: Classification Accuracy in Data set-IITotal No. of Total No. of Percentage of Accuracya
Features Sequences SVM VVRKFA4-fold† 8-fold† 4-fold† 8-fold†
50 885 94.6 95.4 94.9 95.7100 885 98.6 98.9 98.1 98.8150 885 95.1 96.3 94.9 96.1200 885 99.2 99.4 99 99.1a 10-fold cross validation is used.† The value ofη is taken to be0.6 and0.55 for 4-fold and8-fold kernel
respectively.
Table 2.8: Misclassification TableTrue Predicted Subfamilies of the misclassified seq.‡
Subfamily† 4-fold kernel 8-fold kernelNUC PEP (2) PEP (1)PEP NUC (1) -RHD PEP (1) PEP (1)VIR PEP (3) PEP (3)† Subfamilies, which are not shown here, have100% accuracies.‡ Number of misclassified sequences is inside the parentheses.
The misclassification table of a single run of the program using SVM is shown in TABLE-
2.8. A total200 features are used by SVM. Out of885 sequences only5 and7 sequences
are misclassified using4-fold and8-fold kernel respectively.
2.4.4 Comparison with Some Related Work
There were prior efforts in predicting family and sub-family of a GPCR sequence from
primary sequence information. But in every case, different kind of features are used. To
prove the effectiveness of the proposed string kernel two methods, GPCRBind [10] and
GPCRPred [4], are selected for comparison. In GPCRBind, the presence of a triplet (in
ECL) along with its position in a linear sense is considered. In order to incorporate this
information, a computationally expensive exhaustive search is employed [10]. In GPCR-
Pred, a combination of several SVMs is used, but the information related to external regions
24 Chapter 2. GPCR classification using family specific conserved triplets
is not considered [4]. GPCRPred is chosen because it employs a kernel classifier, SVM,
and the effectiveness of proposed string kernel as a preprocessing tool would be verified.
GPCRBind is chosen because it is a method which makes use of the positional informa-
tion of triplets. In TABLE-2.9, a comparison of the proposed work with GPCRPred and
GPCRBind is described using the same data set. The reported result of GPCRBind corre-
sponds to that of most successful run. It may be noted that GPCRPred, which is a SVM-
based classification server, offers poor performance in some subfamilies. It is observed that
the classification accuracy of proposed method is much improved than that of the GPCR-
Pred (92.8%). Additionally, the requirement of the ensembles of SVMs (in GPCRPred) is
relaxed in the proposed work to reduce the computational complexity. In this work, out of
total5324 features (4-fold case), the maximum number of200 features (which is even less
than 5% of total number of features) are used.
Table 2.9: Comparison of Classification Performances among GPCRPred, GPCRBind andproposed String Kernels
Subfamily Total GPCRPred GPCRBind Proposed Method8-fold 4-fold
AMN 208 204 (98.1%) 203 (97.6%) 208 (100%) 208 (100%)CAN 11 9 (81.8%) 9 (81.8%) 11 (100%) 11 (100%)GRH 9 5 (55.5%) 9 (100%) 9 (100%) 9 (100%)HMP 24 21 (87.5%) 24 (100%) 24 (100%) 24 (100%)LYS 8 6 (75.0%) 8 (100%) 8 (100%) 8 (100%)MEL 13 10 (76.9%) 13 (100%) 13 (100%) 13 (100%)NUC† 33 24 (73.7%) 29 (87.8%) 32 (96.9%) 31 (93.9%)OLF 69 60 (86.9%) 68 (98.5%) 69 (100%) 69 (100%)PAF 4 0 (00.0%) 1 (25.0%) 4 (100%) 4 (100%)PEP† 304 301 (99.0%) 302 (99.3%) 304 (100%) 303 (99.7%)PRS 8 3 (37.5%) 8 (100%) 8 (100%) 8 (100%)RHD† 174 174 (100%) 169 (97.1%) 173 (99.4%) 173 (99.4%)TRH 7 4 (57.1%) 6 (85.7%) 7 (100%) 7 (100%)VIR† 13 0 (00.0%) 12 (92.3%) 10 (76.9%) 10 (76.9%)Overall 885 821 (92.8%) 861 (97.3%) 880 (99.4%) 878 (99.2%)
Note: Results are based on data set-II.† For misclassification information, refer to TABLE-2.8.
2.4.5 Effect of Variation of η
As described in 2.3.3, the confidence on the method employed to predict start and end
locations of different segments can be modulated by the parameterη in (2.13) and (2.18).
2.4. Experiments 25
AMN CAN GRH HMP LYS MEL NUC OLF PAF PEP PRS RHD TRH VIR0
20
40
60
80
100
Cla
ssifi
catio
n A
ccu
racy
GPCRPred GPCRRBind 4−Fold Kernel 8−Fold Kernel
Figure 2.5: Comparison of Classification Performances among GPCRPred, GPCRBind andproposed String Kernels on data set-II.
The effect of varition ofη on classification accuracy presented in Fig.2.6. As observed in
the Fig.2.6, classification accuracy is maximum around the region0.45 ≤ η ≤ 0.75. There
may be two reasons for this. Firstly, the performance of TMHMM may not be reliable.
Secondly, there may be a large number of conserved triplet present in internal segments.
The second reason is more likely to be true as it is evident from TABLE-2.11 that some of
the most conserved triplet are in fact from trans-membrane regions and ICLs. This is the
rationale of selectingη close to0.5 for classification results, presented in section 4.4.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 188
90
92
94
96
98
100
Weight (η)
Cla
ssifi
catio
n A
ccu
racy
8−fold ; Number of Features 258−fold ; Number of Features 1008−fold ; Number of Features 2004−fold ; Number of Features 254−fold ; Number of Features 1004−fold ; Number of Features 200
Figure 2.6: Effect of variation ofη on classification accuracy (%) in GPCR class-A sub-family prediction (Based on Dataset-II).
2.4.6 Identified Binding-Site Triplets and Their Positions
In this work, the feature-map, constructed from a single sequencet, is of length 5324
(for 4-fold kernel). While reducing dimension using mRMR method, most information
26 Chapter 2. GPCR classification using family specific conserved triplets
carrying triplets and their positions are identified. The identification of position is direct
because the feature-map is constructed in such a way that first 1331 features correspond to
the segmentt1, next 1331 features corresponds to the segmentt2 and so on. The same is
true for8-fold kernel. The triplets and their location are identifiedusing mRMR method
while performing the 4-Class classification problem of GPCRs using data set-I. Similarly in
Class A subfamily detection, some triplets and their location are identified using data set-II.
Some of the triplets, identified while classifying these twodata sets, are shown in TABLE-
2.10 and TABLE-2.11 respectively. To identify the particular family or subfamily related
Table 2.10: Identified motifs and their locations(dataset-I : GPCR Family)Triplets ti
† Triplets ti Triplets ti Triplets ti Triplets ti
GHG t2 DGJ t1 BAA t1 EDB t2 JAH t2AEJ t1 IAB t2 IHK t1 BCA t1 GBE t4AJA t2 HKJ t2 FAC t2 BIH t3 BDI t1AAA t4 JIJ t4 IKG t1 BAB t4 AAD t2AAE t3 CGH t1 EAC t2 JKE t4 JGB t2CKE t3 AJH t3 HFK t3 KCB t2 JJH t2DIH t3 IHC t3 DJE t2 ADH t1 FDJ t2IHA t3 JJI t2 CIA t1 HAF t2 KJH t1KJE t3 HHA t3 JDJ t2 IHE t3 EIH t1AGD t4 AFA t2 AAG t1 BCE t1 AAD t1† Refer to TABLE-2.2 for segment locations
Table 2.11: Subfamily-specific motifs and their locations (8-fold Kernel)(data set-II : Class-A subfamily)
Triplets ti Subfamily Triplets ti Subfamily Triplets ti Subfamily
EIG t5 RHD JCB t4 OLF BKE t2 CANGHD t4 NUC GEJ t6 AMN FJI t8 CANIIF t1 HMP KAB t4 PEP CKB t6 GRHBAJ t4 PEP IEF t4 AMN HIJ t5 PRSAIB t3 PEP AAH t4 PEP KDA t3 VIRKAJ t2 OLF JFJ t8 PEP IEJ t4 TRHJAK t5 RHD FGA t2 LYS DBJ t4 MELHAA t6 NUC BJH t4 AMN AJK t2 MELAGH t4 RHD FAI t6 PEP KAJ t5 LYSBFJ t8 AMN BDB t8 OLF DIA t5 GRH† Refer to TABLE-2.2 for segment locations
to a particular triplet, the reduced feature-map needs to beconsidered. The procedure to
2.5. Contribution of this Chapter 27
identify the relevant class is described in 2.3.5.1. The subfamilies corresponding to some
of the features are listed in TABLE-2.11.
2.4.7 Effect of Selected Reduced Feature Set
After constructing the proposed string kernel, the number of features is reduced using the
mRMR method. The effect of the number of features on classification accuracies is pre-
sented in Fig. 2.7. It can be observed that classification accuracy saturates beyond a certain
number of features. It is inferred that most of the features in feature-map are redundant.
Only a few features suffice to classify GPCRs. This justifies theuse of mRMR to reduce
the number of features.
0 50 100 150 200 250 30070
80
90
100
Number of Features
Cla
ssifi
catio
n A
ccu
racy
(%
)
4−fold kernel
8−fold kernel
Figure 2.7: Number of features versus classification accuracy (%) in GPCR class-A sub-family prediction (A 17-class problem).
2.5 Contribution of this Chapter
In this work, two kernels are designed to classify GPCR sequences using available struc-
tural information. An improved accuracy is achieved in bothGPCR family and GPCR
Class-A subfamily classification problem by using kernel classifiers. A comparison with
GPCRPred and GPCRBind shows that these kernels can improve the classification accu-
racy. A few triplets or motifs relevant to ligand binding processes and their locations are
identified. This class of method can also classify other typeof protein sequences with
a-priori structural information.
Chapter 3
Expectation Maximization in Random
Projected Spaces to Find Motifs in DNA
sequences
3.1 Introduction
Motif finding problem has received attention in the field of computational biology over
the last two decades given the availability of large genomicsequences and the necessity of
finding conserved regions in those sequences to identify transcription factor binding sites.
In motif finding problem, the task is to identify short conserved substrings or motifs in a
data set of long strings. The problem becomes more difficult if the short substrings are not
identical themselves, rather they carry some form of mutations. In that case, the problem is
to find short substrings which are approximately conserved across the data set. For exam-
ple in the(l, d) motif finding problem, given a setS of N sequences of lengthL, the task
is to find a substringm of lengthl which appears frequently inS accompanied by muta-
tions ind random positions. An example is(15, 4) motif finding problem. The problem,
known aschallenge problem, was introduced in[15]. Later in [16], mathematical analysis
is given explaining the inherent intractability of the problem. Expectation Maximization
(EM) based techniques are popular in finding the motif model from the data set in such a
way that the motif model is maximally dissimilar to the background model. The solution
of this problem leads to the identification of hidden motif start positions. In this work, a
variant ofExpectation Maximization (EM)method is discussed and the solution is achieved
using less number of computation while maintaining the effectiveness of EM method for
30Chapter 3. Expectation Maximization in Random Projected Spaces to Find Motifs in
DNA sequences
de-novo motif discovery.
3.2 Preliminaries
Expection Maximization (EM) method was first introduced in conserved location identifi-
cation problem in [23]. This was a simple model to identify motifs which occur only once
in each sequence. This framework is calledone occurrence per sequenceor oopsmodel.
Later in [17], a more accurate and effective model for EM based motif identification prob-
lem is established. This model is effective given a reasonably good starting point. To
incorporate some randomness in EM algorithm, Monte Carlo Expectation Maximization
method is proposed in [25] where expectation is calculated through Monte Carlo simula-
tion. In [18], this concept is used to discover motifs in biomolecular sequences. In [16], a
locality-sensitive hashing method calledrandom projectionis proposed. This method can
be used to identify a starting point for EM based algorithms.In [24], it is shown that the
performance can be improved by choosing projections uniformly instead of choosing ran-
domly. In this section, the local alignment problem for motif discovery and two algorithms
related to this work are described.
3.2.1 local alignment problem for motif discovery
Let S = {S1, S2, ..., SN} be the data set ofN sequences of lengthLi(i ∈ {1, 2, ..., N}).
Let Sij ∈ Σ be the residue symbol at positionj of ith sequenceSi. Σ is the alphabet of
biomolecules. For DNA sequences,Σ = {A, T,G,C} and |Σ| = 4. If oopsmodel is
considered, then there are a total ofN motifs present inS. This model can be generalized
to any number of motif occurrence inS. In stochastic model of sequence generation, the
assumption is that thew-mer motif instances and the background (rest of the sequence
which is not motif) come from different distributions i.e. ratio of different residues in
w-mer motifs present inS is essentially different from the ratio of residues comprising
background. The residues comprising the background are drawn from an independent and
identical multinomial distributionθ0. The residues constituting thew-mer motifs inS are
drawn from an independent but not identical multinomial distributionθj where,j (1 ≤ j ≤
w) is the position of the residue in the motif. The distributions are not identical becauseθjis not identical for different positions,j. A motif can be thought of as a sequence whose
residues are drawn from a product ofw multinomial distribution :Θ = [θ1, θ2, ..., θw].
Θ, together withθ0, completely characterizePosition Weight Matrix(PWM) whose each
3.2. Preliminaries 31
elementwjk = log (θjk/θ0k) is a measure of the dissimilarities between motif-model at
jth position and background-model. All EM-based algorithm for motif finding iteratively
update PWM to maximize this dissimilarities. LetAi ∈ {0, 1}Li−w+1 be the indicator
vector containing the information of motif start locationsin sequenceSi. The value ofjth
element ofAi, Aij, is one if aw-mer motif starts from locationj in Si. The value is zero
otherwise.A = [AT1 , A
T2 , ..., A
TN ] represents a possible local alignment ofS. Total number
of such possible alignment isN∏
i=1
(
Li−w+1|Ai|
)
. Here,|Ai| =Li−w+1∑
l=1
Ail, is the total number of
motifs present in sequenceSi. Foroopsmodel, only one element inAi has nonzero value
and a single variableai is sufficient to store the information. The variableai = j if Aij = 1.
Corresponding to eachA, the model parametersθ0 andΘ can be computed easily. From
θ0 andΘ, it can be inferred about the extend to which the motif is conserved. For example,
a scoreQ can be assigned to an alignmentA as a measure of conservation of the motifs
such that:
Q = |A|.∑
k∈Σ
θ0k ln (θ0k) + |A|.w∑
j=1
∑
k∈Σ
θjk ln θjk (3.1)
The objective of a motif finding algorithm is to find a suitablealignment, i.e.A, or equiv-
alently a modelΘ andθ0, so that,Q corresponding to the model is maximized. Given the
large alignment space and NP-completeness of the problem [50], EM-based algorithms are
employed to solve the problem iteratively.
3.2.2 Expection Maximization (EM) method for motif discovery
The EM algorithm is used to learn the parametric model of a partially-observable stochastic
process. In multiple local alignment problem, the datasetS containing the sequences is the
observable data. The indicator variableA is not observable. It corresponds to hidden data.
Θ andθ0 are model parameters. The objective is to find a model which maximizes the
marginal probabilityP (S|Θ,θ0). A simple EM-based model is presented in this section for
formalising the concepts. A random alignment is taken at initialization step by randomly
initializing A. FromA, model parameters,Θ andθ0, are updated. Update scheme can
be simplified by counting residues to calculate elements ofΘ andθ0. From these model
parameters, expected value of each element ofA is calculated. This two-step iteration goes
32Chapter 3. Expectation Maximization in Random Projected Spaces to Find Motifs in
DNA sequences
on until convergence. Atrth iteration, in E-step, expected value ofAik is calculated as :
Aik = E[Aik|Si,Θr,θr
0]
= 1× Pr(Aik = 1|Si,Θr,θr
0)
= Pr(Si|Aik = 1,Θr,θr0)
[
Pr(Aik = 1|Θr,θr0)
Pr(Si|Θr,θr
0)
]
(3.2)
In (4.2),Aik is a binary variable. In last step,Bayes’ Theoremis used. As motifs occur
independently in different sequences,Aik depends only on sequenceSi. The numerator
inside the bracket is the prior probability that motif startat positionk, which is taken to
be same (1/(Li − w + 1)) for all k. Θr andθr0 are the last updated values ofΘ andθ0.
Without calculating the bracketed term in (4.2) explicitly, at each step,Aik is divided by∑
∀k
Aik using the fact that∑
∀k
Pr(Aik = 1|Si,Θr,θr
0) = 1. The likelihood is determined as:
Pr(Si|Aik = 1,Θr,θr0) =
∏
u 6∈µ
∏
c∈Σ
θδbSiu
0b
w∏
m=1
∏
b∈Σ
θδbSi,k+m−1
mb (3.3)
In (4.3),µ represent the set of indices fromk to k + w − 1 in Si. δ is the Kronecker delta
function. δab = 1 if a = b, andδab = 0 otherwise. In (4.3), The terms corresponding to
background model vary insignificantly given a sufficiently large value ofLi. So, it is safe
to assume thatPr(Si|Aik = 1,Θr,θr0) ∝
w∏
m=1
∏
b∈Σ
θδbSi,k+m−1
mb . The constant arises due to
this proportionality is taken care of in the normalization process ofAik as described earlier.
In M-step,Θr+1 for which the probabilityPr(S, A|Θ) (or equivalently its logarithm) is
maximized is found. Here, the termθ0 is dropped because of its insignificant contribution.
The update procedure is done for1 ≤ j ≤ w andb ∈ Σ as:
θr+1jb =
N∑
i=1
Li−w+1∑
k=1
AikδbSi,k+j−1
N(3.4)
In (4.4), weighted sum of frequencies of differentb ∈ Σ is used. After few iteration,
if initialization is reasonably good, the algorithm will adjust the weights in favour of the
true motif start locations. A sophisticated and advanced EM-based algorithms use word
statistics to generate seeds for initialization.
3.3. Motivation 33
3.2.3 Random Projection Method
Random Projection is used for dimensionality reduction in the context of image and text
data[? ]. Similar concept was first introduced in motif finding problem in [16]. The
algorithm deals with a projected version of the motif instances. The algorithm is performed
byn independent trial. At each trial of the algorithm, a random projection of lengthk where
0 < k < w is selected. The projection is a set,P , of indices chosen uniformly at random
from the set{1, 2, 3, ..., l} without replacement. Ifx is aw-mer, then letf(x) be thek-mer
that results from concatenating the bases at the selectedk positions ofx. A hash table
containing all possible combination ofk-length strings from the alphabet,Σ is constructed.
Eachw-mer from the dataset is now has a corresponding entry in the hash table. Ifx is
considered as a point in anw-dimensional Hamming space,f(x) is the projection ofx onto
a k-dimensional subspace. Eachx has a corresponding entry in the hash table.w-mers
from the data set are chosen randomly and hashed. IfM be the true motif, then let the set
of mutated positions in a motif instance (a mutated version of the motif) bePi . The motif
instance will be hashed to the index corresponding tof(M) if P ∩ Pi = ∅, where∅ is the
empty set. The trade off in random projection is maintained by selecting appropriatek. If
k is large then there is less probability ofP ∩ Pi being∅. In this case, true motif instances
will not hash to the index corresponding tof(M) in the hash table. Ifk is small, then a
large number of spuriousw-mers will make it difficult forf(M) to be identified.
3.3 Motivation
The intuition behind the random projection algorithm is incorporated in this work to iden-
tify an appropriate motif model from the data using EM. The purpose of random projection
is to identify a reasonably good starting point for EM. The projection of potential motif is
identified by random projection method. Then initial value for motif model,Θ, is made
biased towards the projection. For example, given motif length w = 5, if the projection
P = {1, 3, 4}, and the potential motif projection is CAC, then the initial value for motif
model,Θ, can be given as:
Θ =
0.1 0.25 0.6 0.2 0.25
0.1 0.25 0.1 0.1 0.25
0.2 0.25 0.1 0.1 0.25
0.6 0.25 0.2 0.6 0.25
34Chapter 3. Expectation Maximization in Random Projected Spaces to Find Motifs in
DNA sequences
where, rows ofΘ represent the probabilities of the nucleotides A, T, G and C respec-
tively. The columns ofΘ represent different positions in the motif. The first, thirdand
fourth columns are biased towards C, A and C respectively. Thesecond and fifth columns of
Θ are kept equiprobable as random projection does not provideany information for them.
It might be advantageous, from computation point of view, not to use these equiprobable
columns in EM method. This work explores this possibility.
3.4 Proposed Method : EM in randomly projected spaces
Let the motif model beΘ. The projected motif model corresponding to the projectionP is
represented byΘP . GivenΘP , the Q-function in (4.1) can be written as:
Q = |A|.∑
k∈Σ
θ0k ln (θ0k) + |A|.∑
j∈P
∑
k∈Σ
θjk ln θjk (3.5)
The E-step in (4.2), expected value of the indicator variable,Aik, is calculated as :
Aik = E[Aik|Si,ΘPr,θr
0]
= 1× Pr(Aik = 1|Si,ΘPr,θr
0)
= Pr(Si|Aik = 1,ΘPr,θr
0)
[
Pr(Aik = 1|ΘPr,θr
0)
Pr(Si|ΘPr,θr
0)
]
(3.6)
The likelihood in (4.3) can be modified as:
Pr(Si|Aik = 1,ΘPr,θr
0) =∏
u 6∈µ
∏
c∈Σ
θδcSiu
0c
∏
m∈P
∏
b∈Σ
θδbSi,k+m−1
mb (3.7)
The updation in (4.4) is done forj ∈ P andb ∈ Σ as:
θr+1jb =
N∑
i=1
Li−w+1∑
k=1
AikδbSi,k+j−1
N(3.8)
Thus, only the values of model parameter corresponding to the projected positions are
calculated. This reduces the computational complexity.
3.5. Data Set 35
3.4.1 An example of Projected Motif Model
Let, motif lengthw be five. Suppose in some iteration, the value of the model parameter,
Θ =
0.1 0.1 0.6 0.2 0.1
0.1 0.1 0.1 0.2 0.1
0.2 0.7 0.1 0.2 0.1
0.6 0.1 0.2 0.4 0.7
If projected positionP = {2, 4, 5}, then projected motif model,
Θ[2,4,5] =
0.1 0.2 0.1
0.1 0.2 0.1
0.7 0.2 0.1
0.1 0.4 0.7
3.5 Data Set
To validate the efficiency of the proposed algorithm, two types of datasets are used. The first
one is a synthetic data set where different instances of(l, d) motifs are generated randomly.
A second data set is prepared from JASPAR website [51].
• Synthetic Data set: The conventional procedure to test the effectiveness of motif
finding algorithm is to use a synthetic data set containing(l, d) motif instances and
apply the algorithm to the dataset to verify its efficiency. Although there is some fun-
damental difference between a synthetic data set and a true biological data set, this
type of testing platform approximately provides a reasonable measure of effective-
ness of motif finding algorithms. In the synthetic data set used in this work, the motif
instances are generated by taking a randoml-length genomic sequence and mutating
d positions randomly. These motif instances are implanted ina random location of
a random background sequence with a fixed GC fraction.20 such background se-
quences containing different instances of(l, d) motif are used. Different background
lengths are used to vary the difficulty level of the problem.
• JASPAR Data set: This data set is prepared by taking some of the experimentally
verified biological motif instances (transcription factors of Eukaryotes) from JAS-
PAR website [51]. These instances are implanted in random background sequences
36Chapter 3. Expectation Maximization in Random Projected Spaces to Find Motifs in
DNA sequences
50 100 150 200 250 300 350 400 450 500 550 6000
0.5
1
1.5
2
2.5
3
Background length
Ave
rage
num
ber
of e
rror
Conventional EMProposed Method
Figure 3.1: Comparison of conventional EM and proposed method (error measure is takento be the average number of mismatches between true motif andidentified motif)
(with different lengths) at random locations to increase the difficullty level of the
problem. These backgrounds acts as a promoter regions. GC fraction of the pro-
moter regions is taken to be0.45.
3.6 Experimental Results and Analysis
3.6.1 Results on synthetic data set
The synthetic data set of20 sequences with planted(15, 4) motif (oops model) is pre-
pared as described in section-4.4.1. The background lengthis varied from50 to 650. With
increase in the background, the difficulty level of the problem increases. The result corre-
sponding to the conventional EM and proposed method is shownin Fig. 3.1. For a fixed
background length, the experiment is performed on50 different synthetic data. The error
is calculated as the average number of mismatches between true motif and identified motif
(projected version of the identified motif in the case of proposed method). The proposed
method shows better results in difficult problems.
Another measure of performance is the number of mismatches between identified motif
start locations via indicator variableAik and true motif start locations. The results cor-
responding to this measure is shown in Fig. 3.2. Although theproposed method seems
inferior according to this measure, it should be noted that only few correct motif start loca-
tions are sufficient in identifying the true motif. If a consensus is taken from the identified
motif start locations, a few erroneous motif start locations will not lead to a erroneous re-
sult. Another study is performed to examine the effect of different projection length on
the performance of the proposed method. Projection length is varied from5 to 15. The
3.6. Experimental Results and Analysis 37
100 200 300 400 500 6000
2
4
6
8
Background lengthErr
or
in m
otif
sta
rt lo
catio
ns
Conventional EMProposed Method
Figure 3.2: Comparison of conventional EM and proposed method (error measure is takento be the average number of mismatches between identified motif start locations via indi-cator variableAik and true motif start locations)
5 6 7 8 9 10 11 12 13 14 150
0.5
1
1.5
2
2.5
3
3.5
4
Projection length
Ave
rage
num
ber o
f err
or
Figure 3.3: Performance for different projection length ina (15,4) problem (error measureis taken to be the average number of mismatches between true motif and identified motif)
result in Fig. 3.3 shows that the performance does not significantly. The marginal improve-
ment of the algorithm for longer projection length comes at the expense of an increased
computational cost.
3.6.2 Results on JASPAR data set
This dataset is prepared from JASPAR website. Various motifs are grouped together ac-
cording to their length. The lengths of these motifs and results corresponding to the con-
ventional EM and the proposed method are shown in TABLE-3.1.
It should be noted that even if only50% of the motif start positions are identified cor-
rectly, a consensus at the end can produce satisfactory result.
38Chapter 3. Expectation Maximization in Random Projected Spaces to Find Motifs in
DNA sequences
Table 3.1: Performance on JASPAR data setJASPAR Motif BackgroundAverage number of errorGroup Length Length EM Proposed Method
MA0081, MA0093, MA0096 7 200 0.18 0.24MA0067 8 200 0.13 0.19MA0005, MA0009 11 300 0.28 0.36MA0114 13 300 0.21 0.25MA0116 15 500 0.30 0.39MA0060 16 500 0.23 0.33MA0007 22 500 0.40 0.53
Note: The average number of error is shown as a fraction of total number of motif locationsthat are erroneously predicted.
3.7 Conclusion and Contribution
A method is proposed to perform EM through projected motif model to identify motifs in
genomic sequences. Given reasonably good initial startingpoint, the deterministic EM al-
gorithm converges quickly to a local optimum point. In that case, the proposed method can
be used to reduce the computation. The projected motif modelis smaller as compared to
entire motif model which reduces the computational cost while maintaining the effective-
ness of EM to some extent. The limitation of this method is overcome by using a variant
of the Monte-Carlo Expectation Maximization method which will be discussed in the next
chapter.
Chapter 4
Monte-Carlo Expectation Maximization
for Finding Motifs in DNA Sequences
4.1 Preliminaries
The Expectation Maximization (EM) method was introduced inconserved site identifica-
tion problem by Lawrence and Reilly[23]. This method identifies motifs which occur only
once in each sequence. Later Bailey and Elkan provided a more generalized model for EM-
based motif identification problem [17]. This iterative model is effective given a reasonably
good starting point. If the initial guess about the startingpoint of EM is not reasonably
good enough, the deterministic algorithm converges quickly to a local optimum point. To
ameliorate this limitation, the Monte Carlo Expectation Maximization (MC EM) method
is proposed by Wei and Tanner, where expectation step is calculated through Monte Carlo
simulation [25]. This incorporates randomness in EM algorithm. In Monte Carlo EM Mo-
tif Discovery Algorithm (MCEMDA) [18], this concept is used to discover motifs in DNA
sequences. In this section, the problem of multiple local alignment for motif discovery is
discussed again along with the two related algorithms, the EM [17] and the MCEMDA [18]
for formalizing the concepts.
4.1.1 Multiple Local Alignment for Motif Discovery
LetS = {S1, S2, ..., SN} be the dataset containingN sequences of lengthLi (i ∈ {1, 2, ..., N}).
Let Sij ∈ Σ denote the residue symbol at positionj of the ith sequenceSi, whereinΣ is
the alphabet of biomolecules. For DNA sequencesΣ = {A, T,G,C} and|Σ| = 4. If One
40Chapter 4. Monte-Carlo Expectation Maximization for Finding Mo tifs in DNA
Sequences
Occurrence Per Sequence model (oopsmodel) is assumed, thenN motifs are present inS.
This model can be generalized to any number of motif occurrences inS. In the stochastic
model of sequence generation, the assumption is that thew-mer motifs and the background
(non-motif portion of sequence) are generated from different A-T-G-C distributions, i.e. the
ratio of different residues inw-mer motifs present inS is essentially different from that of
the background. The background is drawn from an independentand identical multinomial
distributionθ0. The residues constituting thew-motifs inS are drawn from an independent
but not identical multinomial distributionθj where,j (1 ≤ j ≤ w) is the position of the
residue in the motif. The A-T-G-C distributions,θj vary withj. Any motif, present inS, can
be considered as a sequence whose residues are drawn from a product ofw multinomial
distributions :Θ = [θ1, θ2, ..., θw]. The joint distribution,Θ, together with background
distribution,θ0, completely characterize thePosition Weight Matrix(PWM) whose each
elementwjk = log (θjk/θ0k) (where,k ∈ Σ) is a measure of dissimilarities between the
motif-model atjth position and the background-model. All EM-based motif finding algo-
rithms iteratively update PWM to maximize this dissimilarity. Let Ai ∈ {0, 1}Li−w+1 be
the indicator vector containing the information of motif start locations in sequenceSi. The
value ofjth element ofAi, Aij, is one if aw-mer motif starts from locationj in Si. The
value is zero otherwise.A = [AT1 , A
T2 , ..., A
TN ] represents a possible local alignment ofS
and the total number of such possible alignments isN∏
i=1
(
Li−w+1|Ai|
)
. Here,|Ai| =Li−w+1∑
l=1
Ail,
is the total number of motifs present in sequenceSi. For theoopsmodel,|Ai| = 1 and a
single variableai is sufficient to store the information. The variableai = j if Aij = 1. For
a givenA, the model parametersθ0 andΘ can be computed. Fromθ0 andΘ, one can infer
about the extent to which the motif is conserved. For example, a scoreQn can be assigned
to an alignmentA as a measure of conservation of the motifs such that:
Qn = |A|.∑
k∈Σ
θ0k ln (θ0k) + |A|.w∑
j=1
∑
k∈Σ
θjk ln θjk (4.1)
The first term corresponds to the background uncertainty andthe second term relates to
that of the motif model. The objective of a motif finding algorithm is to find a suitable
alignment, i.e.A, or equivalently a modelΘ andθ0, so that, the correspondingQn of the
model is maximized. Given the large alignment space and NP-completeness of the problem
[50], EM-based algorithms are employed to solve this problem iteratively.
4.1. Preliminaries 41
4.1.2 Expectation Maximization in Motif Finding
The EM algorithm is employed to learn the parametric model ofa partially-observable
stochastic process. In the multiple local alignment problem, the datasetS containing the
sequences is the observable data. The indicator variable,A, is not observable and corre-
sponds to hidden data. The model parameters areΘ andθ0. The objective here is to find a
model which maximizes the marginal probabilityP (S|Θ,θ0). A simple EM-based model
is presented in this sub-section to formalize the concept. Arandom alignment is taken at
the initialization step with a random choice ofA. FromA, the model parameters,Θ andθ0,
are updated. The model update scheme can be simplified by counting A-T-G-C residues to
compute the elements ofΘ andθ0. From these model parameters, the expected value of
each element ofA is calculated. This two-step iteration goes on until the convergence. At
therth iteration, in E-step, the expected value ofAij is calculated as :
Aij = E[Aij|Si,Θr,θr
0] = 1× Pr(Aij = 1|Si,Θr,θr
0)
= Pr(Si|Aij = 1,Θr,θr0)
[
Pr(Aij = 1|Θr,θr0)
Pr(Si|Θr,θr
0)
]
(4.2)
In (4.2),Aij ∈ {0, 1} is a binary variable. In the final step,Bayes’ Theoremis used. As
motifs occur independently in different sequences,Aij depends only on the sequenceSi.
The numerator inside the bracket is the prior probability that motif starts at positionj,
which is taken to be same as(Li − w + 1)−1 ∀j. The model parameters inrth iteration
are denoted asΘr andθr0. Without calculating the bracketed term in (4.2) explicitly, at
each step,Aij is divided by∑
∀j
Aij using the fact that∑
∀j
Pr(Aij = 1|Si,Θr,θr
0) = 1. The
likelihood is determined as:
Pr(Si|Aij = 1,Θr,θr0) =
∏
u 6∈µ
∏
c∈Σ
θδcSiu
0c
w∏
m=1
∏
b∈Σ
θδbSi j+m−1
mb (4.3)
In (4.3), µ represents the set of indices fromj to j + w − 1 in Si. δ is the Kronecker
delta function,δab = 1 if a = b, andδab = 0, otherwise. In (4.3), given a sufficiently large
value of the length,Li, of theith sequence, the first term corresponding to the background
model remains almost constant. So, it can be assumed thatPr(Si|Aij = 1,Θr,θr0) ∝
w∏
m=1
∏
b∈Σ
θδbSi j+m−1
mb . The proportionality constant is taken care of during the normalization
process ofAij.
42Chapter 4. Monte-Carlo Expectation Maximization for Finding Mo tifs in DNA
Sequences
In the M-step,Θr+1 is found for which the probabilityPr(S, A|Θ) (or equivalently its
logarithm) is maximized. Here, the termθ0 is omitted as it is nearly constant. The model
parameter is updated for1 ≤ t ≤ w andb ∈ Σ as:
θr+1tb =
1
N
N∑
i=1
Li−w+1∑
j=1
AijδbSi,j+t−1
(4.4)
In (4.4), a weighted sum of frequencies of differentb ∈ Σ is used. After a few iterations
(with a good initialization) the algorithm will converge towards the true motif start loca-
tions. An advanced EM-based algorithm such as MEME uses wordstatistics in order to
identify a motif having best statistical relevance with respect to the background [17]. The
parameter,θ0, is an example of low order background model which considersthe frequency
of individual letters. Additionally, MEME takes into account the frequency of words (com-
bination of A-T-G-C). In a sense, it uses higher-order Markovbackground model and thus
is more effective than a naive EM-based motif finding algorithm.
4.1.3 Monte Carlo EM Motif Discovery Algorithm
In MCEMDA [18], randomness is introduced in the M-step by carrying out Monte Carlo
simulation to updateΘ. The deterministic averaging makes EM a local greedy searchalgo-
rithm. The randomness, introduced in MCEMDA, may help the algorithm to escape from
local optima and thus may increase the chance of producing a better solution as described
in [18]. The introduction of this randomness makes MCEMDA more effective compared to
the conventional EM algorithm. In M-step of the MCEMDA, an integer value is assigned
to ai according to the distribution:
u (ai = l|Si,Θr) =
w∏
j=1
∏
b∈Σ
(
θrjb
θr0b
)δbSi,l+j−1
Li−w+1∑
t=1
[
w∏
j=1
∏
b∈Σ
(
θrjb
θr0b
)δbSi,t+j−1
] (4.5)
where,1 ≤ ai ≤ Li − w + 1 (refer to sub-section 4.1.1). Then,Θ is updated as:
θr+1jb =
N∑
i=1
(
δbSi,ai+j−1
)
+ βjb
N∑
i=1
[
∑
b∈Σ
(
δbSi,ai+j−1
)
+∑
b∈Σ
βjb
] (4.6)
4.1. Preliminaries 43
The vectorβ serves as a pseudocount to overcome the problem of zero-count. It can
also be thought of as aDirichlet prior. The value ofai is drawn multiple times (saym
times) from the distribution given in (4.5). In each simulation,Θ is calculated according to
(4.6), and theQ-function in (4.1) is evaluated. IfΘ corresponding to the bestQ-function
is stored and is used in the next iteration then the strategy is called asm-best strategy. On
the other hand, if an average of all model parameters is used in the next iteration, then
the strategy is called asm-average strategy. Them-average strategy is computationally
demanding and slower than them-best strategy [18].
4.1.3.1 MCEMDA as a Markov Chain
MCEMDA can be visualized as a Markov chain as shown in Fig. 4.1.At each step, the
process has the chance of taking new direction depending on the distribution in (4.5) and
thus it has the potential of avoiding local optima. At each step, the chain can followm-
different directions as opposed to a single direction of deterministic EM algorithm. R
such Markov chains are generated (each with different seed)at the beginning to getn ≤
R alignments at the end of the chains. Multiple Markov chains may produce identical
alignment. Furthermore, two alignments may not be identical but they can be almost similar
to each other (owing to the phase shift in motif model [18]). For this reason, a clustering
algorithm is used in MCEMDA using normalized longest common block as a measure of
the distance metric between two alignments. Each cluster-center is considered as a potential
motif.
Clustering
States Path With Maximum Q
Cons
ensu
sM
otif
H1
H2
m = 3
HR
Figure 4.1: Visualization of the MCEMDA as a Markov chain (Number of MC simulationis 3).
44Chapter 4. Monte-Carlo Expectation Maximization for Finding Mo tifs in DNA
Sequences
4.2 Motivation and Methodology
Consider a family of Markov chains (Hi i ∈ {1, 2, ..., R}), shown in Fig. 4.1, corre-
sponding to the MCEMDA. The transitions and states are denoted by arrows and circles
respectively. The number (m) of the MC simulation in each iteration is taken as three. Dur-
ing first few iterations,Hi visits states which are unreliable in nature. These initialstages,
called the burn-in phase, are crucial forHi in terms of convergence. If it fails to track a
suitable path, it may never converge to the actual solution [18]. This situation is shown in
Fig. 4.2. The dynamics of the goodness measure,Qn, is shown for some of the Markov
chains with different initialization. The Markov chain converges to the true motif model if
Qn ≥ QT . Here, the true motif model refers to the hidden parameter,Θ, to be estimated,
or a neighborhood ofΘ that produces the same motif as doesΘ. Out ofR such chains,
only a few converges to the true motif model (or to its phase shifted version).
0 50 100 150 200 250 300 350 400−170
−160
−150
−140
−130
−120
−110
Iteration
Q
Not successful
Successful
Figure 4.2: Improvement of the Q-function during first400 iterations. (Number of MCsimulation in each stage:m = 3).
As a result, a clustering algorithm is employed at the end of all Hi (i ∈ {1, 2, ..., R}). It
would be better off stopping some unreliable chains at the earlier stages. If the unpromising
chains are stopped after first few iterations, then the computational cost reduces. Addition-
ally, the requirement of the clustering stage may be relaxed. This study focuses on an
algorithm which will stop some unpromising Markov chains atthe beginning.
4.2.1 Simplified Q Function
The negative entropy-like goodness measure of a motif modelis shown in (4.1). Given the
motif length,w ≪ Li, the distribution of background residues almost constant for different
motif positions. As a result, the first term in (4.1) may be omitted and the modified measure
4.2. Motivation and Methodology 45
is taken to be :
Q = |A|.w∑
j=1
∑
k∈Σ
θjk ln θjk (4.7)
This modification reduces the computation in each iterationwhile maintaining the true
purpose of the measure.
4.2.2 Selection of Promising Markov Chains
The determination of a suitable threshold,QT , is essential to identify and stop the un-
promising Markov chains. However, finding such hard threshold of Q proves to be cum-
bersome task. To overcome this limitation, the proposed method employs a greedy scheme
by initiating itrmax number of different Markov chains,Hi (i ∈ {1, 2, ..., itrmax}), with
random seeds. After the firstq iterations (called look-ahead) in eachHi, the corresponding
goodness measure,Q, is assumed to beQi (i ∈ {1, 2, ..., itrmax}). Following this look-
ahead stage, the algorithm proceeds with the best chain,Hm, where,m = argmax∀i∈{1,2,...,itrmax}
Qi.
The rest of the chains are discarded.
4.2.3 Goodness Measure:ψ
In order to validate the effectiveness of the proposed method, it is assumed that the true
motif start locations are available. Given these locations, the effectiveness can be measured
fromAik. Letγ = [γ1 γ2 ... γN ]T be the known true motif start position vector. In the
final iteration, the distribution of the index variableai is assumed to beu (refer to (4.5)).
Let γ′i = argmax∀l∈{1,2,...,Li−w+1}
u (ai = l|Si,Θ) be the most probable motif start location ofith
sequence (i ∈ {1, 2, ..., N}). The effectiveness of the proposed algorithm can be evaluated
usingγ andγ′ = [γ′1γ′2...γ
′N ]
T . A score,Ψ = 1N
N∑
i=1
ψi, is used in this work, where,
ψi =
(w−|γ′
i−γi|)
wif |γ′i − γi| ≤ w
0 otherwise.(4.8)
This metric is similar to the nucleotide-level accuracy (nla) [18] which is defined as :
nla(γ′, γ) =1
|N |
|N |∑
i=1
γ′i ∩ γiw
(4.9)
46Chapter 4. Monte-Carlo Expectation Maximization for Finding Mo tifs in DNA
Sequences
where,γ′i ∩ γi represents the size of the overlapping block between predicted and observed
motif.
4.2.4 Overcoming the Limitation of the Phase Shift
The EM-based algorithms have a propensity to converge to a shifted version of the true
motif model. As any shifted version of the motif model also possesses high score compared
to any random alignment, more often than not the EM-based algorithms converge to a
shifted version of the true motif model. In a sense, the shifted versions of the true motif
model form local minima in motif alignment space. EM-based algorithms have inherent
inability to come out of these local minima. To illustrate this phenomenon, two relatively
easy planted(l, d) motif finding problems are considered. The dynamics of the scoring
function,Ψ, is shown in Fig. 4.3 taking different(l, d) and number of MC simulation (m).
It is observed that the algorithm frequently converges to the shifted version of the true motif
model. Here, the true motif model is considered as the motif model which produces the
correct motif without any mutation. The A-T-G-C distribution of the true motif model need
not necessarily be same as that in the planted motifs. As longas the probability of the
correct nucleotide is more than the rest of the nucleotide ina particular motif position, the
model will produce the correct motif. In the Fig. 4.3, the false motif model refers to a
model which does not produces the correct motif or its phase shifted version.
As this phase-shift phenomenon leads to erroneous results,it would be better if some
measure is taken to force the algorithm to converge to the true motif model. This is accom-
plished by shifting the motif model,Θ, to the left or to the right with a given probability
pshift. After a few iterations, if there is no improvement in the goodness measure,Q, then
the algorithm will return to the originalΘ. The scheme is illustrated in Fig. 4.4, where
the proposed phase-shifting mechanism is marked by the discontinuous lines. A better but
complex option might be to make the shifting probabilitypshift dependent onQ because,
the shifting of the current motif model is only helpful when the algorithm reaches in the
neighborhood of the true motif model. But, in this work,pshift is kept constant to avoid
complexities. If the problem is difficult (i.e. motifs are less conserved), then it is likely
that all the EM-based methods will end up producing a wrong result. But somehow, if
the algorithm manages to reach to a nearby model, the proposed shifting mechanism will
look around the shifted versions of the current motif model in the hope of finding a better
alignment.
4.2. Motivation and Methodology 47
0 50 100 150 200 250 300 350 400 450 5000
0.2
0.4
0.6
0.8
1
Iteration
Sco
re (
ψ )
True Motif Model
False Motif ModelShifted True Motif Model
(a)
0 50 100 150 200 250 300 350 400 450 5000
0.2
0.4
0.6
0.8
1
Iteration
Sco
re (ψ
)
True Motif Model
False Motif ModelShifted True Motif Model
(b)
0 50 100 150 200 250 300 350 400 450 5000
0.2
0.4
0.6
0.8
1
Iteration
Sco
re (ψ
)
True Motif Model
False Motif ModelShifted True Motif Model
(c)
0 50 100 150 200 250 300 350 400 450 5000
0.2
0.4
0.6
0.8
1
Iteration
Sco
re (ψ
)
True Motif Model
False Motif ModelShifted True Motif Model
(d)
Figure 4.3: Convergence of Monte Carlo EM based algorithm to the shifted version oftrue motif model. (a)(15, 3) motif finding problem withm = 1. (b) (15, 3) motif findingproblem withm = 3. (c) (12, 2) motif finding problem withm = 1. (d) (12, 2) motiffinding problem withm = 3.
48Chapter 4. Monte-Carlo Expectation Maximization for Finding Mo tifs in DNA
Sequences
InitializationSearch for Best
Initial Motif Model
Motif ModelRight Shifted
Indicator Variable
ImprovementAfter a Few
Improvement
Iteration?After a Few
no
p
p
no
yes
1−2p
Motif ModelOutput Final
A ikshift
shift
shift
yes
Iteration? Best InitialMotif Model
Motif Model
Motif ModelLeft Shifted
Random
ΘLeft
Θ
ΘRight
Figure 4.4: Flowchart to overcome the phase shift problem.
4.2.5 An Example of ShiftedΘ
Let the motif lengthw be five. The model parameter attth iteration is as follows:
Θ(t) =
0.1 0.1 0.6 0.2 0.1
0.1 0.1 0.1 0.2 0.1
0.2 0.7 0.1 0.2 0.1
0.6 0.1 0.2 0.4 0.7
The left-shifted model parameter attth iteration will be:
Θ(t)Left =
0.1 0.6 0.2 0.1 vlA0.1 0.1 0.2 0.1 vlT0.7 0.1 0.2 0.1 vlG0.1 0.2 0.4 0.7 vlC
. Similarly, the right-shifted model parameter attth iteration will be:
Θ(t)Right =
vrA 0.1 0.1 0.6 0.2
vrT 0.1 0.1 0.1 0.2
vrG 0.2 0.7 0.1 0.2
vrC 0.6 0.1 0.2 0.4
4.2. Motivation and Methodology 49
0 50 100 150 200 250 300 350 400 450 5000
0.2
0.4
0.6
0.8
1
Iteration
Sco
re (Ψ
)
Monte−Carlo EMProposed Method
Improvement
Figure 4.5: Improvement in (15,3) motif finding problem.
The empty columns (that arise due to this shifting) are padded by a non-negative random
vectorvx = [vxA vxT v
xC vxD]
T , x ∈ {l, r}, such that‖vx‖1 =∑
k∈Σ
|vxk | = 1. If the goodness
measure,Q, is not improved due to this shifting at(t+ t′)th iteration, the algorithm retains
to the pre-stored original motif model, obtained attth iteration. In the Fig. 4.5, a (15,3)
motif finding problem with background length of 300 is considered to demonstrate the
effectiveness of the proposed shifting technique. It can beobserved that the proposed
methodology helps in avoiding the local minima that arises due to the high alignment score
of the phase shifted version of the true motif model. The proposed method converges to the
true motif while Monte-Carlo EM converges to a phase-shiftedversion of the true motif.
This is reflected on the final value of the Scoring function,Ψ as shown in Fig. 4.5. Here, the
proposed method offers the exact match to the true motif withΨ = 1.0 whereas, the Monte-
Carlo EM offersΨ = 0.8 after500 iterations. In TABLE- 4.1, the true motif is shown in
the first row. The final motif produced by the Monte-Carlo EM andthe proposed method
are shown for comparison. The proposed method converges to the true motif whereas the
Monte-Carlo EM converges to a phase-shifted version (shift=1) of the true motif of length
15. The values ofψc in (4.8) for the consensus sequences are1 and 15−115
= 0.93 for
proposed method and Monte-Carlo EM respectively.
Table 4.1: Convergence to the true motif model due to shiftingψc
True motif CGCTAAATGAGCTAAMonte-Carlo EM solution - GCTAAATGAGCTAAT 0.93Proposed method CGCTAAATGAGCTAA 1
50Chapter 4. Monte-Carlo Expectation Maximization for Finding Mo tifs in DNA
Sequences
4.3 Implementation
Given a set ofN sequences,S, motif width,w, and the number of MC simulations,m, the
proposed algorithm starts by randomly initializing an alignmentA(0). The look-ahead is
assumed to betmax. The pseudocode of the proposed initialization scheme (described in
sub-section 4.2.2) is given in Algorithm 1. It finds a promising motif model,Θ⋆BEST
.
Algorithm 1 Proposed Initialization1: Initialize : itrmax, tmax
2: for itr = 1 to itrmax do3: Initialize : S, A(0), t← 0, Qmax ← −∞, Q⋆ ← −∞4: EvaluateΘ(0) andQ(0) fromA(0)
5: Θ⋆itr← Θ(0) ; Qmax ← Q(0)
6: while t ≤ tmax do7: t← t+ 18: for r = 1 tom do9: for i = 1 toN do
10: draw a sample according to (4.5)11: end for12: updateΘ(t,r) andQ(t,r)
13: end for14: evaluateΘ(t) andQ(t) usingm-best strategy15: if Q(t) ≥ Qmax then16: Qmax ← Q(t) andΘ⋆
itr← Θ(t)
17: end if18: end while19: if Qmax ≥ Q⋆ then20: Θ⋆
BEST← Θ⋆
itr
21: Q⋆ ← Qmax
22: end if23: end for24: Output the best initialized motif modelΘ⋆
BEST
Once the best initialization is identified, then the Markov chain is started using the
initialized motif model,Θ⋆BEST
. At each step, when the motif model,Θ(t), is updated, the
algorithm calls a function which shiftsΘ(t) to the left or right, with a given probability,
pshift, as described in sub-section 4.2.5. This shifting mechanism is performed by the
function ModelShift(), and its pseudocode is given in Algorithm 2.
The proposed method employs both the initialization and shifting scheme. The cor-
responding pseudocode is given in Algorithm 3. The entire work-flow, along with the
4.4. Results and Analysis 51
Algorithm 2 Proposed Shifting: ModelShift(Θ(t),Q(t))
1: Input : motif model,Θ(t)
2: Θ(t)BEST ← Θ(t)
3: With a given probabilitypshift : Θ(t) ← Θ(t)Left orΘ(t) ← Θ
(t)Right
4: if Q(t+t′)n ≤ Q
(t)n at the end of(t+ t′)th iterationsthen
5: Θ(t) ← Θ(t)BEST
6: end if7: Return: Motif Model Θ(t)
end-clustering, is shown in Fig. 4.6.
Clustering
MotifInitializationProposed
MC EM withProposed Model
ConsensusShifting Mechanism
Figure 4.6: Simplified description of the proposed algorithm.
4.3.1 Computational Leverage
In MCEMDA, let the total number of iterations performed for a Markov chain,Hi, i ∈
{1, 2, ...R}, beL. If a single iteration of MC Expectation Maximization algorithm is con-
sidered as a unit computation, then the total computationalcost required, before employing
any clustering algorithm, isRL. On the other hand, in the proposed method, if aq-step
(q<L) look-ahead and a totalR′ randomly initialized Markov chains are employed, then
the total computational cost isR′q + (R′/10)(L − q). The parameters used in MCEMDA
areR = 1000 andL = 100. If R′ andq are chosen as1000 and15 respectively, the pro-
posed method offers an improvement in computational cost approximately by a factor of
four.
4.4 Results and Analysis
4.4.1 Dataset
To validate the efficiency of the proposed algorithm, two types of dataset are used. The first
one is asynthetic datasethaving different instances of randomly generated(l, d) motifs. A
52Chapter 4. Monte-Carlo Expectation Maximization for Finding Mo tifs in DNA
Sequences
Algorithm 3 Proposed Algorithm of Entire Work-flow1: Input : initialized Motif ModelΘ⋆
BESTOutput : Final Motif ModelΘ⋆
2: Initialize : tmax
3: Θ(0) ← Θ⋆BEST
; Qmax ← −∞4: while t ≤ tmax do5: t+ 1← t6: for r = 1 tom do7: for i = 1 toN do8: draw a sample according to (4.5)9: end for
10: updateΘ(t,r) andQ(t,r)
11: Θ(t,r) ←ModelShift(Θ(t,r),Q(t,r))12: end for13: evaluateΘ(t) andQ(t)
n usingm-best strategy14: if Q(t) ≥ Qmax then15: Qmax ← Q
(t)n andΘ⋆ ← Θ(t)
16: end if17: end while18: Output Final Motif ModelΘ⋆
second dataset is prepared taking motif instances from the JASPAR website [51].
• Synthetic Dataset:The procedure to test the effectiveness of any motif finding algo-
rithm is to use a synthetic dataset containing(l, d) motif instances and then apply the
algorithm to the dataset to verify its efficiency. Although there are some fundamental
differences between a synthetic dataset and a true biological dataset, this type of test-
ing platform approximately provides a good measure of effectiveness of motif finding
algorithms. Motif instance of the dataset, used in this work, is generated by taking a
randoml-length genomic sequence and mutating at mostd positions randomly. This
motif instance is implanted in a random location of a random background sequence
with a fixed GC fraction. Different combinations ofl, d and background length are
used to vary the difficulty level of the motif finding problem.
• JASPAR Dataset:This dataset is prepared by taking experimentally verified biolog-
ical motif instances (transcription factors of Eukaryotes) from JASPAR website [51].
These instances are then implanted in random background sequences at random lo-
cations. These backgrounds act as promoter regions. The GC fraction of promoter
regions and the background length are taken as0.45 and500 respectively.
4.4. Results and Analysis 53
4.4.2 Comparison with other EM-based Algorithms
There are multiple methods available for finding motifs e.g.MEME [17], MCEMDA [18],
Random Projection [16],Weeder [19], MUSCLE [20], ClustalW [21], [22] etc. Out of
these, MCEMDA, an excellent algorithm in itself, outperforms the other algorithms in most
of the cases [18]. MCEMDA and a simple EM-based method are selected for comparison.
For difficult motif finding problems, the probability of EM-based algorithms to converge to
an erroneous motif model is high. For this reason, conventionally, the EM-based algorithms
are applied to the dataset multiple times by taking different seeds and at the end, a clustering
algorithm is applied to identify the largest cluster in motif model space. The intuition is
that all spurious motif model will remain scattered in motifmodel space but the optimal
and near optimal solutions will form a larger cluster. This clustering technique improves
the the probability of detection of the true motif model. As clustering is a form of hard
quantization technique, the proposed method is compared toother two EM-based methods
without this clustering stage. If all the algorithms are EM-based, the calculation of the
average score after a fixed number of iterations may suit the purpose more appropriately.
It may be noted that the use of identical clustering technique at the final stage can equally
improve the performance of every EM-based algorithms, discussed in sub-section 4.4.5.
4.4.2.1 Results on Synthetic Dataset
The synthetic dataset with various combinations of plantedmotifs is used to validate the
effectiveness of the proposed method. The average scores ofeach algorithm for100 inde-
pendent trials are given in TABLE-4.2. It may be noted that thescores can be improved
by employing a suitable clustering technique. The conventional EM, used for comparison,
is a simplistic one. The score might be better in case of a sophisticated EM-based algo-
rithm, such as MEME, where additional statistical information (such as word statistics) is
incorporated [17].
4.4.2.2 Results on JASPAR Dataset
The results on JASPAR dataset is shown in the TABLE-4.3 and TABLE-4.4. The dataset
it divided into two groups such that TABLE-4.3 shows the results from sequences with
N ≥ 25. TABLE-4.4 shows the results for smaller (N < 25) groups. In both the cases,
the Monte-Carlo method outperforms the conventional EM. This is due to the advantage
of Monte-Carlo method in avoiding local minima. The proposedinitialization and model
54Chapter 4. Monte-Carlo Expectation Maximization for Finding Mo tifs in DNA
Sequences
Table 4.2: Performance on Synthetic Dataset(l, d) Background Av. ScoreΨ in 100 runsmotif Length Conventional EMMC EM Proposed Method
(10,2) 200 0.26 0.35 0.37(10,2) 300 0.11 0.18 0.28(10,2) 500 0.02 0.09 0.14(11,2) 300 0.35 0.58 0.65(11,2) 400 0.30 0.49 0.53(11,2) 500 0.15 0.39 0.38(12,3) 300 0.09 0.19 0.23(13,3) 300 0.12 0.31 0.39(13,3) 500 0.07 0.28 0.28(14,4) 300 0.18 0.36 0.32(14,3) 500 0.21 0.38 0.41(15,4) 300 0.31 0.41 0.49(15,4) 500 0.25 0.33 0.41(16,5) 300 0.10 0.18 0.23(16,5) 500 0.07 0.16 0.20(17,5) 500 0.20 0.30 0.38(18,5) 500 0.15 0.17 0.22(19,6) 500 0.15 0.27 0.31(20,6) 500 0.19 0.36 0.43
shifting mechanism, when applied together with Monte-CarloExpectation Maximization,
shows another level of improvement in most of the cases.
It may be noted that the proposed method has better performance as the motif length
increases. This phenomenon may be attributed to the fact that for smaller motifs, the devi-
ation of the current motif model due to model shifting is relatively more than that in longer
motif model. It may be inferred that the advantage of the model-shifting mechanism is
minimal in case of small motif lengths.
4.4.3 A Study on the Proposed Initialization Scheme
In order to further evaluate the effectiveness of the proposed algorithm, a study is performed
on the proposed initialization scheme. With the increase inthe number of initial candidate
chains to find a promising Markov chain (itrmax in Algorithm-1), the performance of the
proposed method improves. This improvement is due to the additional computational cost
spent in the proposed initialization scheme. In TABLE-4.3 and 4.4, the results are shown
4.4. Results and Analysis 55
Table 4.3: Performance on JASPAR Dataset : Large GroupAv. ScoreΨ in 100 runs
Motif Conventional Monte- ProposedLarge Group Length EM Carlo EM Method
MA0094 4 0.02 0.06 0.05MA0036 MA0075 5 0.09 0.12 0.12MA0037 MA0080 MA0086
60.14 0.24 0.21
MA0089MA0098 MA0103 MA0104MA0081 MA0093 MA0096 7 0.36 0.46 0.47MA0067 8 0.10 0.07 0.10MA0002 MA0054 MA0077
90.12 0.26 0.29
MA0078 MA0084 MA0118MA0001 MA0015 MA0028
100.11 0.29 0.36
MA0038MA0052 MA0061 MA0092MA0005 MA0009 11 0.16 0.31 0.35MA0019 MA0041 MA0048
120.21 0.36 0.39
MA0083 MA0091 MA0097MA0114 13 0.29 0.42 0.49MA0029 MA0069 MA0072
140.30 0.52 0.54
MA0082MA0116 15 0.35 0.51 0.59MA0060 16 0.42 0.48 0.53MA0065 MA0066 20 0.36 0.54 0.60
Note: Motifs of identical length are grouped together.
taking the value ofitrmax = 25. This can be further improved by increasing its value. A
study is performed for(15, 4) motif model to demonstrate the usefulness of the proposed
initialization scheme. The evolution of the goodness measure,ψ, is shown (continuous line)
in Fig. 4.7. The value ofψ is measured by computing the average score for100 different
instances of(15, 4) motif finding problem with background lengthL = 500. The value of
ψ for MC EM is also shown with the discontinuous line. It can be observed from the figure
that the performance of the proposed method improves at the expense of computational
cost.
56Chapter 4. Monte-Carlo Expectation Maximization for Finding Mo tifs in DNA
Sequences
Table 4.4: Performance on JASPAR Dataset : Small GroupAv. ScoreΨ in 100 runs
Motif Conventional Monte- ProposedSmall Group Length EM Carlo EM Method
MA0053 MA0064 5 0.08 0.06 0.01MA0004 MA0006 MA0020
60.12 0.18 0.13
MA0021 MA0056 MA0095MA0026 MA0063 MA0087 7 0.11 0.18 0.17MA0008 MA0011 MA0024
80.16 0.21 0.24
MA0031 MA0117 MA0121MA0044 MA0076 MA0122 9 0.10 0.25 0.31MA0023 MA0034 MA0049
100.14 0.29 0.31
MA0057MA0058 MA0062 MA0071MA0079 MA0101 MA0107
11
0.14 0.26 0.26MA0012 MA0013 MA0025MA0027 MA0040 MA0059MA0105 MA0111MA0018 MA0022 MA0043
120.12 0.28 0.35
MA0047MA0070 MA0102 MA0120MA0010 MA0017
140.36 0.44 0.51
MA0046 MA0119MA0074 15 0.50 0.57 0.60MA0045 MA0085 16 0.17 0.30 0.31MA0115 17 0.51 0.60 0.67MA0051 MA0112 MA0113 18 0.19 0.22 0.26MA0014 MA0073
200.12 0.24 0.32
MA0088 MA0106MA0007 22 0.29 0.46 0.51
Note: Motifs of identical length are grouped together.
4.4.4 A Study on the Stand-alone Model Shifting Mechanism
In order to identify the improvement due to the stand-alone shifting mechanism, a study
is performed on a few(l, d) motif finding problems excluding the proposed initialization
scheme. The average score in 100 runs are computed for MC EM and the proposed method.
The results are presented in TABLE-4.5. In the fourth column,the score due to the stand-
alone shifting mechanism (i.e. without the proposed initialization step) is provided. The
overall score, considering both the initialization and shifting are also provided here inside
4.4. Results and Analysis 57
5 10 15 20 25 30 35 40 45 500.1
0.2
0.3
0.4
0.5
No. of Candidates for Finding a Promising Chain (itrmax
)
Goo
dnes
s M
easu
re (ψ
)
Proposed MethodMonte−Carlo EM
Figure 4.7: Evolution of goodness measure,ψ, with respect to the number of iteration forfinding a promising Markov chain (itrmax).
the parenthesis.
Table 4.5: Comparison of the Stand-alone Shifting Scheme(l, d) Background Average ScoreΨ in 100 runsmotif Length MC EM Proposed Model Shifting
(10,2) 200 0.35 0.39 (0.37)(11,2) 400 0.49 0.47 (0.53)(12,3) 300 0.19 0.24 (0.23)(13,3) 300 0.31 0.35 (0.39)(14,4) 300 0.36 0.29 (0.32)(15,4) 500 0.33 0.40 (0.41)(16,5) 500 0.16 0.19 (0.20)(17,5) 500 0.30 0.34 (0.38)(18,5) 500 0.37 0.40 (0.42)
Note: Results are based on synthetic dataset.
4.4.5 The Effect of End-Clustering
As discussed in section-4.3.1, careful selection of the parameters associated with the pro-
posed algorithm can be cost effective than the conventionalEM or MC EM method. Due to
the initial screening process and model shifting mechanism, a single Markov chain associ-
ated with the proposed method is computationally expensivethan that of the conventional
EM or MC EM. It should be noted that the Markov chain selected by the proposed method
is more likely to converge to the true motif model as observedin the TABLE-4.2, 4.3
and 4.4. Therefore, the required number of chains to identify the motif using clustering
technique is less than that of the conventional EM or MC EM. Toestablish this claim, a
study is performed on different instances of(l, d) motif. As the output after the clustering
58Chapter 4. Monte-Carlo Expectation Maximization for Finding Mo tifs in DNA
Sequences
scheme is the motif sequence (not the motif start positions), the nucleotide-level accuracy
(nla) is considered as a metric for goodness measure. The values of the itrmax andtmax in
Algorithm-1 are taken to be20 and15 respectively. For the conventional EM and MC EM,
100 random seeds are considered. The final motif consensus is achieved via thek-means
clustering (with varyingk) at the final stage. It is observed that for difficult problems, a few
seeds produces an output motif model in the neighborhood of the true motif. Therefore, the
cluster corresponding to the desired solution is small. In such cases, the value ofk should
be kept as high as10 − 15. For relatively easy problems,k = 5 is sufficient for obtaining
satisfactory result. In TABLE-4.6, the value ofk is taken as five for both the conventional
EM and the MC EM method. In the proposed method, the number of seeds is taken as20.
The number of cluster is taken as three. If the motif finding problem is easy (say,(10, 1)
problem), then EM-based techniques may produce correct results in almost all the cases
resulting in one or more empty cluster(s). In such cases, a simple majority consensus may
be used instead of clustering.
Table 4.6: Performance Improvement on Synthetic Dataset due to Clustering(l, d) BackgroundConventional EM MC EM Proposed Methodmotif Length nla Av. time nla Av. time nla Av. time
score† (minute) score† (minute) score† (minute)
10,2 300 0.41 8.45 0.52 8.97 0.49 7.4211,2 300 0.48 8.83 0.67 9.23 0.71 7.4312,3 300 0.28 9.03 0.39 9.62 0.39 7.4613,3 300 0.5 9.15 0.64 9.7 0.67 7.4914,4 300 0.13 9.22 0.29 9.83 0.36 7.5715,4 300 0.64 9.38 0.60 10.07 0.52 7.9416,5 500 0.28 16.00 0.29 16.92 0.35 13.0117,5 500 0.44 16.35 0.53 17.1 0.50 13.2718,5 500 0.28 16.55 0.27 17.22 0.36 13.5419,6 500 0.36 16.98 0.42 17.35 0.51 13.83
† nla score is computed by performing clustering 10 times.Note : The computation is performed on a PC with Intel 3.0 GHz i5 processor and4GB memory.
4.4.6 Computation Time of Individual Markov Chain
Due to the proposed seed initialization and model shifting techniques, a single Markov
chain of the proposed method is computationally expensive than the one produced by the
4.5. Contribution of this Chapter 59
conventional EM or MC EM as discussed in sub-section 4.4.5. In TABLE-4.7, the com-
putation time of a single Markov chain associated with the three EM-based algorithms are
shown for different motif lengths. The values ofitrmax andtmax in Algorithm-1 are taken
as20 and15 respectively. On the right-most column, the values within parentheses indicate
the computation time without the proposed initialization scheme.
Table 4.7: Computation Time Comparison of a Single ChainMotif Background Chain Time for a single chain (in second)Length Length Length EM MC EM Proposed Method
10 300 100 5.07 5.38 22.25 (5.41)10 300 500 25.01 26.87 43.85 (27.98)13 300 100 5.49 5.82 22.48 (5.96)13 300 500 27.5 29.19 46.64 (31.90)15 300 100 5.63 6.04 23.81 (6.18)15 300 500 28.2 30.3 48.48 (31.91)17 500 100 9.81 10.26 39.8 (10.41)17 500 500 49.08 51.34 82.23 (53.10)19 500 100 10.19 10.41 41.49 (10.59)19 500 500 50.92 52.12 83.42 (54.89)
4.5 Contribution of this Chapter
An improved version of the Monte-Carlo EM method for motif finding in genomic se-
quences is proposed. The method identifies and terminates unpromising markov chains to
improve the performance. A model shifting method is proposed to avoid local minima that
arise due to high alignment score of the shifted version of a true motif model. These two
modifications, when incorporated, improve the performanceof EM-based motif finding
techniques. This proposed modification can be incorporatedto any EM-based motif find-
ing algorithm. The effectiveness of the proposed algorithmis validated using both synthetic
dataset and biological dataset containing experimentallyverified motifs.
Chapter 5
Conclusion
The thesis proposes a novel framework for classification of GPCR sequences based on
family specific conserved triplets or motifs. Two kernels are designed to classify GPCRs
using available structural information. An improved accuracy is achieved in both GPCR
family and GPCR Class-A subfamily classification problem by using conventional kernel
classifiers. A comparison with existing methods shows that these kernels can improve the
classification accuracy. A few triplets or motifs relevant to ligand binding processes and
their locations are identified. This class of method can alsoclassify other type of protein
sequences with a-priori structural information. As a future extension of this work, the pro-
posed string kernel can be made more informative by incorporating the information about
the exact position of motifs (triplets). Different amino acid grouping schemes can be used
based on their ligand-binding properties, as opposed to thephysio-chemical properties.
In the case of DNA sequences, two EM-based methods are proposed to find motifs.
The first method performs EM through projected motif model toidentify motifs in genomic
sequences. Given reasonably good initial starting point, this deterministic EM-based algo-
rithm converges quickly to a local optimum point. Due to the loss of information in the
random-projection process, this method shows less accuracy as compared to the conven-
tional EM method. To overcome this problem, an improved version of the Monte-Carlo
EM method is proposed for motif finding in genomic sequences.This method identifies
and terminates unpromising Markov chains to improve the overall performance. A model
shifting method is proposed to avoid local optima that arisedue to high alignment score
of the shifted version of a true motif model. These two modifications, when incorporated,
improve the performance of EM-based motif finding techniques. This proposed modifica-
tion can be incorporated to any EM-based motif finding algorithm. The effectiveness of the
62 Chapter 5. Conclusion
proposed algorithm is validated on both synthetic and biological dataset containing experi-
mentally verified motifs. The proposed algorithm may be usedto identify the transcription
factor binding sites in promoter regions of DNA sequences. This algorithm is valid for
DNA sequences, however, motif finding problem is also important in the context of protein
sequences. For example, the helix-turn-helix (HTH) motif is a common pattern used by
transcription regulators of prokaryotes and eukaryotes. Any EM-based motif identification
algorithm can be used to identify HTH motifs in protein sequences. The proposed initial-
ization technique may be employed in this case. But, the proposed model shifting scheme
needs to be modified to suit an alphabet size as large as20. This is required because the
shifting of a column of the model parameter,Θ, might move the point of interest far away
from the true motif model resulting in a very slow rate of convergence. As a future ex-
tension of this work, a similar algorithm may be developed todeal with motifs in protein
sequences.
REFERENCES
[1] A. R. Kornblihtt, M. De La Mata, J. P. Fededa, M. J. Munoz, and G. Nogues, “Multi-
ple links between transcription and splicing,”RNA, vol. 10, pp. 1489–98, 2004.
[2] C. E. Lawrence, S. F. Altschul, M. S. Boguski, J. S. Liu, A. F.Neuwald, and J. C.
Wootton, “Detecting subtle sequence signals: A gibbs sampling strategy for multiple
alignment,”Science,, vol. 262, pp. 208–214, 1993.
[3] A. F. Neuwald, J. S. Liu, and C. E. Lawrence, “Gibbs motif sampling: Detection of
bacterial outer membrane protein repeats,”Protein Science, vol. 4, pp. 1618–1632,
1995.
[4] M. Bhasin and G. P. Raghava, “GPCRpred: An SVM-based method for prediction
of families and subfamilies of G-protein coupled receptors,” Nucleic Acids Research,
vol. 32, pp. W383–W389, July 2004.
[5] V. N. Vapnik, Statistical Learning Theory, Springer, 1998.
[6] P. K. Papasaikas, Z. I. Litou P. G. Bagos, V. J. Promponas, and S. J. Hamodrakas,
“PRED-GPCR: GPCR recognition and family classification server,” Nucleic Acids
Research, vol. 32, pp. W380–W382, 2004.
[7] Y. Yabuki, T. Muramatsu, T. Hirokawa, H. Mukai, and M. Suwa, “Griffin: A system
for predicting GPCR-G-protein coupling selectivity using a support vector machine
and a hidden markov model,”Nucleic Acids Research, vol. 33 (suppl. 2), pp. W148–
W153, July 2005.
[8] Zhen Ling Peng, Jian Yi Yang, and Xin Chen, “An improved classification of G-
protein-coupled receptors using sequence-derived features,” BMC Bioinformatics,
vol. 11, pp. 420, 2010.
64 REFERENCES
[9] M. Hayat and A. Khan, “Predicting membrane protein typesby fusing composite pro-
tein sequence features into pseudo amino acid composition,” Journal of Theoretical
Biology, vol. 271, pp. 10–17, 2011.
[10] M. C. Cobanoglu, Y. Saygin, and U. Sezerman, “Classification of GPCRs using
family specific motifs,” IEEE/ACM Trans. Comput. Biol. Bioinf., vol. 8, pp. 1495–
1508, 2011.
[11] Y. Z. Guo, M. Li, M. Lu, Z. Wen, K. Wang, G. Li, and J. Wu, “Classifying G protein-
coupled receptors and nuclear receptors based on protein power spectrum from fast
fourier transform,”Amino Acids, vol. 30, pp. 397–402, 2006.
[12] Muhammad Naveed and Asif Ullah Khan, “GPCR-MPredictor: multi-level prediction
of G protein-coupled receptors using genetic ensemble,”Amino Acids, vol. 42, pp.
1809–1823, 2012.
[13] H. C. Peng, F. Long, and C. Ding, “Feature selection based on mutual information:
criteria of max-dependency, max-relevance, and min-redundancy,” IEEE Trans. on
Pattern Analysis and Machine Intelligence, vol. 27, no. 8, pp. 1226–1238, 2005.
[14] Z. Li, X. Zhou, Z. Dai, and X. Zou, “Classification of G-protein coupled receptors
based on support vector machine with maximum relevance minimum redundancy and
genetic algorithm,”BMC Bioinformatics, vol. 11, pp. 325, 2010.
[15] P. A. Pevzner and S. H. Sze, “Combinatorial approaches tofinding subtle signals in
DNA sequences,”Proc. Eighth Intl Conf. Intelligent Systems for Molecular Biology,
pp. 269–278, 2000.
[16] J. Buhler and M. Tompa, “Finding motifs using random projections,” J. Computa-
tional Biology, vol. 9, no. 2, pp. 225–242, 2002.
[17] T. L. Bailey and C. Elkan, “Unsupervised learning of multiple motifs in biopolymers
using expectation maximization,”Machine Learning, vol. 21, no. 1-2, pp. 51–80,
1995.
[18] Chengpeng Bi, “A monte carlo EM algorithm for de novo motifdiscovery in
biomolecular sequences,”IEEE/ACM Trans. on Computational Biology and Bioin-
formatics, vol. 6, no. 3, pp. 370–386, 2009.
REFERENCES 65
[19] G. Pavesi, G. Mauri, and G. Pesole, “An algorithm for finding signals of unknown
length in DNA sequences,”Bioinformatics, vol. 7, no. 1, pp. S207–S214, 2001.
[20] R.C. Edgar, “MUSCLE: Multiple sequence alignment with high accuracy and high
throughput,”Nucleic Acids Research, vol. 32(5), pp. 1792–1797, 2004.
[21] J. D. Thompson, D. G. Higgins, and T. J. Gibson, “CLUSTAL W: improving the
sensitivity of progressive multiple sequence alignment through sequence weighting,
position-specific gap penalties and weight matrix choice,”Nucleic Acids Research,
vol. 22, pp. 4673–4680, 1994.
[22] X. Liu, D. L. Brutlag, and J. S. Liu, “BioProspector: Discovering conserved DNA
motifs in upstream regulatory regions of co-expressed genes,” Proc. Sixth Pacific
Symp. Biocomputing (PSB 01), vol. 6, pp. 127–138, 2001.
[23] C. E. Lawrence and A. A. Reilly, “An expectation maximization (EM) algorithm
for the identification and characterization of common sitesin unaligned biopolymer
sequences,”Proteins: Struct., Funct., Genet., vol. 7(1), pp. 41–51, 1990.
[24] Benjamin Raphael, Lung-Tien Liu, and George Varghese, “Auniform projection
method for motif discovery in DNA sequences,”IEEE/ACM Trans. on Computational
Biology and Bioinformatics, vol. 1, no. 2, pp. 91–94, 2004.
[25] G. C. G. Wei and M. A. Tanner, “A Monte Carlo implementationof the EM algorithm
and the poor man’s data augmentation algorithms,”J. Am. Statistical Assoc., vol. 85,
no. 411, pp. 699–704, 1990.
[26] J. Bockert and J. P. Pin, “Molecular tinkering of G protein-coupled receptors: an
evolutionary success,”EMBO J., vol. 18, pp. 1723–1729, 1999.
[27] A. Marchese, S. R. George, L. F. Kolakowski, K. R. Lynch, and B. F. O’Dowd, “Novel
GPCRs and their endogenous ligands: expanding the boundariesof physiology and
pharmacology,”Trends Pharmacol Sci., vol. 20, pp. 370–375, 1999.
[28] D. W. Elrod and K. C. Chou, “A study on the correlation of G-protein-coupled recep-
tor types with amino acid composition,”Protein Eng., vol. 15, pp. 713–715, 2002.
[29] K. C. Chou and D. W. Elord, “Bioinformatical analysis of G-protein-coupled recep-
tors,” J Proteome Res., vol. 1, pp. 429–433, 2002.
66 REFERENCES
[30] R. J. Lefkowitz, “The superfamily of heptahelical receptors,” Nat Cell Biol, vol. 2,
pp. e133–e136, 2000.
[31] B. Qian, O. S. Soyer, R. R. Neubig, and R. A. Goldstein, “Depicting a protein’s two
faces: GPCR classification by phylogenetic tree-based HMMs,” FEBS Lett., vol. 554,
pp. 95–99, 2003.
[32] R. Karchin, K. Karplus, and D. Haussler, “Classifying G-protein coupled receptors
with support vector machines,”Bioinformatics, vol. 18, pp. 147–159, 2002.
[33] K. C. Chou, “Coupling interaction between thromboxane a2 receptor and alpha-13
subunit of guanine nucleotide-binding proteins,”J Proteome Res, vol. 4, pp. 1681–
1686, 2005.
[34] D.Filmore, “It’s a GPCR world,”Modern Drug Discovery, vol. 7, no. 11, pp. 24–28,
2004.
[35] R.J. Bryson-Richardson, D.W. Logan, P.D. Currie, and I.J. Jackson, “Large-scale
analysis of gene structure in rhodopsin-like GPCRs: evidencefor widespread loss of
an ancient intron,”Gene, vol. 338, pp. 15–23, 2004.
[36] N. Cristianini and J. Shawe-Taylor,An Introduction to Support Vector Machines and
Other Kernel-based Learning Methods, Cambridge Univ. Press, Cambridge, Mass.,
2000.
[37] U. Kreßel, “Pairwise classification and support vectormachines,” inAdvances in
Kernel Methods-Support Vector Learning, B. Scholkopf, C. J. C. Burges, and A. J.
Smola, Eds., Cambridge, MA, 1999, pp. 255–268, MITPress.
[38] J. C. Platt, N. Cristianini, and J. Shawe-Taylor, “Large margin dag’s for multi-class
classification,” in Advances in Neural Information Processing Systems, vol. 12, pp.
547–553, 2000.
[39] C.-W. Hsu and C.-J. Lin, “A comparison of methods for multi-class support vector
machines,”IEEE Trans. Neural Networks, vol. 13, no. 2, pp. 415–425, 2002.
[40] K. Crammer and Y. Singer, “Ultraconservative online algorithms for multiclass prob-
lems,” The Journal of Machine Learning Research, vol. 3, no. 3, pp. 951–991, 2003.
REFERENCES 67
[41] S. Ghorai, A. Mukherjee, and P. K. Dutta, “Discriminantanalysis for fast multiclass
data classification through regularized kernel function approximation,” IEEE Trans.
Neural Networks, vol. 21, no. 6, pp. 1020–1029, June 2010.
[42] S. Ghorai, A. Mukherjee, and P. K. Dutta,Advances in Proximal Kernel Classifiers,
LAP LAMBERT Academic Publishing, Germany, Nov., 2012.
[43] C. Leslie, E. Eskin, and W. S. Noble, “The spectrum kernel: A string kernel for SVM
protein classification,”Proceedings of the Pacific Symposium on Biocomputing, pp.
564–575, 2002.
[44] M. N. Davies, A. Secker, A. A. Freitas, E. Clark, J. Timmis, and D. R. Flower, “Op-
timizing amino acid groupings for GPCR classification,”Bioinformatics, vol. 24, no.
18, pp. 1980–1986, Sept. 2008.
[45] Lynne Reed Murphy, Anders Wallqvist, and Ronald M.Levy, “Simplified amino acid
alphabets for protein fold recognition and implications for folding,” Protein Engi-
neering, vol. 13, no. 3, pp. 149–152, 2000.
[46] B. Vroling, M. Sanders, C. Baakman, A. Borrmann, S. Verhoeven, J. Klomp,
L. Oliveira, J. de Vlieg, and G. Vriend, “GPCRDB: information system for G protein-
coupled receptors,”Nucleic Acids Res, Nov 2010.
[47] UniProt-Consortium, “Reorganizing the protein space atthe universal protein re-
source (uniprot),”Nucleic Acids Res, vol. 40, pp. D71–D75, 2012.
[48] A. Krogh, B. Larsson, G. von Heijne, and E. L. L. Sonnhammer, “Predicting trans-
membrane protein topology with a hidden markov model: Application to complete
genomes,”Journal of Molecular Biology, vol. 305, no. 3, pp. 567–580, January 2001.
[49] Chih Chung Chang and Chih Jen Lin, “LIBSVM: A library for sup-
port vector machines,” ACM Transactions on Intelligent Systems and
Technology, vol. 2, pp. 27:1–27:27, 2011, Software available at
http://www.csie.ntu.edu.tw/ ˜ cjlin/libsvm .
[50] L. Wang and T. Jiang, “On the complexity of multiple sequence alignment,” J.
Computational Biology, vol. 1, pp. 337–348, 1994.
68 REFERENCES
[51] A. Sandelin, W. Alkema, P. Engstrom, W.W. Wasserman, and B. Lenhard, “JASPAR:
an open-access database for eukaryotic transcription factor binding profiles,”Nucleic
Acids Research, vol. 32, pp. D91–D94, 2004.
AUTHOR’S PUBLICATIONS
[1] Aniruddha Maiti, Santanu Ghorai, and Anirban Mukherjee, “A Multi-Fold String Ker-
nel for Fixed Topology Sequence Classification,” *****.
[2] Anirban Mukherjee Aniruddha Maiti, “ Expectation Maximization in Random Pro-
jected Spaces to Find Motifs in Genomic Sequences,” inInternational Conference on
Electronics, Communication and Instrumentation 2014, Kolkata, 2014.
[3] Aniruddha Maiti and Anirban Mukherjee, “On the Monte-Carlo Expectation Maxi-
mization for Finding Motifs in DNA Sequences,”IEEE Journal of Biomedical and
Health Informatics, 2014.
BIO-DATA
Aniruddha Maiti received the B.E. degree in electronics and telecommunication engineer-
ing from the Bengal Engineering and Science University, Shibpur, India, in 2010. He is
currently pursuing M.S. degree in the Department of Electrical Engineering, Indian Insti-
tute of Technology Kharagpur, Kharagpur, India. His principal research interest is machine
learning and computational biology.