mining motifs in omics networks · finding motifs in omics ... indian institute of technology,...

95
Mining Motifs in Omics Networks Aniruddha Maiti

Upload: lamngoc

Post on 22-Jul-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

Mining Motifs in Omics Networks

Aniruddha Maiti

Finding motifs in Omics Sequences

Thesis submitted in Partial Fulfillment

of the Requirements for the Award of the Degree of

Master of Science (by Research)

by

Aniruddha Maiti(Roll No.: 11EE72P01)

under the supervision of

Dr. Anirban Mukherjee and Dr. Niloy Ganguly

Department of Electrical Engineering

Indian Institute of Technology, Kharagpur

Kharagpur - 721 302, INDIA

August, 2014

© 2014 Aniruddha Maiti. All rights reserved.

You can’t cross the sea merely by standingand staring at the water

"

"

- Rabindranath Tagore

~ Dedicated to my Parents ~

Declaration

I certify that

a. the work contained in this thesis is original and has been done by me under the

guidance of my supervisor.

b. the work has not been submitted to any other Institute for any degree or diploma.

c. I have followed the guidelines provided by the Institute in preparing the thesis.

d. I have conformed to the norms and guidelines given in the Ethical Code of Con-

duct of the Institute.

e. whenever I have used materials (data, theoretical analysis, figures, and text) from

other sources, I have given due credit to them by citing them in the text of the

thesis and giving their details in the references.

Aniruddha Maiti

Date:

Place:

Certificate

This is to certify that the thesis entitled “Finding motifs in Omics Sequences”, sub-

mitted byAniruddha Maiti , to the Department of Electrical Engineering, Indian Institute

of Technology, Kharagpur, India, in partial fulfillment forthe award of the degree ofMas-

ter of Science (by Research)in Electrical Engineering is a bona-fide record of work carried

out by him under my supervision and guidance. The thesis has fulfilled all the requirements

as per the regulations of the institute and, in my opinion, has reached the standard needed

for submission.

Dr. Anirban Mukherjee

Department of Electrical Engineering

Indian Institute of Technology

Kharagpur - 721 302, India

Dr. Niloy Ganguly

Department of Computer Science and Engineering

Indian Institute of Technology

Kharagpur - 721 302, India

Acknowledgments

I would like to take this opportunity to express my sincere gratitude to many individuals

who have given me a lot of supports during my tenure here in IIT, Kharagpur. First and

foremost, I express my deepest sense of gratitude to my supervisor and guide Dr. Anirban

Mukherjee whose expert guidance and support have made my research work fruitful here

in IIT, Kharagpur. I am thankful that I had the opportunity towork under an excellent hu-

man being like him. He never got tired of discussing my ideas and patiently proofread my

publications. His support, not only has enhanced the quality of my research work, it has

also helped me to keep going through the difficult patches andmade me a stronger person

over the last two years. I am also grateful to my co-supervisor Dr. Niloy Ganguly for his

constant encouragement and support.

Apart from my guides, I have learnt valuable technical aspects from Dr. Pabitra Mitra

through attending his classes. I am fortunate enough to haveMr. Surajit Panja as caring

elder brothers whose self-less help and support have made mywork environment as conge-

nial as it could get.

I would like to thank theCouncil of Scientific and Industrial Research, Govt. of Indiafor

sponsoring my research project.

This journey would not have been possible without the blessings and love of my parents.

Their continuous support has provided me the necessary courage to reach to the verge of

completing my degree.

Aniruddha Maiti

Abstract

Given the availability of large number of genomic and proteomic sequences, the motif find-

ing problem has received intense attention in the field of computational biology over the

last two decades. For DNA, short conserved patterns or motifscan represent transcription

factor binding sites. For proteins, the motifs may represent binding domains. For RNA, it

may represent splice junctions. Thus, discovering short conserved substrings in biological

sequences can lead to a better understanding of transcriptional regulation, mRNA splicing

and formation, and classification of protein complexes.

In the motif finding problem, the objective is to locate shortconserved substrings or motifs

in a set of long strings. This thesis presents some methodologies to find conserved loca-

tions in protein and DNA sequences. A method is proposed to classify unlabeled protein se-

quences using additional topological information besidesprimary structural information.

G protein-coupled receptors (GPCRs) are selected in this workas they contain additional

topological information besides primary structural information. Two kernels are designed

to classify GPCR sequences using available structural information. An improved accuracy

is achieved in both GPCR family and GPCR Class-A subfamily classification problem by

using kernel classifiers. The proposed framework can classify sequences with a fixed topol-

ogy and identify the family specific conserved triplets.

For DNA sequences, two Expectation Maximization (EM)-basedtechniques are developed.

First one is random projection based, and second one is Monte-Carlo (MC) simulation

based. The effectiveness of the proposed algorithms are validated using both synthetic

dataset and biological dataset containing experimentallyverified motifs.

Keywords: Kernel Function, Motif Finding, GPCR Classification, Expectation Maximiza-

tion, Random Projection, Monte Carlo

Contents

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

List of Symbols and Abbreviations . . . . . . . . . . . . . . . . . . . . . .. . . ix

1 Introduction 1

1.1 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . .. 2

1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 GPCR Classificaton and Motif Identification Techniques .. . . . . 2

1.2.2 Motif Finding Problem in DNA Sequences . . . . . . . . . . . . .4

2 GPCR classification using family specific conserved triplets 7

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Kernel Classifiers and Spectrum Kernel . . . . . . . . . . . . . . . .. . . 9

2.2.1 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.2 VVRKFA Method of Classification . . . . . . . . . . . . . . . . . 10

2.2.3 Spectrum Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.2.1 Alphabet Reduction Scheme . . . . . . . . . . . . . . . 13

2.3.3 Proposed String Kernel . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3.3.1 Normalization . . . . . . . . . . . . . . . . . . . . . . . 15

2.3.3.2 An Example . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3.4 Extention to the 8-fold Kernel . . . . . . . . . . . . . . . . . . . .16

2.3.5 Feature Reduction Schemes . . . . . . . . . . . . . . . . . . . . . 17

2.3.5.1 Identification of Receptor-Ligand Interaction Sites . . . . 17

2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

ii Contents

2.4.1 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.4.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.4.3.1 Performance to predict the Families and Sub-families . . 21

2.4.3.2 Class A Subfamily Detection . . . . . . . . . . . . . . . 22

2.4.4 Comparison with Some Related Work . . . . . . . . . . . . . . . . 23

2.4.5 Effect of Variation ofη . . . . . . . . . . . . . . . . . . . . . . . . 24

2.4.6 Identified Binding-Site Triplets and Their Positions .. . . . . . . . 25

2.4.7 Effect of Selected Reduced Feature Set . . . . . . . . . . . . . .. 27

2.5 Contribution of this Chapter . . . . . . . . . . . . . . . . . . . . . . . . .27

3 Expectation Maximization in Random Projected Spaces to FindMotifs in DNA

sequences 29

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2.1 local alignment problem for motif discovery . . . . . . . .. . . . . 30

3.2.2 Expection Maximization (EM) method for motif discovery . . . . . 31

3.2.3 Random Projection Method . . . . . . . . . . . . . . . . . . . . . 33

3.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.4 Proposed Method : EM in randomly projected spaces . . . . . .. . . . . . 34

3.4.1 An example of Projected Motif Model . . . . . . . . . . . . . . . .35

3.5 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.6 Experimental Results and Analysis . . . . . . . . . . . . . . . . . . .. . . 36

3.6.1 Results on synthetic data set . . . . . . . . . . . . . . . . . . . . . 36

3.6.2 Results on JASPAR data set . . . . . . . . . . . . . . . . . . . . . 37

3.7 Conclusion and Contribution . . . . . . . . . . . . . . . . . . . . . . . . .38

4 Monte-Carlo Expectation Maximization for Finding Motifs in DN A Sequences 39

4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.1.1 Multiple Local Alignment for Motif Discovery . . . . . . .. . . . 39

4.1.2 Expectation Maximization in Motif Finding . . . . . . . . .. . . . 41

4.1.3 Monte Carlo EM Motif Discovery Algorithm . . . . . . . . . . . .42

4.1.3.1 MCEMDA as a Markov Chain . . . . . . . . . . . . . . 43

4.2 Motivation and Methodology . . . . . . . . . . . . . . . . . . . . . . . .. 44

4.2.1 Simplified Q Function . . . . . . . . . . . . . . . . . . . . . . . . 44

Contents iii

4.2.2 Selection of Promising Markov Chains . . . . . . . . . . . . . . .45

4.2.3 Goodness Measure:ψ . . . . . . . . . . . . . . . . . . . . . . . . 45

4.2.4 Overcoming the Limitation of the Phase Shift . . . . . . . .. . . . 46

4.2.5 An Example of ShiftedΘ . . . . . . . . . . . . . . . . . . . . . . 48

4.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.3.1 Computational Leverage . . . . . . . . . . . . . . . . . . . . . . . 51

4.4 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.4.2 Comparison with other EM-based Algorithms . . . . . . . . . .. . 53

4.4.2.1 Results on Synthetic Dataset . . . . . . . . . . . . . . . 53

4.4.2.2 Results on JASPAR Dataset . . . . . . . . . . . . . . . . 53

4.4.3 A Study on the Proposed Initialization Scheme . . . . . . .. . . . 54

4.4.4 A Study on the Stand-alone Model Shifting Mechanism . .. . . . 56

4.4.5 The Effect of End-Clustering . . . . . . . . . . . . . . . . . . . . . 57

4.4.6 Computation Time of Individual Markov Chain . . . . . . . . . .. 58

4.5 Contribution of this Chapter . . . . . . . . . . . . . . . . . . . . . . . . .59

5 Conclusion 61

REFERENCES 63

AUTHOR’S PUBLICATIONS 69

List of Figures

2.1 Snake plot of a GPCR sequence. . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Multiclass data classification through regression approach. . . . . . . . . . 11

2.3 An Example of Feature-Map Formation. . . . . . . . . . . . . . . . .. . . 15

2.4 Schematic Diagram of the Classification Process. . . . . . . .. . . . . . . 18

2.5 Comparison of Classification Performances among GPCRPred, GPCRBind

and proposed String Kernels on data set-II. . . . . . . . . . . . . . .. . . . 25

2.6 Effect of variation ofη on classification accuracy (%) in GPCR class-A

subfamily prediction (Based on Dataset-II). . . . . . . . . . . . . .. . . . 25

2.7 Number of features versus classification accuracy (%) inGPCR class-A

subfamily prediction (A 17-class problem). . . . . . . . . . . . . .. . . . 27

3.1 Comparison of conventional EM and proposed method (errormeasure is

taken to be the average number of mismatches between true motif and iden-

tified motif) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.2 Comparison of conventional EM and proposed method (errormeasure is

taken to be the average number of mismatches between identified motif

start locations via indicator variableAik and true motif start locations) . . . 37

3.3 Performance for different projection length in a (15,4)problem (error mea-

sure is taken to be the average number of mismatches between true motif

and identified motif) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.1 Visualization of the MCEMDA as a Markov chain (Number of MCsimu-

lation is 3). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.2 Improvement of the Q-function during first400 iterations. (Number of MC

simulation in each stage:m = 3). . . . . . . . . . . . . . . . . . . . . . . 44

vi List of Figures

4.3 Convergence of Monte Carlo EM based algorithm to the shifted version of

true motif model. (a)(15, 3) motif finding problem withm = 1. (b) (15, 3)

motif finding problem withm = 3. (c) (12, 2) motif finding problem with

m = 1. (d) (12, 2) motif finding problem withm = 3. . . . . . . . . . . . . 47

4.4 Flowchart to overcome the phase shift problem. . . . . . . . .. . . . . . . 48

4.5 Improvement in (15,3) motif finding problem. . . . . . . . . . .. . . . . . 49

4.6 Simplified description of the proposed algorithm. . . . . .. . . . . . . . . 51

4.7 Evolution of goodness measure,ψ, with respect to the number of iteration

for finding a promising Markov chain (itrmax). . . . . . . . . . . . . . . . 57

List of Tables

2.1 Sezerman Alphabet Reduction Scheme . . . . . . . . . . . . . . . . . .. . 14

2.2 Start and end points of different segments . . . . . . . . . . . .. . . . . . 15

2.3 Data set-I: Human GPCR sequences . . . . . . . . . . . . . . . . . . . . .19

2.4 Data set-II: Class A subfamilies and TMHMM performance . .. . . . . . 20

2.5 Classification Accuracy in Data set-I . . . . . . . . . . . . . . . . .. . . . 22

2.6 Classification Accuracy in Naveed’s GPCR Data set . . . . . . . .. . . . . 22

2.7 Classification Accuracy in Data set-II . . . . . . . . . . . . . . . .. . . . 23

2.8 Misclassification Table . . . . . . . . . . . . . . . . . . . . . . . . . . .. 23

2.9 Comparison of Classification Performances among GPCRPred, GPCRBind

and proposed String Kernels . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.10 Identified motifs and their locations

(dataset-I : GPCR Family) . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.11 Subfamily-specific motifs and their locations (8-fold Kernel)

(data set-II : Class-A subfamily) . . . . . . . . . . . . . . . . . . . . . . .26

3.1 Performance on JASPAR data set . . . . . . . . . . . . . . . . . . . . . .. 38

4.1 Convergence to the true motif model due to shifting . . . . . .. . . . . . . 49

4.2 Performance on Synthetic Dataset . . . . . . . . . . . . . . . . . . .. . . 54

4.3 Performance on JASPAR Dataset : Large Group . . . . . . . . . . .. . . . 55

4.4 Performance on JASPAR Dataset : Small Group . . . . . . . . . . .. . . . 56

4.5 Comparison of the Stand-alone Shifting Scheme . . . . . . . . .. . . . . . 57

4.6 Performance Improvement on Synthetic Dataset due to Clustering . . . . . 58

4.7 Computation Time Comparison of a Single Chain . . . . . . . . . . . .. . 59

List of Symbols and Abbreviations

Σ Character Alphabet

CV Cross Validation

DNA Deoxyribonucleic acid

ECL Exo-Cellular Loops

EM Expectation Maximization

GPCR G protein-coupled receptors

HMM Hidden Markov Model

ICL Intra-Cellular Loops

MC Monte-Carlo

MC EM Monte-Carlo Expectation Maximization

MCEMDA Monte Carlo EM Motif Discovery Algorithm

mRMR minimum Redundancy Maximum Relevance

mRNA Messenger Ribonucleic acid

oops One Occurrence Per Sequence

PCA Principal Component Analysis

PWM Position Weight Matrix

RBF Radial Basis Function

RNA Ribonucleic acid

x Glossary

SVM Support Vector Machines

VVRKFA Vector-Valued Regularized Kernel Function Approximation

Chapter 1

Introduction

Finding conserved locations or motifs in biological sequences is of paramount impor-

tance. Given the advancement in sequencing technologies, alarge number of biological

sequences, e.g. protein, DNA or RNA sequences, are now available in public domain. For

DNA, short conserved patterns or motifs can represent transcription factor binding sites.

For RNA, it may represent splice junctions [1]. For proteins,the motifs may represent

binding domains. Thus, discovering short conserved substrings in biological sequences can

lead to a better understanding of transcriptional regulation, mRNA splicing and formation,

and classification of protein complexes [2] [3]. This thesispresents some methodologies

to find conserved locations in protein and DNA sequences. Based on the indentified con-

served locations in protein sequences, a method is proposedto classify unlabeled protein

sequences. Some protein sequence possesses additional topological information besides

primary structural information. The primary structural information is combined to the ad-

ditional topological information. G protein-coupled receptors (GPCRs) are selected in this

work as they contain additional topological information besides primary structural infor-

mation. The proposed framework can classify sequences witha fixed topology and identify

the family specific conserved triplets. In the case of DNA sequences, Expectation Maxi-

mization (EM)-based techniques are explored. Two EM based techniques are proposed to

identify motifs in DNA sequences.

The organization of the thesis is as follows :

2 Chapter 1. Introduction

1.1 Organization of the Thesis

• Chapter 1 discusses some of the related motif finding techniques for both protein

and DNA sequences.

• Chapter 2 presents a framework for G-Protein Coupled Receptor classification and

conserved triplet identification with the help of availabletopological information.

• Chapter 3 presents a framework where expectation maximization method is per-

formed in randomly projected space to find motif in DNA sequences.

• Chapter 4 presents a variant of Monte Carlo Expectation Maximization method to

find motifs in DNA sequences.

• Chapter 5 concludes the work described in this thesis and hashes out possible future

research scopes.

1.2 Related Work

For protein sequences, G protein-coupled receptors (GPCRs) are selected as the target class.

The proposed framework classifies GPCR based on identified conserved triplets. Identified

conserved triplets is important in the context of GPCR classification because the more a

triplet is conserved within a particular family, the betterchance it has to classify accurately.

This section discusses some related works on GPCR classificaton and conserved location

identification techniques.

For DNA sequences, the proposed techniques are ExpectationMaximization (EM)-

based. Some related methods concerning motif finding in DNA sequences are also dis-

cussed in this section.

1.2.1 GPCR Classificaton and Motif Identification Techniques

There are a number of techniques employed by different researchers to classify GPCRs.

For example Raghava et al. proposed the GPCRpred [4] method which uses a combina-

tion of 20 Support Vector Machines (SVMs) [5] to classify GPCRsequences in differ-

ent levels. In GPCRpred the feature vectors constructed usingdipeptide composition of

each protein, where the entire sequence represents a point in a high dimensional space.

TheHidden Markov Model(HMM) is also used to classify GPCR sequences, for example

1.2. Related Work 3

PRED-GPCR server uses 265 signature profile HMMs [6]. In GRIFFINproject, the com-

bined use of SVM and HMM is employed to predict the GPCR-G protein coupling [7].

The predictive power of the SVM is higher than that of the restof the methods, but it is

opaque in the sense that it fails to identify the key featureswhich are primarily responsible

for classification. On the other hand, the HMM is capable to pinpoint those features but its

classification performance is not as good as of SVM. The GPCR data suffers from the curse

of dimensionality, i.e., the number of features is much larger than the number of samples in

different classes. The prediction accuracy decreases due to this problem. In order to over-

come this problem, most of the classification techniques employed Principal Component

Analysis(PCA) to reduce the number of features [8] [9]. As PCA combines all the features,

it is difficult to pinpoint the exact set of features which areresponsible for ligand-binding

processes. In their work, Cobanoglu et al. devised an exhaustive search method forfamily

specific tripletsto classify and identify the key ligand-receptor binding sites considering

the linear position of a particular triplet [10]. The accuracy of this method is high because

the exhaustive search is employed based on the relative linear distance of triplets or in

other words, apart from the primary sequence information, topological information of the

sequence is also taken into account. The authors of [11] attempted to classify GPCRs based

on the protein power spectrum from Fast Fourier Transform (FFT). Recent work of Naveed

et al. shows that the classification accuracy can be improvedby using genetic ensembles

[12], where different feature extraction and classification strategies are used for GPCRs

prediction and then the evolutionary ensemble approach is used for enhanced prediction

performance. State of the art classification techniques either use conventional classification

algorithms like HMM or SVM, or exhaustive search method to identify the family specific

motifs. When it comes to conventional classification techniques, none of them exploits the

properties, specific to GPCR sequences. Although GPCR-SVMFS employed the minimum

Redundancy Maximum Relevance (mRMR) method [13] and the geneticalgorithm to ex-

tract features specific to GPCRs, it uses the entire sequence information [14]. The recent

work of Cobanoglu et al. shows that the exo-cellular portion of the sequence is primar-

ily responsible for ligand-binding processes. In order to classify GPCRs, it is sufficient to

take into consideration only the exo-cellular part [10]. The work presented in Chapter-2,

attempts to explore these properties which are specific to a GPCR sequence, while using

conventional method of classification as SVM.

4 Chapter 1. Introduction

1.2.2 Motif Finding Problem in DNA Sequences

In the motif finding problem, the objective is to locate shortconserved substrings or motifs

in a set of long strings. A more general and difficult problem is to find short sub-strings

which arealmost conserved, e.g. the(l, d) motif finding problem. Given a setS ofN num-

ber of sequences of lengthLi (i ∈ {1, 2, ..., N}), the task is to find a substringm of length

l which appears frequently inS accompanied by mutations in at mostd random positions.

An example is(15, 4) motif finding problem. The problem is a difficult one as two mutated

versions of substringm can differ in at most2d (8) positions. This problem is commonly

known as thechallenge problem[15]. Later Buhler and Tompa provided mathematical

analysis explaining the inherent intractability of the problem [16]. Given the intractability

and the importance of the motif finding problem in the contextof transcription factor bind-

ing site identification, a number of computational tools (such as MEME [17], MCEMDA

[18], Projection [16], Weeder [19], MUSCLE [20], ClustalW [21], BioProspector [22] etc.)

have been developed to solve the problem. Among them, the MEME and the MCEMDA

are Expectation Maximization (EM)-based methods. The projection method uses a random

projection technique to identify a favorable starting seedfor EM-based algorithms. The

MUSCLE and the ClustalW are multiple sequence alignment algorithms. MUSCLE incor-

porates a fast distance estimation usingk-mer counting, a progressive alignment using a

profile function, called the log-expectation score, and a refinement technique using a tree-

dependent restricted partitioning [20]. ClustalW is also a progressive multiple sequence

alignment algorithm which uses sequence weighting, position-specific gap penalties [21].

The BioProspector uses Markov background models [22] to search regulatory sequence

motifs. It uses the Gibbs sampling strategy.

The Expectation Maximization (EM) method was introduced inconserved site identi-

fication problem by Lawrence and Reilly[23]. This method identifies motifs which occur

only once in each sequence. Later Bailey and Elkan provided a more generalized model

for EM-based motif identification problem [17]. This iterative model is effective given a

reasonably good starting point. In [16], Buhler and Tompa proposed a locality-sensitive

hashing method calledrandom projectionto pinpoint a good starting point for EM-based

algorithms. In [24], it is shown that the performance of uniform projection is better than

that of the random projection. If the initial guess about thestarting point of EM is not

reasonably good enough, the deterministic algorithm converges quickly to a local opti-

mum point. To ameliorate this limitation, the Monte Carlo Expectation Maximization (MC

1.2. Related Work 5

EM) method is proposed by Wei and Tanner, where expectation step is calculated through

Monte Carlo simulation [25]. This incorporates randomness in EM algorithm. In Monte

Carlo EM Motif Discovery Algorithm (MCEMDA) [18], this concept is used to discover

motifs in DNA sequences.

The work presented in Chapter-3, discusses a method where EM is performed in ran-

domly projected spaces to reduce computation. In Chapter-4,a Monte Carlo Expectation

Maximization (MC EM) based motif finding method is proposed and studied.

Chapter 2

GPCR classification using family specific

conserved triplets

2.1 Introduction

G protein-coupled receptors (GPCRs) being one of the largest superfamily of transmem-

brane proteins, transduce extracellular signals across the cell membrane [26–28]. Various

extracellular signals, related to vision, metabolism, immune and inflammatory responses

[29–32], activate these receptors and on ligands binding, they transduce these signals into

intracellular responses via heterotrimeric G proteins. Different ligands are bound to differ-

ent types of GPCRs and consequently activate them by allowing it to bind with G proteins

[33]. As a result, GPCRs are of paramount importance for pharmacological intervention.

Presently, more than 50 percent of modern drugs, available in the market, target these pro-

tein sequences [34]. Given their importance, there is a considerable attention to identify

the key ligand-receptor binding sites using the GPCR sequence.

GPCR is a single polypeptide chain (shown in Fig. 2.1) which crosses the cell mem-

brane seven times. The segments, external to the cell membrane, are oneamino terminus

and threeexo-cellular loops(ECL). These four segments are in direct physical contact with

ligands during interaction. Although thetransmembrane regions, threeintra-cellular loops

(ICL) and onecarboxyl terminushave an important role in signal transduction mechanism,

they have minor functional role in ligand-binding processes [10]. Previously, GPCRs have

been classified using primary sequence information only [4]. In this work, structural infor-

mation is used along with the primary sequence information to classify GPCR sequences.

8 Chapter 2. GPCR classification using family specific conserved triplets

����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

Trans−membrane

Outside of a Cell

Inside of a Cell

c de

fg hb

a ECL 2 ECL 3N Terminus ECL 1

Carboxyl TerminusICL 3ICL 2ICL 1

Ligand

Regions

i

Figure 2.1: Snake plot of a GPCR sequence.

Out of all major families of GPCR, class-A (Rhodopsin like) is the most important one as

more than80% of GPCRs are Rhodopsin like[35].

From here onwards, the word family will be used in context of the first level classi-

fication of GPCRs. Similarly, the word, subfamily, will be usedin context of Class-A

GPCR subfamilies unless otherwise specified. As it is the mostabundant among all GPCR

families, the GPCR Class-A is the main focus of this work.

A novel string kernel is designed for the classification of GPCR. This framwork can

identify the triplets, conserved within a family, as potential ligand-receptor binding sites.

These triplets may be represented as motifs. As these motifsare specific to a particular

family, it can be inferred that they are primarily responsible for ligand-binding processes.

The primary sequence information and cross-over points (b,c, d, e, f, g, h; in Fig.2.1)

are considered as input to the proposed string kernel for feature extraction. A few selected

features are fed to kernel classifiers. The improved classification accuracy indicates the

effectiveness of the proposed method over the previous methods in terms of the ease of

feature extraction method and classification accuracy. Therest of the chapter is organized

as follows. Two kernel classifiers, namely, Support Vector Machine (SVM) and vector-

valued regularized kernel function approximation (VVRKFA)is described in section 2.2.

In section 2.3, the proposed string kernel, feature extraction and feature selection method-

ology are described. Experimental results are presented insection 2.4. These two sections

provide the principal contribution of the work presented inthis chapter. Finally, section 2.5

concludes by discussing the contribution of this chapter.

2.2. Kernel Classifiers and Spectrum Kernel 9

2.2 Kernel Classifiers and Spectrum Kernel

In this section, two kernel classifiers, used in this work, are described briefly. The basic

construction of spectrum kernel and its limitation are alsodescribed here.

2.2.1 Support Vector Machine

SVM, introduced in [5], is a supervised learning algorithm usedfor classification tasks.

Given a set of features and their binary class labels, SVM computes a linear decision

boundary in high dimensional feature space to discriminatebetween positive and negative

samples by maximizing the margin between them [36]. LetS be a training set consist-

ing of labeled input vectors(xi, li), i = 1, . . . ,m, wherexi ∈ ℜn are the instances and

li ∈ {+1,−1} are their respective labels. The input training data are projected to a higher

dimensional feature space by a nonlinear transformation functionφwhich maps every point

of input spaceX ∈ ℜn to a feature spaceX ∈ ℜm, i.e.,φ : X→ X. In order to train SVM

classifier, the knowledge of the feature space in its explicit form is not required. Only the

knowledge of inner products between training patterns in the feature space is sufficient.

Therefore, the computational problem that arises from the high dimensional feature space

is overcome with the help of kernel trick by replacing the dotproduct in feature space by

some functionK , calledkernel function[36], such that :

K (xi,xj) = φ(xi)Tφ(xj), (2.1)

where,xi andxj are two points in the input spaceX andφ(xi), φ(xj) are their mapped

points, respectively, in the feature spaceX. SVM finds the optimal separating hyperplane

by minimizing the following quadratic optimization problem:

Min 12||w||2 + C

m∑

i=1

ξi

s.t. li(

wTφ(xi) + b)

≥ 1− ξi for i = 1, 2, ...,m.(2.2)

Here,w ∈ ℜn andb ∈ ℜ, respectively, are the normal to the hyperplane and bias term.

ξi (≥ 0) is soft margin error of theith training sample andC is a regularization parameter

for regulating the trade off between the margin width and thegeneralization performance.

This formulation of SVM in (2.2) is known as the soft SVM. The solution of the quadratic

10 Chapter 2. GPCR classification using family specific conserved triplets

programming (QP) problem (2.2) is obtained by forming its Lagrangian dual as follows:

Max L(α) =m∑

i=1

αi −12

m∑

i,j=1

αiαjliljK (xi, xj)

s. t.m∑

i=1

αili = 0, C ≥ αi ≥ 0 for i = 1, 2, ...,m,(2.3)

where,αi(0 < αi < C) is the Lagrange multiplier. The solution of this problem leads to

the following decision hyperplane to test a new patternx ∈ ℜn:

f(x) = sgn(wTφ(xi) + b) = sgn

(

m∑

i=1

liαiK (xi, x) + b

)

. (2.4)

This classification approach was originally developed for binary data classification which is

lated extended to multiclass classifier [37–40]. There are various methods to extend the bi-

nary classifiers to the multiclass version. As the performance of the one-vs-one (OVO)

SVM classifier is better than the others[39], it is used in this work. In OVO method,NC2=N(N − 1)/2 binary classifier models are trained considering each possible pair for

anN class problem. The testing of a sample, in OVO method, is performed by max-win

strategy [39]. By this technique each of the trained binary classifier castes one vote to its

favored class and the class with maximum votes specifies the class label of the sample.

Thus as the number of classes increases the training and testing time also increase in this

method.

2.2.2 VVRKFA Method of Classification

The VVRKFA is a recently proposed multiclass kernel classifier that classifies a pattern in

the low dimensional label space through regression technique [41]. The limitation of the

decomposition techniques of classifying a multiclass datawith the binary classifiers has

been overcome with this method. The concept of VVRKFA method is depicted graphically

in Fig 2.2 [42]. Here, the training data are first mapped to a feature space by using the

kernel trick. Then a regularized nonlinear function is fitted to map the data from feature

space to a lower dimensional label space. The label space consists of the number of class

labels in a data set. As observed in Fig. 2.2, the fitted vector-valued response has three

elements, which is the number of classes. The classificationis performed in the label space

with these low dimensional patterns obtained through the vector-valued regression. The

testing of a new pattern is also carried out in this label space by mapping it from feature

2.2. Kernel Classifiers and Spectrum Kernel 11

Figure 2.2: Multiclass data classification through regression approach.

space to label space.

The mapping function of VVRKFA for anN -class data classification problem, based onm

measurements of input attributes and training samplesX {(xi, yi) : xi ∈ Hx ⊂ ℜn, yi ∈

Hy ⊂ ℜN , i = 1, 2, ..., m}, is obtained by solving the following optimization problem:

Min J(Θ, b, ξ) =C2tr([ Θ b ]T [ Θ b ]) + 1

2

m∑

i=1

‖ξi‖2

s.t. ΘK (xTi , BT )T + b+ ξi = yi, i = 1, 2, ....,m.

(2.5)

The label vectoryi of a samplexi of jth class is chosen as the indicator vector of the classes

with the following rule:

yi = [yi1, yi2, ...., yiN ]Twith yij = 1, yik = 0 for k 6= j ∈ N. (2.6)

Θ ∈ ℜN×m is the regression coefficient matrix that maps a feature vectorφ(xi) = K (xTi , BT )T

from the feature spaceHφ(x) into the label spaceHy,B ∈ ℜ m×n is a matrix formed by ran-

domly picking upm rows of the training data matrixA ∈ ℜm × n andK (., .) is the kernel

function that determines dot product of two vectors in feature space,b ∈ Hy is the bias

vector,ξi ∈ Hy is the slack or error variable anddim(Hφ(x)) ∈ ℜm, m ≤ m. This leads to

the following vector-valued functionρ(xi) = ΘK (xTi , BT )T + b, to map a feature vector

into the low dimensional subspaceHy. where,ρ is the fitted vector-valued response of input

12 Chapter 2. GPCR classification using family specific conserved triplets

x ∈ ℜn by VVRKFA.

A test pattern, in the spaceHy, is classified depending on its proximity (in the Maha-

lanobis sense) to the class centroids. The Mahalanobis distance takes into account the

data scattering and forms the classification rule invariantunder scale changes and location

shifts. The vector-valued responses{ρ(xi) ∈ Hy}mi=1 of the training patterns{xi}mi=1 are

obtained by the approximated function (??) using a nonlinear kernel functionK . These

vector-valued responses are used to form the class centroids of theN classes denoted by

{ρ(1), ρ(2), ...., ρ(N)}. Each class centroidρ(j) is calculated by

ρ(j) =1

mj

mj∑

i=1

ρ(xi) (2.7)

where,mj is the number of training sample ofjth class. The class label of a test patternxt

is determined by

class(xt) = arg min1≤j≤N

dM(ρ(xt), ρ(xj)|Σ), (2.8)

where Σ =N∑

j=1

(mj − 1)Σ(j)/(m−N) (2.9)

is the pooled within class sample co-variance matrix, and

Σ(j) =1

mj − 1

mj∑

i=1|xi∈Clj

(ρ(xi)− ρ(j))(ρ(xi)− ρ

(j))T (2.10)

is the co-variance matrix ofjth class anddM(ρ1, ρ2|Σ) represents the Mahalanobis distance

of two patternsρ1 andρ2 given by

dM(ρ1, ρ2|Σ) =

(ρ1 − ρ2)T Σ−1(ρ1 − ρ2). (2.11)

2.2.3 Spectrum Kernel

Spectrum Kernelis ann-gram kernel introduced in [43] for protein classification.Let Σ

be an alphabet with|Σ| = l (for protein sequences,l = 20 with each letter representing

one amino acid). Lett be a sequence consisting of letters fromΣ. Now, thek-spectrumof

t is the set of allk-length (k ≥ 1) contiguous subsequences thatt contains. Letp ∈ Σk,

then thefeature-mapis indexed by all the elements belonging toΣk. The feature-map

which translates a point from input spaceX toRlk is defined as the column vectorΦk(t) =

2.3. Methodology 13

(φp(t))p∈Σk , where,φp(t) = frequency of occurence ofp in t. In kernel matrix, the value

corresponding to the sequencexi andxj is computed fromΦk(xi) andΦk(xj) by using any

kernel function like linear kernel or Gaussian Radial Basis Function (RBF) kernel. For the

RBF kernelKk(xi,xj) = exp(µ‖Φk(xi)−Φk(xj)‖2). µ ≥ 0 is an adjustable parameter.

This work implements the kernel matrix by incorporating thestructural information of

a GPCR sequence and thus improving the classification accuracy. The limitation of the

traditional string kernel is that in the feature selection step, it can pinpoint thek-mers

that are primarily responsible for classification but it fails to predict their positions. This

limitation is overcome by using the proposed string kernel.

2.3 Methodology

2.3.1 Problem Statement

The research problem is formulated as follows:

Given a data set,D, of GPCR sequences and the name of their respective family and

subfamilies, compute the kernel matrix, using GPCR-specificstructural information, to

classify a new unlabeled sequence by kernel classifiers. From D, identify the family or

subfamily-specific conserved locations as the potential receptor-ligand binding sites.

2.3.2 Feature Extraction

For every GPCR sequence, start and end points of different regions are used. This informa-

tion can be taken from a pre-existing knowledge-base or it can be predicted by the HMM as

is the case in this work. From these segments, all the triplets are collected and the feature-

map is formed. The alphabet size (l) of the amino-acid sequences is twenty and it is quite

a large value to form the feature vector. For this reason an alphabet reduction scheme is

employed here.

2.3.2.1 Alphabet Reduction Scheme

The alphabet reduction scheme is a grouping scheme of amino acids based on the similarity

of their physio-chemical properties. There is prior work onthese types of grouping scheme

in [44], [45], [10]. Among these grouping schemes, the Sezerman grouping scheme, shown

in TABLE-2.1, is considered in this work due to its best results in GPCR classification [10].

14 Chapter 2. GPCR classification using family specific conserved triplets

Table 2.1: Sezerman Alphabet Reduction SchemeIVLM RKH DE QN ST A G W C YF PA B C D E F G H I J K

2.3.3 Proposed String Kernel

Given a receptor sequencet and anyk-length protein sequenceq, thek-spectrum kernel

only possesses the information about the frequency of occurrence ofq in t. But the po-

sition of q in t is equally important in ligand-binding process. To includethe position

information in the spectrum kernel, a GPCR sequence is divided into four separate re-

gions.These regions are represented asti, (i ∈ {1, 2, 3, 4}). Eachti is composed ofti(out)andti(in). Start and end points of these segments are shown in TABLE-2.2. Now, for each

ti, (i ∈ {1, 2, 3, 4}), its feature- map,Φ3i(t), is defined as :

Φ3i(t) = (φi

p(t))p∈Σ3 , (2.12)

whereΣ is the reduced alphabet (for Sezerman’s grouping scheme,|Σ| = 11) and

φip(t) = η ∗ fp

ti(out) + (1− η) ∗ fpti(in) (2.13)

where,fpl is the frequency of occurrence ofp in l. The value of the parameterη (0 ≤

η ≤ 1) is a measure of confidence on two aspects. The first aspect is the belief that the

external segments contain more informative triplets and the second is the accuracy of the

method employed to predict the different segments (HMM in this case). For example, if

the positions of all the segments are known exactly, then thevalue ofη should be close to

unity giving high importance to external regions and givinglittle importance to the rest of

the regions. On the other hand, if the GPCR structure prediction is not much reliable, then

the value ofη should come near0.5 giving almost equally importance to both the regions

ti(out) andti(in). The feature-map, corresponding to the entire sequencet, is defined as,

Φ(t) = [(Φ31(t))T (Φ3

2(t))T (Φ33(t))T (Φ3

4(t))T ]T , (2.14)

The feature-map of the entire GPCR sequence,t, is formed by concatenating the individual

feature-map of four different regions of the GPCR sequence. As the length ofp is clear

from the context, the subscript3 will be omitted from the feature-map notation from here

onwards. The vectorΦi(t) ∈ ℜ|Σ|3 (i ∈ {1, 2, 3, 4}) where|Σ| = 11, so the feature vector

2.3. Methodology 15

Table 2.2: Start and end points of different segments4-fold kernel 8-fold kernel

Segment Start End Segment Start End

t1t1(out) a b t1 a bt1(in) b c t2 b c

t2t2(out) c d t3 c dt2(in) d e t4 d e

t3t3(out) e f t5 e ft3(in) f g t6 f g

t4t4(out) g h t7 g ht4(in) h i t8 h i

† Refer to Fig.2.1 for start and end points.

Φ(t) ∈ ℜ4|Σ|3 . Φ(t) possesses not only the frequency information of a particular triplet

p ∈ Σ3 in different regions oft, but also contains specific positional information. In order

to compute an entry of kernel matrix corresponding to two GPCRsequencesx andy, any

similarity measures between feature-maps of these sequences, i.e.,Φ(x) andΦ(y), can be

incorporated. For example, theinner productor Gaussian RBF kernelcan be used for this

purpose. In this work, the later is used to compute the kernelmatrix.

MELNSRVDSFRYTLPIVLGANGWAMPVAAAAAB 0

AKA

00

CAD

KKJKKK

Feature Map

Alphabet Reduction

ACADEBACEJBJEAKAAAGFDGHFAKA

0.7

0.7

0.7+0.3

Region (t2(in))Trans-membrane

Exo-cellular loop (t2(out))

weight = 0.7

Figure 2.3: An Example of Feature-Map Formation.

2.3.3.1 Normalization

The length of the GPCR sequence varies significantly and so is the total number of triplets

in them. So a proper normalization step is required before the computation of the kernel

matrix. LetD be a dataset containingm GPCR sequences. At first,Φ(t) is computed for

eacht ∈ D. Now a matrixD′ ∈ ℜm×(4.|Σ|3) is formed, where each row represents a GPCR

16 Chapter 2. GPCR classification using family specific conserved triplets

sequence and each column represents a triplet of one of the four different regions of GPCR.

ThenD′ is normalized row-wise to formD′′, for i = 1, 2, ...,m , as :

D′′(i, j) =

D′(i, j)∑

∀j

D′(i, j)j = 1, 2, ..., (4.|Σ|3) (2.15)

The above normalization takes care the problem of variable length GPCR sequence. In the

next step, column-wise normalization is performed to convert all the feature values between

0 and1. HereD′′′ is formed fromD′′, for j = 1, 2, ..., (4.|Σ|3), as :

D′′′(i, j) =

D′′(i, j)−min∀i

D′′(i, j)

max∀i

D′′(i, j)−min∀i

D′′(i, j)i = 1, 2, ...,m (2.16)

2.3.3.2 An Example

Consider a protein sequence segment (ti): MELNSRVDSFRYTLPIVLGANGWAMPV. After

Sezerman’s alphabet reduction scheme is employed, according to TABLE-2.1,ti becomes:

ACADEBACEJBJEAKAAAGFDGHFAKA. Let ti(out) = ACADEBACEJBJEAKAAAGFand

ti(in) = DGHFAKA. The triplets present inti(out) areACA, CAD, ... ,AGF, and the triplets

present inti(in) areDGH, GHF, ... ,AKA. In Φi(t), the values corresponding to different

triplets are computed using (2.12). This scheme is illustrated in Fig. 2.3 (i is taken to be2

in this case). Similarly, feature-maps of other segments are computed and concatenated to

form the feature-map of entire sequence,Φ(t), according to (2.14). Then normalization is

performed as described in section 2.3.3.1.

2.3.4 Extention to the 8-fold Kernel

A new 8-fold kernel is proposed here as a variant of the4 fold kernel. Instead of forming

feature-map ofti by weighted contribution fromti(in) andti(out), different feature-map is

formed for ti(in) and ti(out). In this scheme, a protein sequencet is divided into eight

segmets :ti, (i ∈ {1, 2, ..., 8}). The start and end points of these segments are shown in

TABLE-2.2. Feature-map,Φi(t), for eachti, (i ∈ {1, 2, ..., 8}), is defined as:

Φi(t) = (φip(t))p∈Σ3 , (2.17)

2.3. Methodology 17

As before,Σ is the reduced alphabet, and

φip(t) =

η ∗ fpti if i is odd

(1− η) ∗ fpti otherwise.

(2.18)

where,fpti is the frequency of occurrence ofp in ti and0 ≤ η ≤ 1. Note that,ti is an

external segment ifi is odd and is an internal segment otherwise. The kernel formed in

this manner is8-fold kernel. The feature-map, corresponding to the entiresequencet, is

formed as before,

Φ(t) = [(Φ1(t))T (Φ2(t))T ...(Φ8(t))T ]T , (2.19)

The vectorΦi(t) ∈ ℜ|Σ|3 (i ∈ {1, 2, 3, 4}) where|Σ| = 11, so the feature vectorΦ(t) ∈

ℜ8|Σ|3 . Here, localization property is better than the previous scheme, and higher classifi-

cation accuracy can be achieved at the expense of enhanced computational complexity.

2.3.5 Feature Reduction Schemes

As described in section 2.3.3, the feature-map of a sequence, Φ(t), is of length5324 for

a 4-fold kernel. The length is10648 for 8-fold one. As a reult, it is important to reduce

the dimension ofΦ(t) before classifcation. Theminimum Redundancy Maximum Rele-

vance(mRMR) method [13] is employed in this work for feature selection. The mRMR

method utilizes the mutual information criteria for selecting a set of the most informative

features. This takes the maximum relevance along with the minimum redundancy criteria

into account to choose the additional features that are maximally dissimilar to the already

selected features. The mRMR method with quotient scheme[13], being is a popular choice,

is employed in this work.

2.3.5.1 Identification of Receptor-Ligand Interaction Sites

While using mRMR method to select the best features before classifying GPCR, the effect

of the number of selected features on classification accuracy is studied. It is found that the

classification accuracy saturates with increasing number of features. Further increase in

the number of features results in a drop in classification accuracy. It can be inferred that

the features, offering maximum classification accuracy, correspond to triplets responsible

for ligand-binding processes. The work of Cobanoglu et al. [10] also supports this obser-

vation. These triplets and their positions are identified incorresponding GPCR sequences.

18 Chapter 2. GPCR classification using family specific conserved triplets

A schematic diagram of the entire work flow of the proposed method is summarized in

Fig. 2.4. Once the triplets are identified, their corresponding classes can be identified using

Topological InformationGPCR Dataset

Feature Vector Formation Feature Selection

mRMR−basedIdentification

Motif

Formation of Feature Vector ofReduced DimensionCross Validation

Model Selection by

Trained Classifier Final Accuracy Calculation

using Proposed Kernel

Figure 2.4: Schematic Diagram of the Classification Process.

the following method. Letr be the number of features selected by mRMR. The normal-

ized dataset,D′′′, containing the information ofm GPCR sequences as described in section

2.3.3.1, is now transformed intoDrmrmr ∈ ℜ

m×r, which containsr columns corresponding

to r mRMR features. Ifd ∈ ℜm is the vector containing the corresponding class informa-

tion of Drmrmr, andh is the number of classes, then an indicator vectordj ∈ ℜ

m is formed

from d for a classj as:

dj(i) =

1 if d(i) = j

0 otherwise.(2.20)

Let, Drmrmr(:, ri) represents a column vector ofDr

mrmr corresponding to featureri. The

label,class(ri), for which the triplet corresponding tori is responsible for ligand binding,

is :

class(ri) = argmax∀j∈{1,2,...,h}

[dTj .Drmrmr(:, r1)] (2.21)

Here, in (2.21),dTj represents the transpose ofdj ∈ ℜm.

2.4 Experiments

In this section, the data sets and experimental setup are described. This is followed by the

results and their analyses.

2.4. Experiments 19

2.4.1 Data Set

The performance of the proposed string kernel is evaluted onthree different data sets de-

scribed below:

• Data set-I: This is used for GPCR family detection problem. It is preparedfrom

GPCRDB website [46]. A set of 390 human GPCR sequences, belonging to four

different families, is studied. The number of sequences in each family in this data

set is given in TABLE-2.3. The topological information of these sequences is taken

from UniProt website [47] as the input to the proposed method.

Table 2.3: Data set-I: Human GPCR sequencesGPCR Families # of sequences

Class-A Rhodopsin like 323Class-B Secretin like 47Class-C Metabotropic glutamate or pheromone 15Vomeronasal receptors (V1R and V3R) 5

TOTAL 390

• Data set-II: This data set contains protein sequences of GPCR Class-A family. It

has been used in the work of Raghava et al. [4]. In order to automate the topo-

logical region prediction process, the TMHMM server [48], which predicts different

regions of GPCR based on theHMM, is used. TMHMM may fail to predict a valid

GPCR structure (the one having four extracellular regions).Those sequences, for

which TMHMM fails to predict a valid GPCR structure, are excluded from the study.

Out of the total 1054 sequences, used in the work of Raghava et al.[4], TMHMM

predicted 885 sequences correctly. The number of sequences, processed correctly

by TMHMM, of different subfamilies is shown in TABLE-2.4. It is observed that

TMHMM processes all the GPCR Class-A subfamilies with high accuracy except

for Prostanoid subfamily. After processing with TMHMM server, the prepared data

set becomes identical to the data used in [10], where thefamily specificity of motif,

is considered to classify GPCRs. As Class-A family is considered to be the most

important among all GPCR families, the maximum emphasis is provided to this data

set. The data set is used to compare the results of this work with some existing GPCR

classification methods.

20 Chapter 2. GPCR classification using family specific conserved triplets

Table 2.4: Data set-II: Class A subfamilies and TMHMM performanceClass A Number of† TMHMM ‡ % ofsubfamilies Sequences accuracyAmine (AMN) 221 208 94.1Peptide (PEP) 381 304 79.8Hormone proteins (HMP) 25 24 96.0(Rhod)opsin (RHD) 183 174 95.1Olfactory (OLF) 87 69 79.3Prostanoid (PRS) 38 8 21.0Nucleotide-like (NUC) 48 33 68.7Cannabinoid (CAN) 11 11 100.0Gonadotropin-releasing hormone (GRH) 10 9 90.0Thyrotropin-releasing hormone (TRH) 7 7 100.0Melatonin (MEL) 13 13 100.0Viral (VIR) 17 13 76.5Lysosphingolipids (LYS) 9 8 88.9Platelet activating hormone (PAF) 4 4 100.0

TOTAL 1054 885 84.0

† Total number of sequences available in dataset-II.‡ Number of sequences correctly processed by TMHMM.

• Data set-III: This data set is used by Naveed et al. [12]. This, being the largest

among all three data sets used in this work, consists of five families of GPCR. These

five families are divided into 39 subfamilies. TMHMM is used to predict the topo-

logical information. After preprocessing, this data set contains a total of 6125 se-

quences. The number of sequences in five families and 39 sub-families are shown in

TABLE-2.6.

2.4.2 Experimental Setup

The effectiveness of the selected reduced feature set in classifying GPCR sequence is prin-

cipally studied by SVM. Most of the results, shown here, are based on SVM. Additionally,

in some cases, results using VVRKFA is also shown along with results using SVM because

VVRKFA being a kernel based classifier validates our claim about effectiveness of the pro-

posed string kernel. To implement the SVM classification algorithm, the libSVM software

package [49] is used and the VVRKFA algorithm is implemented in MATLAB [41]. To

evaluate the performance of the proposed method, a 10-fold cross validation (CV) testing

is carried out. The total dataset is divided into10 parts. At a time, nine parts are taken as

2.4. Experiments 21

the training dataset and the remaining part is taken as the testing dataset. The experiment is

performed10 times taking each part as the testing once. The final accuracyis calculated by

averaging the accuracies for every part. The grid search technique is employed to identify

the optimal parameter set for a classifier. The experiment isperformed with a Gaussian

RBF kernel of the formK (xi, xj) = exp(−µ ‖xi − xj‖2), whereµ is the kernel parameter.

The regularization parameterC of SVM and VVRKFA are selected by tuning from the set

{C = 2i|i = −5,−4, ..., 12} and{C = 10i|i = −7,−6, ...,−1}, respectively. The kernel

parameterµ for both the methods is selected from the set{µ = 2i|i = −8,−7, ...., 8}. The

optimal parameter set for both classifiers are selected by the performance of a parameter set

on a tuning set comprising of 30% of the total data. Once the optimal parameter set is cho-

sen, the 10-fold CV is performed to compute the performance ofthe classifiers. For each

data set, the 10-fold CV is performed 100 times with random permutation of the training

data to calculate average testing accuracy.

2.4.3 Experimental Results

With the experimental setup, described in section 2.4.2, the performance of the proposed

method is computed in three ways- 1) performance to predict the different families of the

GPCR, 2) performance to predict the different sub-families, and 3) performance to predict

the sub-families within a specific family of GPCR with an emphasis on Class-A subfami-

lies in particular.

2.4.3.1 Performance to predict the Families and Sub-families

In data set-I, the human GPCR sequences are grouped into four major classes shown in

TABLE-2.5. The proposed method achieves99.1% and98.9% classification accuracy using

SVMandVVRKFA, respectively. The average accuracies of100 runs of the algorithm, using

10-fold CV, are shown in TABLE-2.5.

On data set-III, the task is to classify the GPCR sequences from five families and thirty

nine sub-families in two subsequent stages. The result is given in TABLE-2.6. For the

family prediction, a set of50 mRMR features is used to solve the 5-class problem. For

the prediction of39 subfamilies, the set of400 mRMR features is used to solve the 39-

class problem. The value ofη is taken to be0.60 and0.55 for 4-fold and8-fold kernel

respectively.

The data sets, used in this work, have the additional information about start and end points

22 Chapter 2. GPCR classification using family specific conserved triplets

Table 2.5: Classification Accuracy in Data set-ITotal No. of Total No. of Percentage of Accuracya

Features Sequences SVM VVRKFA4-fold† 8-fold† 4-fold† 8-fold†

25 390 90.8 91 89.9 90.150 390 95.1 96.3 94.9 96.180 390 98.7 99.1 98.6 98.9a 10-fold cross validation is used.† The value ofη is taken to be0.6 and0.55 for 4-fold and8-fold kernel

respectively.

Table 2.6: Classification Accuracy in Naveed’s GPCR Data setLevels No. of No. of No. of % of Accuracyc

Classesa Sequencesb Features SVM VVRKFA4-fold† 8-fold† 4-fold† 8-fold†

Family 5 6125 50 96.2 98.9 96.3 97.5Subfamily 39 6125 400 83.8 90.1 82.5 87.3Class-A

17 4283 150 89.8 92.9 91.6 91.8subfamilyClass-B

14 402 150 90.1 94.3 92.2 93.9subfamilyd

Class-C6 1415 80 76.1 80.6 77.8 78.4

subfamilya Both Class-D and Class-E consist of a single subfamily.b Number of sequences for which TMHMM predicts a valid GPCR structure.c In all results,10 fold cross validation is used to measure the percentage of accuracies.d TMHMM fails to predict a valid GPCR structure for all the sequences of Gastric subfamily in

Class-B .† The value ofη is taken to be0.6 and0.55 for 4-fold and8-fold kernel respectively.

of diferent regions of a GPCR sequence. As a result, it may not be feasible to compare

the classification accuracy of some of the existing methods without incorporating this extra

information.

2.4.3.2 Class A Subfamily Detection

In this stage of classification, the experiment is performedto identify only Class-A sub-

families as it is the most important GPCR family. For this purpose, the data set-II, given in

TABLE-2.4 is used. This data set, prepared by Raghava et al. [4], is also used in the work

of Cobanoglu et al.[10]. An emphasis is provided in this data set as it helps to compare the

results with the existing methods. In the data set-II, as shown in TABLE-2.4, there are 14

2.4. Experiments 23

subfamilies of GPCR Class-A family. The number of sequences varies in each subfamily.

Using200 mRMR features, the proposed method achieves the CV accuraciesof 99.4% and

99.1% in SVM and VVRKFA respectively for the8-fold kernel, shown in TABLE-2.7.

Table 2.7: Classification Accuracy in Data set-IITotal No. of Total No. of Percentage of Accuracya

Features Sequences SVM VVRKFA4-fold† 8-fold† 4-fold† 8-fold†

50 885 94.6 95.4 94.9 95.7100 885 98.6 98.9 98.1 98.8150 885 95.1 96.3 94.9 96.1200 885 99.2 99.4 99 99.1a 10-fold cross validation is used.† The value ofη is taken to be0.6 and0.55 for 4-fold and8-fold kernel

respectively.

Table 2.8: Misclassification TableTrue Predicted Subfamilies of the misclassified seq.‡

Subfamily† 4-fold kernel 8-fold kernelNUC PEP (2) PEP (1)PEP NUC (1) -RHD PEP (1) PEP (1)VIR PEP (3) PEP (3)† Subfamilies, which are not shown here, have100% accuracies.‡ Number of misclassified sequences is inside the parentheses.

The misclassification table of a single run of the program using SVM is shown in TABLE-

2.8. A total200 features are used by SVM. Out of885 sequences only5 and7 sequences

are misclassified using4-fold and8-fold kernel respectively.

2.4.4 Comparison with Some Related Work

There were prior efforts in predicting family and sub-family of a GPCR sequence from

primary sequence information. But in every case, different kind of features are used. To

prove the effectiveness of the proposed string kernel two methods, GPCRBind [10] and

GPCRPred [4], are selected for comparison. In GPCRBind, the presence of a triplet (in

ECL) along with its position in a linear sense is considered. In order to incorporate this

information, a computationally expensive exhaustive search is employed [10]. In GPCR-

Pred, a combination of several SVMs is used, but the information related to external regions

24 Chapter 2. GPCR classification using family specific conserved triplets

is not considered [4]. GPCRPred is chosen because it employs a kernel classifier, SVM,

and the effectiveness of proposed string kernel as a preprocessing tool would be verified.

GPCRBind is chosen because it is a method which makes use of the positional informa-

tion of triplets. In TABLE-2.9, a comparison of the proposed work with GPCRPred and

GPCRBind is described using the same data set. The reported result of GPCRBind corre-

sponds to that of most successful run. It may be noted that GPCRPred, which is a SVM-

based classification server, offers poor performance in some subfamilies. It is observed that

the classification accuracy of proposed method is much improved than that of the GPCR-

Pred (92.8%). Additionally, the requirement of the ensembles of SVMs (in GPCRPred) is

relaxed in the proposed work to reduce the computational complexity. In this work, out of

total5324 features (4-fold case), the maximum number of200 features (which is even less

than 5% of total number of features) are used.

Table 2.9: Comparison of Classification Performances among GPCRPred, GPCRBind andproposed String Kernels

Subfamily Total GPCRPred GPCRBind Proposed Method8-fold 4-fold

AMN 208 204 (98.1%) 203 (97.6%) 208 (100%) 208 (100%)CAN 11 9 (81.8%) 9 (81.8%) 11 (100%) 11 (100%)GRH 9 5 (55.5%) 9 (100%) 9 (100%) 9 (100%)HMP 24 21 (87.5%) 24 (100%) 24 (100%) 24 (100%)LYS 8 6 (75.0%) 8 (100%) 8 (100%) 8 (100%)MEL 13 10 (76.9%) 13 (100%) 13 (100%) 13 (100%)NUC† 33 24 (73.7%) 29 (87.8%) 32 (96.9%) 31 (93.9%)OLF 69 60 (86.9%) 68 (98.5%) 69 (100%) 69 (100%)PAF 4 0 (00.0%) 1 (25.0%) 4 (100%) 4 (100%)PEP† 304 301 (99.0%) 302 (99.3%) 304 (100%) 303 (99.7%)PRS 8 3 (37.5%) 8 (100%) 8 (100%) 8 (100%)RHD† 174 174 (100%) 169 (97.1%) 173 (99.4%) 173 (99.4%)TRH 7 4 (57.1%) 6 (85.7%) 7 (100%) 7 (100%)VIR† 13 0 (00.0%) 12 (92.3%) 10 (76.9%) 10 (76.9%)Overall 885 821 (92.8%) 861 (97.3%) 880 (99.4%) 878 (99.2%)

Note: Results are based on data set-II.† For misclassification information, refer to TABLE-2.8.

2.4.5 Effect of Variation of η

As described in 2.3.3, the confidence on the method employed to predict start and end

locations of different segments can be modulated by the parameterη in (2.13) and (2.18).

2.4. Experiments 25

AMN CAN GRH HMP LYS MEL NUC OLF PAF PEP PRS RHD TRH VIR0

20

40

60

80

100

Cla

ssifi

catio

n A

ccu

racy

GPCRPred GPCRRBind 4−Fold Kernel 8−Fold Kernel

Figure 2.5: Comparison of Classification Performances among GPCRPred, GPCRBind andproposed String Kernels on data set-II.

The effect of varition ofη on classification accuracy presented in Fig.2.6. As observed in

the Fig.2.6, classification accuracy is maximum around the region0.45 ≤ η ≤ 0.75. There

may be two reasons for this. Firstly, the performance of TMHMM may not be reliable.

Secondly, there may be a large number of conserved triplet present in internal segments.

The second reason is more likely to be true as it is evident from TABLE-2.11 that some of

the most conserved triplet are in fact from trans-membrane regions and ICLs. This is the

rationale of selectingη close to0.5 for classification results, presented in section 4.4.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 188

90

92

94

96

98

100

Weight (η)

Cla

ssifi

catio

n A

ccu

racy

8−fold ; Number of Features 258−fold ; Number of Features 1008−fold ; Number of Features 2004−fold ; Number of Features 254−fold ; Number of Features 1004−fold ; Number of Features 200

Figure 2.6: Effect of variation ofη on classification accuracy (%) in GPCR class-A sub-family prediction (Based on Dataset-II).

2.4.6 Identified Binding-Site Triplets and Their Positions

In this work, the feature-map, constructed from a single sequencet, is of length 5324

(for 4-fold kernel). While reducing dimension using mRMR method, most information

26 Chapter 2. GPCR classification using family specific conserved triplets

carrying triplets and their positions are identified. The identification of position is direct

because the feature-map is constructed in such a way that first 1331 features correspond to

the segmentt1, next 1331 features corresponds to the segmentt2 and so on. The same is

true for8-fold kernel. The triplets and their location are identifiedusing mRMR method

while performing the 4-Class classification problem of GPCRs using data set-I. Similarly in

Class A subfamily detection, some triplets and their location are identified using data set-II.

Some of the triplets, identified while classifying these twodata sets, are shown in TABLE-

2.10 and TABLE-2.11 respectively. To identify the particular family or subfamily related

Table 2.10: Identified motifs and their locations(dataset-I : GPCR Family)Triplets ti

† Triplets ti Triplets ti Triplets ti Triplets ti

GHG t2 DGJ t1 BAA t1 EDB t2 JAH t2AEJ t1 IAB t2 IHK t1 BCA t1 GBE t4AJA t2 HKJ t2 FAC t2 BIH t3 BDI t1AAA t4 JIJ t4 IKG t1 BAB t4 AAD t2AAE t3 CGH t1 EAC t2 JKE t4 JGB t2CKE t3 AJH t3 HFK t3 KCB t2 JJH t2DIH t3 IHC t3 DJE t2 ADH t1 FDJ t2IHA t3 JJI t2 CIA t1 HAF t2 KJH t1KJE t3 HHA t3 JDJ t2 IHE t3 EIH t1AGD t4 AFA t2 AAG t1 BCE t1 AAD t1† Refer to TABLE-2.2 for segment locations

Table 2.11: Subfamily-specific motifs and their locations (8-fold Kernel)(data set-II : Class-A subfamily)

Triplets ti Subfamily Triplets ti Subfamily Triplets ti Subfamily

EIG t5 RHD JCB t4 OLF BKE t2 CANGHD t4 NUC GEJ t6 AMN FJI t8 CANIIF t1 HMP KAB t4 PEP CKB t6 GRHBAJ t4 PEP IEF t4 AMN HIJ t5 PRSAIB t3 PEP AAH t4 PEP KDA t3 VIRKAJ t2 OLF JFJ t8 PEP IEJ t4 TRHJAK t5 RHD FGA t2 LYS DBJ t4 MELHAA t6 NUC BJH t4 AMN AJK t2 MELAGH t4 RHD FAI t6 PEP KAJ t5 LYSBFJ t8 AMN BDB t8 OLF DIA t5 GRH† Refer to TABLE-2.2 for segment locations

to a particular triplet, the reduced feature-map needs to beconsidered. The procedure to

2.5. Contribution of this Chapter 27

identify the relevant class is described in 2.3.5.1. The subfamilies corresponding to some

of the features are listed in TABLE-2.11.

2.4.7 Effect of Selected Reduced Feature Set

After constructing the proposed string kernel, the number of features is reduced using the

mRMR method. The effect of the number of features on classification accuracies is pre-

sented in Fig. 2.7. It can be observed that classification accuracy saturates beyond a certain

number of features. It is inferred that most of the features in feature-map are redundant.

Only a few features suffice to classify GPCRs. This justifies theuse of mRMR to reduce

the number of features.

0 50 100 150 200 250 30070

80

90

100

Number of Features

Cla

ssifi

catio

n A

ccu

racy

(%

)

4−fold kernel

8−fold kernel

Figure 2.7: Number of features versus classification accuracy (%) in GPCR class-A sub-family prediction (A 17-class problem).

2.5 Contribution of this Chapter

In this work, two kernels are designed to classify GPCR sequences using available struc-

tural information. An improved accuracy is achieved in bothGPCR family and GPCR

Class-A subfamily classification problem by using kernel classifiers. A comparison with

GPCRPred and GPCRBind shows that these kernels can improve the classification accu-

racy. A few triplets or motifs relevant to ligand binding processes and their locations are

identified. This class of method can also classify other typeof protein sequences with

a-priori structural information.

Chapter 3

Expectation Maximization in Random

Projected Spaces to Find Motifs in DNA

sequences

3.1 Introduction

Motif finding problem has received attention in the field of computational biology over

the last two decades given the availability of large genomicsequences and the necessity of

finding conserved regions in those sequences to identify transcription factor binding sites.

In motif finding problem, the task is to identify short conserved substrings or motifs in a

data set of long strings. The problem becomes more difficult if the short substrings are not

identical themselves, rather they carry some form of mutations. In that case, the problem is

to find short substrings which are approximately conserved across the data set. For exam-

ple in the(l, d) motif finding problem, given a setS of N sequences of lengthL, the task

is to find a substringm of lengthl which appears frequently inS accompanied by muta-

tions ind random positions. An example is(15, 4) motif finding problem. The problem,

known aschallenge problem, was introduced in[15]. Later in [16], mathematical analysis

is given explaining the inherent intractability of the problem. Expectation Maximization

(EM) based techniques are popular in finding the motif model from the data set in such a

way that the motif model is maximally dissimilar to the background model. The solution

of this problem leads to the identification of hidden motif start positions. In this work, a

variant ofExpectation Maximization (EM)method is discussed and the solution is achieved

using less number of computation while maintaining the effectiveness of EM method for

30Chapter 3. Expectation Maximization in Random Projected Spaces to Find Motifs in

DNA sequences

de-novo motif discovery.

3.2 Preliminaries

Expection Maximization (EM) method was first introduced in conserved location identifi-

cation problem in [23]. This was a simple model to identify motifs which occur only once

in each sequence. This framework is calledone occurrence per sequenceor oopsmodel.

Later in [17], a more accurate and effective model for EM based motif identification prob-

lem is established. This model is effective given a reasonably good starting point. To

incorporate some randomness in EM algorithm, Monte Carlo Expectation Maximization

method is proposed in [25] where expectation is calculated through Monte Carlo simula-

tion. In [18], this concept is used to discover motifs in biomolecular sequences. In [16], a

locality-sensitive hashing method calledrandom projectionis proposed. This method can

be used to identify a starting point for EM based algorithms.In [24], it is shown that the

performance can be improved by choosing projections uniformly instead of choosing ran-

domly. In this section, the local alignment problem for motif discovery and two algorithms

related to this work are described.

3.2.1 local alignment problem for motif discovery

Let S = {S1, S2, ..., SN} be the data set ofN sequences of lengthLi(i ∈ {1, 2, ..., N}).

Let Sij ∈ Σ be the residue symbol at positionj of ith sequenceSi. Σ is the alphabet of

biomolecules. For DNA sequences,Σ = {A, T,G,C} and |Σ| = 4. If oopsmodel is

considered, then there are a total ofN motifs present inS. This model can be generalized

to any number of motif occurrence inS. In stochastic model of sequence generation, the

assumption is that thew-mer motif instances and the background (rest of the sequence

which is not motif) come from different distributions i.e. ratio of different residues in

w-mer motifs present inS is essentially different from the ratio of residues comprising

background. The residues comprising the background are drawn from an independent and

identical multinomial distributionθ0. The residues constituting thew-mer motifs inS are

drawn from an independent but not identical multinomial distributionθj where,j (1 ≤ j ≤

w) is the position of the residue in the motif. The distributions are not identical becauseθjis not identical for different positions,j. A motif can be thought of as a sequence whose

residues are drawn from a product ofw multinomial distribution :Θ = [θ1, θ2, ..., θw].

Θ, together withθ0, completely characterizePosition Weight Matrix(PWM) whose each

3.2. Preliminaries 31

elementwjk = log (θjk/θ0k) is a measure of the dissimilarities between motif-model at

jth position and background-model. All EM-based algorithm for motif finding iteratively

update PWM to maximize this dissimilarities. LetAi ∈ {0, 1}Li−w+1 be the indicator

vector containing the information of motif start locationsin sequenceSi. The value ofjth

element ofAi, Aij, is one if aw-mer motif starts from locationj in Si. The value is zero

otherwise.A = [AT1 , A

T2 , ..., A

TN ] represents a possible local alignment ofS. Total number

of such possible alignment isN∏

i=1

(

Li−w+1|Ai|

)

. Here,|Ai| =Li−w+1∑

l=1

Ail, is the total number of

motifs present in sequenceSi. Foroopsmodel, only one element inAi has nonzero value

and a single variableai is sufficient to store the information. The variableai = j if Aij = 1.

Corresponding to eachA, the model parametersθ0 andΘ can be computed easily. From

θ0 andΘ, it can be inferred about the extend to which the motif is conserved. For example,

a scoreQ can be assigned to an alignmentA as a measure of conservation of the motifs

such that:

Q = |A|.∑

k∈Σ

θ0k ln (θ0k) + |A|.w∑

j=1

k∈Σ

θjk ln θjk (3.1)

The objective of a motif finding algorithm is to find a suitablealignment, i.e.A, or equiv-

alently a modelΘ andθ0, so that,Q corresponding to the model is maximized. Given the

large alignment space and NP-completeness of the problem [50], EM-based algorithms are

employed to solve the problem iteratively.

3.2.2 Expection Maximization (EM) method for motif discovery

The EM algorithm is used to learn the parametric model of a partially-observable stochastic

process. In multiple local alignment problem, the datasetS containing the sequences is the

observable data. The indicator variableA is not observable. It corresponds to hidden data.

Θ andθ0 are model parameters. The objective is to find a model which maximizes the

marginal probabilityP (S|Θ,θ0). A simple EM-based model is presented in this section for

formalising the concepts. A random alignment is taken at initialization step by randomly

initializing A. FromA, model parameters,Θ andθ0, are updated. Update scheme can

be simplified by counting residues to calculate elements ofΘ andθ0. From these model

parameters, expected value of each element ofA is calculated. This two-step iteration goes

32Chapter 3. Expectation Maximization in Random Projected Spaces to Find Motifs in

DNA sequences

on until convergence. Atrth iteration, in E-step, expected value ofAik is calculated as :

Aik = E[Aik|Si,Θr,θr

0]

= 1× Pr(Aik = 1|Si,Θr,θr

0)

= Pr(Si|Aik = 1,Θr,θr0)

[

Pr(Aik = 1|Θr,θr0)

Pr(Si|Θr,θr

0)

]

(3.2)

In (4.2),Aik is a binary variable. In last step,Bayes’ Theoremis used. As motifs occur

independently in different sequences,Aik depends only on sequenceSi. The numerator

inside the bracket is the prior probability that motif startat positionk, which is taken to

be same (1/(Li − w + 1)) for all k. Θr andθr0 are the last updated values ofΘ andθ0.

Without calculating the bracketed term in (4.2) explicitly, at each step,Aik is divided by∑

∀k

Aik using the fact that∑

∀k

Pr(Aik = 1|Si,Θr,θr

0) = 1. The likelihood is determined as:

Pr(Si|Aik = 1,Θr,θr0) =

u 6∈µ

c∈Σ

θδbSiu

0b

w∏

m=1

b∈Σ

θδbSi,k+m−1

mb (3.3)

In (4.3),µ represent the set of indices fromk to k + w − 1 in Si. δ is the Kronecker delta

function. δab = 1 if a = b, andδab = 0 otherwise. In (4.3), The terms corresponding to

background model vary insignificantly given a sufficiently large value ofLi. So, it is safe

to assume thatPr(Si|Aik = 1,Θr,θr0) ∝

w∏

m=1

b∈Σ

θδbSi,k+m−1

mb . The constant arises due to

this proportionality is taken care of in the normalization process ofAik as described earlier.

In M-step,Θr+1 for which the probabilityPr(S, A|Θ) (or equivalently its logarithm) is

maximized is found. Here, the termθ0 is dropped because of its insignificant contribution.

The update procedure is done for1 ≤ j ≤ w andb ∈ Σ as:

θr+1jb =

N∑

i=1

Li−w+1∑

k=1

AikδbSi,k+j−1

N(3.4)

In (4.4), weighted sum of frequencies of differentb ∈ Σ is used. After few iteration,

if initialization is reasonably good, the algorithm will adjust the weights in favour of the

true motif start locations. A sophisticated and advanced EM-based algorithms use word

statistics to generate seeds for initialization.

3.3. Motivation 33

3.2.3 Random Projection Method

Random Projection is used for dimensionality reduction in the context of image and text

data[? ]. Similar concept was first introduced in motif finding problem in [16]. The

algorithm deals with a projected version of the motif instances. The algorithm is performed

byn independent trial. At each trial of the algorithm, a random projection of lengthk where

0 < k < w is selected. The projection is a set,P , of indices chosen uniformly at random

from the set{1, 2, 3, ..., l} without replacement. Ifx is aw-mer, then letf(x) be thek-mer

that results from concatenating the bases at the selectedk positions ofx. A hash table

containing all possible combination ofk-length strings from the alphabet,Σ is constructed.

Eachw-mer from the dataset is now has a corresponding entry in the hash table. Ifx is

considered as a point in anw-dimensional Hamming space,f(x) is the projection ofx onto

a k-dimensional subspace. Eachx has a corresponding entry in the hash table.w-mers

from the data set are chosen randomly and hashed. IfM be the true motif, then let the set

of mutated positions in a motif instance (a mutated version of the motif) bePi . The motif

instance will be hashed to the index corresponding tof(M) if P ∩ Pi = ∅, where∅ is the

empty set. The trade off in random projection is maintained by selecting appropriatek. If

k is large then there is less probability ofP ∩ Pi being∅. In this case, true motif instances

will not hash to the index corresponding tof(M) in the hash table. Ifk is small, then a

large number of spuriousw-mers will make it difficult forf(M) to be identified.

3.3 Motivation

The intuition behind the random projection algorithm is incorporated in this work to iden-

tify an appropriate motif model from the data using EM. The purpose of random projection

is to identify a reasonably good starting point for EM. The projection of potential motif is

identified by random projection method. Then initial value for motif model,Θ, is made

biased towards the projection. For example, given motif length w = 5, if the projection

P = {1, 3, 4}, and the potential motif projection is CAC, then the initial value for motif

model,Θ, can be given as:

Θ =

0.1 0.25 0.6 0.2 0.25

0.1 0.25 0.1 0.1 0.25

0.2 0.25 0.1 0.1 0.25

0.6 0.25 0.2 0.6 0.25

34Chapter 3. Expectation Maximization in Random Projected Spaces to Find Motifs in

DNA sequences

where, rows ofΘ represent the probabilities of the nucleotides A, T, G and C respec-

tively. The columns ofΘ represent different positions in the motif. The first, thirdand

fourth columns are biased towards C, A and C respectively. Thesecond and fifth columns of

Θ are kept equiprobable as random projection does not provideany information for them.

It might be advantageous, from computation point of view, not to use these equiprobable

columns in EM method. This work explores this possibility.

3.4 Proposed Method : EM in randomly projected spaces

Let the motif model beΘ. The projected motif model corresponding to the projectionP is

represented byΘP . GivenΘP , the Q-function in (4.1) can be written as:

Q = |A|.∑

k∈Σ

θ0k ln (θ0k) + |A|.∑

j∈P

k∈Σ

θjk ln θjk (3.5)

The E-step in (4.2), expected value of the indicator variable,Aik, is calculated as :

Aik = E[Aik|Si,ΘPr,θr

0]

= 1× Pr(Aik = 1|Si,ΘPr,θr

0)

= Pr(Si|Aik = 1,ΘPr,θr

0)

[

Pr(Aik = 1|ΘPr,θr

0)

Pr(Si|ΘPr,θr

0)

]

(3.6)

The likelihood in (4.3) can be modified as:

Pr(Si|Aik = 1,ΘPr,θr

0) =∏

u 6∈µ

c∈Σ

θδcSiu

0c

m∈P

b∈Σ

θδbSi,k+m−1

mb (3.7)

The updation in (4.4) is done forj ∈ P andb ∈ Σ as:

θr+1jb =

N∑

i=1

Li−w+1∑

k=1

AikδbSi,k+j−1

N(3.8)

Thus, only the values of model parameter corresponding to the projected positions are

calculated. This reduces the computational complexity.

3.5. Data Set 35

3.4.1 An example of Projected Motif Model

Let, motif lengthw be five. Suppose in some iteration, the value of the model parameter,

Θ =

0.1 0.1 0.6 0.2 0.1

0.1 0.1 0.1 0.2 0.1

0.2 0.7 0.1 0.2 0.1

0.6 0.1 0.2 0.4 0.7

If projected positionP = {2, 4, 5}, then projected motif model,

Θ[2,4,5] =

0.1 0.2 0.1

0.1 0.2 0.1

0.7 0.2 0.1

0.1 0.4 0.7

3.5 Data Set

To validate the efficiency of the proposed algorithm, two types of datasets are used. The first

one is a synthetic data set where different instances of(l, d) motifs are generated randomly.

A second data set is prepared from JASPAR website [51].

• Synthetic Data set: The conventional procedure to test the effectiveness of motif

finding algorithm is to use a synthetic data set containing(l, d) motif instances and

apply the algorithm to the dataset to verify its efficiency. Although there is some fun-

damental difference between a synthetic data set and a true biological data set, this

type of testing platform approximately provides a reasonable measure of effective-

ness of motif finding algorithms. In the synthetic data set used in this work, the motif

instances are generated by taking a randoml-length genomic sequence and mutating

d positions randomly. These motif instances are implanted ina random location of

a random background sequence with a fixed GC fraction.20 such background se-

quences containing different instances of(l, d) motif are used. Different background

lengths are used to vary the difficulty level of the problem.

• JASPAR Data set: This data set is prepared by taking some of the experimentally

verified biological motif instances (transcription factors of Eukaryotes) from JAS-

PAR website [51]. These instances are implanted in random background sequences

36Chapter 3. Expectation Maximization in Random Projected Spaces to Find Motifs in

DNA sequences

50 100 150 200 250 300 350 400 450 500 550 6000

0.5

1

1.5

2

2.5

3

Background length

Ave

rage

num

ber

of e

rror

Conventional EMProposed Method

Figure 3.1: Comparison of conventional EM and proposed method (error measure is takento be the average number of mismatches between true motif andidentified motif)

(with different lengths) at random locations to increase the difficullty level of the

problem. These backgrounds acts as a promoter regions. GC fraction of the pro-

moter regions is taken to be0.45.

3.6 Experimental Results and Analysis

3.6.1 Results on synthetic data set

The synthetic data set of20 sequences with planted(15, 4) motif (oops model) is pre-

pared as described in section-4.4.1. The background lengthis varied from50 to 650. With

increase in the background, the difficulty level of the problem increases. The result corre-

sponding to the conventional EM and proposed method is shownin Fig. 3.1. For a fixed

background length, the experiment is performed on50 different synthetic data. The error

is calculated as the average number of mismatches between true motif and identified motif

(projected version of the identified motif in the case of proposed method). The proposed

method shows better results in difficult problems.

Another measure of performance is the number of mismatches between identified motif

start locations via indicator variableAik and true motif start locations. The results cor-

responding to this measure is shown in Fig. 3.2. Although theproposed method seems

inferior according to this measure, it should be noted that only few correct motif start loca-

tions are sufficient in identifying the true motif. If a consensus is taken from the identified

motif start locations, a few erroneous motif start locations will not lead to a erroneous re-

sult. Another study is performed to examine the effect of different projection length on

the performance of the proposed method. Projection length is varied from5 to 15. The

3.6. Experimental Results and Analysis 37

100 200 300 400 500 6000

2

4

6

8

Background lengthErr

or

in m

otif

sta

rt lo

catio

ns

Conventional EMProposed Method

Figure 3.2: Comparison of conventional EM and proposed method (error measure is takento be the average number of mismatches between identified motif start locations via indi-cator variableAik and true motif start locations)

5 6 7 8 9 10 11 12 13 14 150

0.5

1

1.5

2

2.5

3

3.5

4

Projection length

Ave

rage

num

ber o

f err

or

Figure 3.3: Performance for different projection length ina (15,4) problem (error measureis taken to be the average number of mismatches between true motif and identified motif)

result in Fig. 3.3 shows that the performance does not significantly. The marginal improve-

ment of the algorithm for longer projection length comes at the expense of an increased

computational cost.

3.6.2 Results on JASPAR data set

This dataset is prepared from JASPAR website. Various motifs are grouped together ac-

cording to their length. The lengths of these motifs and results corresponding to the con-

ventional EM and the proposed method are shown in TABLE-3.1.

It should be noted that even if only50% of the motif start positions are identified cor-

rectly, a consensus at the end can produce satisfactory result.

38Chapter 3. Expectation Maximization in Random Projected Spaces to Find Motifs in

DNA sequences

Table 3.1: Performance on JASPAR data setJASPAR Motif BackgroundAverage number of errorGroup Length Length EM Proposed Method

MA0081, MA0093, MA0096 7 200 0.18 0.24MA0067 8 200 0.13 0.19MA0005, MA0009 11 300 0.28 0.36MA0114 13 300 0.21 0.25MA0116 15 500 0.30 0.39MA0060 16 500 0.23 0.33MA0007 22 500 0.40 0.53

Note: The average number of error is shown as a fraction of total number of motif locationsthat are erroneously predicted.

3.7 Conclusion and Contribution

A method is proposed to perform EM through projected motif model to identify motifs in

genomic sequences. Given reasonably good initial startingpoint, the deterministic EM al-

gorithm converges quickly to a local optimum point. In that case, the proposed method can

be used to reduce the computation. The projected motif modelis smaller as compared to

entire motif model which reduces the computational cost while maintaining the effective-

ness of EM to some extent. The limitation of this method is overcome by using a variant

of the Monte-Carlo Expectation Maximization method which will be discussed in the next

chapter.

Chapter 4

Monte-Carlo Expectation Maximization

for Finding Motifs in DNA Sequences

4.1 Preliminaries

The Expectation Maximization (EM) method was introduced inconserved site identifica-

tion problem by Lawrence and Reilly[23]. This method identifies motifs which occur only

once in each sequence. Later Bailey and Elkan provided a more generalized model for EM-

based motif identification problem [17]. This iterative model is effective given a reasonably

good starting point. If the initial guess about the startingpoint of EM is not reasonably

good enough, the deterministic algorithm converges quickly to a local optimum point. To

ameliorate this limitation, the Monte Carlo Expectation Maximization (MC EM) method

is proposed by Wei and Tanner, where expectation step is calculated through Monte Carlo

simulation [25]. This incorporates randomness in EM algorithm. In Monte Carlo EM Mo-

tif Discovery Algorithm (MCEMDA) [18], this concept is used to discover motifs in DNA

sequences. In this section, the problem of multiple local alignment for motif discovery is

discussed again along with the two related algorithms, the EM [17] and the MCEMDA [18]

for formalizing the concepts.

4.1.1 Multiple Local Alignment for Motif Discovery

LetS = {S1, S2, ..., SN} be the dataset containingN sequences of lengthLi (i ∈ {1, 2, ..., N}).

Let Sij ∈ Σ denote the residue symbol at positionj of the ith sequenceSi, whereinΣ is

the alphabet of biomolecules. For DNA sequencesΣ = {A, T,G,C} and|Σ| = 4. If One

40Chapter 4. Monte-Carlo Expectation Maximization for Finding Mo tifs in DNA

Sequences

Occurrence Per Sequence model (oopsmodel) is assumed, thenN motifs are present inS.

This model can be generalized to any number of motif occurrences inS. In the stochastic

model of sequence generation, the assumption is that thew-mer motifs and the background

(non-motif portion of sequence) are generated from different A-T-G-C distributions, i.e. the

ratio of different residues inw-mer motifs present inS is essentially different from that of

the background. The background is drawn from an independentand identical multinomial

distributionθ0. The residues constituting thew-motifs inS are drawn from an independent

but not identical multinomial distributionθj where,j (1 ≤ j ≤ w) is the position of the

residue in the motif. The A-T-G-C distributions,θj vary withj. Any motif, present inS, can

be considered as a sequence whose residues are drawn from a product ofw multinomial

distributions :Θ = [θ1, θ2, ..., θw]. The joint distribution,Θ, together with background

distribution,θ0, completely characterize thePosition Weight Matrix(PWM) whose each

elementwjk = log (θjk/θ0k) (where,k ∈ Σ) is a measure of dissimilarities between the

motif-model atjth position and the background-model. All EM-based motif finding algo-

rithms iteratively update PWM to maximize this dissimilarity. Let Ai ∈ {0, 1}Li−w+1 be

the indicator vector containing the information of motif start locations in sequenceSi. The

value ofjth element ofAi, Aij, is one if aw-mer motif starts from locationj in Si. The

value is zero otherwise.A = [AT1 , A

T2 , ..., A

TN ] represents a possible local alignment ofS

and the total number of such possible alignments isN∏

i=1

(

Li−w+1|Ai|

)

. Here,|Ai| =Li−w+1∑

l=1

Ail,

is the total number of motifs present in sequenceSi. For theoopsmodel,|Ai| = 1 and a

single variableai is sufficient to store the information. The variableai = j if Aij = 1. For

a givenA, the model parametersθ0 andΘ can be computed. Fromθ0 andΘ, one can infer

about the extent to which the motif is conserved. For example, a scoreQn can be assigned

to an alignmentA as a measure of conservation of the motifs such that:

Qn = |A|.∑

k∈Σ

θ0k ln (θ0k) + |A|.w∑

j=1

k∈Σ

θjk ln θjk (4.1)

The first term corresponds to the background uncertainty andthe second term relates to

that of the motif model. The objective of a motif finding algorithm is to find a suitable

alignment, i.e.A, or equivalently a modelΘ andθ0, so that, the correspondingQn of the

model is maximized. Given the large alignment space and NP-completeness of the problem

[50], EM-based algorithms are employed to solve this problem iteratively.

4.1. Preliminaries 41

4.1.2 Expectation Maximization in Motif Finding

The EM algorithm is employed to learn the parametric model ofa partially-observable

stochastic process. In the multiple local alignment problem, the datasetS containing the

sequences is the observable data. The indicator variable,A, is not observable and corre-

sponds to hidden data. The model parameters areΘ andθ0. The objective here is to find a

model which maximizes the marginal probabilityP (S|Θ,θ0). A simple EM-based model

is presented in this sub-section to formalize the concept. Arandom alignment is taken at

the initialization step with a random choice ofA. FromA, the model parameters,Θ andθ0,

are updated. The model update scheme can be simplified by counting A-T-G-C residues to

compute the elements ofΘ andθ0. From these model parameters, the expected value of

each element ofA is calculated. This two-step iteration goes on until the convergence. At

therth iteration, in E-step, the expected value ofAij is calculated as :

Aij = E[Aij|Si,Θr,θr

0] = 1× Pr(Aij = 1|Si,Θr,θr

0)

= Pr(Si|Aij = 1,Θr,θr0)

[

Pr(Aij = 1|Θr,θr0)

Pr(Si|Θr,θr

0)

]

(4.2)

In (4.2),Aij ∈ {0, 1} is a binary variable. In the final step,Bayes’ Theoremis used. As

motifs occur independently in different sequences,Aij depends only on the sequenceSi.

The numerator inside the bracket is the prior probability that motif starts at positionj,

which is taken to be same as(Li − w + 1)−1 ∀j. The model parameters inrth iteration

are denoted asΘr andθr0. Without calculating the bracketed term in (4.2) explicitly, at

each step,Aij is divided by∑

∀j

Aij using the fact that∑

∀j

Pr(Aij = 1|Si,Θr,θr

0) = 1. The

likelihood is determined as:

Pr(Si|Aij = 1,Θr,θr0) =

u 6∈µ

c∈Σ

θδcSiu

0c

w∏

m=1

b∈Σ

θδbSi j+m−1

mb (4.3)

In (4.3), µ represents the set of indices fromj to j + w − 1 in Si. δ is the Kronecker

delta function,δab = 1 if a = b, andδab = 0, otherwise. In (4.3), given a sufficiently large

value of the length,Li, of theith sequence, the first term corresponding to the background

model remains almost constant. So, it can be assumed thatPr(Si|Aij = 1,Θr,θr0) ∝

w∏

m=1

b∈Σ

θδbSi j+m−1

mb . The proportionality constant is taken care of during the normalization

process ofAij.

42Chapter 4. Monte-Carlo Expectation Maximization for Finding Mo tifs in DNA

Sequences

In the M-step,Θr+1 is found for which the probabilityPr(S, A|Θ) (or equivalently its

logarithm) is maximized. Here, the termθ0 is omitted as it is nearly constant. The model

parameter is updated for1 ≤ t ≤ w andb ∈ Σ as:

θr+1tb =

1

N

N∑

i=1

Li−w+1∑

j=1

AijδbSi,j+t−1

(4.4)

In (4.4), a weighted sum of frequencies of differentb ∈ Σ is used. After a few iterations

(with a good initialization) the algorithm will converge towards the true motif start loca-

tions. An advanced EM-based algorithm such as MEME uses wordstatistics in order to

identify a motif having best statistical relevance with respect to the background [17]. The

parameter,θ0, is an example of low order background model which considersthe frequency

of individual letters. Additionally, MEME takes into account the frequency of words (com-

bination of A-T-G-C). In a sense, it uses higher-order Markovbackground model and thus

is more effective than a naive EM-based motif finding algorithm.

4.1.3 Monte Carlo EM Motif Discovery Algorithm

In MCEMDA [18], randomness is introduced in the M-step by carrying out Monte Carlo

simulation to updateΘ. The deterministic averaging makes EM a local greedy searchalgo-

rithm. The randomness, introduced in MCEMDA, may help the algorithm to escape from

local optima and thus may increase the chance of producing a better solution as described

in [18]. The introduction of this randomness makes MCEMDA more effective compared to

the conventional EM algorithm. In M-step of the MCEMDA, an integer value is assigned

to ai according to the distribution:

u (ai = l|Si,Θr) =

w∏

j=1

b∈Σ

(

θrjb

θr0b

)δbSi,l+j−1

Li−w+1∑

t=1

[

w∏

j=1

b∈Σ

(

θrjb

θr0b

)δbSi,t+j−1

] (4.5)

where,1 ≤ ai ≤ Li − w + 1 (refer to sub-section 4.1.1). Then,Θ is updated as:

θr+1jb =

N∑

i=1

(

δbSi,ai+j−1

)

+ βjb

N∑

i=1

[

b∈Σ

(

δbSi,ai+j−1

)

+∑

b∈Σ

βjb

] (4.6)

4.1. Preliminaries 43

The vectorβ serves as a pseudocount to overcome the problem of zero-count. It can

also be thought of as aDirichlet prior. The value ofai is drawn multiple times (saym

times) from the distribution given in (4.5). In each simulation,Θ is calculated according to

(4.6), and theQ-function in (4.1) is evaluated. IfΘ corresponding to the bestQ-function

is stored and is used in the next iteration then the strategy is called asm-best strategy. On

the other hand, if an average of all model parameters is used in the next iteration, then

the strategy is called asm-average strategy. Them-average strategy is computationally

demanding and slower than them-best strategy [18].

4.1.3.1 MCEMDA as a Markov Chain

MCEMDA can be visualized as a Markov chain as shown in Fig. 4.1.At each step, the

process has the chance of taking new direction depending on the distribution in (4.5) and

thus it has the potential of avoiding local optima. At each step, the chain can followm-

different directions as opposed to a single direction of deterministic EM algorithm. R

such Markov chains are generated (each with different seed)at the beginning to getn ≤

R alignments at the end of the chains. Multiple Markov chains may produce identical

alignment. Furthermore, two alignments may not be identical but they can be almost similar

to each other (owing to the phase shift in motif model [18]). For this reason, a clustering

algorithm is used in MCEMDA using normalized longest common block as a measure of

the distance metric between two alignments. Each cluster-center is considered as a potential

motif.

Clustering

States Path With Maximum Q

Cons

ensu

sM

otif

H1

H2

m = 3

HR

Figure 4.1: Visualization of the MCEMDA as a Markov chain (Number of MC simulationis 3).

44Chapter 4. Monte-Carlo Expectation Maximization for Finding Mo tifs in DNA

Sequences

4.2 Motivation and Methodology

Consider a family of Markov chains (Hi i ∈ {1, 2, ..., R}), shown in Fig. 4.1, corre-

sponding to the MCEMDA. The transitions and states are denoted by arrows and circles

respectively. The number (m) of the MC simulation in each iteration is taken as three. Dur-

ing first few iterations,Hi visits states which are unreliable in nature. These initialstages,

called the burn-in phase, are crucial forHi in terms of convergence. If it fails to track a

suitable path, it may never converge to the actual solution [18]. This situation is shown in

Fig. 4.2. The dynamics of the goodness measure,Qn, is shown for some of the Markov

chains with different initialization. The Markov chain converges to the true motif model if

Qn ≥ QT . Here, the true motif model refers to the hidden parameter,Θ, to be estimated,

or a neighborhood ofΘ that produces the same motif as doesΘ. Out ofR such chains,

only a few converges to the true motif model (or to its phase shifted version).

0 50 100 150 200 250 300 350 400−170

−160

−150

−140

−130

−120

−110

Iteration

Q

Not successful

Successful

Figure 4.2: Improvement of the Q-function during first400 iterations. (Number of MCsimulation in each stage:m = 3).

As a result, a clustering algorithm is employed at the end of all Hi (i ∈ {1, 2, ..., R}). It

would be better off stopping some unreliable chains at the earlier stages. If the unpromising

chains are stopped after first few iterations, then the computational cost reduces. Addition-

ally, the requirement of the clustering stage may be relaxed. This study focuses on an

algorithm which will stop some unpromising Markov chains atthe beginning.

4.2.1 Simplified Q Function

The negative entropy-like goodness measure of a motif modelis shown in (4.1). Given the

motif length,w ≪ Li, the distribution of background residues almost constant for different

motif positions. As a result, the first term in (4.1) may be omitted and the modified measure

4.2. Motivation and Methodology 45

is taken to be :

Q = |A|.w∑

j=1

k∈Σ

θjk ln θjk (4.7)

This modification reduces the computation in each iterationwhile maintaining the true

purpose of the measure.

4.2.2 Selection of Promising Markov Chains

The determination of a suitable threshold,QT , is essential to identify and stop the un-

promising Markov chains. However, finding such hard threshold of Q proves to be cum-

bersome task. To overcome this limitation, the proposed method employs a greedy scheme

by initiating itrmax number of different Markov chains,Hi (i ∈ {1, 2, ..., itrmax}), with

random seeds. After the firstq iterations (called look-ahead) in eachHi, the corresponding

goodness measure,Q, is assumed to beQi (i ∈ {1, 2, ..., itrmax}). Following this look-

ahead stage, the algorithm proceeds with the best chain,Hm, where,m = argmax∀i∈{1,2,...,itrmax}

Qi.

The rest of the chains are discarded.

4.2.3 Goodness Measure:ψ

In order to validate the effectiveness of the proposed method, it is assumed that the true

motif start locations are available. Given these locations, the effectiveness can be measured

fromAik. Letγ = [γ1 γ2 ... γN ]T be the known true motif start position vector. In the

final iteration, the distribution of the index variableai is assumed to beu (refer to (4.5)).

Let γ′i = argmax∀l∈{1,2,...,Li−w+1}

u (ai = l|Si,Θ) be the most probable motif start location ofith

sequence (i ∈ {1, 2, ..., N}). The effectiveness of the proposed algorithm can be evaluated

usingγ andγ′ = [γ′1γ′2...γ

′N ]

T . A score,Ψ = 1N

N∑

i=1

ψi, is used in this work, where,

ψi =

(w−|γ′

i−γi|)

wif |γ′i − γi| ≤ w

0 otherwise.(4.8)

This metric is similar to the nucleotide-level accuracy (nla) [18] which is defined as :

nla(γ′, γ) =1

|N |

|N |∑

i=1

γ′i ∩ γiw

(4.9)

46Chapter 4. Monte-Carlo Expectation Maximization for Finding Mo tifs in DNA

Sequences

where,γ′i ∩ γi represents the size of the overlapping block between predicted and observed

motif.

4.2.4 Overcoming the Limitation of the Phase Shift

The EM-based algorithms have a propensity to converge to a shifted version of the true

motif model. As any shifted version of the motif model also possesses high score compared

to any random alignment, more often than not the EM-based algorithms converge to a

shifted version of the true motif model. In a sense, the shifted versions of the true motif

model form local minima in motif alignment space. EM-based algorithms have inherent

inability to come out of these local minima. To illustrate this phenomenon, two relatively

easy planted(l, d) motif finding problems are considered. The dynamics of the scoring

function,Ψ, is shown in Fig. 4.3 taking different(l, d) and number of MC simulation (m).

It is observed that the algorithm frequently converges to the shifted version of the true motif

model. Here, the true motif model is considered as the motif model which produces the

correct motif without any mutation. The A-T-G-C distribution of the true motif model need

not necessarily be same as that in the planted motifs. As longas the probability of the

correct nucleotide is more than the rest of the nucleotide ina particular motif position, the

model will produce the correct motif. In the Fig. 4.3, the false motif model refers to a

model which does not produces the correct motif or its phase shifted version.

As this phase-shift phenomenon leads to erroneous results,it would be better if some

measure is taken to force the algorithm to converge to the true motif model. This is accom-

plished by shifting the motif model,Θ, to the left or to the right with a given probability

pshift. After a few iterations, if there is no improvement in the goodness measure,Q, then

the algorithm will return to the originalΘ. The scheme is illustrated in Fig. 4.4, where

the proposed phase-shifting mechanism is marked by the discontinuous lines. A better but

complex option might be to make the shifting probabilitypshift dependent onQ because,

the shifting of the current motif model is only helpful when the algorithm reaches in the

neighborhood of the true motif model. But, in this work,pshift is kept constant to avoid

complexities. If the problem is difficult (i.e. motifs are less conserved), then it is likely

that all the EM-based methods will end up producing a wrong result. But somehow, if

the algorithm manages to reach to a nearby model, the proposed shifting mechanism will

look around the shifted versions of the current motif model in the hope of finding a better

alignment.

4.2. Motivation and Methodology 47

0 50 100 150 200 250 300 350 400 450 5000

0.2

0.4

0.6

0.8

1

Iteration

Sco

re (

ψ )

True Motif Model

False Motif ModelShifted True Motif Model

(a)

0 50 100 150 200 250 300 350 400 450 5000

0.2

0.4

0.6

0.8

1

Iteration

Sco

re (ψ

)

True Motif Model

False Motif ModelShifted True Motif Model

(b)

0 50 100 150 200 250 300 350 400 450 5000

0.2

0.4

0.6

0.8

1

Iteration

Sco

re (ψ

)

True Motif Model

False Motif ModelShifted True Motif Model

(c)

0 50 100 150 200 250 300 350 400 450 5000

0.2

0.4

0.6

0.8

1

Iteration

Sco

re (ψ

)

True Motif Model

False Motif ModelShifted True Motif Model

(d)

Figure 4.3: Convergence of Monte Carlo EM based algorithm to the shifted version oftrue motif model. (a)(15, 3) motif finding problem withm = 1. (b) (15, 3) motif findingproblem withm = 3. (c) (12, 2) motif finding problem withm = 1. (d) (12, 2) motiffinding problem withm = 3.

48Chapter 4. Monte-Carlo Expectation Maximization for Finding Mo tifs in DNA

Sequences

InitializationSearch for Best

Initial Motif Model

Motif ModelRight Shifted

Indicator Variable

ImprovementAfter a Few

Improvement

Iteration?After a Few

no

p

p

no

yes

1−2p

Motif ModelOutput Final

A ikshift

shift

shift

yes

Iteration? Best InitialMotif Model

Motif Model

Motif ModelLeft Shifted

Random

ΘLeft

Θ

ΘRight

Figure 4.4: Flowchart to overcome the phase shift problem.

4.2.5 An Example of ShiftedΘ

Let the motif lengthw be five. The model parameter attth iteration is as follows:

Θ(t) =

0.1 0.1 0.6 0.2 0.1

0.1 0.1 0.1 0.2 0.1

0.2 0.7 0.1 0.2 0.1

0.6 0.1 0.2 0.4 0.7

The left-shifted model parameter attth iteration will be:

Θ(t)Left =

0.1 0.6 0.2 0.1 vlA0.1 0.1 0.2 0.1 vlT0.7 0.1 0.2 0.1 vlG0.1 0.2 0.4 0.7 vlC

. Similarly, the right-shifted model parameter attth iteration will be:

Θ(t)Right =

vrA 0.1 0.1 0.6 0.2

vrT 0.1 0.1 0.1 0.2

vrG 0.2 0.7 0.1 0.2

vrC 0.6 0.1 0.2 0.4

4.2. Motivation and Methodology 49

0 50 100 150 200 250 300 350 400 450 5000

0.2

0.4

0.6

0.8

1

Iteration

Sco

re (Ψ

)

Monte−Carlo EMProposed Method

Improvement

Figure 4.5: Improvement in (15,3) motif finding problem.

The empty columns (that arise due to this shifting) are padded by a non-negative random

vectorvx = [vxA vxT v

xC vxD]

T , x ∈ {l, r}, such that‖vx‖1 =∑

k∈Σ

|vxk | = 1. If the goodness

measure,Q, is not improved due to this shifting at(t+ t′)th iteration, the algorithm retains

to the pre-stored original motif model, obtained attth iteration. In the Fig. 4.5, a (15,3)

motif finding problem with background length of 300 is considered to demonstrate the

effectiveness of the proposed shifting technique. It can beobserved that the proposed

methodology helps in avoiding the local minima that arises due to the high alignment score

of the phase shifted version of the true motif model. The proposed method converges to the

true motif while Monte-Carlo EM converges to a phase-shiftedversion of the true motif.

This is reflected on the final value of the Scoring function,Ψ as shown in Fig. 4.5. Here, the

proposed method offers the exact match to the true motif withΨ = 1.0 whereas, the Monte-

Carlo EM offersΨ = 0.8 after500 iterations. In TABLE- 4.1, the true motif is shown in

the first row. The final motif produced by the Monte-Carlo EM andthe proposed method

are shown for comparison. The proposed method converges to the true motif whereas the

Monte-Carlo EM converges to a phase-shifted version (shift=1) of the true motif of length

15. The values ofψc in (4.8) for the consensus sequences are1 and 15−115

= 0.93 for

proposed method and Monte-Carlo EM respectively.

Table 4.1: Convergence to the true motif model due to shiftingψc

True motif CGCTAAATGAGCTAAMonte-Carlo EM solution - GCTAAATGAGCTAAT 0.93Proposed method CGCTAAATGAGCTAA 1

50Chapter 4. Monte-Carlo Expectation Maximization for Finding Mo tifs in DNA

Sequences

4.3 Implementation

Given a set ofN sequences,S, motif width,w, and the number of MC simulations,m, the

proposed algorithm starts by randomly initializing an alignmentA(0). The look-ahead is

assumed to betmax. The pseudocode of the proposed initialization scheme (described in

sub-section 4.2.2) is given in Algorithm 1. It finds a promising motif model,Θ⋆BEST

.

Algorithm 1 Proposed Initialization1: Initialize : itrmax, tmax

2: for itr = 1 to itrmax do3: Initialize : S, A(0), t← 0, Qmax ← −∞, Q⋆ ← −∞4: EvaluateΘ(0) andQ(0) fromA(0)

5: Θ⋆itr← Θ(0) ; Qmax ← Q(0)

6: while t ≤ tmax do7: t← t+ 18: for r = 1 tom do9: for i = 1 toN do

10: draw a sample according to (4.5)11: end for12: updateΘ(t,r) andQ(t,r)

13: end for14: evaluateΘ(t) andQ(t) usingm-best strategy15: if Q(t) ≥ Qmax then16: Qmax ← Q(t) andΘ⋆

itr← Θ(t)

17: end if18: end while19: if Qmax ≥ Q⋆ then20: Θ⋆

BEST← Θ⋆

itr

21: Q⋆ ← Qmax

22: end if23: end for24: Output the best initialized motif modelΘ⋆

BEST

Once the best initialization is identified, then the Markov chain is started using the

initialized motif model,Θ⋆BEST

. At each step, when the motif model,Θ(t), is updated, the

algorithm calls a function which shiftsΘ(t) to the left or right, with a given probability,

pshift, as described in sub-section 4.2.5. This shifting mechanism is performed by the

function ModelShift(), and its pseudocode is given in Algorithm 2.

The proposed method employs both the initialization and shifting scheme. The cor-

responding pseudocode is given in Algorithm 3. The entire work-flow, along with the

4.4. Results and Analysis 51

Algorithm 2 Proposed Shifting: ModelShift(Θ(t),Q(t))

1: Input : motif model,Θ(t)

2: Θ(t)BEST ← Θ(t)

3: With a given probabilitypshift : Θ(t) ← Θ(t)Left orΘ(t) ← Θ

(t)Right

4: if Q(t+t′)n ≤ Q

(t)n at the end of(t+ t′)th iterationsthen

5: Θ(t) ← Θ(t)BEST

6: end if7: Return: Motif Model Θ(t)

end-clustering, is shown in Fig. 4.6.

Clustering

MotifInitializationProposed

MC EM withProposed Model

ConsensusShifting Mechanism

Figure 4.6: Simplified description of the proposed algorithm.

4.3.1 Computational Leverage

In MCEMDA, let the total number of iterations performed for a Markov chain,Hi, i ∈

{1, 2, ...R}, beL. If a single iteration of MC Expectation Maximization algorithm is con-

sidered as a unit computation, then the total computationalcost required, before employing

any clustering algorithm, isRL. On the other hand, in the proposed method, if aq-step

(q<L) look-ahead and a totalR′ randomly initialized Markov chains are employed, then

the total computational cost isR′q + (R′/10)(L − q). The parameters used in MCEMDA

areR = 1000 andL = 100. If R′ andq are chosen as1000 and15 respectively, the pro-

posed method offers an improvement in computational cost approximately by a factor of

four.

4.4 Results and Analysis

4.4.1 Dataset

To validate the efficiency of the proposed algorithm, two types of dataset are used. The first

one is asynthetic datasethaving different instances of randomly generated(l, d) motifs. A

52Chapter 4. Monte-Carlo Expectation Maximization for Finding Mo tifs in DNA

Sequences

Algorithm 3 Proposed Algorithm of Entire Work-flow1: Input : initialized Motif ModelΘ⋆

BESTOutput : Final Motif ModelΘ⋆

2: Initialize : tmax

3: Θ(0) ← Θ⋆BEST

; Qmax ← −∞4: while t ≤ tmax do5: t+ 1← t6: for r = 1 tom do7: for i = 1 toN do8: draw a sample according to (4.5)9: end for

10: updateΘ(t,r) andQ(t,r)

11: Θ(t,r) ←ModelShift(Θ(t,r),Q(t,r))12: end for13: evaluateΘ(t) andQ(t)

n usingm-best strategy14: if Q(t) ≥ Qmax then15: Qmax ← Q

(t)n andΘ⋆ ← Θ(t)

16: end if17: end while18: Output Final Motif ModelΘ⋆

second dataset is prepared taking motif instances from the JASPAR website [51].

• Synthetic Dataset:The procedure to test the effectiveness of any motif finding algo-

rithm is to use a synthetic dataset containing(l, d) motif instances and then apply the

algorithm to the dataset to verify its efficiency. Although there are some fundamental

differences between a synthetic dataset and a true biological dataset, this type of test-

ing platform approximately provides a good measure of effectiveness of motif finding

algorithms. Motif instance of the dataset, used in this work, is generated by taking a

randoml-length genomic sequence and mutating at mostd positions randomly. This

motif instance is implanted in a random location of a random background sequence

with a fixed GC fraction. Different combinations ofl, d and background length are

used to vary the difficulty level of the motif finding problem.

• JASPAR Dataset:This dataset is prepared by taking experimentally verified biolog-

ical motif instances (transcription factors of Eukaryotes) from JASPAR website [51].

These instances are then implanted in random background sequences at random lo-

cations. These backgrounds act as promoter regions. The GC fraction of promoter

regions and the background length are taken as0.45 and500 respectively.

4.4. Results and Analysis 53

4.4.2 Comparison with other EM-based Algorithms

There are multiple methods available for finding motifs e.g.MEME [17], MCEMDA [18],

Random Projection [16],Weeder [19], MUSCLE [20], ClustalW [21], [22] etc. Out of

these, MCEMDA, an excellent algorithm in itself, outperforms the other algorithms in most

of the cases [18]. MCEMDA and a simple EM-based method are selected for comparison.

For difficult motif finding problems, the probability of EM-based algorithms to converge to

an erroneous motif model is high. For this reason, conventionally, the EM-based algorithms

are applied to the dataset multiple times by taking different seeds and at the end, a clustering

algorithm is applied to identify the largest cluster in motif model space. The intuition is

that all spurious motif model will remain scattered in motifmodel space but the optimal

and near optimal solutions will form a larger cluster. This clustering technique improves

the the probability of detection of the true motif model. As clustering is a form of hard

quantization technique, the proposed method is compared toother two EM-based methods

without this clustering stage. If all the algorithms are EM-based, the calculation of the

average score after a fixed number of iterations may suit the purpose more appropriately.

It may be noted that the use of identical clustering technique at the final stage can equally

improve the performance of every EM-based algorithms, discussed in sub-section 4.4.5.

4.4.2.1 Results on Synthetic Dataset

The synthetic dataset with various combinations of plantedmotifs is used to validate the

effectiveness of the proposed method. The average scores ofeach algorithm for100 inde-

pendent trials are given in TABLE-4.2. It may be noted that thescores can be improved

by employing a suitable clustering technique. The conventional EM, used for comparison,

is a simplistic one. The score might be better in case of a sophisticated EM-based algo-

rithm, such as MEME, where additional statistical information (such as word statistics) is

incorporated [17].

4.4.2.2 Results on JASPAR Dataset

The results on JASPAR dataset is shown in the TABLE-4.3 and TABLE-4.4. The dataset

it divided into two groups such that TABLE-4.3 shows the results from sequences with

N ≥ 25. TABLE-4.4 shows the results for smaller (N < 25) groups. In both the cases,

the Monte-Carlo method outperforms the conventional EM. This is due to the advantage

of Monte-Carlo method in avoiding local minima. The proposedinitialization and model

54Chapter 4. Monte-Carlo Expectation Maximization for Finding Mo tifs in DNA

Sequences

Table 4.2: Performance on Synthetic Dataset(l, d) Background Av. ScoreΨ in 100 runsmotif Length Conventional EMMC EM Proposed Method

(10,2) 200 0.26 0.35 0.37(10,2) 300 0.11 0.18 0.28(10,2) 500 0.02 0.09 0.14(11,2) 300 0.35 0.58 0.65(11,2) 400 0.30 0.49 0.53(11,2) 500 0.15 0.39 0.38(12,3) 300 0.09 0.19 0.23(13,3) 300 0.12 0.31 0.39(13,3) 500 0.07 0.28 0.28(14,4) 300 0.18 0.36 0.32(14,3) 500 0.21 0.38 0.41(15,4) 300 0.31 0.41 0.49(15,4) 500 0.25 0.33 0.41(16,5) 300 0.10 0.18 0.23(16,5) 500 0.07 0.16 0.20(17,5) 500 0.20 0.30 0.38(18,5) 500 0.15 0.17 0.22(19,6) 500 0.15 0.27 0.31(20,6) 500 0.19 0.36 0.43

shifting mechanism, when applied together with Monte-CarloExpectation Maximization,

shows another level of improvement in most of the cases.

It may be noted that the proposed method has better performance as the motif length

increases. This phenomenon may be attributed to the fact that for smaller motifs, the devi-

ation of the current motif model due to model shifting is relatively more than that in longer

motif model. It may be inferred that the advantage of the model-shifting mechanism is

minimal in case of small motif lengths.

4.4.3 A Study on the Proposed Initialization Scheme

In order to further evaluate the effectiveness of the proposed algorithm, a study is performed

on the proposed initialization scheme. With the increase inthe number of initial candidate

chains to find a promising Markov chain (itrmax in Algorithm-1), the performance of the

proposed method improves. This improvement is due to the additional computational cost

spent in the proposed initialization scheme. In TABLE-4.3 and 4.4, the results are shown

4.4. Results and Analysis 55

Table 4.3: Performance on JASPAR Dataset : Large GroupAv. ScoreΨ in 100 runs

Motif Conventional Monte- ProposedLarge Group Length EM Carlo EM Method

MA0094 4 0.02 0.06 0.05MA0036 MA0075 5 0.09 0.12 0.12MA0037 MA0080 MA0086

60.14 0.24 0.21

MA0089MA0098 MA0103 MA0104MA0081 MA0093 MA0096 7 0.36 0.46 0.47MA0067 8 0.10 0.07 0.10MA0002 MA0054 MA0077

90.12 0.26 0.29

MA0078 MA0084 MA0118MA0001 MA0015 MA0028

100.11 0.29 0.36

MA0038MA0052 MA0061 MA0092MA0005 MA0009 11 0.16 0.31 0.35MA0019 MA0041 MA0048

120.21 0.36 0.39

MA0083 MA0091 MA0097MA0114 13 0.29 0.42 0.49MA0029 MA0069 MA0072

140.30 0.52 0.54

MA0082MA0116 15 0.35 0.51 0.59MA0060 16 0.42 0.48 0.53MA0065 MA0066 20 0.36 0.54 0.60

Note: Motifs of identical length are grouped together.

taking the value ofitrmax = 25. This can be further improved by increasing its value. A

study is performed for(15, 4) motif model to demonstrate the usefulness of the proposed

initialization scheme. The evolution of the goodness measure,ψ, is shown (continuous line)

in Fig. 4.7. The value ofψ is measured by computing the average score for100 different

instances of(15, 4) motif finding problem with background lengthL = 500. The value of

ψ for MC EM is also shown with the discontinuous line. It can be observed from the figure

that the performance of the proposed method improves at the expense of computational

cost.

56Chapter 4. Monte-Carlo Expectation Maximization for Finding Mo tifs in DNA

Sequences

Table 4.4: Performance on JASPAR Dataset : Small GroupAv. ScoreΨ in 100 runs

Motif Conventional Monte- ProposedSmall Group Length EM Carlo EM Method

MA0053 MA0064 5 0.08 0.06 0.01MA0004 MA0006 MA0020

60.12 0.18 0.13

MA0021 MA0056 MA0095MA0026 MA0063 MA0087 7 0.11 0.18 0.17MA0008 MA0011 MA0024

80.16 0.21 0.24

MA0031 MA0117 MA0121MA0044 MA0076 MA0122 9 0.10 0.25 0.31MA0023 MA0034 MA0049

100.14 0.29 0.31

MA0057MA0058 MA0062 MA0071MA0079 MA0101 MA0107

11

0.14 0.26 0.26MA0012 MA0013 MA0025MA0027 MA0040 MA0059MA0105 MA0111MA0018 MA0022 MA0043

120.12 0.28 0.35

MA0047MA0070 MA0102 MA0120MA0010 MA0017

140.36 0.44 0.51

MA0046 MA0119MA0074 15 0.50 0.57 0.60MA0045 MA0085 16 0.17 0.30 0.31MA0115 17 0.51 0.60 0.67MA0051 MA0112 MA0113 18 0.19 0.22 0.26MA0014 MA0073

200.12 0.24 0.32

MA0088 MA0106MA0007 22 0.29 0.46 0.51

Note: Motifs of identical length are grouped together.

4.4.4 A Study on the Stand-alone Model Shifting Mechanism

In order to identify the improvement due to the stand-alone shifting mechanism, a study

is performed on a few(l, d) motif finding problems excluding the proposed initialization

scheme. The average score in 100 runs are computed for MC EM and the proposed method.

The results are presented in TABLE-4.5. In the fourth column,the score due to the stand-

alone shifting mechanism (i.e. without the proposed initialization step) is provided. The

overall score, considering both the initialization and shifting are also provided here inside

4.4. Results and Analysis 57

5 10 15 20 25 30 35 40 45 500.1

0.2

0.3

0.4

0.5

No. of Candidates for Finding a Promising Chain (itrmax

)

Goo

dnes

s M

easu

re (ψ

)

Proposed MethodMonte−Carlo EM

Figure 4.7: Evolution of goodness measure,ψ, with respect to the number of iteration forfinding a promising Markov chain (itrmax).

the parenthesis.

Table 4.5: Comparison of the Stand-alone Shifting Scheme(l, d) Background Average ScoreΨ in 100 runsmotif Length MC EM Proposed Model Shifting

(10,2) 200 0.35 0.39 (0.37)(11,2) 400 0.49 0.47 (0.53)(12,3) 300 0.19 0.24 (0.23)(13,3) 300 0.31 0.35 (0.39)(14,4) 300 0.36 0.29 (0.32)(15,4) 500 0.33 0.40 (0.41)(16,5) 500 0.16 0.19 (0.20)(17,5) 500 0.30 0.34 (0.38)(18,5) 500 0.37 0.40 (0.42)

Note: Results are based on synthetic dataset.

4.4.5 The Effect of End-Clustering

As discussed in section-4.3.1, careful selection of the parameters associated with the pro-

posed algorithm can be cost effective than the conventionalEM or MC EM method. Due to

the initial screening process and model shifting mechanism, a single Markov chain associ-

ated with the proposed method is computationally expensivethan that of the conventional

EM or MC EM. It should be noted that the Markov chain selected by the proposed method

is more likely to converge to the true motif model as observedin the TABLE-4.2, 4.3

and 4.4. Therefore, the required number of chains to identify the motif using clustering

technique is less than that of the conventional EM or MC EM. Toestablish this claim, a

study is performed on different instances of(l, d) motif. As the output after the clustering

58Chapter 4. Monte-Carlo Expectation Maximization for Finding Mo tifs in DNA

Sequences

scheme is the motif sequence (not the motif start positions), the nucleotide-level accuracy

(nla) is considered as a metric for goodness measure. The values of the itrmax andtmax in

Algorithm-1 are taken to be20 and15 respectively. For the conventional EM and MC EM,

100 random seeds are considered. The final motif consensus is achieved via thek-means

clustering (with varyingk) at the final stage. It is observed that for difficult problems, a few

seeds produces an output motif model in the neighborhood of the true motif. Therefore, the

cluster corresponding to the desired solution is small. In such cases, the value ofk should

be kept as high as10 − 15. For relatively easy problems,k = 5 is sufficient for obtaining

satisfactory result. In TABLE-4.6, the value ofk is taken as five for both the conventional

EM and the MC EM method. In the proposed method, the number of seeds is taken as20.

The number of cluster is taken as three. If the motif finding problem is easy (say,(10, 1)

problem), then EM-based techniques may produce correct results in almost all the cases

resulting in one or more empty cluster(s). In such cases, a simple majority consensus may

be used instead of clustering.

Table 4.6: Performance Improvement on Synthetic Dataset due to Clustering(l, d) BackgroundConventional EM MC EM Proposed Methodmotif Length nla Av. time nla Av. time nla Av. time

score† (minute) score† (minute) score† (minute)

10,2 300 0.41 8.45 0.52 8.97 0.49 7.4211,2 300 0.48 8.83 0.67 9.23 0.71 7.4312,3 300 0.28 9.03 0.39 9.62 0.39 7.4613,3 300 0.5 9.15 0.64 9.7 0.67 7.4914,4 300 0.13 9.22 0.29 9.83 0.36 7.5715,4 300 0.64 9.38 0.60 10.07 0.52 7.9416,5 500 0.28 16.00 0.29 16.92 0.35 13.0117,5 500 0.44 16.35 0.53 17.1 0.50 13.2718,5 500 0.28 16.55 0.27 17.22 0.36 13.5419,6 500 0.36 16.98 0.42 17.35 0.51 13.83

† nla score is computed by performing clustering 10 times.Note : The computation is performed on a PC with Intel 3.0 GHz i5 processor and4GB memory.

4.4.6 Computation Time of Individual Markov Chain

Due to the proposed seed initialization and model shifting techniques, a single Markov

chain of the proposed method is computationally expensive than the one produced by the

4.5. Contribution of this Chapter 59

conventional EM or MC EM as discussed in sub-section 4.4.5. In TABLE-4.7, the com-

putation time of a single Markov chain associated with the three EM-based algorithms are

shown for different motif lengths. The values ofitrmax andtmax in Algorithm-1 are taken

as20 and15 respectively. On the right-most column, the values within parentheses indicate

the computation time without the proposed initialization scheme.

Table 4.7: Computation Time Comparison of a Single ChainMotif Background Chain Time for a single chain (in second)Length Length Length EM MC EM Proposed Method

10 300 100 5.07 5.38 22.25 (5.41)10 300 500 25.01 26.87 43.85 (27.98)13 300 100 5.49 5.82 22.48 (5.96)13 300 500 27.5 29.19 46.64 (31.90)15 300 100 5.63 6.04 23.81 (6.18)15 300 500 28.2 30.3 48.48 (31.91)17 500 100 9.81 10.26 39.8 (10.41)17 500 500 49.08 51.34 82.23 (53.10)19 500 100 10.19 10.41 41.49 (10.59)19 500 500 50.92 52.12 83.42 (54.89)

4.5 Contribution of this Chapter

An improved version of the Monte-Carlo EM method for motif finding in genomic se-

quences is proposed. The method identifies and terminates unpromising markov chains to

improve the performance. A model shifting method is proposed to avoid local minima that

arise due to high alignment score of the shifted version of a true motif model. These two

modifications, when incorporated, improve the performanceof EM-based motif finding

techniques. This proposed modification can be incorporatedto any EM-based motif find-

ing algorithm. The effectiveness of the proposed algorithmis validated using both synthetic

dataset and biological dataset containing experimentallyverified motifs.

Chapter 5

Conclusion

The thesis proposes a novel framework for classification of GPCR sequences based on

family specific conserved triplets or motifs. Two kernels are designed to classify GPCRs

using available structural information. An improved accuracy is achieved in both GPCR

family and GPCR Class-A subfamily classification problem by using conventional kernel

classifiers. A comparison with existing methods shows that these kernels can improve the

classification accuracy. A few triplets or motifs relevant to ligand binding processes and

their locations are identified. This class of method can alsoclassify other type of protein

sequences with a-priori structural information. As a future extension of this work, the pro-

posed string kernel can be made more informative by incorporating the information about

the exact position of motifs (triplets). Different amino acid grouping schemes can be used

based on their ligand-binding properties, as opposed to thephysio-chemical properties.

In the case of DNA sequences, two EM-based methods are proposed to find motifs.

The first method performs EM through projected motif model toidentify motifs in genomic

sequences. Given reasonably good initial starting point, this deterministic EM-based algo-

rithm converges quickly to a local optimum point. Due to the loss of information in the

random-projection process, this method shows less accuracy as compared to the conven-

tional EM method. To overcome this problem, an improved version of the Monte-Carlo

EM method is proposed for motif finding in genomic sequences.This method identifies

and terminates unpromising Markov chains to improve the overall performance. A model

shifting method is proposed to avoid local optima that arisedue to high alignment score

of the shifted version of a true motif model. These two modifications, when incorporated,

improve the performance of EM-based motif finding techniques. This proposed modifica-

tion can be incorporated to any EM-based motif finding algorithm. The effectiveness of the

62 Chapter 5. Conclusion

proposed algorithm is validated on both synthetic and biological dataset containing experi-

mentally verified motifs. The proposed algorithm may be usedto identify the transcription

factor binding sites in promoter regions of DNA sequences. This algorithm is valid for

DNA sequences, however, motif finding problem is also important in the context of protein

sequences. For example, the helix-turn-helix (HTH) motif is a common pattern used by

transcription regulators of prokaryotes and eukaryotes. Any EM-based motif identification

algorithm can be used to identify HTH motifs in protein sequences. The proposed initial-

ization technique may be employed in this case. But, the proposed model shifting scheme

needs to be modified to suit an alphabet size as large as20. This is required because the

shifting of a column of the model parameter,Θ, might move the point of interest far away

from the true motif model resulting in a very slow rate of convergence. As a future ex-

tension of this work, a similar algorithm may be developed todeal with motifs in protein

sequences.

REFERENCES

[1] A. R. Kornblihtt, M. De La Mata, J. P. Fededa, M. J. Munoz, and G. Nogues, “Multi-

ple links between transcription and splicing,”RNA, vol. 10, pp. 1489–98, 2004.

[2] C. E. Lawrence, S. F. Altschul, M. S. Boguski, J. S. Liu, A. F.Neuwald, and J. C.

Wootton, “Detecting subtle sequence signals: A gibbs sampling strategy for multiple

alignment,”Science,, vol. 262, pp. 208–214, 1993.

[3] A. F. Neuwald, J. S. Liu, and C. E. Lawrence, “Gibbs motif sampling: Detection of

bacterial outer membrane protein repeats,”Protein Science, vol. 4, pp. 1618–1632,

1995.

[4] M. Bhasin and G. P. Raghava, “GPCRpred: An SVM-based method for prediction

of families and subfamilies of G-protein coupled receptors,” Nucleic Acids Research,

vol. 32, pp. W383–W389, July 2004.

[5] V. N. Vapnik, Statistical Learning Theory, Springer, 1998.

[6] P. K. Papasaikas, Z. I. Litou P. G. Bagos, V. J. Promponas, and S. J. Hamodrakas,

“PRED-GPCR: GPCR recognition and family classification server,” Nucleic Acids

Research, vol. 32, pp. W380–W382, 2004.

[7] Y. Yabuki, T. Muramatsu, T. Hirokawa, H. Mukai, and M. Suwa, “Griffin: A system

for predicting GPCR-G-protein coupling selectivity using a support vector machine

and a hidden markov model,”Nucleic Acids Research, vol. 33 (suppl. 2), pp. W148–

W153, July 2005.

[8] Zhen Ling Peng, Jian Yi Yang, and Xin Chen, “An improved classification of G-

protein-coupled receptors using sequence-derived features,” BMC Bioinformatics,

vol. 11, pp. 420, 2010.

64 REFERENCES

[9] M. Hayat and A. Khan, “Predicting membrane protein typesby fusing composite pro-

tein sequence features into pseudo amino acid composition,” Journal of Theoretical

Biology, vol. 271, pp. 10–17, 2011.

[10] M. C. Cobanoglu, Y. Saygin, and U. Sezerman, “Classification of GPCRs using

family specific motifs,” IEEE/ACM Trans. Comput. Biol. Bioinf., vol. 8, pp. 1495–

1508, 2011.

[11] Y. Z. Guo, M. Li, M. Lu, Z. Wen, K. Wang, G. Li, and J. Wu, “Classifying G protein-

coupled receptors and nuclear receptors based on protein power spectrum from fast

fourier transform,”Amino Acids, vol. 30, pp. 397–402, 2006.

[12] Muhammad Naveed and Asif Ullah Khan, “GPCR-MPredictor: multi-level prediction

of G protein-coupled receptors using genetic ensemble,”Amino Acids, vol. 42, pp.

1809–1823, 2012.

[13] H. C. Peng, F. Long, and C. Ding, “Feature selection based on mutual information:

criteria of max-dependency, max-relevance, and min-redundancy,” IEEE Trans. on

Pattern Analysis and Machine Intelligence, vol. 27, no. 8, pp. 1226–1238, 2005.

[14] Z. Li, X. Zhou, Z. Dai, and X. Zou, “Classification of G-protein coupled receptors

based on support vector machine with maximum relevance minimum redundancy and

genetic algorithm,”BMC Bioinformatics, vol. 11, pp. 325, 2010.

[15] P. A. Pevzner and S. H. Sze, “Combinatorial approaches tofinding subtle signals in

DNA sequences,”Proc. Eighth Intl Conf. Intelligent Systems for Molecular Biology,

pp. 269–278, 2000.

[16] J. Buhler and M. Tompa, “Finding motifs using random projections,” J. Computa-

tional Biology, vol. 9, no. 2, pp. 225–242, 2002.

[17] T. L. Bailey and C. Elkan, “Unsupervised learning of multiple motifs in biopolymers

using expectation maximization,”Machine Learning, vol. 21, no. 1-2, pp. 51–80,

1995.

[18] Chengpeng Bi, “A monte carlo EM algorithm for de novo motifdiscovery in

biomolecular sequences,”IEEE/ACM Trans. on Computational Biology and Bioin-

formatics, vol. 6, no. 3, pp. 370–386, 2009.

REFERENCES 65

[19] G. Pavesi, G. Mauri, and G. Pesole, “An algorithm for finding signals of unknown

length in DNA sequences,”Bioinformatics, vol. 7, no. 1, pp. S207–S214, 2001.

[20] R.C. Edgar, “MUSCLE: Multiple sequence alignment with high accuracy and high

throughput,”Nucleic Acids Research, vol. 32(5), pp. 1792–1797, 2004.

[21] J. D. Thompson, D. G. Higgins, and T. J. Gibson, “CLUSTAL W: improving the

sensitivity of progressive multiple sequence alignment through sequence weighting,

position-specific gap penalties and weight matrix choice,”Nucleic Acids Research,

vol. 22, pp. 4673–4680, 1994.

[22] X. Liu, D. L. Brutlag, and J. S. Liu, “BioProspector: Discovering conserved DNA

motifs in upstream regulatory regions of co-expressed genes,” Proc. Sixth Pacific

Symp. Biocomputing (PSB 01), vol. 6, pp. 127–138, 2001.

[23] C. E. Lawrence and A. A. Reilly, “An expectation maximization (EM) algorithm

for the identification and characterization of common sitesin unaligned biopolymer

sequences,”Proteins: Struct., Funct., Genet., vol. 7(1), pp. 41–51, 1990.

[24] Benjamin Raphael, Lung-Tien Liu, and George Varghese, “Auniform projection

method for motif discovery in DNA sequences,”IEEE/ACM Trans. on Computational

Biology and Bioinformatics, vol. 1, no. 2, pp. 91–94, 2004.

[25] G. C. G. Wei and M. A. Tanner, “A Monte Carlo implementationof the EM algorithm

and the poor man’s data augmentation algorithms,”J. Am. Statistical Assoc., vol. 85,

no. 411, pp. 699–704, 1990.

[26] J. Bockert and J. P. Pin, “Molecular tinkering of G protein-coupled receptors: an

evolutionary success,”EMBO J., vol. 18, pp. 1723–1729, 1999.

[27] A. Marchese, S. R. George, L. F. Kolakowski, K. R. Lynch, and B. F. O’Dowd, “Novel

GPCRs and their endogenous ligands: expanding the boundariesof physiology and

pharmacology,”Trends Pharmacol Sci., vol. 20, pp. 370–375, 1999.

[28] D. W. Elrod and K. C. Chou, “A study on the correlation of G-protein-coupled recep-

tor types with amino acid composition,”Protein Eng., vol. 15, pp. 713–715, 2002.

[29] K. C. Chou and D. W. Elord, “Bioinformatical analysis of G-protein-coupled recep-

tors,” J Proteome Res., vol. 1, pp. 429–433, 2002.

66 REFERENCES

[30] R. J. Lefkowitz, “The superfamily of heptahelical receptors,” Nat Cell Biol, vol. 2,

pp. e133–e136, 2000.

[31] B. Qian, O. S. Soyer, R. R. Neubig, and R. A. Goldstein, “Depicting a protein’s two

faces: GPCR classification by phylogenetic tree-based HMMs,” FEBS Lett., vol. 554,

pp. 95–99, 2003.

[32] R. Karchin, K. Karplus, and D. Haussler, “Classifying G-protein coupled receptors

with support vector machines,”Bioinformatics, vol. 18, pp. 147–159, 2002.

[33] K. C. Chou, “Coupling interaction between thromboxane a2 receptor and alpha-13

subunit of guanine nucleotide-binding proteins,”J Proteome Res, vol. 4, pp. 1681–

1686, 2005.

[34] D.Filmore, “It’s a GPCR world,”Modern Drug Discovery, vol. 7, no. 11, pp. 24–28,

2004.

[35] R.J. Bryson-Richardson, D.W. Logan, P.D. Currie, and I.J. Jackson, “Large-scale

analysis of gene structure in rhodopsin-like GPCRs: evidencefor widespread loss of

an ancient intron,”Gene, vol. 338, pp. 15–23, 2004.

[36] N. Cristianini and J. Shawe-Taylor,An Introduction to Support Vector Machines and

Other Kernel-based Learning Methods, Cambridge Univ. Press, Cambridge, Mass.,

2000.

[37] U. Kreßel, “Pairwise classification and support vectormachines,” inAdvances in

Kernel Methods-Support Vector Learning, B. Scholkopf, C. J. C. Burges, and A. J.

Smola, Eds., Cambridge, MA, 1999, pp. 255–268, MITPress.

[38] J. C. Platt, N. Cristianini, and J. Shawe-Taylor, “Large margin dag’s for multi-class

classification,” in Advances in Neural Information Processing Systems, vol. 12, pp.

547–553, 2000.

[39] C.-W. Hsu and C.-J. Lin, “A comparison of methods for multi-class support vector

machines,”IEEE Trans. Neural Networks, vol. 13, no. 2, pp. 415–425, 2002.

[40] K. Crammer and Y. Singer, “Ultraconservative online algorithms for multiclass prob-

lems,” The Journal of Machine Learning Research, vol. 3, no. 3, pp. 951–991, 2003.

REFERENCES 67

[41] S. Ghorai, A. Mukherjee, and P. K. Dutta, “Discriminantanalysis for fast multiclass

data classification through regularized kernel function approximation,” IEEE Trans.

Neural Networks, vol. 21, no. 6, pp. 1020–1029, June 2010.

[42] S. Ghorai, A. Mukherjee, and P. K. Dutta,Advances in Proximal Kernel Classifiers,

LAP LAMBERT Academic Publishing, Germany, Nov., 2012.

[43] C. Leslie, E. Eskin, and W. S. Noble, “The spectrum kernel: A string kernel for SVM

protein classification,”Proceedings of the Pacific Symposium on Biocomputing, pp.

564–575, 2002.

[44] M. N. Davies, A. Secker, A. A. Freitas, E. Clark, J. Timmis, and D. R. Flower, “Op-

timizing amino acid groupings for GPCR classification,”Bioinformatics, vol. 24, no.

18, pp. 1980–1986, Sept. 2008.

[45] Lynne Reed Murphy, Anders Wallqvist, and Ronald M.Levy, “Simplified amino acid

alphabets for protein fold recognition and implications for folding,” Protein Engi-

neering, vol. 13, no. 3, pp. 149–152, 2000.

[46] B. Vroling, M. Sanders, C. Baakman, A. Borrmann, S. Verhoeven, J. Klomp,

L. Oliveira, J. de Vlieg, and G. Vriend, “GPCRDB: information system for G protein-

coupled receptors,”Nucleic Acids Res, Nov 2010.

[47] UniProt-Consortium, “Reorganizing the protein space atthe universal protein re-

source (uniprot),”Nucleic Acids Res, vol. 40, pp. D71–D75, 2012.

[48] A. Krogh, B. Larsson, G. von Heijne, and E. L. L. Sonnhammer, “Predicting trans-

membrane protein topology with a hidden markov model: Application to complete

genomes,”Journal of Molecular Biology, vol. 305, no. 3, pp. 567–580, January 2001.

[49] Chih Chung Chang and Chih Jen Lin, “LIBSVM: A library for sup-

port vector machines,” ACM Transactions on Intelligent Systems and

Technology, vol. 2, pp. 27:1–27:27, 2011, Software available at

http://www.csie.ntu.edu.tw/ ˜ cjlin/libsvm .

[50] L. Wang and T. Jiang, “On the complexity of multiple sequence alignment,” J.

Computational Biology, vol. 1, pp. 337–348, 1994.

68 REFERENCES

[51] A. Sandelin, W. Alkema, P. Engstrom, W.W. Wasserman, and B. Lenhard, “JASPAR:

an open-access database for eukaryotic transcription factor binding profiles,”Nucleic

Acids Research, vol. 32, pp. D91–D94, 2004.

AUTHOR’S PUBLICATIONS

[1] Aniruddha Maiti, Santanu Ghorai, and Anirban Mukherjee, “A Multi-Fold String Ker-

nel for Fixed Topology Sequence Classification,” *****.

[2] Anirban Mukherjee Aniruddha Maiti, “ Expectation Maximization in Random Pro-

jected Spaces to Find Motifs in Genomic Sequences,” inInternational Conference on

Electronics, Communication and Instrumentation 2014, Kolkata, 2014.

[3] Aniruddha Maiti and Anirban Mukherjee, “On the Monte-Carlo Expectation Maxi-

mization for Finding Motifs in DNA Sequences,”IEEE Journal of Biomedical and

Health Informatics, 2014.

BIO-DATA

Aniruddha Maiti received the B.E. degree in electronics and telecommunication engineer-

ing from the Bengal Engineering and Science University, Shibpur, India, in 2010. He is

currently pursuing M.S. degree in the Department of Electrical Engineering, Indian Insti-

tute of Technology Kharagpur, Kharagpur, India. His principal research interest is machine

learning and computational biology.