frmsdpred: predicting local rmsd between structural fragments using sequence information

14
proteins STRUCTURE FUNCTION BIOINFORMATICS fRMSDPred: Predicting local RMSD between structural fragments using sequence information Huzefa Rangwala* and George Karypis Computer Science and Engineering, University of Minnesota, Minneapolis, Minnesota 55455 INTRODUCTION Over the years, several computational methodologies have been developed for determining the 3D structure of a protein (target) from its linear chain of amino acid residues. 1–6 Among them, approaches based on comparative modeling 1,6 are the most widely used and have been shown to produce some of the best predictions when the target has some degree of homology with proteins of known 3D structure (templates). 7,8 The key idea behind comparative modeling approaches is to align the sequence of the target to the sequence of one or more template proteins and then construct the target’s structure from the structure of the template(s) using the alignment(s) as a reference. Thus, the construction of high-quality target-template alignments plays a critical role in the overall effectiveness of the method, as it is used to both select the suitable template(s) and to build good reference alignments. The overall performance of comparative modeling approaches will be significantly improved, if the target-template alignment constructed by considering sequence and sequence-derived information is as close as possible to the structure-based alignment between these two proteins. The development of increasingly more sensitive target-template alignment algorithms, 9–11 that incorporate profiles, 12,13 profile-to-profile scoring func- tions, 14–17 and predicted secondary structure information 18,19 have contrib- uted to the continuous success of comparative modeling. 20,21 The dynamic-programming-based algorithms 22,23 used in target-template alignment are also used by many methods to align a pair of protein struc- tures. However, the key difference between these two problem settings is that, while the target-template alignment methods score a pair of aligned residues using sequence-derived information, the structure alignment meth- ods use information derived from the structure of the protein. For example, structure alignment methods like CE 24 and MUSTANG 25 score a pair of residues by considering how well fixed-length fragments (i.e., short contigu- ous backbone segments) centered around each residue align with each other. This score is usually computed as the root mean squared deviation (RMSD) of the optimal superimposition of the two fragments. The Supplementary Material referred to in this article can be found online at http://www.interscience.wiley. com/jpages/0887-3585/suppmat/ Grant sponsor: NSF; Grant numbers: EIA-9986042, ACI-0133464, IIS-0431135; Grant sponsor: NIH; Grant number: RLM008713A; Grant sponsor: Army High Performance Computing Research Center; Grant number: DAAD19-01-2-0014; Grant sponsor: Digital Technology Center at the University of Minnesota. *Correspondence to: Mr. Huzefa Rangwala, 464 DTC 117 Pleasant Street, Minneapolis, Minnesota United States 55455. E-mail: [email protected] Received 28 August 2007; Revised 17 December 2007; Accepted 8 January 2008 Published online 25 February 2008 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/prot.21998 ABSTRACT The effectiveness of comparative modeling approaches for protein structure predic- tion can be substantially improved by incorporating predicted structural infor- mation in the initial sequence-structure alignment. Motivated by the approaches used to align protein structures, this arti- cle focuses on developing machine learn- ing approaches for estimating the RMSD value of a pair of protein fragments. These estimated fragment-level RMSD values can be used to construct the align- ment, assess the quality of an alignment, and identify high-quality alignment seg- ments. We present algorithms to solve this fragment-level RMSD prediction problem using a supervised learning framework based on support vector regression and classification that incorporates protein profiles, predicted secondary structure, effective information encoding schemes, and novel second-order pairwise exponen- tial kernel functions. Our comprehensive empirical study shows superior results compared with the profile-to-profile scor- ing schemes. We also show that for protein pairs with low sequence similarity (less than 12% sequence identity) these new local structural features alone or in conjunction with profile-based informa- tion lead to alignments that are consider- ably accurate than those obtained by schemes that use only profile and/or pre- dicted secondary structure information. Proteins 2008; 72:1005–1018. V V C 2008 Wiley-Liss, Inc. Key words: structure prediction; compar- ative modeling; machine learning; classi- fication; regression. V V C 2008 WILEY-LISS, INC. PROTEINS 1005

Upload: huzefa-rangwala

Post on 06-Jul-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: fRMSDPred: Predicting local RMSD between structural fragments using sequence information

proteinsSTRUCTURE O FUNCTION O BIOINFORMATICS

fRMSDPred: Predicting local RMSDbetween structural fragments usingsequence informationHuzefa Rangwala* and George Karypis

Computer Science and Engineering, University of Minnesota, Minneapolis, Minnesota 55455

INTRODUCTION

Over the years, several computational methodologies have been developed

for determining the 3D structure of a protein (target) from its linear chain

of amino acid residues.1–6 Among them, approaches based on comparative

modeling1,6 are the most widely used and have been shown to produce

some of the best predictions when the target has some degree of homology

with proteins of known 3D structure (templates).7,8

The key idea behind comparative modeling approaches is to align the

sequence of the target to the sequence of one or more template proteins and

then construct the target’s structure from the structure of the template(s)

using the alignment(s) as a reference. Thus, the construction of high-quality

target-template alignments plays a critical role in the overall effectiveness of

the method, as it is used to both select the suitable template(s) and to build

good reference alignments. The overall performance of comparative modeling

approaches will be significantly improved, if the target-template alignment

constructed by considering sequence and sequence-derived information is as

close as possible to the structure-based alignment between these two proteins.

The development of increasingly more sensitive target-template alignment

algorithms,9–11 that incorporate profiles,12,13 profile-to-profile scoring func-

tions,14–17 and predicted secondary structure information18,19 have contrib-

uted to the continuous success of comparative modeling.20,21

The dynamic-programming-based algorithms22,23 used in target-template

alignment are also used by many methods to align a pair of protein struc-

tures. However, the key difference between these two problem settings is

that, while the target-template alignment methods score a pair of aligned

residues using sequence-derived information, the structure alignment meth-

ods use information derived from the structure of the protein. For example,

structure alignment methods like CE24 and MUSTANG25 score a pair of

residues by considering how well fixed-length fragments (i.e., short contigu-

ous backbone segments) centered around each residue align with each other.

This score is usually computed as the root mean squared deviation (RMSD)

of the optimal superimposition of the two fragments.

The Supplementary Material referred to in this article can be found online at http://www.interscience.wiley.

com/jpages/0887-3585/suppmat/

Grant sponsor: NSF; Grant numbers: EIA-9986042, ACI-0133464, IIS-0431135; Grant sponsor: NIH; Grant

number: RLM008713A; Grant sponsor: Army High Performance Computing Research Center; Grant number:

DAAD19-01-2-0014; Grant sponsor: Digital Technology Center at the University of Minnesota.

*Correspondence to: Mr. Huzefa Rangwala, 464 DTC 117 Pleasant Street, Minneapolis, Minnesota United States

55455. E-mail: [email protected]

Received 28 August 2007; Revised 17 December 2007; Accepted 8 January 2008

Published online 25 February 2008 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/prot.21998

ABSTRACT

The effectiveness of comparative modeling

approaches for protein structure predic-

tion can be substantially improved by

incorporating predicted structural infor-

mation in the initial sequence-structure

alignment. Motivated by the approaches

used to align protein structures, this arti-

cle focuses on developing machine learn-

ing approaches for estimating the RMSD

value of a pair of protein fragments.

These estimated fragment-level RMSD

values can be used to construct the align-

ment, assess the quality of an alignment,

and identify high-quality alignment seg-

ments. We present algorithms to solve this

fragment-level RMSD prediction problem

using a supervised learning framework

based on support vector regression and

classification that incorporates protein

profiles, predicted secondary structure,

effective information encoding schemes,

and novel second-order pairwise exponen-

tial kernel functions. Our comprehensive

empirical study shows superior results

compared with the profile-to-profile scor-

ing schemes. We also show that for

protein pairs with low sequence similarity

(less than 12% sequence identity) these

new local structural features alone or in

conjunction with profile-based informa-

tion lead to alignments that are consider-

ably accurate than those obtained by

schemes that use only profile and/or pre-

dicted secondary structure information.

Proteins 2008; 72:1005–1018.VVC 2008 Wiley-Liss, Inc.

Key words: structure prediction; compar-

ative modeling; machine learning; classi-

fication; regression.

VVC 2008 WILEY-LISS, INC. PROTEINS 1005

Page 2: fRMSDPred: Predicting local RMSD between structural fragments using sequence information

In this article, motivated by the alignment requirements

of comparative modeling approaches and the operational

characteristics of protein structure alignment algorithms,

we focus on the problem of estimating the RMSD value of

a pair of protein fragments by considering only sequence-

derived information. Besides its direct application to target-

template alignment, accurate estimation of these frag-

ment-level RMSD values can also be used to solve a

number of other problems related to protein structure

prediction such as identifying the best template by assess-

ing the quality of target-template alignments and identi-

fying high-quality segments of an alignment.

We present algorithms to solve the fragment-level

RMSD prediction problem using a supervised learning

framework based on support vector regression and classi-

fication that incorporates sequence-derived information

in the form of position-specific profiles and predicted

secondary structure.26 This information is effectively

encoded in fixed-length feature vectors. We develop and

test second-order pairwise exponential kernel functions

designed to capture the conserved signals of a pair of

local windows centered at each of the residues and use a

fusion-kernel-based approach to incorporate the profile-

and secondary structure-based information.

An extensive experimental evaluation of the algorithms

and their parameter space is performed on datasets con-

sisting of residue-pairs selected randomly between pairs

of proteins, as well as residue-pairs derived from optimal

sequence-based local alignments of known protein struc-

tures. Our experimental results show that there is a high

correlation (0.633–0.733) between the estimated and

actual fragment-level RMSD scores. Moreover, the per-

formance of our algorithms is considerably better than

that obtained by state-of-the-art profile-to-profile scoring

schemes when used to solve the fragment-level RMSD

prediction problems.

We also study the extent to which the predicted frag-

ment-level RMSD (fRMSD) values can actually lead to

alignment improvements. Specifically, we study and eval-

uate various alignment scoring schemes that use informa-

tion derived from sequence profiles, predicted secondary

structure, predicted fRMSD values, and their combina-

tions. Results on two different datasets show that scoring

schemes using the predicted fRMSD values alone and/or

in combination with scores derived from sequence pro-

files lead to better alignments than those obtained by

current state-of-the-art schemes that utilize sequence

profiles and predicted secondary structure information,

especially for sequence pairs having less than 12%

sequence identity.

The rest of the article is organized as follows. Defini-

tions and Notations, provides key definitions and nota-

tions used throughout the article. Problem Statement and

Motivation formally defines the fragment-level RMSD pre-

diction and classification problems and describes their

applications. Methods section describes the prediction

methods that we developed. Material section describes the

datasets and the various computational tools used in this

article. Results section presents a comprehensive experi-

mental evaluation of the methods developed. Related

Research section summarizes some of the related research

in this area. Finally, Conclusion section summarizes the

work and provides some concluding remarks.

DEFINITIONS AND NOTATIONS

Throughout the article we will use X and Y to denote

proteins, xi to denote the ith residue of X, and p(xi,yj) todenote the residue-pair formed by residues xi and yj.

Given a protein X of length n and a user-specified pa-

rameter w, we define wmer(xi) to be the (2w 1 1)-length

contiguous subsequence of X centered at position i (w <i � n 2 w). Similarly, given a user-specified parameter v,

we define vfrag(xi) to be the (2v 1 1)-length contiguous

substructure of X centered at position i (v < i � n 2 v).

These substructures are commonly referred to as frag-

ments.24,25 Without loss of generality, we represent the

structure of a protein using the Ca atoms of its back-

bone. The wmers and vfrags are fixed-length windows

that are used to capture information about the sequence

and structure around a particular sequence position,

respectively.

Given a residue-pair p(xi,yj), we define fRMSD(xi,yj) to

be the structural similarity score between vfrag(xi) and

vfrag(yj). This score is computed as the root mean square

deviation between the pair of substructures after optimal

superimposition. A residue-pair p(xi,yj) will be called

reliable if its fRMSD is below a certain value (i.e., there

is a good structural superimposition of the correspond-

ing substructures).

Finally, we will use the notation ha,bi to denote the

dot-product operation between vectors a and b.

PROBLEM STATEMENT ANDMOTIVATION

The work in this article is focused on solving the fol-

lowing two problems related to predicting the local struc-

tural similarity of residue-pairs.

Definition 1 (fRMSD Estimation Problem) Given a

residue-pair p(xi,yj), estimate the fRMSD(xi,yj) score by

considering information derived from the amino acid

sequence of X and Y.

Definition 2 (Reliability Prediction Problem) Given a

residue-pair p(xi,yj), determine whether it is reliable or not

by considering only information derived from the amino

acid sequence of X and Y.

It is easy to see that the reliability prediction problem

is a special case to the fRMSD estimation problem. As

such, it may be easier to develop effective solution meth-

H. Rangwala and G. Karypis

1006 PROTEINS

Page 3: fRMSDPred: Predicting local RMSD between structural fragments using sequence information

ods for it and this is why we consider it as a different

problem in this paper.

The effective solution to these two problems has four

major applications to protein structure prediction. First,

given an existing alignment between a (target) protein and

a template, a prediction of the fRMSD scores of the

aligned residue-pairs (or their reliability) can be used to

assess the quality of the alignment and potentially select

among different alignments and/or different templates.

Second, fRMSD scores (or reliability assessments) can be

used to analyze different protein-template alignments in

order to identify high-quality moderate-length fragments.

These fragments can then be used by fragment-assembly-

based protein structure prediction methods like TASSER27

and ROSETTA28 to construct the structure of a protein.

Third, since residue-pairs with low fRMSD scores are

good candidates for alignment, the predicted fRMSD

scores can be used to construct a position-to-position

scoring matrix between all pairs of residues in a protein

and a template. This scoring matrix can then be used by

an alignment algorithm to compute a high-quality align-

ment for structure prediction via comparative modeling.

Essentially, this alignment scheme uses predicted fRMSD

scores in an attempt to mimic the approach used by vari-

ous structural alignment methods.24,25 Fourth, the

fRMSD scores (or reliability assessments) can be used as

input to other prediction tasks such as remote homology

prediction and/or fold recognition.

METHODS

We approach the problems of distinguishing reliable/

unreliable residue-pairs and estimating their fRMSD

scores following a supervised machine learning frame-

work and use support vector machines (SVM)29,30 to

solve them.

Given a set of positive residue-pairs A1 (i.e., reliable)

and a set of negative residue-pairs A2 (i.e., unreliable),

the task of support vector classification is to learn a func-

tion f(p) of the form

f ðpÞ ¼Xpi2Aþ

kþi Kðp; piÞ �Xpi2A�

k�i Kðp; piÞ; ð1Þ

where ki1 and ki

2 are non-negative weights that are com-

puted during training by maximizing a quadratic objec-

tive function, and K(.,.) is the kernel function designed

to capture the similarity between pairs of residue-pairs.

Having learned the function f(p), a new residue-pair p is

predicted to be positive or negative depending on

whether f(p) is positive or negative. The value of f(p)also signifies the tendency of p to be a member of the

positive or negative class and can be used to obtain a

meaningful ranking of a set of the residue-pairs.

We use the error insensitive support vector regression

e-SVR30,31 for learning a function f(p) to predict the

fRMSD(p) scores. Given a set of training instances

(pi,fRMSD(pi)), the e-SVR aims to learn a function of

the form

f ðpÞ ¼Xpi2Dþ

aþi Kðp; piÞ �

Xpi2D�

a�i Kðp; piÞ; ð2Þ

where D1 contains the residue-pairs for which

fRMSD(pi) 2 f(pi) > e, D2 contains the residue pairs

for which fRMSD(pi) 2 f(pi) < 2e, and ai1 and ai

2 are

non-negative weights that are computed during training

by maximizing a quadratic objective function. The objec-

tive of the maximization is to determine the flattest f(p)in the feature space and minimize the estimation errors

for instances in D1 | D2. Hence, instances that have an

estimation error satisfying |f(pi) 2 fRMSD(pi)| < e are

neglected. The parameter e controls the width of the

regression deviation or tube.

In this work we focus on several key aspects related to

the formulation and solution of the classification and

regression problems. In particular we explore different

types of sequence information associated with the resi-

due-pairs, develop efficient ways to encode this informa-

tion to form fixed length feature vectors, and design sen-

sitive kernel functions to capture the similarity between

pairs of residues in the feature spaces. We also perform

an application specific case study to see how the esti-

mated fRMSD scores can be used for generating sensitive

sequence alignments.

Sequence-based information

For a given protein X, we encode the sequence infor-

mation using profiles and predicted secondary structure.

Profile information

The profile of a protein X is derived by computing a

multiple sequence alignment of X with a set of sequences

{Y1, . . . , Ym that have a statistically significant sequence

similarity with X (i.e., they are sequence homologs).

The profile of a sequence X of length n is represented

by two n 3 20 matrices, namely the position-specific

scoring matrix PX and the position-specific frequency

matrix FX. Matrix P can be generated directly by run-

ning PSI-BLAST,13 whereas matrix F consists of the fre-

quencies used by PSI-BLAST to derive P. These frequen-

cies, referred to as the target frequencies32 consists of

both the sequence-weighted observed frequencies (also

referred to as effective frequencies32) and the BLO-

SUM6233 derived-pseudocounts.13 Further, each row of

the matrix F is normalized to one.

Predicted secondary structure information

For a sequence X of length n we predict the secondary

structure and generate a position-specific secondary

Predicting Local RMSD between Structural Fragments

PROTEINS 1007

Page 4: fRMSDPred: Predicting local RMSD between structural fragments using sequence information

structure matrix SX of length n 3 3. The (i,j) entry of

this matrix represents the strength of the amino acid resi-

due at position i to be in state j, where j [ (0,1,2) corre-

sponds to the three secondary structure elements: alpha

helices (H), beta sheets (E), and coil regions (C).

Coding schemes

The input to our prediction algorithms are a set of

wmer-pairs associated with each residue-pair p(xi,yj). Theinput feature space is derived using various combinations

of the elements in the P and S matrices that are associ-

ated with the subsequences wmer(xi) and wmer(yj).

For the rest of this paper, we will use PX(i 2 w . . .i 1 w) to denote the (2w 1 1) rows of matrix PX corre-

sponding to wmer(xi). A similar notation will be used

for matrix S.

Concatenation coding scheme

For a given residue-pair p(xi,yj), the feature-vector of

the concatenation coding scheme is obtained by first lin-

earizing the matrices PX(i 2 w . . . i 1 w) and PY(j 2w . . . j 1 w) and then concatenating the resulting vec-

tors. This leads to feature-vectors of length 2 3 (2w 11) 3 20. A similar representation is derived for matrix Sleading to feature-vectors of length 2 3 (2w 1 1) 3 3.

The concatenation coding scheme is order dependent

as the feature-vector representations for p(xi,yj) and

p(yj,xi) are not equivalent even though they represent the

same residue-pair. This order-dependency can potentially

reduce the accuracy of the classification/regression meth-

ods as there is no guarantee that the ordering used dur-

ing prediction was the same with that used during learn-

ing.

We addressed this problem as follows. During training,

we built a model using an arbitrary ordering of the

wmers for each residue-pair. However, during model

application, we classified/regressed both orderings of a

residue-pair and used the average of the SVM/e-SVR out-

puts as the final classification/regression result. We

denote this averaging method by avg. Our empirical eval-

uation shows that the averaging scheme achieves up to

1% improvement over both representations for both the

classification as well as regression problem. We present

these results as part of the supplementary data, and

always refer to the avg representation while using the

concatenation coding scheme in this study.

Pairwise coding scheme

For a given residue-pair p(xi,yj), the pairwise coding

scheme generates a feature-vector by linearizing the ma-

trix formed by an element-wise product between PX(i 2w . . . i 1 w) and PY(j 2 w . . . j 1 w). The length of

this vector is (2w 1 1) 3 20 and is order independent.

If we denote the element-wise product operation by ‘‘�’’,

then the element-wise product matrix is given by

PXð�w þ i . . .w þ iÞ � PY ð�w þ j . . .w þ jÞ: ð3Þ

A similar approach is used to obtain the pairwise coding

scheme for matrix S, leading to feature-vectors of length

(2w 1 1) 3 3.

Kernel functions

The general structure of the kernel function that we

use for capturing the similarity between a pair of resi-

due-pairs p(xi,yj) and p0(xi00,yj0 0) is given by

Kcsðp; p0Þ ¼ exp 1:0þ Kcs1 ðp; p0ÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiKcs

1 ðp; pÞKcs1 ðp0; p0Þ

p !

; ð4Þ

where K1cs(p, p0) is given by

Kcs1 ðp; p0Þ ¼ Kcs

2 ðp; p0Þ þ ðKcs2 ðp; p0ÞÞ2; ð5Þ

and K2cs(p, p0) is a kernel function that depends on the

choice of particular coding scheme (cs). For the concate-

nation coding scheme using matrix P (i.e., cs 5 Pconc),

K2cs(p, p0) is given by

KPconc

2 ðp; p0Þ ¼Xk¼þw

k¼�w

hPXði þ kÞ;PX 0 ði0 þ kÞiþ

Xk¼þw

k¼�w

hPY ðj þ kÞ;PY 0 ðj0 þ kÞi:ð6Þ

For the pairwise coding scheme using matrix P (i.e., cs

5 Ppair), K2cs(p, p0) is given as

KPpair

2 ðp; p0Þ ¼Xk¼þw

k¼�w

hPXði þ kÞ � PY ðj þ kÞ;

PX 0 ði0 þ kÞ � PY 0 ðj0 þ kÞi:ð7Þ

Similar kernel functions can be derived using matrix Sfor both the pairwise and the concatenation coding

schemes. We will denote these coding schemes as Spair

and Sconc, respectively. Since the overall structure of the

kernel that we used [Eqs. (4) and (5)] is that of a nor-

malized second-order exponential function, we will refer

to it as nsoe.

The second-order component of Eq. (5) allows the

nsoe kernel to capture pairwise dependencies among the

residues used at various positions within each wmer, and

we found that this leads to better results over the linear

function. This observation is also supported by earlier

research on secondary-structure prediction as well.26 In

addition, nsoe’s exponential function allows it to capture

non-linear relationships within the data just like the ker-

nels based on the Gaussian and radial basis function.30

H. Rangwala and G. Karypis

1008 PROTEINS

Page 5: fRMSDPred: Predicting local RMSD between structural fragments using sequence information

Fusion kernels

We also developed a set of kernel functions that incor-

porate both profile and secondary structure information

using an approach motivated by fusion kernels.31,34 Spe-

cifically, we constructed a new kernel function as the

unweighted sum of the nsoe kernel function for the profile

and secondary structure information. For example, the

concatenation-based fusion kernel function is given by

KðPþSÞconc ðp; p0Þ ¼ KPconc ðp; p0Þ þ KSconc ðp; p0Þ: ð8ÞA similar kernel function can be defined for the pairwise

coding scheme as well. We will denote the pairwise-based

fusion kernel by K(P1S)pair(p, p0). Note that since these

fusion kernels are linear combinations of valid kernels,

they are also admissible kernels.

Case study: sequence alignment

In our recent work,35 we study the effectiveness of

using predicted fRMSD scores for construction of a posi-

tion-to-position scoring matrix between sequence pairs.

This scoring matrix was then used by a dynamic-pro-

gramming based alignment algorithm (Smith-Waterman

local alignment23) to produce sequence alignments. We

also derived position-to-position scoring matrices derived

from a weighted combination of profile-based, predicted

secondary structure and fRMSD-based scoring schemes.

In particular we use three position-specific scoring

matrices: (i) prof which uses the PICASSO17,32 scoring

function to compute similarity between two profile col-

umns (described in Profile-to-Profile Scoring schemes),

(ii) ss which uses a dot-product between the predicted

secondary structure scores, (iii) frmsd which uses the pre-

dicted fRMSD scores. The different alignments generated

by the scoring matrices are referred by prof, ss, and

frmsd.

MATERIALS

Datasets

We evaluated the classification and regression perform-

ance of the various kernels on a set of protein pairs used

in a previous study for learning a profile-to-profile scor-

ing function.36 These pairs of proteins were derived

from the SCOP 1.57 database, classes a-e, with no two

protein domains sharing greater than 40% sequence

identity. The dataset is comprised of 158 protein pairs

belonging to the same family, 394 pairs belonging to the

same superfamily but not the same family, and 184 pairs

belonging to the same fold but not the same superfamily.

We use a five-fold cross-validation framework to evalu-

ate the performance of the various classifiers and regres-

sion models. We ensure that for each of the cross-valida-

tion runs, the sequence identity between sequences in the

test and train set is atmost 40%.

We created two datasets consisting of selected residue-

pairs that allow us to evaluate the reliability prediction and

fRMSD estimation problems within the context of the

applications, discussed in Problem Statement and Motiva-

tion. The first dataset, referred to as ‘‘DSR,’’ is designed to

evaluate the fRMSD estimation accuracy for the applica-

tion of creating position-to-position scoring matrices for

generating sequence alignments. It contains 28,788 residue-

pairs that were selected randomly from each protein pair.

The second dataset, reffered to as ‘‘DSA,’’ is designed to

evaluate the fRMSD estimation accuracy for the applica-

tion of assessing the target-template alignment quality and

for selecting high-quality fragments from different align-

ments. It contains 46,803 residue-pairs that correspond to

the aligned positions produced by the Smith-Waterman23

algorithm from each protein pair. These alignments were

computed using the sensitive PICASSO17,32 profile-

to-profile scoring function.

For each residue-pair p(xi,yj) in the DSR and DSA

datasets, we compute its fRMSD(xi,yj) score by consider-

ing fragments of length seven (i.e., we optimally superim-

posed vfrags with v 5 3). These fRMSD scores are used as

the true values to be estimated in case of the fRMSD esti-

mation problem. The fRMSD scores are also used to

define the two classes (reliable/unreliable) in case of the

reliability prediction problem. In particular, we define the

positive class (i.e., reliable residue-pairs) to contain all res-

idue-pairs whose fRMSD score is less than 0.75A, and the

negative class to contain the rest of the residue-pairs.

We perform a detailed analysis using different subsets of

the datasets to train and test the performance of the mod-

els. Specifically, we train four models using (i) protein

pairs sharing the same SCOP family, (ii) protein pairs

sharing the same superfamily but not the family, (iii) pro-

tein pairs sharing the same fold but not the superfamily,

and (iv) protein pairs from all the three levels. These four

models are denoted by fam, suf, fold, and all. We also

report performance numbers by splitting the test set in the

aforementioned four levels. These subsets allow us to eval-

uate the performance of the schemes for different levels of

sequence similarity. Table I shows the different statistics

for the all, fam, suf, and fold levels as well.

All the datasets are available at the supplementary

website for this article.y

Profile generation

To generate the profile matrices P and F , we ran PSI-

BLAST, using the following parameters (blastpgp 2j 5

2e 0.01 2h 0.01). The PSI-BLAST was performed against

yhttp://bioinfo.cs.umn.edu/supplements/fRMSDPred/

Predicting Local RMSD between Structural Fragments

PROTEINS 1009

Page 6: fRMSDPred: Predicting local RMSD between structural fragments using sequence information

NCBI’s nr database that was downloaded in November of

2004 and contained 2,171,938 sequences.

Secondary structure prediction

We use the state-of-the-art secondary structure predic-

tion server called YASSPP26 (default parameters) to gen-

erate the S matrix. The values of the S matrix are the

output of the three one-versus-rest SVM classifiers

trained for each of the secondary structure elements.

Evaluation methodology

We measure the quality of the methods using the stand-

ard receiver operating characteristic (ROC) scores and the

ROC5 scores averaged across every protein pair. The ROC

score is the normalized area under the curve that plots the

true positives against the false positives for different

thresholds for classification.12 The ROCn score is the area

under the ROC curve up to the first n false positives. We

compute the ROC and ROC5 numbers for every protein

pair and report the average results across all the pairs and

cross-validation steps. We selected to report ROC5 scores

because each individual ROC-based evaluation is per-

formed on a per protein-pair basis, which, on average,

involves one to two hundred residue-pairs.

The regression performance is assessed by computing

the standard Pearson correlation coefficient (CC)

between the predicted and observed fRMSD values for

every protein pair. We also compute the root mean

square error rmse between the predicted and observed

fRMSD values for every protein pair. The results reported

are averaged across the different protein pairs and cross-

validation steps.

Profile-to-profile scoring schemes

To assess the effectiveness of our supervised learning

algorithms we compare their performance against that

obtained by using two profile-to-profile scoring schemes

to solve the same problems. Specifically, we use the pro-

file-to-profile scoring schemes to compute the similarity

between the aligned residue-pairs summed over the

length of their wmers. To assess how well these scores

correlated with the fRMSD score of each residue-pair we

compute their correlation coefficients. Note that since

residue-pairs with high-similarity score are expected to

have low fRMSD scores, good values for these correlation

coefficients will be close to 21. Similarly, for the reliabil-

ity prediction problem, we sort the residue-pairs in

decreasing similarity score order and assess the perform-

ance by computing ROC and ROC5 scores.

The two profile-to-profile scoring schemes that we

used are based on the dot-product and the PICASSO

score, both of which are used extensively and shown to

produce good results.15,16,32 The dot-product similarity

score is defined both for the profile- as well as the

secondary-structure-based information, whereas the

PICASSO score is defined only for the profile-based in-

formation. The profile-based dot-product similarity score

between residues xi and yj is given by hPX(i), PY(j)i.Similarly, the secondary-structure-based dot-product sim-

ilarity score is given by hSX(i), SY(j)i. The PICASSO

similarity score17,32 between residues xi and yj uses both

the P and F matrices and is given by hFX(i) PY(j) 1FY(j) PX(i)i. We will use Pdotp, Sdotp, and PF pic to

denote these three similarity scores, respectively.

Support vector machines

The classification and regression is done using the

publicly available support vector machine tool SVMlight37

that implements an efficient soft margin optimization

algorithm.

The performance of SVM and e-SVR depends on the

parameter that controls the trade-off between the margin

and the misclassification cost (‘‘C’’ parameter). In addi-

tion, the performance of e-SVR also depends on the

Table IDataset Statistics

Reliability prediction

all fam suf fold

Total < 0.75 (1) > 0.75 (2) Total < 0.75 (1) > 0.75 (2) Total < 0.75 (1) > 0.75 (2) Total < 0.75 (1) > 0.75 (2)

DSA 46,803 16,127 30,676 12,356 6393 5693 24,616 6691 17,625 9831 2743 7088DSR 28,788 1977 26,811 5311 341 4970 17,360 1212 16148 6117 424 5693

fRMSD estimation

Total l(fRMSD) r(fRMSD) Total l(fRMSD) r(fRMSD) Total l(fRMSD) r(fRMSD) Total l(fRMSD) r(fRMSD)

DSA 46,803 1.84 1.28 12,356 1.38 1.22 24,616 2.01 1.26 9831 2.07 1.26DSR 28,788 2.61 1.0 5311 2.62 1.01 17,360 2.62 1.0 6117 2.59 1.01

Total represents the total number of residue-pairs in the dataset. l(fRMSD) represents the mean fRMSD scores and r(fRMSD) represents the standard deviation.

< 0.75 (1) and > 0.75 (2) represent the number of residue-pairs in the positive and negative classes for the reliability prediction problem.

H. Rangwala and G. Karypis

1010 PROTEINS

Page 7: fRMSDPred: Predicting local RMSD between structural fragments using sequence information

value of the deviation parameter e. We performed a lim-

ited number of experiments to determine good values for

these parameters. These experiments showed that C 50.1 and e 5 0.1 achieved consistently good performance

and was the value used for all the reported results.

Alignment evaluation

Dataset description

We evaluate the accuracy of the alignment schemes on

two datasets. The first dataset, referred to as the ce_ref

dataset, was used in a previous study to assess the per-

formance of different profile-profile scoring functions for

aligning protein sequences.14 The ce_ref dataset consists

of 581 alignment pairs having high structural similarity

but low sequence identity (�30%). The gold standard

reference alignment was curated from a consensus of two

structure alignment programs: FSSP38 and CE.24 The

second dataset, referred to as the mus_ref dataset, was

derived from the SCOP 1.57 database.39 This dataset

consists of 190 protein pairs with an average sequence

identity of 9.6%. Mustang25 was used to generate the

gold standard reference alignments.

To better analyze the performance of the different

alignment methods, we segmented each dataset based on

the pairwise sequence identities of the proteins that they

contain. We segmented the ce_ref dataset into four

groups, of sequence identities in the range of 6–12%, 12–

18%, 18–24%, and 24–30% that contained 15, 140, 203,

and 223 pairs of sequences, respectively. We segmented

the mus_ref dataset into three groups, of sequence identi-

ties in the range of 0–6%, 6–12%, and 12–30% that con-

tained 76, 67, and 47 pairs of sequences, respectively.

Note that the three groups of the mus_ref are highly cor-

related with the bottom three levels of the SCOP hierar-

chy, with most pairs in the first group belonging to the

same fold but different superfamily, most pairs in the

second group belonging to the same superfamily but dif-

ferent family, and most pairs in the third group belong-

ing to the same family.

Evaluation metric

We evaluate the quality of the various alignment

schemes by comparing the differences between the gener-

ated candidate alignment and the reference alignment

generated from structural alignment programs.14,40,41

As a measure of alignment quality, we use the Cline Shift

score (CS)42 to compare the reference alignments with

the candidate alignments. The CS score is designed to pe-

nalize both under- and over-alignment and crediting the

parts of the generated alignment that may be shifted by a

few positions relative to the reference alignment.11,14,42

The CS score ranges from a small negative value to 1.0,

and is symmetric in nature. We also assessed the per-

formance on the standard Modeler’s (precision) and

Developer’s (recall) score,40 but found similar trends to

the CS score and hence do not report the results here.

fRMSD estimation

To estimate the fRMSD scores for a residue-pair p(xi,yj)we use the fusion-based nsoe kernel with a concatenation

coding scheme, denoted as (P1S)conc in the study. The

results are reported for the regression model, trained on a

‘‘DSA’’ like dataset described earlier (Datasets). The align-

ment scheme using such a matrix is represented by frmsd.

Gap modeling and shift parameters

For all the different scoring schemes, we use a local

alignment framework with an affine gap model, and a

zero-shift parameter16 to maintain the necessary require-

ments for a good optimal alignment.43 We optimize the

gap modeling parameters (gap opening (go), gap exten-

sion (ge)), the zero shift value (zs), and weights on the

individual scoring matrices for integrating them to obtain

the highest quality alignments for each of the schemes.

Having optimized the alignment parameters on the ce_ref

dataset, we keep the alignment parameters unchanged for

evaluation on the mus_ref dataset.

RESULTS

We performed a comprehensive study evaluating the

classification and regression performance of the various

information sources, coding schemes, and kernel func-

tions (Methods) and compared it against the perform-

ance achieved by the profile-to-profile scoring schemes

(Profile-to-Profile Scoring schemes).

We performed a number of experiments using different

length wmers for both the SVM/e-SVR and profile-to-

profile-based schemes. These experiments showed that

the supervised learning schemes achieved the best results

when 5 � w � 7, whereas in the case of the profile-to-

profile scoring schemes, the best performing value of w

was dependent on the particular scoring scheme. For

these reasons, for all the SVM/e-SVR-based schemes we

only report results for w 5 6, whereas for the profile-to-

profile schemes we report results for the values of w that

achieved the best performance. For all our experiments

we compared the performance of the concatenation cod-

ing scheme with the pairwise coding scheme, and found

the concatenation coding scheme to always have better or

comparable results to the pairwise coding scheme. Hence,

we present results only for the concatenation coding

scheme.

RBF versus NSOE kernel functions

Table II compares the classification and regression per-

formance achieved by the standard rbf kernel against that

achieved by the nsoe kernel described in Kernel Functions

Predicting Local RMSD between Structural Fragments

PROTEINS 1011

Page 8: fRMSDPred: Predicting local RMSD between structural fragments using sequence information

section. The rbf results were obtained after normalizing

the feature-vectors to unit length, as it produced substan-

tially better results over the unnormalized representation.

These results show that the performance achieved by

the nsoe kernel is consistently better than that of the rbf

kernel for the reliability prediction and fRMSD estima-

tion problems. For example, for the concatenation coding

scheme, the nsoe kernel achieves 3 and 1% better ROC5

values and its correlation coefficient is 3.5 and 13.4%

higher than the rbf kernel for the DSA and DSR datasets,

respectively.

The key difference between the two kernels is that in

the nsoe kernel the even-ordered terms are weighted

higher in the expansion of the infinite exponential series

than the rbf kernel. As discussed in Kernel Functions sec-

tion, this allows the nsoe kernel function to better capture

the pairwise dependencies that exists at different posi-

tions of each wmer.

In Table II, we also present results for the pairwise

coding scheme, and the regression and classification per-

formance is always poorer compared with the concatena-

tion coding scheme. This trend was verified across all the

other experiments, and hence we report results for the

concatenation coding scheme only.

Input information

Table III compares how the features derived from the

profiles and the predicted secondary structure impact the

performance achieved for the reliability prediction and

fRMSD estimation problems on the DSA and DSR data-

sets. The table presents results for the SVM-based

schemes using the concatenation coding scheme as well

as results obtained by the dot-product-based profile-to-

profile scoring scheme. (See Profile-to-Profile Scoring

schemes section for a discussion on how these scoring

schemes were used to solve the reliability prediction

problem and fRMSD estimation problem). Also, as dis-

cussed in Profile-to-Profile Scoring schemes section, the

scores computed by the profile-to-profile scoring schemes

should be negatively correlated with the fRMSD; thus,

negative correlations represent good estimations.

Table IIComparing the Classification and Regression Performance of the rbf

and nsoe Kernel Functions

Scheme

RP EST

ROC5 ROC CC rmse

DSA datasetPconc-all (rbf ) 0.560 0.875 0.555 0.974Pconc-all (nsoe) 0.577 0.881 0.575 0.946Ppair-all (rbf ) 0.518 0.859 0.532 0.976Ppair-all (nsoe) 0.525 0.858 0.535 0.968

DSR datasetPconc-all (rbf ) 0.387 0.531 0.297 0.953Pconc-all (nsoe) 0.392 0.531 0.337 0.933Ppair-all (rbf ) 0.316 0.495 0.247 0.963Ppair-all (nsoe) 0.323 0.494 0.250 0.960

The test and training set consisted of proteins from the all set. RP denotes the

reliability prediction results, and EST denotes the fRMSD estimation results. rbf

denotes the standard radial basis kernel function, and nsoe denotes the normalized

second order exponential kernel function introduced in this work. The numbers

in bold show the best performing schemes for each of the sub-tables.

Table IIIClassification and Regression Performance of the Individual and Fusion Kernels

Scheme

all fam suf fold

RP EST RP EST RP EST RP EST

ROC5 ROC CC rmse ROC5 ROC CC rmse ROC5 ROC CC rmse ROC5 ROC CC rmse

DSA datasetPdotp (6) 0.328 0.701 20.331 – 0.465 0.762 20.451 – 0.314 0.699 20.322 – 0.230 0.647 20.235 –Sdotp (3) 0.432 0.814 20.541 – 0.455 0.789 20.523 – 0.428 0.824 20.547 – 0.423 0.818 20.550 –(P 1 S)dotp (6) 0.339 0.713 20.350 – 0.470 0.769 20.460 – 0.327 0.712 20.342 – 0.240 0.664 20.263 –PF pic 1 Sdotp (3) 0.512 0.855 20.602 – 0.541 0.835 20.596 – 0.494 0.858 20.601 – 0.521 0.869 20.614 –Pconc-all 0.577 0.881 0.575 0.946 0.589 0.881 0.634 0.838 0.574 0.884 0.561 0.974 0.573 0.875 0.549 0.992Sconc-all 0.622 0.898 0.712 0.776 0.589 0.875 0.692 0.747 0.628 0.907 0.715 0.790 0.640 0.902 0.728 0.772(P 1 S)conc-all 0.666 0.914 0.733 0.744 0.657 0.902 0.727 0.699 0.665 0.919 0.733 0.762 0.675 0.916 0.743 0.745

DSR datasetPdotp (6) 0.152 0.355 20.094 – 0.148 0.336 20.083 – 0.143 0.369 20.093 – 0.174 0.344 20.101 –Sdotp (3) 0.403 0.541 20.521 – 0.392 0.512 20.499 – 0.398 0.561 20.516 – 0.425 0.525 20.550 –(P 1 S)dotp (6) 0.160 0.368 20.120 – 0.155 0.346 20.107 – 0.150 0.383 20.119 – 0.185 0.357 20.128 –PF pic 1 Sdotp (3) 0.363 0.528 20.457 – 0.355 0.491 20.416 – 0.355 0.551 20.453 – 0.387 0.514 20.499 –Pconc-all 0.392 0.531 0.337 0.933 0.398 0.512 0.339 0.950 0.390 0.556 0.321 0.926 0.393 0.495 0.365 0.938Sconc-all 0.436 0.537 0.623 0.717 0.436 0.521 0.587 0.758 0.426 0.549 0.620 0.712 0.460 0.525 0.662 0.694(P 1 S)conc-all 0.451 0.546 0.633 0.712 0.441 0.525 0.614 0.752 0.449 0.564 0.629 0.707 0.467 0.527 0.660 0.694

RP denotes the reliability prediction results, and EST denotes the fRMSD estimation results. The test set consisted of proteins from the fam, suf, fold and all sets,

whereas the training set used the all set. The numbers in parentheses for the profile-to-profile scoring schemes indicate the value of w for the wmers that were used.

The numbers in bold show the best performing schemes for each of the sub-tables.

H. Rangwala and G. Karypis

1012 PROTEINS

Page 9: fRMSDPred: Predicting local RMSD between structural fragments using sequence information

Analyzing the results across the different SCOP-derived

test sets, it can be seen that the secondary structure cod-

ing schemes perform better compared to the profile-

based coding schemes. Moreover, the relative perform-

ance gap between secondary-structure- and profile-based

schemes tends to increase as we move from the super-

family- to the fold-derived set. These results should not

be surprising as the secondary structure of two wmers

(assuming that it is predicted correctly) provides better

information as to the structural similarity of their corre-

sponding fragments than profiles alone. Also note that

the only time that the profile-based coding scheme per-

forms comparably to the secondary structure-based cod-

ing is for the DSA fragment pairs that were derived from

the fam-level test set. This is due to the fact that these

fragments pairs tend to have high sequence similarity

which by itself translates to high structural similarity.

Table III also highlights the superior performance of the

SVM-based methods in comparison to the dot-product-

based profile-to-profile scoring schemes. Comparing the

ROC5 value for the reliability prediction problem, the

SVM-based scheme (Sconc) is about 43 and 8% better than

the secondary structure based dot-product scoring schemes

(Sdotp) on the DSA and DSR datasets, respectively.

The results obtained on the DSA and DSR datasets for

the reliability prediction and fRMSD estimation problem,

indicate that it is more challenging to perform the classi-

fication/estimation task for residue pairs that do not

come from local alignments.

Fusion kernels

Table III also shows the performance achieved by the

fusion kernels. For comparison purposes, this table also

shows the best results that were obtained by using the

profile-to-profile-based schemes to solve the reliability

prediction and fRMSD estimation problem. Specifically,

we present dot-product-based results that score each

wmer as the sum of its profile and secondary-structure

information ((P 1 S)dotp) and results that score each

wmer as the sum of its PICASSO score and a secondary-

structure-based dot-product score (PF pic 1 Sdotp).

From the results in Table III, we see that the SVM-

based schemes, regardless of their coding schemes, consis-

tently outperform the profile-to-profile scoring schemes.

Comparing the best results obtained by the concatenation

scheme against those obtained by the PF pic 1 Sdotp

scheme for the reliability prediction problem, we see that

the former achieves 29–34% and 20–26% higher ROC5

scores on the DSA and DSR datasets, respectively. Similar

improvements ranging between 20–21% and 32–47% were

obtained for the fRMSD estimation problem as well.

To further illustrate the performance difference

between the two schemes on the DSA dataset, Figures 1

and 2 plot the actual fRMSD scores against the estimated

fRMSD scores of (P 1 S)conc-all and the similarity scores

of PF pic 1 Sdotp, respectively. Comparing the two figures

we can see that the fRMSD estimations produced by the

e-SVR-based scheme are significantly better correlated

with those produced by PF pic 1 Sdotp.

Comparing the reliability prediction and fRMSD esti-

mation performance achieved by the fusion kernels with

that achieved by the individual nsoe kernels (Table III) we

see that by combing both profile and secondary structure

information we can achieve improvements in the range of

3% to 11% in terms of ROC5 and 1.2% to 7.0% in terms

of the rmse metrics. These performance improvements are

consistent across the different test sets (all, fam, suf, and

fold) and datasets (DSA and DSR).

Also, the simple dot-product based scoring of the sec-

ondary structure and profiles does not lead to an

improvement in the classification and regression result for

the DSR dataset, but using the fusion kernel always leads

to an improvement compared to the individual kernels.

Rather than using the unweighted combination of the

profile-based and secondary structure based kernels [Eq.

(8)], we also experimented changing the weighting on

the two kernels. Our experiments indicated that the best

performance for the reliability prediction classification

problem and the fRMSD estimation regression problem

was achieved by an equal weighting of the two kernels.

Additional considerations

We also experimented with variations in the input in-

formation, coding schemes and datasets. Our empirical

results showed that the use of non-position specific in-

formation, decayed weighting function, cascading models,

and SCOP-specific datasets did not improve the classifi-

cation and estimation performance. These results are not

included in this paper but are provided on our supple-

mentary website for this work.{

Position independent features

Motivated by the work in,26 we also derived a feature

space using the BLOSUM62 matrix. These scores corre-

spond to the position non-specific score for the various

residue positions. The primary motivation for the use of

the BLOSUM62 derived feature vectors was to improve

the classification accuracy in cases where the target

sequence does not have a sufficiently large number of ho-

mologous sequences in nr, and/or PSI-BLAST fails to

compute a correct alignment for some segments of the

sequence.26 By augmenting the input coding schemes to

contain both PSSM- as well as BLOSUM62-based infor-

mation, the SVM could potentially learn a model that is

also partially based on the non-position specific informa-

tion. This information will remain valid even in cases in

which PSI-BLAST could not generate correct alignments.

{http://bioinfo.cs.umn.edu/supplements/fRMSDPred/

Predicting Local RMSD between Structural Fragments

PROTEINS 1013

Page 10: fRMSDPred: Predicting local RMSD between structural fragments using sequence information

Figure 1Scatter plot for test protein-pairs at all levels between estimated and actual fRMSD scores on the aligned dataset. The color coding represents the approximate density of

points plotted in a fixed normalized area.

Figure 2Scatter plot for test protein-pairs at all levels between profile-to-profile scores and actual fRMSD scores on the aligned dataset. The color coding represents the

approximate density of points plotted in a fixed normalized area.

H. Rangwala and G. Karypis

1014 PROTEINS

Page 11: fRMSDPred: Predicting local RMSD between structural fragments using sequence information

Decay weighted features

In the coding schemes described above, the different

positions within the wmer contribute equally to the final

similarity score. We also experimented with a linearly

decaying weight function, such that the contribution of

each position in K2cs (p, p0) decreased linearly with

respect to its distance from the central residue-pairs.

Cascaded-level models

We also experimented with a second-level model that

incorporates the predictions obtained from the first-level

models and the original sequence information. This form

of cascaded models has found success in secondary struc-

ture prediction algorithms like PHD,44 PSIPRED45 and

YASSPP.26

SCOP-level trained models

To investigate the extent to which the classification

and regression performance can be improved by building

models that are specific to different levels of sequence

similarity we performed an experiment in which we

trained three different models using the fam, suf, and

fold subsets of the DSA dataset.

Comparing the reliability prediction and fRMSD esti-

mation performance of the (P 1 S)conc-all model i.e.,

trained using all the three levels of protein-pairs to the

specific SCOP-level trained (P 1 S)conc-fam, (P 1S)conc-suf and (P 1 S)conc-fold models the (P 1 S)conc-all trained model achieves the best or almost the best

performance across different test sets. This suggests that

there is little advanatge to be gained by building models

that are tuned to different levels of sequence similarity

(Hence, we do not report these results here.)

We also noticed that for the models trained on the dif-

ferent subsets of the data, the best performance for a par-

ticular subset is achieved by the model that was trained

on the same subset, i.e., for a test residue-pair belonging

to the fam subset, the best performance is achieved by a

model trained on the fam subset as well (similar is the

case for the suf and fold subsets).

Alignment results

We performed a comprehensive study to evaluate the

accuracy of the alignments obtained by the scoring

scheme derived from the estimated frmsd values against

those obtained by the prof and ss scoring schemes and

their combinations. These results are summarized in Fig-

ures 3 and 4, which show the accuracy performance of

the different scoring schemes on the ce_ref and mus_ref

datasets, respectively. The alignment accuracy is assessed

using the average CS scores across the entire dataset and

at the different pairwise sequence identity segments. To

better illustrate the differences between the schemes, the

results are presented relative to the CS score obtained by

the prof alignment and are shown on a log2 scale.

Analyzing the performance of the different scoring

schemes we see that most of those that utilize predicted

information about the protein structure (ss, frmsd, and

combinations involving them and prof) lead to substan-

tial improvements over the prof scoring scheme for the

low sequence identity segments. However, the relative

advantage of these schemes somewhat diminishes for the

segments that have higher pairwise sequence identities.

In fact, in the case of the 12–30% segment for mus_ref,

most of these schemes perform worse than prof. This

result is not surprising, and confirms our earlier discus-

sion in Introduction.

Comparing the ss and frmsd scoring schemes, we see

that the latter achieves consistently and substantially bet-

ter performance across the two datasets and sequence

identity segments. For instance, for the first segment of

ce_ref (sequence identities in the range of 6–12%),

frmsd’s CS score is 20% higher than that achieved by the

ss scoring scheme. In the first segment of mus_ref dataset

(sequence identity in the range of 0–6%), frmsd’s CS

score is 33% higher than achieved by the ss scoring

scheme, and is 19% higher for the second segment

(sequence identity in the range of 6–12%).

Comparing most of the schemes based on frmsd and

its combinations with the other scoring schemes we see

that for the segments with low sequence identities they

achieve the best results. Among them, the frmsd 1prof

scheme achieves the best results for ce_ref, whereas the

frmsd 1prof 1ss performs the best for mus_ref. For the

first segments of ce_ref and mus_ref, both of these

schemes perform 6.1% and 27.8% better than prof 1ss,

respectively, which is the best non-frmsd-based scheme.

Moreover, for many of these segments, the performance

achieved by frmsd alone is comparable to that achieved by

the prof 1ss scheme. Also, comparing the results obtained

by frmsd and frmsd 1ss we see that by adding information

about the predicted secondary structure the performance

does improve. In the case of the segments with somewhat

higher sequence identities, the relative advantage of frmsd

1prof diminishes and becomes comparable to PF pic 1ss.

Finally, comparing the overall performance of the vari-

ous schemes on the ce_ref and mus_ref datasets, we see

that frmsd 1prof is the overall winner as it performs the

best for ce_ref and similar to the best for mus_ref.

Comparison to other alignment schemes

Since the ce_ref dataset has been previously used to

evaluate the performance of various scoring schemes we

can directly compare the results obtained here with those

presented in Ref. 14. In particular, according to that

study, the best PSI-BLAST-profile based scheme achieved

a CS score of 0.805, which is considerably lower than the

CS scores of 0.854 and 0.845 obtained by the frmsd

1prof and prof 1ss, respectively.

Predicting Local RMSD between Structural Fragments

PROTEINS 1015

Page 12: fRMSDPred: Predicting local RMSD between structural fragments using sequence information

Figure 3Relative CS Scores on the ce_ref dataset. For each segment we display the range of percent sequence identity, the number of pairs in the segment, and the average CS score

of the baseline prof scheme.

Figure 4Relative CS Scores on the mus_ref dataset. For each segment we display the range of percent sequence identity, the number of pairs in the segment, and the average CS

score of the baseline prof scheme.

H. Rangwala and G. Karypis

1016 PROTEINS

Page 13: fRMSDPred: Predicting local RMSD between structural fragments using sequence information

Also, to ensure that the CS scores achieved by our

schemes on the mus_ref dataset are reasonable, we com-

pared them against the CS scores obtained by the state-

of-the-art CONTRALIGN46 and ProbCons47 schemes.

These schemes were run locally using the default parame-

ters. CONTRALIGN and ProbCons achieved average CS

scores of 0.197 and 0.174 across the 190 alignments,

respectively. In comparison the frmsd scheme achieved an

average CS score of 0.299, whereas frmsd 1prof achieved

an average CS score of 0.337.

RELATED RESEARCH

The problem of determining the reliability of residue-

pairs has been visited before in several different settings.

ProfNet36,48 uses artificial neural networks to learn a

scoring function to align a pair of protein sequences. In

essence, ProfNet aims to differentiate related and unre-

lated residue-pairs and also estimate the RMSD score

between these residue-pairs using profile information.

Protein pairs are aligned using STRUCTAL,49 residue-

pairs within 3A apart are considered to be related, and

unrelated residue-pairs are selected randomly from pro-

tein pairs known to be in different folds. A major differ-

ence between our methods and ProfNet is in the defini-

tion of reliable/unreliable residue-pairs and on how the

RMSD score between residue-pairs is measured. As dis-

cussed in Definitions and Notations, we measure the

structural similarity of two residues (fRMSD) by looking

at how well their vfrags structurally align with each other.

However, ProfNet only considers the proximity of two

residues within the context of their global structural

alignment. As such, two residues can have a very low

RMSD and still correspond to fragments whose structure

is substantially different. This fundamental difference

makes direct comparisons between the results impossible.

The other major differences lie in the development of

order independent coding schemes and the use of infor-

mation from a set of neighboring residues by using a

wmer size greater than zero.

The task of aligning a pair of sequences has also been

casted as a problem of learning parameters (gap opening,

gap extension, and position independent substitution

matrix) within the framework of discriminatory learn-

ing50,51 and setting up optimization parameters for an

inverse learning problem.52 Recently, pair conditional

random fields were also used to learn a probabilistic

model for estimating the alignment parameters (i.e., gap

and substitution costs).46

CONCLUSION

In this article we defined the fRMSD estimation and the

reliability prediction problems to capture the local struc-

tural similarity using only sequence-derived information.

We developed a machine-learning approach for solving

these problems by using a second-order exponential kernel

function to encode profile and predicted secondary struc-

ture information into a kernel fusion framework. Our

results showed that the fRMSD values of any general resi-

due-pairs can be predicted at a good level of accuracy.

We believe that this lays the foundation for using esti-

mated fRMSD values for producing sensitive and accu-

rate sequence alignments. Our promising results shown

on the aligned residue-pairs allows us to evaluate the

quality of target-template alignments and refine them.

We also evaluated the effectiveness of using estimated

fRMSD scores to aid in the alignment of protein sequen-

ces. Our results showed that the structural information

encoded in these estimated scores are substantially better

than the corresponding information in predicted second-

ary structures and when coupled with existing state-of-

the-art profile scoring schemes, they lead to considerable

improvements in aligning protein pairs with very low

sequence identities.

Note, however sequence alignment may be one of the

applications of our method. We would like to perceive

our method to be used for protein modeling irrespective

of the target difficulty (i.e whether it has an easy detecta-

ble known homolog). Our method, would be able to

rank or assign the fragments from different set of tem-

plates per position which could be later assembled to-

gether using a fragment assembly method like TASSER,27

or Rosetta.28

ACKNOWLEDGMENTS

We would like to express our deepest thanks to Profes-

sor Arne Elofsson and Dr. Tomas Ohlson for helping us

with datasets for the study. We would also like to thank

the anonymous reviewers for their suggestions to

improve the manuscript.

REFERENCES

1. Sanchez R, Sali A. Advances in comparative protein-structure mod-

elling. Curr Opin Struc Biol 1997;7:206–214.

2. Jones DT. Genthreader: an efficient and reliable protein fold recog-

nition method for genomic sequences. J Mol Biol 1999;287:797–

815.

3. Skolnick J, Kihara D. Defrosting the frozen approximation: Prospec-

tor—a new approach to threading. Proteins 2001;42:319–331.

4. Pillardy J, Czaplewski C, Liwo A, Lee J, Ripoll DR, Kazmierkiewicz

R, Oldziej S, Wedemeyer WJ, Gibson KD, Arnautova YA, Saunders

J, Ye YJ, Scheraga HA. Recent improvements in prediction of pro-

tein structure by global optimization of a potential energy function.

Proc Natl Acad Sci USA 2001;98:2329–2333.

5. Simons KT, Strauss C, Baker D. Prospects for ab initio protein

structural genomics. J Mol Biol 2001;306:1191–1199.

6. Schwede T, Kopp J, Guex N, Peltsch MC. Swiss-model: an auto-

mated protein homology-modeling server. Nucleic Acids Res 2003;

31:3381–3385.

7. Helen M. Berman, Bhat TN, Philip E. Bourne, Zukang Feng, Gary

Gilliland Helge Weissig, John Westbrook. The Protein Data Bank

Predicting Local RMSD between Structural Fragments

PROTEINS 1017

Page 14: fRMSDPred: Predicting local RMSD between structural fragments using sequence information

and the challenge of structural genomics. Nat Struct Biol 2000;7:

957–959.

8. Zhang Y, Skolnick J. The protein structure prediction problem

could be solved using the current pdb library. Proc Natl Acad Sci

USA 2005;1024:1029–1034.

9. Altschul SF, Gish W, Miller EW, Lipman DJ. Basic local alignment

search tool. J Mol Biol 1990;215:403–410.

10. William R. Pearson, David J. Lipman. Improved tools for biological

sequence comparison. Proc Natl Acad Sci 1988;85:2444–2448.

11. Rangwala H, Karypis G. Incremental window-based protein

sequence alignment algorithms. Bioinformatics 2007;23:e17–23.

12. Gribskov M, Robinson N. Use of receiver operating characteristic

(roc) analysis to evaluate sequence matching. Comput Chem 1996;

20:25–33.

13. Altschul SF, Madden LT, Schffer AA, Zhang J, Zhang Z, Miller W,

Lipman DJ. Gapped blast and psi-blast: a new generation of protein

database search programs. Nucleic Acids Res 1997;25:3389–402.

14. Edgar R, Sjolander K. A comparison of scoring functions for pro-

tein sequence profile alignment. Bioinformatics 2004;20:1301–1308.

15. Marti-Renom M, Madhusudhan M, Sali A. Alignment of protein

sequences by their profiles. Protein Sci 2004;13:1071–1087.

16. Wang G, Dunbrack RL JR. Scoring profile-to-profile sequence align-

ments. Protein Sci 2004;13:1612–1626.

17. Heger A, Holm L. Picasso: generating a covering set of protein fam-

ily profiles. Bioinformatics 2001;17:272–279.

18. Jones DT, Taylor WR, Thorton JM. A new approach to protein fold

recognition. Nature 1992;358:86–89.

19. Qiu J, Elber R. Ssaln: an alignment algorithm using structure-de-

pendent substitution matrices and gap penalties learned from struc-

turally aligned protein pairs. Proteins 2006;62:881–891.

20. Venclovas C. Comparative modeling in casp5: progress is evident,

but alignment errors remain a significant hindrance. Proteins 2003;

53:380–388.

21. Venclovas C, Margelevicius M. Comparative modeling in casp6

using consensus approach to template selection, sequence-structure

alignment, and structure assessment. Proteins 2005;7:99–105.

22. Needleman SB, Wunsch CD. A general method applicable to the

search for similarities in the amino acid sequence of two proteins.

J Mol Biol 1970;48:443–453.

23. Smith TF, Waterman MS. Identification of common molecular sub-

sequences. J Mol Biol 1981;147:195–197.

24. Shindyalov I, Bourne PE. Protein structure alignment by incremen-

tal combinatorial extension (ce) of the optimal path. Protein Eng

1998;11:739–747.

25. Konagurthu AS, Whisstock JC, Stuckey PJ, Lesk AM. Mustang: a

multiple structural alignment algorithm. Proteins 2006;64:559–574.

26. George Karypis. Yasspp: better kernels and coding schemes lead to

improvements in svm-based secondary structure prediction. Pro-

teins 2006;64:575–586.

27. Zhang Y, Arakaki AJ, Skolnick J. Tasser: an automated method for

the prediction of protein tertiary structures in casp6. Proteins 2005;

7:91–98.

28. Rohl CA, Strauss CEM, Misura KMS, Baker D. Protein structure

prediction using rosetta. Methods Enzymol 2004;383:66–93.

29. Joachims T. Text categorization with support vector machines: learn-

ing with many relevant features. In Proc. of the European Conference

on Machine Learning, 1998.

30. Vladimir N. Vapnik. The Nature of statistical learning theory. New

York: Springer Verlag; 1995.

31. Smola A, Scholkopf B. A tutorial on support vector regression.

NeuroCOLT2, NC2-TR-1998-030, 1998.

32. Mittelman D, Sadreyev R, Grishin N. Probabilistic scoring measures

for profile-profile comparison yield more accurate short seed align-

ments. Bioinformatics 2003;19:1531–1539.

33. Henikoff S, Henikoff JG. Amino acid subsitution matrices from

protein blocks. Proc Natl Acad Sci USA 1992;89:10915–10919.

34. Lanckriet GRG, Bie TD, Cristianini N, Jordan MI, Noble WS. A sta-

tistical framework for genomic data fusion. Bioinformatics 2004;

20:2626–2635.

35. Rangwala H, Karypis G. frmsdalign: protein sequence alignment

using predicted local structure information. Technical Report 07-

014, Department of Computer Science, University of Minnesota,

Minneapolis, 2007.

36. Ohlson T, Elofsson A. Profnet, a method to derive profile-profile

alignment scoring functions that improves the alignments of dis-

tantly related proteins. BMC Bioinform 2005;6:253.

37. Schlkopf B, Burges C, Smola A, editors. Making large-scale SVM

learning practical. Advances in Kernel methods — support vector

learning. Cambridge: MIT Press; 1999.

38. Holm L, Sander C. Mapping the protein universe. Science 1996;

273:595–602.

39. Murzin AG, Brenner SE, Hubbard T, Chothia C. Scop: a structural

classification of proteins database for the investigation of sequences

and structures. J Mol Biol 1995;247:536–540.

40. Sauder JM, Arthur JW, Dunbrack RL. Large-scale comparison of

protein sequence alignments with structural alignments. Proteins

2000;40:6–22.

41. Elofsson A. A study on protein sequence alignment quality. Proteins

2002;46:330–339.

42. Cline M, Hughey R, Karplus K. Predicting reliable regions in pro-

tein sequence alignments. Bioinformatics 2002;18:306–314.

43. Dan Gusfield. Algorithms on strings, trees, and sequences: com-

puter science and computational biology. New York, NY: Cambridge

University Press; 1997.

44. Rost B, Sander C. Prediction of protein secondary structure at bet-

ter than 70% accuracy. J Mol Biol 1993;232:584–599.

45. David T. Jones. Protein secondary structure prediction based on

position-specific scoring matricies. J Mol Biol 1999;292:195–202.

46. Do CB, Gross SS, Batzoglou S. Contralign: discriminative training

for protein sequence alignment. In Proceedings of the Tenth Annual

International Conference on Computational Molecular Biology

(RECOMB), 2006.

47. Do CB, Mahabashyam MSP, Batzoglou S. Probcons: probabilistic

consistency-based multiple sequence alignment. Genome Res

2005;15:330–340.

48. Ohlson T, Aggarwal V, Elofsson A, Maccallum R. Improved align-

ment quality by combining evolutionary information, predicted sec-

ondary structure and self-organizing maps. BMC Bioinform 2006;

1:357.

49. Gerstein M, Levitt M. Comprehensive assessment of automatic

structural alignment against a manual standard, the scop classifica-

tion of proteins. Protein Sci 1998;7:445–456.

50. Joachims T, Galor T, Elber R. Learning to align sequences: a maxi-

mum-margin approach. New Algorithms for Macromolecular Simu-

lation 2005;49:57–68.

51. Yu C, Joachims T, Elber R, Pillardy J. Support vector training of

protein alignment models. To appear in Proceeding of the Eleventh

International Conference on Research in Computational Molecular

Biology (RECOMB), 2007.

52. Sun F, Fernandez-Baca D, Yu W. Inverse parametric sequence align-

ment. Proceedings of the international computing and combinator-

ics conference (COCOON), 2002.

H. Rangwala and G. Karypis

1018 PROTEINS