improved\u003ctex\u003e$hboxk$\u003c/tex\u003e-means clustering algorithm for exploring local...

11
IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 4, NO. 3, SEPTEMBER 2005 255 Improved K-Means Clustering Algorithm for Exploring Local Protein Sequence Motifs Representing Common Structural Property Wei Zhong, Gulsah Altun, Robert Harrison, Phang C. Tai, and Yi Pan*, Senior Member, IEEE Abstract—Information about local protein sequence motifs is very important to the analysis of biologically significant con- served regions of protein sequences. These conserved regions can potentially determine the diverse conformation and activities of proteins. In this work, recurring sequence motifs of proteins are explored with an improved K-means clustering algorithm on a new dataset. The structural similarity of these recurring sequence clusters to produce sequence motifs is studied in order to evaluate the relationship between sequence motifs and their structures. To the best of our knowledge, the dataset used by our research is the most updated dataset among similar studies for sequence motifs. A new greedy initialization method for the K-means algorithm is proposed to improve traditional K-means clustering techniques. The new initialization method tries to choose suitable initial points, which are well separated and have the potential to form high-quality clusters. Our experiments indicate that the improved K-means algorithm satisfactorily increases the percentage of sequence segments belonging to clusters with high structural similarity. Careful comparison of sequence motifs obtained by the improved and traditional algorithms also suggests that the improved K-means clustering algorithm may discover some rel- atively weak and subtle sequence motifs, which are undetectable by the traditional K-means algorithms. Many biochemical tests reported in the literature show that these sequence motifs are biologically meaningful. Experimental results also indicate that the improved K-means algorithm generates more detailed se- quence motifs representing common structures than previous research. Furthermore, these motifs are universally conserved sequence patterns across protein families, overcoming some weak points of other popular sequence motifs. The satisfactory result Manuscript received November 20, 2004; revised February 28, 2005. This work was supported in part by the U.S. National Institutes of Health (NIH) under Grants R01 GM34766-17S1 and P20 GM065762-01A1, in part by the U.S. National Science Foundation (NSF) under Grants ECS-0196569 and ECS- 0334813, in part by the Georgia Cancer Coalition, and in part by the Georgia Research Alliance. The work of W. Zhong is supported by a Georgia State Uni- versity Molecular Basis of Disease Program Fellowship and by several NIH and NSF grants. The work of R. Harrison is supported by the NIH and the Georgia Cancer Coalition. The work of Y. Pan is supported by the NSF, the NIH, the National Natural Science Foundation of China (NSFC), the Air Force Office of Scientific Research (AFOSR), the Air Force Research Laboratory (AFRL), the Japan Society for Promotion of Science (JSPS), the International Information Science Foundation (IISF), and the states of Georgia and Ohio. Asterisk indi- cates corresponding author. W. Zhong and G. Altun are with the Computer Science Department, Georgia State University, Atlanta, GA 30303-4110 USA (e-mail: [email protected]; [email protected]). R. Harrison is with the Computer Science Department and the Biology Department, Georgia State University, Atlanta, GA 30303-4110 USA (e-mail: [email protected]). P. C. Tai is with the Biology Department, Georgia State University, Atlanta, GA 30303-4110 USA (e-mail: [email protected]). *Y. Pan is with the Computer Science Department, Georgia State University, Atlanta, GA 30303-4110 USA (e-mail: [email protected]). Digital Object Identifier 10.1109/TNB.2005.853667 of the experiment suggests that this new K-means algorithm may be applied to other areas of bioinformatics research in order to explore the underlying relationships between data samples more effectively. Index Terms—K-means clustering algorithm, protein structure, sequence motif. I. INTRODUCTION U NDERSTANDING the relationship between protein structure and its sequence is one of the most important tasks of current bioinformatics research. Many biochemical tests suggest that a sequence determines conformation com- pletely, because all the information that is necessary to specify protein interaction sites with other molecules is embedded into its amino acid sequence [23]. This close relationship between protein sequences and structures forms the theoretical basis for exploring the sequence motifs representing a strong common structure. Various researches show that a relatively small number of structurally or functionally conserved sequence regions are available in a large number of protein families. Se- quence motifs obtained from functionally conserved sequence regions may be used to predict any subsequent reoccurrence of structural or functional areas on other proteins. These functional and structural areas may include enzyme-binding sites, prosthetic group attachment sites, or regions involved in binding other small molecules. PROSITE [16], PRINTS [2], and BLOCKS [14] are three popular databases for sequence motifs. Core PROSITE se- quence patterns are created from observation of short conserved sequences, which are experimentally proven to be significant to the biological function of certain protein families. Analysis of three-dimensional (3-D) structure of PROSITE patterns sug- gests that recurrent sequence motifs imply common structure and function [24]. Fingerprints from PRINTS contain several motifs from different regions of multiple sequence alignments, increasing the discriminating power to predict the existence of similar motifs because identification of individual parts of the fingerprint is mutually conditional [2]. Since sequence motifs from PROSITE, PRINTS, and BLOCKS are developed from multiple sequence alignments, these sequence motifs only search conserved elements of sequence alignments from the same protein family and carry little information about conserved sequence regions, which transcend protein families. Furthermore, the knowledge about the biologically important regions or residues is the precondition of finding these motifs. 1536-1241/$20.00 © 2005 IEEE

Upload: independent

Post on 02-Dec-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 4, NO. 3, SEPTEMBER 2005 255

Improved K-Means Clustering Algorithm forExploring Local Protein Sequence MotifsRepresenting Common Structural PropertyWei Zhong, Gulsah Altun, Robert Harrison, Phang C. Tai, and Yi Pan*, Senior Member, IEEE

Abstract—Information about local protein sequence motifs isvery important to the analysis of biologically significant con-served regions of protein sequences. These conserved regions canpotentially determine the diverse conformation and activities ofproteins. In this work, recurring sequence motifs of proteins areexplored with an improved K-means clustering algorithm on anew dataset. The structural similarity of these recurring sequenceclusters to produce sequence motifs is studied in order to evaluatethe relationship between sequence motifs and their structures. Tothe best of our knowledge, the dataset used by our research is themost updated dataset among similar studies for sequence motifs.A new greedy initialization method for the K-means algorithm isproposed to improve traditional K-means clustering techniques.The new initialization method tries to choose suitable initialpoints, which are well separated and have the potential to formhigh-quality clusters. Our experiments indicate that the improvedK-means algorithm satisfactorily increases the percentage ofsequence segments belonging to clusters with high structuralsimilarity. Careful comparison of sequence motifs obtained bythe improved and traditional algorithms also suggests that theimproved K-means clustering algorithm may discover some rel-atively weak and subtle sequence motifs, which are undetectableby the traditional K-means algorithms. Many biochemical testsreported in the literature show that these sequence motifs arebiologically meaningful. Experimental results also indicate thatthe improved K-means algorithm generates more detailed se-quence motifs representing common structures than previousresearch. Furthermore, these motifs are universally conservedsequence patterns across protein families, overcoming some weakpoints of other popular sequence motifs. The satisfactory result

Manuscript received November 20, 2004; revised February 28, 2005. Thiswork was supported in part by the U.S. National Institutes of Health (NIH)under Grants R01 GM34766-17S1 and P20 GM065762-01A1, in part by theU.S. National Science Foundation (NSF) under Grants ECS-0196569 and ECS-0334813, in part by the Georgia Cancer Coalition, and in part by the GeorgiaResearch Alliance. The work of W. Zhong is supported by a Georgia State Uni-versity Molecular Basis of Disease Program Fellowship and by several NIH andNSF grants. The work of R. Harrison is supported by the NIH and the GeorgiaCancer Coalition. The work of Y. Pan is supported by the NSF, the NIH, theNational Natural Science Foundation of China (NSFC), the Air Force Office ofScientific Research (AFOSR), the Air Force Research Laboratory (AFRL), theJapan Society for Promotion of Science (JSPS), the International InformationScience Foundation (IISF), and the states of Georgia and Ohio. Asterisk indi-cates corresponding author.

W. Zhong and G. Altun are with the Computer Science Department, GeorgiaState University, Atlanta, GA 30303-4110 USA (e-mail: [email protected];[email protected]).

R. Harrison is with the Computer Science Department and the BiologyDepartment, Georgia State University, Atlanta, GA 30303-4110 USA (e-mail:[email protected]).

P. C. Tai is with the Biology Department, Georgia State University, Atlanta,GA 30303-4110 USA (e-mail: [email protected]).

*Y. Pan is with the Computer Science Department, Georgia State University,Atlanta, GA 30303-4110 USA (e-mail: [email protected]).

Digital Object Identifier 10.1109/TNB.2005.853667

of the experiment suggests that this new K-means algorithm maybe applied to other areas of bioinformatics research in order toexplore the underlying relationships between data samples moreeffectively.

Index Terms—K-means clustering algorithm, protein structure,sequence motif.

I. INTRODUCTION

UNDERSTANDING the relationship between proteinstructure and its sequence is one of the most important

tasks of current bioinformatics research. Many biochemicaltests suggest that a sequence determines conformation com-pletely, because all the information that is necessary to specifyprotein interaction sites with other molecules is embedded intoits amino acid sequence [23]. This close relationship betweenprotein sequences and structures forms the theoretical basis forexploring the sequence motifs representing a strong commonstructure. Various researches show that a relatively smallnumber of structurally or functionally conserved sequenceregions are available in a large number of protein families. Se-quence motifs obtained from functionally conserved sequenceregions may be used to predict any subsequent reoccurrenceof structural or functional areas on other proteins. Thesefunctional and structural areas may include enzyme-bindingsites, prosthetic group attachment sites, or regions involved inbinding other small molecules.

PROSITE [16], PRINTS [2], and BLOCKS [14] are threepopular databases for sequence motifs. Core PROSITE se-quence patterns are created from observation of short conservedsequences, which are experimentally proven to be significantto the biological function of certain protein families. Analysisof three-dimensional (3-D) structure of PROSITE patterns sug-gests that recurrent sequence motifs imply common structureand function [24]. Fingerprints from PRINTS contain severalmotifs from different regions of multiple sequence alignments,increasing the discriminating power to predict the existenceof similar motifs because identification of individual parts ofthe fingerprint is mutually conditional [2]. Since sequencemotifs from PROSITE, PRINTS, and BLOCKS are developedfrom multiple sequence alignments, these sequence motifsonly search conserved elements of sequence alignments fromthe same protein family and carry little information aboutconserved sequence regions, which transcend protein families.Furthermore, the knowledge about the biologically importantregions or residues is the precondition of finding these motifs.

1536-1241/$20.00 © 2005 IEEE

256 IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 4, NO. 3, SEPTEMBER 2005

As a result, the discovery of sequence motifs and profilesrequires intensive human intervention.

In many applications, researchers have little prior knowledgeabout data and have to make as few assumptions about the dataas possible [18]. Under these restrictions, clustering algorithmsare particularly suitable to discover the underlying relationshipamong the data samples and assess their common character-istics. Methods to produce the popular sequence motifs ofPROSITE, PRINTS, and BLOCKS databases require humanintervention to explore the biologically significant regions ofprotein sequences. In contrast, the clustering technique providesan automatic, unsupervised discovery process.

Han and Baker have used the K-means clustering program tofind recurring local sequence motifs for proteins [12], [13]. Intheir work, a set of initial points for cluster centers is chosenrandomly [12]. Since the performance of K-means clustering isvery sensitive to initial point selection, their technique may notyield satisfactory results. Random selection often obtains eitherinitial points that are close together or outliers of clusters, pro-ducing unsatisfactory partitions, since initial points need to bewell separated to approximate each cluster in the sparse dataspace. To overcome the problem of random selection, we pro-pose the new greedy algorithm to select suitable initial pointsin order to allow the K-means algorithm to converge to a betterlocal minimum.

In our research, protein sequences are converted into slidingsequence segments. These sliding sequence segments are classi-fied into different groups with the improved K-means clusteringalgorithm. The structural similarity of these groups is evalu-ated. The recurrent groups with high structural similarity willbecome the candidate to generate sequence motifs representinga common structure. Our sequence motifs are represented by thefrequency profiles.

This paper has been divided into six sections. In Section II, themajor characteristics of the traditional and improved K-meansalgorithms are compared. In Section III, the experimental setupis explained. In Section IV, experimental results are presentedto show that the improved K-means algorithm is better thanthe traditional K-means algorithm and to give evidence thatour research find some previously undiscovered sequence mo-tifs. In Section V, our research is compared to other state-of-the-art approaches in order to emphasize the advantages of ourresearch. In Section VI, the major highlights of this research aresummarized.

II. K-MEANS CLUSTERING ALGORITHMS

In this section, we reveal weak points of the traditionalK-means algorithms and analyze other researchers’ effortsto explore new initialization methods. Then, we proposethe improved K-means clustering algorithm and explain itsadvantages.

A. Traditional K-Means Clustering Algorithm

K-means clustering is computationally efficient for largedatasets with both numeric and categorical attributes [10]. Forthe traditional K-means clustering algorithm, K samples arechosen at random from the whole sample space to approximate

centroids of initial clusters. The K-means clustering algorithmthen iteratively updates the centers until no reassignment ofpatterns to new cluster centers occurs. In every step, eachsample is allocated to its closest cluster center and clustercenters are reevaluated based on current cluster memberships[18]. Some researchers have adopted the K-means clusteringalgorithm to perform knowledge discovery in bioinformaticsresearch [11], [37].

Random, Forgy, MacQueen, and Kaufman are four initial-ization methods for the K-means algorithm [30]. In these fourinitialization methods, the choice of initial data points definesdeterministic mapping from the initial partition to the results,since the K-means algorithm tries to find optimal local minima.Inappropriate choices of initial points in these four initializationmethods may result in distorted or incorrect partitions, whichare far from the globally optimal solution. A large percentageof data samples may be concentrated into small numbers ofclusters while remaining clusters have a very small number ofsamples. Due to restriction of current protein database designand very large number of sequence segments generated by ourprotein dataset, it is impractical to implement Random, Mac-Queen, and Kaufman as the initialization method for K-meansclustering technique in our application. As a result, we chooseForgy as the initialization method for the traditional K-meansclustering algorithm. The Forgy approach will select K sam-ples from the database randomly as the representation for ini-tial cluster centers [30]. In our paper, random selection of datasamples refers to the Forgy approach.

Many efforts have been taken to choose suitable initial clus-tering centers so that the algorithm is more likely to find theglobal minimum value [19], [39]. Special assumptions aboutthe data distribution, which is the precondition for implementa-tion of these new initialization methods, are not appropriate toour application due to complex underlying distribution patternsof our dataset. Therefore, we propose a new greedy initializa-tion method for the K-means algorithm. This new initializationmethod does not depend on the knowledge about the underlyingdistribution patterns of the dataset, which is the advantage overother available improved initialization methods for the K-meansalgorithm.

B. New Greedy Initialization Method for the K-MeansAlgorithm

To overcome potential problems of random initialization, thenew greedy initialization method tries to choose suitable initialpoints so that final partitions can represent the underlying dis-tribution of the data samples more consistently and accurately.Each initial point is represented by one local sequence segment.In the new initialization method, the clustering algorithm willonly be performed for several iterations during each run. Aftereach run, initial points, which can be used to form the clusterwith good structural similarity, are chosen and their distance ischecked against that of all points already selected in the initial-ization array. If the minimum distance of new points is greaterthan the specified distance, these points will be added to theinitialization array. Satisfaction of the minimum distance canguarantee that each newly selected point will be well separated

ZHONG et al.: IMPROVED K-MEANS CLUSTERING ALGORITHM FOR EXPLORING LOCAL PROTEIN SEQUENCE MOTIFS 257

from all the existing points in the initialization array and will po-tentially belong to different natural clusters. This process willbe repeated several times until the specified number of pointsis chosen. After this procedure, these carefully selected pointscan be used as the initial centers for the K-means clusteringalgorithm.

Here is an example of how this new initialization methodworks. Let us suppose that a structural similarity threshold isgiven as 65% and the distance threshold is given as 1400. Afterthree iterations, one initial point creates the cluster with thestructural similarity of 67%, which is greater than the structuralsimilarity threshold. As a result, this point will be one of pos-sible candidates. In the second step, the distance of this pointagainst all the existing points in the initialization array is cal-culated. The minimum distance against all the existing pointsis 1439, which is greater than the distance threshold. There-fore, the point is added into the initialization array. This processwill continue until the specified number of initial points for theK-means algorithm is chosen. The pseudocode for the initial-ization method of the improved K-means algorithm is given inthe following:

WHILE (the number of initial points discovered is

less than the total number of clusters)

{

Run the traditional K-means algorithm for a fixed

number of iterations on the whole sample space

Assess the structural similarity of clusters cre-

ated by each initial point

IF (the structural similarity for one cluster is

bigger than or equal to a given threshold)

{

Check the minimum distance of the point creating

this cluster with existing points in the initializa-

tion array

IF (the minimum distance is bigger than threshold)

This new point is included into the initializa-

tion array

END IF

}

END IF

}

END WHILE

III. EXPERIMENT SETUP

In this section, we introduce experimental parameters, thedataset, and the method to generate and represent the sequencesegments. Then, we discuss the cluster membership calculationfor sequence segments and the structural similarity of a givencluster. Finally, we provide two measures in order to evaluatethe performance of clustering algorithms.

A. Experimental Parameters

In this research, 800 initial clusters are chosen arbitrarilyfor the K-means clustering algorithm. Since the K-meansclustering algorithm is very sensitive to starting points, the

numerical stability of the cluster algorithm is estimated by per-forming K-means clustering five times with different randomstarting points. Only recurrent clusters come into the analysis ofresults. To save the computation time, the structural similaritythreshold is set as 65% for all the experiments. Different min-imum distances are used to evaluate their effects on clusteringperformance.

B. Dataset

The dataset used in this work includes 2290 protein se-quences obtained from the Protein Sequence Culling Server(PISCES) [40]. In this protein database, the percentage identitycutoff is 25%, the resolution cutoff is 2.2, and the -factorcutoff is 1.0. No sequences of this database share more than25% sequence identities. Since PISCES uses PSI-BLAST [1]alignments to distinguish many underlying patterns below 40%identity, PISCES produces a rigorous nonredundant database.

C. Generation and Representation of Sequence Segments

The sliding windows with ten successive residues are gen-erated from protein sequences. Each window represents onesequence segment of ten continuous positions. Five hundredthousand sequence segments from 2290 protein sequences areproduced by the sliding window method. These sequence seg-ments of ten continuous positions are classified into differentgroups with the K-means algorithm. The frequency profiles arechosen to represent each sequence segment. The frequency pro-files are defined in the similarity-derived secondary structure ofproteins (HSSP) [34]. Each sequence segment is representedby the 20 10 matrix. Twenty rows represent 20 amino acidsand ten columns represent each position of the sliding window.For the frequency profiles representation of sequence segments,each position of the matrix represents the frequency for a spec-ified amino acid residue in a given sequence position for themultiple sequence alignment. The frequency profile defined inthe HSSP is very important in exploring preferences and pat-terns for sequence analysis and in explaining structural roles ofconserved residues.

D. Distance Measure and Cluster Membership Calculation forSequence Segments in the Sequence Space

Each centroid of a sequence cluster is represented by the20 10 matrix. In the matrix representing the centroid of se-quence clusters, each position of the matrix represents the fre-quency for a specified amino acid residue in a given windowposition of the sequence clusters. The city block metric is usedfor calculating the difference between a sequence segment andthe centroid of a given sequence cluster. Han and Baker alsochose the city block metric because of complications associatedwith the use of Euclidean metric for clustering algorithms [12].The following formula is used to calculate the distance betweentwo sequence segments [12]:

Distance

where is the window size and is 20. is the value ofthe matrix at row and column used to represent the sequence

258 IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 4, NO. 3, SEPTEMBER 2005

segment . is the value of the matrix at row and columnused to represent the centroid of a given sequence cluster.In our K-means clustering algorithm, a sequence segment is

assigned to a specific cluster if the sequence segment has theminimum distance against the centroid of this specific cluster.The minimum distance between a sequence segment and its as-signed cluster centroid might increase the possibility for accu-rately assigning this sequence segment to a cluster.

E. Secondary Structure Assignment

Definition for Secondary Structure of Proteins (DSSP) [20] isused for determining the secondary structure of proteins. In thispaper, represents helices, represents sheets, and repre-sents coils.

F. Measure of Structural Similarity for a Given Cluster

The following formula calculates the level of structural sim-ilarity. Structural similarity for a given cluster (%) is [12]

ws

ws is the window size. is the frequency of occurrenceof helices among the sequence segments for the cluster in po-sition . and are similarly defined. The secondarystructure with the maximum frequency is used for representingthe common structure in that position. The average results ofthe max frequency from all positions of a given window showthe structural similarity level for a given cluster. If the struc-tural similarity for secondary structure within the cluster ex-ceeds 70%, the cluster can be considered structurally similar[34]. If the structural similarity for secondary structure withinthe cluster is between 60% and 70%, the cluster can be consid-ered weakly structurally similar.

G. Evaluation of Performance for the Clustering Algorithmand Generation of Frequency Profiles for Sequence Motifs

The percentage of sequence segments belonging to clusterswith high structural similarity and the number of clusterswith high structural similarity are two measures to evaluatethe performance for the clustering algorithm. In the sectionof experimental results, the percentage of sequence segmentsbelonging to clusters with high structural similarity and thenumber of clusters with high structural similarity are averagedfrom five-times-running results. Improved average percentageof sequence segments belonging to clusters with high structuralsimilarity indicates that the clustering algorithm can increase itseffectiveness to classify data with specified similar characteris-tics. If new sequence patterns are discovered from the increasednumber of clusters with high structural similarity, the clusteringalgorithm can reveal more underlying relationships betweendata samples. The percentage of sequence segments belongingto clusters with the structural similarity greater than 60% iscalculated by division of the sum of all sequence segmentsbelonging to clusters with the structural similarity greater than60% by total number of sequence segments in the database.

TABLE ICOMPARISON OF PERCENTAGE OF SEQUENCE SEGMENTS BELONGING TO

CLUSTERS WITH HIGH STRUCTURAL SIMILARITY

During the process of generating frequency profiles forsequence motifs, the frequency for the specified amino acidresidue in a given window position for a cluster is calculatedby division of the number of specified residues by total numberof residues in that position. Only recurrent clusters with highstructural similarity from five runs are used to generate se-quence motifs.

IV. EXPERIMENTAL RESULTS

In this section, we compare the experimental results of the tra-ditional and improved K-means algorithm. We also discuss thesequence motifs generated by the improved K-means algorithmand use the biochemical experiment published in the literatureto support biological meanings of out sequence motifs.

A. Comparison of Performance for the Traditional andImproved K-Means Algorithm

In Table I, the average percentage of sequence segments be-longing to clusters with high structural similarity for the tradi-tional and improved K-means algorithm is given.

The first column of Table I shows the algorithm withdifferent parameters. “Traditional” refers to the traditionalK-means algorithm, which randomly selects the initial pointsfrom the whole sample space. “New 1100” illustrates the im-proved K-means algorithm choosing initial points, which canpotentially form clusters with good structural similarity. Theminimum distances among these points for the initializationarray are at least 1100. “New 1200,” “New 1300,” “New 1400,”and “New 1500” are similarly defined. The second columnof Table I gives the average percentage of sequence segmentsbelonging to clusters with the structural similarity greater than60% from five runs. The third column of Table I gives thestandard deviation of the percentage of sequence segmentsbelonging to clusters with the structural similarity greater than60%. The fourth column of Table I gives the average percentageof sequence segments belonging to clusters with the structuralsimilarity greater than 70% from five runs. The fifth columnof Table I gives the standard deviation of the percentage ofsequence segments belonging to clusters with the structuralsimilarity greater than 70%.

Our experimental results show an average of 40 clusters outof 800 clusters is empty after the first iteration of the traditionalK-means algorithm with random selection of initial points. Fur-ther analysis indicates that most initial points that create these40 clusters come from outliers of clusters. Outliers of clustersrefer to sequence segments, which are far away from centers of

ZHONG et al.: IMPROVED K-MEANS CLUSTERING ALGORITHM FOR EXPLORING LOCAL PROTEIN SEQUENCE MOTIFS 259

TABLE IICOMPARISON OF NUMBER OF CLUSTERS WITH HIGH STRUCTURAL SIMILARITY

natural clusters. Analysis of the clustering process of the tradi-tional clustering algorithm also reveals that some of the initialpoints are very close to each other, creating strong interferenceswith each other. Strong interferences among initial points willaffect final partitioning negatively. The results of Table I showthe average percentage of sequence segments belonging to clus-ters with the structural similarity greater than 60% steadily im-proves with increasing minimum distances among initial points.This improved percentage results from decreased interferencesamong initial points when the distances among initial pointsare increased. The average percentage performance of “New1100” is worse than that of the traditional K-means algorithmas a result of strong interferences among initial points, whichare too close to each other. “New 1500” increases the averagepercentage of sequence segments belonging to clusters with thestructural similarity greater than 60% by almost 4.5% and im-proves the average percentage of sequence segments belongingto clusters with the structural similarity greater than 70% by1.4%. Furthermore, “New 1500” reduces the standard deviationfor the percentage of sequence segments belonging to clusterswith the structural similarity greater than 60%. The increasedaverage percentage and decreased standard deviation suggestthat the improved K-means algorithm performs better and moreconsistently than the traditional algorithm because the improvedK-means algorithm avoids outliers of clusters and keeps initialpoints as far as possible.

Table II shows the number of clusters exceeding givenstructural similarity thresholds for the traditional and improvedK-means algorithm. The first column of Table II is the same asthat of Table I. The second column shows the average numberof clusters with the structural similarity greater than 60% fromfive runs. The third column shows the standard deviation forthe number of clusters with the structural similarity greaterthan 60%. The fourth column shows the average number ofclusters with the structural similarity greater than 70% fromfive runs. The fifth column indicates the standard deviation forthe number of clusters with the structural similarity greater than70%. “New 1500” increases the average number of clusters withthe structural similarity greater than 60% by 42. Comparisonbetween sequence motifs obtained by both algorithms suggeststhat the improved K-means clustering algorithm may discoversome relatively weak and subtle sequence motifs. These motifsare undetectable by the traditional K-means algorithm becauserandom selection of points may choose two starting points,which are within one natural cluster. For example, some ofthe weak amphipathic helices and sheets are not discovered bythe traditional K-means algorithms. In addition, the number of

repeated substitution patterns of sequence motifs found by thetraditional K-means algorithms is less than that of the improvedK-means algorithms.

B. Sequence MotifsThe motif tables for Patterns 1–27 are comprised in Ta-

bles III–V. The following format is used for representation ofeach sequence motif table.

The average number of sequence segments used to generatethe given motif and their average structural similarity are indi-cated above the columns of each motif table.

• The first column of each motif table shows the positionof amino acid profiles in each local sequence motifwith ten consecutive positions.

• The second column of each motif table shows thetypes of amino acids in the given position. The aminoacid appearing with the frequency greater than 10%are indicated by the upper case. The amino acid withthe upper case emphasizes its high occurrence ratein that position. The amino acids appearing with thefrequency between 8% and 10% are indicated by thelower case.

• The third column shows the variability. Variability in-dicates the number of amino acids occurring with thefrequency greater than 5%.

• The fourth column indicates the hydrophobicity index.The hydrophobicity index is the sum of the frequenciesof occurrence of alanine, valine, isoleucine, leucine,methionine, proline, phenylalanine, and tryptophan.

• The fifth column indicates the representative sec-ondary structure in that position.

More than 170 local sequence motifs indicating common struc-ture are discovered in this study. These 170 sequence motifshave been grouped into 27 major patterns according to theircommon characteristics. One representative of each group ischosen to show the sequence pattern of this group. Since thestatistics of the structural database indicate the average lengthof helices is ten, 70% of the sequence motifs generated by theK-means clustering algorithm with the window size of ten arerelated to helices. Analysis of related biochemical studies in-dicates that patterns obtained by the K-means algorithm mayplay vital roles in intramolecular and intermolecular interac-tions, which decide the structure and function of proteins. Fur-thermore, analysis of these sequence motifs provides importantinsight into the degrees to which changes in the primary se-quence are tolerated. This knowledge can help us understandstructurally conservative substitutions of 20 amino acids duringthe evolutionary process.

Patterns 4, 5, and 6 contain conserved glutamic acid, lysine,or serine. These three amino acids are polarly charged residueswith relatively strong organic acids and bases. As a result,these amino acids can establish ionic bonds with other chargedmolecules in the cells and play important roles in catalysis andsalt bridges [4]. These charged amino acids are also importantto decide the characteristics of protein surfaces, which act asthe major functional locations for many proteins [33].

260 IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 4, NO. 3, SEPTEMBER 2005

TABLE IIIMOTIF TABLES FOR PATTERNS 1–9

ZHONG et al.: IMPROVED K-MEANS CLUSTERING ALGORITHM FOR EXPLORING LOCAL PROTEIN SEQUENCE MOTIFS 261

TABLE IVMOTIF TABLES FOR PATTERNS 10–18

262 IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 4, NO. 3, SEPTEMBER 2005

TABLE VMOTIF TABLES FOR PATTERNS 19–27

ZHONG et al.: IMPROVED K-MEANS CLUSTERING ALGORITHM FOR EXPLORING LOCAL PROTEIN SEQUENCE MOTIFS 263

The hydrophobic property of amino acid side chains can af-fect protein conformation and function [26], [41]. Thermody-namics show that polar or hydrophilic residues are placed ontothe surface of protein interacting with surrounding water andnonpolar residues tend to gather within the interior of most sol-uble proteins, connecting with one another as the result of vander Waals forces and hydrophobic interactions [4], [25], [31].These hydrophobic interactions among nonpolar residues in-crease the overall stability of the protein. For many enzymes,reactive polar residues can move into the nonpolar interior inorder to increase chemical reaction between polar groups [23].Since the level of hydrophobicity plays important roles in deter-mining the structure and activities of proteins, special attentionshave been paid to analyze the hydrophobic and hydrophilic pat-terns of the sequence motifs.

Many patterns related to helices show pronounced amphi-pathicity such as Patterns 7, 8, and 9, since amphipathic helicesare one of the common structural motifs in proteins [36]. In thesoluble protein, the hydrophobic face of helices is buried intothe protein interior and the polar face can project into its polarsurrounding [4]. Patterns 7, 8, and 9 show that hydrophobicamino acids are regularly arranged three or four positions apart.Amphipathic helices are first found in myoglobin. Severalmethods are proposed to identify these amphipathic helices [8],[35]. Possible functions of these amphipathic helices have beenexperimentally tested [6], [22]. Peptides that show amphipathicstructural motifs have been widely adopted as the model systemto understand problems associated with protein folding andstability [5], [28].

Pattern 11 shows helices with very high hydrophobicity. Thissuggests that Pattern 11 may be located at the core of proteins,linking its NH and CO groups with hydrogen bonding. Pattern10 reveals amphiphilic helices with very low hydrophobicity.Pattern 10 may point to polar solution. Amphiphilic helices maydetermine the functions of representative apolipoproteins, pep-tide toxins, and peptide hormones [21]. By increasing the am-phipilicity of the structurally significant regions of the molecule,the biological activity of the peptide can surpass naturally occur-ring polypeptide [21]. As a result, amphiphilic helices are veryimportant for protein design projects [6].

Many patterns associated with coils show very low hy-drophobicity. Coils are located on the surfaces of proteinsand are sometimes involved in chemical interaction betweenproteins and other molecules [4], [17]. Many patterns asso-ciated with sheets have high levels of hydrophobicity, sincehydrophobic amino acids are statistically preferred for the sheetstructure [17], [27]. Pattern 14 shows interesting alternatinghydrophobic-polar residues. Patterns 15 and 16 indicate thesheet-coil with clear hydrophobicity transition.

Pattern 17 illustrates the coils containing conserved glycinesin several positions. Many other patterns also contain conservedglycine residues. The side chain of glycine only has one hy-drogen atom. The properties of lacking side chains allow theprotein backbone to move and approach other backbones veryclosely [23]. As a result, it is worthwhile to study the position ofconserved glycine in the sequence patterns. Patterns 18 and 19contain conserved proline residues in several positions. Prolinedoes not easily fit into an ordered secondary structure because its

ring structure increases the restriction on its conformation [4].As a result, the frequency of proline is low for patterns relatedwith helices and sheets and is high for patterns related with coils.Patterns 20 and 21 provides helices–coils motifs. Transitionalregions between helices and coils contain conserved glycine,since glycine favors disruption of the helices. Helix-terminationrules of thumb show helix termination by glycine and proline isanticipated [3].

Many patterns also show very similar substitution patterns atseveral positions, such as Patterns 22, 23, and 26. These similarsubstitution patterns can provide insights into conserved substi-tution patterns, which can preserve the structure of proteins.

V. DISCUSSION

In this section, we compare our work with other state-of-the-art approaches and discuss the future work that may enhanceour research.

A. Result Comparison With Other Research

Our results reveal much more detailed hydrophobicity pat-terns for helices, sheets, and coils than the previous study [12],[13]. These elaborate hydrophobicity patterns are supported byvarious biochemical experiments published in the literature.Increased information about hydrophobicity patterns associ-ated with these sequence motifs can expand our knowledgeof how proteins fold and how proteins interact with eachother. Furthermore, the analysis of discovered sequence motifsshows that some elaborate and subtle sequence patterns suchas Patterns 1, 9, and 22 have never been reported in previousworks. Especially, increased number of repeated substitutionpatterns reported in this study may provide additionally strongevidences for structurally conservative substitutions during theevolutionary process for protein families.

The sequence motifs discovered in this study indicate con-served residues that are structurally and functionally importantacross protein families because protein sequences used in thisstudy share less than 25% sequence identities. These importantfeatures from our sequence motifs may help to compensate forsome of the weak points of those created by PROSITE, PRINTS,and BLOCKS. Our sequence motifs may reflect general struc-tural or functional characteristics shared by different proteinfamilies while sequence motifs from PROSITE, PRINTS, andBLOCKS represent structural or functional constraints specificto a particular protein family.

B. Potential Improvements

In this study, we have focused on designing and testing theimproved K-means algorithm. In the next step, the size of slidingwindows can be changed in order to produce sequence segmentswith lengths ranging from 3 to 15 residues. The size-ten slidingwindow is used to generate the sequence segments in this work.The improved K-means clustering algorithm for sequence seg-ments with lengths from 3 to 15 will produce more importantsequence motifs, which reflect common secondary and tertiarystructure.

The structural similarity of clusters is evaluated based on thesecondary structure in this study. Common tertiary structures of

264 IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 4, NO. 3, SEPTEMBER 2005

the given clusters are implied and are not evaluated explicitlyin order to compare our work with other similar research. Inthe next step, the 3-D structural similarity of clusters will beanalyzed. Comparison between common secondary and tertiarystructure for a given cluster may give us important insights intothe relationship between secondary and tertiary structure.

In our study, the cluster number of 800 is chosen arbitrarily.However, 800 may not be the optimal cluster number. There-fore, the improved K-means algorithm will be run several timeswith different values of in order to discover the most suit-able number of clusters. With the information about the optimalcluster number, clustering results may be potentially closest tounderlying distribution patterns of the sample space. However,the time spent searching for the good initial points grows sub-stantially when the minimum distance and structural similaritythreshold are increased. For example, it will take 18 days to ob-tain appropriate initial points with the distance threshold of 1500in the very large sample space. Due to the time and processingpower constraints, the search for the optimal cluster number hasnot been completed. The long searching time for initial pointsmotivates us to implement the parallel K-means algorithm inorder to reduce the searching time for suitable initial points toone to two days. The parallelization of the improved K-meansalgorithm will make exploration of the optimal cluster numberpossible. We predict that the performance gains for the improvedK-means algorithm will be increased further after the optimalcluster number is found.

VI. CONCLUSION

In this study, the new initialization method for the K-meansalgorithm has been proposed to solve problems associated withrandom selection. In the new initialization method, we try tochoose suitable initial points, which are well separated andhave the potential to form high-quality clusters. Many biochem-ical tests published in the literature indicate that discoveredsequence motifs are biologically meaningful. Analysis ofsequence motifs also shows the improved K-means algorithmmay detect some very subtle sequence motifs overlooked bythe traditional algorithm. The reasonable experimental resultsshow the improved K-means clustering technique is effective inclassifying data with specified similar biological characteristicsand in discovering the underlying relationship among the datasamples. The discovered sequence motifs across protein fami-lies may overcome the shortcomings of other popular sequencemotifs. The most updated dataset from PISCES is used for thefirst time to create sequence motifs. Because the dataset fromPISCES has several advantages over other existing databases,sequence motifs discovered in this process can reveal morepatterns that are meaningful during the process of evolutionthan other studies. Since the K-means algorithm is a verypowerful tool for data mining problems, the improved K-meansalgorithm may be useful for other important bioinformaticsapplications.

ACKNOWLEDGMENT

The authors would like to thank Prof. R. L. Dunbrack forproviding the dataset from PISCES.

REFERENCES

[1] S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W.Miller, and D. J. Lipman, “Gapped BLAST and PSI-BLAST: A newgeneration of protein database search programs,” Nucleic Acids Res.,vol. 25, no. 17, pp. 3389–3402, 1997.

[2] T. K. Attwood, M. Blythe, D. R. Flower, A. Gaulton, J. E. Mabey, N.Maudling, L. McGregor, A. Mitchell, G. Moulton, K. Paine, and P.Scordis, “PRINTS and PRINTS-S shed light on protein ancestry,” Nucl.Acids Res., vol. 30, no. 1, pp. 239–241, 2002.

[3] R. Aurora and G. D. Rose, “Helix capping,” Protein Sci., vol. 7, no. 1,pp. 21–38, 1998.

[4] J. M. Berg, J. L. Tymoczko, and L. Stryer, Biochemistry, fifth ed. NewYork: W.H. Freeman, 2002, pp. 53–70.

[5] Y. Chen, C. T. Mant, and R. S. Hodges, “Determination of stereochem-istry stability coefficients of amino acid side-chains in an amphipathica-helix,” J. Peptide Res., vol. 59, pp. 18–33, 2002.

[6] W. F. DeGrado, “Design of peptides and proteins,” Adv. Protein Chem.,vol. 39, pp. 51–124, 1988.

[7] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, Biological SequenceAnalysis: Probabilistic Models of Protein and Nucleic Acids. Cam-bridge, U.K.: Cambridge Univ. Press, 1998.

[8] J. Finer-Moore and R. M. Stroud, “Amphipathic analysis and possibleformation of the ion channel in an acetocholine receptor,” Proc. Nat.Acad. Sci. USA, vol. 81, no. 1, pp. 155–159, 1984.

[9] D. Frishman and P. Args, “Knowledge-based protein secondary structureassignment,” Proteins Struct. Funct. Genet., vol. 23, pp. 566–579, 1995.

[10] S. K. Gupta, K. S. Rao, and V. Bhatnagar, “K-means clustering algorithmfor categorical attributes,” in Proc. Data Warehousing and KnowledgeDiscovery (DaWaK-99), pp. 203–208.

[11] V. Guralnik and G. Karypis, “A scalable algorithm for clusteringprotein sequences,” in Proc. Workshop Data Mining in Bioinformatics(BIOKDD), 2001, pp. 73–80.

[12] K. F. Han and D. Baker, “Recurring local sequence motifs in proteins,”J. Mol. Biol., vol. 251, no. 1, pp. 176–187, 1995.

[13] , “Global properties of the mapping between local amino acid se-quence and local structure in proteins,” Proc. Nat. Acad. Sci. USA, vol.93, no. 12, pp. 5814–5818, 1996.

[14] S. Henikoff, J. G. Henikoff, and S. Pietrokovski, “Blocks+: A nonredun-dant database of protein alignment blocks derived from multiple compi-lations,” Bioinformatics, vol. 15, no. 6, pp. 471–479, 1999.

[15] U. Hobohm, M. Scharf, R. Schneider, and C. Sander, “Selection of rep-resentative protein data sets,” Protein Sci., vol. 1, no. 3, pp. 409–417,1992.

[16] N. Hulo, C. J. A. Sigrist, V. Le Saux, P. S. Langendijk-Genevaux, L.Bordoli, A. Gattiker, E. De Castro, P. Bucher, and A. Bairoch, “Recentimprovements to the PROSITE database,” Nucl. Acids Res., vol. 32, no.Database, pp. D134–137, 2004.

[17] E. G. Hutchinson and J. M. Thornton, “A revised set of potentials for�-turn formation in proteins,” Protein Sci., vol. 3, no. 12, pp. 2207–2216,1994.

[18] A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: A review,”ACM Comput. Surv., vol. 31, no. 3, pp. 264–323, 1999.

[19] A. Juan and E. Vidal, “Comparison of four initialization techniques forthe K-medians clustering algorithm,” in Proc. 8th Int. Workshop Struc-tural and Syntactic Pattern Recognition, 3rd Int. Workshop StatisticalTechniques in Pattern Recognition, vol. 1876, 2000, pp. 842–852.

[20] W. Kabsch and C. Sander, “Dictionary of protein secondary structure:Pattern recognition of hydrogen-bonded and geometrical features,”Biopolymers, vol. 22, pp. 2577–2637, 1983.

[21] E. T. Kaiser and F. J. Kézdy, “Secondary structures of proteins and pep-tides in amphiphilic environments (a review),” Proc. Nat. Acad. Sci., vol.80, no. 4, pp. 1137–1143, 1983.

[22] E. T. Kaiser and F. J. Kezdy, “Amphiphilic secondary structure: Designof peptide hormones,” Science, vol. 223, pp. 249–255, 1984.

[23] G. Karp, Cell and Molecular Biology (Concepts and Experiments), 3rded. New York: Wiley, 2002, pp. 52–65.

[24] A. Kasuya and J. M. Thornton, “Three-dimensional structure analysis ofPROSITE patterns,” J. Mol. Biol., vol. 286, no. 5, pp. 1673–1691, 1999.

[25] W. Kauzmann, “Some factors in the interpretation of protein denatura-tion,” Adv. Protein Chem., vol. 14, pp. 1–63, 1959.

[26] J. Kyte and R. F. Doolitle, “A simple method for displaying the hydro-pathic character of a protein,” J. Mol.Biol., no. 157, pp. 105–132, 1982.

ZHONG et al.: IMPROVED K-MEANS CLUSTERING ALGORITHM FOR EXPLORING LOCAL PROTEIN SEQUENCE MOTIFS 265

[27] S. Lifson and C. Sander, “Antiparallel and parallel beta-strands differin amino acid residue preferences,” Nature, vol. 282, no. 5734, pp.109–111, 1979.

[28] C. T. Mant, N. E. Zhou, and R. S. Hodges, “The role of amphipathichelices in stabilizing peptide and protein structure,” in The AmphipathicHelix, R. M. Epand, Ed. Boca Raton, FL: CRC, 1993.

[29] T. Noguchi, H. Matsuda, and Y. Akiyama, “PDB-REPRDB: A databaseof representative protein chains from the protein data bank (PDB),” Nucl.Acids Res., vol. 29, no. 1, pp. 219–220, 2001.

[30] J. M. Peña, J. A. Lozano, and P. Larrañaga, “An empirical compar-ison of four initialization methods for the K-means algorithm,” PatternRecognit. Lett., vol. 20, no. 10, pp. 1027–1040, 1999.

[31] P. L. Privalov, “Thermodynamics of protein folding,” J. Chem. Ther-modyn., vol. 29, pp. 447–474, 1997.

[32] F. M. Richards and C. E. Kundrot, “Identification of structural motifsfrom protein coordinate data: Secondary structure and first-level super-secondary structure,” Proteins Struct. Funct. Genet., vol. 3, pp. 71–84,1988.

[33] A. D. Robertson, “Intramolecular interactions at protein surfaces andtheir impact on protein function,” Trends Biochem. Sci., vol. 27, pp.521–526, 2002.

[34] C. Sander and R. Schneider, “Database of homology-derived proteinstructures and the structural meaning of sequence alignment,” ProteinsStruct. Funct. Genet., vol. 9, no. 1, pp. 56–68, 1991.

[35] M. Schiffer and A. B. Edmundson, “Use of helical wheels to representthe structures of proteins and to identify segments with helical potential,”Biophysical J., vol. 7, no. 2, pp. 121–135, 1967.

[36] J. P. Segrest, H. De Loof, J. G. Dohlman, C. G. Brouilette, and G. M.Anantharamaiah, “Amphipathic helix motif: Classes and properties,”Proteins Struct. Funct. Genet., vol. 8, no. 2, pp. 103–117, 1990.

[37] J. Selbig and P. Argos, “Relationships between protein sequence andstructure patterns based on residue contacts proteins,” Proteins Struct.Funct. Genet., vol. 31, pp. 172–185, 1998.

[38] E. L. L. Sonnhammer, S. R. Eddy, E. Birney, A. Bateman, and R. Durbin,“Pfam: Multiple sequence alignments and HMM-profiles of protein do-mains,” Nucl. Acids Res., vol. 26, no. 1, pp. 320–322, 1998.

[39] Y. Sun, Q. Zhu, and Z. Chen, “An iterative initial-points refinement al-gorithm for categorical data clustering,” Pattern Recognit. Lett., vol. 23,no. 7, pp. 875–884, 2002.

[40] G. Wang and R. L. Dunbrack, Jr., “PISCES: A protein sequence-cullingserver,” Bioinformatics, vol. 19, no. 12, pp. 1589–1591, 2003.

[41] J. M. Zimmerman, N. Eliezer, and R. Simha, “The characterization ofamino acid sequences in proteins by statistical methods,” J. Theor. Biol.,vol. 21, pp. 170–201, 1968.

Wei Zhong received the B.S. degree in computer sci-ence from Georgia State University, Atlanta, in 2001.He is currently working toward the PhD. degree inthe Department of Computer Science, Georgia StateUniversity, under the supervision of Dr. Y. Pan.

He has served as a reviewer for many conferencesand journal papers. His main research interests in-clude bioinformatics, machine learning algorithms,and high-performance computing.

Mr. Zhong received the Outstanding Ph.D. Re-search Award from Georgia State University for his

works on bioinformatics.

Gulsah Altun was born in Canakkale, Turkey. Shereceived the B.S. degree in electronics and telecom-munication engineering from Kocaeli University,Turkey in 1999 and the M.S. degree in computer sci-ence from Georgia State University, Atlanta, in 2003.She is currently working toward the Ph.D. degree inthe Department of Computer Science, Georgia StateUniversity, under the supervision of Dr. R. Harrison.

She has served as a reviewer for many conferencesand journal papers. Her research interests includebioinformatics, data mining, and parallel computing.

Ms. Altun is the President of the Student Chapter of the Association for Com-puting Machinery at Georgia State University

Robert Harrison received the B.S. degree in bio-physics from Pennsylvania State University, Univer-sity Park, in 1979 and the Ph.D. degree in molecularbiochemistry and biophysics from Yale University,New Haven, CT, in 1985.

He is currently an Associate Professor in the De-partment of Computer Science at Georgia State Uni-versity, Atlanta. An active scientist, he has publishedover 95 papers in journals and referred proceedingson a wide range of computational issues in computa-tional biology, structural biology, and bioinformatics.

His current research interests include computational approaches to the predic-tion and design of molecular structure, machine learning, and the developmentof grammar-based models for systems and structural biology.

Dr. Harrison is a Georgia Cancer Coalition Distinguished Scholar.

Phang C. Tai received the Ph.D. degree in microbi-ology from the University of California, Davis.

He has done postdoctoral work at Harvard MedicalSchool, Boston, MA, and is currently a Regents’Professor and Chair of the Department of Biology,Georgia State University, Atlanta. His researchinterest is in molecular biology and microbial physi-ology. His current research focuses on the mechanismof protein secretion across bacterial membranes,with emphasis on the structure and function of SecAprotein.

Yi Pan (S’90–SM’91) received the B.Eng. andM.Eng. degrees in computer engineering fromTsinghua University, China, in 1982 and 1984, re-spectively, and his Ph.D. degree in computer sciencefrom the University of Pittsburgh, Pittsburgh, PA, in1991.

Currently, he is the Chair and a full Professor inthe Department of Computer Science at GeorgiaState University, Atlanta. He has published morethan 80 journal papers with 29 papers published invarious IEEE journals. In addition, he has published

over 100 papers in refereed conferences. He has also coedited 18 books(including proceedings) and contributed several book chapters. He has servedas an editor-in-chief or editorial board member for eight journals. He hasdelivered over 50 invited talks, including keynote speeches and colloquiumtalks, at conferences and universities worldwide. His research interests includehigh-performance computing, networking, and bioinformatics.