computational intelligence in bioinformatics

4
BioSystems 72 (2003) 1–4 Editorial Computational intelligence in bioinformatics The 50th anniversary of the discovery of the DNA double helix was celebrated on April 25, 2003. This landmark event heralded a new age of understanding in biology at the level of molecules and their inter- play within a cell. Since 1953, our knowledge has expanded from the structure of DNA and proteins to the sequencing of genes and complete genomes. We have the ability to monitor gene expression in a cell, a better appreciation for the amazing dynamics between genome, metabolome, and proteome, and a sincere desire for better understanding of informa- tion processing at the molecular level. It has been an astounding 50 years of unparalleled advancement. The success of the molecular age was tied to key advances in technology. New tools for crystallogra- phy and nuclear magnetic resonance have provided us with better visions of molecular structures in three dimensions. The polymerase chain reaction (PCR) coupled with advances in DNA sequencing machines have revolutionized the speed by which we can inter- rogate DNA sequence information. What took several years to sequence only a decade ago can now be sequenced in a matter of days or weeks. Microarray or “DNA chip” technology has provided us with a window to gene expression, a means to discovery the underlying patterns that may predicate disease. These are just a sampling of the tools that are assisting us for discovery. No technological advancement has been more di- rectly responsible for the success of molecular biology over the last 50 years than the computer. Comput- ers have become so important in biology that it is difficult to think of any significant advancement in the last 10 years that did not have direct assistance from a computer, whether this is as a viewer for three-dimensional structures, a controller for auto- mated robot maneuvering of 96-well plates for PCR, a means to interpret DNA sequencing gels, microarray data, etc. The revolution of the last 50 years has re- sulted in such a wealth of data that our understanding of the underlying processes would be significantly re- duced if computers were not at hand as our assistants with respect to “bioinformatics”. The scale of the bio- logical problems of interest and our understanding of those problems has closely paralleled Moore’s Law. Realistically, it was not until the 1970s and 1980s when computers became truly relevant for biological information processing at a rate commensurate with the data being generated. The 1980s and 1990s her- alded the Internet, which has become an invaluable resource for sharing biological information. However, in parallel with the molecular biology, methods of computational intelligence also share their origins in the 1950s, with refinement over time into a wide array of algorithms useful for data mining, pattern recog- nition, optimization, and simulation. Today, many of these same algorithms can be said to offer “compu- tational intelligence” that can handle the large size of experimental output from modern biology. What is “computational intelligence”? Compu- tational intelligence can be broadly defined as the ability of a machine to react to an environment in new ways, making useful decisions in light of current and previous information. Computational intelligence is generally accepted to include evolutionary compu- tation, fuzzy systems, neural networks, and combina- tions thereof. One might also extend this definition to include reaction speeds and error rates approaching human performance as an answer to Turing’s com- ment “we may hope that machines will eventually 0303-2647/$ – see front matter © 2003 Elsevier Ireland Ltd. All rights reserved. doi:10.1016/S0303-2647(03)00129-1

Upload: gary-b-fogel

Post on 03-Jul-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Computational intelligence in bioinformatics

BioSystems 72 (2003) 1–4

Editorial

Computational intelligence in bioinformatics

The 50th anniversary of the discovery of the DNAdouble helix was celebrated on April 25, 2003. Thislandmark event heralded a new age of understandingin biology at the level of molecules and their inter-play within a cell. Since 1953, our knowledge hasexpanded from the structure of DNA and proteinsto the sequencing of genes and complete genomes.We have the ability to monitor gene expression in acell, a better appreciation for the amazing dynamicsbetween genome, metabolome, and proteome, and asincere desire for better understanding of informa-tion processing at the molecular level. It has been anastounding 50 years of unparalleled advancement.

The success of the molecular age was tied to keyadvances in technology. New tools for crystallogra-phy and nuclear magnetic resonance have providedus with better visions of molecular structures in threedimensions. The polymerase chain reaction (PCR)coupled with advances in DNA sequencing machineshave revolutionized the speed by which we can inter-rogate DNA sequence information. What took severalyears to sequence only a decade ago can now besequenced in a matter of days or weeks. Microarrayor “DNA chip” technology has provided us with awindow to gene expression, a means to discovery theunderlying patterns that may predicate disease. Theseare just a sampling of the tools that are assisting us fordiscovery.

No technological advancement has been more di-rectly responsible for the success of molecular biologyover the last 50 years than the computer. Comput-ers have become so important in biology that it isdifficult to think of any significant advancement inthe last 10 years that did not have direct assistancefrom a computer, whether this is as a viewer for

three-dimensional structures, a controller for auto-mated robot maneuvering of 96-well plates for PCR, ameans to interpret DNA sequencing gels, microarraydata, etc. The revolution of the last 50 years has re-sulted in such a wealth of data that our understandingof the underlying processes would be significantly re-duced if computers were not at hand as our assistantswith respect to “bioinformatics”. The scale of the bio-logical problems of interest and our understanding ofthose problems has closely paralleled Moore’s Law.

Realistically, it was not until the 1970s and 1980swhen computers became truly relevant for biologicalinformation processing at a rate commensurate withthe data being generated. The 1980s and 1990s her-alded the Internet, which has become an invaluableresource for sharing biological information. However,in parallel with the molecular biology, methods ofcomputational intelligence also share their origins inthe 1950s, with refinement over time into a wide arrayof algorithms useful for data mining, pattern recog-nition, optimization, and simulation. Today, many ofthese same algorithms can be said to offer “compu-tational intelligence” that can handle the large size ofexperimental output from modern biology.

What is “computational intelligence”? Compu-tational intelligence can be broadly defined as theability of a machine to react to an environment innew ways, making useful decisions in light of currentand previous information. Computational intelligenceis generally accepted to include evolutionary compu-tation, fuzzy systems, neural networks, and combina-tions thereof. One might also extend this definition toinclude reaction speeds and error rates approachinghuman performance as an answer to Turing’s com-ment “we may hope that machines will eventually

0303-2647/$ – see front matter © 2003 Elsevier Ireland Ltd. All rights reserved.doi:10.1016/S0303-2647(03)00129-1

Page 2: Computational intelligence in bioinformatics

2 Editorial / BioSystems 72 (2003) 1–4

compete with men in all purely intellectual fields”(Turing, 1956).

Successes in computational intelligence have beenforthcoming. These methods have been used widely inengineering for optimization of plant control, schedul-ing, for the design of small robots for locomotion(Lipson and Pollack, 2000), and for the evolution of ahuman expert-level checkers player (Fogel, 2002) allwithout human expertise. As suggested at the dawnof this era in1966 by Fogel et al.,“the old saw ‘thecomputer never knows more than the programmer’ issimply no longer true”.

These same methods are now being applied to prob-lems in molecular biology and bioinformatics (Fogeland Corne, 2002). This volume is a collection of pa-pers that addresses the importance of computationalintelligence in bioinformatics and presents a surveyof the wide opportunity and promise that these meth-ods hold for the computer science and biology com-munities. Thirteen contributions representing 31 au-thors from 8 countries were accepted for inclusion inthis special issue, with topics ranging from problemswith biological sequence analysis to metabolic path-way analysis.

Thomas Kiel Rasmussen and Thiemo Krink’spaper Improved hidden Markov model training formultiple sequence alignment by a particle swarmoptimization—evolutionary algorithm hybridfocuseson a new training method for hidden Markov models(HMMs). Multiple sequence alignment is a centralproblem in bioinformatics. The authors demonstratehow the combination of particle swarm optimizationand evolutionary algorithms can generate better pro-tein sequence alignments than with more traditionalHMM training methods, such as Baum–Welch andsimulated annealing.

Daniel Howard and Karl Benson address the prob-lem of promoter prediction in eukaryotes in their paperEvolutionary computation method for pattern recog-nition of cis-acting sites. Promoters are commonlydefined as DNA sequences immediately upstream oftranscription start sites that attract proteins essentialfor transcription. In prokaryotes, these sequences arewell known, however, in eukaryotes, promoter ele-ments are more difficult to discover. The authors utilizeevolved finite-state machines for this purpose and testtheir ability to distinguish promoter sequences fromnon-promoter sequences.

Structure prediction also plays a central role inbioinformatics and three of the papers in this specialissue relate directly to this area of research. Kay Wieseand Edward Glen’s contribution,A permutation-basedgenetic algorithm for the RNA folding problem: a crit-ical look at selection strategies, crossover operators,and representation issues, reviews the area of RNAsecondary structure prediction. A permutation-basedrepresentation is used and the affects of differentvariation operators relative to the convergence of thealgorithm are investigated.

The paper by Hitoshi Iba, Shusuke Saeki, KiyoshiAsai, Katsutoshi Takahashi, Yutaka Ueno, and Kat-sunori Isono, entitledInference of Euler anglesfor single particle analysis by means of evolution-ary algorithms, focuses on the reconstruction ofthree-dimensional protein structure images. Euler an-gles have recently been offered in the literature as anapproach for automated three-dimensional structureanalysis. The approach offered here applies evolu-tionary computation to reduce the computational timeand increase the precision of the resolved structure.

René Thomsen’s contributionFlexible ligand dock-ing using evolutionary algorithms: investigating theeffects of variation operators and local-search hybridsreviews the area of drug docking and evaluates theperformance of different settings with respect to anevolutionary algorithm, including choice of variationoperator, population size, and local search. The bestparameter settings after this analysis resulted in an al-gorithm that was more robust than methods reportedpreviously in the literature.

Gene expression analysis and clustering methodshave recently become of critical importance in ourunderstanding of proteomics. Carlos Cotta and PabloMoscato introduce a heuristic approach in their paper,A memetic-aided approach to hierarchical clusteringfrom distance matrices: application to gene expres-sion clustering and phylogeny. This approach is placedwithin the context of the well-known traveling sales-man problem and demonstrates the extent to whichclassic problems in computer science and associatedalgorithms can be adopted successfully to real-worldproblems in biology.

Peter Merz also makes use of a memetic algorithmapproach in his contributionAnalysis of gene expres-sion profiles: an application of memetic algorithms tothe minimum-sum-of-squares clustering problem. This

Page 3: Computational intelligence in bioinformatics

Editorial / BioSystems 72 (2003) 1–4 3

approach is evaluated relative to standard clusteringmethods in the biological literature, such ask-meansand self-organizing maps on benchmark and biologi-cal data. The results demonstrate the clear effective-ness of this new approach to clustering.

Kalyanmoy Deb and A. Raji Reddy utilize amulti-objective evolutionary algorithm in their paperReliable classification of two-class cancer data usingevolutionary algorithms. Classification in this regard,focused on the identification of gene subsets relatedto one or another type of cancer. This approach mini-mized the misclassification error in both training andtesting while simultaneously minimizing gene subsetsize.

Methods of computational intelligence, such as neu-ral networks, can be used for a very wide range oftopics. Jonathan Clark outlines an approach to uti-lize neural networks as an aid for botanical identifi-cation inArtificial neural networks for species iden-tification by taxonomists. A multi-layer perceptron isused to identify varieties of flowering plants of thegenusLithopsand the value of this tool is compared totraditional methods of identification. The neural net-work approach demonstrated improved accuracy inidentifying taxa wherein the data were limited or thespecies were difficult to distinguish by morphologyalone.

Dana Weekes and Gary B. Fogel utilize evolution-ary algorithms to train neural networks for drug ac-tivity prediction in their paperEvolutionary optimiza-tion, backpropagation, and data preparation issues inQSAR modeling of HIV inhibition by HEPT deriva-tives. The results of this work further confirm thegrowing indication that evolutionary computation canoutperform backpropagation as a method of artificialneural network training. The results also indicate thedegree to which bias in the initial training and test-ing data can affect performance and the importance ofbootstrapping.

Zheng Rong Yang, Rebecca Thomson, T. CharlesHodgman, Jon Dry, Austin K. Doyle, Ajit Narayanan,and XiKun Wu utilize a computational intelligenceapproach for the characterization and prediction ofproteolytic cleavage sites by proteases in their paperSearching for discrimination rules in protease pro-teolytic cleavage activity using genetic programmingwith a min–max scoring function. An interesting rep-resentation method coupled with use of a min–max

scoring function and minimum description lengthprovides increased predictive performance relative toother methods of pattern recognition on four datasetsof protease cleavage.

Jason H. Moore and Lance W. Hahn model dis-ease susceptibility in their paperPetri net modeling ofhigh-order genetic systems using grammatical evolu-tion. Petri nets are used to generate biochemical net-work models that are consistent with genetic modelsof disease susceptibility. A method of grammaticalevolution is introduced as a means to search for op-timal Petri net models. The results indicate that thisapproach can result in useful models that relate genesand disease.

As a conclusion to the special issue, inModel se-lection methodology in supervised learning with evo-lutionary computation, Jem J. Rowland reviews theimportance of computational intelligence methods inbioinformatics, focusing on areas of concern that makethese methods susceptible to overtraining and chancerelationships in the data. An approach to model selec-tion that avoids these pitfalls is proposed and testedon problems of metabolite determination and diseaseprediction from gene expression data. This chapter isof great importance to those readers interested in ap-plying their own versions of the models described inthe remainder of the issue.

Acknowledgements

We would like to thank David Fogel and Ray Patonfor their help and support during the development ofthis special issue.

References

Fogel, D., 2002. Blondie24: Playing at the Edge of AI. MorganKaufmann, San Francisco.

Fogel, G.B., Corne, D.W., 2002. Evolutionary Computation inBioinformatics. Morgan Kaufmann, San Francisco.

Fogel, L.J., Owens, A.J., Walsh, M.J., 1966. Artificial IntelligenceThrough Simulated Evolution. Wiley, New York, p. 113.

Lipson, H., Pollack, J., 2000. Automatic design and manufactureof artificial lifeforms. Nature 406, 974–978.

Turing, A.M., 1956. Can a machine think? In: Newman, J.R. (Ed.),The World of Mathematics, vol. 4. Simon and Schuster, NewYork, p. 2122.

Page 4: Computational intelligence in bioinformatics

4 Editorial / BioSystems 72 (2003) 1–4

Gary B. FogelNatural Selection, Inc., 3333 N. Torrey Pines Ct.

Suite 200, La Jolla, CA 92037, USACorresponding author

Tel.: +1-858-455-6449; fax:+1-858-455-1560E-mail address:[email protected]

(G.B. Fogel)

David W. CorneDepartment of Computer Science

Harrison Building, University of ExeterExeter EX4 4QF, UK

Tel.: +44-1392-263628; fax:+44-1392-217965E-mail address:[email protected]

(D.W. Corne)