bioinformatics—an introduction for computer scientists - cohen

8/11/2019 Bioinformaticsan Introduction for Computer Scientists - Cohen

1/38

BioinformaticsAn Introduction for Computer Scientists

JACQUES COHEN

Brandeis University

Abstract. The article aims to introduce computer scientists to the new field ofbioinformatics. This area has arisen from the needs of biologists to utilize and helpinterpret the vast amounts of data that are constantly being gathered in genomicresearchand its more recent counterparts, proteomics and functional genomics. Theultimate goal of bioinformatics is to develop in silico models that will complement invitro and in vivo biological experiments. The article provides a birds eye view of thebasic concepts in molecular cell biology, outlines the nature of the existing data, anddescribes the kind of computer algorithms and techniques that are necessary tounderstand cell behavior. The underlying motivation for many of the bioinformaticsapproaches is the evolution of organisms and the complexity of working with incomplete

and noisy data. The topics covered include: descriptions of the current softwareespecially developed for biologists, computer and mathematical cell models, and areas ofcomputer science that play an important role in bioinformatics.

Categories and Subject Descriptors: A.1 [Introductory and Survey]; F.1.1[Computation by Abstract Devices]: Models of ComputationAutomata (e.g., finite,push-down, resource-bounded); F.4.2 [Mathematical Logic and Formal Languages]:Grammars and Other Rewriting Systems; G.2.0 [Discrete Mathematics]: General;G.3 [Probability and Statistics]; H.3.0 [Information Storage and Retrieval]:General; I.2.8 [Artificial Intelligence]: Problem Solving, Control Methods, andSearch; I.5.3 [Pattern Recongnition]: Clustering; I.5.4 [Pattern Recongnition]:

ApplicationsText processing; I.6.8 [Simulation and Modeling]: Types ofSimulationContinuous; discrete event; I.7.0 [Document and Text Processing]:General; J.3 [Life and Medical Sciences]:Biology and genetics

General Terms: Algorithms, Languages, Theory

Additional Key Words and Phrases: Molecular cell biology, computer, DNA, alignments,dynamic programming, parsing biological sequences, hidden-Markov-models,phylogenetic trees, RNA and protein structure, cell simulation and modeling,microarray

1. INTRODUCTION

It is undeniable that, among the sciences,biology played a key role in the twentiethcentury. That role is likely to acquire fur-ther importance in the years to come. Inthe wake of the work of Watson and Crick,

Authors address: Department of Computer Science, Brandeis University, Waltham, MA 02454; email: [email protected].

Permission to make digital/hard copy of part or all of this work for personal or classroom use is grantedwithout fee provided that the copies are not made or distributed for profit or commercial advantage, thecopyright notice, the title of the publication, and its date appear, and notice is given that copying is bypermission of ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requiresprior specific permission and/or a fee.c2004 ACM 0360-0300/04/0600-0122 $5.00

[2003] and the sequencing of the humangenome, far-reaching discoveries are con-stantly being made.

One of the central factors promoting theimportance of biology is its relationshipwith medicine. Fundamental progress inmedicine depends on elucidating some of

ACM Computing Surveys, Vol. 36, No. 2, June 2004, pp. 122158.


2/38

BioinformaticsAn Introduction for Computer Scientists 123

the mysteries that occur in the biologicalsciences

Biology depended on chemistry to make

major strides, and this led to the de-velopment of biochemistry. Similarly, theneed to explain biological phenomena atthe atomic level led to biophysics. Theenormous amount of data gathered bybiologistsand the need to interpret itrequires tools that are in the realm of com-puter science. Thus, bioinformatics.

Both chemistry and physics have bene-fited from the symbiotic work done withbiologists. The collaborative work func-tions as a source of inspiration for novelpursuits in the original science. It seemscertain that the same benefit will ac-

crue to computer sciencework with bi-ologists will inspire computer scientists tomake discoveries that improve their owndiscipline.

A common problem with the matu-ration of an interdisciplinary subject isthat, inevitably, the forerunner disciplinescall for differing perspectives. I see thesedifferences in working with my biolo-gist colleagues. Nevertheless, we are sointerested in the success of our dia-logue, that we make special efforts tounderstand each others point of view.That willingness is critical for joint

work, and this is particularly true inbioinformatics.

An area called computational biologypreceded what is now called bioinformat-ics. Computational biologists also gath-ered their inspiration from biology and de-veloped some very important algorithmsthat are now used by biologists. Computa-tional biologists take justified pride in theformal aspects of their work. Those ofteninvolve proofs of algorithmic correctness,complexity estimates, and other themesthat are central to theoretical computerscience.

Nevertheless, the biologists needs areso pressing and broad that many other as-pects related to computer science have tobe explored. For example, biologists needsoftware that is reliable and can deal withhuge amounts of data, as well as inter-faces that facilitate the human-machineinteractions.

I believe it is futile to argue the differ-ences and scope of computational biologyas compared to bioinformatics. Presently,

the latter is more widely used among biol-ogists than the former, even though thereis no agreed definition for the two terms.

A distinctive aspect of bioinformatics isits widespread use of the Web. It couldnot be otherwise. The immense databasescontaining DNA sequences and 3D pro-tein structures are available to almostany researcher. Furthermore, the commu-nity interested in bioinformatics has de-veloped a myriad of application programsaccessible through the Internet. Some ofthese programs (e.g., BLAST) have takenyears of development and have been finely

tuned. The vast numbers of daily visitsto some of the NIH sites containing ge-nomic databases are comparable to thoseof widely used search engines or activesoftware downloading sites. This explainsthe great interest that bioinformaticianshave in script languages such as Perl andPython that allow the automatic exami-nation and gathering of information fromwebsites.

With the above preface, we can put for-ward the objectives of this article andstate the background material necessaryfor reading it. The article is both a tuto-

rial and a survey. As its title indicates, itis oriented towards computer scientists.

Some biologists may argue that theproper way to learn bioinformatics is tohave a good background in organic chem-istry and biochemistry and to take a fullcourse in molecular cell biology. I beg todisagree: In an interdisciplinary field likebioinformatics there must be several en-try points and one of them is using thelanguage that is familiar to computer sci-entists. This does not imply that one canskip the fundamental knowledge avail-able in a cell and molecular biology text.

It means that a computer scientist inter-ested in learning what is presently beingdone in bioinformatics can save some pre-cious time by reading the material in thisarticle.

It was mentioned above that, in inter-disciplinary areas like bioinformatics, theplayers often view topics from a different

ACM Computing Surveys, Vol. 36, No. 2, June 2004.


3/38

124 J. Cohen

perspective. This is not surprising sinceboth biologists and computer scientistshave gone through years of education in

their respective fields. A related plausiblereason for such differences is as follows:

In computer science, we favor general-ity and abstractions; our approach is oftentop-down, as if we were developing a pro-gram or writing a manual. In contrast, bi-ologists often favor a bottom-up approach.This is understandable because the minu-tiae are so important and biologists are of-ten involved with time-consuming experi-ments that may yield ambiguous results,which in turn have to be resolved by fur-ther tests. The work of synthesis eventu-ally has to take place but, since in biology

most of the rules have exceptions, biolo-gists are wary of generalizations.

Naturally, in this articles presentation,I have used a top-down approach. To makethe contents of the article clear and self-contained, certain details have had to beomitted. From a pragmatic point of view,articles like this one are useful in bridg-ing disciplines, provided that the readersare aware of the complexities lying behindabstractions.

The article should be viewed as a tourof bioinformatics enabling the interestedreader to search subsequently for deeper

knowledge. One should expect to expendconsiderable effort gaining that knowl-edge because bioinformatics is inextrica-bly tied to biology, and it takes time tolearn the basic concepts of biology.

This work is directed to a mature com-puter scientist who wishes to learn moreabout the areas of bioinformatics and com-putational biology. The reader is expectedto be at ease with the basic concepts ofalgorithms, complexity, language and au-tomata theory, topics in artificial intelli-gence, and parallelism.

As to knowledge in biology, I hope the

reader will be able to recall some of therudiments of biology learned in secondaryeducation or in an undergraduate coursein the biological sciences. An appendixto this article reviews a minimal set offacts needed to understand the material inthe subsequent sections. In addition, TheCartoon Guide to Genetics [Gonick and

Wheelis 1991] is an informative and amus-ing text explaining the fundamentals ofcell and molecular biology.

The reader should keep in mind the per-vasive role of evolution in biology. It isevolution that allows us to infer the cellbehavior of one species, say the human,from existing information about the cellbehavior of other species like the mouse,the worm, the fruit fly, and even yeast.

In the past decades, biologists havegathered information about the cell char-acteristics of many species. With the helpof evolutionary principles, that informa-tion can be extrapolated to other species.However, most available data is frag-mented, incomplete, and noisy. So if one

had to characterize bioinformatics in logi-cal terms, it would be: reasoning with in-complete information. That includes pro-viding ancillary tools allowing researchersto compare carefully the relationship be-tween new data and data that has beenvalidated by experiments. Since under-standing the human cell is a primary con-cern in medicine, one usually wishes to in-fer human cell behavior from that of otherspecies.

This articles contents aim to shed somelight on the following questions:

How can one describe the actors andprocesses that take place within a liv-ing cell?

What can be determined or measured toinfer cell behavior?

What data is presently available for thatpurpose?

What are the present major problems inbioinformatics and how are they beingsolved?

What areas in computer science relatemost to bioinformatics?

The next section offers some words ofcaution that should be kept in the readersmind. The subsequent sections aim at an-swering the above questions. A final sec-tion provides information about how toproceed if one wishes to further explorethis new discipline.



4/38


2. WORDS OF CAUTION

Naturally, it is impossible to condense

the material presently available in manybioinformatics texts into a single surveyand tutorial. Compromises had to be madethat may displease purists who think thatno shortcuts are possible to explain thisnew field. Nevertheless, the objective ofthis work will be fulfilled if it incites thereader to continue along the path openedby reading this precis.

There is a dichotomy between the var-ious presentations available in bioinfor-matics articles and texts. At one extremeare those catering to algorithms, com-plexity, statistics, and probability. On the

other are those that utilize tools to in-fer new biological knowledge from ex-isting data. The material in this arti-cle should be helpful to initiate potentialpractitioners in both areas. It is alwaysworthwhile to keep in mind that new al-gorithms become useful when they aredeveloped into packages that are used bybiologists.

It is also wise to recall some of the distin-guishing aspects of biology to which com-puter scientists are not accustomed. DavidB. Searls, in a highly recommended arti-cle onGrand challenges in computational

biology [Searls 1998], points out that, inbiology:

There are no rules without exception;

Every phenomenon has a nonlocal com-ponent;

Every problem is intertwined withothers.

For example, for some time, it wasthought that a gene was responsible forproducing a single protein. Recent workin alternate splicing indicates that a genemay generate several proteins. This mayexplain why the number of genes in

the human genome is smaller than thatthat had been anticipated earlier. Anotherexample of the biologys fundamentallydynamic and empirical state is that ithas been recently determined that gene-generated proteins may contain aminoacids beyond the 20 that are normally usedas constituents of those proteins.

The second item in Searls list warnsus that the existence of common local fea-tures cannot be generalized. For example,

similar 3D substructures may originatefrom different sequences of amino acids.This implies that similarity at one levelcannot be generalized to another.

Finally, the third item cautions us toconsider biological problems as an aggre-gate and not to get lost in studying onlyindividual components. For example, sim-ple changes of nucleotides in DNA mayresult in entirely different protein struc-tures and function. This implies that thestudy of genomes has to be tied to thestudy of the resulting proteins.

3. BRIEF DESCRIPTION OF A SINGLE CELL

AT THE MOLECULAR LEVEL

In this section, we assume that the readerhas a rudimentary level of knowledge incell and molecular biology. (The appendixreviews some of that material.) The in-tent here is to show the importance three-dimensional structures have in under-standing the behavior of a living cell. Cellsin different organisms or within the sameorganism vary significantly in shape, size,and behavior. However, they all share com-mon characteristics that are essential for

life.The cell is made up of molecular

components, which can be viewed as3D-structures of various shapes. Thesemolecules can be quite large (like DNAmolecules) or relatively small (like the pro-teins that make up the cell membrane).The membrane acts as a filter that con-trols the access of exterior elements andalso allows certain molecules to exit thecell.

Biological molecules in isolation usuallymaintain their structure; however, theymay also contain articulations that allow

movements of their subparts (thus, the in-terest of nano-technology researchers inthose molecules).

The intracellular components are madeof various types of molecules. Some ofthem navigate randomly within the mediainside the membrane. Other molecules areattracted to each other.



5/38

126 J. Cohen

In a living cell, the molecules interactwith each other. By interaction it is meantthat two or more molecules are combined

to form one or more new molecules, thatis, new 3D-structures with new shapes.

Alternatively, as a result of an interac-tion, a molecule may be disassembledinto smaller fragments. An interactionmay also reflect mutual influence amongmolecules. These interactions are due toattractions and repulsions that take placeat the atomic level. An important type ofinteraction involves catalysis, that is, thepresence of a molecule that facilitates theinteraction. These facilitators are called

enzymes.Interactions amount to chemical reac-

tions that change the energy level ofthe cell. A living cell has to maintainits orderly state and this takes energy,which is supplied by surrounding light andnutrients.

It can be said that biological inter-actions frequently occur because of theshape and locationof the cells constituentmolecules. In other words, the proximity ofcomponents and the shape of componentstrigger interactions. Life exists only whenthe interactions can take place.

A cell grows because of the availabilityof external molecules (nutrients) that can

penetrate the cells membrane and par-ticipate in interactions with existing in-tracellular molecules. Some of those mayalso exit through the membrane. Thus, acell is able to digest surrounding nutri-ents and produce other molecules that areable to exit throughthe cells membrane. Ametabolic pathwayis a chain of molecularinteractions involving enzymes.Signaling

pathwaysare molecular interactions thatenable communication through the cellsmembrane. The notions of metabolic andsignaling pathways will be useful in un-derstanding gene regulation, a topic that

will be covered later.Cells, then, are capable of growing

by absorbing outside nutrients. Copiesof existing components are made by in-teractions among exterior and interiormolecules. A living cell is thus capable ofreproduction: this occurs when there are

enough components in the original cell toproduce a duplicate of the original cell, ca-pable of acting independently.

So far, we have intuitively explained theconcepts of growth, metabolism, and re-production. These are some of the basiccharacteristics of living organisms. Otherimportant characteristics of living cellsare: motility, the capability of searchingfor nutrients, and eventually death.

Notice that we assumed the initial ex-istence of a cell, and it is fair to ask thequestion: how could one possibly have en-gineered such a contraption? The answerlies inevolution. When the cell duplicatesit may introduce slight (random) changesin the structure of its components. If those

changes extend the life span of the cellthey tend to be incorporated in future gen-erations. It is still unclear what ingredi-ents made up the primordial living cellthat eventually generated all other cellsby the process of evolution.

The above description is very generaland abstract. To make it more detailedone has to introduce the differences be-tween the various components of a cell. Letus differentiate between two types of cellmolecules: DNA and proteins. DNA can beviewed as a template for producing addi-tional (duplicate) DNA and also for pro-

ducing proteins.Protein production is carried out using

cascading transformations. In bacterialcells (called prokaryotes), RNA is firstgenerated from DNA and proteins areproduced from RNA. In a more developedtype of cells (eukaryotes), there is anadditional intermediate transformation:pre-RNA is generated from DNA, RNAfrom pre-RNA, and proteins from RNA.Indeed, the present paragraph expresseswhat is known as the central dogmain molecular biology. (Graphical repre-sentations of these transformations are

available in the site of the National HealthMuseum [http://www.accessexcellence.org/AB/GG/].)

Note that the above transformations areactually molecular interactions such as wehad previously described. A transforma-tion A B means that the resulting



6/38


molecules of B are constructed anew us-ing subcomponents that are copies of theexisting molecules of A. (Notice the sim-

ilarity with Lisp programs that use con-structors like cons to carry out transfor-mations of lists.)

The last two paragraphs implicitly as-sume the existence ofprocessors capable ofeffecting the transformations. Indeed thatis thecase with RNA-polymerases, spliceo-somes, and ribosomes. These are mar-velous machineries that are made them-selves of proteins and RNA, which in turnare produced from DNA! They demon-strate the omnipresence of loops in cellbehavior.

One can summarize the molecular

transformations that interest us using thenotation:

DNA RNA-polymerase

pre-RNA Spliceosome

RNA Ribosome

Protein

The arrows denote transformations andthe entities below them indicate the pro-cessors responsible for carrying out thecorresponding transformations. Some im-portant remarks are in order:

(1) All the constituents in the abovetransformation are three-dimensional

structures.(2) It is more appropriate to consider

a gene (a subpart of DNA) as theoriginal material processed by RNA-polymerase.

(3) An arsenal of processors in the vicinityof the DNA molecule works on multiplegenes simultaneously.

(4) The proteins generated by variousgenes are used as constituents makingup the various processors.

(5) A generated protein may prevent (oraccelerate) the production of other

proteins. For example, a protein Pimay place itself at the origin of geneGk and prevent Pk from being pro-duced. It is said that Pi represses Pk .In other cases, the opposite occurs:one proteinactivatesthe production ofanother.

(6) It is known that a spliceosome is capa-ble of generating different RNAs (al-ternate splicing) and therefore the old

notion that a given gene produces onegiven protein no longer holds true. Asa matter of fact, a gene may produceseveral different proteins, though themechanism of this is still a subject ofresearch.

(7) It is never repetitious to point out thatin biology, most rules have exceptions[Searls 1998].

The term, gene expression, refers to theproduction of RNA by a given gene. Pre-sumably, the amount of RNA generatedby the various genes of an organism es-

tablishes an estimate of the correspondingprotein levels.

An important datum that can be ob-tained by laboratory experiments is anestimate of the simultaneous RNA produc-tion of thousands of genes. Gene expres-sions vary depending on a given state ofthe cell (e.g., starvation or lack of light,abnormal behavior, etc.).

3.1. Analogy with Computer Science

Programs

We now open a parenthesis to recall the

relationship that exists between computerprograms and data; that relationship hasanalogies that are applicable to under-standing cell behavior. Not all biologistswill agree with a metaphor equating DNAto a computer program. Nevertheless, Ihave found that metaphor useful in ex-plaining DNA to computer scientists.

In the universal Turing Machine (TM)model of computing, one does not distin-guish between program and datatheycoexist in the machines tape and it is theTM interpreter that is commanded to startcomputations at a given state examining

a given element of the tape.Let us introduce the notion of interpre-

tation in our simplified description of asingle biological cell. Both DNA and pro-teins are components of our model, but theinteractions that take place between DNAand other components (existing proteins)



7/38

128 J. Cohen

result in producing new proteins each ofwhich has a specific function needed forcell survival (growth, metabolism, replica-

tion, and others).The following statement is crucial to un-

derstanding the process of interpretationoccurring within a cell. Let a geneGin theDNA component be responsible for pro-ducing a protein P . Interpreters Icapableof processing any gene may well utilize Pas one of its components. This implies thatif P has not been assembled into the ma-chinery ofIno interpretation takes place.

Another instance in which P cannot beproduced is the already mentioned factthat another proteinP may position itselfat the beginning of geneG and (temporar-

ily) prevent the transcription.The interpreter in the biological case is

either one that already exists in a givencell (prior to cell replication) or else itcan be assembled from proteins and RNAgenerated by specific genes (e.g., riboso-mal genes). In biology the interpreter canbe viewed as a mechanical gadget thatis made of moving parts that producenew components based on given templates(DNA or RNA). The construction of newcomponents is made by subcomponentsthat happen to be in the vicinity. If theyare not, interpretation cannot proceed.

One can imagine a similar situationwhen interpreting computer programs (al-though it is unlikely to occur in actual in-terpreters). Assume that the componentsofIare first generated on the fly and onceIis assembled (as data), control is trans-ferred to the execution ofI(as a program).

The above situation can be simulatedin actual computers by utilizing concur-rent processes that comprise a multitudeof interrupts to control program execution.This could be implemented using inter-preters that first test that all the compo-nents have been assembled: execution pro-

ceeds only if that is the case; otherwise aninterrupt takes place until the assemblyis completed. Alternatively one can exe-cute program parts as soon as they areproduced and interrupt execution if a se-quel has not yet been fully generated. InSection 7.5.1, we will describe one suchmodel of gene interaction.

4. LABORATORY TOOLS FOR

DETERMINING BIOLOGICAL DATA

We start with a warning that the expla-nations that follow are necessarily coarse.The goal of this section is to enable thereader to have some grasp of how biolog-ical information is gathered and of thedegree of difficulty in obtaining it. Thiswill be helpful in understanding the var-ious types of data available and the pro-grams needed to utilize and interpret thatdata.

Sequencers are machines capable ofreading off a sequence of nucleotides in astrand of DNA in biological samples. Themachines are linked to computers that dis-

play the DNA sequence being analyzed.The display also provides the degree ofconfidence in identifying each nucleotide.Present sequencers can produce over 300kbase pairs per day at very reasonablecosts. It is also instructive to remark thatthe inverse operation of sequencing canalso be performed rather inexpensively:it is now common practice to order frombiotech companies vials containing shortsequences of nucleotides specified by auser.

A significant difficulty in obtaining anentire genomes DNA is the fact that the

sequences gathered in a wet lab consistof relatively short random segments thathave to be reassembled using computerprograms; this is referred to as the shot-

gun method of sequencing. Since DNAmaterial contains many repeated subse-quences, performing the assemblage canbe tricky. This is due to the fact that a frag-ment can be placed ambiguously in two ormore positions of the genome being assem-bled. (DNA assembly will be revisited inSection 7.7.)

Recently, there has been a trend toattempt to identify proteins using mass

spectroscopy. The technique involves de-termining genes and obtaining the corre-sponding proteins in purified form. Theseare cut into short sequences of amino acids(calledpeptides) whose molecular weightscan be determined by a mass spectro-graph. It is then computationally possibleto infer the constituents of the peptides



8/38


yielding those molecular weights. By us-ing existing genomic sequences, one canattempt to reassemble the desired se-

quence of amino acids.The 3D structure of proteins is mainly

determined by X-ray crystallography andby nuclear magnetic resonance (NMR).Both these experiments are time consum-ing and costly. In X-ray crystallography,one attempts to infer the 3D position ofeach of the proteins atoms from a projec-tion obtained by passing X-rays througha crystallized sample of that protein. Oneof the major difficulties of the process isthe obtaining of good crystals. X-ray ex-periments may require months and evenyears of laboratory work.

In the NMR technique, one obtains anumber of matrices that express the factthat two atomsthat are not in the samebackbone chain of the proteinare withina certain distance. One then deduces a 3Dshape from those matrices. The NMR ex-periments are also costly. A great advan-tage is that they allow one to study mobileparts of proteins, a task which cannot bedone using crystals.

The preceding paragraphs explain whyDNA data is so much more abundant than3D protein data.

Another type of valuable information

obtainable through lab experiments isknown as ESTs orexpressed sequence tags.These are RNA chunks that can be gath-ered from a cell in minute quantities, butcan easily be duplicated. Those chunks arevery useful since they do not contain ma-terial that would be present in introns (seethe Appendix for a definition). The avail-ability of EST databases comprising manyorganisms allows bioinformaticians to in-fer the positions of introns and even de-duce alternate splicing.

A powerful new tool available in biol-ogy is microarrays. They allow determin-

ing simultaneously the amount of mRNAproduction of thousands of genes. As men-tioned earlier, this amount corresponds to

gene-expression; it is presumed that theamount of RNA generated by the var-ious genes of an organism establishesan estimate of the corresponding proteinlevels.

Microarray experiments require threephases. In the first phase one places thou-sands of different one-stranded chunks of

RNA in minuscule wells on the surface ofa small glass chip. (This task is not un-like that done by a jet printer using thou-sands of different colors and placing eachof them in different spots of a surface.) Thechunks correspond to the RNA known tohave been generated by a given gene. The2D coordinates of each of the wells are ofcourse known. Some companies mass pro-duce custom preloaded chips for cells ofvarious organisms and sell them to biolog-ical labs.

The second phase consists of spreadingon the surface of the glass

genetic material (again one-strandedRNA) obtained by a cell experiment onewishes to perform. Those could be theRNAs produced by a diseased cell, orby a cell being subjected to starvation,high temperature, etc. The RNA alreadyin the glass chip combines with theRNA produced by the cell one wishes tostudy. The degree of combined materialobtained by complementing nucleotides isan indicator of how much RNA is beingexpressed by each one of the genes of thecell being studied.

The third phase consists of using a laser

scanner connected to a computer. The ap-paratus measures the amount of com-bined material in each chip well and de-termines the degree of gene expressionareal numberfor each of the genes origi-nally placed on the chip. Microarray datais becoming available in huge amounts. Aproblem with this data is that it is noisyand its interpretation is difficult. Microar-rays arebecoming invaluable for biologistsstudying how genes interact with eachother. This is crucial in understanding dis-ease mechanisms.

The microarray approach has been ex-

tended to the study of protein expres-sion. There exist chips whose wells containmolecules that can be bound to particularproteins.

Another recent development in exper-imental biology is the determination ofprotein interaction by what is called two-hybrid experiments. The goal of such



9/38

130 J. Cohen

experiments is to construct huge Booleanmatrices, whose rows and columns repre-sent the proteins of a genome. If a protein

interacts with another, the correspondingposition in the matrix is set to true. Again,one has to deal with thousands of pro-teins (genes); the data of their interactionsis invaluable in reconstructing metabolicand signaling pathways.

A final experimental tool described inthis section is the availability of librariesof variants of a given organism, yeast be-ing a notable example. Each variant cor-responds to cells having a single one of itsgenes knocked out. (Of course researchersare only interested in living cells sincecertain genes are vital to life.) These

libraries enable biologists to performexperiments (say, using microarray) anddeduce information about cell behaviorand fault tolerance.

A promising development in experimen-tal biology is the use of RNA-i(the idenot-ing interference). It has been found thatwhen chunks of the RNA of a given geneare inserted in the nucleus of a cell, theymay prevent the production of that gene.This possibility is not dissimilar to that of-fered by libraries of knocked-out genes.

The above descriptions highlight thetrend of molecular biology experiments be-

ing done by ordering components, or byhaving them analyzed by large biotechcompanies.

5. BIOLOGICAL DATA AVAILABLE

In a previous section, we mentioned thatall the components of a living cell are 3Dstructures and that shape is crucial inunderstanding molecular interactions. Afundamental abstraction often done in bi-ology is to replace the spatial 3D infor-mation specifying chemical bindings witha much simpler sequence of symbols: nu-

cleotides or amino acids. In the case ofDNA, we know that the helix is the un-derlying 3D structure.

Although it is much more convenient todeal with sequences of symbols than withcomplex 3D entities, the problem of shapedetermination remains a critical onein thecase of RNA and proteins.

The previous section outlined the labo-ratory tools for gathering biological data.The vast majority of the existing informa-

tion has been obtained through sequenc-ing, and it is expressible by stringsthatis, sequences of symbols. These sequencesspecify mostly nucleotides (genomic data)but there is also substantial informationon sequences of amino acids.

Next in volume of available informationare the results of microarray experiments.These can be viewed as very large usu-ally dense matrices of real numbers. Thesematrices may have thousands of rows andcolumns. And that is also the case of thesparse Boolean matrices describing pro-tein interactions.

The information about 3D structuresof proteins pales in comparison to thatavailable in sequence form. The pro-tein database (PDB) is the repositoryfor all known three-dimensional proteinstructures.

In a recent search, I found that thereare now about 26 billion base pairs (bp)representing the various genomes avail-able in the server of the National Cen-ter for Biotechnology Information (NCBI).Besides the human genome with about3 billion bp, many other species have theircomplete genome available there. These

include several bacteria (e.g., E. Coli)and higher organisms including yeast,worm, fruit fly, mouse, and plants (e.g.,

Arabidopsis).The largest known gene in the NCBI

server has about 20 million base pairs andthelargest protein consists of about 34,000amino acids. These figures give an idea ofthe lengths of the entities we have to dealwith.

In contrast, the PDB has a catalogue ofonly 45,000 proteins specified by their 3Dstructure. These proteins originate fromvarious organisms. The relatively meager

protein data shows the enormous need ofinferring protein shape from data avail-able in the form of sequences. This is oneof the major tasks facing biologists. Butmany others lie ahead.

The goal of understanding protein struc-ture is only part of the task. Next we haveto understand how proteins interact and



10/38


form the metabolic and signaling path-ways in the cell.

There is information available about

metabolic pathways in simple organisms,and parts of those pathways are knownfor human cells. The formidable task isto put all the available information to-gether so that it can be used to under-stand better the functioning of the hu-man cell. That pursuit is called functional

genomics.The term, genomics, is used to denote

the study of various genomes as enti-ties having similar contents. In the pastfew years other terms ending with thesuffixes-omeor - micshave been popular-ized. That explains proteomics(the study

of all the proteins of a genome), transcrip-tome,metabolome, and so forth.

6. PRESENT GOALS OF BIOINFORMATICS

The present role of bioinformatics is to aidbiologists in gathering and processing ge-nomic data to study protein function. An-other important role is to aid researchersat pharmaceutical companies in makingdetailed studies of protein structures to fa-cilitate drug design. Typical tasks done inbioinformatics include:

Inferring a proteins shape and functionfrom a given a sequence of amino acids,

Finding all the genes and proteins in agiven genome,

Determining sites in the protein struc-ture where drug molecules can beattached.

To perform these tasks, one usually hasto investigate homologous sequences orproteins for which genes have been deter-mined and structures are available. Ho-mology between two sequences (or struc-tures) suggests that they have a common

ancestor. Since those ancestors may wellbe extinct, one hopes that similarity at thesequence or structural level is a good indi-cator of homology.

It is important to keep in mind that se-quence similarity does not always implysimilarity in structure, and vice-versa. Asa matter of fact, it is known that two fairly

dissimilar sequences of amino acids mayfold into similar 3D structures.

Nevertheless, the search for similarity

is central to bioinformatics. When givena sequence (nucleotides or amino acids)one usually performs a search of similaritywith databases that comprise all availablegenomes and known proteins. Usually, thesearch yields many sequences with vary-ing degrees of similarities. It is up to theuser to select those that may well turn outto be homologous.

In the next section we describe the var-ious computer science algorithms that arefrequently used by bioinformaticians.

7. ALGORITHMS FREQUENTLY USED

IN BIOINFORMATICS

We recall that a major role of bioinfor-matics is to help infer gene function fromexisting data. Since that data is varied, in-complete, noisy, and covers a variety of or-ganisms, one has to constantly resort tothe biological principles of evolution to fil-ter out useful information.

Based on the availability of the data andgoals described in Sections 4 to 6, we nowpresent the various algorithms that leadto a better understanding of gene function.They can be summarized as follows:

(1) Comparing Sequences. Given thehuge number of sequences available, thereis an urgent need to develop algorithmscapable of comparing large numbers oflong sequences. These algorithms shouldallow the deletion, insertion, and replace-ments of symbols representing nucleotidesor amino acids, for such transmutationsoccur in nature.

(2)Constructing Evolutionary (Phyloge-netic) Trees. These trees are often con-structed after comparing sequences be-longing to different organisms. Trees

group the sequences according to their de-gree of similarity. They serve as a guideto reasoning about how these sequenceshave been transformed through evolution.For example, they infer homology fromsimilarity, and may rule out erroneousassumptions that contradict known evolu-tionary processes.



11/38

132 J. Cohen

(3) Detecting Patterns in Sequences.There are certain parts of DNA and aminoacid sequences that need to be detected.

Two prime examples are the search forgenes in DNA and the determining of sub-components of a sequence of amino acids(secondary structure). There are severalways to perform these tasks. Many of themare based on machine learning and in-clude probabilistic grammars, or neuralnetworks.

(4)Determining 3D Structures from Se-quences. The problems in bioinformat-ics that relate sequences to 3D structuresare computationally difficult. The deter-mination of RNA shape from sequences

requires algorithms of cubic complexity.The inference of shapes of proteins fromamino acid sequences remains an un-solved problem.

(5) Inferring Cell Regulation. Thefunction of a gene or protein is bestdescribed by its role in a metabolic orsignaling pathway. Genes interact witheach other; proteins can also prevent orassist in the production of other pro-teins. The available approximate modelsof cell regulation can be either discreteor continuous. One usually distinguishes

between cell simulation and modeling.The latter amounts to inferring the for-mer from experimental data (say microar-rays). This process is usually called reverse

engineering.

(6) Determining Protein Function andMetabolic Pathways. This is one of themost challenging areas of bioinformat-ics and for which there is not consider-able data readily available. The objectivehere is to interpret human annotationsfor protein function and also to developdatabases representing graphs that can be

queried for the existence of nodes (speci-fying reactions) and paths (specifying se-quences of reactions).

(7)Assembling DNA Fragments. Frag-ments provided by sequencing machinesare assembled using computers. Thetricky part of that assemblage is that DNAhas many repetitive regions and the same

fragment may belong to different regions.The algorithms for assembling DNA aremostly used by large companies (like the

former Celera).

(8) Using Script Languages. Many ofthe above applications are already avail-able in websites. Their usage requiresscripting that provides data for an appli-cation, receives it back, and then analyzesit.

The algorithms required to perform theabove tasks are detailed in the followingsubsections. What differentiates bioinfor-matics problems from others is the hugesize of the data and its (sometimes ques-tionable) quality. That explains the need

for approximate solutions.It should be remarked that several

of the problems in bioinformatics areconstrained optimization problems. Thesolution to those problems is usually com-putationally expensive. One of the effi-cient known methods in optimization isdynamicprogramming. That explains whythis technique is often used in bioinfor-matics. Other approaches like branch-and-bound are also used, but they areknown to have higher complexity.

7.1. Comparing SequencesFrom the biological point of view sequencecomparison is motivated by the fact thatall living organisms are related by evolu-tion. That implies that the genes of speciesthat are closer to each other should exhibitsimilarities at the DNA level; one hopesthat those similarities also extend to genefunction.

The following definitions are useful inunderstanding what is meant by the com-parison of two or more sequences. Analignment is the process of lining up se-quences to achieve a maximal level of iden-

tity. That level expresses thedegreeof sim-ilarity between sequences. Two sequencesare homologous if they share a commonancestor, which is not always easy to de-termine. The degree of similarity obtainedby alignment can be useful in determiningthe possibility of homology between twosequences.



12/38


In biology, the sequences to be com-pared are either nucleotides (DNA, RNA)or amino acids (proteins). In the case of

nucleotides, one usually aligns identicalnucleotide symbols. When dealing withamino acids the alignment of two aminoacids occurs if they are identical or if onecan be derived from the other by substitu-tions that are likely to occur in nature.

An alignment can be either local orglobal. In the former, only portions of thesequences are aligned, whereas in the lat-ter one aligns over the entire length of thesequences. Usually, one uses gaps, repre-sented by the symbol -, to indicate thatit is preferable not to align two symbolsbecause in so doing, many other pairs can

be aligned. In local alignments there arelarger regions of gaps. In global align-ments, gaps are scattered throughout thealignment.

A measure of likeness between two se-quences ispercent identity: once an align-ment is performed we count the numberof columns containing identical symbols.The percent identity is the ratio betweenthat number and the number of symbolsin the (longest) sequence. A possible mea-sure orscoreof an alignment is calculatedby summing up the matches of identical(or similar) symbols and counting gaps as

negative.With these preliminary definitions in

mind, we are ready to describe the algo-rithms that are often used in sequencecomparison.

7.1.1. Pairwise Alignment. Many of themethods of pattern matching used incomputer science assume that matchescontain no gaps. Thus there is no matchfor the pattern bd in the text abcd. Inbiological sequences, gaps are allowedand an alignmentabcdwithbd yields therepresentation:

a b c d

b d .

Similarly, an alignment ofabcd with bucyields:

a b c d

b u c .

The above implies that gaps can appearboth in the text and in the pattern. There-fore there is no point in distinguishing

texts from patterns. Both are called se-quences. Notice that, in the above exam-ples, the alignments maximize matchesof identical symbols in both sequences.Therefore, sequence alignment is an op-timization problem.

A similar problem exists when we at-tempt to automatically correct typing er-rors like character replacements, inser-tions, and deletions. Google and Word, forexample, are able to handle some typingerrors and display suggestions for possi-ble corrections. That implies searching adictionary for best matches.

An intuitive way of aligning two se-quences is by constructing dot matrices.These are Boolean matrices representingpossible alignments that can be detectedvisually. Assume that the symbols of a firstsequence label the rows of the Boolean ma-trix and those of the second sequence labelthe columns. The matrix is initially set to

zero(or false). An entry becomes a one (ortrue) if the labels of the corresponding rowand column are identical.

Consider now two identical sequences.Initially, assume that all the symbols in asequence are different. The corresponding

dot matrix has ones in the diagonal indi-cating a perfect match. If the second se-quence is a subsequence of the first, thedot matrix also shows a shorter diago-nal line ofonesindicating where matchesoccur.

The usage of dot matrices requires a vi-sual inspection to detect large chunks ofdiagonals indicating potential common re-gions of identical symbols. Notice, how-ever, that if the two sequences are longand contain symbols of a small vocabulary(like the four nucleotides in DNA) thennoise occurs: that means that there will

be a substantial number of scattered onesthroughout the matrix and there may beseveral possible diagonals that need to beinspected to find the one that maximizesthe number of symbol matches. Compar-ing multiple symbols instead of just twoone in a row the other in a columnmayreduce noise.



13/38

134 J. Cohen

An interesting case that often occurs inbiology is one in which a sequence containsrepeated copies of its subsequences. That

results in multiple diagonals, and againvisual inspection is used to detect the bestmatches.

The time complexity of constructing dotmatrices for two sequences of lengths mandnismn. The space complexity is alsomn. These values may well be intolerablefor very large values of m and n. Noticealso that, if we know that the sequencesare very similar, we do not need to buildthe entire matrix. It suffices to constructthe elements around the diagonal. There-fore, one can hope to achieve almost linearcomplexity in those cases. It will be seen

later that the most widely used pairwisesequence alignment algorithm (BLAST)can be described in terms of finding diag-onals without constructing an entire dotmatrix.

The dot matrix approach is useful butdoes not yield precise measures of simi-larity among sequences. To do so, we in-troduce the notion of costs for the gaps, ex-act matches, and the fact that sometimesan alignment of different symbols is toler-ated and considered better than introduc-ing gaps.

We will now open a parenthesis to mo-

tivate the reader of the advantages of us-ing dynamic programming in finding op-timal paths in graphs. Let us consider adirected acyclic graph (DAG) with possi-bly multiple entry nodes and a single exitnode. Each directed edge is labeled with anumber indicating the cost (or weight) ofthat edge in a path from the entry to theexit node. The goal is to find an optimal(maximal or minimal) path from an entryto the exit node.

The dynamic programming (DP) ap-proach consists of determining the bestpath to a given node. The principle is sim-

ple: consider all the incoming edges ei toa node V. Each of these edges is labeledby a value vi indicating the weight of theedge ei. Let pi be the optimal values forthe nodes that are the immediate prede-cessors of V. An optimal path from anentry point to V is the one correspond-ing to the maximum (or the minimum)

of the quantities: p1 + v1, p2 + v2, p3 +v3, . . .etc.

Starting from the entry nodes one de-

termines the optimal path connecting thatnode to its successors. Each successor nodeis then considered and processed as above.The time complexity of the DP algorithmfor DAGs is linear with its number ofnodes. Notice that the algorithm deter-mines a single value representing the totalcost of the optimal path.

If one wished to determine the sequenceof nodes in that path, then one would haveto perform a second (backward) pass start-ing from the exit node and retrace one byone the nodes in that optimal path. Thecomplexity of the backward pass is also

linear. A word of caution is in order. Letus assume that there are several pathsthat yield the same total cost. Then thecomplexity of the backward pass could be

exponential!We will now show how to construct a

DAG that can be used to determine an op-timal pairwise sequence alignment. Let usassume that the cost of introducing a gapis g , the cost of matching two identicalsymbols is s,and the choice of toleratingthe alignment of different symbols is d .In practice when matching nucleotide se-quences it is common to use the weights

g = 2, s = 1, and d = 1. That attri-bution of weights penalizes gaps most, fol-lowed by a tolerance of unmatched sym-bols; identical symbols induce the highestweight. The optimal path being sought isthe one with a total maximum cost.

Consider now the following part of aDAG expressing the three choices dealingwith a pair of symbols being aligned.

The horizontal and vertical arrows statethat a gap may be introduced either in thetop or in the bottom sequences. The di-agonal indicates that the symbols will be



14/38


aligned and the cost of that choice is eithers(if the symbols match) ord if they do notmatch.

Consider now a two-dimensional ma-trix organized as the previously describeddot matrix, with its rows and columnsbeing labeled by the elements of the se-quences being aligned. The matrix entriesare the nodes of a DAG, each node hav-ing the three outgoing directed edges as inthe above representation. In other wordsthe matrix is tiled with copies of the abovesubgraph. Notice that the bottommost row(and the rightmost column) consists onlyof a sequence of directed horizontal (verti-cal) edges labeled by g .

An optimal alignment amounts to de-

termining the optimal path in the overallDAG. The single entry node is the matrixelement with indices [0,0] and the singleexit node is the element indexed by [m, n]wherem and n are the lengths of the twosequences being aligned.

The DP measure of similarity of thepairwise alignment is a score denotingthe sum of weights in the optimal path.With the weights mentioned earlier (g =2, s = 1, and d = 1), the best scoreoccurs when two identical sequences oflength n are aligned; the resulting scoreisn, since the cost attributed to a diagonal

edge is 1. The worst (unrealistic) case oc-curs when aligning a sequence with theempty sequence resulting in a score of2n. Another possible outcome is that dif-ferent symbols become aligned since theresulting path yields better scores thanthose that introduce gaps. In that case, thescore becomes n.

Now a few words about the complexityof the DP approach for alignment. Sincethere arem n nodes in the DAG, the timeand space complexity is quadratic whenthe two sequences have similar lengths.

As previously pointed out, that complex-

ity can become intolerable (exponential) ifthere exist multiple optimal solutions andall of them need to be determined. (Thatoccurs in the backward pass.)

The two often-used algorithms for pair-wise alignment are those developed by thepairs of co-authors NeedlemanWunschand SmithWaterman. They differ on the

costs attributed to the outmost horizon-tal and vertical edges of the DAG. In theNeedlemanWunsch approach, one uses

weights for the outmost edges that encour-age the best overall (global) alignment. Incontrast, the SmithWaterman approachfavors the contiguity of segments beingaligned.

Most of the textbooks mentioned inthe references (e.g., Setubal and Meidanis[1997], Durbin et al. [1998], and Dwyer[2002]) contain good descriptions of us-ing dynamic programming for performingpairwise alignments.

7.1.2. Aligning Amino Acids Sequences.

The DP algorithm is applicable to anysequence provided the weights for com-parisons and gaps are properly chosen.When aligning nucleotide sequences thepreviously mentioned weights yield goodresults. A more careful assessment of theweights has to be done when aligningsequences of amino acids. This is be-cause the comparison between any twoamino acids should take evolution intoconsideration.

Biologists have developed 20 20 trian-gular matrices that provide the weightsfor comparing identical and different

amino acids as well as the weight thatshould be attributed to gaps. The twomore frequently used matrices are knownas PAM (Percent Accepted Mutation) andBLOSUM (Blocks Substitution Matrix).These matrices reflect the weights ob-tained by comparing the amino acids sub-stitutions that have occurred through evo-lution. They are often called substitutionmatrices.

One usually qualifies those matrices bya number: the higher values of the X in ei-ther PAM X or BLOSUM X, indicate morelenience in estimating the difference be-

tween two amino acids. An analogy withthe previously mentioned weights clarifieswhat is meant by lenience: a weight of 1attributed to identical symbols and 0 at-tributed to different symbols is more le-nient than retaining the weight of 1 forsymbol identity and utilizing the weight1 for nonidentity.



15/38

136 J. Cohen

Many bioinformatics texts (e.g., Mount[2001] and Pevzner [2000]) provide de-tailed descriptions on how substitution

matrices are computed.

7.1.3. Complexity Considerations and BLAST.The quadratic complexity of the DP-based algorithms renders their usage pro-hibitive for very large sequences. Recallthat the present genomic database con-tains about 30 billion base pairs (nu-cleotides) and thousands of users access-ing that database simultaneously wouldlike to determine if a sequence being stud-ied and made up of thousands of symbolscan be aligned with existing data. That is

a formidable problem!The program called BLAST (Basic Local

Alignment Search Tool) developed by theNational Center for Biotechnology Infor-mation (NCBI) has been designed to meetthat challenge. The best way to explainthe workings of BLAST is to recall theapproach using dot matrices. In BLASTthe sequence, whose presence one wishesto investigate in a huge database, is splitinto smaller subsequences. The presenceof those subsequences in the database canbe determined efficiently (say by hashingand indexing).

BLAST then attempts to pursue furthermatching by extending the left and rightcontexts of the subsequences. The pairingsthat do not succeed well are abandonedand the best match is chosen as a result ofthe search. The functioning of BLAST cantherefore be described as finding portionsof the diagonals in a dot matrix and thenattempting to determine the ones that canbe extended as much as possible. It ap-pears that such technique yields practi-cally linear complexity. The BLAST sitehandles about 90,000 searches per day. Itsuccess demonstrates that excellent hack-

ing has its place in computer science.BLAST allows comparisons of either

nucleotide or amino acid sequences withthose existing in the NCBI database. Inthe case of amino acids, the user is offeredvarious options for selecting theapplicablesubstitution matrices. Other input param-eters are also available.

Among the important information pro-vided by a BLAST search is the p-valueassociated with each of the sequences

that match a user specified sequence. Ap-value is a measure of how much evi-dence we have against the null hypothe-ses. (The null hypothesis is that observa-tions are purely the result of chance.) Avery small p-value (say of the order of1019) indicates that it is very unlikelythat the sequence provided by the searchis totally unrelated to the one provided bythe user. The home page of BLAST is anexcellent source for a tutorial and a wealthof other information (http://www.ncbi.nlm.nih.gov/BLAST/).

FASTA is another frequently used

search program with a strategy similar tothat of BLAST. The differences betweenBLAST and FASTA are discussed in manytexts (e.g., Pevsner [2003]).

A topic that has attracted the atten-tion of present researchers is the compar-ison between two entire genomes. Thatinvolves aligning sequences containingbillions of nucleotides. Programs havebeen developed to handle these time con-suming tasks. Among these programs isPipmaker [Schwartz et al. 2000]; a discus-sion of the specific problems of comparingtwo genomes is presented in Miller [2001].

An interesting method of entire genomecomparison using suffix trees is describedin Delcher et al. [1999].

7.1.4. Multiple Alignments. Let us assumethat a multiple alignment is performed fora set of sequences. One calls the consensussequence the one obtained by selecting foreach column of the alignment the symbolthat appears most often in that column.

Multiple alignments are usually per-formed using sequences of amino acidsthat are believed to have similar struc-

tures. The biological motivation for multi-ple alignments is to find common patternsthat are conserved among all the se-quences being considered. Those patternsmay elucidate aspects of the structure of aprotein being studied.

Trying to extend the dot matrix andDP approaches to the alignment of three



16/38


or more sequences is a tricky proposi-tion. One soon gets into difficult time-consuming problems. A three-dimensional

dot matrix cannot be easily inspected vi-sually. The DP approach has to considerDAGs whose nodes have seven outgoingedges instead of the three edges neededfor pairwise alignment (see, e.g., Dwyer[2002]).

As dimensionality grows so does algo-rithmic complexity. It has been provedthat multiple alignments have exponen-tial complexity with the number of se-quences to be aligned. That does not pre-vent biologists from using approximatemethods. These approximate approachesare sometimes unsatisfactory; therefore,

multiple alignment remains a worthytopic of research. Among the approximateapproaches, we consider two.

The first is to reduce a multiple align-ment to a series of pairwise alignmentsand then combine the results. One can usethe DP approach to align all pairs of se-quences and display the result in a tri-angular matrix form such that each en-try [i, j ] represents the score obtained byaligning sequencei with sequence j .

What follows is more an art than sci-ence. One can select a center sequence Cas the one that yields a maximum sum of

pairwise scores with all others. Other se-quences are then aligned withCfollowingthe empirical rule: once a gap is introducedit is never removed. As in the case of pair-wise alignments, one can obtain global orlocal alignments. The former attempts toobtain an alignment with maximum scoreregardless of the positions of the symbols.In contrast, local alignments favor conti-guity of matched symbols.

Another approach for performing mul-tiple alignments is using the HiddenMarkov Models (HMMs), which are cov-ered in Section 7.3.2.

CLUSTAL and its variants are softwarepackages often used to produce multiplealignments. As in the case of pairwisealignments these packages offer capa-bilities of utilizing substitution matriceslike BLOSUM or PAM. A description ofCLUSTAL W appears in Tompson et al.[1994].

7.1.5. Pragmatic Aspects of Alignments. Animportant aspect of providing results forsequence alignments is their presenta-

tion. Visual inspection is crucial in obtain-ing meaningful interpretations of those re-sults. The more elaborate packages thatperform alignments use various colors toindicate regions that are conserved andprovide statistical data to assess the con-fidence level of the results.

Another aspect worth mentioning is thevariety of formats that are available forin-put and display of sequences. Some pack-ages require specific formats and, in manycases, it is necessary to translate from oneformat to another.

7.2. Phylogenetic Trees

Since evolution plays a key role in biol-ogy, it is natural to attempt to depict it us-ing trees. These are referred to as phyloge-netic trees: their leaves represent variousorganisms, species, or genomic sequences;an internal node P stands for an abstractorganism (species, sequence) whose exis-tence is presumed and whose evolutionled to the organisms whose direct descen-dants are the branches emanating fromP .

A motivation for depicting trees is toexpressin graphical formthe outcome

of multiple alignments by the relation-ships that exist between pairs or groupsof sequences. These trees may reveal evo-lutionary inconsistencies that have to beresolved. In that sense the construction ofphylogenetic validates or invalidates con-

jectures made about possible ancestors ofa group of organisms.

Consider a multiple alignment: Amongits sequences one can select two, whosepairwise score yields the highest value. Wethen create an abstract node representingthe direct ancestor of the two sequences.

A tricky step then is to reconstruct

among several possible sequencesonethat best represents its children. This re-quires both ingenuity and intuition. Oncethe choice is made, the abstract ances-tor sequence replaces its two children andthe algorithm continues recursively untila root node is determined. The result isa binary tree whose root represents the



17/38

138 J. Cohen

primordial sequence that is assumed tohave generated all the others. We will soonrevisit this topic.

There are several types of trees used inbioinformatics. Among them, we mentionthe following:

(1) Unrooted trees are those that spec-ify distances (differences) betweenspecies. The length of a path betweenany two leaves represents the accumu-lated differences.

(2) Cladograms are rooted trees in whichthe branches lengths have no mean-ing; the initial example in this sectionis a cladogram.

(3) Phylograms are extended cladogramsin which the length of a branch quan-tifies the number of genetic transfor-mations that occurred between a givennode and its immediate ancestor.

(4) Ultrametric trees are phylograms inwhich the accumulated distances fromthe root to each of the leaves is quanti-fied by the same number; ultrametrictrees are therefore the ones that pro-vide most information about evolution-ary changes. They are also the mostdifficult to construct.

The above definitions suggest establish-

ing some sort of molecular clock in whichmutations occur at some predictable rateand that there exists a linear relation-ship between time and number of changes.These rates are known to be different fordifferent organisms and even for the var-ious cell components (e.g., DNA and pro-teins). That shows the magnitude and dif-ficulty of establishing correct phylogenies.

An algorithm frequently used to con-struct unrooted trees is called UPGMA(for Unweighted Pair Group Method us-ing Arithmetic averages). Let us recon-sider the initial example of multiple align-

ments and assume that one can quantifythe distances between any two pairwisealignments (The DP score of those align-ments could yield the information aboutdistances: the higher the score, the loweris the distance among the sequences). Thevarious distances can be summarized intriangular matrix form.

The UPGMA algorithm is similar to theone described in the initial example. Con-sider two elements E1 and E2 having the

lowest distance among them. They aregrouped into a new element (E1, E2). Anupdated matrix is constructed in whichthe new distances to the grouped elementsare the averages of the previous distancesto E1 and E2. The algorithm continuesuntil all nodes are collapsed into a sin-gle node. Note that if the original ma-trix contains many identical small en-tries there would be multiple solutionsand the results may be meaningless. Inbioinformaticsas in any fieldone hasto exercise care and judgment in interpret-ing program output.

The notion ofparsimony is often invokedin constructing phylograms, rooted treeswhose branches are labeled by the numberof evolutionary steps. Parsimony is basedon the hypothesis that mutations occurrarely. Consequently, the overall numberof mutations assumed to have occurredthroughout evolutionary steps ought to beminimal. If one considers the change of asingle nucleotide as a mutation, the prob-lem of constructing trees from sequencesbecomes hugely combinatorial.

An example illustrates the difficulty. Letus assume an unrooted tree with two types

of labels. Sequences label the nodes andnumbers label the branches. The numbersspecify the mutations (symbol changeswithin a given position of the sequence)occurring between the sequences labelingadjacent nodes.

The problem of constructing a tree us-ing parsimony is: given a small number ofshort nucleotide sequences, place them inthe nodes and leaves of an unrooted treeso that the overall number of mutations isminimal.

One can imagine a solution in which allpossible trees are generated, their nodes

labeled and the tree with minimal over-all mutations is chosen. This approachis of course forbidding for large sets oflong sequences and it is another exampleof the ubiquity of difficult combinatorialoptimization problems in bioinformatics.Mathematicians and theoretical computerscientists have devoted considerable effort



18/38


in solving efficiently these types of prob-lems (see, e.g., Gusfield [1997]).

We end this section by presenting an in-

teresting recent development in attempt-ing to determine the evolutionary treesfor entire genomes using data compression[Bennett et al. 2003]. Let us assume thatthere is a set of long genomic sequencesthat we want to organize as a cladogram.Each sequence often includes a great dealof intergenic (noncoding) DNA materialwhose function is still not well understood.The initial example in this section is thebasis for constructing the cladogram.

Given a textTand its compressed formC, the ratio r = |C|/|T| (where || isthe length of the sequence ) expresses

the degree of compression that has beenachieved by the program. The smaller theratio the more compressed is C. Data com-pressors usually operate by using dictio-naries to replace commonly used words bypointers to words in the dictionary.

If two sequences S1 and S2 are verysimilar, it is likely that their respective rratios are close to each other. Assume nowthat all the sequences in our set have beencompressed and the ratios r are known.It is then easy to construct a triangularmatrix whose entries specify the differ-ences in compression ratios between any

two sequences.As in the UPGMA algorithm, we con-

sider the smallest entry and construct thefirst binary node representing the pair ofsequences Si and Sj that are most simi-lar. We then update the matrix by replac-ing the rows corresponding to Si and Sjby a row representing the combination ofSi with Sj . The compression ratio for thecombined sequences can be taken to be theaverage between the compression ratios ofSi and S j . The algorithm proceeds as be-fore by searching for the smallest entryand so on, until the entire cladogram is

constructed.An immense advantage of the data com-

pression approach is that it will consideras very similar two sequences and, where , ,, and are long subse-quences that have been swapped around.This is because the two sequences arelikely to have comparable compression ra-

tios. Genome rearrangements occur dur-ing evolution and could be handled by us-ing the data compression approach.

Finally, we should point out the exis-tence of horizontal transfers in molecularbiology. This term implies that the DNAof given organism can be modified by theinclusion of foreign DNA material thatcannot be explained by evolutionary argu-ments. That occurrence may possibly behandled using the notion of similarity de-rived from data compression ratios.

A valuable reference on phylogenetictrees is the recent text by Felsenstein[2003]. It includes a description ofPHYLIP (Phylogenetic Inference Pack-age), a frequently used software package

developed by Felsenstein for determiningtrees expressing the relationships amongmultiple sequences.

7.3. Finding Patterns in Sequences

It is frequently the case in bioinformat-ics that one wishes to delimit parts of se-quences that have a biological meaning.Typical examples are determining the lo-cations of promoters, exons, and intronsin RNA, that is, gene finding, or detect-ing the boundaries of-helices, -sheets,and coils in sequences of amino acids.

There are several approaches for perform-ing those tasks. They include neural nets,machine learning, and grammars, espe-cially variants of grammars called prob-abilistic [Wetherell 1980].

In this subsection, we will deal withtwo of such approaches. One is usinggrammars and parsing. The other, calledHidden Markov Models or HMMs, isa probabilistic variant of parsing usingfinite-state grammars.

It should be remarked that the recentcapabilities of aligning entire genomes(see Section 7.1.3) also provides means for

gene finding in new genomes: assumingthat all the genes of a genome G1 havebeen determined, then a comparison withthe genome G2 should reveal likely posi-tions for the genes in G 2.

7.3.1. Grammars and Parsing. Chomskyslanguage theory is based on grammar



19/38

140 J. Cohen

rules used to generate sentences. In thattheory, a nonterminal is an identifiernaming groups of contiguous words that

may have subgroups identified by othernonterminals.

In the Chomsky hierarchy of gram-mars and languages, the finite-state (FS)model is the lowest. In that case, anonterminal corresponds to a state ina finite-state automaton. In context-freegrammars one can specify a potentiallyinfinite number of states. Context-freegrammars (CFG) allow the descriptionof palindromes or matching parentheses,which cannot be described or generated byfinite-state models.

Higher than the context-free languages

are the so-called context sensitive ones(CSL). Those can specify repetitions ofsequence of words like ww, where wis any sequence using a vocabulary.These repetitions cannot be described byCFGs.

Parsing is the technique of retracing thegeneration of a sentence using the givengrammar rules. The complexity of parsingdepends on the language or grammar be-ing considered. Deterministic finite-statemodels can be parsed in linear time. Theworst case parsing complexity of CF lan-guages is cubic. Little is known about the

complexity of general CS languages butparsing of its strings can be done in finitetime.

The parse of sentences in a finite-statelanguage can be represented by the se-quence of states taken by the correspond-ing finite-state automaton when it scansthe input string. A tree conveniently rep-resents the parse of a sentence in acontext-free language. Finally, one canrepresent the parse of sentence in a CSLby a graph. Essentially, an edge of thegraph denotes the symbols (or nontermi-nals) that are grouped together.

This prelude is helpful in relating lan-guage theory with biological sequences.Palindromes and repetitions of groups ofsymbols often occur in those sequencesand they can be given a semantic meaning.Searls [1992, 2002] has been a pioneer inrelating linguistics to biology and his pa-pers are highly recommended.

All the above grammars, includingfinite-state, can generate ambiguousstrings, and ambiguity and nondetermin-

ism are often present when analyzingbiological sequences. In ambiguoussituationsas in natural languageoneis interested in the most-likely parse. Andthat parse can be determined by usingprobabilities and contexts. In biology, onecan also use energy considerations anddynamic programming to disambiguatemultiple parses of sequences.

There is an important difference be-tween linguistics as used in natural lan-guage processing and linguistics appliedto biology. The sentences or speech utter-ances in natural language usually amount

to a relatively few words. In biology, wehave to deal with thousands!

It would be wonderful if we could pro-duce a grammar defining agene, as a non-terminal in a language defining DNA. Butthat is an extremely difficult task. Simi-larly, it would be very desirable to havea grammar expressing protein folds. Inthat case, a nonterminal would correspondto a given structure in 3D space. As incontext-sensitive languages, the parse (agraph) would indicate the subcomponentsthat are close together.

It will be seen later (Section 7.4.1) that

CFGs can be conveniently used to mapRNA sequences into 2D structures. How-ever, it is doubtful that practical gram-mars would exist for detecting genes inDNA or determining tertiary structure ofproteins.

In what follows, we briefly describe thetypes of patterns that are necessary todetect genes in DNA. A nonterminal G,defined by the rules below, can roughlydescribe the syntax of genes:

G P R

P NR E I R|E

E N

I g t N a g ,

whereNdenotes a sequence of nucleotidesa , c , g , t; E is an exon, Ian intron, R a



20/38


sequence of alternating exons and intronsandP is a promoter region, that is, a head-ing announcing the presence of the gene.

In this simplified grammar, the markersg t and a g are delimiters for introns.

Notice that it is possible to transformthe above CFG into an equivalent FSGsince there is a regular expression that de-fines the above language. But the impor-tant remark is that the grammar is highlyambiguous since the markers gt or a gcould appear anywhere within an exon anintron or in a promoter region. Therefore,the grammar is descriptive but not usablein constructing a parser.

One could introduce further constraintsabout the nature of promoters, require

that the lengths of introns and exonsshould adhere to certain bounds, and thatthe combined lengths of all the exonsshould be a multipleof three since a gene istranscribed and spliced to form a sequenceof triplets (codons).

Notice that even if one transforms theabove rules with constraints into finite-state rules, the ambiguities will remain.The case of alternative splicing bears wit-ness to the presence of ambiguities. The al-ternation exons-introns can be interpretedin many different ways, thus accountingforthe fact that a given gene may generate

alternate proteins depending on contexts.Furthermore, in biology there are excep-tions applicable to most rules. All this im-plies that probabilities will have to be in-troduced. This is a good motivation for theneed of probabilistic grammars as shownin the following section.

7.3.2. Hidden Markov Models (HMMs).HMMs are widely used in biologicalsequence analysis. They originated andstill play a significant role in speechrecognition Rabiner [1989].

HMMs can be viewed as variantsof probabilistic or stochastic finite-statetransducers (FSTs). In an FST, the au-tomaton changes states according to theinput symbols being examined. On a givenstate, the automaton also outputs a sym-bol. Therefore, FSTs are defined by sets ofstates, transitions, and input and output

vocabularies. There is as usual an initialstate and one or more final states.

In a probabilistic FST, the transitions

are specified by probabilities denoting thechance that a given state will be changedto a new one upon examining a symbol inthe input string. Obviously, the probabili-ties of transition emanating from a givenstate for a given symbol have to add upto 1. The automata that we are dealingwithcan be and usually are nondetermin-istic. Therefore, upon examining a giveninput symbol, the transition depends onthe specified probabilities.

As in the case of an FST, an output isproduced upon reaching a new state. AnHMM is a probabilistic FST in which there

is also a set of pairs [p, s] associated toeach state; p is a probability and s is asymbol of the output vocabulary. The sumof the ps in each set of pairs within agiven state also has to equal 1. One canassume that the input vocabulary for anHMM consists of a unique dummy symbol(say, the equivalent of an empty symbol).

Actually, in the HMM paradigm, we aresolely interested in state transitions andoutput symbols. As in the case of finite-state automata, there is an initial stateand a final state.

Upon reaching a given state, the HMM

automaton produces the output symbol swith a probability p. The ps are calledemission probabilities. As we describedso far, the HMM behaves as a stringgenerator.

The following example inspired fromDurbin et al. [1998] is helpful to under-stand the algorithms involved in HMMs.

Assume we have two coins: one, whichis unbiased, the other biased. We will usethe letters F (fair) for the former and L(loaded) for the latter. When tossed, theL coin yields Tails 75% of the time. Thetwo coins are indistinguishable from each

other in appearance.Now imagine the following experiment:

the person tossing the coins uses only onecoin at a given time. From time to time,he alternates between the fair and thecrooked coin. However, we do not know at agiven time which coin is being used (hence,the term hidden in HMM). But let us



21/38

142 J. Cohen

assume that the transition probabilitiesof switching coins are known. The transi-tion fromFto Lhas probability u, and the

transition from L to Fhas probability v.Let Fbe the state representing the us-

age of the fair coin and Lthe state repre-senting the usage of the loaded coin. Theemission probabilities for the Fstate areobviously 1/2 forHeads and 1/2 for Tails. Letus assume that the corresponding proba-bilities while in L are 3/4for Tailsand 1/4forHeads.

Let [r,O ] denote emission of the sym-bolO with probability r and {S, [r,O ] ,w,S} denote the transition from state S tostate S with probabilityw. In our partic-ular example, we have:

{F, [1/2,H], u,L}

{F, [1/2,T], u,L}

{F, [1/2,H], 1 u,F}

{F, [1/2,T], 1 u,F}

{L, [3/4,T], v, F}

{L, [1/4, H], v,F}

{L, [3/4,T], 1 v,L}

{L, [1/4, H], 1 v,L}.

Let us assume that both u and v are small,that is, one rarely switches from one cointo another. Then the outcome of a genera-tor simulating the above HMM could pro-duce the string:

. . . HTHTTHTT. . . . . . HTTTTTTTHT. . .

. . . FFFFFFFF. . . . . . LLLLLLLLLLL . . . .

The sequence of states below each emit-ted symbol indicates the present state For L of the generator.

The main usage of HMMs is in the

reverse problem: recognition or parsing.Given a sequence of Hs and Ts, attemptto determine the most likely correspondingstate sequence ofFs and L s.

We now pause to mention an often-neglected characteristic of nondeter-ministic and even ambiguous finite-state-automata (FSA). Given an input

string accepted by the automaton, it ispossible to generate a directed acyclicgraph (DAG) expressing all possible

parses (sequence of transition states).The DAG is a beautifully compact form toexpress all those parses. The complexityof the DAG construction is O(n|S| ) inwhichn is the size of the input string and|S| is the number of states. If|S| is small,then the DAG construction is linear!

Let us return to the biased-coin exam-ple. Given an input sequence of Hs andTs produced by an HMM generator andalso the numeric values for the transitionand emission probabilities, we could gen-erate a labeled DAG expressing all possi-ble parses for the given input string. The

label of each edge and node of the graphcorrespond to the transition and emissionprobabilities.

To determine the optimal parse (path),we are therefore back to dynamic pro-gramming (DP) as presented in the sectionon alignments (7.1). The DP algorithmthat finds the path of maximum likelihoodin the DAG is known as the Viterbi algo-rithm: given an HMM and an input stringit accepts, determine the most likely parse,that is, the sequence of states that bestrepresent the parsing of the input string.

A biological application closely related

to the coin-tossing example is the deter-mination of GC islands in DNA. Those areregions of DNA in which the nucleotides Gand C appear in a higher frequency thanthe others. The detection of GC islands isrelevant since genes usually occur in thoseregions.

An important consideration not yet dis-cussed is how to determine the HMMstransition and emission probabilities. Toanswer that question, we have to enter therealm of machine learning.

Let us assume the existence of a learn-ing set in which the sequence of tosses is

annotated by very good guesses of whenthe changes of coins occurred. If that setis available one can compute the tran-sition and emission probabilities simplyby ratios of counts. This learning is re-ferred as supervised learning since theuser provides the answers (states) to eachsequence of tosses.



22/38


An even more ambitious question is:can one determine the probabilities with-out having to annotate the input string?

The answer is yes, with reservations.First, one has to suspect the existenceof different states and the topology ofthe HMM; furthermore, the generatedprobabilities may not be of practical useif the data is noisy. This approach iscalled unsupervised learning and the cor-responding algorithm is called BaumWelch.

The interested reader is highly recom-mended to consult the book by Durbinet al. [1998] where the algorithms of

Viterbi and BaumWelch (probabilitygenerator) are explained in detail. A very

readable, paper by Krogh [1998] is alsoadvocated. That paper describes interest-ing HMMs applications such as multiplealignments and gene finding. Many appli-cations of HMMs in bioinformatics con-sist of finding subsequences of nucleotidesor amino acids that have biological sig-nificance. These include determining pro-moter or regulatory sites, and protein sec-ondary structure.

It should be remarked that there isalso a theory for stochastic context-free-grammars; those grammars have beenused to determine RNA structure. That

topic is discussed in the following section.

7.4. Determining Structure

From the beginning of this article, wereminded the reader of the importanceof structure in biology and its relationto function. In this section, we reviewsome of the approaches that have beenused to determine 3D structure from lin-ear sequences. A particular case of struc-ture determination is that of RNA, whosestructure can be approximated in two di-mensions. Nevertheless, it is known that

3D knot-like structures exist in RNA.This section has two subsections. In the

first, we cover some approaches availableto infer 2D representations from RNA se-quences. In the second, we describe one ofthe most challenging problems in biology:the determination of the 3D structure ofproteins from sequences of amino acids.

Both problems deal with minimizing en-ergy functions.

7.4.1. RNA Structure. It is very conve-nient to describe the RNA structure prob-lem in terms of parsing strings generatedby context-free-grammars (CFG). As inthe case of finite-state automata used inHMMs we have to deal with highly am-biguous grammars. The generated stringscan be parsed in multiple ways and onehas to choose an optimal parse based onenergy considerations.

RNA structure is determined by the at-tractions among its nucleotides: A (ade-nine) attracts U (uracil) and C (cytosine)attracts G (guanine). These nucleotideswill be represented using small case let-ters. The CFG rules:

S aSu/uSa/

generate palindrome-like sequences ofusand as of even length. One could mapthis palindrome to a 2D representationin which each a in the left of the gener-ated string matches the corresponding uin the right p

bioinformatics—an introduction for computer scientists - cohen

Documents