o que é bioinformática

8/12/2019 O que bioinformtica

1/13

2 0 0 1 Scha tt aue r Gm bH

3 4 6

Method In fo rm Med 4 / 200 1

related to this new field has been surging,and now comprise almost 2% of theannual tota l of papers in PubMed.

This unexpected union between the two

subjects is attributed to the fact that lifeitself is an information technology; anorganisms physiology is largely deter-mined by its genes, which at its most basiccan be viewed as digital information. At thesame time, there have been major advancesin the technologies that supply the initialdata; Anthony Kervalage of Celera recent-ly cited that an experimental laboratorycan produce over 100 gigabytes of data aday w ith ease [5]. This incredible pro-cessing power has been ma tched by devel-

opments in computer technology; the mostimportant areas of improvements havebeen in the CPU, disk storage and Internet,allowing faster computations, better datastorage and revolutionalised the methodsfor accessing and exchanging data .

1.1 Aims of BioinformaticsIn general, the aims of bioinformatics arethree-fold. First, at its simplest bioinfor-matics organises data in a way that allowsresearchers to access existing informationand to submit new entries as they areproduced, e.g. the Protein Data B ank for3D macromolecular structures [6, 7]. Whiledata -curation is an essential task, the in-formation stored in these databases isessentially useless until ana lysed. Thus thepurpose of bioinformatics extends muchfurther. The second aim is to develop toolsand resources that aid in the analysis ofdata. For example, having sequenced a par-ticular protein, it is of interest to compare it

with previously characterised sequences.

What is Bioinformatics?

A Proposed Definition and Overviewof the FieldN. M. Luscombe, D. Greenbaum, M. GersteinDepartment of Molecular Biophysics and BiochemistryYale University, New Haven, USA

1. Introduction

Biological data are being produced at a

phenomenal rate [1]. For example asof April 2001, the G enBank repository ofnucleic acid sequences contained11,546,000 entries [2] and the SWISS-PROT database of protein sequences con-tained 95,320 [3]. On average, these data ba-ses are d oubling in size every 15 months [2].In addition, since the publication ofthe H . inf luenzaegenome [4], completesequences for nearly 300 organisms havebeen released, ranging from 450 genes toover 100,000.A dd to this the data from the

myriad of related projects that study geneexpression, determine the protein structu-res encoded by the genes, and detail howthese products interact with one another,and w e can begin to imagine the enormousquantity and variety of information that isbeing produced.

As a result of this surge in data , compu-ters have become indispensable to biologi-cal research. Such an approach is idealbecause of the ease with which computerscan handle large quantities of data andprobe the complex dynamics observed innature. B ioinformatics, the subject of thecurrent review,is often defined a s the appli-cation of computational techniques tounderstand and organise the informationassociated with biological ma cromolecules.Fig. 1 shows that the number of papers

SummaryBackground: The recent flood of data from genomesequences and functional genomics has given rise to

new field, bioinformatics, which combines elementsof biology and computer science.Objectives: Here we propose a definition for thisnew f ield and review some of the research that isbeing pursued, particularly in relation to transcriptionalregulatory systems.Methods: Our definition is as follows: Bioinform aticsis conceptualizing biology in terms of macromolecules( in the sense of physical-chemistry) and then applying informatics techniques (derived from disciplinessuch as applied maths, computer science, and statis-tics) to understand and organize the informationassociated with these molecules, on a large-scale.Results and Conclusions: Analyses in bioinformatics

predominantly focus on three ty pes of large datasetsavailable in m olecular biology: m acromolecular struc-tures, genome sequences, and the results of function-al genomics experiments (eg expression data).Additional information includes the text of scientificpapers and relationship data from m etabolic path-ways, t axonomy trees, and protein-protein interactionnetworks. Bioinformatics employs a wide rangeof com putational techniques including sequence andstructural alignment, database design and datamining, macromolecular geometry, phylogenetic treeconstruction, prediction of protein structure andfunction, gene finding, and expression data clustering.

The emphasis is on approaches integrating a variety ofcomputational m ethods and heterogeneous datasources. Finally, bioinformatics is a practical discipline.We survey some representative applications, such asfinding hom ologues, designing drugs, and performinglarge-scale censuses. Additional information pertinentto the review is available over the web atht tp: / / b io info.mbb.yale.edu/ what- is-it .

KeywordsBioinformatics, Genomics, Introduction, TranscriptionRegulation

Method InformMed 2001; 40: 34658

Updated version of an invited review paperthat a ppeared in H aux,R .,Kulikowski, C. (eds.)(2001). IMI A Yearbook o f Medical Informa tics2001: D igital Libra ries and Med icine,pp. 8399.

Stuttgart:Schattauer.


2/13

What is Bioinform atics?

3 4 7

Method In form Med 4 / 200 1

This needs more tha n just a simple text-based search, and programs such as FASTA[8] and PSI-BLAST [9] must consider whatconstitutes a biologically significant match.

Development of such resources dictatesexpertise in computational theory, as wellas a thorough understanding of biology.The third aim is to use these tools to ana-lyse the data and interpret the results in abiologically meaningful manner. Traditio-nally, biological studies examined individu-al systems in detail, and frequently com-pared them with a few that are related. Inbioinformatics, we can now conduct globalanalyses of all the available data with theaim o f uncovering common principles thatapply across many systems and highlightnovel features.

In this review, we provide a systematicdefinition of bioinformatics as shown inB ox 1.We focus on the first and third a imsjust described, with particular reference tothe keywords: information, informatics,organisation, understanding, large-scaleand practical applications. Specifically, wediscuss the range of data that a re currentlybeing examined, the databa ses into whichthey are organised, the types of analysesthat are being conducted using transcrip-

tion regulatory systems as an example, and

finally some of the ma jor practical applica-tions of bioinformatics.

2. the INFORMATIONassociated with theseMolecules

Table 1 lists the types of d ata that areanalysed in bioinformatics and the range oftopics that we consider to fall within thefield. Here we take a broad view and in-clude subjects that may not normally belisted. We also give approximate valuesdescribing the sizes of data being discussed.

We start w ith an o verview of the sourcesof information. Most bioinformatics analy-ses focus on three primary sources of dat a:D NA or protein sequences, macromolecu-lar structures and the results of functionalgenomics experiments. Raw D NA se-quences are strings of the four ba se-letterscomprising genes, each t ypically 1,000 baseslong. The G enBank [2] repository ofnucleic acid sequences currently holds atotal of 12.5 billion bases in 11.5 million

entries (all database figures as of April

2001).At the next level are protein sequenc-es comprising strings of 20 amino acid-letters. At present there are about 400,000known protein sequences [3], with a typicalbacterial protein containing approximately300 amino acids. Macromolecular struc-tural dat a represents a more complex formof informat ion. There are currently 15,000entries in the Protein Data B ank, PD B[6, 7], containing atomic structures of pro-teins, D NA and RNA solved by x-raycrystallography and NMR . A typical PD Bfile for a medium-sized protein contains thexyz-coordinates of approximately 2,000atoms.

Scientific euphoria ha s recently centred

on whole genome sequencing. As with theraw D NA sequences, genomes consist ofstrings of base-letters, ranging from 1.6million bases in H aemophil us i nf luenzae

[10] to 3 billion in huma ns [11, 12]. TheEntrez database [13] currently has com-plete sequences for nearly 300 archaeal,bacterial and eukaryotic organisms. Inaddition to producing the raw nucleotidesequence, a lot of work is involved inprocessing this data . An important aspectof complete genomes is the distinction

between coding regions and non-codingregions -junk repetitive seq uences makingup the bulk of base sequences especially ineukaryotes. Within the coding regions,genes are annotated with their translatedprotein sequence, and often with theircellular function.

Fig.1 Plot showing the growth of scientificpublications in bioinformatics between 1973 and 2000. The histogrambars(left vertical axis) counts the total number of scientificarticles relating to bioinformatics, and the black line (right verticalaxis) gives the percentage of the annual total of articles relating to bioinformatics. The data are taken fromPubMed.

Bioinformatics a Definition1

(M olecular) bio informatics:bioinfor-matics is conceptualising biology in

terms of molecules (in the sense of P hy-sical chemistry) and applying informa-tics techniques (derived from disci-plines such as applied ma ths, computerscience and statistics) to understandandorganise the information associatedwith these molecules, on a large scale.Inshort, bioinformatics is a managementinformation system for molecular biolo-gy and has many practical applications.

1As submitted to the Oxford EnglishDictionary.


3/13

Luscombe, Greenbaum, Gerstein

3 4 8


More recent sources of data have beenfrom functional genomics experiments, ofwhich the most common are gene expres-sion studies.We can now determine expres-sion levels of a lmost every gene in a givencell on a whole-genome level, howeverthere is currently no centra l repository f orthis data and public availability is limited.These experiments measure the amount ofmRNA that is produced by the cell [14-18]under different environmental conditions,different stages of t he cell cycle and d iffer-ent cell types in multi-cellular organisms.Much of the effort ha s so fa r focused on theyea st [19-24] and huma n geno mes [25, 26].One of the largest dataset for yeast hasmade approximately 20 time-point meas-urements for 6,000 genes [19]. How ever,

there is potential for much greater quan-

tities of data when experiments are con-ducted for larger organisms and at moretime-points.

Further genomic-scale data includebiochemical information on metabolicpathways, regulatory networks, protein-protein interaction data from two-hybridexperiments, and systematic knockouts ofindividual genes to test the viability of anorganism.

What is apparent from this list is thediversity in the size a nd complexity of d if-ferent data sets. There are invariably moresequence-based data than others becauseof the relative ease with which they can beproduced.This is partly related to the great-er complexity and information-content ofindividual structures or gene expression

experiments compared to individual se-

quences. While more biological informa-tion can be d erived from a single structurethan a protein sequence, the lack of depthin the latter is compensated by analysinglarger quantities of data.

3. ORGANISE the Infor-mation on a LARGE SCALE3.1 Redundancy and Multiplicityof Data

A concept that underpins most researchmethods in bioinformatics is that much of

the data can be grouped together based onbiologically meaningful similarities. Forexample, sequence segments are oftenrepeated at different positions of genomicD NA [27]. G enes can be clustered intothose with particular functions (eg enzy-matic actions) or according to the meta-bolic pathway to which they belong [28],although here, single genes may actuallypossess several functions [29]. G oingfurther, distinct proteins frequently havecomparable sequences organisms often

have multiple copies of a particular genethrough duplication and different specieshave equivalent or similar proteins thatwere inherited when they diverged fromeach other in evolution. At a structurallevel,w e predict there to be a finite numberof different tertiary structures estimatesrange between 1,000 and 10,000 folds[30, 31] and proteins adopt eq uivalentstructures even when they differ great ly insequence [32]. As a result, although thenumber of structures in the PDB hasincreased exponentially, the rate o f discov-ery of novel folds has actually decreased.

There are common terms to describe therelationship between pairs of proteins orthe genes from which they are derived:analogo us proteins have related folds, butunrelated sequences, while homologousproteins are both seq uentially and structu-rally similar. The two categories can some-times be difficult to distinguish especially ifthe relationship between the two proteinsis remote [33, 34].A mong homologues, it isuseful to distinguish between orthologues,

proteins in different species that have evolv-

Table1 Sources of data used in bioinformatics, the quantity of each type of data that is currently (April 2001) available,and bioinformatics subject areas that utilize this data.


4/13


5/13


3 5 0


different secondary structural elements.Another example is the use of structuraldata to understand a proteins function;here studies have investigated the rela-tionship different protein folds and theirfunctions [52, 53] and a naly sed similaritiesbetween different binding sites in the ab-sence of homology [54]. Co mbined withsimilarity measurements, these studies pro-vide us with an understanding of how much

biological information can be accurately

transferred between homologous proteins[55].

4.1 The Bioinformatics SpectrumFig. 2 summarises the main points weraised in our discussions of organisingand understanding biological data thedevelopment of bioinformatics techniques

has allowed an expansion of biological

analysis in two dimension, depth andbreadt h. The first is represented by thevertical axis in the figure and outlines apossible approach to the rational drugdesign process. The aim is to take a singlegene and follow through an analysis thatmaximises our understanding of theprotein it encodes. Starting with a genesequence, we can determine the proteinsequence with strong certainty. From there,

prediction algorithms can be used to calcu-

Paradigmshifts during the past couple of decades have taken much of biology away fromthelaboratory bench and have allowed the integration of other scientific disciplines, specificallycomputing. The result is an expansion of biological research in breadth and depth. The vertical axisdemonstrates howbioinformatics can aid rational drug design with minimal work in the wet lab.Starting with a single gene sequence, we can determine with strong certainty, the proteinsequence. Fromthere, we can determine the structure using structure prediction techniques. Withgeometry calculations, we can further resolve the proteins surface and through molecular

simulation determine the force fields surrounding the molecule. Finally docking algorithms canprovide predictions of the ligands that will bind on the protein surface, thus paving the way for thedesign of a drug specific to that molecule. The horizontal axis shows howthe influx of biologicaldata and advances in computer technology have broadened the scope of biology. Initially with a pairof proteins, we can make comparisons between the between sequences and structures ofevolutionary related proteins. With more data, algorithms for multiple alignments of severalproteins become necessary. Using multiple sequences, we can also create phylogenetic trees totrace the evolutionary development of the proteins in question. Finally, with the deluge of data wecurrently face, we need to construct large databases to store, view and deconstruct theinformation. Alignments nowbecome more complex, requiring sophisticated scoring schemes andthere is enough data to compile a genome census a genomicequivalent of a population census providing comprehensive statistical accounting of protein features in genomes.

Fig.2 Organizing and understanding biological data


6/13


3 5 1


late the structure adopted by the protein.G eometry calculations can define theshape of the proteins surface and molecu-lar simulations can determine the forcefields surrounding the molecule. Finally,using docking algorithms, one couldidentify or design ligands that may bindthe protein, paving the way for designing adrug that specifically alters the proteinsfunction. In practise, the intermediate stepsare still difficult to achieve accurately, andthey are best combined with experimentalmethods to obtain some of the data, forexample characterising the structure of theprotein of interest.

The aim of the second dimension, the

breadt h in biological ana lysis, is to comparea gene or gene product with others. Ini-tially, simple algorithms can be used tocompare the sequences and structures of apair of rela ted proteins. With a larger num-ber of proteins,improved a lgorithms can beused to produce multiple alignments, andextract sequence patterns or structuraltemplates that define a family of proteins.U sing this data, it is also possible to con-struct phylogenetic trees to trace the evolu-tionary path of proteins. Finally, with even

more data, the information must be storedin large-scale databa ses. Comparisonsbecome more complex, requiring multiplescoring schemes, and we are able to con-duct genomic scale censuses that providecomprehensive statistical accounts ofprotein features, such as the abundance ofparticular structures or functions in diffe-rent genomes. It a lso allows us to buildphylogenetic trees that trace the evolutionof w hole organisms.

5. applying INFORMATICSTECHNIQUES

The distinct subject areas we mentionrequire different ty pes of informatics tech-niques. Briefly, for data organisation, thefirst biological databases were simple flatfiles. H owever with the increasing amountof information, rela tional databasemethods with Web-page interfaces havebecome increasingly popular. In sequence

ana lysis, techniques include string compari-

son methods such as text search and one-dimensional alignment a lgorithms. Motifand pattern identification for multiplesequences depend on machine learning,clustering and d ata -mining techniques. 3Dstructural analysis techniques include Eu-clidean geometry calculations combinedwith basic application of physical chemis-try, graphical representations of surfacesand volumes, and structural comparisonand 3D matching methods. For molecularsimulations, Newtonian mechanics, quan-tum mechanics, molecular mechanics andelectrostatic calculations are applied. Inmany of these areas, the computationalmethods must be combined with good

statistical analyses in order to provide anobjective measure for the significance ofthe results.

6. Transcription Regulation a Case Study in Bioinformatics

DNA-binding proteins have a central rolein all aspects of genetic activity within anorganism, participating in processes such a s

transcription, packaging, rearrangement,replication and repair. In this section, wefocus on the studies that have contributedto our understanding of transcriptionregulation in different organisms. Throughthis example, we demonstrate how bio-informatics has been used to increase ourknowledge of biological systems and alsoillustrate the practical applications of thedifferent subject areas that were brieflyoutlined earlier. We start by consideringstructural analyses of how DNA-bindingproteins recognise particular base se-quences. Later, we review several genomicstudies that have characterised the natureof transcription factors in different orga-nisms, and the metho ds that ha ve been usedto identify regulatory binding sites in theupstream regions. Finally, we provide anoverview of gene expression analyses thathave been recently conducted and suggestfuture uses of tra nscription regulatory ana -lyses to rationalise the observations madein gene expression experiments. All theresults that we describe have been found

through computat ional studies.

6.1 Structural Studies

As of A pril 2001, there w ere 379 structuresof protein-DNA complexes in the PDB.Analyses of these structures have providedvaluable insight into the stereochemicalprinciples of binding, including how par-ticular base sequences are recognizedand how the DNA structure is quite oftenmodified on binding.

A structural taxonomy of D NA-bindingproteins, similar to that presented in SCO Pand CATH, was first proposed by Harrison[56] and periodically updated to accom-modate new structures as they are solved[57]. The cla ssificat ion co nsists of a tw o-tier

system: the first level collects proteins intoeight groups that share gross structuralfeatures for D NA-binding, and the secondcomprises 54 families of proteins that arestructurally homologous to each other.Assembly of such a system simplifies thecomparison of different binding methods;ithighlights the diversity o f prot ein-D NAcomplex geometries found in nature, butalso underlines the importance of inter-actions between -helices and the DNAmajor groove, the main mode of binding in

over half the protein families. While thenumber of structures represented in thePDB does not necessarily reflect the rela-tive importance of t he different proteins inthe cell, it is clear tha t helix-turn-helix,zinc-coordinating and leucine zipper mot ifsare used repeatedly.These provide compactframeworks to present the -helix on thesurfaces of structurally diverse proteins. Ata gross level, it is possible to highlight thedifferences between transcription factordomains that just bind D NA and thoseinvolved in catalysis [58]. Although t hereare exceptions, the former typicallyapproach the DNA from a single face andslot into the grooves to interact with baseedges. The latter commonly envelope thesubstrate, using complex networks ofsecondary structures and loops.

Focusing on proteins with -helices, thestructures show many variations, both inamino acid sequences and detailed geo-metry. They have clearly evolved indepen-dently in accordance with the req uirementsof the context in which they are found.

While achieving a close fit between the


7/13


3 5 2


-helix and ma jor groove, there is enoughflexibility to allow both the protein andDNA to adopt distinct conformations.H owever, several studies that analysed thebinding geometries of -helices demon-strated that most adopt fairly uniform con-formations regardless of protein family.They are commonly inserted in the majorgroove sideways, with their lengthwise axisroughly parallel to the slope outlined bythe DNA backbone. Most start with theN-terminus in the groove and extend out,completing two to three turns withincontacting distance of the nucleic acid [59,60].

G iven the similar binding orientations,it

is surprising to find that the interactionsbetween each amino acid position alongthe -helices and nucleotides on t he D NAvary considerably between different pro-tein families. How ever, by classifying theamino acids according to the sizes of theirside chains, we are able to rationa lise thedifferent intera ctions patterns. The rules ofinteractions are based on the simple pre-mise that for a given residue position on-helices in similar conformations, smallamino acids interact with nucleotides that

are close in distance and large amino a cidswith tho se that a re further [60,61].E qui-va-lent studies for binding by ot her structuralmotifs, like -hairpins, have also been con-ducted [62]. When considering theseinteractions, it is important to rememberthat d ifferent regions of the protein surfacealso provide interfaces with the D NA.

This brings us to look a t the a tomic levelinteractions between individual aminoacid-base pairs. Such analyses are based onthe premise that a significant proportion ofspecific D NA-binding could b e ra tionalisedby a universal code of recognition betweenamino acids and bases, ie whether certainprotein residues preferably interact withparticular nucleotides regardless of thetype of protein-D NA complex [63]. Studieshave considered hydrogen bonds, van derWaals contacts and water-mediated bonds[64-66].R esults showed tha t a bout 2/3 of a llinteractions are with the DNA back-bone and that their main role is one ofsequence-independent stabilisation. Incontrast, interactions with bases display

some strong preferences, including the

interactions of arginine or lysine withguanine, asparagine or glutamine withadenine and threonine with thymine. Suchpreferences were explained through exami-nation of the stereochemistry of t he aminoacid side chains and base edges. Alsohighlighted were more complex types ofinteractions where single amino acidscontact more than one base-step simulta-neously, thus recognising a short D NAsequence. These results suggested thatuniversal specificity, one tha t is observedacross all protein-D NA complexes, indeedexists. Ho wever, many interactions that a renormally considered to be non-specific,such as those with the DNA backbone, can

also provide specificity depending on thecontext in which they are made.

Armed with an understanding ofprotein structure, D NA-binding motifs andside chain stereochemistry,a major a pplica-tion has been the prediction of bindingeither by proteins known to contain a part i-cular motif, or those with structures solvedin the uncomplexed form. Most commonare predictions for -helix-major grooveinteractions given the amino acid se-quence, what D NA sequence would it

recognise [61, 67]. In a d ifferent approach,molecular simulation techniques have beenused to dock whole proteins and DNAs onthe ba sis of force-field calculations aroundthe t wo molecules [68, 69].

The reason that both methods havebeen met with limited success is becauseeven for apparently simple cases like -helix-binding, there are many other factorsthat must be considered. Comparisonsbetween bound and unbound nucleic acidstructures show that DNA-bending is acommon feature of complexes formed withtra nscription fa ctors [58, 70].This and ot herfactors such as electrostatic and cation-mediated interactions assist indirectrecognition of the nucleotide sequence,although they are not well understood yet.Therefore, it is now clear that d etailed rulesfor specific DNA-binding will be familyspecific, but with underlying trends such asthe arginine-guanine interactions.

6.2 Genomic Studies

D ue to the wealth of biochemical data tha tare available, genomic studies in bioin-formatics have concentrated on modelorganisms, and the analysis of regulatorysystems has been no exception. Identification

of transcription factors in genomes invari-ably depends on similarity search strate-gies, which assume a functional a nd evolu-tionary relationship between homologousproteins. In E. coli, studies have so farestimat ed a t ota l of 300 to 500 tra nscriptionregulators [71] and PE DA NT [72], a d ata -base of automatically assigned gene funct-ions, shows that typically 2-3% of pro-

karyotic and 6-7% of eukaryotic genomescomprise DNA -binding proteins.A s assign-ments were only complete for 40-60% ofgenomes as o f August 2000, these figuresmost likely underestimate the actual num-ber. Nonetheless, they already represent alarge qua ntity of proteins and it is clear thatthere are more transcription regulatorsin eukaryotes than other species. This isunsurprising, considering the organismshave developed a relatively sophisticatedtranscription mechanism.

From the conclusions of the structuralstudies, the best strategy for chara cterisingD NA-binding of the putative transcriptionfactors in each genome is to group themby homology and to analyse the individualfamilies. Such classifications are providedin the secondary sequence databasesdescribed earlier and also those thatspecialise in regulatory proteins such asR egulonD B [73] and TRA NSFAC [74].Of even greater use is the provision ofstructural assignments to the proteins;given a transcription factor, it is helpful toknow the structural motif that it uses forbinding, therefore providing us with abetter understanding of how it recognisesthe target sequence. Structural genomicsthrough bioinformatics assigns structuresto the protein products of genomes bydemonstrating similarity to proteins ofknow n structure [75]. These studies haveshown that prokaryotic transcription fac-tors most frequently contain helix-turn-helix motifs [71, 76] and eukaryotic fa ctorscontain homeodomain type helix-turn-

helix, zinc finger or leucine zipper motifs.


8/13


3 5 3


From the protein classifications in eachgenome, it is clear that different types ofregulatory proteins differ in abundance andfamilies significantly d iffer in size. A studyby Huynen and van Nimwegen [77] hasshown that members of a single family havesimilar functions, but as the requirementsof this function vary over time, so doesthe presence of each gene family in thegenome.

Most recently, using a combination ofsequence and structural data, we examinedthe conservation of amino acid sequencesbetween related DNA-binding proteins,and the effect that mutations have onD NA sequence recognition. The structural

families described above were expanded toinclude proteins that a re related by sequence

similarity, but whose structures remainunsolved. Again, members of the samefamily are homologous, and proba bly derive

from a common a ncestor.Amino a cid conservations were calculat-

ed for the multiple sequence alignmentsof each family [78]. G enerally, alignmentpositions that interact with the DNA arebetter conserved than the rest of the pro-tein surface, although the detailed patterns

of conservation a re quite complex. R esiduesthat contact the DNA backbone are highlyconserved in a ll protein families, providinga set of stabilising interactions that arecommon to all homologous proteins. Theconservation of alignment positions thatcontact bases, and recognise the D NAsequence, are more complex and could berationa lised by defining a t hree-class modelfor D NA-binding. First , protein familiesthat bind non-specifically usually containseveral conserved base-contacting residues;without exception, interactions are made inthe minor groove where there is littlediscrimination between base types. Thecontacts are commonly used to stabilisedeforma tions in the nucleic acid structure,particularly in widening the DNA minorgroo ve. The second class comprise familieswhose members all target the samenucleotide sequence; here, base-contactingpositions are absolutely or highly conser-ved allowing related proteins to target thesame sequence.

The third, and most interesting, class

comprises families in which binding is also

specific but different members bind distinctbase sequences. Here protein residuesundergo frequent mutations, and familymembers can be divided into subfamiliesaccording to the amino acid sequencesat base-contacting positions; those in thesame subfamily are predicted to bind thesame D NA sequence and those of differentsubfamilies to bind distinct sequences. Onthe whole, the subfamilies correspondedwell with the proteins functions and mem-bers of the same subfamilies were found toregulate similar transcription pathways.The combined analysis of sequence andstructural da ta described by this study pro-vided an insight into how homologous

DNA-binding scaffolds achieve differentspecificities by altering their amino acidsequences. In doing so, proteins evolveddistinct functions, therefore allowingstructurally related transcription fa ctors toregulate expression of different genes.Therefore, the relative abunda nce of tran-scription regulatory families in a genomedepends, not only on the importance of aparticular protein function, but also in theadaptability of the DNA-binding motifs torecognise distinct nucleotide sequences.

This, in turn, appears to be best accommo-dated by simple binding motifs, such as thezinc fingers.

G iven the knowledge of the transcription

regulators that are contained in eachorganism, and an understanding of howthey recognise D NA sequences, it is ofinterest to search for their potential bind-ing sites within genome sequences [79].For prokaryotes, most analyses have in-volved compiling data on experimentallyknown binding sites for particular proteinsand building a consensus sequence tha t in-corporates any variations in nucleotides.Additional sites are found by conductingword-matching searches over the entiregenome and scoring candidate sites bysimilarity [80-83]. U nsurprisingly, most ofthe predicted sites are found in non-codingregions of the DNA [80] and the results ofthe studies are oft en presented in dat aba sessuch as R egulonD B [73]. The consensussearch approach is often complemented bycomparative genomic studies searchingupstream regions of orthologous genes in

closely related o rgan isms. Through such an

approach, it was found that at least 27% ofknown E. coliDNA-regulatory motifs areconserved in one or more distantly relatedbacteria [84].

The detection of regulatory sites ineukaryotes poses a more d ifficult problembecause consensus sequences tend to bemuch shorter, variable, and dispersed oververy large distances. How ever, initial stud-ies in S. cerevi siae provided an interestingobservation for the G ATA protein in nitro-gen metabolism regulation. While the 5base-pair G ATA co nsensus sequence isfound almost everywhere in the genome, asingle isolated binding site is insufficient toexert the regulatory f unction [85]. There-

fore specificity of G ATA activity comesfrom the repetition of the consensus se-quence within the upstream regions of con-trolled genes in multiple copies. An initialstudy has used this observation to predictnew regulatory sites by searching for over-represented oligonucleotides in non-codingregions of yeast and worm genomes [86,87].

Having detected the regulatory bindingsites, there is the problem of defining thegenes that a re actually regulated, commonly

termed regulons. G enerally, binding sitesare a ssumed to be located directly upstreamof the regulons; however there are differentproblems associated with this assumptiondepending on the organism. For prokary-otes, it is complicated by t he presence ofoperons; it is difficult to locate the regulat-ed gene within an operon since it can lieseveral genes downstream of the regulatorysequence. It is often difficult to predict theorganisation of operons [88], especially todefine the gene that is found at the head,and there is often a lack of long-range con-servation in gene order between relatedorga nisms [89]. The problem in eukar yot esis even more severe; regulatory sites oftenact in both directions, binding sites areusually distant from regulons because oflarge intergenic regions, and transcriptionregulation is usually a result of combinedaction by multiple transcription factors in acombinatorial manner.

D espite these problems, these studieshave succeeded in confirming the transcrip-

tion regulatory pathwa ys of well-character-

ized systems such as the hea t shock response


9/13


10/13


3 5 5


could be made in low-level organisms likeyeast and the results applied to homo-logues in higher-level organisms such ashumans, where experiments are moredemanding.

An eq uivalent a pproach is also employed

in genomics. H omologue-finding is exten-sively used to confirm coding regions innewly sequenced genomes and functionaldata is frequently transferred to annotate

individual genes. On a larger scale, it also

simplifies the problem of understandingcomplex genomes by analysing simpleorganisms first and then applying thesame principles to mor e complicated o nes this is one reason why early structuralgenomics projects focused on M ycoplasmagenitalium[75].

Ironically, the same idea can be appliedin reverse. Potential drug ta rgets are quickly

discovered by checking whether homo-

logues of essential microbial proteins are

missing in humans. On a smaller scale,structural d ifferences between similar pro-teins may be harnessed to design drugmolecules that specifically bind to onestructure but not a nother.

7.2 Rational Drug DesignOne of the earliest medical applications of

bioinformatics has been in aiding rational

Fig.3 Above is a schematicoutlining howscientists can use bioinformatics to aid rational drug discovery. MLH1 is a human gene encoding a mismatch repair protein (m m r ) situated on theshort armof chromosome 3. Through linkage analysis and its similarity tom m r genes in mice, the gene has been implicated in nonpolyposis colorectal cancer. Given the nucleotide sequence,the probable amino acid sequence of the encoded protein can be determined using translation software. Sequence search techniques can be used to find homologues in model organisms, andbased on sequence similarity, it is possible to model the structure of the human protein on experimentally characterised structures. Finally, docking algorithms could design molecules thatcould bind the model structure, leading the way for biochemical assays to test their biological activity on the actual protein.


11/13


3 5 6


drug design. Fig. 3 outlines the commonlycited approach, taking the MLH1 gene pro-duct as an example drug target. MLH 1 is ahuman gene encoding a mismatch repairprotein (mmr) situated on t he short arm ofchromo some 3 [110]. Through linka ge a na-lysis and its similarity to mmr genes in mice,the gene has been implicated in nonpoly-posis colorectal cancer [111]. G iven thenucleotide sequence, the probable aminoacid sequence of the encoded protein canbe determined using translation software.Sequence search techniques can then beused to find homologues in model orga-nisms, and based on sequence similarity, itis possible to model the structure of the

human protein o n experimentally chara cter-ized structures. Finally, docking algorithmscould design molecules that could bind themodel structure, leading the way for bio-chemical assays to test their biologicalactivity on the actual protein.

7.3 Large-scale CensusesAlthough da taba ses can efficiently store allthe information related to genomes, struc-

tures and expression data sets, it is useful tocondense all this information into under-standable trends and facts that users canreadily understand. B road generalisationshelp identify interesting subject areas forfurther detailed analysis, and place new ob-servations in a proper context. This enablesone to see whether they a re unusual in anyway.

Through these lar ge-scale censuses, onecan address a number of evolutionary, bio-chemical a nd biophysical q uestions. Forexample,a re specific protein folds associat-ed with certain phylogenetic groups? Howcommon are different folds within partic-ular organisms? And to what degree arefolds shared between related organisms?Does this extent of sharing parallel meas-ures of relatedness derived from trad itionalevolutionary trees? Initial studies showthat the frequency of folds differs greatlybetween organisms and that the sharing offolds between o rganisms does in fact fo llowtraditional phylogenetic classifications[37, 112, 113]. We can also integra te da ta o n

protein functions; given that the particular

protein folds are often related to specificbiochemical functions [52, 53], these find-ings highlight the diversity of metabolicpath wa ys in diff erent organisms [36, 89].

As we discussed earlier, one of the mostexciting new sources of genomic informa tion

is the expression dat a. Combining expression

information with structural and functionalclassifications of proteins we can askwhether the high occurrence of a proteinfold in a genome is indicative of high ex-pression levels [97]. Further genomic scaledata that we can consider in large-scale sur-veys include the subcellular localisations ofproteins and their interactions with eachother [114-116]. In conjunction w ith struc-

tural data, we can then begin to compile amap of all protein-protein interactions inan orga nism.

8. ConclusionsWith the current deluge of data, compu-tational methods have become indispens-able to biological investigations. Originallydeveloped for the a nalysis of biological se-

quences, bioinformatics now encompassesa wide range of subject areas includingstructural biology, genomics and gene ex-pression studies. In this review, we providedan introduction and overview of the cur-rent state of field. In particular, we discus-sed the types of biological information a nddatabases that are commonly used, exa-mined some of the studies tha t are being con-

ducted with reference to transcriptionregulatory systems and finally looked atseveral practical applications of the field.

Two principa l approa ches underpin all stud-ies in bioinformatics. First is that of com-paring and grouping the data according tobiologically meaningful similarities and sec-ond, that of analysing one type of data toinfer and understand the observations foranother type of da ta. These approaches arereflected in the main aims of the field,which are to understand and organise theinformation associated with biologicalmolecules on a large scale. As a result,bioinformatics has not only provided great -er depth to biological investigations, but

added the dimension of breadth as well. In

this way, we are able to examine individualsystems in detail and also compare themwith those that are related in order to un-cover common principles that apply acrossmany systems and highlight unusual fea-tures that are unique to some.

AcknowledgmentsWe thank Pa trick McG arvey for comments onthe manuscript.

References1. R eichhardt T. Its sink or swim as a tidal wave

of da ta appro aches. Nature 1999. 399 (6736):517-20.

2. Benson DA, et a l . G enBank. Nucleic Acids

R es 2000; 28 (1): 15-8.3. Bairoch A, Apweiler R. The SWISS-PROT

protein sequence database and its supplementTrE MB L in 2000. Nucleic Acids R es 2000; 28(1): 45-8.

4. Fleischmann RD , et al. Whole-genome ran-dom sequencing and assembly of Haemo-philus influenzae Rd. Science 1995; 269(5223): 496-512.

5. D rowning in data. The Economist 1999 (26June 1999).

6. B ernstein FC, et al.The Protein Data Ba nk.Acomputer-based archival file for macromolec-ular structures. Eur J B iochem 1977; 80 (2):319-24.

7. Berman HM, et a l. The Protein Data B ank.

Nucleic Acid s R es 2000; 28 (1): 235-42.8. Pearson WR, Lipman DJ. Improved tools for

biological sequence comparison. Proc NatlAca d S ci U SA 1988; 85 (8): 2444-8.

9. Altschul SF, et al. G apped BLA ST and PSI-B LAST: a new generation of protein data basesearch programs. Nucleic Acids Res 1997; 25(17): 3389-402.

10. Fleischmann R D, et al. Whole-genome ran-dom sequencing and assembly of Haemo-philus influenzae Rd. Science 1995; 269(5223): 496-512.

11. Lander ES, et al. Initial sequencing and analy-sis of the human genome. Nature 2001; 409:860-921.

12. Venter JC, et al. The sequence of the humangeno me. Science 2001; 291 (5507): 1304-51.13. Tatusova TA, Ka rsch-Mizrachi I, Ostell JA.

Complete genomes in WWW Entrez: datarepresentation and analysis. B ioinformatics1999; 15 (7-8): 536-43.

14. Eisen MB, Brown PO. DNA arrays for analy-sis of gene expression. Methods Enzymol,1999; 303:179-205.

15. Cheung VG, et al. Making and reading micro-arra ys. Nat G enet 1999; 21 (1 Suppl): 15-9.

16. D uggan D J, et al. Expression profiling usingcDNA microarrays. Nat Genet 1999. 21(1 Suppl): 10-4.

17. Lipshutz RJ, et al. High density syntheticoligonucleotide arra ys.Na t G enet 1999;21 (1):

20-4.


12/13


3 5 7


18. Velculescu VE, et al. Serial Analysis of G eneExpression. D etailed P rotocol 1999.

19. Holstege FC, et al. Dissecting the regulatorycircuitry of a eukaryot ic genome. Cell 1998; 95

(5): 717-28.20. Roth FP, Es tep PW, Church G M. FindingD NA regulatory motifs within unaligned non-coding sequences clustered by whole-genomemRNA quantitation. Nat Biotech 1998; 16(10): 939-45.

21. Jelinsky SA, Samson LD. G lobal response ofSaccharomyces cerevisiae to an alkylatingagent. Proc Natl Acad Sci USA 1999; 96 (4):1486-91.

22. Cho RJ, et al. A genome-wide transcriptionalanalysis of the mitotic cell cycle. Mol Cell1998; 2 (1): 65-73.

23. D eRisi JL, Iyer VR, Brown PO. Exploring themetabolic and genetic control of gene expres-

sion on a genomic scale. Science 1997; 278(5338): 680-6.

24. Winzeler E A, et a l. Functional characteriza-tion of the S. cerevisiae genome by genedeletion and par allel analysis. Science 1999;285 (5429): 901-6.

25. Perou CM, et al.Mo lecular portraits of humanbreast tumours. Nature 2000; 406 (6797):747-52.

26. G olub TR, et al. Molecular classification ofcancer: class discovery and class prediction bygene expression monito ring. Science 1999; 286(5439): 531-7.

27. Pedersendagger AG, et al. A D NA structuralatlas for Escherichia coli. J Mol B iol 2000; 299

(4): 907-30.28. Kanehisa M; G oto S. KEG G : kyoto encyclo-

pedia of genes and genomes. Nucleic AcidsR es 2000; 28 (1): 27-30.

29. Jeffery C J. Moonlighting proteins. TIB S 1999;24 (1): 8-11.

30. Chothia , C. Proteins. One thousand famil iesfor t he molecular biologist. Nature 1992; 357(6379): 543-4.

31. Orengo CA, Jones D T, Thornton JM. Proteinsuperfamilies and domain superfolds. Nature1994; 372 (6507): 631-4.

32. Lesk AM, Chothia C. How different aminoacid sequences determine similar proteinstructures: the structure and evolutionary

dynamics of the globins. J Mol B iol 1980; 136(3): 225-70.

33. Russell RB, et al. Recognition of analogousand homologous protein folds: analysis ofsequence and structure conservation. J MolB iol 1997; 269 (3): 423-39.

34. Russell RB, et al. Recognition of analogousand homologous protein folds assessment ofprediction success and associated alignmentaccuracy using empirical substitution matri-ces. Pro tein E ng 1998; 11 (1): 1-9.

35. Fitch WM. D istinguishing homologous fromana logous prot eins. Syst Z ool 1970; 19: 99-110.

36. Tatusov RL , Koonin EV, Lipman D J. A geno-mic perspective on protein families. Science

1997; 278 (5338): 631-7.

37. G erstein M, Hegyi H. Comparing genomes interms of protein structure: surveys of a f initeparts list. FEM S Microbiol Rev 1998; 22 (4):277-304.

38. Skolnick J, Fetrow JS. From genes to proteinstructure and f unction: novel applications ofcomputational approaches in the genomic era.Trend s B iot ech 2000; 18: 34-9.

39. Qian J, et al. PartsList: a web-based system fordynamically ranking protein folds based ondisparate attributes, including whole-genomeexpression and interaction information.Nucleic Acid s R es 2001; 29 (8): 1750-64.

40. G erstein M. Integrative data base analysis instructural genomics. Nat Struct Biol 2000; 7Suppl: 960-3.

41. Etzold T, Ulyanov A, Argos P. SRS: informa-tion retrieval system for molecular biology da tabanks. Metho ds E nzymol 1996; 266: 114-28.

42. Schuler G D, et al. Entrez: molecular biology

database and retrieval system. MethodsE nzymol 1996; 266: 141-62.

43. Wade K. Searching Entrez PubMed anduncover on the internet. Aviat Space EnvironMed 2000; 71 (5): 559.

44. B e rt on e P, e t al. S P IN E : An in te gr at edtracking database and datamining approachfor high-throughput structural proteomics,enabling the determination of the propertiesof readily characterized proteins. NucleicAcids Res.I n Press.

45. Zhang MQ. Promo ter analysis of co-regulatedgenes in the yeast genome. Comput Chem1999; 23 (3-4): 233-50.

46. Boguski MS. B iosequence exegesis. Science1999; 286 (5439): 453-5.

47. M iller C, G u rd J, B r ass A . A R A PI Dalgorithm for sequence da tabase comparisons:application to the identification of vectorcontamination in the EMBL databases. Bio-info rma tics 1999; 15 (2): 111-21.

48. G onnet G H, Korostensky C , Brenner S.Evaluation measures of multiple sequencealignments. J Comput Biol 2000; 7 (1-2):261-76.

49. Orengo CA, Taylor WR. SSAP: sequentia lstructure alignment program for protein struc-ture comparison. Metho ds Enzymo l 1996; 266:617-35.

50. Orengo CA. CO R A topological fingerprintsfor protein structural families. Protein Sci

1999; 8 (4): 699-715.51. Russell RB, Sternberg MJ. Structure predic-tion. Ho w good a re we? Curr B iol 1995; 5 (5):488-90.

52. Martin AC, et al. Protein folds and functions.Str ucture 1998; 6 (7): 875-84.

53. Hegyi H, G erstein M. The relationship be-tween protein structure and function: a com-prehensive survey with application to theyeast genome. J Mol B iol 1999; 288 (1):147-64.

54. Russel l RB, Sasieni PD, Sternberg MJE.Supersites within superfolds. B inding sitesimilarity in the absence of homology. J Mo lB iol 1998; 282 (4): 903-18.

55. Wilson CA, Kreychman J, G erstein M. As-sessing annotation transfer for genomics:

quantifying the relations between protein

sequence, structure and function throughtraditional and proba bilistic scores. J Mol B iol2000; 297 (1): 233-49.

56. Harrison SC.A structural taxonomy of D NA-

binding domains. Nature 1991; 353 (6346):715-9.57. Luscombe NM,et a l.An overview of the struc-

tures of protein-D NA complexes. G enomeB iology 2000; 1 (1): 1-37.

58. Jones S, et al. Protein-DNA interactions: Astructural analysis. J Mol B iol 1999; 287 (5):877-96.

59. Suzuki M, G erstein M. B inding geometry ofalpha-helices that recognize DNA. Proteins1995; 23 (4): 525-35.

60. Luscombe NM, Thornton JM. Protein-D NAinteractions: a 3D ana lysis of alpha-helix-binding in the major groove. Manuscript inpreparation.

61. Suzuki M, et al. D NA recognition code of

transcription factors. Protein E ng 1995; 8 (4):319-28.

62. Suzuki M. D NA recognition by a -sheet.Pro tein E ng 1995; 8 (1): 1-4.

63. Seeman NC, Rosenberg JM, Rich A.Sequencespecific recognition of double helical nucleicacids by proteins. Proc Natl Acad Sci U SA1976; 73: 804-8.

64. Suzuki M.A framework for the D NA-proteinrecognition code of the probe helix intranscription factors: the chemical and stereo-chemica l rules. Str ucture 1994; 2 (4): 317-26.

65. Mandel-G utfreund Y, Schueler O, Margalit H.Comprehensive analysis of hydrogen bonds inregulatory protein-D NA complexes: in search

of common principles. J M ol B iol 1995; 253(2): 370-82.

66. Luscombe NM, Laskowski RA , Thornton JM.Protein-DNA interactions: a 3D ana lysis ofamino a cid-base interactions. Nucleic AcidsR es. In Press.

67. Mandel-G utfreund Y, Margalit H. Quantita-tive parameters for amino acid-base inter-action: inplications for prediction of protein-D NA binding sites. Nucleic Acids R es 1998;26: 2306-12.

68. Sternberg MJ,G abb HA, Jackson RM. Predic-tive docking of protein-protein and protein-D NA complexes. Curr O pin Struct Biol 1998;8 (2): 250-6.

69. Aloy P, et al. Modelling repressor proteinsdocking to DNA. Proteins 1998; 33 (4):535-49.

70. D ickerson R E. D NA-binding: the prevalenceof kinkiness and the virtues of normality.Nucleic Acids R es 1998; 26 (8): 1906-26.

71. Perez-R ueda E, Collado-Vides J. The reper-toire of D NA-binding tra nscriptional regula-tors in Escherichia coli K-12. Nucleic AcidsR es 2000; 28 (8): 1838-47.

72. Mewes HW, et al. MIPS: a database for geno-mes and protein sequences. Nucleic Acids R es2000; 28 (1): 37-40.

73. Salgado H, et al. RegulonD B (version 3.0):transcriptional regulation and operon orga-nization in Escherichia coli K-12. Nucleic

Acid s Res 2000; 28 (1): 65-7.


13/13

o que é bioinformática

Documents