functional annotation of gene products

Functional Annotation of Gene Products

Patrik Georgii-Hemming

TRITA-NA-E04067

NADA

Numerisk analys och datalogi Department of Numerical AnalysisKTH and Computer Science100 44 Stockholm Royal Institute of Technology

SE-100 44 Stockholm, Sweden

Functional Annotation of Gene Products

Patrik Georgii-Hemming

TRITA-NA-E04067

Master’s Thesis in Computer Science (20 credits)Single Subject Courses,

Stockholm University 2004Supervisor at Nada was Stefan Arnborg

Examiner was Stefan Arnborg

Abstract

In recent years new technologies have been developed that allow biologists to measurethe expression of thousands of genes at the same time. The amount of data generatedin a single experiment presents a significant challenge for the analyst. This projectconcerns one aspect of the analysis, functional annotation, where one associatesrelevant knowledge from different biological databases with gene products of interest.The first step in functional annotation is to decide what information should beincluded. The next step is to mine different biological databases for this informationand to store the information locally to enable queries of the aquired data. In thisproject we have developed an application to automate retrieval of information fromfive selected biological databases. The application stores the data in an embeddeddatabase that can be queried using SQL.

Funktionell annotering av genprodukter

Sammanfattning

Utvecklingen av nya tekniker har på senare år gjort det möjligt för biologer attstudera uttrycket av tusentals gener på samma gång. Den stora mängden data frånexperiment där dessa tekniker används är en stor utmaning för dem som ska analyse-ra dessa data. I det här projektet har vi studerat en aspekt av processen, funktionellannotering. Funktionell annotering syftar till att associera relevant information frånolika biologiska databaser med intressanta genprodukter. Det första steget vid funk-tionell annotering är att bestämma vilken typ av information som är intressant.Nästa steg är att med hjälp av databrytning lokalisera den här informationen i oli-ka biologiska databaser och sedan lagra informationen i en lokal databas för vidareanvändning. I det här projektet har vi utvecklat ett program som automatiskt häm-tar information från fem utvalda databaser. Programmet lagrar data i en inbäddaddatabas som tillåter SQL-sökning.

Contents

1 Introduction 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 What we are trying to achieve . . . . . . . . . . . . . . . . . . 11.1.2 Different kinds of information about genes . . . . . . . . . . . 21.1.3 Challenges and problems . . . . . . . . . . . . . . . . . . . . . 3

1.2 Biological databases . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2.1 Data modeling and data management . . . . . . . . . . . . . 41.2.2 Data retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2.3 Data acquisition . . . . . . . . . . . . . . . . . . . . . . . . . 61.2.4 Databases used in this project . . . . . . . . . . . . . . . . . . 6

1.3 Current annotation strategies . . . . . . . . . . . . . . . . . . . . . . 111.3.1 Manual annotation . . . . . . . . . . . . . . . . . . . . . . . . 111.3.2 Analysis pipelines . . . . . . . . . . . . . . . . . . . . . . . . . 111.3.3 More ambitious approaches based on link integration . . . . . 131.3.4 Datawarehousing . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 The present work 142.1 Goals and strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.1.1 Clarifying the task and the problems . . . . . . . . . . . . . . 142.1.2 How should applications for functional annotation be designed? 142.1.3 Providing a small proof-of-concept annotation program . . . . 15

2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2.1 Choosing data sources . . . . . . . . . . . . . . . . . . . . . . 162.2.2 The terminology problem . . . . . . . . . . . . . . . . . . . . 162.2.3 The validation problem . . . . . . . . . . . . . . . . . . . . . 172.2.4 The GeneAnnotator program (GEA) . . . . . . . . . . . . . . 17

2.3 What remains to be done . . . . . . . . . . . . . . . . . . . . . . . . 192.3.1 Designing a user-friendly interface . . . . . . . . . . . . . . . 192.3.2 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.3.3 Designing for extensibility and flexibility . . . . . . . . . . . . 21

3 Future directions 223.1 Ongoing research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.1.1 Web services . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.1.2 Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.1.3 Globally unique qualifiers . . . . . . . . . . . . . . . . . . . . 23

3.2 Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

References 25

Appendix 27

Chapter 1

Introduction

1.1 Background

This project was done at Center for Genomics and Bioinformatics (CGB), KarolinskaInstitute. It forms a small part of a collaborative effort between biologists andbioinformaticians to create an application for automatic analysis of results fromhigh-throughput experiments.

1.1.1 What we are trying to achieve

One of the most central properties shared by all life forms is the ability to store andpropagate information. The series of discoveries of how diverse life forms can becoded into a string of chemical entities called nucleotides gave birth to the scientificdiscipline of molecular biology. These nucleotides come in four variations (codedA,C,G,T) and are combined linearly to form a string of DNA1. Defined regions,genes, of DNA in the genome contains information on how proteins should be builtfrom amino acids. The information flow from gene to protein is divided into twoprocesses. Transcription, where the information in the gene is copied into a RNA2

sequence and translation, where the RNA-strand is translated to a protein whichdoes the actual work in the cell.

To understand the mechanisms behind different processes in the cells it is neces-sary to understand which proteins are present and how they interact during theseprocesses. For technical reasons it is much easier to study the RNA sequences thanit is to study the proteins directly. The amount of different RNA sequences duringdifferent cellular processes can serve as a marker for their corresponding proteins.In recent years several high-throughput methods have been developed to measurethe level of several thousands or even tens of thousands of these RNAs in a singleexperiment. One big challenge is how to interpret the results of these experiments.

The task we are concerned with in this project is how to begin to understandwhat the presence of these RNAs ”mean”. Regardless of the hypothesis that lead the

1deoxyribonucleic acid2ribonucleic acid

1

biologist to perform a particular experiment there are some steps that must alwaysbe taken in the process of analyzing data from the high-throughput methods we areinterested in here. One of the first steps is to find as much information as possibleabout the function of the individual gene products3. Functional annotation of thegene products is the process of associating individual gene products with relevantinformation about the gene product derived from different sources. Ideally thisinformation should be stored in a way that makes all information easily accessibleto the biologists.

1.1.2 Different kinds of information about genes

It is of course necessary to define what kind of information one is looking for.I willoutline the type of information we have focused on in this work.

Databases containing literature references: Scientific articles describing thefunction of gene products are important sources of information. In practice itis enough to search PubMed, a service of the National Library of Medicine inthe United States. PubMed includes over 14 million citations for biomedicalarticles back to the 1950’s [21].

Sequence databases: Sequence databases hold the most basic information aboutnucleotide sequences4 and protein sequences5. The data includes the se-quences, references to the scientists that submitted the sequences and thename of the entity the sequence is derived from if available. The sequenceentities can be a section of the genome, a known gene, RNA or a protein. Thedatabase may also hold more information but this varies from database todatabase. GenBank [17], EMBL [6] and DDBJ [5] are the most comprehensivesequence databases.

Ontology databases: An ontology in this context is simply a list of terms (includ-ing a definition of their meaning) and the relationships between the terms. Anontology provides a conceptualization of a domain of knowledge, facilitatescommunication between domain experts and makes it easier to write softwarethat is dependent on domain knowledge [14]. In this project we have usedthe Gene Ontology (GO) which is actually three ontologies named biologicalprocess, molecular function and cellular component [3]. Each ontology is adirected acyclic graph where the terms are the nodes and the relationships arethe arcs. There are two types of relationships, “is-a” and “part-of”. Most geneproducts have been annotated with one or several gene ontology terms. Theseterms give a good indication of the function of the gene product.

3Since we are interested in making inferences about proteins and not RNAs, I will sometimes usethis more vague term to denote the abstract entity composed of RNA and corresponding protein

4a nucleotide sequence is the ”word” created by the ordering of nucleotides in a gene or RNA5a protein sequence is the ordering of the amino acids that build up the protein

2

Databases with information about metabolic pathways: A metabolic path-way is a small part of the biochemistry of a cell, e.g. fat metabolism or energyproduction in the Krebbs cycle. Many gene products are enzymes involvedin regulation of these processes. Several databases hold information aboutmetabolic pathways but most of them are very specialized and only hold in-formation about a particular organism or cell type. In this project we decidedto use KEGG (Kyoto encyclopedia of genes and genomes) which is the mostcomprehensive metabolic pathways database [15].

Signaling pathways databases: The activity of a cell is regulated by externalsignals. Insulin, e.g., binds to a receptor molecule at the cell surface and startsa cascade of events where one protein (gene product) activates the next in thecell. One result of this signaling is that the cell starts taking up sugar fromthe bloodstream. If we can place a gene product in one or several signalingpathways we will have learned a lot about its function. Unfortunately thisinformation is available only in pictorial form. This means that these databasescan only give answer to queries about which signaling pathways a gene productparticipates in. It is, e.g., not possible to ask about the location of a geneproduct in a signaling pathway, this information must be deduced by the userlooking at the pictures of the signaling pathways. Ideally, the data about eachsignaling pathways should be held in a directed graph since this would allowmore advanced queries. We have used the Biocarta database in this project[16]. Biocarta holds information about signaling pathways in humans andmice.

Domain databases: Proteins are composed of several domains, distinct parts eachhaving its own function. One kind of domain is common to proteins that sitsin the cell membrane, another kind of domain is common to proteins thatcan bind to DNA and so forth. A domain is a kind of recurrent “theme” inproteins. If we know that a gene product has a certain domain we can guesswhat function the gene product has if the function of other gene products withthis domain is known. The problem with this kind of information is that itis often based on computational predictions. Pfam (Protein families), e.g., isa database that uses hidden Markov models to predict the presence of do-mains based on the protein sequence. Other databases hold information fromcrystallographic experiments which provide direct evidence for the presenceof particular domains. The quality of the data is obviously not the same.However, we have chosen to ignore this difficulty in this project and we usethe CDD (Conserved domain database) [18] which includes information fromseveral other databases, e.g. Pfam.

1.1.3 Challenges and problems

The problem with functional annotation is to a large degree a data integrationproblem since the relevant information is fragmented across many different data

3

sources. EBI (European Bioinformatics Institute) and Infobiogen (a research insti-tute in France) maintains a catalog of biological databases, dbcat, which currentlyholds links to 511 different databases [12]. One seemingly trivial problem is how toidentify the entities you want to annotate. There is no standardized way to assignand maintain names of biological objects across databases. For example, searchingthe OMIM (online mendelian inheritance in humans) database for “SLAP” results intwo completely unrelated proteins, “Sarcolemmal associated protein” and “Src-likeadaptor protein”. A more subtle problem is the clash of concepts as you move fromone database to another. An example is the definition of a gene. The definition of”gene” differs between researchers and databases which makes it very hard or evenimpossible to merge data from some sources. There are also technical challenges.The various databases use different DBMSs and none provide a standard way of ac-cessing the data.Some databases provide large text dumps of their contents, othersoffer access to the underlaying DBMS and still others provide only web pages astheir primary mode of access.

1.2 Biological databases

As already mentioned, there are several hundred biological databases. Well-knownexamples are DDBJ [5], EMBL [6], GenBank [17], PIR [9], and SWISS-PROT [23].It is difficult to keep track of all these databases and dbcat was developed for thispurpose. Most biological databases are also large, GenBank, e.g., contains morethan 23×106 gene sequence records. To make matters even more complicated thesedatabases are growing very rapidly. Both the actual size and the growth rate of thesedatabases has become a serious problem and without automated methods, such asdata mining algorithms, the data collected can no longer be fully exploited.

1.2.1 Data modeling and data management

Molecular databases can be classified as follows [8]:

• Databases using a standard DBMS, i.e. relational,object or object-relational.

• Databases using the database management system ACEDB [27].ACEDB isa DBMS which was originally developed for the biology database called AC.elegans Data Base.

• Databases using the OPM (object protocol model) [11] together with a rela-tional or object database management system. OPM is a data model combin-ing standard object-oriented modeling constructs with specific constructs formodeling of scientific experiments.

• Databases implemented as flat files.

Most biology databases were first implemented as a collection of flat files. Later,many of them were reimplemented using relational or object database management

4

systems (DBMS). Unfortunately the relational model is not ideal for biological datathat often has a semi-structed form, this has lead to very complex schemas that arenot intuitive. The object model fits better but is less known.

ACEDB is a database management system originally developed to hold data on asmall worm (C. elegans).ACEDB was later extended to be able to manage other suchspecialized databases.ACEDB resembles an object database management system.With ACEDB, data are modeled as objects organized in classes. However, ACEDBsupports neither class hierarchies not inheritance. An ACEDB object has a set ofattributes that are objects or atomic values such as number or strings. ACEDBobjects are represented as trees where the named nodes are objects or atomic valuesand arcs express the attribute relationships. The advantages of ACEDB is that itaccomodates irregular data items. The schema can also be extended easily by addingattributes to objects because all objects of a class must not have all attributes. WithACEDB it is possible to extend a database schema without having to restructurethe database. for existing objects need not be changed. ACEDB has its own querylanguage AQL.

The Object Protocol Model (OPM) has been developed for modeling both bio-logical data and the event sequences in scientific experiments. OPM is similar to anobject model but provides specific constructs for the modeling of scientific exper-iments. The SQL-like query language of OPM supports nested queries with pathexpressions and set predicates. OPM also offers an ontology of scientific terms.

It has been argued that DBMSs are unnecessary in biology because transactionsare so rare, most access is read-only, and because the cost of reimplementing thedatabase in a relational DBMS is often very high. Another reason is that biologicaldata is often very complex and includes deeply nested records, sets and lists. Suchdata types are difficult to model in a relational or object DBMS. The flatfile data-bases have no explicit data model in general. Their entries are structured eitherimplicitly or explicitly by search indexes. Flat files are the de facto data exchangestandard in biology. Many tools biologists use work only with flat files.

Many research projects are currently investigating alternative means of datastorage for bioinformatics. Different XML based strategies seem to hold a lot ofpromise and I will return to this subject in a later section.

1.2.2 Data retrieval

In general a biological database provides access via at least one of the followingapproaches:

• Query interface. The ability to query the database directly using SQL isactually a rarity in this field. My only explanation for this is that manybiologists do not know SQL and besides, many databases are nothing butindexed flat files.

• Indirect retrieval using web browsers. This resembles the approach taken bycommon search engines on the web (e.g. GOOGLE). These databases allow

5

users to input boolean search strings to query the database.

• Database downloading (as flat file). This is also quite common and it dependson the user having software to sift through the text file to extract interestingdata.

1.2.3 Data acquisition

The information is collected in different ways:

• From other databases. Many databases only summarize the data in otherdatabases. In one way this is convenient for the biologist who may find iteasier to locate interesting information. The problem is that it can be hard tofind the original source of the information.

• From the research community. Nowadays it is common that scientific journ-als demand that the scientists submit their data to relevant databases beforepublication. This ensures that the raw data are available to other researchgroups.

• From the scientific literature. Some databases have large staffs of curators whoare experts in the field and who regularly update database records based onnew published findings.

1.2.4 Databases used in this project

In this section I will provide an overview of the databases that were used in this work.The factors involved in the choice of these particular databases will be discussed inchapter 2.

GenBank

The GenBank sequence database is an annotated collection of all publicly availablenucleotide sequences and their protein translations. This database is produced atthe National Center for Biotechnology Information (NCBI) [22] as part of an inter-national collaboration with the European Molecular Biology Laboratory (EMBL)Data library from the European Bioinformatics Institute (EBI) [6] and the DNAData Bank of Japan (DDBJ) [5]. In Febrary 2003, GenBank contained more than23 million records. GenBank is built by direct submissions from individual researchgroups. Entries are found and retrieved using keyword searches. It is possible tosearch the database using a web browser or programmatically by using the fact thatthe queries are sent as http GET requests. Since the query string is a part of theurl it is relatively easy to automate the searches. The result is returned as text,html or xml. It is also possible to do bulk downloads using FTP. The format ofthe query string and the result report is specified which makes the access protocolto GenBank relatively stable. GenBank is a flat file database and no access to the

6

database backend is provided. Figure 1.1 shows a shortened version of a GenBankrecord. The first thing to notice is that LOCUS is given with the prefix ”NM_” forthis record which means that this sequence is a reference sequence (see RefSeq be-low). The second thing to notice is that the data is semi-structured and it can easilybe stored as a tree which makes XML a perfect match for this datatype. However,it is not obvious how to design the schema for a relational database that will holdthis data in third normal form. The result is a schema that is not intuitive for thebiologist, a problem I will return to later.

RefSeq

The goal of the reference sequence (RefSeq) database is to provide a biologically non-redundant collection of DNA, RNA and protein sequences [19]. Each RefSeq repres-ents a single, naturally occurring molecule from a particular organism. RefSeqs arefrequently based on GenBank records but are really a synthesis of information fromseveral sources. GenBank contains records based on genomic sequences, transcribed(RNA) sequences and protein sequences. The problem is that we are interested inthe functional unit, the gene product, that encompass a genomic sequence that istranscribed to RNA and translated to a protein. In GenBank this information isscattered over several records. Furthermore, several records are derived from thesame biological entity but submitted by different research groups using differentmethods and reporting slightly different results. The RefSeq database removes re-dundant records and retains one reference sequence for the genomic sequence, onereference sequence for the RNA and one reference sequence for the protein. Everyrecord in the RefSeq database also holds a link to the other members of the func-tional unit. This makes it easier to retrieve the functional information about thegene products. The RefSeq database is maintained by NCBI and is accessed as wasdescribed for GenBank.

LocusLink

LocusLink organizes information from several public databases to provide a locus-centered view of genomic information [20]. A locus is a defined place in the genomethat is transcribed to RNA. LocusLink is the most important source of functionalinformation precisely because it is based on the concept of a locus and not onsequences (sequences have a many-to-one relationship to a locus). Figure 1.2 showsa shortened version of a LocusLink record. The data is again semi-structured butwith the additional problem that the data is only available as an html-page. TheGenBank records are always given in the same format (both the text format and theXML format is standardized) but since the LocusLink records are only meant to looksimilar to a user who browses the pages there is no guarantee that the html-codewill not change.

7

Figure 1.1. An abbreviated GenBank record

8

Figure 1.2. An abbreviated LocusLink record

9

Figure 1.3. A Biocarta record.

Biocarta

The BioCarta database stores pictures of different signaling pathways [16]. One usesa web interface to search for a particular gene product by name or by LocusLink ID.If the gene product is found, a web page with links to the relevant records is returned.If one follows the links the pictures are shown in the web browser. Unfortunately,a lot of information is lost because pictures and not, e.g., graphs are used to storethe data. An example record is shown in figure 1.3

KEGG

KEGG (Kyoto Encyclopedia of Genes and Genomes) is a Japanese database that issimilar to BioCarta but contains pictures of metabolic pathways citekegg. KEGG is

10

accessed exactly like the BioCarta database and are subject to the same restrictions.Figure 1.4 shows an example record.

GeneOntology

In this work we are interested in going from a gene product to its function. Unfor-tunately the Gene Ontology database is adapted for the opposite problem of goingfrom a defined function and finding gene products involved in that particular func-tion [26]. The critical information contained in the Gene Ontology database canalso be found by links from the LocusLink database records and we decided to usethis indirect route instead of using the Gene Ontology database directly.

1.3 Current annotation strategies

To put the present work in perspective I will give an overview of the most commonlyused strategies for functional annotation today.

1.3.1 Manual annotation

Today the most common approach to annotation is to do ”database surfing” wherethe biologist starts the search for information by querying a few databases that he orshe happens to know about. The database records frequently contain hyperlinks toinformation in other databases on the internet so the biologist follows these links toget more data and stops when enough information is collected. This approach oftenstarts with a keyword search which is in itself a problem due to the large number ofirrelevant hits returned by the search engine. Combined with the fact that every linkhas to be followed and checked manually makes the whole process very laboriousand prone to error. Another problem is that it depends on the hyperlinks whichmust be correct and kept up-to-date.

1.3.2 Analysis pipelines

Some improvements are possible by writing software that hides the problems fromthe biologist who is only interested in the end result. When a group of biologistshave common information needs, it is common that a piece of tailor-made softwareis written and then used by all researchers in the group. These applications are typ-ically rather small and their purpose is to do automatically what would otherwise bedone manually. The benefit is standardization and the possibility to make necessarychanges in one place instead of having every biologist relearn how to accomplishthe goal when the database model change or when databases disappear from theinternet (it happens, e.g. when they are bought by a company). Another benefit isthat it becomes possible to handle a lot of information which can be integrated in alocal database. The problem with this approach is that the applications can nevergrow to be truly general because this would rapidly lead to maintainance problemsthat cannot be handled by a small research group.

11

Figure 1.4. A KEGG record

12

1.3.3 More ambitious approaches based on link integration

An example of a more general application that is based on link integration is SRS(sequence retrieval system) which is a keyword indexing and search system for bio-logical databases [7]. SRS is more sophisticated than general web-based searchtools (like e.g. GOOGLE) because it recognizes the existence of structured fields insource databases and allows maintainers to explicitly relate a field in one databaseto a differently named field in another. Biologists can go to a SRS site and performtheir searches there. This system depends on having maintainers that constantlywork to keep the information accurate.

1.3.4 Datawarehousing

The idea in datawarehousing is to collect all relevant information in one database.The first step is to develop a data model that can accomodate all the informationcontained in the various source databases. It is also necessary to develop softwareprograms that will fetch the data from the source databases, transform them tomatch the data model and then load them into the database. Datawarehousingis difficult because the database must be updated constantly. New information iscontinously added to the source databases, which means that the new data mustbe re-imported into the datawarehouse. To make matters worse, database designsdo not stand still, their maintainers are changing the data model by adding newdata types, changing fields and nomenclature, and changing the relationships amongdatatypes. This means that software to fetch, transform and load information thathas been written for one version of a database will not necessarily work with a laterversion. One ambitious attempt at the warehouse approach in bioinformatics wasthe Integrated Genome Database (IGD) project [25]. At its peak, IGD integratedmore than a dozen source databases. The IGD project survived for slightly longerthan a year before collapsing. The main reason for its collapse was the rapid changeof the source databases. On average, each of the source databases changed its datamodel twice a year. This meant that the IGD data import system broke down everytwo weeks and the software had to be rewritten.

13

Chapter 2

The present work

2.1 Goals and strategy

This project was initiated to explore some of the problems and possibilities in func-tional annotation. In the planning phase we decided to concentrate on the followingsubgoals.

2.1.1 Clarifying the task and the problems

The first goal of this project was to find a good strategy for functional annotation.This can in itself be useful for biologists doing manual annotation since it will helpthem avoid many problems. From this viewpoint the application (see 2.2.4) is onlyimportant in so far as it validates the strategy. The points to consider were:

• How to get a handle on all information. About 500 biological databases areavailable on the internet. Many of them have overlapping and sometimesconflicting information.

• How to handle the terminology problem. It is necessary to make sure that thereis no doubt about what one is talking about. What genes are we annotating?

• How to get and organize the information once it is found. Every database thatone wants to get information from must be treated separately since they differin how data is accessed and in how the results are delivered. As a final step itis necessary to set up a local database to store all the data.

These points will be discussed beginning with section 2.2.1.

2.1.2 How should applications for functional annotation be designed?

The second goal was to clarify how an application for functional annotation shouldbe designed. When we planned this project we came up with the following importantfactors.

14

• It is necessary to design the applications with change in mind. The demandsof the application will change. The application will have dependencies on dataformats, query interfaces and other things. These dependencies will be broken.The question is how to create a design that makes changing the application aspainless as possible.

• The application must be able to work as a component of a larger system. Thisis necessary to allow the user to adapt the application to his or hers particularneeds. Therefore it is necessary to give careful thought to input and output ofthe application.

• Even though the application will use a few selected databases it is necessaryto allow other databases to be added or removed at a later date.

• Quality control. How to ensure that the information is valid.

2.1.3 Providing a small proof-of-concept annotation program

The final goal of this project is to provide a proof-of-concept implementation of anapplication that takes a number of gene names and outputs annotated genes. Theapplication must be usable, although somewhat limited, and incorporate the featuresmentioned above. The major limitation is that the only interface to the result willbe SQL queries of the created database. Since most biologists do not know SQL analternative interface must be constructed for this application to be really useful.

2.2 Results

I will put emphasis on the data selection and data integration problem since thatwas the major challenge of this project. The writing of the application was relativelystraightforward. To make the discussion more concrete I will first give an outline ofthe steps in the annotation process.

1. Read the file with the names of the genes to be annotated.

2. Find the entities (in this case the gene products) to annotate. This is doneby querying the GenBank or the LocusLink databases to find all genes withnames found in the input file.

3. Use the information from these databases to learn how to query other data-bases for more information. At present the application will try to find in-formation in the KEGG database, the Biocarta database and the Gene On-tology database apart from the information already found in GenBank andLocusLink.

4. The application uses an embedded database and will automatically create adatabase on the users computer to hold all information in a relational database.

15

5. Write all information to the local database.

The embedded database, SQLite [4], comes with a client program that can beused for SQL queries of the created database.

2.2.1 Choosing data sources

It took a long time to decide which databases to use as sources of information.The major problem was to find the relevant databases. First the databases haveto be found and then it is necessary to quickly determine if they contain relevantinformation. There is no easy approach to do this. Basically, information wasgathered from biologists at CGB, from bioinformatics literature and web searches.It was difficult to know when to stop searching and accepting that a search likethis cannot be exhaustive. Another problem was to keep the application simple bynot including two databases with very similar information. Several parameters wereconsidered in the selection of the five databases.

1. Completeness. In our case this is a question of whether the database holdsinformation on all gene products of interest to biologists at CGB. This criterionis fulfilled by the chosen databases if the definition of completeness is restrictedto mean that the databases hold all information that exists within their scope.The records in the Biocarta and KEGG databases depend on a rather deepunderstanding of the function of the gene products so it is not so surprisingthat these databases lack information about many gene products. But theyare still complete in the restricted sense that there is not more information ofthe same kind anywhere else.

2. Quality. Even though quality is difficult to measure we did what we could.One criterion was frequency of citation in biological research articles. Wereasoned that databases that are frequently cited are regarded as trustworthyby the researchers in the field. The databases we chose are often cited in thebiological literature and they are well known to most biologists.

3. Accessibility. How is information retrieved and in what format is the inform-ation returned? The most troublesome databases are completely designed foraccess via a web browser. To retrieve data from these databases automatically,the application must first perform a direct GET or, even worse, a direct POSTrequest via HTTP and then parse the returned HTML-code. This makes theapplication very brittle since even small changes in an HTML-form or in thereturned HTML-code will lead to a collapse of the application. We had hopedto be able to choose databases where the database backend is available but thiswas not possible and accessibility was a problem with all chosen databases.

2.2.2 The terminology problem

There are many potential ways to handle the ambigous terminology in this field. Ichoose to implement two strategies in parallel. The first strategy is to let the user

16

handle the problem. If the user inputs an ambigous gene name then the output willbe ambigous. This is not as bad as it seems because the application will first find allgene products that the gene name refers to and then create an entity for each geneproduct and annotate these entities. The result will be correct in the sense that theannotations are correctly associated with the relevant entities but this approach willgenerate a lot of irrelevant results that the user must handle.

The second strategy is to let the user use LocusLink IDs instead of gene names.LocusLink IDs are unique so there is no longer any ambiguity but this approachinvolves more work for the user who has to find the LocusLink IDs.

2.2.3 The validation problem

Another problem when one wants to build an application that is dependent on datafrom several external sources is the question of validation of data. Every databaseuses a different strategy to check the correctness of data before it is entered in thedatabase. GenBank makes no guarantees, the research group that submits the datais responsible for its correctness. In most cases the research group supply a refer-ence to an article where they have published their work and this article can then bescrutinized by any sceptical database user. RefSeq is another story since the data-base records are assemblies of data from several sources. The RefSeq maintainerstherefore accept responsibility for validation of the data in RefSeq. Of course, anyuser could in principle backtrack the steps that were taken in the assembly processbut this quickly becomes unpractical even if the user is a software application. Tovalidate all records would mean that the whole database would have to be recreated.LocusLink is also a curated database with data assembled from several sources. Thedifference to RefSeq is that it is more obvious where the information comes from andlinks to the original source is provided. Still, validation of all data means recreatingthe database so the best thing one can do is to retain the references to the originaldata source and let the (human) user decide what needs to be checked. Biocartarecords are created by scientists who volunteer. The pictures come with literaturereferences but this is not very useful if one wants to validate the information auto-matically. Once again validation must be left to the end-user. The information inKEGG is validated by the maintainers and can be taken as is.

The conclusion is that most of the validation must be left to the end-user. It istheoretically possible to do some checks, e.g. using multiple sources and check forconsistency but then the question becomes how to resolve conflicts between differentdata sources. In this project the data is taken ”as-is”.

2.2.4 The GeneAnnotator program (GEA)

A major design goal was to maximize flexibility for the user. One result of this isthat the user can enter either gene names, gene symbols or LocusLink IDs in theinputfile. Another result is the use of an embedded database, SQLite, to store thedata [4]. By using an embedded database the user does not have to set up thedatabase and can run the application on his or her computer directly. At the same

17

time it is very easy for a knowledgeable user to switch to another relational databaseshould that be necessary. It was important to store the result in a database not onlyto allow queries but also to allow GEA to be a part of an analysis pipeline. Theanalysis for the biologist does not stop with the functional annotation. The nextstep may be visualization of different aspects of the result or the use of differentdata mining algorithms. It is difficult to predict exactly what the next steps can bebut by storing the data in a database most needs can be met. However, the needto write the application to fit a particular database schema is a real problem. If auser wants to add information from another biological database the database schemamust be changed and with it a large part of the application. The problem is thateven a small addition may necessitate a large change to keep the database schemain third normal form. This problem would be reduced if an object based databaseor an XML database had been used instead. A future version of GEA may switch toan XML database but right know this is not practical because the end-users knowabout relational databases and are reluctant to change to anything else.

The application reads an entry at a time from the input file and must then firstretrieve all GenBank record IDs that match the entry. This is done by building aquery string that is concatenated to a certain url. This is just a way of sendinga GET-request using HTTP. The GenBank server then sends back the IDs of allmatching records. The next step is to send another GET-request to retrieve theGenBank records. At this stage some filtering is done and only GenBank recordsthat are also RefSeq records are retrieved. The GenBank records are retrieved astext. The data in the parsed records are stored in classes which have attributes thatmatch the tables in the database. To retrieve more information the LocusLink IDof each record is extracted and used to get the relevant records from LocusLink.The LocusLink ID is also used to find the relevant Biocarta and KEGG records.In all these cases the requests for records are sent as GET-requests via HTTP.The problem with the responses from LocusLink, Biocarta and KEGG is that theapplication must parse html-code. A decision was made to only extract the data weneed from the html-code. This is in contrast with the GenBank records where allinformation is retained. By not trying to parse the records in their entirety we hopethat the code will not be so brittle.

The user are given the choice to either create a new database or use an existingonebefore the results are written to disc. SQLite comes with a client program that canbe used to query an existing database. The resulting database is a single file whichcan easily be imported to a more full-featured database if desired.

The whole application is written in Python, an object-oriented scripting languagewell suited to our purpose. Instead of burdening this thesis with a lot of code I haveincluded pseudocode for some important parts of the application in the appendix.

Figure 2.1 shows the entity-relationship diagram for the GEA database. Thelocuslink entity is central which is natural since a LocusLink record represents agene product which is what we want to annotate. The RefSeq database has recordsfor genomic sequences (DNA), mRNA sequences and protein sequences so in realityone LocusLink record can correspond to a least three RefSeq records but a decision

18

was made by the biologists to only include the mRNA RefSeq records since theserecords refer to the other RefSeq records (the genomic and protein sequences). Thisis reflected in the one-to-one relationship between the locuslink and refseq entities.The chromosome attribute of the locuslink entity gives the chromosome on whichthe gene is located. For many of the entities the id is actually very informative. Anexample is the kegg and biocarta entities where the id is the name of the processthe gene product is a part of. Sometimes the name of a process is enough, in thesecases the pictures are not important. Initially we planned to store the picturesbut it takes a long time to download all pictures and the database becomes verylarge. In the present version we only store the urls of the pictures. This limitationwill be discussed in section 2.3 Generif is a table of literature references. Eachrecord in this table contains a summary of an article that describes the function ofa gene product and the url of the article itself. The domain table holds the nameof the protein domains of gene products. The url of a domain points to a recordin the conserved domain database (cdd) that holds information of that particulardomain. The lltodomain table associates a gene product with a domain and containsa literature reference that justifies the association. The go table contains the nameof a gene ontology term, its aspect (one of biological process, molecular function orcellular component) and the url of the GO term. The url actually points to a recordin a database maintained by the Gene Ontology Consortium.

In the present version the user must use SQL to access the data. This will changein later versions.

2.3 What remains to be done

The application is functional and used by some of the biologists in CGB but morework is needed before it can be considered finished.

2.3.1 Designing a user-friendly interface

The most important improvement will be to allow automatic retrieval of the re-sources pointed to by the urls in the database. It will be relatively easy to add aGUI which shows the urls as real hyperlinks. Another improvement would be toadd a simplified query interface. This could be modeled on web search engines likeGOOGLE. Another possibility would be to show the data as one large table and letthe user mark interesting columns with the mouse. Exactly how the user interfacewill look is not clear but since the users have an interest in the application they arewilling to participate in its design.

2.3.2 Robustness

At present the error handling is very primitive. Errors are simply ignored. If arecord from a database cannot be retrieved, the application will simply proceed tothe next step. An obvious improvement would be to let the application create a log

19

idsummaryurlllid

generif idnameaspecturl

go

gotoll

goidurl

locuslink

idsynonymnamephenotypeomimid

biocarta

idurl

bctoll

bcidllid

refseq

idmrnaproteingenomechromosomellid

keggtoll

keggidllid

lltodomain

domainidllidrefurl

kegg

idurl

domain

idurl

Figure 2.1. The entity-relationship diagram for the database schema used in GEA.

20

file where all problems would be recorded. Another improvement would be to letthe application retry every failed operation at least once.

A different kind of problem presents itself when the user wants to update thedatabase. It is easy to add records to an existing database by running the applicationagain with a new inputfile but at present it is not possible to change the informationabout a gene product already in the database. Either the user must create a newdatabase or he or she must first remove the old record.

2.3.3 Designing for extensibility and flexibility

As already discussed, the use of a relational database limits the extensibility of theapplication. We are thinking about using an XML database, e.g. Xindice from theApache Software Foundation [1]. Xindice is an Open Source database which canbe queried with XPath. Using an XML database would make it easier to add orremove information of gene products without disrupting the whole application.

21

Chapter 3

Future directions

3.1 Ongoing research

There is a lot of research in bioinformatics that tries to address the problems I havetouched upon in this thesis. I will briefly mention some ideas that will probablyprove very useful in the area of functional annotation.

3.1.1 Web services

Web services can be seen as a variant of link integration. In this view, the het-erogeneous collection of linked data sources on the web is turned upside-down andbecomes a web of services that are linked by service names and definitions. Gen-Bank, e.g., is no longer a database for retrieval of sequences but is transformed toa service that transforms sequence accession numbers into GenBank flat files. Thedifference seems minor but it allows users to establish a common framework thatcan encompass many data sources. One example of a web service is the DistributedAnnotation System (DAS) [2]. DAS provides a web service for exchanging gen-omic annotations, information that can be associated with a region of the genome.The DAS protocol is simple. The user asks for a genomic region and the serverreturns a structured document that contains information about all annotations thatoverlap the specified region. The DAS service allows data providers to exchangeinformationabout annotations and allows a limited form of data integration. DASis unfortunately semantically weak and a lot of ongoing work tries to overcome thislimitation. The first aspect of this weakness is that the annotation-fields are withouttype information. Another aspect is that DAS does nothing to handle the termin-ology problem even though all objects exchanged through DAS have names. Twoother technologies, ontologies and globally unique identifiers, go a long way towardssolving these problems.

22

3.1.2 Ontologies

We have already discussed one ontology, the Gene Ontology (GO), but ontologiesare a very active research area in bioinformatics and there are several other onto-logy projects as well [24]. Ontologies can not by themselves lead to integration ofbiological databases but they can be important facilitators. The existence of an on-tology allows a data integrator to merge the information in different databases withsome guarantee that a term means the same thing in all databases. The SequenceOntology (SO),e.g., defines a set of terms and definitions that describe features ona genome, such as exon, pseudogene and transcription start site. An importantfeature of biological ontologies is that terms are organized in a hierarchical mannerwhere more specific terms are specializations of more general ones. This makes itpossible to merge specific, detailed information with more general information byfirst moving up in the hierarchy from the specialized terms to a common, more gen-eral term. To support the complex relationships that are common in biology, termsare allowed to have more than one parent leading to a data structure that is a DAG(directed acyclic graph). The most common type of relationship is ”is-a” but otherrelationships are found in certain ontologies, e.g., ”part-of” in GO.

3.1.3 Globally unique qualifiers

Since the same biological object may have several names and different biological ob-ject may have the same name, terminology is a major problem as already discussed.One solution might be to have a names commission to manage the definite list ofsuch names, as the HUGO Gene Nomenclature Committee is attempting to do withhuman gene symbols [10]. The problem is that names come, go, are merged and splittoo rapidly for any commission to keep up with. Even if the names commission couldhandle this, it is unclear how the changes can be propagated to the databases thatdepend on them. Another solution is to create a globally unique identifier. The LifeSciences Identifier (LSID) put forward by the Interoperable Informatics Infrastruc-ture Consortium (I3C), combines the internet domain name of the source databasewith the local identifier from the database [13]. An example is the C. elegans rad-3gene which might get the LSID ”urn:lsid:www.wormbase.org:gene/rad-3”. The ”urn:”identifies the resource as a Universal Resource Name (URN) to distinguish it froma Universal Resource Location (URL).

The combination of ontologies and globally unique identifiers increases the chancesthat web services can exchange data without manual intervention. A future versionof DAS will use SO to describe sequence annotation types and LSIDs to identifybiological objects.

3.2 Recommendations

There is at present no ”best practice” when it comes to functional annotation of geneproducts. The field is full of ad hoc solutions based on the opinions of individual

23

biologists so it is very important to design for change and to develop applicationsin this area in close cooperation with the end users. Any application written todaywill be obsolete in six months so it is no use trying to design a ”killer application”.

24

References

URLs last visited the 20th of May, 2004

[1] Apache Software Foundation.http://xml.apache.org/xindice

[2] Biodas.http://biodas.org.

[3] The Gene Ontology Consortium. Creating the gene ontology resource: Designand implementation. Genome Research, 2001.

[4] Applied Software Research D R Hipp. http://www.sqlite.org

[5] http://www.ddbj.nig.ac.jp/Welcome-e.html

[6] European Bioinformatics Institute.http://www.ebi.ac.uk/embl/index.html

[7] European Bioinformatics Institute.http://srs.embl-heidelberg.de:8000/srs5

[8] Peer Kröger Francois Bry. A molecular biology database digest. Distributedand Parallel Databases, 2003.

[9] Georgetown University Medical Center.http://pir.georgetown.edu/home.shtml

[10] HUGO Gene Nomenclature Committee.http://www.gene.ucl.ac.uk/nomenclature

[11] V Markowitz I M Chen. An overview of the object protocol model (opm) andthe opm data management tools. Information Systems, 1995.

[12] Infobiogen.http://www.infobiogen.fr/services/dbcat

[13] Interoperable Informatics Infrastructure Consortium.http://www.i3c.org/wgr/ta/resources/lsid/docs/index.asp

25

[14] Seung Y Rhee Jonathan B L Bard. Ontologies in biology: Design, applicationsand future challenges. Nature Reviews Genetics, 2004.

[15] Kyoto University Bioinformatics Center.http://www.genome.ad.jp/kegg

[16] National Cancer Institute.http://cgap.nci.nih.gov/Pathways/BioCarta_Pathways

[17] National Center for Biotechnology Information.http://www.ncbi.nlm.nih.gov/Genbank

[18] National Center for Biotechnology Information.http://www.ncbi.nih.gov/Structure/cdd/cdd.shtml

[19] National Center for Biotechnology Information.http://www.ncbi.nlm.nih.gov/RefSeq

[20] National Center for Biotechnology Information.http://www.ncbi.nlm.nih.gov/LocusLink

[21] National Library of Medicine.http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed

[22] http://www.ncbi.nlm.nih.gov

[23] Swiss Institute of Bioinformatics.http://us.expasy.org/sprot

[24] Open Biological Ontologies.http://obo.sourceforge.net

[25] L Stein. Integrating biological databases. Nature Reviews Genetics, 2003.

[26] The Gene Ontology Consortium.http://www.godatabase.org/cgi-bin/go.cgi

[27] The Sanger Institute.http://www.acedb.org

26

Appendix

Pseudocode for parts of GEA

read commandline argumentscreate or open SQLite database

while entry in inputFile:entry = getEntry(inputFile)queryString = genbankIdURL + entrylistOfGenBankIDs = sendGETrequest(queryString)listOfGenBankRecords = nullfor id in listOfGenBankIDs:

queryString = genbankRecordURL + idlistGenBankRecords.add(sendGETrequest(queryString)

for record in listOfGenBankRecords:parse recordstore in shadow classes # the shadowclasses have attributes that

# mirrors the database table attributesqueryString = locuslinkURL + locuslinkIDlocuslinkRecord = sendGETrequest(queryString)extract data from locuslinkRecordstore in shadow classesqueryString = biocartaURL + locuslinkIDbiocartaPathways = sendGETrequest(queryString)extract from biocartaPathwaysstore in shadow classesqueryString = keggURL + locuslinkIDkeggPathways = sendGETrequest(queryString)extract data from keggPathwaysstore in shadow classeswrite data in shadow classes to database

27

functional annotation of gene products

Documents