transparent access to multiple bioinformatics information ...stevensr/papers/goble01.pdf · complex...

20
Transparent access to multiple bioinformatics information sources by C. A. Goble R. Stevens G. Ng S. Bechhofer N. W. Paton P. G. Baker M. Peim A. Brass This paper describes the Transparent Access to Multiple Bioinformatics Information Sources project, known as TAMBIS, in which a domain ontology for molecular biology and bioinformatics is used in a retrieval-based information integration system for biologists. The ontology, represented using a description logic and managed by a terminology server, is used both to drive a visual query interface and as a global schema against which complex intersource queries are expressed. These source-independent declarative queries are then rewritten into collections of ordered source-dependent queries for execution by a middleware layer. In bioinformatics, the majority of data sources are not databases but tools with limited accessible interfaces. The ontology helps manage the interoperation between these resources. The paper emphasizes the central role that is played by the ontology in the system. The project distinguishes itself from others in the following ways: the ontology, developed by a biologist, is substantial; the retrieval interface is sophisticated; the description logic is managed by a sophisticated terminology server. A full pilot application is available as a Java TM applet integrating five sources concerned with proteins. This pilot is currently undergoing field trials with working biologists and is being used to answer real questions in biology, one of which is used as a case study throughout the paper. T he molecular biology community is a distributed one with a culture of sharing substantial quan- tities of rapidly evolving information. However, the development of a global informatics infrastructure to support this community has been piecemeal. Each area of molecular biology generates its own data- bases, and a wide range of specialized interrogation and analysis tools are commonly used over these re- sources. The Molecular Biology Database Collec- tion, 1 for example, currently holds over 500 infor- mation resources, excluding the tools that analyze the information contained therein. The most pop- ular resources include those concerned with protein sequences (SWISS-PROT, an annotated protein se- quence database that originated in Switzerland, and PIR, the Protein Information Resource), genome data (ACeDB,a Caenorhabditis elegans database), DNA (deoxyribonucleic acid) sequences (EMBL—the Eu- ropean Molecular Biology Laboratory, and Gen- Bank), protein structure (PDB, the Protein Data Bank), motifs (PROSITE, a database of protein fam- ilies and domains, and PRINTS, a compendium of pro- tein fingerprints), and sequence matching (BLAST, Basic Local Alignment Search Tool). Others are more specialized; for example, Yeast Proteome Da- tabase (YPD) and FlyBase are species-specific. This network of information services forms a loose federation of autonomous, distributed, heteroge- neous data repositories, ripe for information inte- gration. 2 A number of approaches of varying sophis- tication have been adopted, from Web-based browsers to data warehouses. 3 The characteristics of this collection of resources are worth expressing in order to set the context for the rest of the paper, and our approach in particular. rCopyright 2001 by International Business Machines Corpora- tion. Copying in printed form for private use is permitted with- out payment of royalty provided that (1) each reproduction is done without alteration and (2) the Journal reference and IBM copy- right notice are included on the first page. The title and abstract, but no other portions, of this paper may be copied or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the Editor. GOBLE ET AL. 0018-8670/01/$5.00 © 2001 IBM IBM SYSTEMS JOURNAL, VOL 40, NO 2, 2001 532

Upload: others

Post on 01-Jan-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Transparent access to multiple bioinformatics information ...stevensr/papers/goble01.pdf · complex domain, but also the expression of queries in a visual interface and the execution

Transparent accessto multiplebioinformaticsinformation sources

by C. A. GobleR. StevensG. NgS. Bechhofer

N. W. PatonP. G. BakerM. PeimA. Brass

This paper describes the Transparent Access toMultiple Bioinformatics Information Sourcesproject, known as TAMBIS, in which a domainontology for molecular biology and bioinformaticsis used in a retrieval-based informationintegration system for biologists. The ontology,represented using a description logic andmanaged by a terminology server, is used bothto drive a visual query interface and as a globalschema against which complex intersourcequeries are expressed. These source-independentdeclarative queries are then rewritten intocollections of ordered source-dependent queriesfor execution by a middleware layer. Inbioinformatics, the majority of data sources arenot databases but tools with limited accessibleinterfaces. The ontology helps manage theinteroperation between these resources. Thepaper emphasizes the central role that is playedby the ontology in the system. The projectdistinguishes itself from others in the followingways: the ontology, developed by a biologist,is substantial; the retrieval interface issophisticated; the description logic is managedby a sophisticated terminology server. A full pilotapplication is available as a JavaTM appletintegrating five sources concerned with proteins.This pilot is currently undergoing field trials withworking biologists and is being used to answerreal questions in biology, one of which is usedas a case study throughout the paper.

The molecular biology community is a distributedone with a culture of sharing substantial quan-

tities of rapidly evolving information. However, thedevelopment of a global informatics infrastructureto support this community has been piecemeal. Eacharea of molecular biology generates its own data-bases, and a wide range of specialized interrogationand analysis tools are commonly used over these re-

sources. The Molecular Biology Database Collec-tion,1 for example, currently holds over 500 infor-mation resources, excluding the tools that analyzethe information contained therein. The most pop-ular resources include those concerned with proteinsequences (SWISS-PROT, an annotated protein se-quence database that originated in Switzerland, andPIR, the Protein Information Resource), genomedata (ACeDB, a Caenorhabditis elegans database), DNA(deoxyribonucleic acid) sequences (EMBL—the Eu-ropean Molecular Biology Laboratory, and Gen-Bank), protein structure (PDB, the Protein DataBank), motifs (PROSITE, a database of protein fam-ilies and domains, and PRINTS, a compendium of pro-tein fingerprints), and sequence matching (BLAST,Basic Local Alignment Search Tool). Others aremore specialized; for example, Yeast Proteome Da-tabase (YPD) and FlyBase are species-specific.

This network of information services forms a loosefederation of autonomous, distributed, heteroge-neous data repositories, ripe for information inte-gration.2 A number of approaches of varying sophis-tication have been adopted, from Web-basedbrowsers to data warehouses.3 The characteristicsof this collection of resources are worth expressingin order to set the context for the rest of the paper,and our approach in particular.

rCopyright 2001 by International Business Machines Corpora-tion. Copying in printed form for private use is permitted with-out payment of royalty provided that (1) each reproduction is donewithout alteration and (2) the Journal reference and IBM copy-right notice are included on the first page. The title and abstract,but no other portions, of this paper may be copied or distributedroyalty free without further permission by computer-based andother information-service systems. Permission to republish anyother portion of this paper must be obtained from the Editor.

GOBLE ET AL. 0018-8670/01/$5.00 © 2001 IBM IBM SYSTEMS JOURNAL, VOL 40, NO 2, 2001532

Page 2: Transparent access to multiple bioinformatics information ...stevensr/papers/goble01.pdf · complex domain, but also the expression of queries in a visual interface and the execution

The data resources are frequently not databases inthe conventional sense in that they do not have aseparate schema containing their meta-data (or ifthey do, it is not freely accessible), and they do nothave a declarative query language such as SQL (Struc-tured Query Language). Most are tools, processes(e.g., sequence alignment), or proprietary flat filestructures containing embedded meta-data, with alimited set of parameterizable services accessedthrough a call-based interface. Little distinction ismade between databases (e.g., SWISS-PROT) and tools(e.g., BLAST). The sources have complete autonomy,continually extending their intensional and exten-sional coverage.

The resources are poorly integrated and difficult touse together (partially a consequence of the previ-

ous point). This condition is a drawback if we con-sider the complex retrieval tasks that biologists work-ing in this environment are typically required toundertake. For example, consider the query in Fig-ure 1, a typical query in drug target detection. Bi-ologists must:

● Construct their own view of the meta-data in eachsource (the intension) and the instances coveredby that source (the extension), resolving any se-mantic heterogeneities between the sources (e.g.,SWISS-PROT covers some information on proteins,PROSITE covers motifs on protein sequences, andBLAST is a tool for matching sequences)

● Construct the various parts of the request in thedifferent formats and terms required by the dif-ferent sources, taking care to resolve structural and

Figure 1 Running example Query 1: A typical bioinformatics query over multiple data resources

Query 1: Select motifs for antigenic human proteins that participate in apoptosis and are homologous to the lymphocyteassociated receptor of death (also known as lard).

Translation: Select patterns in the proteins that invoke an immunological response and participate in programmed cell death thatare similar in their sequence of amino acids to the protein that is associated with triggering cell death in the white cells of theimmune system.

(A) Concept expression in GRAIL:

Motif which<isComponentOf (Protein which

<hasOrganismClassification SpeciesFunctionsInProcess ApoptosisHasFunction Antigen isHomologousToProtein which <hasName

ProteinName>)>)>

Species: Is instantiated by value “human”ProteinName: Is instantiated by value “lard”

(C) Informal query plan:

• Select proteins with protein name “lard” from SWISS-PROT• Execute a BLAST sequence alignment process against SWISS-PROT results• Check the entries for apoptosis process and antigen function• Pass the resultant sequences to PROSITE to scan for their motifs

(D) CPL expression:

set-unique {(#motif1:motif1)I\protein3 <- get-sp-entries-by-de("lard"), \protein2 <- do-blastp-by-sq-in-entry(protein3),Check-sp-entries-by-kwd("apoptosis",protein2), check-sp-entries-by-de("antigen",protein2),Check-sp-entry-for-species("human",protein2), \motif1 <- do-ps-scan-by-sq-in-entry(protein2)}

(B) Equivalent expression inALC standard Description Logic notation:

A ≡ Protein ∃ hasName.ProteinNameB ≡ Protein ∃ isHomologousTo.A

∃ hasFunction.Antigen∃ functionsInProcess.Apoptosis∃ hasOrganismClassification Species

Motif ∃ isComponentOf.B

IBM SYSTEMS JOURNAL, VOL 40, NO 2, 2001 GOBLE ET AL. 533

Page 3: Transparent access to multiple bioinformatics information ...stevensr/papers/goble01.pdf · complex domain, but also the expression of queries in a visual interface and the execution

semantic differences between seemingly similar in-formation (e.g., the global unique identifier of aprotein is its accession number, but these are in-consistent between sources)

● Locate and communicate with the sources, andprocess intermediate results into appropriate in-put formats for successive stages

● Interoperate between resources, unprotected fromthe vagaries of the various services provided, plan-ning a series of requests that pick from each re-source the information relevant to the query, andtracking and linking related instances through thesources. The requests usually incorporate sometransformation processes that link together strictlyordered chains of retrieval, filtering, and process-ing procedures. There are often many alternativeways to resolve a request, which have varying ef-ficiency, and the user has to choose among theseoptions.

Heavy reliance is made on biologists’ knowledge ofmolecular biology and bioinformatics, and their in-terpretation of each source whose intensional andextensional coverage is dynamic. If biologists wishto go beyond the standard provision offered by pre-defined query systems such as SRS (Sequence Re-trieval System),4 they must resort to developing theirown analysis programs.

In this paper we present a prototype mediation sys-tem called TAMBIS (Transparent Access to MultipleBioinformatics Information Sources), designed to re-lieve biologists of the task of choosing, combining,and interacting with the resources required to an-swer their queries. The TAMBIS architecture is ser-vice-oriented, centered on an extensive source-in-dependent global ontology of molecular biology andbioinformatics, represented in a Description Logic,and managed by a terminology server. The ontolog-ical services are used extensively by the user inter-face component and during the query transforma-tion process. The emphasis in TAMBIS is on thefollowing:

● High transparency. TAMBIS aims to provide the il-lusion of a single query language, a single datamodel, and a single location. Biologists express re-quests independently of any knowledge of thesources.

● Read-only access. The autonomy of the sources isa basic requirement, and performing updates onthe integrated view of the data resources is inap-propriate and undesirable.

● Complex queries expressed over multiple diverse

data sources. This characteristic implies a retrieval-oriented architecture in which there is a require-ment for the production of efficient and correctquery plans over multiple sources; the emphasisis on the coordinated interoperation of diversesources.

● Heterogeneity management. As one would expect,there is considerable syntactic heterogeneity be-tween the various bioinformatics sources, of thekind traditionally classified in the integration lit-erature. There is some semantic heterogeneity, al-though the sources have limited overlap in theirintensions or extensions.

● A visual query interface. Many researchers2,5 stressthat visual query interfaces are very important forfacilitating the interaction of scientists with com-ponent databases, and that users are not preparedto handle query languages such as SQL. Currentgraphical user interfaces to biological sources gen-erally only support the specification of predefinedqueries or have limited query languages (e.g.,SRS4).

Many other researchers have proposed the use ofontologies for integrating heterogeneous sources,6

although the approach has serious limitations.7,8 Thescope of the TAMBIS project is unusually wide, cov-ering not only the ontological representation of acomplex domain, but also the expression of queriesin a visual interface and the execution of those que-ries over highly heterogeneous sources. We believethat TAMBIS has the following distinctive features:

● An unusually rich domain ontology, which cur-rently contains around 1800 biological conceptsand their relationships and is capable of inferringmany more by way of compositional constraints en-compassed in the ontology, known as sanctioning.Its coverage includes proteins and nucleic acids,their motifs, protein structure and structural clas-sification, biological processes, and functions.

● A Web-based dynamic, compositional query for-mulation and ontology browsing interface entirelydriven by the ontology service. The user interfaceensures that only biologically coherent queries canbe expressed and acts as an effective tutorial onthe services available to the biologist.

● A query translation and planning process that iden-tifies appropriate sources, plans an efficient wayof executing a query, and generates an executionplan for use with a middleware layer. Sources ac-cessed through TAMBIS need not themselves pro-vide query language interfaces.

GOBLE ET AL. IBM SYSTEMS JOURNAL, VOL 40, NO 2, 2001534

Page 4: Transparent access to multiple bioinformatics information ...stevensr/papers/goble01.pdf · complex domain, but also the expression of queries in a visual interface and the execution

The TAMBIS pilot is currently in the field for eval-uation trials by biologists. The pilot uses a subset ofthe ontology (250 asserted concepts covering pro-teins), with complete mappings to five of the mostpopular bioinformatics resources: SWISS-PROT (pro-tein sequences); PROSITE (protein motifs); BLAST (se-quence homology); ENZYME (data bank of enzymeclasses); and CATH (a hierarchical classification ofprotein domain structures, clustering proteins at fourlevels: class, architecture, topology, and homologoussuperfamily, thus structural classification). With justthis collection, the TAMBIS pilot can pose a signif-icant number of the queries desired by biologists.9

Note that, in line with the points above, the sourcescover connected but complementary and barely over-lapping information.

In the next section of this paper we present theTAMBIS Ontology (TaO) and the terminology ser-vices, and discuss the utility of combining a descrip-tion logic with a compositional constraint system. Inthe subsequent section, we describe the TAMBIS ar-chitecture and its use of the ontology and wrapperservices. Then we give an idea of TAMBIS in use,drawn from the pilot implementation. In the fifth sec-tion we present related work in bioinformatics andinformation integration, concluding the paper witha summary.

TaO: The TAMBIS global domain ontology

Biologists’ knowledge of molecular biology andbioinformatics, and their interpretation of the re-sources with respect to this knowledge, is essentialto the task of combining resources to answer que-ries. Practical exploitation of knowledge-based in-formation integration systems has often been hin-dered in the past by the lack of suitable ontologiesin challenging domains. Bioinformatics researchershave recognized that semantic schema and datamatching could be aided by a comprehensive the-saurus of terms or a reusable reference ontology ofbiological concepts.2,10

TAMBIS uses a global, source-independent domainontology to provide a unified conceptual level rep-resentation of its registered component resources.The ontology is described fully by Baker et al.11

Bioinformatics is mainly concerned with the studyof the protein and nucleic acid of biological macro-molecules. Hence, these form the core of the ontol-ogy. The databases and the tools that TAMBIS cov-ers describe the principal concepts of the molecularbiology and bioinformatics part of the ontology: mac-

romolecules and their motifs, their structure, func-tion, cellular location, and the processes in which theyact.

The ontology was developed by a biologist and abioinformatician over a period of two years, usinga range of modeling tools12 and feedback from the

TAMBIS interface itself. A mixed top-down and bot-tom-up iterative methodology was employed. Thetop-down component extended the upper levels ofan ontology previously developed for a medical ap-plication13 and reused the taxonomy of CATH. Thistop-down part of the ontology construction was com-plemented with a bottom-up approach that addedfurther concepts to these general concepts such asspecific motifs, kinds of secondary and tertiary struc-ture of molecules, and cellular components. Theseconcepts were gathered from the coverage of the da-tabases themselves, their schemas, and their keywordcollections. This bottom-up part of the constructionhelps ensure that the ontology covers the informa-tion in the sources themselves. Part of the futurework of TAMBIS is to develop tools to facilitate thisprocess. The emphasis is on the information cover-age of these popular data sources and was specif-ically developed for a retrieval task. Only passing at-tention was paid to reusability and encoding bias.The ontology is organized into subdomains aroundprotein structure, function, homology, location, andprocess, and is organized into those concepts con-cerned with molecular biology and those concernedwith bioinformatics (for example, accession numbersof protein database entries). For a detailed descrip-tion see Baker et al.11 No other biological ontolo-gies were reused at the time—the TAMBIS ontologypredated the Gene Ontology14 by some years, forexample.

The ontology is described using the DescriptionLogic GRAIL (GALEN Representation and Integra-tion Language),15 which is a formal and declarativerepresentation. Descendants of the KL-ONE knowl-edge representation system, Description Logics16

(DLs) describe the domain in terms of a limited set

TAMBIS uses a global,source-independent

domain ontology.

IBM SYSTEMS JOURNAL, VOL 40, NO 2, 2001 GOBLE ET AL. 535

Page 5: Transparent access to multiple bioinformatics information ...stevensr/papers/goble01.pdf · complex domain, but also the expression of queries in a visual interface and the execution

of primitive concepts that denote a set of individuals(instances) and roles that denote a set of binary re-lationships between individuals. An individual in thedenotation of a concept is said to be an instance ofthat concept. Recursive term constructors associateconcepts and roles to define new complex compo-sitional concepts. A concept filling the value of roleis a role filler. Using this Description Logic, the on-tology comprises:

● A vocabulary for representing and communicat-ing knowledge about molecular biology and a setof relationships that hold among the terms in thatvocabulary

● A logic-based framework for reasoning about con-cepts and inter-relationships; for example, to in-fer that a concept is either more specialized thananother or that a new concept is inconsistent withthe rest of the ontology

● A constraint system, referred to as sanctioning, tocontrol the combination of terms

● Constants represented as values and some individ-uals represented as nominals. GRAIL, in commonwith many DLs, has only reasoning over the ter-minology itself. Nominals are instances in the on-tology. For example, Heam could be argued to bean instance of prosthetic group and not a class inits own right.

● Instantiation directives that indicate if a role filleris instantiable with a value (e.g., Protein Name orSpecies)

● Visibility directives that indicate whether an ab-stract concept should be visible to a browser

Reasoning services for description logics. The se-mantics of the term constructors is sufficiently well-defined to support reasoning about the concept de-scriptions. Consequently, Description Logics providea variety of reasoning services16 that make them at-tractive as models for describing complex and incom-plete information, including:

● Subsumption: One concept is said to subsume an-other when its extension must be a superset of thesubsumed concepts’ extension, as a logical conse-quence of their descriptions. Primitive conceptsare just described as inclusion assertions (con-ceptsubsumee # conceptsubsumer). For defined compo-sitional concepts, subsumption is automatically in-ferable using a suitable algorithm that is provablycomplete and decidable.17

● Classification: By using subsumption tests, a col-lection of conceptual definitions can be organizedinto a partial order. Newly defined concepts intro-

duced to a pre-existing lattice are positioned intotheir correct place, dynamically evolving the clas-sification structure.

● Satisfiability: Given a concept definition, we candetermine whether the concept is satisfiable withrespect to the subsumption lattice, i.e., if we can-not classify the concept, then it is unsatisfiable.

● Retrieval: Given a concept definition, we can re-trieve all the instances of that concept, which in-cludes all instances of subsumed concepts.

Description Logics balance expressivity, complete-ness, and tractability. In the past, results concerningthe intractability of reasoning within DLs have beenused to dismiss their use in real-world applications.Although, in the worst case, DL languages are gen-erally known to be intractable, there have been tre-mendous advances in the last five years in the DLcommunity in implementations of optimized reason-ing engines, which can deliver realistic performancefor practical applications. This is even true of highlyexpressive DLs such as FaCT (Fast Classification ofTerminologies) and the SHIQ18 reasoner logic. Thus,the old arguments of empirical intractability are out-dated. GRAIL is less expressive than many other De-scription Logics and has a tractable subsumption test.GRAIL has only one concept-forming operator, which,and its language restricts expressions to be conjunc-tive ones with existential role quantification of theform (keywords are in italics):

BaseConcept which ,role1 Concept1, . . . ,rolen Conceptn. name ConceptName

Although GRAIL has a simple language and is com-paratively inexpressive, we had a number of tech-nical and pragmatic reasons for adopting its use ini-tially. GRAIL compensates for its limited expressivityby supporting transitive roles, role hierarchies, a pow-erful set of concept assertion axioms, and a novelmultilayered sanctioning mechanism for roles. De-fining the role partOf to be transitive allows the queryprocessor, for instance, to reason that a protein thatis part of the inner mitochondrial membrane is alsopart of the mitochondrion. This can be useful whenthe sources only allow the larger grained query tobe asked, but local processing can achieve the finergrained request. Both transitivity and role hierarchiesallow sophisticated descriptions to be made for con-cepts that allow the reasoner to automatically infersubsumption (for example, a cytosolPart is a kind ofcellularPart, so anything defined as a kind of cytosol-Part is also a kind of cellularPart). The sanctioning

GOBLE ET AL. IBM SYSTEMS JOURNAL, VOL 40, NO 2, 2001536

Page 6: Transparent access to multiple bioinformatics information ...stevensr/papers/goble01.pdf · complex domain, but also the expression of queries in a visual interface and the execution

mechanism is crucial to the TAMBIS query formula-tion interface, since it constrains what concepts usersmay form, thus guiding them to form meaningfulqueries.

Sanctioned term construction. GRAIL operates a kindof “closed world” model. To restrict the construc-tion of complex concepts to only those that are se-mantically meaningful (in terms of the ontology),GRAIL provides rules or sanctions that dictate whichroles may legitimately be applied to which concepts.This model is quite different from the usual role re-striction approach normally found in DescriptionLogics. Two concepts can only be combined by a roleif they have been explicitly sanctioned to do so. Sanc-tions are inherited. The sanctioning of concepts andtheir relationships allows a large number of complexconcepts to be inferred (or generated) from a rel-atively sparsely populated model. Two levels of sanc-tioning are provided: grammatical and sensible.Grammatical sanctions express abstract or generalrelationships between classes of things, whereas sen-sible sanctions indicate that instantiable composi-tions can be built. A grammatical sanction must bein place before a sensible sanction can be made.

Figure 2 shows the sanctioning of the relationshiphasComponent at the grammatical and sensible lev-els. The relationship between the concepts Biomol-ecule and StructuralComponent is sanctioned at thegrammatical level because it is grammatically per-missible to speak of biomolecules having structuralcomponents, but not all kinds of biomolecule canlegitimately have any kind of structural component.The solid arrows in Figure 2 show explicit subsump-tion relationships. Thus, Protein is subsumed by

Biomolecule and an AlphaHelix is subsumed by Struc-turalComponent. The hasComponent relationship be-tween the concepts Protein and AlphaHelix is sanc-tioned at the sensible level because any kind ofprotein could legitimately have an alpha helix. How-ever, not all biomolecules will have alpha helices(DNA, for instance) —sanctioning is about represent-ing the possibility of composition, not its necessity.Although grammatical sanctions on their own do notpermit the construction of instantiable compositedefinitions, they do represent valid queries that maybe formed. In the example, asking for all biomol-ecules that have some structural component is a validquestion. Given a set of concepts and suitably sanc-tioned roles, we can systematically combine them tocreate all possible legitimate composite concepts. Formore on the ontology itself and the sanctioningmechanism, see Baker et al.11

Querying. Description Logics are naturally suited forexpressing queries and defining views.16 A conceptformed as a query is resolved when its extension isretrieved. For example, the concept Protein whichhasFunction Receptor describes the class of receptorproteins; retrieving the instances of that concept an-swers the query “find all proteins that act as recep-tors.” The subsumption hierarchy is effectively aquery inclusion hierarchy. We can navigate the lat-tice to find concepts that are the direct subsumersand subsumees of a concept. Such query generali-zation or specialization and imprecise querying issupported by: generalizing or specializing conceptor role terms (through role hierarchies) and role fill-ers; relaxing or restricting constraints on roles; andadding or removing terms. Abstract or intensionalanswers are possible as well as the enumeration of

Figure 2 Sanctioned roles

StructuralComponent

ElementOfSecondaryStructure

AlphaHelix DNAProtein

Biomolecule

NucleicAcid

grammatically hasComponent

sensibly hasComponent

IBM SYSTEMS JOURNAL, VOL 40, NO 2, 2001 GOBLE ET AL. 537

Page 7: Transparent access to multiple bioinformatics information ...stevensr/papers/goble01.pdf · complex domain, but also the expression of queries in a visual interface and the execution

instances, for example, “What kinds of cellular pro-cesses are there?” or “What things can a motif bea component of?”

Terminology services. The reasoning services asso-ciated with GRAIL are encapsulated within a Termi-nology Server (TeS). The TeS supports concept rea-soning, role sanctioning, thesauri, and extrinsics19

services. The TeS supports a range of ASK and TELLinteractions.16 For example, given concepts A andB: Is A satisfiable? Does A subsume B? What arethe direct and indirect subsumers or subsumees ofA? What are the sanctioned roles and role fillers forA? What is the English language expression for A?Is the role filler for role r on A instantiable? Is roler1 the inverse of role r2? For further discussion seeBechhofer and Goble.20 In the next section we showhow the global ontology and the sources relate, andwe outline how the TeS services are used in theTAMBIS components.

Architecture

The TAMBIS project adopts a top-down approach tointegrating information sources, with three layersgiven in Figure 3, using a mixture of procedural(wrappers) and declarative (ontology) interopera-tion.21 The domain ontology was constructed first,with mappings from the model to the underlying datasource schemas being determined subsequently.

The physical model. The data resources are encap-sulated by wrappers, described in the functional mul-tidatabase language CPL, the Collection Program-ming Language.22 The underlying data structures andfile formats are converted to the nested, value-basedmodel of CPL (see part D of Figure 1). CPL modelscomplex data types such as lists, sets, and variants,with drivers (wrappers) that execute requests overdata sources. Type coercions between functions ofthe different wrappers are handled at the wrapper

Figure 3 The TAMBIS models

SOURCE 1 SOURCE 2

CONCEPTUAL MODEL

MAPPING MODEL

PHYSICAL MODEL

GLOBAL DOMAIN ONTOLOGY CONCEPTS:MOTIF WHICH <isComponentOf (Protein which hasAccessionNumber AccessionNumber)>

MAPPING OF GLOBAL DOMAIN CONCEPTSTO LOCAL SOURCE CONCEPTS

SOURCE-INDEPENDENTCONCEPT REWRITING RULES

RESOURCE-SPECIFIC CONCEPTSACCESSIBLE THROUGH FUNCTIONS

RESOURCE-SPECIFIC CONCEPTSACCESSIBLE THROUGH FUNCTIONS

WRAPPER CONCEPTS,COSTS AND CARDINALITY DATATYPE COERCIONS

GOBLE ET AL. IBM SYSTEMS JOURNAL, VOL 40, NO 2, 2001538

Page 8: Transparent access to multiple bioinformatics information ...stevensr/papers/goble01.pdf · complex domain, but also the expression of queries in a visual interface and the execution

level, as is location and format transparency. Con-sequently the wrappers are quite substantial. Thereis no local ontology in the sense of OBSERVER;23 in-stead there are collections of functions that form theapplication programming interface (API) for thesource. These functions provide a physical, ratherthan a logical, level of mapping since aspects suchas alternative access paths are presented by functions.No cross-model assertions or representations exist;all intersource mapping is channeled through theglobal ontology. In effect the ontology coordinatesintersource management.

The conceptual model. The global ontology is a uni-fied conceptual level representation of its registeredcomponent resources, encompassing the conceptsmade accessible through their CPL wrapper functions.Source-independent queries are formulated in thesame language as the conceptual description, hid-ing the sources from the user. However, the globalontology is more than the union of the schemas ofits sources in that it provides an abstract frameworkfor relating, reconciling, and coordinating the con-cepts of the sources. There are also queries (or sub-queries) that can be answered intensionally basedon the ontology alone.

The mapping model. The role of the query proces-sor is to convert a query phrased only in terms of theconceptual layer into an executable plan in terms ofthe classes and methods of the physical layer. To dothis, a range of mappings is required. These map-pings are used either during query planning or byqueries at run time. The mappings constitute theTAMBIS Sources and Services Model (SSM), whichrelates the wrapper services in the sources with theirconceptual counterparts in the domain ontology.Currently, the sources and services data are con-structed manually. Use of the same information usedin the bottom-up part of the ontology constructionhelps to ensure that the SSM is complete—each con-cept is systematically checked to see whether it hasa commensurate mapping. This task is, however, dif-ficult to perform by hand, especially when sourcesor concepts change. Part of the future work onTAMBIS will be the development of tools to managethis mapping of concepts and relationships to wrap-per classes and values in the sources.

Mapping concepts and roles to functions. Seven cat-egories of mapping capture the relationships betweenthe conceptual and physical representations of que-ries.24 For example, the iteration mapping indicatesthat the instances of a concept can be obtained by

iterating over values obtained by evaluating somefunction. An iteration mapping is thus a pair, con-sisting of a concept description and the descriptionof the function that can be used to retrieve the in-stances of the concept. Each function carries CPLtype, cardinality, and cost information. For example:

, concept: protein,function: , name: ''get-all-sp-entries'',

arguments: [ ],resultType: ''protein_record'',cardinality: 80000,cost: 1000,source: ''SwissProt''

..

In the above display, the CPL function, get-all-sp-en-tries, is declared to be able to return all proteins inthe source SwissProt, in the form of values of CPLtype protein_record. A more specialized concept mayprovide an alternative approach to iteration. Forexample, instances of the concept Protein whichhasFunction CatalysisProtein which catalyzes Reactioncan be obtained from the source Enzyme:

, concept: Protein which hasFunctionCatalysisProtein which catalyzes Reaction,

function: , name: ''get-all-enzyme-entries'',arguments: [ ],resultType: ''enzyme_record'',cardinality: 5000,cost: 200,source: ''Enzyme''

..

The ontology is used during the query transforma-tion process as a semantic index to the wrapper meth-ods, and the subsumption mechanism is used to se-lect the most specialized mapping available. Forexample, there is no entry for the concept Proteinwhich hasFunction Receptor, but there is an SSM en-try for the concept Protein which hasFunction biologi-calFunction. The latter concept subsumes the formerconcept and the class of sources and services rules.Another entry indicates that a mapping can be usedfor the role filler Receptor as an argument to themethod indicated by the subsuming concept. Thus,the ontology guides the choice of an appropriatemethod for the query concept component. Source-independent rewrite rules govern the choice of map-pings and how the query components are com-bined.24

IBM SYSTEMS JOURNAL, VOL 40, NO 2, 2001 GOBLE ET AL. 539

Page 9: Transparent access to multiple bioinformatics information ...stevensr/papers/goble01.pdf · complex domain, but also the expression of queries in a visual interface and the execution

Mapping values. It is sometimes the case that a con-cept in the ontology maps onto a scalar value in aCPL query. For example, in Figure 1, the conceptname Apoptosis maps (rather directly) to the string“apoptosis.” These mappings can, however, be muchless obvious. For example, the concept Kinase mapsto “2.7.-.-” in a classification scheme used by thesource Enzyme. This would be represented by thetriple:

, Kinase, ''Enzyme'', ''2.7.-.-'' .

Architectural components. To realize the abovemodel, TAMBIS has five major components organizedinto a classical mediator-wrapper three-layer mod-el: a presentation layer, a mediation layer (dealingwith mappings), and a wrapper layer (dealing withthe physical models), as shown in Figure 4.

The Terminology Server, discussed earlier, is exten-sively used in the query formulation and transfor-mation and in the Sources and Services Model. Thecomponents are written in, or interfaced to, theJava** programming language.

1. Query formulation interface: This is a graphicaland forms-based interface in which the userbrowses the ontology and forms complex concep-tual requests without having to memorize termsor be aware of the relevant information sources.The result is a declarative GRAIL concept expres-sion. The interface has a desirable side effect ofbeing a tutorial for biologists new to bioinformat-ics. More details on the interface are provided inthe next section.

2. Query transformation processor. The input to thequery processor is a GRAIL query, and the outputis a CPL program. Because the CPL program rep-resents an ordered collection of function calls, thequery processor must not only generate a validexecution plan, it must also take into account thelikely performance of alternative implementationsfor a query. Briefly, the process resembles a tra-ditional database query planner: the conceptualquery is decomposed into a collection of querycomponents, and, using a cost model, alternativeevaluation orders are identified and ranked. Theways in which each query component can be eval-uated are generated with reference to the map-pings described as outlined in the subsection onthe mapping model. The subsumption service ofthe TeS is used in the selection of the most spe-cialized mapping available at each point, so the

mappings are effectively indexed by the conceptmodel. The query processor is described in de-tail in Paton et al.24

3. Wrapper service. The wrapper service is providedby BioKleisli, a low-level mediation system de-veloped for the biological community22 that of-fers format and location transparency, but doesnot hide the sources from the user and does notoffer schema or data reconciliation. BioKleisli isbuilt on CPL. The service coordinates and dis-patches the execution plans generated by TAMBISto the appropriate wrapped component informa-tion services. Results are returned in HyperTextMarkup Language (HTML) as a Web page in alocal Web browser.

A flavor of TAMBIS

The TAMBIS interface, dynamically driven by the TeSservices, supports the following:

● Browsing the ontology, so that users can find outwhat they can retrieve and how they can ask ques-tions. The browser acts as an educational guide toinformation availability.

● Controlled incremental building and manipulationof query expressions through interaction with a vi-sual representation of the ontology

● Instantiation of certain concepts with values (e.g.,species with human)

● Identification of concepts that should be returnedin the result

● Bookmarking queries

Users may construct complex queries by combiningappropriately sanctioned concepts and roles. So,rather than testing whether an expression is correctby classifying it (as in other Description Logics), theinterface forces users to create only expressions thatare classifiable. The user interface does not forceusers to freely type their expressions; instead theyare able to select from options presented to themthat are guaranteed to produce a legal expression.The user interface guides the user as to which rolesmay be applied to any given concept at any givenpoint during the construction of a query concept.Consequently, we need to be able to navigate be-tween concepts and to expose which roles can legallybe applied to a concept.

Figure 5 shows one of the ontology browsers on theconcept Protein. The concept currently in focus oc-cupies the center of the frame, with its subsumingand subsumed concepts and its sanctioned roles dis-

GOBLE ET AL. IBM SYSTEMS JOURNAL, VOL 40, NO 2, 2001540

Page 10: Transparent access to multiple bioinformatics information ...stevensr/papers/goble01.pdf · complex domain, but also the expression of queries in a visual interface and the execution

played around it. Another TAMBIS browser, Figure6, presents the local subsumption hierarchy aroundthe concept; this is used for substituting a more gen-eral, more specific, or sibling concept when manip-ulating a query expression.

The sanctioning mechanism allows us to explore po-tential legitimate relationships between concepts, re-stricting the user to asking questions it is sensible to

ask. The role browser displays the legally sanctionedrole-role filler pairs that may be applied to a baseconcept. Figure 7B gives those that can be appliedto Protein which HasComponent Motif. Selecting therole filler selects the role (unfilled roles are not ca-tered for).

Figure 8A shows the query expression for the queryin Figure 1, and Figure 8B shows the CPL that the

Figure 4 The TAMBIS component architecture

ONTOLOGYBROWSER

QUERYFORMULATION

PRESENTATION LAYER

QUERYFORMULATIONDIALOG

TaODOMAINONTOLOGY

TERMINOLOGY SERVERCONCEPTUALMODEL

LINGUISTICMODEL

WRAPPER SERVICE

WRAPPER LAYER

WRAPPERCLIENT

WRAPPERCLIENT

CPLFUNCTION

SOURCES AND SERVICES

MEDIATION LAYERMAPPINGMODEL

ONTOLOGY <->FUNCTIONSMAPPING

COSTS ANDCOERCIONS

QUERY TRANSFORMATION

PHYSICALMODEL

REWRITERULES

INTE

RFA

CE

CPLFUNCTIONS

IBM SYSTEMS JOURNAL, VOL 40, NO 2, 2001 GOBLE ET AL. 541

Page 11: Transparent access to multiple bioinformatics information ...stevensr/papers/goble01.pdf · complex domain, but also the expression of queries in a visual interface and the execution

rewriter has produced. A query is built up from abase concept by adding or removing criteria usingthe role browser and substituting concepts for sib-ling concepts or those that are more specific or gen-eral by using the explorer browsers. The equivalentEnglish expression for the concept is generated us-ing the linguistic capabilities of the TeS and displayedat the top of the window. The subquery componentsof the query can be aggregated into complex expres-sions that can be manipulated semi-independently,in that the browsers operate on them independently,but the options available are dependent on the over-all context of the query expression.

Although there is little space here to describe themanipulation interface in detail, we can give an ideaof it. The primary results of Query 1 in Figure 1 aremotifs, and the concept representing motifs formsthe base of the query concept. The TAMBIS mainscreen holds a “find” dialog box with Explore andBuild Query buttons, or bookmarks for previous que-ries. Suppose we pick a bookmark for the conceptMotif isComponentOf Protein. Selecting “Build Query”launches a query builder window, with the query rep-resented as two linked buttons (as in the two linkedbuttons near the top of Figure 8A). This concept rep-resents all motifs that are components of proteins.

Figure 5 An ontology browser showing the concept hierarchy and sanctioned roles for Protein

GOBLE ET AL. IBM SYSTEMS JOURNAL, VOL 40, NO 2, 2001542

Page 12: Transparent access to multiple bioinformatics information ...stevensr/papers/goble01.pdf · complex domain, but also the expression of queries in a visual interface and the execution

We need to restrict the query so that it applies toonly certain sorts of proteins. Clicking on the “pro-tein” button invokes a menu displaying options foractions upon this concept (the concept menu). If wechoose to Explore protein, we obtain an ontologybrowser as in Figure 5. If we choose to “restrict viaa relationship,” the ontology is asked what relation-ships this concept can hold. The answer returned bythe TeS is displayed for the user as a list, as in Fig-ure 7B. Criteria selected and accepted are added tothe concept in the query builder window. In this case,the criteria functionsInProcess cellularProcess, has-Function biological function, isHomologousTo protein,and hasOrganismClassification species are chosen.Now the query builder shows a new concept shownin Figure 7A. Each constituent concept is shown asa button, and the concepts are linked by lines rep-resenting the relationships between them. The “spe-

cies” concept button is in an italic typeface—this in-dicates that the user may set a value for that concept.Choosing “set value” from the concept menu invokesa dialog so that the string “human” can be assignedto this concept. In this sense, the user is as much aninformation source as any of the bioinformatics da-tabases. This value is also displayed in the querybuilder.

The Protein filler of isHomologousTo denotes homol-ogy to all proteins and hence should be restrictedby, perhaps, a hasName relationship in the same wayas before. The concepts cellularProcess and biologi-calFunction are perfectly permissible but would re-sult in a very general query. To restrict the query toproteins whose cellularProcess is apoptosis andwhose biologicalFunction is antigen, these conceptsshould be specialized. By selecting the concept menu

Figure 6 A local subsumption hierarchy browser centered on cellular process

IBM SYSTEMS JOURNAL, VOL 40, NO 2, 2001 GOBLE ET AL. 543

Page 13: Transparent access to multiple bioinformatics information ...stevensr/papers/goble01.pdf · complex domain, but also the expression of queries in a visual interface and the execution

option “replace with a kind of this concept” on cel-lularProcess, the user interface asks the TeS whatthe parents and children of the concept are and dis-plays them as a hierarchy as shown in Figure 6. Theuser navigates to the more specialized concept ap-optosis and selects “Replace,” substituting apopto-sis for cellularFunction in the query. A similar ap-proach is taken for biologicalFunction. The finalresult is Figure 8A. Pressing the “Submit” button ini-tiates the process of transforming the conceptualquery into the concrete query plan shown in Figure8B.

The compositional concept Query 1 has been clas-sified and installed into the classification lattice asa result of its construction. Figure 9 shows the ex-plorer browser on Query 1, illustrating some of thedefined compositional concepts (bottom left). Thisdemonstrates that the interface dynamically uses the

TeS services to access the current state of the clas-sification lattice. The visualization of defined con-cepts is usually disabled because it can be disturbingfor users.

A number of ontology browsers have been designedto support ontology development25,26 or retrieval.27

However, query construction has been left toHTML-form-based query tools that require the userto have some knowledge of the query language andthe modeling language (e.g., what a frame or slot is).In TAMBIS users are shielded from these concerns

Figure 7B The roles and fillers sanctioned for Protein thathas a Motif component

Figure 7A A rather general query expression after selectingroles and fillers

GOBLE ET AL. IBM SYSTEMS JOURNAL, VOL 40, NO 2, 2001544

Page 14: Transparent access to multiple bioinformatics information ...stevensr/papers/goble01.pdf · complex domain, but also the expression of queries in a visual interface and the execution

leaving them to concentrate in the domain ratherthan in the knowledge representation of the domain.In addition, the graphical layout makes the querymuch more readable and, consequently, comprehen-sible. More details on the interface can be found inBechhofer et al.28

Related work

The closest work to TAMBIS in the biological domainis the Object-Protocol Model (OPM),29 which usesan object model to implement a unifying schema.OPM does not provide source transparency, and que-ries are expressed in a variant of Object Query Lan-guage (OQL), or in a link-following manner througha graphical interface. SRS,4 Entrez,30 and BioNavi-gator31 link several databanks and processes togetherthrough World Wide Web (WWW) front ends. Theirsource-linking functionality is similar to TAMBIS, butthere is no source or schema transparency, no abil-

ity to issue complex declarative and multidatabasequeries, and limited flexibility, since only a limitedrange of access paths through predefined links or re-trieval functions are available. Systems such as Bio-Kleisli32 and DiscoveryLink33 are lower-level middle-ware solutions that concentrate on multisource querylanguages, wrapping sources, and intersource queryoptimization. They offer format and location trans-parency but do not hide the sources from the userand do not offer schema or data reconciliation.TAMBIS sits on top of such a middleware layer as wehave already done with an earlier version ofCPL/BioKleisli. 22

Many proposals have been made that use domainontologies, often expressed in a Description Logic,in intelligent “read-only view” mediation ser-vices.23,34–36 However, none of these has the ontology-driven query dialog of TAMBIS. But most are alsobased on query decomposition and expansion andsend subqueries to source databases.37 UnlikeTAMBIS, most [e.g., Levy et al.35 and Mena et al.23]target sources that provide declarative query facil-ities. Many34,35,38–40 adopt a “top-down” approach us-

Figure 8A The query builder expression for Query 1 inFigure 1

Figure 8B The CPL produced

IBM SYSTEMS JOURNAL, VOL 40, NO 2, 2001 GOBLE ET AL. 545

Page 15: Transparent access to multiple bioinformatics information ...stevensr/papers/goble01.pdf · complex domain, but also the expression of queries in a visual interface and the execution

ing a global schema encompassing relevant informa-tion. However, TAMBIS does not express the data heldin the sources as views over this schema.41 SCOPE42

and DQW (Data Quality Warehouse)21 emphasize se-mantic reconciliation of heterogeneous sources, us-ing ontologies to identify interschema semantic re-lationships and representing them as assertions. InTAMBIS there are no interschema relationships; allmappings are from local concepts to global ones. Theemphasis is not on discovering semantic similarities

or conflicts but on managing the already identifiedglobal-local concept mappings.

Carnot36 used the Cyc knowledge base as the globalschema, which is a general ontology rather than thedetailed domain ontology used in TAMBIS. The scopeof its follow-on, InfoSleuth,43 is more concerned withthe provision of a generic agent architecture for in-formation integration than the development of a spe-cific user-centered retrieval mediator such as TAMBIS.

Figure 9 Explorer browser on Query 1 from Figure 1

GOBLE ET AL. IBM SYSTEMS JOURNAL, VOL 40, NO 2, 2001546

Page 16: Transparent access to multiple bioinformatics information ...stevensr/papers/goble01.pdf · complex domain, but also the expression of queries in a visual interface and the execution

A particular application of InfoSleuth, EDEN,44 andCoopWARE40 uses a global domain ontology to sup-port agent-based semantic interoperability.

Unlike TAMBIS, Information Manifold35 adopts a “lo-cal as view” approach,41 where the sources are de-scribed as views over a global ontology expressed inthe CLASSIC Description Logic system. This approachpotentially makes the source descriptions more mod-ular, proving useful when sources change or join thefederation. However, efficient execution of queriesexpressed in the ontology across multiple sources ismuch more difficult. OBSERVER23 associates eachsource with an ontology (again using CLASSIC) thatdescribes its contents. It thus exploits existing on-tologies or describes each source using a new on-tology, and relates them. However, query process-ing is targeted more at the selection of a single sourcefor answering a query than at global planning for que-ries run over multiple sources. TSIMMIS (TheStanford-IBM Manager of Multiple InformationSources)38 uses a lightweight integrating objectmodel and places great emphasis on wrappers, con-centrating on automated wrapper generation. Thereis no attempt to create a declarative integrating do-main model, and no schema reconciliation. The clos-est project to TAMBIS is SIMS.34 It is based arounda source-independent domain model expressed inthe LOOM Description Logic system; queries ex-pressed against this model are rewritten to source-dependent queries also in LOOM. The query plan-ner and optimizer of SIMS have influenced those inTAMBIS. TAMBIS differs from SIMS in that our sourcesrarely provide query interfaces, and in having a vi-sual query construction interface provided byTAMBIS. The first TAMBIS prototype reported heredoes not follow SIMS in having source models in theDescription Logic, though the next generationTAMBIS adopts a variant of this approach.

Discussion

TAMBIS aims to provide complete transparent access:transparency of location, source, and data structures.A substantial global domain ontology supported bya range of terminological services is the cornerstoneof the provision of this transparency. The biologistinteracts with a single schema through one interface,using a single language. The schema and query lan-guage are effectively the same. The ontology of some1800 assertions was developed by a biologist andbioinformatician.11

A visual query interface that supports the explorationof the ontology and the controlled incremental cre-ation and manipulation of complex query expressionsis made possible through the use of the terminolog-ical services. We know of no comparable query in-terface supported by a knowledge-based informationintegration system. Although the interface lookssomewhat like a form-based interface to a databasesystem, it is able to detect a range of biologically non-sensical queries that more conventional query inter-faces to databases would be happy to compile, butwhich are sure to yield empty results.

We can express source-independent declarative com-plex queries that range across multiple diverse sources.A sources and services model associates conceptsfrom the ontology with source-dependent wrapperfunctions written in CPL. The ontology forms a ca-nonical model to relate the shared concepts to theirsource-specific counterparts and forms a frameworkto manage heterogeneity between the sources. A queryplanner uses the sources and services model and theterminological services to generate a source-depen-dent execution plan from the GRAIL query, which isexecuted through the BioKleisli middleware.

TAMBIS does not assume that individual sources ex-port query facilities—in bioinformatics very fewsources do. Sources are wrapped by CPL at essen-tially the same level as CORBA** (Common ObjectRequest Broker Architecture**) sources arewrapped, so TAMBIS can be seen as generating out-put at a level that is typical of that required for useover widely accepted middleware layers.

The first TAMBIS prototype is operational and acces-sible through a password-protected Java applet athttp://img.cs.man.ac.uk/tambis.html. TAMBIS was de-veloped in close collaboration with biologists in ac-ademic institutions and pharmaceutical companies,and the applet is currently undergoing trials in arange of universities and specialist national bioin-formatics institutes, and at AstraZeneca Pharmaceu-ticals. Five popular resources have been fully inte-grated into TAMBIS covering an extensive range ofreal biological questions that have been difficult topose before, such as that given in Figure 1. TAMBIScan now pose complex questions that the datasources cannot answer or on which they hold no data,which in itself is a useful piece of information. Forexample, the following are a list of some of the typesof queries that can be formulated in TAMBIS. Theyrange from standard bioinformatics queries to those

IBM SYSTEMS JOURNAL, VOL 40, NO 2, 2001 GOBLE ET AL. 547

Page 17: Transparent access to multiple bioinformatics information ...stevensr/papers/goble01.pdf · complex domain, but also the expression of queries in a visual interface and the execution

that are difficult to ask without development of a be-spoke program:

● Find human homologues of yeast receptor pro-teins.

● Find rat proteins that have a domain with a seven-propeller domain architecture.

● Find phosphorylation motifs on human apoptosisreceptor proteins.

● Find protein homologues of a protein with a par-ticular accession number.

● Find the binding sites of human enzymes with zinccofactors.

The evaluation of TAMBIS broadly falls into two cat-egories: technical effectiveness and user usability.From the technical point of view, the TAMBIS pro-totype has demonstrated that the approach is fea-sible, identifying issues of source integration andtransparency, which we discuss later. From the userinterface point of view, we have not undertaken asystematic usability study but rather an informal anal-ysis. On the plus side, the concept browsers and con-trolled query formulation is recognized as a flexibleway of forming complex queries by term combina-tion. This also has the side effect of acting as a tu-torial on what requests to the services are available.User trials have also identified a range of issues in-cluding: the appropriate presentation of a large on-tology on a small amount of screen; the accurate in-terpretation of the query by the user; and theidentification of paths to follow in the constructionof a query concept; for example, is a concept spe-cialized by adding a restricting role or choosing amore specialized subclass? The TAMBIS query inter-face is different from a normal text-based one andtakes some familiarization. The emphasis is on que-rying rather than click-based navigation as in SRS orEntrez. Describing what is wanted rather than de-scribing how to get it is a new paradigm for manybioinformaticians used to thinking in terms of pro-cesses rather than questions. By way of a compro-mise, we suggest that an expert bioinformatician canuse TAMBIS to generate a series of “canned” prede-termined parameterizable queries and make this theinterface for the bench biologist.

Relieving the user of the onerous task of linking thesources passes the creation and maintenance of theillusion of one terminology and one source to themediation system. The TAMBIS approach is a top-down one. The global domain model was constructedinitially, with mappings from the model to the un-derlying data resource schemas determined subse-

quently. This “global as view”41 approach is effec-tive for the purposes of expanding and rewritingqueries against the global ontology, which is the mainpurpose of TAMBIS.

However, this approach carries the cost of (A) build-ing, validating, and maintaining the ontology and (B)building, validating, and maintaining the mediationmappings.8 There is a strong dependency betweenthe ontology and the resources. This is particularlyso because the mappings are often complex. Two fea-tures of the web of resources make these difficulttasks, which is why the mediation process is hiddenfrom the user in the first place. The number of re-sources is large, and the resources change withoutnotification. Resources fall out of use and othersemerge. The relationships between resources changeas well as their relevance or popularity.

The development of a single schema or ontology isa serious and expensive undertaking best tackled asa joint exercise with others by merging and adopt-ing pre-existing ontologies, with the intention thatthe result will be reusable by other applications. Thistask is difficult.45 The use of a single terminology bya mediator requires that the user know what is inthe terminology, understand what the terms and con-cepts mean, and buy into it. Gaining consensus isparticularly difficult because one user’s or commu-nity’s vocabulary might differ from that of another.The ontology will need to be tended and updated tocater to new sources or changes in sources. It alsoneeds to be comprehensive enough to cater to anappropriately adequate range of resource types. In-terpretations of concepts often depend on context,and one ontology cannot be viewed as a repositoryof all possible interpretations.7,45 Attempts to tacklethis issue range from the adoption of de facto com-mon vocabularies by a community prepared to adaptto some form of common consensus, for example,the Gene Ontology,14 to mechanisms for defining on-tological commitment, multiple definitions for con-cepts in the same ontology, and ontological views.Description Logics are particularly convenient forgiving multiple definitions to concepts, unlike frameor other object-based schemes. The TAMBIS ontol-ogy is designed specifically for one task, retrieval overbioinformatics resources, and thus may not be an ap-propriate interpretation for, say, reasoning aboutprotein function.45

The two major issues that we are tackling in the nextTAMBIS prototype are source management andsource transparency. Source management—the in-

GOBLE ET AL. IBM SYSTEMS JOURNAL, VOL 40, NO 2, 2001548

Page 18: Transparent access to multiple bioinformatics information ...stevensr/papers/goble01.pdf · complex domain, but also the expression of queries in a visual interface and the execution

troduction of new sources and keeping up withchanges—is the time-consuming activity in an envi-ronment where the data sources are autonomous.Wrappers must be developed and mapped to onto-logical concepts through the SSM, remembering thatthe mapping is often complex. Fundamental researchinto a more flexible “local as view” source integra-tion mechanism, where sources are described inde-pendently in terms of a shared terminology, is stillat an early stage.41 The support of automated sourceintegration without hampering the efficient globalrewriting machinery is our next challenge. Integra-tion systems such as TAMBIS are generally hamperedby the paucity of the accessibility functions of thesources. Many offer only point and click interactioninterfaces designed for people, not programs, andHTML or flat file results. Web wrappers that screen-scrape are brittle because sources change their in-terfaces frequently. Consequently, in practice, me-diation systems tend to integrate a rather modestnumber of sources.

TAMBIS, in contrast to similar systems such as OPMand SRS, is very transparent; users are completelyshielded from any notion of which source will be usedto answer their query, or the order in which thesources will be used. In some cases this position istoo extreme; many biologists will be keen to inter-vene in the rewriting process to direct requests totheir favorite sources, to alter the order of execu-tion, or to siphon off intermediate results. At the veryleast they require an explanation of the query planand the results to be decorated with their originat-ing source. Removing the responsibility of integra-tion from the user can have the potentially detrimen-tal side effect of taking control from the user, or usersperceiving that control has been wrested from them.Confidence in the results is related to control. Giventhe volatility of the Web, all users seek reassuranceon the quality and suitability of the resources used,which requires that the provenance of the results,and an explanation of how they were obtained, is asessential as the results themselves.

A second TAMBIS prototype has already been devel-oped that replaces GRAIL in the Conceptual Modelwith a far more expressive and mainstream Descrip-tion Logic, FaCT,18 and replaces CPL in the PhysicalModel with Java wrappers. This forms a platformon which we can address the issues of source inte-gration and transparency.

Acknowledgments

This work is supported by the BBSRC/EPSRC Bioin-formatics Programme grant BIF/05344, by AstraZen-eca Pharmaceuticals, and by EPSRC grant GR/M76607under the DIM initiative. The Java version of the in-terface was initially ported by Alex Jacoby. We aregrateful for his contribution.

**Trademark or registered trademark of Sun Microsystems, Inc.,or Object Management Group.

Cited references and note

1. A. D. Baxevanis, “The Molecular Biology Database Collec-tion: An Online Compilation of Relevant Database Resourc-es,” Nucleic Acids Research 28, No. 1, 1–7 (2000).

2. P. Karp, “A Strategy for Database Interoperation,” Journalof Computational Biology 2, No. 4, 573–586 (1995).

3. O. Ritter, P. Kocab, M. Senger, D. Wold, and S. Suhai, “Pro-totype Implementation of the Integrated Genomic Database,”Computers and Biomedical Research 27, 97–115 (1994).

4. T. Etzold and P. Argos, “SRS—An Indexing and RetrievalTool for Flat File Data Libraries,” Computer Applications inthe Biosciences 9, No. 1, 49–57 (1993).

5. V. M. Markowitz and O. Ritter, “Characterizing Heteroge-neous Molecular Biology Database Systems,” Journal of Com-putational Biology 2, No. 4 (1995).

6. A. Ouksel and A. Sheth, “Editorial: Special Section on Se-mantic Interoperability in Global Information Systems,” ACMSIGMOD Record 28, No. 1, 5–12 (1999).

7. A. Ouksel and I. Ahmed, “Ontologies Are Not the Panaceain Data Integration,” Journal of Distributed and Parallel Da-tabases 7, 1–29 (1999).

8. C. A. Goble, “Supporting Web-based Biology with Ontolo-gies,” Proceedings of the Third IEEE International Conferenceon Information Technology Applications in Biomedicine(ITAB00), Arlington, VA (November 2000), pp. 384–390.

9. R. Stevens, C. A. Goble, P. G. Baker, and A. Brass, “A Sys-tematic Classification of Bioinformatics Queries,” Bioinfor-matics 17, No. 1, 1–9 (2001).

10. S. B. Davidson, C. Overton, and P. Buneman, “Challengesin Integrating Biological Data Sources,” Journal of Compu-tational Biology 2, No. 4 (1995).

11. P. G. Baker, C. A. Goble, S. Bechhofer, N. W. Paton,R. Stevens, and A. Brass, “An Ontology for BioinformaticsApplications,” Bioinformatics 15, No. 6, 510–520 (1999).

12. W. D. Solomon, GALEN-IN-USE Project Deliverable 9.2,The GRAIL KnoME Knowledge Modelling Environment(1998), http://www.galen-organisation.com.

13. A. L. Rector, P. E. Zanstra, and the GALEN-IN-USE Con-sortium, “The GALEN-IN-USE Project,” European HealthTelematics Observatory Journal (1996).

14. Ashburner et al., “Gene Ontology: Tool for the Unificationof Biology,” Nature Genetics 25, 25–29 (2000).

15. A. L. Rector, S. Bechhofer, C. A. Goble, I. Horrocks, W. A.Nowlan, and W. D. Solomon, “The GALEN Modelling Lan-guage for Medical Terminology,” Artificial Intelligence in Med-icine 9, 139–171 (1997).

16. A. Borgida, “Description Logics in Data Management,” IEEETransactions on Knowledge and Data Engineering 7, No. 5, 671–682 (1995).

17. F. M. Donini, M. Lenzerini, D. Nardi, and A. Schaerf, “Rea-

IBM SYSTEMS JOURNAL, VOL 40, NO 2, 2001 GOBLE ET AL. 549

Page 19: Transparent access to multiple bioinformatics information ...stevensr/papers/goble01.pdf · complex domain, but also the expression of queries in a visual interface and the execution

soning in Description Logics,” Foundations of Knowledge Rep-resentation, CSLI Publications (1996), pp. 191–236.

18. I. Horrocks and U. Sattler, “A Description Logic with Tran-sitive and Inverse Roles and Role Hierarchies,” Journal ofLogic and Computation 9, No. 3, 385–410 (1999).

19. Arbitrary meta-data associated with concepts and roles thattake no part in the Description Logic reasoning services.

20. S. Bechhofer and C. A. Goble, “Delivering TerminologicalServices,” AI*IA Notizie 12, No. 1, 27–32 (1999).

21. D. Calvanese, D. De Giacomo, M. Lenzerini, D. Nardi, andR. Rosati, “Information Integration: Conceptual Modelingand Reasoning Support,” Proceedings of the 6th InternationalConference on Cooperative Information Systems (CoopIS-98)(1998), pp. 280–291.

22. P. Buneman, S. B. Davidson, K. Hart, C. Overton, andL. Wong, “A Data Transformation System for Biological DataSources,” Proceedings of VLDB, Zurich (September 1995).

23. E. Mena, V. Kashyap, A. Seth, and A. Illarramendi, “OB-SERVER: An Approach for Query Processing in GlobalInformation Systems Based on Interoperating Across Pre-Existing Ontologies,” Proceedings of COOPIS (1996),pp. 14–25.

24. N. W. Paton, R. Stevens, P. G. Baker, C. A. Goble, S. Bech-hofer, and A. Brass, “Query Processing in the TAMBIS Bioin-formatics Source Integration System,” 11th International Con-ference on Scientific and Statistical Database Systems (1999),pp. 138–147.

25. P. D. Karp, V. K. Chaudhri, and S. M. Paley, A CollaborativeEnvironment for Authoring Large Knowledge Bases (1997).

26. B. Swartout, R. Patil, K. Knight, and T. Russ, “Towards Dis-tributed Use of Large-Scale Ontologies,” Proceedings of 10thKnowledge Acquisition Workshop (KAW’96) (1996).

27. D. Fensel, S. Decker, M. Erdmann, and R. Studer, “Onto-broker: How to Make the WWW Intelligent,” Proceedings of11th Banff Knowledge Acquisition for Knowledge Based Sys-tems Workshop (KAW’98) (1998).

28. S. Bechhofer, C. A. Goble, R. Stevens, G. S. Ng, and A. Ja-coby, “Browsing and Query Formulation in Large Ontolo-gies,” Workshop on User Interface for Data Intensive Systems(UIDIS), IEEE Press (September 1999), pp. 158–161.

29. I. A. Chen and V. M. Markowitz, “An Overview of the Object-Protocol Model (OPM) and the OPM Data ManagementTools,” Information Systems 20, No. 5, 393–418 (1995).

30. D. A. Benson, M. Boguski, D. J. Lipman, and J. Ostell, “Gen-bank,” Nucleic Acids Research 22, 3441–3444 (1994).

31. BioNavigator, see http://www.bionavigator.com.32. L. Wong, “Kleisli: Its Exchange Format, Supporting Tools

and an Application in Protein Interaction Extraction,” Pro-ceedings of the IEEE International Symposium on Bio-Infor-matics and Biomedical Engineering, Arlington, VA (2000), pp.21–28.

33. L. Hass, P. Kodali, J. Rice, P. Schwarz, and W. Swope, “In-tegrating Life Science Data—With a Little Garlic,” Proceed-ings of the IEEE International Symposium on Bio-Informaticsand Biomedical Engineering, Arlington, VA (2000), pp. 5–12.

34. Y. Arens, C. A. Knoblock, and W.-M. Shen, “Query Refor-mulation for Dynamic Information Integration,” Journal ofIntelligent Information Systems 6, No. 2/3, 99–130 (1996).

35. A. Y. Levy, A. Rajaraman, and J. J. Ordille, “Querying Het-erogeneous Information Sources Using Source Descriptions,”Proceedings of the 22nd International Conference on Very LargeDatabases, VLDB-96, Bombay, India (September 1996).

36. M. P. Singh, P. Cannata, M. Huhns, N. Jacobs, T. Ksiezyk,K. Ong, A. Sheth, C. Tomlinson, and D. Woelk, “The Car-not Heterogeneous Database Project: Implemented Appli-

cations,” Journal of Distributed and Parallel Databases 5, 207–225 (1997).

37. R. Hull, “Managing Semantic Heterogeneity in Databases:A Theoretical Perspective,” ACM PODS (1997), pp. 51–61.

38. Y. Papakonstantinou, A. Gupta, H. Garcina-Molina, andJ. Ullman, “A Query Translation Scheme for Rapid Imple-mentation of Wrappers,” Proceedings of DOOD 95, Springer-Verlag, Inc. (1995).

39. S. Bergamaschi, S. Castano, and M. Vincini, “Semantic In-tegration of Semi-Structured and Structured Data Sources,”ACM SIGMOD Record 28, No. 1, 54–59 (1999).

40. A. Gal, “Semantic Interoperability in Information Services:Experiences with CoopWARE,” ACM SIGMOD Record 28,No. 1, 68–75 (1999).

41. J. D. Ullman, “Information Integration Using Logical Views,”Proceedings of the International Conference on Database The-ory—ICDT97 (1997), pp. 19–40.

42. A. M. Ouksel and C. F. Naiman, “Coordinating Context Build-ing in Heterogeneous Information Systems,” Journal of In-telligent Information Systems 3, No. 1, 151–183 (1994).

43. R. Bayardo et al., “InfoSleuth: Semantic Integration of In-formation in Open and Dynamic Environments,” Proceed-ings of ACM SIGMOD, Tucson, AZ (May 1997), pp. 195–206.

44. J. Fowler, B. Perry, M. Nodine, and B. Bargmeyer, “Agent-Based Semantic Interoperability in InfoSleuth CoopWARE,”ACM SIGMOD Record 28, No. 1, 60–67 (1999).

45. R. Stevens, C. A. Goble, and S. Bechhofer, “Ontology-basedKnowledge Representation for Bioinformatics,” Briefings inBioinformatics 1, No. 4, 398–414 (2000).

Accepted for publication December 22, 2000.

Carole A. Goble Department of Computer Science, University ofManchester, Oxford Road, Manchester M13 9PL, United Kingdom(electronic mail: [email protected]). Professor Goble co-leadsthe Information Management Group, which she founded in 1995with Norman Paton. Her academic career has been spent at theUniversity of Manchester. She gained her B.Sc. degree in com-puting and information systems in 1982, joined the faculty staffin 1985, and became full professor in 2000. Her principal inter-ests are in meta-data, knowledge representation, and ontologies,and their use in hypermedia, information integration, intelligentuser interfaces, and intelligent retrieval. She has worked in a va-riety of application areas, notably medical informatics, bioinfor-matics, and conceptual hypermedia and multimedia databases.Current work includes automating scientific database annotation,ontology-driven intelligent interfaces for retrieval and data entryin scientific databases, the visualization of ontologies, and tech-nologies for the semantic web.

Robert Stevens Department of Computer Science, University ofManchester, Oxford Road, Manchester M13 9PL, United Kingdom(electronic mail: [email protected]). Dr. Stevens is a ResearchAssociate in the Information Management Group. He gained hisB.Sc. degree in biochemistry at the University of Bristol in 1986;an M.Sc. degree in bioinformatics in 1991, and a D.Phil. degreein computer science in 1996, both at the University of York. Hisdoctoral research was on user interfaces. His specializations arein ontology construction and reconciliation of semantic hetero-geneity in bioinformatics resources.

Gary Ng Department of Computer Science, University of Manches-ter, Oxford Road, Manchester M13 9PL, United Kingdom (electron-

GOBLE ET AL. IBM SYSTEMS JOURNAL, VOL 40, NO 2, 2001550

Page 20: Transparent access to multiple bioinformatics information ...stevensr/papers/goble01.pdf · complex domain, but also the expression of queries in a visual interface and the execution

ic mail: [email protected]). Dr. Ng is a Research Associate inthe Information Management Group. He gained his B.Sc. degreein 1993 and his M.Sc. degree in computer science in 1995 at theUniversity of Manchester. He recently completed his doctoratedegree on visualization techniques for ontology development. Hismain interest is in user interfaces for knowledge bases.

Sean Bechhofer Department of Computer Science, University ofManchester, Oxford Road, Manchester M13 9PL, United Kingdom(electronic mail: [email protected]). Mr. Bechhofer is a Re-search Fellow in the Information Management Group. He gainedhis B.Sc. degree in mathematics at the University of Bristol in1988, and has worked in industry as well as academia. He hasbeen a member of the Information Management Group since itwas founded in 1995. His main research interests are in the ap-plications of description logics, particularly as a delivery mech-anism for terminologies, semantic meta-data, and ontologies.

Norman W. Paton Department of Computer Science, Universityof Manchester, Oxford Road, Manchester M13 9PL, United King-dom (electronic mail: [email protected]). Dr. Paton is a Pro-fessor of Computer Science. He obtained a B.Sc. degree in com-puting science from Aberdeen University in 1986 and a Ph.D.degree from the same institution in 1989. He was a lecturer incomputer science at Heriot-Watt University from 1989 to 1995,before moving to Manchester, where he co-leads the Informa-tion Management Group. His research interests have principallybeen in databases, in particular, active databases, spatial data-bases, deductive object-oriented databases, and user interfacesto databases. He is currently working on parallel object databases,spatio-temporal databases, and distributed information systems.He is also leading work in the development of an object-orienteddata warehouse for yeast genomic data and provides databaseexpertise for a number of databases in microbial eukarytes.

Patricia G. Baker Sagitus Solutions Ltd., Incubator Building,Grafton Street, Manchester M13 9XX, United Kingdom (electronicmail: [email protected]). Dr. Baker gained her B.Sc.degree in biochemistry from Liverpool John Moores Universityin 1990, an M.Sc. degree in computer science from UMIST in1991, and a Ph.D. degree in computational biochemistry fromthe School of Biological Science at the University of Manchesterin 1995. She has worked in academia and in the biotechnologyindustry and is currently the technical director of Sagitus Solu-tions, a company specializing in knowledge-based bioinformaticssystems for drug discovery.

Martin Peim Department of Computer Science, University ofManchester, Oxford Road, Manchester M13 9PL, United Kingdom(electronic mail: [email protected]). Dr. Peim is a Research As-sociate in the Information Management Group. He obtained hisB.Sc. degree in 1981 and a Ph.D. degree in 1989, both in math-ematics, at the University of Manchester. He spent two years atthe University of Kentucky (1986–1988), before taking up com-puter science. Before joining the TAMBIS project, he worked ondistributed decision support systems, automated reasoning, andhardware design verification.

Andy Brass School of Biological Sciences, University of Manches-ter, Oxford Road, Manchester M13 9PL, United Kingdom (electron-ic mail: [email protected]). Dr. Brass is a Senior Lecturer in theSchool of Biological Sciences. He started his career as a theo-retical physicist working on molecular modeling of superionic sol-ids, gaining his Ph.D. degree from Edinburgh University in 1987.

He then moved to McMaster University in Canada on a NATOfellowship to study aspects of high-temperature superconductiv-ity and strongly coupled electron systems. In 1990 he moved tothe University of Manchester to become a founding member ofthe bioinformatics group, where he has a wide range of projectsin protein function prediction, gene expression analysis, intelli-gent integration, automated curation, and bioinformatics educa-tion.

IBM SYSTEMS JOURNAL, VOL 40, NO 2, 2001 GOBLE ET AL. 551