text analytics for life science using the unstructured ... · the unstructured information...

Text analytics forlife science usingthe UnstructuredInformationManagementArchitecture

by R. MackS. MukherjeaA. SofferN. UramotoE. BrownA. Coden

J. CooperA. InokuchiB. IyerY. MassH. MatsuzawaL. V. Subramaniam

Biomedical text plays a fundamental role inknowledge discovery in life science, inboth basic research (in the field ofbioinformatics) and in industry sectorsdevoted to improving medical practice, drugdevelopment, and health care (such asmedical informatics, clinical genomics, andother sectors). Several groups in the IBMResearch Division are collaborating on thedevelopment of a prototype system for textanalysis, search, and text-mining methods tosupport problem solving in life science. Thesystem is called “BioTeKS” (“Biological TextKnowledge Services”), and it integratesresearch technologies from multiple IBMResearch labs. BioTeKS is also the first majorapplication of the UIMA (UnstructuredInformation Management Architecture)initiative also emerging from IBM Research.BioTeKS is intended to analyze biomedicaltext such as MEDLINE™ abstracts, medicalrecords, and patents; text is analyzed byautomatically identifying terms or namescorresponding to key biomedical entities (e.g.,“genes,” “proteins,” “compounds,” or “drugs”)and concepts or facts related to them. In thispaper, we describe the value of text analysisin biomedical research, the development ofthe BioTeKS system, and applications whichdemonstrate its functions.

The large scale sequencing of the human genomehas greatly increased our knowledge of the geneticbasis of biological processes and accelerated the pace

of research and development aimed at treating dis-ease and enhancing the health and well-being of hu-mans. However, these advances also result in in-creased complexity in understanding and applyingbiomedical research and data. There is consensus inthe life-science (LS) industry and academic labora-tories that managing the complexity of biological dataand knowledge requires an integrative, information-based systems approach, in which computer technol-ogy must play an essential role. For a cogent anal-ysis of this situation and the role of computationalmethods in life science, see References 1–3.

Key components of computational technology thatare relevant to this effort include analyzing, search-ing, and mining biomedical text, and correlating thestructured data derived from texts with data derivedfrom biomedical experiments, transcribed medicalrecords, and so on. This paper describes an IBM Re-search project to exploit and develop the text-analytical technology needed for managing, analyz-ing, and using biomedical text to solve problems inlife science. We call the system BioTeKS for “Bio-logical Text Knowledge Services.” BioTeKS is alsoone of the first major systems implemented with theIBM Unstructured Information Management Archi-tecture (UIMA), which is described later in this pa-per, in other papers in this issue,4 and elsewhere.5

�Copyright 2004 by International Business Machines Corpora-tion. Copying in printed form for private use is permitted with-out payment of royalty provided that (1) each reproduction is donewithout alteration and (2) the Journal reference and IBM copy-right notice are included on the first page. The title and abstract,but no other portions, of this paper may be copied or distributedroyalty free without further permission by computer-based andother information-service systems. Permission to republish anyother portion of this paper must be obtained from the Editor.

R. MACK ET AL. 0018-8670/04/$5.00 © 2004 IBM IBM SYSTEMS JOURNAL, VOL 43, NO 3, 2004490

This paper begins by describing the role and valueof text analysis in LS research and development, andhow BioTeKS fits into the broad range of technol-ogies needed to manage text content. It then focusesin detail on the BioTeKS system specifically, and howBioTeKS is being used to explore text analysis, textsearch, and text mining to support problem solvingin life science.

The role of text analysis in life-scienceresearch and development

Text analysis is a key component in text-oriented un-structured information management (UIM). The gen-eral goal of text analysis in UIM is to transform un-structured text information into structuredinformation, and to use this information to supporthigher-level processes of text search, mining, and dis-covery. (For comprehensive reviews of UIM, see Ref-erences 4, 6, and 7.) Transforming unstructured textinto structured information means transforming“chunks” of text into specific, discrete data objectscategorized or labeled by one or more attributes,where the data objects are words, phrases, or largertext segments. The essence of what the BioTeKS sys-tem does is information extraction (IE) for life-science text. Examples of IE include identifyingnames of biomedical entities, like gene, protein, anddisease names (which may be expressed in multiwordphrases), and identifying more complex facts aboutand relations between entities, such as interactionsbetween proteins, genes, and the functions associ-ated with them, or the correlations between drug ef-fects and disease indications. Several overviews oftext IE exist, in general,8,9 and specifically for life sci-ence,10–12 and we assume readers have some famil-iarity with the basic technical issues in IE.

The business and research value of extracting struc-tured text information is that it can be used to solveproblems in key biomedical domains and increaseproductivity in research and development. Text anal-ysis can enhance general knowledge managementpractices and tools, for example, by improving theeffectiveness (i.e., precision) of searching for doc-uments in large collections, and by organizing thesecollections into taxonomy groupings or topic clus-ters for easier browsing. More importantly, text anal-ysis can support knowledge discovery in various do-mains. Papers on knowledge portals and text-miningresearch in IBM can be found in recent special issuesof the IBM Systems Journal. 6,7,13,14

Figure 1 provides examples of text analysis phasesin life science in relation to four key phases of drugresearch and development (see Reference 3). In the“Target Selection” phase for a drug (scenario 1), forexample, a researcher needs to search scientific andpatent literature to find out what drugs or diseasesother researchers or institutions are working on, andwhat is already known and patented in this field.Knowledge discovery can increase the speed (andhence the productivity) of a drug researcher findinga drug target, a competitor�s patent activity, or a par-ticipant in a clinical trial. In the “Preclinical” phase,researchers may conduct experiments that provideindications of relevant gene responses to drugs ordisease agents. Scenarios 2, 3, 4, and 5 all pertain tofinding literature describing aspects of genes and pro-teins that can help researchers investigate hypoth-eses about the relevance of these genes or gene prod-ucts to some drug, disease, or biological process ofinterest. For example, studies have shown that lit-erature references to genes can improve the searchfor gene homologies that may be relevant to iden-tifying functions of novel target genes15 (scenario 2),validate molecular pathways,16,17 and help interpretwhy a cluster of genes might react together undersome experimental conditions18 (scenario 3).

In later stages of drug development, the effective-ness of target drugs is evaluated in the “Clinical Tri-als” phase (scenario 6). Text analysis of medical rec-ords can provide information which can be combinedwith other kinds of more traditional structured data(e.g., individual genomic information), to assist indata mining of factors associated with positive or neg-ative drug and treatment effects. Text mining can beused to find participants for a clinical trial (scenario7), based on specific attributes of treatment, med-ical history, and so on. Finally, once a drug is in themarketplace, feedback about large-scale use underuncontrolled circumstances provides additional valu-able feedback. Text mining can be used to analyzetreatment effects as reported in various forums, fieldreports, and call center records (scenario 8).

In each of these scenarios, text analysis has the po-tential for reducing the time it takes researchers tofind relevant documents and to find specific factualcontent within documents that can help researchersinterpret experimental data, clinical record informa-tion, and business intelligence data contained in pat-ents. As Ng and Wong put it, “The race to a newgene or drug is now increasingly dependent on howquickly a scientist can keep track of the voluminousinformation online to capture the relevant picture

IBM SYSTEMS JOURNAL, VOL 43, NO 3, 2004 R. MACK ET AL. 491

(such as protein-protein interaction pathways) hid-den within the latest research articles.”16

These text-based search and mining tasks are typ-ically functions available in end-user applications.BioTeKS has not yet been used to build all the ap-plications implied in Figure 1. These applicationstend to be highly customized for specific needs andpractices of specific pharmaceutical organizations.BioTeKS has focused instead on the core text anal-ysis methods needed to extract the text data that suchapplications would use as input for further higher-level processing; that is, indexing a text search en-gine or clustering search results, mining trends, orassociations among terms expressing concepts ofinterest.

Document clustering using BioTeKS componentscan be found as a function in the publicly accessibleWeb-based Bio-Dictionary* tool19 developed by theComputational Biology Center20 at the IBM Watson

Research Center. Bio-Dictionary uses pattern-matching algorithms to analyze protein sequencesin relation to molecular-level subunits called “se-qlets” (seqlets are to proteins roughly what wordsand phrases in a dictionary are to sentences that usethem to express ideas). Much of the protein analysishas nothing to do with text, per se, but as the systemsearches for known and homologous proteins onpublic protein databases such as Swiss-Prot**,21 italso retrieves and compiles MEDLINE** abstracts thatdescribe facets of these proteins. These documentscontain descriptions of potentially relevant functionsand characteristics of the proteins under analysis.The document clustering results shown in Figure 2are a subset of several hundred MEDLINE documentscompiled by Bio-Dictionary for the sample proteinsequence “SODUM-BORU,” and the clusters are la-beled with topically relevant keywords. These includephrases suggesting attributes and functions. Hierar-chical document clustering with cluster labels helps

Figure 1 Text–mining scenarios for four phases of drug development

Do modeling and detailed analysis of interactions between drug compounds and gene and protein targets

Preclinical

Determine effectiveness, toxicity, and delivery characteristics of drug treatments with humans

Clinical Trials

Track drug effects and side effects in real-world (post-sales) contexts via field reports, call center reports, etc.

Production/Deployment

TASKS

DRUG DEVELOPMENTPHASE

Given disease, identify candidate target drugs, genes, and proteins associated with disease

Target Selection

Scenario 3 Retrieve literature and analyze common functions and characteristics of gene clusters from micro-array experiments

Scenario 6 Correlate drug treatment effects with individual clinical and genomic data contained in clinical records and databases

Scenario 8Track negative drug side effects expressed in field reports, call center reports, and Web forums

Scenario 1 Search and analyze patent and PubMed literature for competitive information ( i.e., who is working on what?)

Scenario 4 Search and analyze literature on experimental data for gene products implicated in cell metabolism for a target genome (e.g., Drosophila)

Scenario 5Search and analyze literature on protein-protein interactions, and extract as part of compiling pathogen (or other) pathways.

Scenario 7 Identify participants with specified treatment conditions, disease, indications, and phenotypic characteristics for clinical trials, using text and data in clinical records

Scenario 2 Given novel gene (or protein) sequence, search and analyze literature for references to cell functions and disease correlates associated with these sequences.

R. MACK ET AL. IBM SYSTEMS JOURNAL, VOL 43, NO 3, 2004492

users browse document collections more easily thanunordered lists of MEDLINE titles, especially whenthe cluster labels suggest topics for subsets of MED-LINE abstracts. The clustering engine, an exampleof BioTeKS application-enabling middleware, usesnoun phrases in its clustering algorithm and for la-beling each cluster. BioTeKS annotators (described

later) extract these noun phrases and make themavailable to the clustering engine. Other BioTeKScomponents format the results of clustering as anXML (eXtensible Markup Language) file that can bepresented to an end user as a dynamic (interactive)HTML (HyperText Markup Language) Web page, asshown in Figure 2.

Figure 2 Hierarchical document clustering results


In this context, BioTeKS functions as a text-analysismiddleware component of a larger system that startswith input documents obtained from some source(e.g., search processes in the Bio-Dictionary tool),and ends as an end-user application view (e.g., theWeb page in the figure) that enables end users todo something useful with text documents, such asbrowse them by means of a hierarchy of labeled clus-ters. The labels suggest specific topics relevant to in-terpreting the characteristics of the original novelprotein.

In the next section, we discuss the larger middlewaretechnology context for BioTeKS, and then we de-vote the rest of the paper to describing BioTeKS text-analysis methods.

Technologies for managing text contentA variety of technologies are needed to support text-based knowledge management and discovery, fromaccessing and managing documents (unstructured in-formation) from various sources to delivering anal-ysis results to end user applications. Figure 3 showsa generic platform depicting this end-to-end se-

quence of technologies. BioTeKS focuses on thegreen components in the figure: text-analysis meth-ods, indexing, and storage of text-analysis data. Ap-plication middleware engines access and further an-alyze this text data. Many products are available fromIBM and other vendors to implement this platform(see the Life Sciences Framework22 for a descrip-tion of available IBM products and technologies).

Managing unstructured text information begins withaccessing text information using crawling or feder-ated search methods. Once text is accessed, it needsto be filtered or converted from a variety of formats(e.g., PDF [Portable Document Format], HTML, XML)into a standard form for further processing. The lat-ter includes document parsing to extract text con-tent (“blobs”) and meta-data, and it often involvesstoring this information in local data warehouseswithin an organization. Collected text informationis typically stored in diverse and heterogeneous re-positories, including public and corporate Web sitesand internal corporate databases (e.g., Lotus Notes,file systems). These basic functions are available instandard IBM products (e.g., IBM DB2* Information

Figure 3 Generic platform for managing unstructured ( text ) information

APPLICATION INTEGRATION / WEB SERVICES

APPLICATIONSERVICES

DOCUMENTWAREHOUSE/DOCUMENTMANAGEMENT

PORTLETSANDSERVLETS

APPLICATION-ENABLING MIDDLEWARE ENGINES

TEXT ANALYSIS

SEMANTIC SEARCHTEXT SEARCH

ENGINE

DOCUMENTCLUSTERING

ASSOCIATION MINING

LEXICAL NETWORKS

DOCUMENT ACCESS / DOCUMENT FILTERING

TEXT-ANALYSIS-DATAINDEXING AND ACCESS

END-USERAPPLICATIONSFOR SEARCH AND TEXT MINING

DATA MINING

INTERNET / WEB

CORPORATEINTRANETS

CRAWLER

FEDERATEDACCESS

1

2

4

7

3

DATABASE

DATABASE

PUBLIC WEB SITESAND DATABASES

BUSINESS DOCUMENTS

CLINICALRECORDS

R&D REPORTS

US PATENT RECORDS

PubMedSWISSPROT

5

6


Integrator for Content23) or other vendor productsin the domain of content and documentmanagement.

Once the basics of text access and warehousing orstorage are accomplished, text analysis of variouskinds can be carried out at a finer granularity, to sup-port a wide range of text searching and mining ap-plications. This is where the BioTeKS system andUIMA come into play. The numbered arrows in Fig-ure 3 show the high-level flow of information be-tween these components. Accumulated documentsflow in to the text analysis component (arrow 3). Textanalysis processes in this component automaticallyextract text features of various kinds and annotatethese features with linguistic and semantic informa-tion that associates meaning with text features (e.g.,a string is a “gene” relative to some resource). Textanalyses produce fine-grained, structured text datathat typically need to be stored and indexed for useby higher-level applications. Structured data ex-tracted from the document as meta-data (e.g., au-thor, title, date, keyword annotations based onMeSH [Medical Subject Headings] codes, etc.) andextracted from text content (e.g., entity names suchas genes, drugs, or diseases contained in MEDLINEabstracts) can be stored (arrow 4) in a relational da-tabase for access by upstream text-mining applica-tions using SQL (structured query language) databasequeries. In addition, text content can be indexed ina text search engine, enabling text search based onmany more terms indexed in the full document textcontent.

The application-enabling engines identified in Fig-ure 3 provide access to text-analysis data for text-search and higher-level text-mining functions, includ-ing tools that allow researchers to control theseanalyses and visualize results. Arrow 7 in the figureindicates access to indexed information by applica-tion-level components, and arrow 5 indicates howend-user (Web) applications communicate user re-quests to these middleware application components.For example, the user may submit a text query insome query language, which is ultimately providedas input to the text search engine via the “semanticsearch” component in the figure. This componentcan also contain additional functions to enhancesearching, for example, to help users express searchintentions in query languages, or to iteratively re-fine searches by suggesting query terms based onusers selecting result documents they judge to be rel-evant to their interests (i.e., “show me more likethis”). Searching can also be enhanced by ranking

result documents by their relevance to query terms,and by clustering or categorizing documents, mak-ing it easier for users to browse large collections ofsearch result documents.

Text-mining functions provide additional ways of or-ganizing and visualizing text, either at the documentlevel or at the level of more specific text content.Text-mining methods include analyzing associationsand trends between categories of entities, such ascorrelations between names of researchers and re-search topics, genes and gene products, drug andcompound effects and disease indications, and so on.We discuss these and other text-mining applicationsin more depth later.

The technology platform indicated in Figure 3 in-cludes tools for developing applications, includinggraphical user interfaces and Web services to man-age the application logic supporting application cli-ents. These services are used for controlling the in-tegration and information flow among middlewarecomponents like databases and search engines anddocument-clustering engines. In the section, “Ap-plication prototypes and application-enabling en-gines,” we discuss examples of such applications.

An overview of the BioTeKS systemBioTeKS is a system focused on methods for ana-lyzing or annotating biomedical text, including ascheme for storing and indexing text-analysis datagenerated from these methods and a suite of pro-totype application-enabling engines for supportingcertain types of biomedical text-mining applications.

Figure 4 presents specific text-analysis componentsof BioTeKS which were shown in green in Figure 3.As shown, the text-analysis engine applies a sequenceof text annotators to annotate input documents. Textanalysis data is stored in a database and in a textsearch engine, and application-enabling engines ac-cess this text data and apply additional text-miningmethods to the data.

BioTeKS functions in a larger context where doc-uments have already been crawled or compiled intoa collection, and a set of applications of interest mightalready exist (e.g., text search applications). BioTeKSfocuses on the analysis and extraction of text infor-mation from documents in this collection to supporttext searching and mining applications.

BioTeKS uses the UIMA framework and tools fromwhich it derives much of its value. The component


labeled “Text Analysis Engine” in Figure 4 also de-scribes the essential elements of the UIMA implemen-tation of BioTeKS. Other papers describe UIMA indepth, and we only highlight aspects of it relevantto describing BioTeKS.4,5

How UIMA improves text analysis processing. UIMA(and BioTeKS) address the challenges of analyzingtext by standardizing and improving the process ofdeveloping, deploying, and integrating the operationof multiple text-analysis methods that operate on dif-ferent levels of linguistic analysis.5 Document textexpresses ideas that involve all the complexity of hu-man language expression. Analyzing the form andcontent of text is complicated because it entails in-tegrating and coordinating analysis methods for mul-tiple levels of analysis of linguistic structure, fromcharacter strings and word tokens to names and

phrases, to syntactic elements including clauses andsentences, to still higher-level topic structures thatmight span multiple paragraphs within documentsand even multiple documents. Each of these levelsof information requires specialized analysis meth-ods, all of which need to be orchestrated andintegrated.

How BioTeKS improves text analysis processing forlife science. BioTeKS in turn is focused on meth-ods for text analysis of biomedical text, which is es-pecially complicated scientific text. What these meth-ods are, and how these methods improve the analysisof biomedical text, is the subject of the rest of thispaper. The starting point, however, is to use UIMAas the framework to develop and orchestrate a col-lection of text-analysis methods that automaticallyidentify names of biomedical entities (e.g., genes and

Figure 4 Overview of BioTeks system and UIMA

•••••

SEMANTICSEARCH

DOCUMENTCLUSTERING

ASSOCIATIONMINING

LEXICAL NL ETWORKANALYSIS

TEXTXTX ANALYSIS

KNOWLEDGERESOURCES(DICTIONARIES/ ONTOLOGY)

TEXTXTX ANALYSIS ENGINE

COLLECTIONPROCESSINGMANAGER


proteins) and facts about them (e.g., a gene expressesa protein, one protein activates another protein, agene cluster appears implicated in a disease syn-drome, etc.). Table 1 summarizes features and func-tions of the annotators that are used in the “TextAnalysis Engine” component shown in Figure 4.

In BioTeKS, text annotators have a standard imple-mentation specified by UIMA, in which annotatorscreate, use, and interoperate via the Common Anal-

ysis System (CAS). CAS annotations are structuresconsisting of attribute-value pairs, where the at-tribute is some type of classifying or identifying de-scription (e.g., a linguistic category like “noun,” ora semantic category like “gene”) and the value is aspan of text in the document to which this annota-tion applies. Each annotation can have additionalproperties associated with it as well, such as the lo-cation of the span of text in the document. Each an-notator takes a CAS structure as input and returns

Table 1 Annotators used in BioTeKS

Annotator Name Function in BioTeKS Comments

LanguageWarelinguisticengine

Used for tokenizing textinto strings and sentences.Assigns generic lexicalinformation to strings.

Tokenization based on regular expression rules and dictionary lookup.Dictionary lookup also provides generic lexical (lemma) information (e.g.,part of speech). Dictionaries can be customized for specific categories ofentities.

POS (part ofspeech) tagger

Assigns parts of speechto tokens, using context.

Assigns parts of speech to tokens based on statistical model of terms incontext. Can also disambiguate POS annotations from prior annotators. POStagger can be trained on manually annotated document corpus. Defaulttraining corpus is licensed from Language Data Consortium.28

FST (finite statetransducer)with shallowparsing rules

Parses sentences intosyntactic units (e.g.,subject noun phrase).

General-purpose FST engine. Syntactic parsing rules customize the FST forshallow parsing. Shallow parser assumes POS tags of individual string tokens.

DictionaryLookup

Assigns a lexical orsemantic category to astring (e.g., “Trk A is aprotein”) and otherlexical information

General-purpose dictionary-based pattern-matching lookup. Takes tokens asinput or does its own tokenization. Includes reference to an authority (e.g.,MeSH), canonical form, and synonyms. Multiple dictionaries for flexibleentity extraction; e.g., MeSH terms, genes, proteins, and drugs. Includesfunctions for handling case, stemming, etc. Dictionaries developed as XMLfiles.

BioAnnotator Identifies biomedicalterms and phrases andassociates a UMLSidentifier with them.

General-purpose dictionary-lookup method specialized for UMLS. Takesnoun phrases as input and classifies them as “biological” if they matchselected categories of UMLS terms or contain UMLS terms. Includes regularexpression rules for disambiguating terms in context.

ChemFrag Identifies complex tokensas organic chemical names.

Identifies organic chemical names using rules based on chemical fragmentstrings. Does not identify specific chemical name with respect to chemicalname-resource.

DrugDosage Identifies phrasescontaining drug nameand a dosage qualifier.

Uses dictionary lookup for drug names, quantities, and dosage qualifiers andFST rules for identifying phrasal combinations of these elements.

Term OntologyMapper

Adds taxonomyinformation to identifiedterms.

Dictionary lookup methods identify terms relative to ontology and dictionaryresources, but in some cases, like MeSH (Medical Subject Headings), there isadditional information for placing terms in a hierarchical taxonomy. Thisannotator fills in this additional semantic information.

Term Categorizer Assigns a lexical andsemantic category to astring (e.g., “congestiveheart failure” is a diseasesymptom).

Term Categorizer is an annotator based upon machine-learning techniquesapplied to training documents (see text).

RelationExtractor

Identifies syntacticclauses containing nounand verb phrases.

Extracts clauses consisting of subjects, verbs, and objects. Takes shallowparsing annotations as input.


a modified CAS structure as output. Text annotatorsare executed in the text analysis engine shown in Fig-ure 4, and multiple annotators can be organized ina pipeline.

The current set of BioTeKS text annotators areshown in Table 1. These annotators focus on infor-mation extraction; that is, identifying the location,category, and properties of the names of significantbiomedical entities and identifying certain relationsbetween these entities. However, to analyze entitiesand relations requires exploiting more generic nat-ural language processing (NLP) analyses, beginningwith tokenization and part-of-speech tagging of textstrings. These are discussed in the next section.

UIMA also specifies various services and methods forcollection-processing management, resource man-agement for linguistic and other resources neededby annotators (e.g., dictionary or ontology resources),and CAS consumer methods for translating CAS an-notations to other forms of data that can be indexedand stored (see Figure 4). Neither UIMA norBioTeKS have functions for accessing document col-lections in the sense of crawling. (Figure 4 shows doc-ument access and management as external to thetext-analysis component). However, BioTeKS doesadopt the UIMA collection-processing scheme for im-plementing “collection reader” functions, for feed-ing documents from a compiled collection into a text-analysis engine. A collection reader parses each inputdocument and initializes a new CAS structure con-taining the initial flat text on which annotators willoperate, as well as optional document meta-data(e.g., title, author, etc.). BioTeKS has a collectionreader specialized for MEDLINE abstracts, and areader for aspects of patent documents. In bothcases, the documents are initially available as XMLdocuments with labeled fields for document-levelmeta-data, such as, “title,” “author,” and “date,” aswell as fields for extended segments of text contain-ing the contents of the documents (e.g., MEDLINEabstract, patent abstract or claims, etc.).

NLP annotators. BioTeKS includes the following ge-neric NLP text annotation methods:

● LanguageWare* linguistic engine● Part-of-speech (POS) tagger● Finite state transducer (FST) with shallow parsing

syntax rules

The LanguageWare linguistic engine24 segments textinto tokens and sentences, using a specific text an-

notation model. The LanguageWare tokenizer com-bines dictionary lookup with algorithmic processingto segment input text into distinct lexical units.25 Lan-guageWare dictionaries also contain additional lex-ical information that can be associated with the lex-ical items identified as part of segmentation, such asa word�s lemma or part of speech. This lexical in-formation is useful for the subsequent annotationprocesses, including disambiguation of POS tags.

The POS tagger and FST component annotators areresearch-enhanced annotators based on an earliertext analysis engine developed by IBM Research,called Textract, which was available in the IBM In-telligent Miner* for Text (IM4T) software product.26

(Textract components are described in more detailin Reference 27.) These two annotators, along withthe LanguageWare tokenizer, use a common “textannotation framework” (TAF), also described in Ref-erence 27. TAF specifies a set of CAS annotation typesappropriate for multiple levels of linguistic process-ing, for example, tokens (strings), terms (includingmultiword phrases), sentences, clauses, and so forth,and properties of these linguistic objects, such as thecharacter location of the span of annotated text ina document, part of speech for the span, and so forth.Most BioTeKS annotators interoperate through thisannotation type system, or through annotation typesderived from this set.

The output of the tokenizer consists of CAS anno-tations which form the input to the POS tagger. ThePOS tagger uses a statistical language model, createdduring a separate training phase, to disambiguate thepossible POS tags based on the language model. Sta-tistical methods are also used to determine a POS tagfor tokens which are not in the language model. Thetagger needs to be trained manually to develop thisstatistical language model by using a corpus anno-tated with correct POS tags.28

The shallow parser analyzes syntactic relationshipsamong these POS-tagged terms, and produces a shal-low syntactic parse, which groups tagged terms intonoun, verb, prepositional phrases, and so forth, andin some cases, identifying the “subject,” “verb,” and“object” clause structure of sentences. Figure 5 showsan example of shallow-parsing results for a sentence,indicating POS tags and syntactic groups of taggedterms (NP is noun phrase, VP is verb phrase, PP isprepositional phrase, NPP is noun phrase with prep-ositional phrase, and VG is verb group). BioTeKSdoes not generate the ideal semantic representationin the top frame, but it does generate the interme-


diate shallow parse. This syntactic parse is useful forcertain types of relation extraction because verbs typ-ically describe relations between named entities ina sentence.

The shallow parser is realized as a cascade of finitestate grammars running within a generalized FST en-gine. This transduces one sequence of symbols intoanother.8, 22 For example, rules corresponding to En-glish grammar are used by the FST to compose a se-quence of word tokens (with parts of speech, suchas noun or adjective, disambiguated by a statisticaltagger) into noun phrases (NPs) to combine NPs andother phrasal units into syntactic arguments, and toassign grammatical roles (such as “subject” or “ob-ject”) to these arguments. FST rules are declarativeand capture linguistic regularities in human-readableform. An FST compiler compiles these rules into aruntime “state transition graph” executable that canbe run efficiently against sequences of text fragments(e.g., document content) or, more abstractly, appliedto sequences of annotations that may be of lexical,syntactic, or indeed any annotation type. In additionto its use for shallow parsing, the FST engine pro-vides powerful general-purpose capabilities formanipulating text structures, syntactic or otherwise.

These annotators produce a rich set of generic lin-guistic annotations that capture key linguistic infor-

mation which can be used by other BioTeKS anno-tators and applications, as we describe in thefollowing. This information is useful outside the con-text of biomedical text. Indeed, pre-UIMA versionsof the TALENT (Text Analysis and Language Engi-neering Technology) tools on which the TAF com-ponents are based have been used in applicationsinvolving customer call centers and otherinformation.13,29

These annotators can be modified to work effectivelyon biomedical text. For example, standard POS tag-gers and shallow-parser rules are developed for sen-tence structures in generic text (e.g., news articles),and need to be modified for sentences and word-to-word statistical patterns characteristic of narratedphysician reports or other kinds of biomedical text.In addition we should note that additional NLP an-notators exist, including deep syntactic parsers.30

However, we have not yet exploited these for life-science tasks.

Annotators for biomedical entity extraction. Entityextractors are annotators that identify the locationof an entity name in a text and categorize the namerelative to one or more knowledge resources likeMeSH and UMLS** (Unified Medical Language Sys-tem), both developed by the National Library ofMedicine.31 Examples shown in Table 1 include en-

Figure 5 Shallow parsing results for a sentance in PubMed abstract 12568865

(MUTATED_GENE: BrCA1 FUNCTION: inhibits, PROCESS, soluble HLA secretion CONFIDENCE: 75%)

PP

NPP

NP NP NPVG

A/a/DT point//NN mutation//NN of//IN the//DT BrCA1//JJ gene//NNappears/appear/VBZ to//TO inhibit//VB soluble//JJ HLA//JJ secretion//NN

A point mutation of the BrCA1 gene appears to inhibit soluble HLA secretion

A point mutation of the BrCA1 gene appears to inhibit soluble HLA secretion.


tity extractors for identifying genes, MeSH terms(also used for manual annotation of MEDLINE ab-stracts), drug names, and chemical-compoundnames.

Figure 6 shows a debugging tool used in BioTeKSto annotate in color specific categories of entities ina MEDLINE abstract. The legend at the bottom iden-tifies semantic categories of words (e.g., “genes,”“diseases,” “chemicals and drugs”), and the datastructure in the right frame is a CAS annotation forthe selected term in the document. The annotationframe shows the CAS annotation for lamin A/C. Notethat this is a variant of the gene LMNA, which ap-pears in the title line (TI). Note also that the genestring is annotated relative to LocusLink,32 a pub-licly available gene description database.

Entity extraction is a broad domain (see Reference8), and there are multiple techniques available.

BioTeKS is exploring three approaches to entity ex-traction, and Table 1 identifies for each annotatorthe general technique used to implement each an-notator, namely:

● Pattern matching of terms (roughly string lookup)using a dictionary or database of known terms insome target category of terms (e.g., “MeSH”terms, LocusLink-derived “gene” names, etc.)

● Rules defined over a set of features or annotationscharacteristic of a category of terms

● Machine learning, based on human-created train-ing documents containing correct examples ofsome target category of terms and also based onfeatures or annotations associated with the targetcategory of terms

The strategy in BioTeKS is not to build annotatorsfor every possible biomedical entity. This is an open-ended task that typically requires access to special-

Figure 6 MEDLINE abstract annotated with automatically extracted MeSH and gene names


ized text sources (e.g., medical records) and domainexpertise. Rather, we have developed examples ofgeneral techniques for a set of representative enti-ties typical of bioinformatics (e.g., genes and pro-teins), medical informatics (e.g., drugs and diseaseindicators), and patent mining (e.g., drug and chem-ical names, disease indicators). New annotators canbe built by extending this set of annotators by add-ing new dictionaries, FST rules, or training documentsin the case of machine learning. This process is bestdone with domain experts, who can provide morein-depth expertise necessary for evaluating the qual-ity of entity identification and for inferring how toiteratively improve the quality of annotators by add-ing terms and synonyms to dictionaries, improvingrules, or developing more accurate trainingdocuments.

The Dictionary Lookup annotator uses a “dictio-nary” of entity names, each categorized in relationto a knowledge resource such as MeSH. For exam-ple, in addition to a dictionary of MeSH terms, wedeveloped a dictionary for gene names, compiledfrom the public LocusLink database.32 These dic-tionaries are stored as XML files that include the ca-nonical form of biomedical entity names, as well aslexical variants, and other information, such as ref-erences to the source (database, authority, or “on-tology”) and an identifier of the entity name in thatsource.

The annotator applies pattern matching to match to-kens in the text (potential names) to each item inthe dictionary. The matching process is more thansimple lookup because it can also handle variationsin case and morphology (e.g., “Trk A” or “Trk-A”),including stemming (e.g., the common “stem” un-derlying plural vs. singular forms of a word).

The Dictionary Lookup annotator is actually severalannotators, each specialized for specific diction-aries. Dictionaries can be built from MeSH and UMLSresources available from the National Library ofMedicine33 (NLM), as well as publicly available data-bases specialized for specific biomedical entities likeproteins (Swiss-Prot21) or genes (LocusLink32). Notethat many pharmaceutical and biotechnology com-panies have also developed internal and proprietarydictionaries of terms, including variants and syn-onyms. In general, dictionary tools need to be ableto incorporate new dictionary resources, and the Dic-tionary Lookup tool can do so, using other dictio-naries when they are properly formatted.

The BioAnnotator (see Table 1) categorizes nounphrases provided by a shallow parser as biomedicalphrases when they match terms in UMLS, either ascomplete matches or as partial (substring) matches.BioAnnotator uses the LanguageWare linguistic en-gine described earlier to identify UMLS terms. Thisis done by replacing the default English language dic-tionary in the engine with a dictionary based onUMLS. BioAnnotator also has a rule-based compo-nent to identify biological terms not present in UMLSand to resolve certain types of ambiguity in extractedgene names (e.g., some gene names are also used toname proteins or nonbiological entities, such as“BIKE”).34 Dictionary Lookup and BioAnnotatorare annotator options that have overlapping func-tions, but also explore different entity identificationmethods.

The Term Categorizer annotator elaborates on thesemantic context of identified terms. For example,MeSH terms are categorized in a hierarchical tax-onomy. This annotator optionally associates iden-tified terms with additional information such as syn-onym and cross-reference information. Upstreamapplications can use this information in various ways,for example, to create a navigation function forbrowsing and selecting terms. This function couldbe bundled with Dictionary Lookup, but because itis an optional level of annotation, it is appropriateto keep it separate and invoked only when needed.

Entity extraction using dictionary lookup works wellwhen domain experts can enumerate the names ofentities of interest (as well as variants of thesenames). However, this is not always possible. Thesecond approach to entity extraction is based onrules, not enumeration, where the rules can be de-veloped by domain experts, either directly or by us-ing machine-learning methods. The context in whicha name occurs can sometimes provide feature cuesfor the semantic type of name for a term. Where thisis the case, it is sometimes possible to write rulesbased on these features. The ChemFrag and Drug-Dosage annotators are examples of annotators basedon rules.

The ChemFrag annotator combines regular expres-sion rules that recognize organic chemical nameswith rules that assemble these fragments into largerdescriptions. A small dictionary of prefixes and suf-fixes is used in some of the rules. An example of arecognized chemical fragment is Pivaloyloxymethyl1-ethyl-1,4-dihydro-4-oxo-7-(4-pyridyl)-3-quinolinecar-boxylate. Note that identification in this case only


means categorizing chemical names as “chemicalnames,” and does not mean identifying the specificchemical name in a standard resource (e.g., Refer-ence 35). Rules are formal expressions, such as, “Afragment contains balanced parentheses or brack-ets and possibly, numbers and hyphens.” ChemFragis a hybrid consisting mainly of rule classificationbased on features, augmented with a small dictio-nary of known prefixes and suffixes.

The DrugDosage annotator (see Table 1) is also ahybrid annotator.36 In DrugDosage, dictionaries areused to identify known drug names (e.g., “Ibupro-fin”) and strings associated with quantities and dos-ages (e.g., the quantity “20,” and the abbreviation“mg” for milligrams). Rules classify co-occurrencesof drug names, quantities and dosage abbreviations(e.g., “ibuprofen, 20 mg”) as “drug and dosage” con-cepts. The rules for identifying these concepts usea modified version of the FST engine used in the TAFshallow parser. However, instead of compiling andusing English language syntax rules, the rules fordrugs and dosages are modified to apply to nounphrases that contain patterns of drug names, quan-tities, and dosage modifiers. Writing rules manuallyrequires expertise and iterative refinement to achievesatisfactory levels of accuracy.

The BioTeKS team is also exploring machine learn-ing (ML) approaches to entity extraction. ML ap-proaches are especially useful when neither dictio-naries nor explicit rules are easy or possible to build.ML approaches consist of a training phase involvingthe creation by humans of a training corpus, provid-ing true examples of a category of entity to be learned(e.g., drug or gene names). The ML process automat-ically builds a classification model of the target cat-egory of term based on features associated with termsin the document context. These may be features ofthe term itself (e.g., distinctive characters, substrings,prefixes, or suffixes) or features of the linguistic andsemantic context in which entity names occur.

There are several ML tools, some of which are avail-able in the public domain, for example, the WEKAtools.36 In BioTeKS, we are using an IBM Researchprototype for interactive and semi-supervised learn-ing.37 The prototype uses a statistical-machine-learn-ing algorithm called Generalized Winnow (an im-proved version of the standard Winnow algorithm),described in References 38 and 39. Some of the ad-vantages of this tool are its interactive visual userinterface for guiding the process of identifying trueinstances of an entity category based on confidence

levels (estimates of in-class probabilities), the op-tional use of UIMA text annotators by the learningalgorithm, and the use of the resulting classificationmodel as a UIMA annotator for entity extraction.

Annotators for biomedical relation extraction.BioTeKS also provides the capability to extract whatis expressed or predicated about specific entities,once they have been identified. Extracting predica-tions means identifying relationships expressed be-tween two or more terms describing simpler enti-ties. The following are examples of biomedicalrelationships (where the bracketed and capitalizedterms refer to semantic categories of entities relatedto each other):

1. [PROTEIN:]“ABP1”co-occurswith[PROTEIN:]“SRV2.” (NOTE: the actual source sentence is“SH3 domain of ABP1p binds specifically in vitroto the proline rich segment of SRV2p.”)

2. [RELATION:] located in, [SUBJECT:] collagenmolecules, [OBJECT:] type XIII plasma mem-branes (of these cells)

3. [RELATION:] localized in, [SUBJECT:] [PRO-TEIN:] “PRP20”, [OBJECT:] [CELL STRUC-TURE] the nucleus

4. [MUTATED_GENE:] BrCA1, [FUNCTION:]inhibits, [PROCESS:] soluble HLA secretion(The original sentence is: “Apocytochrome cblocks caspase-9 activation and BAX inducedapoptosis.”)

As with entity identification, there are multiple ap-proaches for extracting information about relationsbetween entities, and BioTeKS is exploring at leastthree methods.

● Computation of mutual co-occurrence of terms(“unnamed relations” for short)

● Extraction of syntactic clauses using shallow anddeep linguistic parsing, with additional filtering ofrelations based on the category of entities in therelations (“named relations” for short)

● Graph mining, based on compiling relationshipswith semantic and syntactic information, andsearching for common structural patterns insubgraphs.

The first technique computes the frequency of co-occurrence of entity names within a window of text,for example, a sentence, paragraph, abstract, or othersegment. Co-occurrences can be found in multipledocuments, or multiple occurrences can be foundwithin a single document. We call these “unnamed


relations” because they identify a likely associationbetween two or more terms, but cannot identify orname the specific relationship. An example is item1 in the examples listed previously (a co-occurrencerelation). Unnamed relations can be expressed in anumber of ways, including in a database triplet con-sisting of the associated entities and a quantitativemeasure of the strength of association (see Refer-ence 40 for details).

The next two analysis methods use linguistic infor-mation to identify relations, including informationabout verbs connecting two entities. The name ofthe verb can be used to name the relationship, andhence we call these “named relations.” The Rela-tion Extractor annotator (see Table 1) extracts syn-tactic relations, using the shallow-parsing results ofthe Shallow-Parser annotator to extract syntacticclauses connecting nouns and verbs. In this case, therelation is expressed as a verb connecting a subjectnoun and an object noun, as shown in examples 2and 3. These examples are essentially a compilationof verb, subject, and object noun phrases extractedfrom the Relation Extractor annotator.

Once a set of such relations is extracted, it needs tobe further analyzed to focus on specific relations ofinterest. For example, example 4 is an instance ofa set of relations filtered by identifying verbs thatare characteristic of gene functions. Identifying genefunctions is a key bioinformatics task, and the bioin-formatics research community has provided a mech-anism for manual creation of descriptions of genefunctions, including a reference to a specific MED-LINE record where this function is described in de-tail. These descriptions are called Gene RIF (refer-ence into function) descriptions. Example 4 isexpressed as a sentence and also by a somewhat moreformal representation labeling the entities and func-tions expressed by verbs such as “blocks,” “activa-tion,” and “induced.” This representation is shownonly for illustrative purposes because there are cur-rently no formal standards for creating Gene RIF de-scriptions (although Gene Ontology41 would be acandidate). The National Institute of Standards andTechnology (NIST) sponsored the first GenomicsTREC (Text Retrieval Conference) track in 2003, witha focus on finding MEDLINE abstracts and Gene RIFdescriptions for target genes using text informationextraction and search methods. Our work on iden-tifying gene functions is based on IBM�s participa-tion in this NIST-sponsored Genomics track.42

The Relation Extractor annotator is also being ex-tended to include graph-mining techniques. Graphmining applies data-mining techniques to graphs thatare derived from syntactic parse trees, where thegraphs� nodes and edges represent terms and syn-tactic relations, respectively. Inokuchi has developedgraph-mining algorithms to find salient and frequentpatterns or subgraphs corresponding to interestingrelations.43

Performance of BioTeKS annotators. It is reason-able to ask is how fast the BioTeKS annotators arein processing documents. In the case of MEDLINE ab-stracts, the corpus we are indexing currently consistsof about 11 million abstracts, corresponding to 35gigabytes of text. The performance depends on whatannotators are applied in a text-analysis engine (seeFigure 4), what is done with the annotations (e.g.,are they used to index a search engine or load a da-tabase), the computing resources applied, and po-tential optimization methods. We are using a singleXSeries* 335 Intel Xeon** 2.8 GHz processor with1.5 GB of memory running Linux** release 9.0. Someidea of the throughput can be gained from the re-sults of two analysis runs:

● In the first run, it took 40 hours to annotate thefull MEDLINE corpus, using selected annotatorsand annotator translation applied to each MED-LINE abstract. Each MEDLINE abstract in XML for-mat was parsed, terms and sentences were token-ized, and Dictionary Lookup was used three timesto identify MeSH terms, expanded gene names,and gene function verbs. CAS consumers were usedto translate annotations to an indexable format.An additional 120 hours were needed to create aJuru XML index (see “Semantic text search”) fromthe MEDLINE abstract content, meta-data, and ex-tracted annotations.

● In the second run, 360 hours were needed to an-notate the full MEDLINE corpus, using selected an-notators and annotation translation. Each MED-LINE abstract in XML format was parsed, terms andsentences were tokenized, and POS tagging, shal-low parsing, and Dictionary Lookup for MeSHterms were applied. CAS consumers were used tocreate indexing input for text search (without ac-tually creating a Juru XML index), and for creat-ing noun-phrase feature files for on-demand clus-tering of document search results.

The difference between these two cases is the use ofPOS tagging and shallow linguistic parsing. It is gen-erally the case that such processes are computation-


ally intensive compared to other processes: for ex-ample, the current shallow parser can parse 14documents per second, as compared to DictionaryLookup, which can process between 200 and 1558documents per second, depending on the size andcomplexity of the dictionary entries. However, in-dexing the entire MEDLINE corpus (or any other cor-pus) is likely to be done very infrequently, and onceit is done, the indexes can be incrementally updatedas new documents are added to the source collec-tion. There are numerous parameters influencingprocessing throughput, and several optimizationstrategies are being explored.

Quality of BioTeKS entity and relation annotators.It is important to evaluate the quality of entity andrelation extraction, but it is also quite difficult, giventhe state of the art in these technologies. Evaluationrequires the manual definition of a test bed of ac-curately categorized entities or relations, againstwhich the results of text annotations can be com-pared. This is difficult because of the inherent dif-ficulty of unambiguously assigning meaning to lan-guage expressions (humans do not always agree onhow to categorize a term or phrase), and becausethere are virtually no comprehensive test beds againstwhich to compare performance (see Reference 10for a discussion of this state of affairs). Nonetheless,the BioTeKS project is pursuing the evaluation ofselected annotators. The following are someexamples.

The BioAnnotator tool described earlier identifies“biomedical concepts” corresponding to nounphrases that contain one or more keywords in oneor more UMLS thesauruses (including MeSH key-words). An evaluation against the GENIA test bed of670 MEDLINE abstracts44 produced standard preci-sion, recall, and F-values45 of 0.87, 0.94 and 0.90, re-spectively, for approximate matching (i.e., findingphrases that have any GENIA term in them; for a fullreport, see Reference 34). These are reasonable fig-ures, although identifying more specific categoriesof terms, such as specific MeSH or UMLS terms, maybe more difficult and is the subject of ongoinginvestigation.

An unpublished evaluation of the ChemFrag anno-tator evaluated the accuracy of identification of or-ganic chemical names in a manually annotated testbed of ten patent documents. The evaluation pro-duced precision, recall, and F-values of 0.91, 0.94 and0.92, respectively. These are also reasonable figuresalthough the sample is small and there is consider-

able room for improvement. Note that identificationin this case means only categorizing names as any“organic chemical names,” and does not mean iden-tifying the structure of that compound (see Refer-ence 35).

Regardless of these results, information extraction(entity names and relations) is very difficult, and avery active and challenging domain of research. Afew examples can indicate the difficulties. Annota-tors need to identify lexical variants of an entity namebased on case and stemming. A case variant is ex-emplified by “trk A” vs. “Trk A.” Handling stem-ming variants means identifying the common stemsexpressed in variations resulting from factors like plu-ralization, as in “actin filament(s),” or by tense vari-ation as in “bind(s/ing).” The entity annotators de-scribed earlier include an approach for handling caseand stemming variants, but there are also ambigu-ities that require taking the larger context into ac-count. In a biomedical context, case may carry in-formation; for example, the term “BIKE” capitalizedis likely to be a gene name, whereas “bike” or “Bike”is likely to be the vehicle. Annotators need to be ableto take into account the context in which ambiguousterms occur in order to resolve the ambiguity.

Annotators need to be able to handle partitive phrasevariants, that is, terms that are phrases, and that maymatch part of larger phrases of biomedical interest.A partitive phrase variant is exemplified by findingthe term “dendritic” in the phrase “dendritic struc-tures” in a MEDLINE abstract, but not finding thisspecific phrase in MeSH, while many other phraseswith the term “dendritic” (e.g., “dendritic cells”) dooccur in MeSH. The entity annotators described ear-lier can also handle partitive phrase variants, butagain, there are policy issues to resolve. In somecases, we may want only strict matching of phrasesrelative to some resource like MeSH, whereas inother cases, we may want to be made aware ofphrases that contain biomedically relevant terms(e.g., “dendritic structures”) even if these specificphrases do not actually appear in MeSH. The Bio-Annotator uses this criterion to identify “biologicallyinteresting” phrases.

Annotators need to be able to resolve other typesof word-sense ambiguity. Names of entities can beambiguous without taking context into account. Thisis especially true of gene names, which often do notconform to simple naming conventions, and whichcan be ambiguous with respect to other common En-glish language terms. A simple dictionary lookup


finds the gene names “FHC” or “OF” in the followingsentences. Both gene names are represented in agene dictionary based on LocusLink:32

● “Recent genotype-phenotype correlation studiesin familial hypertrophic cardiomyopathy (FHC)have revealed that some mutations in the beta-my-soin heavy chain (BMHC) gene may be associatedwith a high incidence of sudden death and a poorprognosis.” (from PubMeD** PMID 8655135)

● “BACKGROUND OF THE INVENTION”(from a section title of a US patent)

In the first sentence, “FHC is also an acronym for “fa-milial hypertrophic cardiomyopathy,” and in this con-text “FHC” is not a gene name. The “OF” gene in thesecond example is actually a preposition in a patentdocument title, which is capitalized. These are ex-amples of word-sense ambiguity, where the sameterm can mean different things depending on the con-text in which it occurs. This is a well-known and clas-sic problem in NLP.8 Identifying the names of en-tities often requires additional word-sensedisambiguation processes, which typically entail an-alyzing the context in which names occur. The Bio-Annotator annotator includes methods that use cer-tain types of features to select the correct word sense.An example is exemplified in the PubMeD sentenceabove: “. . . in the beta-mysoin heavy chain (BMHC)gene. . .” The presence of the term “gene” in thephrase “. . . (BMHC) gene” is additional evidence that“BMHC” is indeed the name of a gene.

BioTeKS includes annotators that compare the re-sults of multiple annotators and choose the mostplausible annotation. For example, the POS taggingand shallow-parsing annotators assign the syntacticannotation “PREPOSITION” to the “OF” keyword inthe patent example above. A disambiguation rulethat requires gene name annotations to apply onlyto keywords that are annotated with the syntactic an-notation “NOUN” will reject the “GENE” entity nameannotation for this string annotated as a “PREPOSI-TION.” Similarly for the “FHC” gene, an abbrevia-tion annotator can associate phrases like “familialhypertrophic cardiomyopathy” with acronyms like“FHC” in close textual proximity.46 A disambigua-tion rule gives precedence to the “ABBREVIATION”annotation, compared to the “GENE” entity name.Disambiguation rules must be written for a varietyof cases, and we are exploring these in the BioTeKSproject.

Like entity extraction, evaluating the accuracy of ex-tracted relations is quite difficult. It is necessary todevelop test beds with real examples of relationsagainst which to compare relation extractors. TheBioTeKS project has data for two evaluations of re-lation extraction. One evaluation focused on protein-protein interactions in a manually annotated set of573 MEDLINE abstracts based on a protein interac-tion database developed by the Munich InformationCenter for Protein Sequences.47 We conducted ananalysis of mutual co-occurrence, and filtered the un-named relations by selecting relations with “protein”entity names as subject and object nouns. This re-sulted in precision and recall measures of 0.30 and0.40 respectively.48 An analysis of the errors pointsto the complexity of identifying protein names andlimitations of the MEDLINE abstracts, as opposed tofull source literature. For example, we identifiedproblems relating to handling complicated synonyms,such as the synonyms for “SRV2,” which are “Srv2p,”“SRV2p,” “CAP,” “CAP1,” and so forth. We devel-oped functions to identify protein complexes, suchas those represented as “prot1-prot2,” which also ex-press protein interactions.

Other problems are related to complex syntax, as ina sentence like “Proteins A, B and C interact withD.” In this case, mutual co-occurrence might findall pair-wise relations among A, B, C and D, but onlythose connecting A, B and C with D may be valid.We discovered that many of the relations we detectedwhich had not been tabulated actually were discussedin the abstracts, but had not been tabulated by theMunich Information Center for Protein Sequencesdatabase because they were not reports of new ex-periments. Finally we eliminated missing relations,which were not discussed in the abstracts but onlyin the full papers. After taking all of these factorsinto account, we found a more respectable precisionof 0.91 and recall of 0.84.

A second evaluation involved finding MEDLINE doc-uments (and sentences in them) that describe genefunctions for a target gene. We alluded to this task inour earlier discussion of the NIST Genomics track.42

The track organizers created a test bed of 500 000MEDLINE abstracts that were associated with man-ually created Gene RIFs describing the functions as-sociated with genes described in the abstracts. Par-ticipants in the Genomics track had two tasks. Thefirst task was to find relevant abstracts and Gene RIFs,given a target gene, where a relevant abstract is onethat describes a function of the gene and, in partic-ular, is cited in a Gene RIF entry for that gene. For


the 50 genes provided as test queries by NIST, theBioTeKS team achieved a mean average precisionof 0.28. The second task was to automatically extractthe best sentence describing a gene�s function froman article known to describe the gene�s function.Gene RIF descriptions (created by biology experts)were used as a “gold standard” for describing genefunctions. The BioTeKS team achieved .505 on ameasure of similarity computed between the textsummaries generated by the BioTeKS system foreach MEDLINE abstract (a sentence selected from theabstract), and the “standard” description containedin the Gene RIF text.42

Relation extraction based on co-occurrence statis-tics, syntax, or even graph mining is a good start butmay not be abstract enough to represent the deepersemantics of what is being asserted or predicatedabout the entities named in the relations. Annota-tors need to map varied forms of relationship expres-sion to a common underlying relationship. For ex-ample, a purely syntactic analysis would extract asseparate relations the following active and passiveforms of a relationship: “Trk A expression is inhib-ited by factor X” vs. “Factor X inhibits Trk A expres-sion.” To extract the common underlying relation-ship, a top-down approach may be necessary, usinga more knowledge-based and ontology-based ap-proach for representing relations. The analysis of se-mantic relations is an active area of research in thebioinformatics and computational linguistic researchcommunities.17,49,50

Indexing and accessing text annotations. The textannotators described in the previous sections all pro-duce CAS annotations. Typically, these annotationsneed to be stored for each document in the collec-

tion so that they can be accessed easily for furthercollection-level text-mining processing and applica-tions. In UIMA, CAS annotations are converted toother forms of data by implementing CAS consum-ers. Examples of CAS conversions in BioTeKS areshown in Table 2, and include extractors which cre-ate formatted input for indexing a database and text-search engine, noun phrases for a clustering engine(described later), and an XML file of the annotationsrepresented as XML tags embedded in the source doc-ument (and viewable as Web pages, as in Figure 6).

CAS consumers iterate, filter, and translate CAS an-notations into other formatted data, including stringsand flat files. The details depend on the text-anal-ysis data needed by applications. For example, thedatabase schema alluded to in the figure includes atable for storing extracted terms and their offsets (lo-cations) in documents. This information is used tocompute mutual co-occurrence statistics for text min-ing applications.40 In other cases, storing text-anal-ysis results is convenient for performance reasons.For example, the noun phrases used for documentclustering can be extracted in real time for an ad hoccollection of documents (e.g., resulting from asearch), but computing and storing them ahead oftime in a specialized file format improves overall clus-tering performance.

Biomedical ontology resources. An ontology is aconceptual framework for defining the basic classesof entities in some domain of knowledge, the rela-tionships these entities have to each other, and theorganization of concepts in terms of higher-level con-cepts, typically taxonomic in nature.51,52 The term on-tology is often used to refer to a range of linguisticand conceptual resources, from thesauri and dictio-

Table 2 Examples of CAS conversions in BioTeKS

1. CAS to index specification: Generates an indexing input for a text search engine. The text search engine used in BioTeKSindexes conventional content keywords indexed in the full-text index, sentence-boundary annotations, and a specifiable setof “semantic” annotations extracted by the text annotators.

2. CAS to database index: Generates a database load file for batch database indexing of text annotations. The databaseschema indexes document meta-data (e.g., title, author, date, etc.), extracted terms and statistics (e.g., location in sourcedocument), and semantic identifying information for terms (e.g., MeSH identification code).

3. CAS to noun phrases: Generates a formatted flat-file representation of indexing features needed as input for application-level clustering of document collections, such as that resulting from a text or database search. These cluster features couldalso be stored in a database.

4. CAS to XML file: Generates XML for the document text, with text annotations represented as inline XML-tagged stringsand used for viewing entity annotations (e.g., MeSH terms) in a Web browser, with annotated terms highlighted and linkedto their CAS annotation data structures. (see Figure 6)


naries containing records defining key terms, syn-onyms, and so forth in some domain, to taxonomiesthat classify names and phrases in higher-level cat-egories, and formal knowledge representations thatmight support automatic inferences and certain typesof reasoning.

In biology, an ontology like MeSH33 provides a setof broad-based, multidisciplinary concepts and cat-egories that biomedical experts (librarians) use toannotate the content of literature in terms of key con-cepts describing genes, proteins, cell function, an-atomical objects, diseases, and so on. The Gene On-tology41 (GO) is a more specialized and semanticallyricher ontology that organizes knowledge aboutgenes and cell functions associated with genes. Re-lations may be taxonomic (e.g., “Trk A” is a “recep-tor”), or part-whole (e.g., “cell membrane” is partof “cell”).41 Ontologies have to be built and main-tained, and this is a difficult, labor-intensive, and ex-pertise-intensive task. The National Library of Med-icine supports major ontology development efforts(the UMLS metathesaurus33,53), as do many other re-search institutions (e.g., the GENIA Project11).

As discussed earlier, we use MeSH to create a lookupdictionary of terms expressing biomedical concepts.Dictionaries are not the only way that entity namescan be identified (rules and machine-learning meth-ods are among the alternatives), but they are an effec-tive first approximation. Ontology information canalso be stored in a relational database to support da-tabase queries on this information, and BioTeKS in-cludes software components for loading MeSH andUMLS information into database schemas.

Application prototypes and application-enabling enginesThe value of the annotations produced by BioTeKSis a function of the applications that use these an-notations. In this section, we briefly highlight sev-eral application prototypes we have built withBioTeKS to indicate how it can be used. We havefocused on two broad classes of applications: textsearching and text mining. Searching refers to find-ing collections of documents that meet some searchcriteria. Text mining refers to analyzing collectionsof documents at a more fine-grained level, to iden-tify, for example, relationships and trends among spe-cific topics and concepts expressed within documentsin a collection. Text mining can also be applied tothe results of a search or to collections of documentsderived from other kinds of processes. The follow-ing are several examples.

Semantic text search. One of the most basic appli-cation functions is searching for documents, for ex-ample MEDLINE abstracts, using biomedical terms.The focus in BioTeKS on generating rich semanticannotations of terminology in biomedical documentssuggests the opportunity for text search to index notonly keyword content, but also the semantic anno-tations associated with those keywords. That is, wewant to explore the value of indexing and searchingnot only keywords corresponding to a specific genename, such as “LMNA,” but also indexing and search-ing annotations associated with these keywords, suchas the category annotation “gene” associated with“LMNA.” BioTeKS uses an IBM Research text searchengine called Juru XML54–56 to explore the indexingand search of annotations generated by BioTeKSannotators.

Semantic search can refer to many things,8 but forour purposes it means simply indexing semantic in-formation (in our case expressed as annotations) as-sociated with keywords, and using this in the searchprocess. Juru XML can index and search both key-words (based on text content) and semantic anno-tations of text keywords expressed as XML tags forkeyword annotations, initially stored in CAS anno-tations.54,56 The results of a Juru XML search are re-turned as a list of documents or document compo-nents, ordered by their relevance to the original queryterms. For example, we can index all “sentences,”the entity names contained within the span of thesesentences, and other syntactic phrases (typically verbphrases) expressing biological functions of interest.This allows us to form queries on structures moreclosely approximating certain types of relations, suchas protein-protein interactions.

For example, the following query, expressed in XMLtags, will find sentences that contain keywords an-notated as “proteins” and syntactic phrases that con-tain keywords annotated as “biological function”(typically verb phrases like “binds to” or “inhibits”):

�Sentence�

�Protein�SRV2�/Protein�

�Function��/Function��Protein��/Protein�

�/Sentence�

In this query, the user wants to see MEDLINE abstractsthat contain sentences with a specific protein“SRV2,” any other protein, and terms and phrases


that have been annotated by BioTeKS as “biolog-ical functions.” The specific annotator used for an-notating sentences in this way is the DictionaryLookup annotator (see Table 1), using dictionariesof protein names and biological function terms. Thisquery does not ensure that the abstracts which arefound will contain an actual protein-protein inter-action, but pending evaluation, we believe such que-ries will greatly increase the likelihood of finding thiscombination of semantically annotated terms, andhence find actual interactions of interest.

Document clustering. Document clustering is a wayto organize document collections (such as those de-rived from search results) in topical clusters. Clus-tering can complement text search or any other func-tion that compiles collections of documents as partof an analytic process. We previously discussed anexample of document clustering in the context of theBio-Dictionary tool12 (shown in Figure 2). The roleof BioTeKS, as we indicated, is to extract text fea-tures, such as noun phrases, for input to the clus-tering-engine algorithm. These noun phrases are ex-tracted using the shallow linguistic parser, and thisannotator in turn uses tokenization and POS taggingannotators.

There are numerous clustering methods and tools,and we have experimented with several approachesin BioTeKS for different purposes. The clusteringengine in Figure 2 uses a so-called subspace projec-tion clustering method.57 We also have used this clus-tering engine to organize the results of the text searchengine described in the previous section. Other clus-tering methods available in BioTeKS include a mod-el-based hierarchical clustering tool58 and a clustertool called “SearchEssence,”59 which create a con-cept hierarchy from the collection of results returnedby a search and label the clusters using the namesof entities extracted by other annotators.

Association mining using MedTAKMI. MedTAKMI(Medical Text Analysis and Knowledge Manage-ment) is a version of the TAKMI* (Text Analysis andKnowledge Mining) system for mining relations andtrends among different categories of named entitiesin a text collection based on search.13 Users selecta collection of documents (e.g., MEDLINE abstracts)and one or more categories or subcategories of en-tity in an ontology, such as MeSH, and then invokean analysis of associations between these categoriesof entities. Association analysis can be presented toend users as a spreadsheet table. For example, a usermight want to view associations between two selected

term categories like “membrane proteins” and “or-gans and diseases.” Cells in the table correspondingto terms that co-occur relatively more frequently thansome background value are highlighted. Links canbe provided from a cell to documents with the cor-responding associations. Association mining andother MedTAKMI text-mining functions, includingtrend analysis, term distribution analysis, and so on,are described more fully in the paper by Uramotoet al. in this issue.60

Analyzing mutual co-occurrences among terms us-ing lexical networks. Lexical networks are visualiza-tions of the unnamed relations between extractedterms which we described earlier.40 Nodes in a net-work graph represent the names of entities (e.g., pro-teins), and the links represent relatively strong sta-tistical correlations between these entities within asentence or paragraph (e.g., protein-protein inter-actions). Lexical networks allow pair-wise term re-lations computed in documents to be compiled intolonger sequences that span multiple documents. Themeaning of these longer sequences is not always clearand must be treated cautiously, but the links do pro-vide clues to possible interesting connections. Insome cases, these connections are genuine discov-eries. For example, we replicated Swanson�s discov-ery of a connection between “Raynaud�s disease” and“fish oils” using lexical networks,61–64 and we haveexplored the use of lexical networks in life-sciencedemonstration prototypes. Figure 7 shows two ex-amples of lexical networks. Term-to-term relationsare generated from text-analysis data, which is ex-tracted using BioTeKS text annotators. Numbers arestrengths of associations. One network shows con-nections between a set of protein interactions ex-tracted from a set of 600 MEDLINE documents. Forexample, “TIP20” is linked to “Sec20P” with an as-sociation strength of 65. The latter in turn is linkedto “Sec22.” Other biomedical phrases, such as “Golgicomplex,” also co-occur with these protein namesin MEDLINE abstracts. The second network showsterms that co-occur with the drug name “tamoxifen”in the context of a collection of documents retrievedfrom a text search engine using the query “breastcancer.” The connection between “tamoxifen” and“EGF” (“epidermal growth factor,” circled in red)is of special interest.

In BioTeKS, mutual co-occurrence computations arebased on text annotators that extract terms (e.g., pro-tein names like “TIP20” and chemical names like“tamoxifen”) and their location in documents. CASconsumers index these extracted terms for each MED-


LINE document in a relational database. Mutual co-occurrence is a text-mining method applied to a col-lection of MEDLINE documents and implemented asa database query on the text data stored in the da-tabase for this collection.40 These computations canbe performed for predefined collections or subcol-lections of documents and on the unnamed relationsalso stored in a relational database for later accessby an application.

Analyzing gene clustering using the MedMeSHSummarizer. MedMeSH is an application prototypethat takes as input a cluster of gene names derived

from a micro-array expression analysis, retrievesMEDLINE abstracts that contain those genes, and clus-ters MeSH terms extracted from those abstracts inways intended to suggest the functions associatedwith these genes, at least insofar as these have beenexpressed in the literature as MeSH keywords. Theidea is to organize MeSH keywords into topical clus-ters that can help to explain why these genes reactcollectively.65 Figure 8 shows a sample screen shotof a MedMeSH Summarizer cluster analysis. Thegene cluster is derived from gene expression dataobtained from micro-array experiments. Keywords

Figure 7 Lexical networks showing terms and relations derived from a collection of text documents


in the TERMS window are MeSH terms extracted andclustered from MEDLINE abstracts containing thegene names in the cluster. The figure shows thatterms like “glucose,” “yeast,” and “hexokinase” areassociated with the given set of genes. These termsindicate that these genes are involved in glycolysis,the process by which glucose is broken down in thecells of all higher animals.

The input genes are shown as the nodes of a graph.The input gene cluster for this example is a groupof yeast genes encoding the enzymes involved in gly-colysis. If the similarity between two genes is greaterthan the threshold (which can be specified by the enduser), an edge is drawn between the nodes. Clickingon an edge shows the terms common to the genes,as expressed in MEDLINE abstracts that refer to the

genes in the graph. For example, the terms commonbetween TDH1 and TDH2 are shown in the popupwindow when you click on the link shown in red.

Building applications using BioTeKS. The applica-tion prototypes we have described all have a similargeneral design. Each application has one or morefunctions, such as document search, document clus-tering, or text mining of associations or statistical co-occurrences between terms across a collection of doc-uments. Each function requires structured text dataas input and computes new information based on it.The input text data is extracted using one or moretext annotation methods executed by BioTeKS. Typ-ically, text-mining methods apply to text data for acollection of documents, and therefore, text datamust be accumulated across documents in a collec-

Figure 8 Sample results of cluster analysis by MedMeSH Summarizer


tion and stored in a persistent form. The role ofBioTeKS is to apply a sequence of text annotators(orchestrated in the “Text Analysis Engine” in Fig-ure 3) to documents in a collection, and to providethe translation methods (CAS consumers) to createtext data formats (e.g., database load files) to indexthe text data in a form that allows it to be easily ac-cessed by upstream text-search and text-mining appli-cations, such as MedTAKMI or document clustering.

In this sense, BioTeKS is a middleware system thatcan be used to build applications and that can be usedfor many application-level tasks. Many important de-tails of how BioTeKS is used will vary with differenttarget application requirements. For example, dif-ferent sets of annotators will be needed for differentapplication-level functions. We also anticipate thattext annotators built for BioTeKS will need to be cus-tomized or extended for specific research or poten-tial customer applications. We know, for example,that pharmaceutical companies have internal knowl-edge resources including document collections andthesauri of terms and concepts proprietary to howthese companies conduct research and development.Using BioTeKS to solve specific problems would re-quire incorporating additional resources (dictionar-ies, etc.) into the BioTeKS annotators.

The value of BioTeKS in this applied setting is two-fold. Firstly, it casts each application scenario intoa similar sequence of application-design consider-ations—each application implies one or more typesof documents (we have focused on MEDLINE ab-stracts), one or more types of text data, a configura-tion of one or more text annotators that can extractrelevant types of text data, and one or more CAS con-sumers for translating annotations into persistentforms for convenient access by upstream applica-tions. The database model implied for the text-min-ing applications discussed earlier may or may not beuseful for other applications. Secondly, the value ofBioTeKS is in the quality of its specific text anno-tation methods. The development and refinementof these methods is an ongoing process, involvingboth software engineering and basic research.

ConclusionsBioTeKS is a significant technical initiative withinthe IBM Research laboratories to integrate and cus-tomize a broad suite of text-analysis projects andtechnologies targeting problems in the domain ofbiomedical text analysis. The goal of the BioTeKStext-analysis methods is to convert initially unstruc-

tured text information into structured text data, com-mensurate with structured data derived from sourcesother than text (e.g., gene names in micro-array ex-periments, or drug, treatment, and disease referencesin clinical records). The domain of text-analysis prob-lems that BioTeKS addresses is very broad, and manyproblems are the subject of basic research initiativesin industry and academic institutions.

Accordingly, both BioTeKS and UIMA are works inprogress. Future work will focus on research issuesand on solving real-world problems. Research willfocus on improving the quality of BioTeKS anno-tators for both biomedical entities and the facts andrelations associated with them. We need to betterexploit emerging biomedical ontology resources inthe process of IE. In addition, we believe machinelearning techniques hold great promise for improv-ing the quality of IE. Improving text-analysis qualityalso requires the development of, or access to, testbeds or “gold standards” for correct identificationof biomedical entities and relations. Lack of such testbeds is a problem in the bioinformatics research com-munity, as we noted earlier (see Reference 10), andit is equally problematic in analyzing clinical recordsand patents. Finally, a key requirement for makingprogress is the use of text analysis for the kinds ofreal-world drug-discovery and biomedical-discoverytasks surveyed in the introduction. The domain ofpotential IE problems is very large, and progress willrequire focus and feedback that we believe impliessignificant collaboration with domain experts (andproblem solvers) in biomedical research and devel-opment organizations in both academia and industry.

In this context, BioTeKS has two key advantages thatmake it a useful platform for pursuing research inbiomedical text analysis. First and foremost,BioTeKS is implemented using the UIMA framework.This means that BioTeKS can support serious ex-perimentation in the development of text-analysismethods. Second, IBM Research has invested in coretext-analysis technologies, including machine learn-ing approaches. Combined with the UIMA frame-work, BioTeKS provides a broad-based platform fortackling all the key text-analysis problems, whose so-lutions are essential to progress in life-science re-search and development.

Acknowledgments

BioTeKS is in large part a systems integration effortthat builds on technologies and expertise developed


by many groups in IBM Research. All these groupsgenerously provided their time and expertise to de-liver code to the BioTeKS team, teach us how to useit, and in some cases modify their code for our needs.These groups include the UIMA / JEDII team in Haw-thorne (David Ferrucci, Adam Lally, and Jerry Cwik-lik), the TALENT text analysis team (Roy Byrd, MaryNeff, Rie Ando, Bran Boguraev, and Youngja Park),the machine learning team (David Johnson, SylvieLevesque, Tong Zhang, Frank Oles, and BrianWhite), and research management at various levelsfor support and technical direction (Alan Marwick,Marshall Schor, Arthur Ciccolo, Ajay Royyuru, andKohichi Takeda). Additional colleagues at otherIBM Research locations were involved in various ap-plications cited in this paper. These colleagues in-clude Ravi Kothar, Pankaj Kankar, Biplav Srivas-tava, Vishal Batra, Krishna Kummamuru, andPasumarti Kamesam at India Research; Jeff Kreulen,Dan Gruhl, and Shivakumar Vaithaynathan at Al-maden Research; and Tetsuya Nasukawa at TokyoResearch. We also thank Tony Eng (MIT graduatestudent) for useful comments on early drafts of thispaper, and four other anonymous reviewers whosecareful reading greatly improved this paper. Col-leagues in business units also provided opportuni-ties for learning about real-world customer require-ments. These include Michael Hehenberger, AvijitChatterjee, Usha Reddy, Edwin Hilpert, StephenBoyer, John Christina, Peter Libal, Thilo Gotz, Tho-mas Hampp, Deb Sutherland, Lance Snow, Eric Will,Jeff Tenner, Dave Herbeck, Erik Voldal, Bryan Tou-sley, Joanna Batstone, Beth Smith, and of course,Carol Kovac.

*Trademark or registered trademark of International BusinessMachines Corporation.

**Trademark or registered trademark of United States NationalLibrary of Medicine, Institut Suisse de Bioinformatique (SIB)Foundation, Intel Corporation, or Linus Torvalds.

Cited references and notes

1. Pharma2010: The Threshold of innovation, IBM Corporation(2002), http:// www-1.ibm.com / industries / healthcare / doc /content/resource/thought/390030105.html.

2. W. C. Swope, “Deep Computing for the Life Sciences,” IBMSystems Journal 40, No. 2, 284–262 (2001), http://www.research.ibm.com/journal/sj/402/swope.html.

3. J. Augen, “The Evolving Role of Information Technology inthe Drug Discovery Process,” Drug Discovery Today 7, No.5, 315–323 (March 2002).

4. D. Ferrucci and A. Lally, “Building an Example Applicationwith the Unstructured Information Management Architec-ture,” IBM Systems Journal 43, No. 3, 455–475 (2004, thisissue).

5. D. Ferrucci and A. Lally, “Accelerating Corporate Researchin the Development, Application and Deployment of HumanLanguage Technologies,” Proceedings of the Workshop on Soft-ware Engineering and Architecture of Language Technology Sys-tems (SEALTS), Edmonton, CA (May 31, 2003).

6. A. D. Marwick, “Knowledge Management Technology,” IBMSystems Journal 40, No. 4, 814–830 (2001).

7. R. Mack, R. Byrd, and Y. Ravin, “Knowledge Portals andthe Emerging Digital Knowledge Workplace,” IBM SystemsJournal 40, No. 4, 925–955 (2001).

8. D. Jurafsky and J. Martin, Speech and Language Processing:An Introduction to Natural Language Processing, Computa-tional Linguistics, and Speech Recognition, Prentice-Hall, Sad-dle River, NJ (2000).

9. R. Baeza-Yates and B. Ribeiro-Neto, B. (Editors) ModernInformation Retrieval, ACM Press, New York (1999).

10. C. Blaschke, L. Hirschman, and A. Valencia, “InformationExtraction in Molecular Biology,” Briefings in Bioinformatics3, No. 2, 154–165 (June 2002).

11. J. Tsujii, “Tutorial on Information Extraction in BiologicalSciences,” Proceedings of the Pacific Symposium on Biocom-puting (2001), http://www-tsujii.is.s.u-tokyo.ac.jp/�genia/tutorial/index.html#psb2001.

12. L. Hunter, On-line Course Notes on Bioinformatics, Centerfor Computational Pharmacology, University of ColoradoHealth Sciences Center, http://compbio.uchsc.edu/hunter/.

13. T. Nasukawa and T. Nagano, “Text Analysis and KnowledgeMining System,” IBM Systems Journal 40, No. 4, 967–984(2001).

14. W. Cody, J. Kreulen, V. Krishna, and W. S. Spangler, “TheIntegration of Business Intelligence and Knowledge Manage-ment,” IBM Systems Journal 41, No. 4, 697–713 (2002).

15. J. Chang, S. Raychaudhuri, and R. Alman, “Including Bio-logical Literature Improves Homology Search,” Proceedingsof the Pacific Symposium on Biocomputing, World Scientific,River Edge, NJ, 374–383 (2001).

16. S.-K. Ng and M. Wong, “Toward Routine Automatic Path-way Discovery from Online Scientific Text Abstracts,” Ge-nome Informatics 10, 104–112 (1999).

17. T. Ono, H. Hishigaki, A. Tanigami, and T. Takagi, “Auto-mated Extraction of Information on Protein-protein Inter-actions from the Biological Literature,” Bioinformatics 17,155–161 (2001).

18. T.-K. Jenssen, A.-L. Komorowski, and E. Hovig, “A Liter-ature Network of Human Genes for High Throughput Anal-ysis of Gene Expression,” Nature Genetics 28, 21–28 (May2001).

19. T. Huynh, I. Rigoutsos, L. Parida, D. Platt, and T. Shibuya,“The Web Server of IBM�s Bioinformatics and Pattern Dis-covery Group,” Nucleic Acids Research 31, No. 13, 3645–3650(2003).

20. IBM Research, Computational Biology Center, http://www.research.ibm.com/bioinformatics/.

21. Swiss-Prot is available (along with other related databases)at http://www.expasy.org/sprot/.

22. Life Sciences Framework, IBM Corporation (October, 2003),http: // www-3 . ibm . com / solutions / lifesciences / solutions/framework.html.

23. DB2 Information Integrator for Content, IBM Corporation,http://www-3.ibm.com/software/data/eip/.

24. IBM LanguageWare Linguistic Engine, http://www-306.ibm-.com/software/globalization/topics/languageware/index.jsp.

25. The LanguageWare linguistic engine provides dictionariesfor several languages. A legacy version, LanguageWare v2,


was embedded in the pre-UIMA Textract tool, available inthe IM4T toolkit.

26. Intelligent Miner for Text Product Review, IBM Corporation(February, 1997), http://www-3.ibm.com/software/data/iminer/fortext/.

27. M. Neff, R. Byrd, and B. Boguraev, “The Talent System: TEX-TRACT Architecture and Data Model,” Proceedings of theWorkshop on Software Engineering and Architecture of Lan-guage Technology Systems (SEALTS), Edmonton, CA (May31, 2003).

28. The default training corpus used to train the POS tagger isavailable from the Language Data Consortium (http://www.ldc.upenn.edu/) and is based on non-biomedical text contentand style.

29. “IBM, SAS join to help automakers to comply with TREADact,” Computer World (April 3, 2003).

30. M. McCord, “Slot Grammar: A System for Simpler Construc-tion of Practical Natural Language Grammars,” in NaturalLanguage and Logic: International Scientific Symposium, Lec-ture Notes in Computer Science, R. Studer Editor, SpringerVerlag, Berlin (1990), pp. 118–145.

31. Unified Medical Language System (UMLS), http://www.nlm.nih.gov/research/umis.

32. LocusLink, http://www.ncbi.nlm.nih.gov/LocusLink/.33. National Library of Medicine, http://www.nlm.nih.gov/

libserv.html.34. L. Subramaniam, S. Mukherjea, P. Kankar, B. Srivastava, V.

Batra, P. Kamesam, and R. Kothari, “Information Extrac-tion from Biomedical Literature: Methodology, Evaluationand an Application,” Proceedings of the 2003 ACM CIKM Con-ference, New Orleans, LA (2003).

35. International Union of Pure and Applied Chemistry, http://www.chem.qmw.ac.uk/iupac/.

36. WEKA Machine Learning Project, http://www.cs.waikato.ac.nz/�ml/.

37. Personal communication, David E. Johnson, IBM ThomasJ. Watson Research Center.

38. T. Zhang, F. Damereau, and D. E. Johnson, “Text ChunkingBased on a Generalization of Winnow,” Journal of MachineLearning Research 2, 615–637 (2002).

39. T. Zhang, “Regularized Winnow Methods,” Advances in Neu-ral Information Processing Systems 13, 703–709 (2001).

40. J. Cooper and R. Byrd, “Lexical Navigation: VisuallyPrompted Query Expansion and Refinement,” Proceedingsof ACM DIGLIB97, Philadelphia, PA (1997), pp. 237–246.

41. M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. But-ler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T.Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver, A. Kasarskis,S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M.Rubin, and G. Sherlock, “Gene Ontology: Tool for the Uni-fication of Biology,” Nature Genetics 25, 25–29 (2000).

42. E. Brown, A. Dolbey, and L. Hunter, “IBM Research and theUniversity of Colorado TREC 2003 Genomics Track,” Proceed-ings of the 12th NIST TREC Conference (November, 2003), http://trec.nist.gov/pubs/trec12/papers/ibm-brown.genomics.pdf.

43. A. Inokuchi and H. Kashima, “Mining Significant Pairs ofPatterns from Graph Structures with Class Labels,” Proceed-ings of the Third IEEE International Conference on Data Min-ing, Melbourne, FL (November 2003).

44. The GENIA project Web site provides links to several re-sources of Junichi Tsujii, http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/.

45. These three measures are standard measures of accuracy infinding specified categories of objects in search or informa-tion extraction. “Recall” is the percentage of correct iden-

tifications relative to all possible correct answers in the col-lection. “Precision” is the percentage of correct identificationrelative to all correct answers provided by the system. TheF-value is derived from and tries to balance the trade-off im-plied in precision and recall measures (see Reference 8, page578).

46. P. Youngja and R. Byrd, “Hybrid Text Mining for FindingTerms and Their Abbreviations,” Proceedings of the 2001 Con-ference on Empirical Methods in Natural Language Processing(EMNLP-2001), Carnegie Mellon University, Pittsburgh, PA(June 2001), http://www.cs.cornell.edu/home/llee/emnlp.html.

47. Munich Information Center for Protein Sequences, http://www.mips.biochem.mpg.de/.

48. J. Cooper, “An Evaluation of Unnamed Relations in Discov-ery of Protein-Protein Interactions,” Presented at ACM SI-GIR 2003, Workshop on Text Analysis and Search for Bioin-formatics, Toronto, CA (2003).

49. C. Blaschke, M. Andrade, C. Ouzounis, and A. Valencia, “Au-tomatic Extraction of Biological Information from ScientificText: Protein-Protein Interactions,” Proceedings of the Inter-national Conference on Intelligent Systems in Molecular Biol-ogy (1999), pp. 60–67.

50. T. Rindflesch, L. Tanabe, J. Weinstein, and L. Hunter,“EDGAR: Extraction of Drugs, Genes and Relations fromthe Biomedical Literature,” Proceedings of the 5th Pacific Sym-posium on Biocomputing (2000), pp. 538–549.

51. J. Sowa, “Conceptual Structures: Information Processing inMind and Machine. Reading,” Addison-Wesley, Reading, MA(1984).

52. M. Gruninger and J. Lee, “Ontology Applications and De-sign, Introduction,” ACM Communications 45, No. 2, 39–41(February 2002).

53. B. Humphreys, D. Lindberg, H. Schoolman, and G. Barnett,“The Unified Medical Language System: An Information Re-search Collaboration,” Journal of the American Medical In-formatics Association 5, 1–11, (1998).

54. D. Carmel, E. Amitay, M. Herscovici, Y. Maarek, Y. Petruschka,and A. Soffer, “Juru at TREC 10 - Experiments with Index Prun-ing,” Proceedings of the 10th Text Retrieval Conference (2001).

55. D. Carmel, Y. Maarek, M. Mandelbrod, Y. Mass, and A. Sof-fer, “Searching XML Documents via XML Fragments,” Pro-ceedings of the Twenty-Sixth Annual International ACM SIGIRConference on Research and Development in Information Re-trieval (SIGIR 2003), Toronto, Canada, (August 2003), pp.151–158.

56. The Juru text search engine is described in http://www.haifa.il.ibm.com/km/ir/juru/ and the version of Juru for indexingand searching XML fields is described in “Juru XML - anXML retrieval system at INEX�02,” http://qmir.dcs.qmul.ac.uk/inex/Slides/YosiMass_etal_talk.pdf.

57. R. Y. Ando, B. Boguraev, R. Byrd, and M. Neff, “Multido-cument Summarization by Visualizing Topic Content,” Pro-ceedings of ANLP/NAACL Workshop on Automatic Summa-rization, Seattle, WA (2000), pp. 79–88.

58. S. Vaithyanathan and B. Dom, “Model-Based HierarchicalClustering,” Proceedings of UAI-2000, Stanford (2000), http://citeseer.nj.nec.com/vaithyanathan00modelbased.html.

59. K. Krishna and R. Krishnapuram, “A Clustering Algorithmfor Asymmetrically Related Data with its Applications to TextMining,” Proceedings of the 2001 ACM CIKM Conference, At-lanta, GA, (2001), pp. 571–573.

60. N. Uramoto, H. Matsuzawa, T. Nagano, A. Murakami, H.Takeuchi, and K. Takeda, “A Text-Mining System for Knowl-edge Discovery from Biomedical Documents,” IBM SystemsJournal 43, No. 3, 516–533 (2004, this issue).


61. S. Grell, “Information Retrieval in Life Sciences: How to Dis-cover Relations between Concepts,” Unpublished master�sthesis, University of Heidelberg (December, 2001).

62. R. Mack and M. Hehenberger, “Text-Based Knowledge Dis-covery: Search and Mining of Life-Sciences Documents,” DrugDiscovery Today, 7, 89–98 (2002).

63. D. Swanson, “Implicit Text Linkages between MEDLINERecords: Using Arrowsmith as an Aid to Scientific Discov-ery,” Library Trends 48, No. 1, 48–59 (1999).

64. M. Weeber, H. Klein, T. W. Lolkje, and L. de Jong-van denBerg, “Using Concepts in Literature-Based Discovery: Sim-ulating Swanson�s Raynaud-Fish Oil and Migraine-Magne-sium Discoveries,” Journal of the American Society for Infor-mation Science and Technology 52, No. 7, 548–557 (2001).

65. P. Kankar, S. Adak, A. Sarkar, K. Murari, and G. Sharma,“MedMeSH Summarizer: Text Mining for Gene Clusters,”Proceedings of the Second SIAM International Conference onData Mining, Arlington, VA, (2002), pp. 548–565.

Accepted for publication April 18, 2004.

Robert Mack IBM Research Division, Thomas J. Watson Re-search Center, 19 Skyline Drive, Hawthorne, N.Y., 10532([email protected]). Dr. Mack is a research staff memberat the IBM Watson Research lab. He received his Ph.D. degreein cognitive and experimental psychology at the University ofMichigan in 1981. He has worked on a wide range of softwaresystems, both as a user-interface and human-computer-interac-tion specialist, and as a project leader for prototype systems andmiddleware for applications relating to digital libraries, digitalvideo archive and search, and knowledge portals. Dr. Mack andhis team are currently working on applying text mining to sup-port knowledge discovery in life science.

Sougata Mukherjea IBM Research Division, India ResearchLab., Block 1, IIT Hauz Khas, New Delhi, India 110017([email protected]). Dr. Mukherjea is a research staff mem-ber at the IBM India Research Lab. He received a bachelor�s de-gree from Jadavpur University in Calcutta, a master�s degree fromNortheastern University in Boston, and a Ph.D. degree from theGeorgia Institute of Technology in Atlanta (all in computer sci-ence). Before joining IBM, he held research and software archi-tecture positions in companies in the Silicon Valley (California)including NEC USA, BEA Systems, and Verity. His research in-terests include information visualization and retrieval and appli-cations of text mining in areas such as Web search and bioinfor-matics.

Aya Soffer IBM Research Division, Haifa Research Laboratory,University Campus, Haifa, Israel 31905 ([email protected]). Dr. Sof-fer is a research staff member at the IBM Research Lab in Haifa,Israel, where she manages the Information Retrieval group. Shereceived her M.S. and Ph.D. degrees in computer science fromthe University of Maryland at College Park in 1992 and 1995,respectively. Before joining IBM, Dr. Soffer was a research sci-entist at the NASA Goddard Space Flight Center, where sheworked on digital libraries for earth science data. Her researchinterests include information retrieval, pictorial information sys-tems, multidimensional indexing, and non-traditional databasesystems. Dr. Soffer and her group have developed several knowl-edge management solutions for IBM and Lotus products, amongthem the WebSphere* Portal�s search engine.

Naohiko Uramoto IBM Research Division, Tokyo Research Lab-oratory, 1623-14, Shimotsuruma, Yamato, Kanagawa, Japan([email protected]). Dr. Uramoto joined the IBM Tokyo Re-search Laboratory in 1990 after receiving a master�s degree incomputer science from Kyushu University. He received a Ph.D.degree in computer science from Kyushu University in 1990. Hehas been a visiting associate professor at the National Instituteof Informatics since 2000. His research interests include naturallanguage processing, text mining, XML and meta-data, and in-formation integration.

Eric Brown IBM Research Division, Thomas J. Watson ResearchCenter, 19 Skyline Drive, Hawthorne, N.Y., 10532([email protected]). Dr. Brown is a research staff member at theWatson Research Center. He received a B.S. degree from theUniversity of Vermont and M.S. and Ph.D. degrees from the Uni-versity of Massachusetts at Amherst, all in computer science. Dr.Brown joined IBM in 1995 and conducts research in a variety ofinformation-retrieval and text-analysis areas, including paralleland distributed information retrieval, question answering, text cat-egorization, applications of speech recognition in knowledge man-agement, and text analysis and search for biomedical text.

Anni Coden IBM Research Division, Thomas J. Watson ResearchCenter, 19 Skyline Drive, Hawthorne, N.Y., 10532([email protected]). Dr. Coden is a research staff member at theWatson Research Center. She received her Ph.D. degree in com-puter science from the Massachusetts Institute of Technology in1981 and continued there as a research scientist. She joined theIBM Research Division in 1982, and has worked in various areasof computer science. She started her career in developing toolsfor computer chip design and continued her work in IBM�s anti-virus project. More recently, she worked on algorithms and sys-tems in the area of digital multimedia libraries and on analyzingand searching speech transcripts in real-time video systems. Hercurrent investigations are in the area of automatically analyzingmedical records for knowledge discovery.

James Cooper IBM Research Division, Thomas J. Watson Re-search Center, 19 Skyline Drive, Hawthorne, N.Y., 10532([email protected]). Dr. Cooper is a research staff member atthe Watson Research Center, where he studies problems in textanalytics and object-oriented design. He received his Ph.D. de-gree in chemistry and has been working on problems in comput-ing and chemistry at IBM and elsewhere in the industry for morethan 20 years. He is a well-known writer, having published 15books in computer science, is a columnist for JavaPro Magazine,and is currently the most widely translated author at IBM Re-search.

Akihiro Inokuchi IBM Research Division, Tokyo Research Lab-oratory, 1623-14, Shimotsuruma, Yamato, Kanagawa, Japan([email protected]). Dr. Inokuchi joined the IBM Tokyo Re-search Laboratory in 2000 after receiving a master�s degree incommunication engineering from Osaka University. He receiveda Ph.D. degree in communication engineering from Osaka Uni-versity in 2004. His current focus is on data mining, machine learn-ing, text mining, chemo-informatics, bioinformatics, and their ap-plications.

Bhavani Iyer IBM Research Division, Thomas J. Watson Re-search Center, 19 Skyline Drive, Hawthorne, N.Y., 10532([email protected]). Ms. Iyer is a software engineer at theWatson Research Center. She received an M.B.A. degree in fi-


nance from the Asian Institute of Management in the Philippinesand an M.S. degree in computer science and information man-agement from Pace University. Her most recent project involvedthe application of text analytics and unstructured informationmanagement systems to the life-science domain. Other interestsinclude database management, distributed systems, data mining,and systems and application integration.

Yosi Mass IBM Research Division, Haifa Research Laboratory,University Campus, Haifa, Israel 31905 ([email protected]). Mr.Mass received a B.Sc. degree in computer science from Bar IlanUniversity in Israel in 1980 and an M.Sc. degree from the Weiz-mann Institute in Israel in 1993. Mr. Mass has worked for IBMsince 1996 at the Haifa Research lab as a researcher. Before join-ing IBM he worked in the Israeli high-tech industry for Orbotechand for Ubique. His research areas include virtual reality, visu-alization, security and trust, multi-user collaboration, XML, andinformation retrieval. In recent years, he has been working onan information retrieval engine for search in XML collections.

Hirofumi Matsuzawa IBM Research Division, Tokyo ResearchLaboratory, 1623-14, Shimotsuruma, Yamato, Kanagawa, Japan([email protected]). Mr. Matsuzawa joined the IBM TokyoResearch Laboratory in 1993 after receiving a master�s degreein software engineering from Waseda University. His experienceincludes 3D solid modeling, business intelligence, parallel OLAP,data mining, and text mining. His current focus is on data min-ing, text mining, and the application of grid architectures to lifescience.

L. Venkata Subramaniam IBM Research Division, India Re-search Lab., Block 1, IIT Hauz Khas, New Delhi, India 110017([email protected]). Dr. Subramaniam has been a researchstaff member at the IBM India Research lab since 1998. He re-ceived a B.E. degree in electronics and communications engineer-ing from the University of Mysore in India, an M.S. degree inelectrical engineering from Washington University in St. Louis,and a Ph.D. degree in electronics from the Indian Institute ofTechnology in New Delhi, in 1991, 1993, and 1998, respectively.His current research interests include statistical natural languageprocessing, text and data mining, and bioinformatics.