ontology learning for the semantic web

8
72 1094-7167/01/$10.00 © 2001 IEEE IEEE INTELLIGENT SYSTEMS T h e S e m a n t i c W e b Ontology Learning for the Semantic Web Alexander Maedche and Steffen Staab, University of Karlsruhe T he Semantic Web relies heavily on formal ontologies to structure data for com- prehensive and transportable machine understanding. Thus, the proliferation of ontologies factors largely in the Semantic Web’s success. Ontology learning greatly helps ontology engineers construct ontologies. The vision of ontology learning that we propose includes a number of complementary disciplines that feed on different types of unstructured, semistruc- tured, and fully structured data to support semiauto- matic, cooperative ontology engineering. Our ontol- ogy-learning framework proceeds through ontology import, extraction, pruning, refinement, and evalua- tion, giving the ontology engineer coordinated tools for ontology modeling. Besides the general frame- work and architecture, this article discusses tech- niques in the ontology-learning cycle that we imple- mented in our ontology-learning environment, such as ontology learning from free text, dictionaries, and legacy ontologies. We also refer to other techniques for future implementation, such as reverse engi- neering of ontologies from database schemata or learning from XML documents. Ontologies for the Semantic Web The conceptual structures that define an underlying ontology provide the key to machine-processable data on the Semantic Web. Ontologies serve as metadata schemas, providing a controlled vocabulary of concepts, each with explicitly defined and machine-processable semantics. By defining shared and common domain the- ories, ontologies help people and machines to commu- nicate concisely—supporting semantics exchange, not just syntax. Hence, the Semantic Web’s success and pro- liferation depends on quickly and cheaply constructing domain-specific ontologies. Although ontology-engineering tools have matured over the last decade, 1 manual ontology acquisition remains a tedious, cumbersome task that can easily result in a knowledge acquisition bottleneck. When developing our ontology-engineering workbench, OntoEdit, we particularly faced this question as we were asked questions that dealt with time (“Can you develop an ontology quickly?”), difficulty, (“Is it dif- ficult to build an ontology?”), and confidence (“How do you know that you’ve got the ontology right?”). These problems resemble those that knowledge engineers have dealt with over the last two decades as they worked on knowledge acquisition method- ologies or workbenches for defining knowledge bases. The integration of knowledge acquisition with machine-learning techniques proved extremely ben- eficial for knowledge acquisition. 2 The drawback to such approaches, 3 however, was their rather strong focus on structured knowledge or databases, from which they induced their rules. Conversely, in the Web environment we encounter when building Web ontologies, structured knowl- edge bases or databases are the exception rather than the norm. Hence, intelligent support tools for an ontology engineer take on a different meaning than the integration architectures for more conventional knowledge acquisition. 4 In ontology learning, we aim to integrate numerous disciplines to facilitate ontology construction, partic- ularly machine learning. Because fully automatic machine knowledge acquisition remains in the distant future, we consider ontology learning as semiauto- matic with human intervention, adopting the paradigm of balanced cooperative modeling for constructing ontologies for the Semantic Web. 5 With this objective in mind, we built an architecture that combines knowl- edge acquisition with machine learning, drawing on The authors present an ontology-learning framework that extends typical ontology engineering environments by using semiautomatic ontology-construction tools. The framework encompasses ontology import, extraction, pruning, refinement, and evaluation.

Upload: s

Post on 08-Aug-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ontology learning for the Semantic Web

72 1094-7167/01/$10.00 © 2001 IEEE IEEE INTELLIGENT SYSTEMS

T h e S e m a n t i c W e b

Ontology Learning forthe Semantic WebAlexander Maedche and Steffen Staab, University of Karlsruhe

The Semantic Web relies heavily on formal ontologies to structure data for com-

prehensive and transportable machine understanding. Thus, the proliferation of

ontologies factors largely in the Semantic Web’s success. Ontology learning greatly helps

ontology engineers construct ontologies. The vision of ontology learning that we propose

includes a number of complementary disciplines thatfeed on different types of unstructured, semistruc-tured, and fully structured data to support semiauto-matic, cooperative ontology engineering. Our ontol-ogy-learning framework proceeds through ontologyimport, extraction, pruning, refinement, and evalua-tion, giving the ontology engineer coordinated toolsfor ontology modeling. Besides the general frame-work and architecture, this article discusses tech-niques in the ontology-learning cycle that we imple-mented in our ontology-learning environment, suchas ontology learning from free text, dictionaries, andlegacy ontologies. We also refer to other techniquesfor future implementation, such as reverse engi-neering of ontologies from database schemata orlearning from XML documents.

Ontologies for the Semantic WebThe conceptual structures that define an underlying

ontology provide the key to machine-processable dataon the Semantic Web. Ontologies serve as metadataschemas, providing a controlled vocabulary of concepts,each with explicitly defined and machine-processablesemantics. By defining shared and common domain the-ories, ontologies help people and machines to commu-nicate concisely—supporting semantics exchange, notjust syntax. Hence, the Semantic Web’s success and pro-liferation depends on quickly and cheaply constructingdomain-specific ontologies.

Although ontology-engineering tools have maturedover the last decade,1 manual ontology acquisitionremains a tedious, cumbersome task that can easilyresult in a knowledge acquisition bottleneck. When

developing our ontology-engineering workbench,OntoEdit, we particularly faced this question as wewere asked questions that dealt with time (“Can youdevelop an ontology quickly?”), difficulty, (“Is it dif-ficult to build an ontology?”), and confidence (“Howdo you know that you’ve got the ontology right?”).

These problems resemble those that knowledgeengineers have dealt with over the last two decadesas they worked on knowledge acquisition method-ologies or workbenches for defining knowledgebases. The integration of knowledge acquisition withmachine-learning techniques proved extremely ben-eficial for knowledge acquisition.2 The drawback tosuch approaches,3 however, was their rather strongfocus on structured knowledge or databases, fromwhich they induced their rules.

Conversely, in the Web environment we encounterwhen building Web ontologies, structured knowl-edge bases or databases are the exception rather thanthe norm. Hence, intelligent support tools for anontology engineer take on a different meaning thanthe integration architectures for more conventionalknowledge acquisition.4

In ontology learning, we aim to integrate numerousdisciplines to facilitate ontology construction, partic-ularly machine learning. Because fully automaticmachine knowledge acquisition remains in the distantfuture, we consider ontology learning as semiauto-matic with human intervention, adopting the paradigmof balanced cooperative modeling for constructingontologies for the Semantic Web.5 With this objectivein mind, we built an architecture that combines knowl-edge acquisition with machine learning, drawing on

The authors present

an ontology-learning

framework that

extends typical

ontology engineering

environments by using

semiautomatic

ontology-construction

tools. The framework

encompasses ontology

import, extraction,

pruning, refinement,

and evaluation.

Page 2: Ontology learning for the Semantic Web

resources that we find on the syntactic Web—free text, semistructured text, schema defini-tions (such as document type definitions[DTDs]), and so on. Thereby, our framework’smodules serve different steps in the engineer-ing cycle (see Figure 1):

• Merging existing structures or definingmapping rules between these structuresallows importing and reusing existingontologies. (For instance, Cyc’s ontolog-ical structures have been used to constructa domain-specific ontology.6)

• Ontology extraction models major partsof the target ontology, with learning sup-port fed from Web documents.

• The target ontology’s rough outline, whichresults from import, reuse, and extraction,is pruned to better fit the ontology to itsprimary purpose.

• Ontology refinement profits from the prunedontology but completes the ontology at afine granularity (in contrast to extraction).

• The target application serves as a measurefor validating the resulting ontology.7

Finally, the ontology engineer can begin thiscycle again—for example, to include newdomains in the constructed ontology or tomaintain and update its scope.

ArchitectureGiven the task of constructing and main-

taining an ontology for a Semantic Webapplication such as an ontology-basedknowledge portal,8 we produced support forthe ontology engineer embedded in a com-prehensive architecture (see Figure 2). Theontology engineer only interacts via thegraphical interfaces, which comprise two ofthe four components: the OntoEdit Ontol-ogy Engineering Workbench and the Man-agement Component. Resource Processingand the Algorithm Library are the architec-ture’s remaining components.

The OntoEdit Ontology EngineeringWorkbench offers sophisticated graphicalmeans for manual modeling and refining of thefinal ontology. The interface gives the user dif-ferent views, targeting the epistemologicallevel rather than a particular representation lan-guage. However, the user can export the onto-logical structures to standard Semantic Webrepresentation languages such as OIL (ontol-ogy interchange language) and DAML-ONT(DAML ontology language), as well as our ownF-Logic-based extensions of RDF(S)—we useRDF(S) to refer to the combined technologies

MARCH/APRIL 2001 computer.org/intelligent 73

Web documents

Crawlcorpus

Domain ontology

Wordnet

O1

Resultset

ResourceProcessing

Center

Natural-language-processingsystem

OntologyOntologyengineer

Processed data

O2

HTMLHTML

DTD

Legacy databases

Importschema Import

existing ontologies

XML schema

OntoEdit OntologyEngineering Workbench

AlgorithmLibrary

Resultset

Lexicon

Management Component

XML Import semi- structured schema

Figure 2. Ontology-learning architecture for the Semantic Web.

Apply

Extract

Refine

Prune

Import and reuse Ontologylearning

Ontologylearning

Legacy and application data

Legacy and application data

Domain ontology

Figure 1. The ontology-learning process.

Page 3: Ontology learning for the Semantic Web

of the resource description framework andRDF Schema. Additionally, users can gener-ate and access executable representations forconstraint checking and application debuggingthrough SilRi (simple logic-based RDF inter-preter, www.ontoprise.de), our F-Logic infer-ence engine, which connects directly toOntoEdit.

We knew that sophisticated ontology-engi-neering tools—for example, the Protégé mod-eling environment for knowledge-based sys-tems1—would offer capabilities roughlycomparable to OntoEdit. However, in tryingto construct a knowledge portal, we foundthat a large conceptual gap existed betweenthe ontology-engineering tool and the input(often legacy data), such as Web documents,Web document schemata, databases on theWeb, and Web ontologies, which ultimatelydetermine the target ontology. Into this voidwe have positioned new components of ourontology-learning architecture (see Figure 2).The new components support the ontologyengineer in importing existing ontology prim-itives, extracting new ones, pruning givenones, or refining with additional ontologyprimitives. In our case, the ontology primi-tives comprise

• a set of strings that describe lexical entriesL for concepts and relations;

• a set of concepts C (roughly akin tosynsets in WordNet9);

• a taxonomy of concepts with multipleinheritance (heterarchy) HC;

• a set of nontaxonomic relations Rdescribed by their domain and rangerestrictions;

• a heterarchy of relations—HR;• relations F and G that relate concepts and

relations with their lexical entries; and• a set of axioms A that describe additional

constraints on the ontology and makeimplicit facts explicit.8

This structure corresponds closely toRDF(S), except for the explicit considerationof lexical entries. Separating concept refer-ence from concept denotation permits verydomain-specific ontologies without incur-ring an instantaneous conflict when merg-ing ontologies—a standard Semantic Webrequest. For instance, the lexical entry schoolin one ontology might refer to a building inontology A, an organization in ontology B,or both in ontology C. Also, in ontology A,we can refer to the concept referred to inEnglish by school and school building by the

German Schule and Schulgebäude.Ontology learning relies on an ontology

structured along these lines and on input dataas described earlier to propose new knowledgeabout reasonably interesting concepts, rela-tions, and lexical entries or about links betweenthese entities—proposing some for addition,deletion, or merging. The graphical result setpresents the ontology-learning process’sresults to the ontology engineer (we’ll discussthis further in the “Association rules” section).The ontology engineer can then browse theresults and decide to follow, delete, or modifythe proposals, as the task requires.

ComponentsBy integrating the previously discussed con-

siderations into a coherent generic architecturefor extracting and maintaining ontologies fromWeb data, we have identified several core com-ponents (including the graphical user interfacediscussed earlier).

Management component graphical user interface

The ontology engineer uses the manage-ment component to select input data—that is,relevant resources such as HTML and XMLdocuments, DTDs, databases, or existingontologies that the discovery process can fur-ther exploit. Then, using the managementcomponent, the engineer chooses from a setof resource-processing methods available inthe resource-processing component and froma set of algorithms available in the algorithmlibrary.

The management component also supportsthe engineer in discovering task-relevantlegacy data—for example, an ontology-basedcrawler gathers HTML documents that are rel-evant to a given core ontology.

Resource processingDepending on the available input data, the

engineer can choose various strategies forresource processing:

• Index and reduce HTML documents tofree text.

• Transform semistructured documents,such as dictionaries, into a predefined rela-tional structure.

• Handle semistructured and structuredschema data (such as DTDs, structureddatabase schemata, and existing ontolo-gies) by following different strategies forimport, as described later in this article.

• Process free natural text. Our systemaccesses the natural-language-processingsystem Saarbrücken Message ExtractionSystem, a shallow-text processor for Ger-man.10 SMES comprises a tokenizerbased on regular expressions, a lexicalanalysis component including variousword lexicons, an amorphological analy-sis module, a named-entity recognizer,a part-of-speech tagger, and a chunkparser.

After first preprocessing data according to oneof these or similar strategies, the resource-pro-cessing module transforms the data into analgorithm-specific relational representation.

Algorithm libraryWe can describe an ontology by a number

of sets of concepts, relations, lexical entries,and links between these entities. We canacquire an existing ontology definition(including L, C, HC, R, HR, A, F, and G),using various algorithms that work on thisdefinition and the preprocessed input data.Although specific algorithms can varygreatly from one type of input to the next, aconsiderable overlap exists for underlyinglearning approaches such as associationrules, formal concept analysis, or clustering.Hence, we can reuse algorithms from thelibrary for acquiring different parts of theontology definition.

In our implementation, we generally usea multistrategy learning and result combina-tion approach. Thus, each algorithm pluggedinto the library generates normalized resultsthat adhere to the ontology structures we’vediscussed and that we can apply toward acoherent ontology definition.

Import and reuseGiven our experiences in medicine,

74 computer.org/intelligent IEEE INTELLIGENT SYSTEMS

T h e S e m a n t i c W e b

In trying to construct a

knowledge portal, we found that a

large conceptual gap existed

between the ontology-

engineering tool and the input

(often legacy data).

Page 4: Ontology learning for the Semantic Web

telecommunications, tourism, and insurance,we expect that domain conceptualizations areavailable for almost any commercially signif-icant domain. Thus, we need mechanisms andstrategies to import and reuse domain con-ceptualizations from existing (schema) struc-tures. We can recover the conceptualizations,for example, from legacy database schemata,DTDs, or from existing ontologies that con-ceptualize some relevant part of the targetontology.

In the first part of import and reuse, weidentify the schema structures and discusstheir general content with domain experts.We must import each of these knowledgesources separately. We can also import man-ually—which can include a manual defini-tion of transformation rules. Alternatively,reverse-engineering tools—such as thosethat exist for recovering extended entity-relationship diagrams from a given data-base’s SQL description (see the sidebar)—might facilitate the recovery of conceptualstructures.

In the second part of the import and reusestep, we must merge or align imported con-

ceptual structures to form a single commonground from which to springboard into thesubsequent ontology-learning phases ofextracting, pruning, and refining. Althoughthe general research issue of merging andaligning is still an open problem, recent pro-posals have shown how to improve the man-ual merging and aligning process. Existingmethods mostly rely on matching heuristicsfor proposing the merger of concepts and sim-ilar knowledge base operations. Our researchalso integrates mechanisms that use an appli-cation-data–oriented, bottom-up approach.11

For instance, formal concept analysis lets usdiscover patterns between application dataand the use of concepts, on one hand, andtheir heterarchies’relations and semantics, onthe other, in a formally concise way (see B.Ganter and R. Wille’s work on formal con-cept analysis in the sidebar).

Overall, the ontology-learning import andreuse step seems to be the hardest to general-ize. The task vaguely resembles the generalproblems encountered in data-warehousing—adding, however, challenging problems of its own.

ExtractionOntology-extraction models major

parts—the complete ontology or largechunks representing a new ontology sub-domain—with learning support exploitingvarious types of Web sources. Ontology-learning techniques partially rely on givenontology parts. Thus, we here encounter aniterative model where previous revisionsthrough the ontology-learning cycle canpropel subsequent ones, and more sophis-ticated algorithms can work on structuresthat previous, more straightforward algo-rithms have proposed.

To describe this phase, let’s look at someof the techniques and algorithms that weembedded in our framework and imple-mented in our ontology-learning environ-ment Text-To-Onto (see Figure 3). Wecover a substantial part of the overall ontol-ogy-learning task in the extraction phase.Text-To-Onto proposes many differentontology learning algorithms for primi-tives, which we described previously (thatis, L, C, R, and so on), to the ontology engi-neer building on several types of input.

MARCH/APRIL 2001 computer.org/intelligent 75

Figure 3. Screenshot of our ontology-learning workbench, Text-To-Onto.

Page 5: Ontology learning for the Semantic Web

76 computer.org/intelligent IEEE INTELLIGENT SYSTEMS

Until recently, ontology learning—for comprehensive ontol-ogy construction—did not exist. However, much work in numer-ous disciplines—computational linguistics, information retrieval,machine learning, databases, and software engineering—hasresearched and practiced techniques that we can use in ontol-ogy learning. Hence, we can find techniques and methods rel-evant for ontology learning referred to as

• “acquisition of selectional restrictions,”1,2

• “word sense disambiguation and learning of word senses,”3

• “computation of concept lattices from formal contexts,”4 and• “reverse engineering in software engineering.”5

Ontology learning puts many research activities—which focuson different input types but share a common domain conceptu-alization—into one perspective. The activities in Table A span avariety of communities, with references from 20 completely dif-ferent events and journals.

References1. P. Resnik, Selection and Information: A Class-Based Approach to

Lexical Relationships, PhD thesis, Dept. of Computer Science, Univ.of Pennsylvania, Philadelphia, 1993.

2. R. Basili, M.T. Pazienza, and P. Velardi, “Acquisition of SelectionalPatterns in a Sublanguage,” Machine Translation, vol. 8, no. 1, 1993,pp. 175–201.

3. P. Wiemer-Hastings,A. Graesser, and K. Wiemer-Hastings, “Inferringthe Meaning of Verbs from Context,” Proc. 20th Ann. Conf. CognitiveScience Society (CogSci-98), Lawrence Erlbaum, New York, 1998.

4. B. Ganter and R. Wille, Formal Concept Analysis: MathematicalFoundations, Springer-Verlag, Berlin, 1999.

5. H.A. Mueller et al., “Reverse Engineering: A Roadmap,” Proc. Int’lConf. Software Eng. (ICSE-00),ACM Press, New York, 2000, pp. 47–60.

6. P. Buitelaar,CORELEX Systematic Polysemy and Underspecification,PhDthesis,Dept. of Computer Science,Brandeis Univ.,Waltham,Mass., 1998.

7. H. Assadi, “Construction of a Regional Ontology from Text and ItsUse within a Documentary System,” Proc. Int’l Conf. Formal Ontol-ogy and Information Systems (FOIS-98), IOS Press, Amsterdam.

8. D. Faure and C. Nedellec, “A Corpus-Based Conceptual ClusteringMethod for Verb Frames and Ontology Acquisition,” Proc. LREC-98Workshop on Adapting Lexical and Corpus Resources to Sublan-guages and Applications, European Language Resources—Distrib-ution Agency, Paris, 1998.

9. F. Esposito et al., “Learning from Parsed Sentences with INTHELEX,”Proc. Learning Language in Logic Workshop (LLL-2000) and Learn-ing Language in Logic Workshop (LLL-2000), Assoc. for Computa-tional Linguistics, New Brunswick, N.J., 2000, pp. 194-198.

10. A. Maedche and S. Staab, “Discovering Conceptual Relations fromText,” Proc. European Conf. Artificial Intelligence (ECAI-00), IOSPress, Amsterdam, 2000, pp. 321–325.

11. J.-U. Kietz,A. Maedche, and R. Volz, “Semi-Automatic Ontology Acqui-sition from a Corporate Intranet.” Proc. Learning Language in LogicWorkshop (LLL-2000), ACL, New Brunswick, N.J., 2000, pp. 31–43.

12. E. Morin, “Automatic Acquisition of Semantic Relations betweenTerms from Technical Corpora,” Proc. of the Fifth Int’l Congress onTerminology and Knowledge Engineering (TKE-99), TermNet-Ver-lag, Vienna, 1999.

13. U. Hahn and K. Schnattinger, “Towards Text Knowledge Engineer-ing,” Proc. Am. Assoc. for Artificial Intelligence (AAAI-98),AAAI/MIT Press, Menlo Park, Calif., 1998.

14. M.A. Hearst, “Automatic Acquisition of Hyponyms from Large TextCorpora,” Proc. Conf. Computational Linguistics (COLING-92), 1992.

15. Y. Wilks, B. Slator, and L. Guthrie, Electric Words: Dictionaries,Computers, and Meanings, MIT Press, Cambridge, Mass., 1996.

16. J. Jannink and G. Wiederhold, “Thesaurus Entry Extraction from anOn-Line Dictionary,” Proc. Second Int’l Conf. Information Fusion(Fusion-99), Omnipress, Wisconsin, 1999.

17. J.-U. Kietz and K. Morik, “A Polynomial Approach to the Con-structive Induction of Structural Knowledge,” Machine Learning,vol. 14, no. 2, 1994, pp. 193–211.

18. S. Schlobach, “Assertional Mining in Description Logics,” Proc. 2000Int’l Workshop on Description Logics (DL-2000), 2000; http://Sun-SITE.Informatik.RWTH-Aachen.DE/Publications/CEUR-WS/Vol-33.

19. A. Doan, P. Domingos, and A. Levy, “Learning Source Descriptionsfor Data Integration,” Proc. Int’l Workshop on The Web and Data-bases (WebDB-2000), Springer-Verlag, Berlin, 2000, pp. 60–71.

20. P. Johannesson, “A Method for Transforming Relational Schemasinto Conceptual Schemas,” Proc. Int’l Conf. Data Engineering(IDCE-94), IEEE Press, Piscataway, N.J., 1994, pp. 190–201.

21. Z. Tari et al., “The Reengineering of Relational Databases Based onKey and Data Correlations,” Proc. Seventh Conf. Database Seman-tics (DS-7), Chapman & Hall, 1998, pp. 40–52.

A Common Perspective

Table A. A survey of ontology-learning approaches.

Domain Methods Features used Prime purpose Papers

Free Text Clustering Syntax Extract Paul Buitelaar,6 H. Assadi,7 and David Faure and Claure Nedellec8

Inductive logic Syntax, logic Extract Frederique Esposito et al.9programming representation

Association rules Syntax, Tokens Extract Alexander Maedche and Steffen Staab10

Frequency-based Syntax Prune Joerg-Uwe Kietz et al.11

Pattern matching — Extract Emanuelle Morin12

Classification Syntax, semantics Refine Udo Hahn and Klemens Schnattinger13

Dictionary Information extraction Syntax Extract Marti Hearst,14 Yorik Wilks,15 and Joerg-Uwe Kietz et al.11

Page rank Tokens — Jan Jannink and Gio Wiederhold16

Knowledge base Concept induction, Relations Extract Joerg-Uwe Kietz and Katharina Morik17 and S. Schlobach18

A-Box mining

Semistructured Naive Bayes Relations Reverse engineering Anahai Doan et al.19

schemata

Relational Data correlation Relations Reverse engineering Paul Johannesson20 and Zahir Tari et al.21

schemata

Page 6: Ontology learning for the Semantic Web

Lexical entry and concept extractionOne of the baseline methods applied in our

framework for acquiring lexical entries withcorresponding concepts is lexical entry andconcept extraction. Text-To-Onto processesWeb documents on the morphological level,including multiword terms such as “databasereverse engineering” by n-grams, a simple sta-tistics-based technique. Based on this text pre-processing, we apply term-extraction tech-niques, which are based on (weighted)statistical frequencies, to propose new lexicalentries for L.

Often, the ontology engineer follows theproposal by the lexical entry and concept-extraction mechanism and includes a new lex-ical entry in the ontology. Because the newlexical entry comes without an associated con-cept, the ontology engineer must then decide(possibly with help from further processing)whether to introduce a new concept or link thenew lexical entry to an existing concept.

Hierarchical concept clusteringGiven a lexicon and a set of concepts, one

major next step is taxonomic concept classifi-cation. One generally applicable method withregard to this is hierarchical clustering, whichexploits items’ similarities to propose a hier-archy of item categories. We compute the sim-ilarity measure on the properties of items.

When extracting a hierarchy from natural-language text, term adjacency or syntacticalrelationships between terms yield consider-able descriptive power to induce the semantichierarchy of concepts related to these terms.

David Faure and Claure Nedellec give asophisticated example for hierarchical clus-tering (see the sidebar). They present a coop-erative machine-learning system, Asium(acquisition of semantic knowledge usingmachine-learning method), which acquirestaxonomic relations and subcategorizationframes of verbs based on syntactic input. TheAsium system hierarchically clusters nounsbased on the verbs to which they are syntac-tically related and vice versa. Thus, theycooperatively extend the lexicon, the conceptset, and the concept heterarchy (L, C, HC).

Dictionary parsingMachine-readable dictionaries are fre-

quently available for many domains. Althoughtheir internal structure is mostly free text, com-paratively few patterns are used to give textdefinitions. Hence, MRDs exhibit a largedegree of regularity that can be exploited toextract a domain conceptualization.

We have used Text-To-Onto to generate aconcept taxonomy from an insurance com-pany’s MRD (see the sidebar). Likewise,we’ve applied morphological processing toterm extraction from free text—this time,however, complementing several pattern-matching heuristics. Take, for example, thefollowing dictionary entry:

Automatic Debit Transfer: Electronic servicearising from a debit authorization of the YellowAccount holder for a recipient to debit bills thatfall due direct from the account….

We applied several heuristics to the mor-phologically analyzed definitions. Forinstance, one simple heuristic relates the defi-nition term, here automatic debit transfer, with

the first noun phrase in the definition, here elec-tronic service. The heterarchy HC : HC (auto-matic debit transfer, electronic service) linkstheir corresponding concepts. Applying thisheuristic iteratively, we can propose large partsof the target ontology—more precisely, L, C,and HC to the ontology engineer. In fact,because verbs tend to be modeled as relations,we can also use this method to extend R (andthe linkage between R and L).

Association rulesOne typically uses association-rule-learn-

ing algorithms for prototypical applications ofdata mining—for example, finding associa-tions that occur between items such as super-market products in a set of transactions forexample customers’ purchases. The general-ized association-rule-learning algorithm ex-tends its baseline by aiming at descriptions atthe appropriate taxonomy level—for example,“snacks are purchased together with drinks,”rather than “chips are purchased with beer,”and “peanuts are purchased with soda.”

In Text-To-Onto (see the sidebar), we usea modified generalized association-rule-learning algorithm to discover relationsbetween concepts. A given class hierarchyHC serves as background knowledge. Pairsof syntactically related concepts—for exam-ple, pair (festival,island) describing thehead–modifier relationship contained in thesentence “The festival on Usedom attractstourists from all over the world.”—are givenas input to the algorithm. The algorithm gen-erates association rules that compare the rel-evance of different rules while climbing up ordown the taxonomy. The algorithm proposeswhat appears to be the most relevant binaryrules to the ontology engineer for modelingrelations into the ontology, thus extending R.

As the algorithm tends to generate a highnumber of rules, we offer various interactionmodes. For example, the ontology engineercan restrict the number of suggested relationsby defining so-called restriction concepts thatmust participate in the extracted relations.The flexible enabling and disabling of taxo-nomic knowledge for extracting relations isanother way of focusing.

Figure 4 shows various views of theresults. We can induce a generalized relationfrom the example data given earlier—relationrel(event,area), which the ontology engineercould name locatedin, namely, events located inan area (which extends L and G). The user canadd extracted relations to the ontology bydragging and dropping them. To explore anddetermine the right aggregation level ofadding a relation to the ontology, the user canbrowse the relation views for extracted prop-erties (see the left side of Figure 4).

PruningA common theme of modeling in various

disciplines is the balance between com-pleteness and domain-model scarcity. Tar-geting completeness for the domain modelappears to be practically unmanageable andcomputationally intractable, but targeting thescarcest model overly limits expressiveness.Hence, we aim for a balance between the twothat works. Our model should capture a richtarget-domain conceptualization but excludethe parts out of its focus. Ontology importand reuse as well as ontology extraction putthe scale considerably out of balance whereout-of-focus concepts reign. Therefore, weappropriately diminish the ontology in thepruning phase.

We can view the problem of pruning in atleast two ways. First, we need to clarify how

MARCH/APRIL 2001 computer.org/intelligent 77

Targeting completeness for the

domain model appears to be

practically unmanageable and

computationally intractable, but

targeting the scarcest model

overly limits expressiveness.

Page 7: Ontology learning for the Semantic Web

pruning particular parts of the ontology (forexample, removing a concept or relation)affects the rest. For instance, Brian Petersonand his colleagues have described strategiesthat leave the user with a coherent ontology(that is, no dangling or broken links).6 Second,we can consider strategies for proposing ontol-ogy items that we should either keep or prune.Given a set of application-specific documents,several strategies exist for pruning the ontol-ogy that are based on absolute or relativecounts of term frequency combined with theontology’s background knowledge (see thesidebar).

RefinementRefining plays a similar role to extract-

ing—the difference is on a sliding scalerather than a clear-cut distinction. Although

extracting serves mainly for cooperativemodeling of the overall ontology (or at leastof very significant chunks of it), the refine-ment phase is about fine-tuning the targetontology and the support of its evolvingnature. The refinement phase can use datathat comes from a concrete Semantic Webapplication—for example, log files of userqueries or generic user data. Adapting andrefining the ontology with respect to userrequirements plays a major role in theapplication’s acceptance and its furtherdevelopment.

In principle, we can use the same algo-rithms for extraction and refinement. How-ever, during refinement, we must considerin detail the existing ontology and its exist-ing connections, while extraction worksmore often than not practically from scratch.

Udo Hahn and Klemens Schnattingerpresented a prototypical approach for re-finement (see the sidebar)—although notfor extraction! They introduced a method-ology for automating the maintenance ofdomain-specific taxonomies. This incre-mentally updates an ontology as it acquiresnew concepts from text. The acquisitionprocess is centered on the linguistic andconceptual “quality” of various forms ofevidence underlying concept-hypothesisgeneration and refinement. Particularly, todetermine a particular proposal’s quality,Hahn and Schnattinger consider semanticconflicts and analogous semantic structuresfrom the knowledge base for the ontology,thus extending an existing ontology withnew lexical entries for L, new concepts forC, and new relations for HC.

78 computer.org/intelligent IEEE INTELLIGENT SYSTEMS

T h e S e m a n t i c W e b

Figure 4. Result presentation in Text-To-Onto.

Page 8: Ontology learning for the Semantic Web

MARCH/APRIL 2001 computer.org/intelligent 79

Ontology learning could add significantleverage to the Semantic Web because

it propels the construction of domain ontolo-gies, which the Semantic Web needs to suc-ceed. We have presented a comprehensiveframework for ontology learning that crossesthe boundaries of single disciplines, touch-ing on a number of challenges. The goodnews is, however, that you don’t need perfector optimal support for cooperative ontologymodeling. At least according to our experi-ence, cheap methods in an integrated envi-ronment can tremendously help the ontologyengineer.

While a number of problems remainwithin individual disciplines, additional chal-lenges arise that specifically pertain to apply-ing ontology learning to the Semantic Web.With the use of XML-based namespacemechanisms, the notion of an ontology withwell-defined boundaries—for example, onlydefinitions that are in one file—will disap-pear. Rather, the Semantic Web might yieldan amoeba-like structure regarding ontologyboundaries because ontologies refer to andimport each other (for example, the DAML-ONT primitive import). However, we do notyet know what the semantics of these struc-tures will look like. In light of these facts, theimportance of methods such as ontologypruning and crawling will drasticallyincrease. Moreover, we have so far restrictedour attention in ontology learning to the con-ceptual structures that are almost containedin RDF(S). Additional semantic layers on topof RDF (for example, future OIL or DAML-ONT with axioms, A) will require newmeans for improved ontology engineeringwith axioms, too!

AcknowledgmentsWe thank our students, Dirk Wenke, and

Raphael Volz for work on OntoEdit and Text-To-Onto. Swiss Life/Rentenanstalt (Zurich,Switzerland), Ontoprise GmbH (Karlsruhe,Germany), the US Air Force DARPA DAML(“OntoAgents” project), the European Union(IST-1999-10132 “On-To-Knowledge” project),and the German BMBF (01IN901C0 “GETESS”project) partly financed research for this article.

References

1. E. Grosso et al., “Knowledge Modeling at theMillennium—the Design and Evolution ofProtégé-2000,” Proc. 12th Int’l WorkshopKnowledge Acquisition, Modeling and Man-agement (KAW-99), 1999.

2. G. Webb, J. Wells, and Z. Zheng, “An Exper-imental Evaluation of Integrating MachineLearning with Knowledge Acquisition,”Machine Learning, vol. 35, no. 1, 1999, pp.5–23.

3. K. Morik et al., Knowledge Acquisition andMachine Learning: Theory, Methods, andApplications, Academic Press, London,1993.

4. B. Gaines and M. Shaw, “Integrated Knowl-edge Acquisition Architectures,” J. IntelligentInformation Systems, vol. 1, no. 1, 1992, pp.9–34.

5. K. Morik, “Balanced Cooperative Modeling,”Machine Learning, vol. 11, no. 1, 1993, pp.217–235.

6. B. Peterson, W. Andersen, and J. Engel,“Knowledge Bus: Generating Application-

Focused Databases from Large Ontologies,”Proc. Fifth Workshop Knowledge Represen-tation Meets Databases (KRDB-98), 1998,http://sunsite.informatik.rwth-aachen.de/Pub-lications/CEUR-WS/Vol-10 (current 19 Mar.2001).

7. S. Staab et al., “Knowledge Processes andOntologies,” IEEE Intelligent Systems, vol.16, no. 1, Jan./Feb. 2001, pp. 26–34.

8. S. Staab and A. Maedche, “Knowledge Por-tals—Ontologies at Work,” to be published inAI Magazine, vol. 21, no. 2, Summer 2001.

9. G. Miller, “WordNet: A Lexical Database forEnglish,” Comm. ACM, vol. 38, no. 11, Nov.1995, pp. 39–41.

10. G. Neumann et al., “An Information Extrac-tion Core System for Real World German TextProcessing,” Proc. Fifth Conf. Applied Nat-ural Language Processing (ANLP-97), 1997,pp. 208–215.

11. G. Stumme and A. Maedche, “FCA-Merge:A Bottom-Up Approach for Merging Ontolo-gies,” to be published in Proc. 17th Int’l JointConf. Artificial Intelligence (IJCAI ’01),Morgan Kaufmann, San Francisco, 2001.

Alexander Maedche is a PhD student at the Institute of Applied Informat-ics and Formal Description Methods at the University of Karlsruhe. Hisresearch interests include knowledge discovery in data and text, ontologyengineering, learning and application of ontologies, and the Semantic Web.He recently founded together with Rudi Studer a research group at the FZIResearch Center for Information Technologies at the University of Karlsruhethat researches Semantic Web technologies and applies them to knowledgemanagement applications in practice. He received a diploma in industrialengineering, majoring in computer science and operations research, from the

University of Karlsruhe. Contact him at the Institute AIFB, Univ. of Karlsruhe, 76128 Karlsruhe,Germany; [email protected].

Steffen Staab is an assistant professor at the University of Karlsruhe andcofounder of Ontoprise GmbH. His research interests include computationallinguistics, text mining, knowledge management, ontologies, and the Seman-tic Web. He received an MSE from the University of Pennsylvania and a Dr.rer. nat. from the University of Freiburg, both in informatics. He organizedseveral national and international conferences and workshops, and is nowchairing the Semantic Web Workshop in Hongkong at WWW10. Contacthim at the Institute AIFB, Univ. of Karlsruhe, 76128 Karlsruhe, Germany;[email protected].

T h e A u t h o r s